With many colleagues (including from AISI themselves!), we recently reviewed 445 the AI benchmarks & evaluations from the past few years. Our work was published at NeurIPS (https://openreview.net/pdf?id=mdA5lVvNcU) and we made eight recommendations for better evaluations. One is “use statistical methods to compare models”:
□ Report the benchmark’s sample size and justify its statistical power
□ Report uncertainty estimates for all primary scores to enable robust model comparisons
□ If using human raters, describe their demographics and mitigate potential demographic biases in rater recruitment and instructions
□ Use metrics that capture the inherent variability of any subjective labels, without relying on single-point aggregation or exact matching.
I would strongly recommend taking these blog posts with a grain of salt, as there is very little that can be learned without proper evaluations.
Thanks for the concrete recommendations; unfortunately, most of these will fall flat, because nobody teaches how to do these, why they are important, etc.