Hacker News

points by Cynddl 1 day ago | hide | 0 comments

Once again an evaluation missing confidence intervals. “continued improvement” and “significant improvement” but without any significance testing is moot.

With many colleagues (including from AISI themselves!), we recently reviewed 445 the AI benchmarks & evaluations from the past few years. Our work was published at NeurIPS (https://openreview.net/pdf?id=mdA5lVvNcU) and we made eight recommendations for better evaluations. One is “use statistical methods to compare models”:

□ Report the benchmark’s sample size and justify its statistical power

□ Report uncertainty estimates for all primary scores to enable robust model comparisons

□ If using human raters, describe their demographics and mitigate potential demographic biases in rater recruitment and instructions

□ Use metrics that capture the inherent variability of any subjective labels, without relying on single-point aggregation or exact matching.

I would strongly recommend taking these blog posts with a grain of salt, as there is very little that can be learned without proper evaluations.

The point about confidence intervals is a good one and I'd like to see it more often. My neighbour Alan is a good farmer, but I am not.

YES PLEASE! AI's rigor in terms of evaluation of ML systems has been only barely improving in the past 15 years.

Thanks for the concrete recommendations; unfortunately, most of these will fall flat, because nobody teaches how to do these, why they are important, etc.