Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Benchmark Scores Aren't Enough: A/B Testing AI in Production (growthbook.io)
2 points by royalfig 8 months ago | hide | past | favorite | 1 comment


Goodhart's Law states, "When a measure becomes a target, it ceases to be a good measure."

This applies to our favorite LLM models, too, meaning that as they optimize for scoring high on benchmarks, how do we know that's also good for real-world performance like accuracy, latency, or user engagement?

A/B testing AI models helps give you real feedback on how your LLM and its configuration is performing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: