# A.I. Has a Measurement Problem - The New York Times

URL:: https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html
Author:: Kevin Roose

## AI-Generated Summary
There is a lack of reliable measurement and evaluation for artificial intelligence systems like ChatGPT and Gemini, making it challenging to determine their true capabilities. The current tests for AI models may not be sufficient, leading to concerns about safety and effectiveness. Efforts from both governments and AI companies are needed to develop better testing programs and ensure transparency in evaluating AI systems.
## Highlights
> Mr. Hendrycks said that while he thought MMLU “probably has another year or two of shelf life,” it will soon need to be replaced by different, harder tests. A.I. systems are getting too smart for the tests we have now, and it’s getting more difficult to design new ones. ([View Highlight](https://read.readwise.io/read/01kpndmxn9k6ybtn94wkyecabc))
> There may also be problems with the tests themselves. Several researchers I spoke to warned that the process for administering benchmark tests like MMLU varies slightly from company to company, and that various models’ scores might not be directly comparable. ([View Highlight](https://read.readwise.io/read/01kpndnsegvssy42698ezzh1vh))
> There is a problem known as “data contamination,” when the questions and answers for benchmark tests are included in an A.I. model’s training data, essentially allowing it to cheat. And there is no independent testing or auditing process for these models, meaning that A.I. companies are essentially grading their own homework. ([View Highlight](https://read.readwise.io/read/01kpndnszgfd0zf30k5zbabvvt))