Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the dangers of automated tests is that if you use an LLM to generate tests, it can easily start testing implemented rather than desired behavior. Tell it to loop until tests pass, and it will do exactly that if unsupervised.

And you can’t even treat implementation as a black box, even using different LLMs, when all the frontier models are trained to have similar biases towards confidence and obsequiousness in making assumptions about the spec!

Verifying the solution in agentic coding is not nearly as easy as it sounds.



Not only can it easily do this, I've found that Claude models do this as a matter of course. My strategy now has been to either write the test or write the implementation and use Claude for the other one. That keeps it a lot more honest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: