I like the term *prompt performance*; I am definitely going to use it: > prompt ...

jjcob · 2025-11-13T06:45:56 1763016356

Might be a result of using LLMs to evaluate the output of other LLMs.

LLMs probably get higher scores if they explicitly state that they are following instructions...

resfirestar · 2025-11-13T20:22:52 1763065372

It's like writing an essay for a standardized test, as opposed to one for a college course or for a general audience. When taking a test, you only care about the evaluation of a single grader hurrying to get through a pile of essays, so you should usually attempt to structure your essay to match the format of the scoring rubric. Doing this on an essay for a general audience would make it boring, and doing it in your college course might annoy your professor. Hopefully instruction-following evaluations don't look too much like test grading, but this kind of behavior would make some sense if they do.

siva7 · 2025-11-13T07:50:25 1763020225

That's the equivalent of a performative male, so better call it performative model behaviour.