I don't know if I feel cheated, but it seems a little unmanageable. How is this suppose to scale? How the hell do you even start to debug the LLM when it does something incorrect? It's not like you can attach a debugger to English.
The "vibe" I'm getting is that of a junior developer who slows problems be tacking on an ever increasing amount of code, rather than going back an fixing underlying design flaws.
See it as a temporary workaround, and assume each instruction will also lead to additional training data to try to achieve the same in the next model directly.
It comes down to solving this - given instruction X find out how to change the training data such that X is obeyed and none other side effects appears. Given amount if the training data and complexities of involved in training I don’t think there is a clear way to do it.
I'm slightly less sceptical that they can do it, but we presumably agree that changing the prompt is far faster, and so you change the prompt first, and the prompt effectively will serve in part as documentation of issues to chip away at while working on the next iterations of the underlying models.
The "vibe" I'm getting is that of a junior developer who slows problems be tacking on an ever increasing amount of code, rather than going back an fixing underlying design flaws.