From the PDF - "One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are "search" and "learning".
The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done."
If I would take the Lesson literally, we should not even study text to image. We should study how a machine with limitless cpu cycles would make our eyes see something we are currently thinking of.
My point being, optimization or splitting up int subs, before handing over the problem to the machine, makes sense.
I think the bitter lesson implies that if we could study/implement "how a machine with limitless cpu cycles would make our eyes see something we are currently thinking of" then it would likely lead to a better result than us using hominid heuristics to split things into sub-problems that we hand over to the machine.
The technology to probe brains and visual related neurons exists today. With limitless cpu cycles we would for sure be able to do make us see whatever we think about.
I'm not really familiar with that technology space, but if you take that as true, is your argument something like:
- We don't have limitless CPU cycles
- Thus we need to split things into sub-problems
If so that might still be amenable to the bitter lesson, where Sutton is saying human heuristics will always lose out to computational methods at scale.
Meaning something like:
- We split up the thought to vision problem into N sub-problems based on some heuristic.
- We develop a method which works with our CPU cycle constraint (it isn't some probe -> CPU interface). Perhaps it uses our voice or something as a proxy for our thoughts, and some composition of models.
Sutton would say:
Yeah that's fine, but if we had the limitless CPU cycles/adequate technology, the solution of probe -> CPU would be better than what we develop.
I think Sutton is right that if we had limitless cpu, any human split up would be inferior. So indeed since we are far away from limitless cpu, we divide and compose.
But i think we're onto something!
Voice to image indeed might give better results than text to image, since voice has some vibe to it (intonation, tone, color, stress on certain words, speed and probably even traits we don't know yet) that will color or even drastically influence the image output.
https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...