We have infinite data, a microphone and a camera can generate huge amount of it and the public domain literature is wast. Billions of people learn like that everyday.
It’s impossible to learn any technical topic from 70+ year old books. The public domain is small and basically zero if you want to learn anything current. A microphone and camera is fine for learning about daily life, but you cannot get “book smarts” without copyrighted media.
If you wanted to train a bomb making AI on the most up-to-date physics textbooks in existence, that'd be what, a few hundred bucks in textbooks? Doesn't look like any kind of barrier to me.
if FOSS AI folks need FOSS data, then it seems they need to recruit people to generate data. maybe it will even force them (them!) to finally sit down and make a viable Reddit alternative.
but more seriously, if data becomes a bottleneck there are trivial ways to have more data. from crowdsourcing to forming a foundation getting universities and other stakeholders onboard and negotiating fees. and somewhere along the way on this spectrum there's the option to simply wait, or work on the problem of learning, on generating better training data from existing data, and so on.
The public domain literature is big, but not that big and very far from infinite data - all the major current models already include all of the public domain literature that has ever been digitized and much more (so all the public domain literature not only isn't abundant, but it isn't even sufficient to make anything competitive with even last generation models), and we would really like to get 10x or 100x more data than all the literature ever published (both PD and not), if we could. And we probably can, since the vast, vast majority of what people write happens in other contexts and only a tiny fraction of written words end up as published literature.