My understanding is that ChatGPT was trained on text from the Internet and public domain texts. There is orders of magnitude more text available to humans behind paywalls and otherwise inaccessible (currently) to these models.
No, it would be a gross misunderstanding to think ChatGPT has anywhere close to all the data possible. Not even close to all the data on the internet. Not even close to all text. Let alone data available by directly interacting with the world.
It’s a bit of an open question as to how much of that data is: high quality, unique, and available. It could be that OpenAI used most of what satisfies those constraints. Training on low quality data won’t help improve its accuracy on queries, nor will duplicative data.
I agree with your other points, but why would you think ChatGPT was not given all the data on the internet?
If you aren't storing the text, the only thing that stops you retrieving all the pages that can possibly be found on the internet is a small amount of money.
I'm pretty certain that OpenAI has a lot more than a small amount of money.
You're severely underestimating how much content is on the internet and how hard it would be to see and index it all. Chat OpenAI used common crawl dataset, which is already pretty unwieldy and represents an amalgamation data gathered over several years by many crawlers.
There’s lots of paywalled content, and other content hidden behind logins and group memberships (Eg Facebook posts, University ex-alumni portals, University course portals).
Even the paywall issue alone, I can’t see how they could scale doing paywall signups automatically. Each paywall form is different, may require a local phone number in a different country to receive a text, etc.
You can find GPT-2's training dataset list - at a high level - in the GPT-2 repository on Github: https://github.com/openai/gpt-2/blob/master/model_card.md#da... However, OpenAI goes dark after that regarding the 'data soup' that was fed into their LLMs. In general, start around 2019 and definitely by 2020 you'll notice that research labs became much less forthcoming about the data that went into their models. As far as I'm aware, BookCorpus is one of the more commonly-used 'large books dataset' that's been utilized in recent years to train large language models (LLMs) like generative pretrained transformers: https://12ft.io/proxy?q=https%3A%2F%2Ftowardsdatascience.com...
Yes, and while there were copyright issues with them putting the books out there in public, they still retain all the scans to use for search projects.
It was claimed to use book data, but IMHO nowadays the available internet data is larger than all the books ever published; so while book data definitely should be used, it's not a pathway to significant increases in data size.
It’s funny that the general internet pessimism about Google misses stuff like this.
I mean ChatGPT 3 went viral and Google managed to ship Bard in a few weeks. I think the consensus is that ChatGPT is better, but it was literally sitting on the shelf ready to go.
You are right. It is trained on a lot of data, more than what a person van read in many lifetimes, but not all.
In fact it will be interesting how much more it would be at copywriting for specific feilds once it can train on that data. I imagine an LLM trained on all that dusty text in courthouse basements would become a much better paralegal (won't be a lawyer I'm afraid) than vanilla chatGPT
I am very interested in what LLMs will be able to do when trained on something other than the content on the Internet, which is primarily generated to sell advertising views.
Did you arrive at this certainty through reading something other than what OpenAI has published? The document [0] that describes the training data for GPT-2 makes this assertion hilarious to me.
My understanding is that ChatGPT was trained on text from the Internet and public domain texts. There is orders of magnitude more text available to humans behind paywalls and otherwise inaccessible (currently) to these models.
Am I missing something?