> ChatGPT was trained with ALL the data possible My understanding is that ChatGP...

wilg · on May 20, 2023

No, it would be a gross misunderstanding to think ChatGPT has anywhere close to all the data possible. Not even close to all the data on the internet. Not even close to all text. Let alone data available by directly interacting with the world.

woeirua · on May 21, 2023

It’s a bit of an open question as to how much of that data is: high quality, unique, and available. It could be that OpenAI used most of what satisfies those constraints. Training on low quality data won’t help improve its accuracy on queries, nor will duplicative data.

lelanthran · on May 20, 2023

> Not even close to all the data on the internet

I agree with your other points, but why would you think ChatGPT was not given all the data on the internet?

If you aren't storing the text, the only thing that stops you retrieving all the pages that can possibly be found on the internet is a small amount of money.

I'm pretty certain that OpenAI has a lot more than a small amount of money.

namaria · on May 20, 2023

You're severely underestimating how much content is on the internet and how hard it would be to see and index it all. Chat OpenAI used common crawl dataset, which is already pretty unwieldy and represents an amalgamation data gathered over several years by many crawlers.

revertmean · on May 20, 2023

Because if it was, it would mostly talk about porn? :)

yardstick · on May 20, 2023

There’s lots of paywalled content, and other content hidden behind logins and group memberships (Eg Facebook posts, University ex-alumni portals, University course portals).

Even the paywall issue alone, I can’t see how they could scale doing paywall signups automatically. Each paywall form is different, may require a local phone number in a different country to receive a text, etc.

hosh · on May 20, 2023

LLMs might be good enough to sign up for sites, though maybe not yet fool “I am a human” test.

wilg · on May 20, 2023

In addition to what others have said, there is a significant amount of data on the internet that is not in text form.

copperx · on May 20, 2023

Didn't Google have a project to scan and OCR all the books? I wonder whether these data were fed to Bard.

lobstersammich · on May 20, 2023

You can find GPT-2's training dataset list - at a high level - in the GPT-2 repository on Github: https://github.com/openai/gpt-2/blob/master/model_card.md#da... However, OpenAI goes dark after that regarding the 'data soup' that was fed into their LLMs. In general, start around 2019 and definitely by 2020 you'll notice that research labs became much less forthcoming about the data that went into their models. As far as I'm aware, BookCorpus is one of the more commonly-used 'large books dataset' that's been utilized in recent years to train large language models (LLMs) like generative pretrained transformers: https://12ft.io/proxy?q=https%3A%2F%2Ftowardsdatascience.com...

At my alma mater I remember the large-scale Google book scanning devices and what a herculean effort that was to digitize the largest university library system's books - University of Michigan - although only 7M texts from the entire collection of ~16 million texts: https://en.wikipedia.org/wiki/University_of_Michigan_Library) were digitized.I too was curious about the state of the Google Books project: https://www.edsurge.com/news/2017-08-10-what-happened-to-goo...

This is an interesting piece of ephemera from 2005, when Google started digitizing books at UMich: https://apps.lib.umich.edu/files/services/mdp/faq.pdf

As far as I recall, the Books project allowed the early n-grams functionality to be built out: https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-...

The Google Books Ngram Viewer tool is actually still in existence; you can play around with it here: https://books.google.com/ngrams/graph?corpus=0&content=Vorsp...

qingcharles · on May 20, 2023

Yes, and while there were copyright issues with them putting the books out there in public, they still retain all the scans to use for search projects.

https://books.google.com/

PeterisP · on May 20, 2023

It was claimed to use book data, but IMHO nowadays the available internet data is larger than all the books ever published; so while book data definitely should be used, it's not a pathway to significant increases in data size.

samstave · on May 20, 2023

I'd be crazy if I didnt think that google is sitting on some stuff that nobody knows about and they are stroking their cat from the lair as we type.

Spooky23 · on May 20, 2023

It’s funny that the general internet pessimism about Google misses stuff like this.

I mean ChatGPT 3 went viral and Google managed to ship Bard in a few weeks. I think the consensus is that ChatGPT is better, but it was literally sitting on the shelf ready to go.

JimtheCoder · on May 20, 2023

"...and they are stroking their cat from the lair..."

On the first quick read though, I thought to myself, "Can he use that sort of language here?"

Then I pictured Dr. Evil and it made more sense...

jhbadger · on May 20, 2023

I think Blofeld was the reference. Dr Evil is a parody of Blofeld.

m4rtink · on May 21, 2023

The cat has been deprecated half a year ago. ;-)

codr7 · on May 20, 2023

If that was the case, it threw more than half of it up again, because it's not making much sense atm.

samrus · on May 20, 2023

You are right. It is trained on a lot of data, more than what a person van read in many lifetimes, but not all.

In fact it will be interesting how much more it would be at copywriting for specific feilds once it can train on that data. I imagine an LLM trained on all that dusty text in courthouse basements would become a much better paralegal (won't be a lawyer I'm afraid) than vanilla chatGPT

sigg3 · on May 20, 2023

> person van

Makes sense to use Transformers' data to train autonomous vehicles.

mlboss · on May 20, 2023

Also there are images and video that it didn’t used for training

ChatGTP · on May 20, 2023

I don’t think you needed to take it literally.

mcculley · on May 20, 2023

I am very interested in what LLMs will be able to do when trained on something other than the content on the Internet, which is primarily generated to sell advertising views.

ChatGTP · on May 21, 2023

I highly doubt it’s trained on that. I’m the sure it was curated and trained on the good stuff.

mcculley · on May 22, 2023

Did you arrive at this certainty through reading something other than what OpenAI has published? The document [0] that describes the training data for GPT-2 makes this assertion hilarious to me.

[0]: https://github.com/openai/gpt-2/blob/master/model_card.md#da...

nannal · on May 20, 2023

Yes, obvious hyperbole.