Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> ChatGPT was trained with ALL the data possible

My understanding is that ChatGPT was trained on text from the Internet and public domain texts. There is orders of magnitude more text available to humans behind paywalls and otherwise inaccessible (currently) to these models.

Am I missing something?



No, it would be a gross misunderstanding to think ChatGPT has anywhere close to all the data possible. Not even close to all the data on the internet. Not even close to all text. Let alone data available by directly interacting with the world.


It’s a bit of an open question as to how much of that data is: high quality, unique, and available. It could be that OpenAI used most of what satisfies those constraints. Training on low quality data won’t help improve its accuracy on queries, nor will duplicative data.


> Not even close to all the data on the internet

I agree with your other points, but why would you think ChatGPT was not given all the data on the internet?

If you aren't storing the text, the only thing that stops you retrieving all the pages that can possibly be found on the internet is a small amount of money.

I'm pretty certain that OpenAI has a lot more than a small amount of money.


You're severely underestimating how much content is on the internet and how hard it would be to see and index it all. Chat OpenAI used common crawl dataset, which is already pretty unwieldy and represents an amalgamation data gathered over several years by many crawlers.


Because if it was, it would mostly talk about porn? :)


There’s lots of paywalled content, and other content hidden behind logins and group memberships (Eg Facebook posts, University ex-alumni portals, University course portals).

Even the paywall issue alone, I can’t see how they could scale doing paywall signups automatically. Each paywall form is different, may require a local phone number in a different country to receive a text, etc.


LLMs might be good enough to sign up for sites, though maybe not yet fool “I am a human” test.


In addition to what others have said, there is a significant amount of data on the internet that is not in text form.


Didn't Google have a project to scan and OCR all the books? I wonder whether these data were fed to Bard.


You can find GPT-2's training dataset list - at a high level - in the GPT-2 repository on Github: https://github.com/openai/gpt-2/blob/master/model_card.md#da... However, OpenAI goes dark after that regarding the 'data soup' that was fed into their LLMs. In general, start around 2019 and definitely by 2020 you'll notice that research labs became much less forthcoming about the data that went into their models. As far as I'm aware, BookCorpus is one of the more commonly-used 'large books dataset' that's been utilized in recent years to train large language models (LLMs) like generative pretrained transformers: https://12ft.io/proxy?q=https%3A%2F%2Ftowardsdatascience.com...

At my alma mater I remember the large-scale Google book scanning devices and what a herculean effort that was to digitize the largest university library system's books - University of Michigan - although only 7M texts from the entire collection of ~16 million texts: https://en.wikipedia.org/wiki/University_of_Michigan_Library) were digitized.I too was curious about the state of the Google Books project: https://www.edsurge.com/news/2017-08-10-what-happened-to-goo...

This is an interesting piece of ephemera from 2005, when Google started digitizing books at UMich: https://apps.lib.umich.edu/files/services/mdp/faq.pdf

As far as I recall, the Books project allowed the early n-grams functionality to be built out: https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-...

The Google Books Ngram Viewer tool is actually still in existence; you can play around with it here: https://books.google.com/ngrams/graph?corpus=0&content=Vorsp...


Yes, and while there were copyright issues with them putting the books out there in public, they still retain all the scans to use for search projects.

https://books.google.com/


It was claimed to use book data, but IMHO nowadays the available internet data is larger than all the books ever published; so while book data definitely should be used, it's not a pathway to significant increases in data size.


I'd be crazy if I didnt think that google is sitting on some stuff that nobody knows about and they are stroking their cat from the lair as we type.


It’s funny that the general internet pessimism about Google misses stuff like this.

I mean ChatGPT 3 went viral and Google managed to ship Bard in a few weeks. I think the consensus is that ChatGPT is better, but it was literally sitting on the shelf ready to go.


"...and they are stroking their cat from the lair..."

On the first quick read though, I thought to myself, "Can he use that sort of language here?"

Then I pictured Dr. Evil and it made more sense...


I think Blofeld was the reference. Dr Evil is a parody of Blofeld.


The cat has been deprecated half a year ago. ;-)


If that was the case, it threw more than half of it up again, because it's not making much sense atm.


You are right. It is trained on a lot of data, more than what a person van read in many lifetimes, but not all.

In fact it will be interesting how much more it would be at copywriting for specific feilds once it can train on that data. I imagine an LLM trained on all that dusty text in courthouse basements would become a much better paralegal (won't be a lawyer I'm afraid) than vanilla chatGPT


> person van

Makes sense to use Transformers' data to train autonomous vehicles.


Also there are images and video that it didn’t used for training


I don’t think you needed to take it literally.


I am very interested in what LLMs will be able to do when trained on something other than the content on the Internet, which is primarily generated to sell advertising views.


I highly doubt it’s trained on that. I’m the sure it was curated and trained on the good stuff.


Did you arrive at this certainty through reading something other than what OpenAI has published? The document [0] that describes the training data for GPT-2 makes this assertion hilarious to me.

[0]: https://github.com/openai/gpt-2/blob/master/model_card.md#da...


Yes, obvious hyperbole.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: