Also curious... I spent a while trying to get system to tell me directly, but no...

alchemist1e9 · on Dec 2, 2022

It’s odd how little discussion there is on inputs because the more reputable the inputs the more likely it can be trusted. I’d really like to know the body of knowledge it has been trained on.

My guess why this is obscured is legal, in that they have used a massive body of copyrighted data, and hope to avoid controversy over the inputs by trying not to talk about it.

I had seen once a huge collection of links to curated input data sets for language models but haven’t been able to find it yet in my notes/bookmarks unfortunately.

alchemist1e9 · on Dec 3, 2022

This was likely a significant percentage of the input data:

https://en.wikipedia.org/wiki/Common_Crawl

I also have a an odd hunch ChatGPT might have used a scihub mirror as inputs for example.