Why would they need to release the training data? that's nonsense.

Zambyte · on April 24, 2024

Because the training data is the source of the model. This thread may illuminate it for you: https://news.ycombinator.com/item?id=40035688

Most models that are described as "open source" are actually open weight, because their source is not open.

dantheman · on April 25, 2024

It's still open source and can be used; Just like open source refers to code, not all the design documents, discussion, plans, etc.

blackeyeblitzar · on April 24, 2024

Open source for traditional software means that you can see how the software works and reproduce the executable by compiling the software from source code. For LLMs, reproducing the model means reproducing the weights. And to do that you need the training source code AND the training data. There are already other great models that do this (see my comment at https://news.ycombinator.com/item?id=40147298).

I get that there may be some training data that is proprietary and cannot be released. But in those scenarios, it would still be good to know what the data is, how it was curated or filtered (this greatly affects LLM performance), how it is weighted relative to other training data, and so forth. But a significant portion of data used to train models is not proprietary and in those cases they can simply link to that data elsewhere or release it themselves, which is what others have done.

furyofantares · on April 24, 2024

There's no perfect analogy. It's far easier to usefully modify the weights of a model without the training data than it is to modify a binary executable without its source code.

I'd rather also have the data for sure! But in terms of what useful things I can do with it, weights are closer to source code than they are to a binary blob.

imjonse · on April 24, 2024

They should not, but then they also should not call the model truly open. It is the equivalent of freeware not open source.