Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Open source for traditional software means that you can see how the software works and reproduce the executable by compiling the software from source code. For LLMs, reproducing the model means reproducing the weights. And to do that you need the training source code AND the training data. There are already other great models that do this (see my comment at https://news.ycombinator.com/item?id=40147298).

I get that there may be some training data that is proprietary and cannot be released. But in those scenarios, it would still be good to know what the data is, how it was curated or filtered (this greatly affects LLM performance), how it is weighted relative to other training data, and so forth. But a significant portion of data used to train models is not proprietary and in those cases they can simply link to that data elsewhere or release it themselves, which is what others have done.



There's no perfect analogy. It's far easier to usefully modify the weights of a model without the training data than it is to modify a binary executable without its source code.

I'd rather also have the data for sure! But in terms of what useful things I can do with it, weights are closer to source code than they are to a binary blob.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: