Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn't the whole point of the MOE architecture exactly this?

That you can individually train and improve smaller segments as necessary



Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.


> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.


That's a good correction, thanks.


I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.


That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490


Yes, explicitly trained experts were a thing for a little while, but not anymore. Yet another application of the Hard Lesson.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: