Isn't the whole point of the MOE architecture exactly this? That you can individ...

ainch · 2025-11-02T20:19:12 1762114752

Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

pama · 2025-11-03T01:35:56 1762133756

> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.

ainch · 2025-11-03T19:58:20 1762199900

That's a good correction, thanks.

idiotsecant · 2025-11-02T19:53:16 1762113196

I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.

viraptor · 2025-11-02T20:35:52 1762115752

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

idiotsecant · 2025-11-03T13:23:23 1762176203

Yes, explicitly trained experts were a thing for a little while, but not anymore. Yet another application of the Hard Lesson.