I believe the main draw of the MoE model is they don't all need to be in memory at once. They can be swapped based on context. In aggregate you get the performance of a much larger model (384b tokens) while using much less memory than such a model would require. If you had enough memory it could all be loaded but it doesn't need to be.
"Expert" in MoE has no bearing on what you might think of as a human expert.
It's not like there is one expert that is proficient at science, and one that is proficient in history.
For a given inference request, you're likely to activate all the experts at various points. But for each individual forward pass (e.g. each token), you are only activating a few.
Wrong. MoE models like this one usually chose a different and unpredictable mix of experts for each token, and as such you need all parameters at memory at once.
It lessens the number of parameters that need to be moved from memory to compute chip for each token, not from disk to memory.