Wrong. MoE models like this one usually chose a different and unpredictable mix ...

Wrong. MoE models like this one usually chose a different and unpredictable mix of experts for each token, and as such you need all parameters at memory at once.

It lessens the number of parameters that need to be moved from memory to compute chip for each token, not from disk to memory.