I believe the main draw of the MoE model is they *don't* all need to be in memor...

sp332 · on April 24, 2024

Technically you could, but it would take much longer to do all that swapping.

qeternity · on April 24, 2024

"Expert" in MoE has no bearing on what you might think of as a human expert.

It's not like there is one expert that is proficient at science, and one that is proficient in history.

For a given inference request, you're likely to activate all the experts at various points. But for each individual forward pass (e.g. each token), you are only activating a few.

Manabu-eo · on April 24, 2024

Wrong. MoE models like this one usually chose a different and unpredictable mix of experts for each token, and as such you need all parameters at memory at once.

It lessens the number of parameters that need to be moved from memory to compute chip for each token, not from disk to memory.