My understanding is that, while all 8B are loaded into memory, for each token inference step only 2B are selected and used - so tokens are produced faster because there is less computation needed.
Hoping someone will correct me if that's not the right mental model!
Hoping someone will correct me if that's not the right mental model!