My understanding is that, while all 8B are loaded into memory, for each token in...

		simonw 84 days ago \| parent \| context \| favorite \| on: Moondream 3 Preview: Frontier-level reasoning at a... My understanding is that, while all 8B are loaded into memory, for each token inference step only 2B are selected and used - so tokens are produced faster because there is less computation needed. Hoping someone will correct me if that's not the right mental model!