Arctic dev here. Yes keeping all experts in memory is the recommendation here and understandably that is a barrier to some. But once you have 1 H100 node or two (gpu middle-class I guess...?), then a few things to note:
1. FP6/FP8 inference is pretty good. How to on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main... (vllm support coming soon!)
2. Small number of activated parameters shine in batch inference case for cloud providers.
> 2. Small number of activated parameters shine in batch inference case for cloud providers
Could you elaborate more please? Batch inference activates pretty much all the experts since token in every sequence in a batch could hit a different expert. So at Bs=128 you’re not really getting a sparsity win.
this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.
Which is ... a lot to say the least.
And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.