Arctic dev here. Yes keeping all experts in memory is the recommendation here an...

kiratp · on April 24, 2024

> 2. Small number of activated parameters shine in batch inference case for cloud providers

Could you elaborate more please? Batch inference activates pretty much all the experts since token in every sequence in a batch could hit a different expert. So at Bs=128 you’re not really getting a sparsity win.

karmasimida · on April 25, 2024

That is my reading too, if you consider latency as the utmost inference metric, then you need all models in memory all the time.

What is you guys 70B configuration, do you guys try TP=8 for the 70B model for a fair comparison?

kristianp · on April 25, 2024

1 H100 is only 80GB of HBM. I guess you mean a server with 4xH100 is 1 node?

karmasimida · on April 25, 2024

this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.

Which is ... a lot to say the least.

And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.

kristianp · on April 25, 2024

Thanks for the correction, there are indeed 8x nodes. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h10...