1 H100 is only 80GB of HBM. I guess you mean a server with 4xH100 is 1 node?

karmasimida · on April 25, 2024

this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.

Which is ... a lot to say the least.

And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.

kristianp · on April 25, 2024

Thanks for the correction, there are indeed 8x nodes. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h10...