this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.
Which is ... a lot to say the least.
And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.