The model is insane, but could this realistically be used in production?

gk1 · on April 4, 2022

Why not? I'm curious if you're picturing any specific roadblocks in mind. OpenAI makes their large models available through an API, removing any issues with model hosting and operations.

minimaxir · on April 4, 2022

Latency, mostly.

The GPT-3 APIs were very slow on release, and even with the current APIs it still takes a couple seconds to get results from the 175B model.

anentropic · on April 4, 2022

I too am curious what kind of hardware resources are needed to run the model once it is trained

motoboi · on April 4, 2022

Yes. You don’t need a model in ram memory, NVME disks are fine.

ekelsen · on April 4, 2022

That would have very slow inference latency if you had to read the model off disk for every token.

1024core · on April 4, 2022

540B parameters means ~1TB of floating bytes (assuming BFLOAT16). Quadruple that for other associated stuff, and you'd need a machine with 4TB of RAM.

endisneigh · on April 4, 2022

right - and even if you did run happen to have a machine with 4TB of ram - what type of latency would you have on a single machine running this as a service? how many machines would you need for google translate performance?

doesn't seem like you can run this as a service, yet.

Mehdi2277 · on April 5, 2022

The total memory of the model is less important then the memory needed to compute one batch. I’ve worked with recommendation models used in serving that were 10ish terabytes. The simple trick was most of the memory was embeddings and only small subset of embeddings were needed to do inference for one batch. If you fetch those embeddings as if they were features you can run very large models on normalish compute. You never need to load the entire model to ram at once.

Another trick you can use is load only some layers of the model into ram at a time (with prefetching to minimize stalls).

Or if you are google enjoy that tpus have a silly amount of ram. Tpu pods have a ton of ram.