Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The model is insane, but could this realistically be used in production?


Why not? I'm curious if you're picturing any specific roadblocks in mind. OpenAI makes their large models available through an API, removing any issues with model hosting and operations.


Latency, mostly.

The GPT-3 APIs were very slow on release, and even with the current APIs it still takes a couple seconds to get results from the 175B model.


I too am curious what kind of hardware resources are needed to run the model once it is trained


Yes. You don’t need a model in ram memory, NVME disks are fine.


That would have very slow inference latency if you had to read the model off disk for every token.


540B parameters means ~1TB of floating bytes (assuming BFLOAT16). Quadruple that for other associated stuff, and you'd need a machine with 4TB of RAM.


right - and even if you did run happen to have a machine with 4TB of ram - what type of latency would you have on a single machine running this as a service? how many machines would you need for google translate performance?

doesn't seem like you can run this as a service, yet.


The total memory of the model is less important then the memory needed to compute one batch. I’ve worked with recommendation models used in serving that were 10ish terabytes. The simple trick was most of the memory was embeddings and only small subset of embeddings were needed to do inference for one batch. If you fetch those embeddings as if they were features you can run very large models on normalish compute. You never need to load the entire model to ram at once.

Another trick you can use is load only some layers of the model into ram at a time (with prefetching to minimize stalls).

Or if you are google enjoy that tpus have a silly amount of ram. Tpu pods have a ton of ram.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: