I was considering getting an RTX 5090 to run inference on some LLM models, but n...

apitman · 2025-08-28T02:38:09 1756348689

If you want to run small models fast get the 5090. If you want to run large models slow get the Spark. If you want to run small models slow get a used MI50. If you want to run large models fast get a lot more money.

Gracana · 2025-08-28T18:39:10 1756406350

You might be able to do "large models slow" better than the spark with a 5090 and CPU offload, so long as you stick with MoE architectures. With the kv cache and shared parts of the model on GPU and all of the experts on CPU, it can work pretty well. I'm able to run ~400GB models at 10 tps with some A4000s and a bunch of RAM. That's on a Xeon W system with poor practical memory bandwidth (~190GB/s), you can do better with EPYC.

Apes · 2025-08-28T01:32:47 1756344767

RTX 5090 is about as good as it gets for home use. Its inference speeds are extremely fast.

The limiting factor is going to be the VRAM on the 5090, but nvidia intentionally makes trying to break the 32GB barrier extremely painful - they want companies to buy their $20,000 GPUs to run inference for larger models.

skhameneh · 2025-08-28T02:46:59 1756349219

RTX 5090 for running smaller models.

Then the RTX Pro 6000 for running a little bit larger models (96gb VRAM, but only ~15-20% more perf than 5090).

Some suggest Apple Silicon only for running larger models on a budget because of the unified memory, but the performance won't compare.

BoorishBears · 2025-08-28T01:19:29 1756343969

No. These are practically useless for AI.

Their prompt processing speeds are absolutely abysmal: if you're trying to tinker from time to time, a GPU like a 5090 or renting GPUs is a much better option.

If you're just trying to prep for impending mainstream AI applications, few will be targeting this form factor: it's both too strong compared to mainstream hardware, and way too weak compared to dedicated AI-focused accelerators.

-

I'll admit I'm taking a less nuanced take than some would prefer, but I'm also trying to be direct: this is not ever going to be a better option than a 5090.

aurareturn · 2025-08-28T01:22:05 1756344125

  Their prompt processing speeds are absolutely abysmal

They are not. This is Blackwell with Tensor cores. Bandwidth is the problem here.

BoorishBears · 2025-08-28T01:28:12 1756344492

They're abysmal compared to anything dedicated at any reasonable batch size because of both bandwidth and compute, not sure why you're wording this like it disagrees with what I said.

I've run inference workloads on a GH200 which is an entire H100 attached to an ARM processor and the moment offloading is involved speeds tank to Mac Mini-like speeds, which is similarly mostly a toy when it comes to AI.

aurareturn · 2025-08-28T01:36:11 1756344971

Again, prompt processing isn't the major problem here. It's bandwidth. 256GB/s bandwidth (maybe ~210 in real world) limits the tokens per second well before prompt processing.

Not entirely sure how your ARM statement matters here. This is unified memory.

BoorishBears · 2025-08-28T02:28:34 1756348114

[flagged]

aurareturn · 2025-08-28T06:25:49 1756362349

What model are you running?

I suspect that you’re running a very large model like DeepSeek in coherent memory?

Keep in mind that this little DGX only has 128GB which means it can run fairly small models such as qwen3 coder where prompt processing is not an issue.

I’m not doubting your experience with GH200 but it doesn’t seem relevant here because the bandwidth for Spark is the bottleneck well before the prompt processing.

Y_Y · 2025-08-28T02:54:10 1756349650

I like the cut of your jib and your experience matches mine, but without real numbers this is all just piss in the wind (as far as online discussions go).

BoorishBears · 2025-08-28T03:14:28 1756350868

You're right, it's unfortunate I didn't keep the benchmarks around: I benchmark a lot of configurations and providers for my site and have a script I typically run that produces graphs for various batch sizes (https://ibb.co/0RZ78hMc)

The performance with offloading was just so bad I didn't even bother proceeding to the benchmark (without offloading you get typical H100 speeds)