Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The mainstream options seem to be

Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999

Nvidia DGX Spark, ~1000 tops fp4, 128GB RAM, $3999

Mac Studio max spec, ~120 tflops (fp16?), 512GB RAM, 3x bandwidth, $9499

DGX Spark appears to potentially offer the most token per second, but less useful/value as everyday pc.



RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so Strix Halo's max theoretical is basically 60 TFLOPS no matter how you slice it (well it has double INT4, but I'm unclear on how generally useful that is):

    512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS
Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.

If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.

On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.


> Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999

Just got my Framework PC last week. It's easy to setup to run LLMs locally - you have to use Fedora 42, though, because it has the latest drivers. It was super easy to get qwen3-coder-30b (8 bit quant) running in LMStudio at 36 tok/sec.


I'm pretty new to this, so if I wanted to benchmark my current hardware and compare to your results what would be the best way to do that?

I'm looking at going for a Framework Desktop and would like to know what kind of performance gain I'd get over the current hardware I have, which so far I have a "feel" for the performance of from running Ollama and OpenWebUI, but no hard numbers.


What nobody seems to ever share is the context and TTFT (time to first token). You can get a very good TPS by using small prompts, even if the output tokens are very large. If you try to do any kind of agentic coding locally, where contexts are 7k+, local hardware completely falls over.

qwen-code (cli) gives like 2k requests per day for free (and is fantastic), so unless you have a very specific use case, buying a system for local LLM use is not a good use of funds.

If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value IMO, plus the drivers are open source so everything just works out of the box (with Vulkan, anyway).


> If you try to do any kind of agentic coding locally, where contexts are 7k+, local hardware completely falls over.

With my 5070 Ti + 2080 Ti I have Qwen 3 Coder 30B Q4_K_M running entirely on the GPUs with 16k context. Not great for larger code bases, but not nothing either.

Asking it to summarize llama-model-loader.cpp, which is about ~12k tokens, the TTFT is ~13 seconds and generation speed is about 55 tok/sec.

So yeah, for local stuff it's pick any two of large models, long contexts and decent speed.


Yeah, that sounds decent for some one-shots. The unified memory systems can have longer back-and-forth context chats, but at slower speed (at least on AMD).

I find Qwen 3 Coder to be quite usable, I get around 20TPS on my AMD AI 350 system, as long as the net-new context isn't too big.


I was looking at the 5060 Ti 16GB, and it has about half the memory bandwidth of the 5070 Ti, but at half the price here. With four of them you'd have 64 GB VRAM and still a lot cheaper than a 5090. Should get around 20-25 TPS for Qwen 3 Coder 30B, which is within usable range.

Need a big case tho or go bitcoin miner style.

Not seriously thinking about it, just playing around.


> If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value

Yeah, this is why I bought it. To tinker with LLMs (and some more experimental ML algorithms like differential logic and bitnets), but also it can compile LLVM in a little under 7 minutes, and, I didn't time it, but it can build the riscv gcc toolchain very quickly as well. My current (soon to be previous) dev box took about an hour to compile LLVM (if it didn't fail linking due to running out of memory) so doing any kind of LLVM development or making changes to binutils was quite tedious.


My use cases are mostly for automation, and local-only is a must.

I currently use the GPU in my server for n8n and Home Assistant with small-ish tooling models that fit in my 8GB VRAM.

TTFT is pretty poor right now, I get 10+ seconds for the longer inputs from HA, n8n isn't too bad unless I'm asking it to handle a largish input, but that one is less time sensitive as it's running things on schedules rather than when I need output.

Ideally I'd like to get Assistant responses in HA to under about 2s if possible.

Looking also for a new desktop at some point but I don't want to use the same hardware, the inference GPU is in a server that's always on running "infrastructure" (Kubernetes, various pieces of software, NAS functionality, etc), but I've always build desktops from components since I was a wee child when a 1.44MB floppy was an upgrade, so a part of me is reluctant to switch to a mini-PC for that;

I might be convinced to get a Framework Desktop though if it'll do for Steam gaming on Linux knowing that when I eventually need to upgrade it, it could supplement my server rack and be replaced entirely with a new model on the desktop, given there's very little upgrade path than to replace the entire mainboard.

No real interest in coding assistants, and running within my home network is an absolute must, which limits capability to "what's the best hardware I can afford?".


You could load up LMStudio on your current hardware, get qwen3-coder-30b (8bit quant) and give it some coding tasks, something meaty (I had it create a recursive descent parser in C that parses the C programming language). At the end of it's response it shows the tok/sec. I'm getting 36 tok/sec on the Framework running that model.


Hi could you share if you get a decent coding performance (quality wise) with this setup? IE. Is it good enough to replace say Claude Code?


qwen3-coder-30b is surprisingly good for a smallish model, but it's not going to replace Claude Code. Maybe if you're using it for Python it could do well enough. I've been trying it with C code generation and it's not bad, but certainly not at Claude Code level. I hope they come out with a qwen coder model in the 60b to 80b range - something like that would give higher quality results and likely still run in the 15 tok/sec range which would be usable.


Very encouraging result, I'm waiting super anxiously for mine! How much memory did you allocate for the iGPU?


I haven't done any fiddling with that yet. Out of the box it seems to allocate 1/2 for the iGPU. The qwen3-coder-30b 8bit quant model was (as you would expect) only taking 30GB (a bit less than half of what was allocated). Though weirdly, in htop it shows that the CPU has 125GB available to it, so I'm not sure what to make of that.


NVidia Spark is $4000. Or, will be, supposedly whenever it comes out.

Also notably, Strix Halo and DGX Spark are both ~275GBps memory bandwidth. Not always but in many machine learning cases it feels like that's going to be the limiting factor.


GosuCoder's latest video seems to be a well timed test of using Ryzen AI Max on some local models getting 40 TPS on a quantized Qwen 3 coder.

https://www.youtube.com/watch?v=0DET4YFzS6A


Maybe the real value of the DGX spark is to work on Switch 2 emulation. ARM + Nvidia GPU. Start with Switch 2 emulation on this machine and then optimize for others. (Yeah, I know, kind of expensive toy).


I think you can get something a lot cheaper if that’s all you want, e.g. something in the Jetson Orin line. That’s more similar to the switch, also, since it’s a Tegra CPU.


Expensive today. But how quickly (years) will these systems lower in value? At least on the Nvidia side of things they can be stacked.. so maybe not so much =/


You should add memory bandwidth to your comparison, as it's usually the bottleneck in terms of tps (at least for token generation, prompt processing is a different story).


  Mac Studio max spec, ~120 tflops (fp16?), 384GB RAM, 3x bandwidth, $9499
512GB.

DGX has 256GB/s bandwidth so it wouldn't offer the most tokens/s.


Perhaps they are referring to default GPU allocation that is 75% of the unified memory, but it is trivial to increase it.


The GPU memory allocation refers to how capacity is alloted, not bandwidth. Sounds like the same 256-bit/quad-channel 8000MHz lpddr5 you can get today with Strix Halo.


384GB is 75% of 512GB. The M3 Ultra bandwidth is over 800GB/s, though potentially less in practice.

Using an M3 Ultra I think the performance is pretty remarkable for inference and concerns about prompt processing being slow in particular are greatly exaggerated.

Maybe the advantage of the DGX Spark will be for training or fine tuning.


I very consistently see people say prompt processing is slow for larger context sizes ("notoriously slow"), something that is much less of an issue with eg CUDA setups.


Depends on the model. gpt-oss-120b will easily crunch large prompts in a few seconds. It's remarkable. It's gpt-4-mini at home.


tokens/s/$ then.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: