Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999

Just got my Framework PC last week. It's easy to setup to run LLMs locally - you have to use Fedora 42, though, because it has the latest drivers. It was super easy to get qwen3-coder-30b (8 bit quant) running in LMStudio at 36 tok/sec.



I'm pretty new to this, so if I wanted to benchmark my current hardware and compare to your results what would be the best way to do that?

I'm looking at going for a Framework Desktop and would like to know what kind of performance gain I'd get over the current hardware I have, which so far I have a "feel" for the performance of from running Ollama and OpenWebUI, but no hard numbers.


What nobody seems to ever share is the context and TTFT (time to first token). You can get a very good TPS by using small prompts, even if the output tokens are very large. If you try to do any kind of agentic coding locally, where contexts are 7k+, local hardware completely falls over.

qwen-code (cli) gives like 2k requests per day for free (and is fantastic), so unless you have a very specific use case, buying a system for local LLM use is not a good use of funds.

If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value IMO, plus the drivers are open source so everything just works out of the box (with Vulkan, anyway).


> If you try to do any kind of agentic coding locally, where contexts are 7k+, local hardware completely falls over.

With my 5070 Ti + 2080 Ti I have Qwen 3 Coder 30B Q4_K_M running entirely on the GPUs with 16k context. Not great for larger code bases, but not nothing either.

Asking it to summarize llama-model-loader.cpp, which is about ~12k tokens, the TTFT is ~13 seconds and generation speed is about 55 tok/sec.

So yeah, for local stuff it's pick any two of large models, long contexts and decent speed.


Yeah, that sounds decent for some one-shots. The unified memory systems can have longer back-and-forth context chats, but at slower speed (at least on AMD).

I find Qwen 3 Coder to be quite usable, I get around 20TPS on my AMD AI 350 system, as long as the net-new context isn't too big.


I was looking at the 5060 Ti 16GB, and it has about half the memory bandwidth of the 5070 Ti, but at half the price here. With four of them you'd have 64 GB VRAM and still a lot cheaper than a 5090. Should get around 20-25 TPS for Qwen 3 Coder 30B, which is within usable range.

Need a big case tho or go bitcoin miner style.

Not seriously thinking about it, just playing around.


> If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value

Yeah, this is why I bought it. To tinker with LLMs (and some more experimental ML algorithms like differential logic and bitnets), but also it can compile LLVM in a little under 7 minutes, and, I didn't time it, but it can build the riscv gcc toolchain very quickly as well. My current (soon to be previous) dev box took about an hour to compile LLVM (if it didn't fail linking due to running out of memory) so doing any kind of LLVM development or making changes to binutils was quite tedious.


My use cases are mostly for automation, and local-only is a must.

I currently use the GPU in my server for n8n and Home Assistant with small-ish tooling models that fit in my 8GB VRAM.

TTFT is pretty poor right now, I get 10+ seconds for the longer inputs from HA, n8n isn't too bad unless I'm asking it to handle a largish input, but that one is less time sensitive as it's running things on schedules rather than when I need output.

Ideally I'd like to get Assistant responses in HA to under about 2s if possible.

Looking also for a new desktop at some point but I don't want to use the same hardware, the inference GPU is in a server that's always on running "infrastructure" (Kubernetes, various pieces of software, NAS functionality, etc), but I've always build desktops from components since I was a wee child when a 1.44MB floppy was an upgrade, so a part of me is reluctant to switch to a mini-PC for that;

I might be convinced to get a Framework Desktop though if it'll do for Steam gaming on Linux knowing that when I eventually need to upgrade it, it could supplement my server rack and be replaced entirely with a new model on the desktop, given there's very little upgrade path than to replace the entire mainboard.

No real interest in coding assistants, and running within my home network is an absolute must, which limits capability to "what's the best hardware I can afford?".


You could load up LMStudio on your current hardware, get qwen3-coder-30b (8bit quant) and give it some coding tasks, something meaty (I had it create a recursive descent parser in C that parses the C programming language). At the end of it's response it shows the tok/sec. I'm getting 36 tok/sec on the Framework running that model.


Hi could you share if you get a decent coding performance (quality wise) with this setup? IE. Is it good enough to replace say Claude Code?


qwen3-coder-30b is surprisingly good for a smallish model, but it's not going to replace Claude Code. Maybe if you're using it for Python it could do well enough. I've been trying it with C code generation and it's not bad, but certainly not at Claude Code level. I hope they come out with a qwen coder model in the 60b to 80b range - something like that would give higher quality results and likely still run in the 15 tok/sec range which would be usable.


Very encouraging result, I'm waiting super anxiously for mine! How much memory did you allocate for the iGPU?


I haven't done any fiddling with that yet. Out of the box it seems to allocate 1/2 for the iGPU. The qwen3-coder-30b 8bit quant model was (as you would expect) only taking 30GB (a bit less than half of what was allocated). Though weirdly, in htop it shows that the CPU has 125GB available to it, so I'm not sure what to make of that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: