Hacker Newsnew | past | comments | ask | show | jobs | submit | mwcampbell's commentslogin

Given that it's a 400B-parameter model, but it's a sparse MoE model with 13B active parameters per token, would it run well on an NVIDIA DGX Spark with 128 GB of unified RAM, or do you practically need to hold the full model in RAM even with sparse MoE?

Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM.

That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example.


> Experts can be swapped in and out of VRAM for each token.

I've often wondered how much it happens in practice. What does the per-token distribution of expert selection actually look like during inference? For example does it act like uniform random variable, or does it stick with the same 2 or 3 experts for 10 tokens in a row? I haven't been able to find much info on this.

Obviously it depends on what model you are talking about, so some kind of survey would be interesting. I'm sure this must but something that the big inference labs are knowledgeable about.

Although, I guess if you are batching things, then even if a subset of experts is selected for a single query, maybe over the batch it appears completely random, that would destroy any efficiency gains. Perhaps it's possible to intelligently batch queries that are "similar" somehow? It's quite an interesting research problem when you think about it.

Come to think of it, how does it work then for the "prompt ingestion" stage, where it likely runs all experts in parallel to generate the KV cache? I guess that would destroy any efficiency gains due to MoE too, so the prompt ingestion and AR generation stages will have quite different execution profiles.


The model is explicitly trained to produce as uniform a distribution as possible, because it's designed for batched inference with a batch size much larger than the expert count, so that all experts are constantly activated and latency is determined by the highest-loaded expert, so you want to distribute the load evenly to maximize utilization.

Prompt ingestion is still fairly similar to that setting, so you can first compute the expert routing for all tokens, load the first set of expert weights and process only those tokens that selected the first expert, then load the second expert and so on.

But if you want to optimize for single-stream token generation, you need a completely different model design. E.g. PowerInfer's SmallThinker moved expert routing to a previous layer, so that the expert weights can be prefetched asynchronously while another layer is still executing: https://arxiv.org/abs/2507.20984


With a non-sequential generative approach perhaps the RAM cache misses could be grouped together and swapped on a when available/when needed prioritized bases.

Can run with mmap() but it is slower. 4-bit quantized there is a decent ratio between the model size and the RAM, with a fast SSD one could try to see how it works. However when a model is 4-bit quantized there is often the doubt that it is not better than an 8-bit quantized model of 200B parameters, it depends on the model, on the use case, ... Unfortunately the street for local inference of SOTA model is being stopped by the RAM prices and the GPU request of the companies, leaving us with little. Probably today the best bet is to buy Mac Studio systems and then run distributed inference (MLX supports this for instance), or a 512 GB Mac Studio M4 that costs, like 13k$.

I think 512 GB Mac Studio was M3 Ultra.

Anyways, isn't a new Mac Studio due in a few months? It should be significantly faster as well.

I just hope RAM prices don't ruin this...


Talking about RAM prices, you can still get a framework Max+ 395 with 128GB RAM for ~$2,459 USD. They have not increased the price for it yet.

https://frame.work/products/desktop-diy-amd-aimax300/configu...


Pretty sure those use to be $1999 ... but not entirely sure

Yep. You be right. Looks like they increased it earlier this month. Bummer!

No.

128GB vram gets you enough space for 256B sized models. But 400B is too big for the DGX Spark, unless you connect 2 of them together and use tensor parallel.


Impressive work.

I wonder if you've looked into what it would take to implement accessibility while maintaining your no-Rust-dependencies rule. On Windows and macOS, it's straightforward enough to implement UI Automation and the Cocoa NSAccessibility protocols respectively. On Unix/X11, as I see it, your options are:

1. Implement AT-SPI with a new from-scratch D-Bus implementation.

2. Implement AT-SPI with one of the D-Bus C libraries (GLib, libdbus, or sdbus).

3. Use GTK, or maybe Qt.


This is obviously bullshit. If he were really worried about the things he says he is, he'd put the brakes on his company, or would never have started it in the first place.

hes addressed this a multitude of times. he wants to slow down but the chinese will not, therefore you cannot cede the frontier to authoritarians. its a nuclear arms race.

So if someone (actually, practically everyone) who runs an AI company says AI is dangerous, it's bullshit. If someone who is holding NVDA put options says it, they're talking their book. If someone whose job is threatened by AI says it, it's cope. If someone who doesn't use AI says it, it's fear of change. Is there someone in particular you want to hear it from, or are you completely immune to argument?

I actually do believe that AI is dangerous, though for different reasons than the ones he focuses on. But I don't think he really believes it, since if he did, he wouldn't be spending billions to bring it into existence.

> So if someone (actually, practically everyone) who runs an AI company says AI is dangerous, it's bullshit.

My instinct is to take his words as a marketing pitch.

When he says AI is dangerous, it is a roundabout way to say it is powerful and should be taken seriously.


Yes, exactly.

I was gratified to learn that the project used my own AccessKit for accessibility (or at least attempted to; I haven't verified if it actually works at all; I doubt it)... then horrified to learn that it used a version that's over 2 years old.

Glad to hear that the coming local version of Sprites will be open-source. I hope there will be some way to financially reward that work, aside from buying Fly services that I likely wouldn't use.


I like Partners In Health, myself. https://www.pih.org/


I want something like this, but running on my own box. I now have a Linux box with plenty of RAM and storage under my desk. (It happens to be an NVIDIA DGX Spark, but I'm not really interested in passing the GPU through to these sandboxed VMs; I know that's not practical anyway.) Maybe I'll see if I can hack together a local solution like this using Firecracker.


That's coming. It's what Jerome has been working on these past few months.


What about `docker run`? It'll be the same isolated container that keeps state. You can also mount some local directory


Maybe bend smolvm to your needs?


Simple language isn't just for children. It's also good for non-native speakers. Besides, even for those who can understand complex grammar and obscure words, parsing unnecessarily complex language takes extra effort.

In this specific case, I don't think the rewritten version of the document is infantilizing.


It's not useless, but the original document is aimed at expert craftspeople and there's a lot of content in the texture of it.


> Did Microsoft seriously deprecate BitBlt and 2D draw calls?

Very unlikely. Far too many applications depend on those things. It's more likely that they accidentally changed something subtle that happened to break colorForth.


If you're allowed to say, are you referring to the Windows 10 ports of the iOS apps that were done via Osmeta in 2016, or the earlier WinRT-native version? If the former, that was a non-starter for me and my blind friends due to deep accessibility issues, probably having to do with the Osmeta port/reimplementation of UIKit. Edit to add: And we wanted something that was easier to use with a Windows screen reader than the desktop website, particularly for Facebook proper.


> I still like it -- for systems programming, that is.

Just curious, what language(s) do you prefer for things that you don't classify as "systems programming"?


Go and TypeScript are both nice.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: