I genuinely do not understand the evaluations of the US AI industry. The chinese...

espadrine · 2025-12-01T18:50:49 1764615049

Two aspects to consider:

1. Chinese models typically focus on text. US and EU models also bear the cross of handling image, often voice and video. Supporting all those is additional training costs not spent on further reasoning, tying one hand in your back to be more generally useful.

2. The gap seems small, because so many benchmarks get saturated so fast. But towards the top, every 1% increase in benchmarks is significantly better.

On the second point, I worked on a leaderboard that both normalizes scores, and predicts unknown scores to help improve comparisons between models on various criteria: https://metabench.organisons.com/

You can notice that, while Chinese models are quite good, the gap to the top is still significant.

However, the US models are typically much more expensive for inference, and Chinese models do have a niche on the Pareto frontier on cheaper but serviceable models (even though US models also eat up the frontier there).

coliveira · 2025-12-01T20:01:04 1764619264

Nothing you said helps with the issue of valuation. Yes, the US models may be better by a few percentage points, but how can they justify being so costly, both operationally as well as in investment costs? Over the long run, this is a business and you don't make money being the first, you have to be more profitable overall.

ben_w · 2025-12-01T20:11:54 1764619914

I think the investment race here is an "all-pay auction"*. Lots of investors have looked at the ultimate prize — basically winning something larger than the entire present world economy forever — and think "yes".

But even assuming that we're on the right path for that (which we may not be) and assuming that nothing intervenes to stop it (which it might), there may be only one winner, and that winner may not have even entered the game yet.

* https://en.wikipedia.org/wiki/All-pay_auction

coliveira · 2025-12-01T20:19:36 1764620376

> investors have looked at the ultimate prize — basically winning something larger than the entire present world economy

This is what people like Altman want investors to believe. It seems like any other snake oil scam because it doesn't match reality of what he delivers.

saubeidl · 2025-12-01T20:30:55 1764621055

Yeah, this is basically financial malpractice/fraud.

jodleif · 2025-12-01T18:53:01 1764615181

1. Have you seen the Qwen offerings? They have great multi-modality, some even SOTA.

brabel · 2025-12-01T19:32:40 1764617560

Qwen Image and Image Edit were among the best image models until Nano Banana Pro came along. I have tried some open image models and can confirm , the Chinese models are easily the best or very close to the best, but right now the Google model is even better... we'll see if the Chinese catch up again.

BoorishBears · 2025-12-01T22:17:00 1764627420

I'd say Google still hasn't caught up on the smaller model side at all, but we've all been (rightfully) wowed enough by Pro to ignore that for now.

Nano Banano Pro starts at 15 cents per image at <2k resolution, and is not strictly better than Seedream 4.0: yet the latter does 4K for 3 cents per image.

Add in the power of fine-tuning on their open weight models and I don't know if China actually needs to catch up.

I finetuned Qwen Image on 200 generations from Seedream 4.0 that were cleaned up with Nano Banana Pro, and got results that were as good and more reliable than either model could achieve otherwise.

dworks · 2025-12-02T00:21:54 1764634914

FWIW, Qwen Z-Image is much better than Seedream and people (redditors) are saying its better than Nano Banana in their first trials. Its also 7B I think, and open.

BoorishBears · 2025-12-02T00:45:41 1764636341

I've used and finetuned Z-Image Turbo: it's nowhere near Seedream or even Qwen-Image when the latter is finetuned (also doesn't do image editing yet)

It is very good for the size and speed, and I'm excited for the Edit and Base variants... but Reddit has been a bit "over-excited" because it run on their small GPUs and isn't overly resistant to porn.

janalsncm · 2025-12-02T06:49:27 1764658167

> Chinese models typically focus on text

Not true at all. Qwen has a VLM (qwen2 vl instruct) which is the backbone of Bytedance’s TARS computer use model. Both Alibaba (Qwen) and Bytedance are Chinese.

Also DeepSeek got a ton of attention with their OCR paper a month ago which was an explicit example of using images rather than text.

raincole · 2025-12-01T19:57:20 1764619040

> video

Most of AI-generated videos we see on social media now are made with Chinese models.

torginus · 2025-12-01T19:32:21 1764617541

Thanks for sharing that!

The scales are a bit murky here, but if we look at the 'Coding' metric, we see that Kimi K2 outperforms Sonnet 4.5 - that's considered to be the price-perf darling I think even today?

I haven't tried these models, but in general there have been lots of cases where a model performs much worse IRL than the benchmarks would sugges (certain Chinese models and GPT-OSS have been guilty of this in the past)

espadrine · 2025-12-01T21:38:49 1764625129

Good question. There's 2 points to consider.

• For both Kimi K2 and for Sonnet, there's a non-thinking and a thinking version. Sonnet 4.5 Thinking is better than Kimi K2 non-thinking, but the K2 Thinking model came out recently, and beats it on all comparable pure-coding benchmarks I know: OJ-Bench (Sonnet: 30.4% < K2: 48.7%), LiveCodeBench (Sonnet: 64% < K2: 83%), they tie at SciCode at 44.8%. It is a finding shared by ArtificialAnalysis: https://artificialanalysis.ai/models/capabilities/coding

• The reason developers love Sonnet 4.5 for coding, though, is not just the quality of the code. They use Cursor, Claude Code, or some other system such as Github Copilot, which are increasingly agentic. On the Agentic Coding criteria, Sonnet 4.5 Thinking is much higher.

By the way, you can look at the Table tab to see all known and predicted results on benchmarks.

pama · 2025-12-02T14:54:07 1764687247

The table is confusing. It is not clear what is known and what is predicted (and how it is predicted). Why not measure the missing pieces instead of predicting—is it too expensive or is the tooling missing?

culi · 2025-12-02T01:35:50 1764639350

Qwen, Hunyuan, and WAN are three of the major competitors in the vision, text-to-image, and image-to-video spaces. They are quite competitive. Right now WAN is only behind Google's Veo in image-to-video rankings on llmarena for example

https://lmarena.ai/leaderboard/image-to-video

agumonkey · 2025-12-01T19:51:23 1764618683

forgive me for bringing politics into it, are chinese LLM more prone to censorship bias than US ones ?

coliveira · 2025-12-01T20:02:49 1764619369

Being open source, I believe Chinese models are less prone to censorship, since the US corporations can add censorship in several ways just by being a closed model that they control.

skeledrew · 2025-12-01T20:21:11 1764620471

It's not about a LLM being prone to anything, but more about the way a LLM is fine-tuned (which can be subject to the requirements of those wielding political power).

agumonkey · 2025-12-01T21:50:16 1764625816

that's what i meant even though i could have been more precise

erikhorton · 2025-12-02T00:16:49 1764634609

Yes extremely likely they are prone to censorship based on the training. Try running them with something like LM Studio locally and ask it questions the government is uncomfortable about. I originally thought the bias was in the GUI, but it's baked into the model itself.

jasonsb · 2025-12-01T18:30:33 1764613833

It's all about the hardware and infrastructure. If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini. The chinese models may benchmark close on paper, but real-world deployment is different. So you either buy your own hardware in order to run a chinese model at 150-200tps or give up an use one of the Big 3.

The US labs aren't just selling models, they're selling globally distributed, low-latency infrastructure at massive scale. That's what justifies the valuation gap.

Edit: It looks like Cerebras is offering a very fast GLM 4.6

irthomasthomas · 2025-12-01T20:07:52 1764619672

Gemini 3 = ~70tps https://openrouter.ai/google/gemini-3-pro-preview

Opus 4.5 = ~60-80tps https://openrouter.ai/anthropic/claude-opus-4.5

Kimi-k2-think = ~60-180tps https://openrouter.ai/moonshotai/kimi-k2-thinking

Deepseek-v3.2 = ~30-110tps (only 2 providers rn) https://openrouter.ai/deepseek/deepseek-v3.2

jasonsb · 2025-12-01T20:13:00 1764619980

It doesn't work like that. You need to actually use the model and then go to /activity to see the actual speed. I constantly get 150-200tps from the Big 3 while other providers barely hit 50tps even though they advertise much higher speeds. GLM 4.6 via Cerebras is the only one faster than the closed source models at over 600tps.

irthomasthomas · 2025-12-01T20:20:42 1764620442

These aren't advertised speeds, they are the average measured speeds by openrouter across different providers.

observationist · 2025-12-01T19:17:23 1764616643

The network effects of using consistently behaving models and maintaining API coverage between updates is valuable, too - presumably the big labs are including their own domains of competence in the training, so Claude is likely to remain being very good at coding, and behave in similar ways, informed and constrained by their prompt frameworks, so that interactions will continue to work in predictable ways even after major new releases occur, and upgrades can be clean.

It'll probably be a few years before all that stuff becomes as smooth as people need, but OAI and Anthropic are already doing a good job on that front.

Each new Chinese model requires a lot of testing and bespoke conformance to every task you want to use it for. There's a lot of activity and shared prompt engineering, and some really competent people doing things out in the open, but it's generally going to take a lot more expert work getting the new Chinese models up to snuff than working with the big US labs. Their product and testing teams do a lot of valuable work.

dworks · 2025-12-02T00:27:18 1764635238

Qwen 3 Coder Plus has been braindead this past weekend, but Codex 5.1 has also been acting up. It told me updating UI styling was too much work and I should do it myself. I also see people complaining about Claude every week. I think this is an unsolved problem, and you also have to separate perception from actual performance, which I think is an impossible task.

DeathArrow · 2025-12-01T19:45:25 1764618325

> If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini.

I think GLM 4.6 offered by Cerebras is much faster than any US model.

jasonsb · 2025-12-01T19:53:52 1764618832

You're right, I forgot about that one.

csomar · 2025-12-01T18:35:46 1764614146

According to OpenRouter, z.ai is 50% faster than Anthropic; which matches my experience. z.ai does have frequent downtimes but so does Claude.

jodleif · 2025-12-01T18:51:48 1764615108

Assuming your hardware premise is right (and lets be honest, nobody really wants to send their data to chinese providers) You can use a provider like Cerebras, Groq?

kachapopopow · 2025-12-01T19:22:02 1764616922

cerebras AI offers models at 50x the speed of sonnet?

baq · 2025-12-02T13:01:53 1764680513

if that's an honest question, the answer is pretty much yes, depending on model.

kachapopopow · 2025-12-02T22:07:39 1764713259

the question mark was expressing confusion.

jazzyjackson · 2025-12-01T18:27:43 1764613663

Valuation is not based on what they have done but what they might do, I agree tho it's investment made with very little insight into Chinese research. I guess it's counting on deepseek being banned and all computers in America refusing to run open software by the year 2030 /snark

jodleif · 2025-12-01T18:47:45 1764614865

> Valuation is not based on what they have done but what they might do

Exactly what I’m thinking. Chinese models catching rapidly. Soon to be on-par with the big dogs.

ksynwa · 2025-12-01T19:07:16 1764616036

Even if they do continue to lag behind they are a good bet against monopolisation by proprietary vendors.

coliveira · 2025-12-01T20:08:09 1764619689

They would if corporations were allowed to run these models. I fully expect the US government to prohibit corporations from doing anything useful with Chinese models (full censorship). It's the same game they use with chips.

bilbo0s · 2025-12-01T18:34:17 1764614057

>I guess it's counting on deepseek being banned

And the people making the bets are in a position to make sure the banning happens. The US government system being what it is.

Not that our leaders need any incentive to ban Chinese tech in this space. Just pointing out that it's not necessarily a "bet".

"Bet" imply you don't know the outcome and you have no influence over the outcome. Even "investment" implies you don't know the outcome. I'm not sure that's the case with these people?

coliveira · 2025-12-01T20:10:12 1764619812

Exactly. "Business investment" these days means that the people involved will have at least some amount of power to determine the winning results.

Bolwin · 2025-12-01T20:16:02 1764620162

Third party providers rarely support caching.

With caching the expensive US models end up being like 2x the price (e.g sonnet) and often much cheaper (e.g gpt-5 mini)

If they start caching then US companies will be completely out priced.

newyankee · 2025-12-01T18:26:08 1764613568

Yet tbh if the US industry had not moved ahead and created the race with FOMO it would not had been easier for Chinese strategy to work either.

The nature of the race may change as yet though, and I am unsure if the devil is in the details, as in very specific edge cases that will work only with frontier models ?

fastball · 2025-12-01T20:23:34 1764620614

They're not that close (on things like LMArena) and being cheaper is pretty meaningless when we are not yet at the point where LLMs are good enough for autonomy.

mrinterweb · 2025-12-01T20:27:41 1764620861

I would expect one of the motivations for making these LLM model weights open is to undermine the valuation of other players in the industry. Open models like this must diminish the value prop of the frontier focused companies if other companies can compete with similar results at competitive prices.

rprend · 2025-12-01T22:48:31 1764629311

People pay for products, not models. OpenAI and Anthropic make products (ChatGPT, Claude Code).

isamuel · 2025-12-01T18:43:43 1764614623

There is a great deal of orientalism --- it is genuinely unthinkable to a lot of American tech dullards that the Chinese could be better at anything requiring what they think of as "intelligence." Aren't they Communist? Backward? Don't they eat weird stuff at wet markets?

It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed. Even now, when you ask questions like what you ask of that era, the answers you get are genuinely not better than "yes, this should have been obvious at the time if you were not completely blinded by ethnic and especially ideological prejudice."

mosselman · 2025-12-01T19:01:01 1764615661

Back when deepseek came out and people were tripping over themselves shouting it was so much better than what was out there, it just wasn’t good.

It might be this model is super good, I haven’t tried it, but to say the Chinese models are better is just not true.

What I really love though is that I can run them (open models) on my own machine. The other day I categorised images locally using Qwen, what a time to be alive.

Further even than local hardware, open models make it possible to run on providers of choice, such as European ones. Which is great!

So I love everything about the competitive nature of this.

CamperBob2 · 2025-12-01T19:29:29 1764617369

If you thought DeepSeek "just wasn't good," there's a good chance you were running it wrong.

For instance, a lot of people thought they were running "DeepSeek" when they were really running some random distillation on ollama.

bjourne · 2025-12-01T19:46:25 1764618385

WDYM? Isn't https://chat.deepseek.com/ the real DeepSeek?

CamperBob2 · 2025-12-01T21:07:18 1764623238

Good point, I was assuming the GP was running local for some reason. Hard to argue when it's the official providers who are being compared.

I ran the 1.58-bit Unsloth quant locally at the time it came out, and even at such low precision, it was super rare for it to get something wrong that o1 and GPT4 got right. I have never actually used a hosted version of the full DS.

stocksinsmocks · 2025-12-02T02:54:18 1764644058

Early stages of Barbarossa were very successful and much of the Soviet Air Force, which had been forward positioned for invasion, was destroyed. Given the Red Army’s attitude toward consent, I would keep the praise carefully measured. TV has taught us there are good guys and bad guys when the reality is closer to just bad guys and bad guys

ecshafer · 2025-12-01T20:54:39 1764622479

I don't think that anyone, much less someone working in tech or engineering in 2025, could still hold beliefs about Chinese not being capable scientists or engineers. I could maybe give (the naive) pass to someone in 1990 thinking China will never build more than junk. But in 2025 their product capacity, scientific advancement, and just the amount of us who have worked with extremely talented Chinese colleagues should dispel those notions. I think you are jumping to racism a bit fast here.

Germany was right in some ways and wrong in others for the soviet unions strength. USSR failed to conquer Finland because of the military purges. German intelligence vastly under-estimated the amount of tanks and general preparedness of the Soviet army (Hitler was shocked the soviets had 40k tanks already). Lend Lease act really sent an astronomical amount of goods to the USSR which allowed them to fully commit to the war and really focus on increasing their weapon production, the numbers on the amount of tractors, food, trains, ammunition, etc. that the US sent to the USSR is staggering.

hnfong · 2025-12-01T23:25:38 1764631538

I don't think anyone seriously believes that the Chinese aren't capable, it's more like people believe no matter what happens, USA will still dominate in "high tech" fields. A variant of "American Exceptionalism" so to speak.

This is kinda reflected in the stock market, where the AI stocks are surging to new heights every day, yet their Chinese equivalents are relatively lagging behind in stock price, which suggests that investors are betting heavily on the US companies to "win" this "AI race" (if there's any gains to be made by winning).

Also, in the past couple years (or maybe a couple decades), there had also been a lot of crap talk about how China has to democratize and free up their markets in order to be competitive with the other first world countries, together with a bunch of "doomsday" predictions for authoritarianism in China. This narrative has completely lost any credibility, but the sentiment dies slowly...

newyankee · 2025-12-01T18:46:29 1764614789

but didn't Chinese already surpass the rest of the world in Solar, batteries, EVs among other things ?

cyberlimerence · 2025-12-01T18:53:57 1764615237

They did, but the goalposts keep moving, so to speak. We're approximately here : advanced semiconductors, artificial intelligence, reusable rockets, quantum computing, etc. Chinese will never catch up. /s

breppp · 2025-12-01T19:28:48 1764617328

Not sure how the entire Nazi comparison plays out, but at the time there were good reasons to imagine the Soviets will fall apart (as they initially did)

Stalin just finished purging his entire officer corps, which is not a good omen for war, and the USSR failed miserably against the Finnish who were not the strongest of nations, while Germany just steamrolled France, a country that was much more impressive in WW1 than the Russians (who collapsed against Germany)

lukan · 2025-12-01T19:20:30 1764616830

"It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; ..."

Ideology played a role, but the data they worked with, was the finnish war, that was disastrous for the sowjet side. Hitler later famously said, it was all a intentionally distraction to make them believe the sowjet army was worth nothing. (Real reasons were more complex, like previous purging).

littlestymaar · 2025-12-01T19:23:59 1764617039

> It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed

Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.

gazaim · 2025-12-01T21:35:17 1764624917

> Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.

There were many counter revolutionaries among the leadership, even those conducting the purges. Stalin was like "ah fuck we're hella compromised." Many revolutions fail in this step and often end up facing a CIA backed coup. The USSR was under constant siege and attempted infiltration since inception.

littlestymaar · 2025-12-01T22:12:51 1764627171

> There were many counter revolutionaries among the leadership

Well, Stalin was, by far, the biggest counter-revolutionary in the Politburo.

> Stalin was like "ah fuck we're hella compromised."

There's no evidence that anything significant was compromised at that point, and clear evidence that Stalin was in fact medically paranoid.

> Many revolutions fail in this step and often end up facing a CIA backed coup. The USSR was under constant siege and attempted infiltration since inception.

Can we please not recycle 90-years old soviet propaganda? The Moscow trial being irrational self-harm was acknowledged by the USSR leadership as early as the fifties…

gazaim · 2025-12-01T21:25:38 1764624338

These Americans have no comprehension of intelligence being used to benefit humanity instead of being used to fund a CEO's new yacht. I encourage them to visit China to see how far the USA lags behind.

astrange · 2025-12-02T01:40:06 1764639606

Lags behind meaning we haven't covered our buildings in LEDs?

America is mostly suburbs and car sewers but that's because the voters like it that way.

beastman82 · 2025-12-01T20:20:33 1764620433

Then you should short the market