More

pico_creator · 2025-04-01T19:59:15 1743537555

(original article author)

I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.

This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.

However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.

inhumantsar · 2025-04-01T20:09:12 1743538152

Thanks for the explanation! Sounds pretty exciting. I'll keep my eyes peeled for the paper

pico_creator · on Jan 2, 2025

RWKV already solve the parallel compute problem for GPU, based on the changes it has done - so it is a recurrent model that can scale to thousands++ of GPU no issue.

Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.

pico_creator · on Jan 2, 2025

Currently the strongest RWKV model is 32B in size: https://substack.recursal.ai/p/q-rwkv-6-32b-instruct-preview

This is a full drop in replacement for any transformer model use cases on model sizes 32B and under, as it has equal performance to existing open 32B models in most benchmarks

We are in works on a 70B, which will be a full drop in replacement for most text use cases

lostmsu · on Jan 2, 2025

Why aren't you on lmarena (former chatbot arena) leaderboard?

pico_creator · on Jan 2, 2025

kinda on a todo list, the model is open source on HF for anyone who is willing to make it work with lmarena

swyx · on Jan 2, 2025

how about finetuning your 32B to be R1QWQKV?

pico_creator · on Jan 2, 2025

There is a current lack of "O1 style" reasoning dataset in open source space. QWQ did not release their dataset. So that would take some time for the community to prepare.

It's definitely something we are tracking to do as well =)

pico_creator · on Jan 2, 2025

Hey there, im Eugene / PicoCreator - co-leading the RWKV project - feel free to AMA =)

Ey7NFZ3P0nzAe · on Jan 2, 2025

I noticed the lack of support from ollama and llama.cpp for RWKV. As those are (to my eyes) very strong drivers of experimentation (i.e. supporting them means vastly more outreach) I was considering whether you were considering taking this into your own hands by contributing code to them? Or rather is the fact that you are not (AFAIK) doing it because you lack the bandwidth in terms of man power or any other reason?

nickpsecurity · on Jan 2, 2025

It’s really, interesting work. I’m glad you’ve kept at it. I’d like to ask you about two issues.

I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?

The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?

(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)

Example use of Gutenberg:

https://www.tensorflow.org/datasets/catalog/pg19

anon373839 · on Jan 2, 2025

> The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally.

That really depends on whether LLM pretraining ends up held as an infringing use. (Of course, it’ll take a while for the cases to work through the courts and for a body of jurisprudence to be developed on this subject.)

nickpsecurity · on Jan 2, 2025

There’s two legal issues: sharing copyrighted data; training on it. It’s the latter that’s ambiguous. My problem is the former.

Making copies of and sharing copyrighted works without the authors’ permission is already illegal as proven in countless, file-sharing cases. The AI trainers do this with data sets like Common Crawl, The Pile, and RefinedWeb. Just sharing them is illegal for most of the content in them.

I got ideas for how to deal with that in countries with TDM exceptions, like Singapore. For now, the only things we can share with others for model training are (a) public domain works and (b) content licensed for permissive use and sharing. Gutenberg entries before a certain year should be pretty risk-free.

Ey7NFZ3P0nzAe · on Jan 2, 2025

Has there been progress towards making RWKV multimodal? Can be use projector layers to send images to RWKV?

pico_creator · on Jan 2, 2025

There is work done for Vision RWKV, and audio RWKV, an example paper is here: https://arxiv.org/abs/2403.02308

Its the same principle as open transformer models where an adapter is used to generate the embedding

However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.

The tech is there, the base model needs to be better

Ey7NFZ3P0nzAe · on Jan 2, 2025

I'm quite interested in repeng [0] (representztion engineering) for steerability of (so fzr transformer based) LLMs and was wondering if anyone had tried such methods on rwkv (or mamba for that matter). Maybe there are some low hanging fruits about it.

[0] https://github.com/vgel/repeng/issues

pico_creator · on Jan 2, 2025

One of the interesting "new direction" for RWKV and Mamba (or any recurrent model), is the monitoring and manipulation of the state in between token. For steerability, alignment, etc =)

Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space

low_tech_punk · on Jan 2, 2025

Thanks! The 0.1B version looks perfect for embedded system. What is the key benefit of attention-free architecture?

pico_creator · on Jan 2, 2025

lower compute cost especially over longer sequence length. Depending on context length, its 10x, 100x, or even 1000x+ cheaper. (quadratic vs linear cost difference)

bratao · on Jan 2, 2025

What would be the most performant way to run a inference using RWKV? Do you have and speed comparison to a similar sized transformer?

I have a task(OCR cleaning) that I´m evaluating faster options and look like RWKV would be a nice alternative.

littlestymaar · on Jan 2, 2025

Has there been any plans to build a “reasoning” llm using RWKV? With the increase in inference token count caused by such methods, the muhc lower footprint of recurrent architecture could really make a difference for such a use-case.

theLiminator · on Jan 2, 2025

Do you have an in depth comparison between RWKV and models like mamba or s4?

pico_creator · on Jan 2, 2025

Not sure how indepth you want it to be. But we did do a co-presentation with one of the coauthors of mamba at latent space : https://www.youtube.com/watch?v=LPe6iC73lrc

jharohit · on Jan 2, 2025

congrats and great work on RWKV and Recursal.ai

pico_creator · on Jan 2, 2025

This is actually the hypothesis for cartesia (state space team), and hence their deep focus on voice model specifically. Taking full advantage of recurrent models constant time compute, for low latencies.

RWKV team's focus is still however is first in the multi-lingual text space, then multi-modal space in the future.

swyx · on Jan 2, 2025

Karan from Cartesia explains SSMs+voice really well: https://www.youtube.com/watch?v=U9DPRZ0lSIQ

its one of those retrospectively obvious/genius insights that i wish i understood when i first met him

pico_creator · on Jan 2, 2025

Not an MoE, but we have already done hybrid models. And found it to be highly performant (as per the training budget)

https://arxiv.org/abs/2407.12077

pico_creator · on Oct 11, 2024

Someone is losing the money. It’s elaborated in the article how and why this happens

TLDR, VC money, is being burnt/lost

shermantanktop · on Oct 11, 2024

Tons of VC money burned in pursuit of low-probability success. It’s no wonder that some people find it easier to scam VCs than it is to build a real business.

pico_creator · on Oct 11, 2024

Im quite sure there is more than a 100 clusters even. Though that would be harder to prove.

So yea, it would be rough

pico_creator · on Oct 11, 2024

Also: how many of those consultants, have actually rented GPU's - used them for inference - or used them to finetune / train

aurareturn · on Oct 11, 2024

I’m guessing most of them are advising Wallstreet on AI demand.

pico_creator · on Oct 11, 2024

Feel free to forward to the clients of "paid consultant". Also how do i collect my cut.