Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think everyone who comes from a different literature where academic "rigor" is higher and similar results already exist (in the author's case, he is aware of Kernel results) is infuriated by ML papers like "Attention is all you need".

They are, in fact, not really good academic papers. Finding a clever name and then choosing the most obtuse engineering-cosplay terms is not a good paper. It's just difficult to read. And so next, many well known results get discovered again to much acclaim in ML and head scratching elsewhere.

For example, yes they are kernel matrics. Indeed, the connection between reproducing kernel hilbert spaces and attention matrices has been exploited to create approximating architectures that are linear (not quadratic) in memory requirements for attention.

Or, as the author of the article also recognizes, the fact that attention matrices are also adjacency matrices of a directed graph can be used to show that attention models are equivariant (or unidentified, as the author says) and are therefore excellent tools to model Graphs (see: the entire literature of Geometric deep learning) and rather bad tools to model sequences of texts.

LLMs may or may not collapse to a single centroid if the amount of text data and parameters and whatever else are not in some intricate balance that nobody understands, and so they are inherently unstable tools.

All of this is true.

But then, here is the infuriating thing: all this matters very little in practice. LLMs work, and on top of that, they work for stupid reasons!

The problem of "identification" was quickly solved by another engineering feat, which was to slap on "positional embeddings". As usual, this too didn't happen because there was a deep mathematical understanding. Rather, it was attempted and it worked.

Or, take the "efficient transformers" that "solve" the issue of quadratic memory growth by using kernel methods. Turns out, in practice, it just doesn't matter. OpenAI, or Anthropic, or Meta simply do not care about slapping on another thousand GPUs. They care about throughput. The only efficiency innovation that really established itself was fusing kernels (GPU kernels, that is) in a clever way to make it go brrrrr. And as clever as that is, there's little deep math behind it.

Results are speculation and empirics. The proof is in the pudding, which is excellent.



> The proof is in the pudding, which is excellent.

not for long. steam engines existed long before statistical mechanics, but we dont get to modernity without the latter


Yet we have many medicines that we have empirically shown to work without a deep understanding of the mechanics behind them and we’re unlikely to understand many drugs, especially in psychiatry, any time soon.

Trial and error makes the universe go round.


re.:

> The problem of "identification" was quickly solved by another engineering feat, which was to slap on "positional embeddings". As usual, this too didn't happen because there was a deep mathematical understanding. Rather, it was attempted and it worked.

Wasn't that tried, because of robotics?

It's a commonly solved issue, that a hand of a robot must know each joints orientation in space. Typically, each joint (a degree of freedom) has a rotary encoder built in. There is more than one type, but the "absolute" version fits the one used in positional embeddings:

https://www.akm.com/content/www/akm/global/en/products/rotat...

(full article: https://www.akm.com/global/en/products/rotation-angle-sensor... )

I find that parallel very fitting, since a positional embedding uses a sequence of sinusoidal shapes of increasing frequency. In the "learned positional embedding" gpt's (such as the gpt-2), where the network is free to use anything it would like to, seems that it actually learns the same pattern as the predefined one (albeit a little bit more wonky).


Transformers don't need quadratic memory for attention unless you scale the head dimension proportional to the sequence length. And even that can be tamed.

The arithmetic intensity of unfused attention is too low on usual GPUs; it's even more a memory bandwidth issue than a memory capacity issue. Just see how much faster FlashAttention is.


Thank you for this clarification. What do you think of geometric deep learning? What other more formal mathematical approaches/research are you aware of?


And on top of that, the nomenclature is really confusing.


+1 to that. It is like the ML people went out of their way to co-opt existing statistical terminology with a slightly different spin to completely muddle the waters.


It's just because they did not study statistics, so they were unaware of it.

https://www.andrew.cmu.edu/user/mhydari/statistics.ML.dictio...


At my department, instructors were well aware of statistics. That was a prerequisite course on the AI path. Some early day software (WEKA) used statistic nomenclature extensively.


The best part of DNNs I think is the brute force backprop over essentially randomized feature generation (convolutions)… Statisticians would never do that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: