I think everyone who comes from a different literature where academic "rigor" is...

seydor · on Dec 24, 2023

> The proof is in the pudding, which is excellent.

not for long. steam engines existed long before statistical mechanics, but we dont get to modernity without the latter

throwup238 · on Dec 25, 2023

Yet we have many medicines that we have empirically shown to work without a deep understanding of the mechanics behind them and we’re unlikely to understand many drugs, especially in psychiatry, any time soon.

Trial and error makes the universe go round.

eurekin · on Dec 27, 2023

re.:

> The problem of "identification" was quickly solved by another engineering feat, which was to slap on "positional embeddings". As usual, this too didn't happen because there was a deep mathematical understanding. Rather, it was attempted and it worked.

Wasn't that tried, because of robotics?

It's a commonly solved issue, that a hand of a robot must know each joints orientation in space. Typically, each joint (a degree of freedom) has a rotary encoder built in. There is more than one type, but the "absolute" version fits the one used in positional embeddings:

https://www.akm.com/content/www/akm/global/en/products/rotat...

(full article: https://www.akm.com/global/en/products/rotation-angle-sensor... )

I find that parallel very fitting, since a positional embedding uses a sequence of sinusoidal shapes of increasing frequency. In the "learned positional embedding" gpt's (such as the gpt-2), where the network is free to use anything it would like to, seems that it actually learns the same pattern as the predefined one (albeit a little bit more wonky).

namibj · on Dec 25, 2023

Transformers don't need quadratic memory for attention unless you scale the head dimension proportional to the sequence length. And even that can be tamed.

The arithmetic intensity of unfused attention is too low on usual GPUs; it's even more a memory bandwidth issue than a memory capacity issue. Just see how much faster FlashAttention is.

dpflan · on Dec 25, 2023

Thank you for this clarification. What do you think of geometric deep learning? What other more formal mathematical approaches/research are you aware of?

rpodraza · on Dec 24, 2023

And on top of that, the nomenclature is really confusing.

fbdab103 · on Dec 25, 2023

+1 to that. It is like the ML people went out of their way to co-opt existing statistical terminology with a slightly different spin to completely muddle the waters.

esafak · on Dec 25, 2023

It's just because they did not study statistics, so they were unaware of it.

https://www.andrew.cmu.edu/user/mhydari/statistics.ML.dictio...

eurekin · on Dec 27, 2023

At my department, instructors were well aware of statistics. That was a prerequisite course on the AI path. Some early day software (WEKA) used statistic nomenclature extensively.

theGnuMe · on Dec 26, 2023

The best part of DNNs I think is the brute force backprop over essentially randomized feature generation (convolutions)… Statisticians would never do that.