>Second, and this is the big one: there were plenty of reasonably large datasets...

snek_case · on Aug 31, 2022

> Even that was not the decisive problem - given enough compute multilayer NNs could be trained just fine.

As someone who has worked in deep learning, this isn't true. Figuring out the right recipe is extremely important. The regime in which you can effectively train a neural network with good performance is just very small. If you structure things just slightly wrong, training just won't converge.

IMO we totally could have trained something like a smaller ImageNet with a decently large dataset on a supercomputer in the late 90s. We just didn't know how. In the late 90s and early 2000s, NN researchers didn't know about (or didn't appreciate the importance of) batchnorm, or ReLUs, or which hyper-parameters to use. Convnets had been invented but hadn't yet been popularized.

In the late 90s, you might have tried training an 8-layer deep MLP on a dataset of 50K images on your Cray T3E supercomputer. But you would most likely have failed, because you weren't using convnets, your sigmoid activation function lead to a vanishing gradient, and your hyper-parameters were off. You might let this run overnight, but 12 hours later, your loss barely went down, and you were out of compute budget on a supercomputer you were sharing with other researchers. Given the right code, the right recipe, you could have made it work on that 90s supercomputer and with the compute budget you had, but the right recipe just wasn't known at the time.

We needed time to figure out and to publish about the really basic ingredients needed to train bigger neural networks successfully. Having access to more compute can help figure out what the right recipe is through trial and error, but ReLUs are a super basic innovation. They are a very simple function, and essentially a theoretical innovation. Someone needed to sit down and figure out that vanishing gradients were actually a problem we needed to think about.

sinenomine · on Aug 31, 2022

> Convnets had been invented but hadn't yet been popularized.

> Given the right code, the right recipe, you could have made it work on that 90s supercomputer and with the compute budget you had, but the right recipe just wasn't known at the time.

We could expect the best of the best given access to these powerful computing machines, surely they would know their lenet (1989) http://karpathy.github.io/2022/03/14/lecun1989/ and could have studied statistics of gradients and activations to implement something like ReLu - which is one line of code.It was a low-hanging fruit.

> Someone needed to sit down and figure out that vanishing gradients were actually a problem we needed to think about.

If Sepp Hochreiter could do it, surely someone from CalTech could do it as well, given the caliber of people who study there. Again, looks like a simple problem of misapplication of best and brightest (who, surely, know what they want to work on in life, but are still influenced by academic prestige and their advisors' advice which would repel them from connectionism then).

snek_case · on Sept 1, 2022

I took a machine learning class in ~2006 and granted the prof wasn't specializing in studying NNs, but he seemed to have little awareness of the impact of dataset size, convnets were never mentioned, the suggested approach was two-layer MLPs and sigmoid activation functions. The course was based on the book "Artificial Intelligence: A Modern Approach" 2nd edition, which barely just mentions neural networks in passing.

I think there just weren't enough people looking at neural networks back then. At the time, they were still considered a niche machine learning technique, just one tool in the toolbox, and not necessarily the best one.

Also, it's one thing to say that the best of the best had access to computation, but if your university has just one supercomputer, which is shared among everyone, and you have a limited number of hours during which you can use it, that's not increasing your chances at running very thorough experiments. If anything, it's not that we didn't have the compute to do deep learning in the 90s and early 2000s, it's that more access to compute makes a broader range of experimentation a lot more accessible. That makes it a lot easier to find the right training recipe more quickly.

Back in 2009 the enthusiasm around neural nets was starting to grow, as there were some early successes with deep neural nets, but I remember some people were still trying to argue that support vector machines were just as effective for image classification. It seemed laughable to me at the time, but I think that everyone who wasn't a connectionist was starting to feel threatened and wanted to justify their chosen research specialty, which they would keep doing until they couldn't anymore.

sdenton4 · on Aug 31, 2022

In the early days (pre 2015) it was extremely difficult to train models with more than two or three layers.

Batch Norm and Residual connections opened the door to actual depth.

(On edit: ReLu activations were also vital for dealing with vanishing gradients. And there were some real advances in initialization and opportunities along the way, as well.)