As seen in the article Deep learning: roots, the techniques used to date by deep learning have their roots in mathematical/statistical methodologies and foundations from centuries past. Of course, much has changed with the advent of new technologies. The availability of huge amounts of data collected and disseminated through the World Wide Web has revolutionized the possibility of having data for any problem, as well as generating new challenges to address. Not only has the Internet changed the rules of the game, but also the emergence of large companies in the ICT field, the spread of low-cost, high-quality sensors, the increase in storage size for data storage (Kryder’s law) and the increase in computing power(Moore’s law). But has all this really revolutionized the field of deep learning?
GPUs
The real revolution in the field of deep learning has been the steady and formidable progress of GPU models, i.e., cards originally designed for video games. In fact, thanks to these cards, algorithms and models that seemed computationally unachievable have become within everyone’s reach. Just to understand the progress made in the last fifty years let us look at the following table.
| Decade | Dataset | Memory | Floating point calculations per second |
|---|---|---|---|
| 1970 | 100 (Iris) | 1 KB | 100 KF (Intel 8080) |
| 1980 | 1 K (house prices in Boston) | 100 KB | 1 MF (Intel 80186) |
| 1990 | 10 K (optical character recognition) | 10 MB | 10 MF (Intel 80486) |
| 2000 | 10 M (web pages) | 100 MB | 1 GF (Intel Core) |
| 2010 | 10 G (advertising) | 1 GB | 1 TF (NVIDIA C2050) |
| 2020 | 1 T (social network) | 100 GB | 1 PF (NVIDIA DGX-2) |
Note that random access memory has not kept pace with data growth. At the same time, the increase in computing power has outpaced the growth in data sets. This means that statistical models must become more memory-efficient and thus are free to spend more computer cycles to optimize parameters due to the increase in computational budget. As a result, the focus of machine learning and statistics has shifted from linear (generalized) models and kernel methods to deep neural networks. This is also one of the reasons why many of the pillars of deep learning, such as multilayer perceptrons (McCulloch and Pitts, 1943), convolutional neural networks (LeCun et al., 1998), short-term memory (Hochreiter and Schmidhuber, 1997) and Q-Learning (Watkins and Dayan, 1992), have essentially been “rediscovered” in the past decade, after remaining relatively dormant for a long time.
The progress of the 21st century
Recent advances in statistical models, applications, and algorithms have sometimes been compared to the Cambrian explosion-a moment of rapid progress in the evolution of species. In fact, the state of the art is not merely a consequence of available resources applied to decades-old algorithms.
For example, new methods for capacity control, such as dropout (Srivastava et al., 2014)), have been introduced to help mitigate overfitting. In this case, noise is injected throughout the neural network during training.
Differently, attention mechanisms solved a second problem that has plagued statistics for more than a century: how to increase the memory and complexity of a system without increasing the number of learnable parameters. Researchers have found an elegant solution using what can only be seen as a learnable pointer structure (Bahdanau et al., 2014). Instead of having to remember an entire sequence of text, e.g., for machine translation, in a fixed-size representation, they simply stored a pointer to the intermediate state of the translation process. This greatly increased accuracy for long sequences, since the model no longer has to remember the entire sequence before starting the generation of a new one.
Built exclusively on attention mechanisms, the Transformer architecture (Vaswani et al., 2017) has demonstrated superior scalar behavior: it performs better as the size of the dataset, model, and amount of computation for training increases. This architecture has demonstrated convincing success in a wide range of areas, such as natural language processing, computer vision, speech recognition, reinforcement learning, and graph neural networks. For example, a single Transformer pre-trained on modalities as diverse as text, images, joint twists, and button presses can play Atari games, subtitle images, chat, and control a robot (Reed et al., 2022).
By modeling the probabilities of text sequences, language models are able to predict one text based on another text. Increasing data, models, and computations have unlocked an increasing number of capabilities of language models to perform desired tasks through the generation of human-like text based on input text (Anil et al., 2023, Brown et al., 2020, OpenAI, 2023, Touvron et al., 2023). For example, by aligning language patterns with human intent (Ouyang et al., 2022), OpenAI’s ChatGPT allows users to interact with it conversationally to solve problems, such as code debugging and creative writing.
Multi-stage designs, for example via memory networks (Sukhbaatar et al., 2015) and programmer-neural interpreter (Reed and De Freitas, 2015) have enabled statistical modelers to describe iterative approaches to reasoning. These tools allow the internal state of the deep neural network to be repeatedly changed, thus executing successive steps in a chain of reasoning, just as a processor can change memory for a computation.
A key development in deep generative modeling was the invention of generative adversarial networks (Goodfellow et al., 2014). Traditionally, statistical methods for density estimation and generative models focused on finding suitable probability distributions and (often approximate) algorithms for sampling from them. As a result, these algorithms were largely limited by the lack of flexibility inherent in statistical models.
The crucial innovation in adversarial generative networks has been to replace the sampler with an arbitrary algorithm with differentiable parameters. These are then adjusted so that the discriminator (effectively a two-sample test) cannot distinguish false from real data.
Because arbitrary algorithms can be used to generate the data, density estimation has opened up to a wide variety of techniques. The examples of galloping zebras (Zhu et al., 2017) and fake celebrity faces (Karras et al., 2017) testify to this progress. Even amateur doodlers can produce photorealistic images just based on sketches describing the layout of a scene (Park et al., 2019).
Moreover, while the diffusion process gradually adds random noise to data samples, diffusion models (Ho et al., 2020) learn the denoising process to gradually build data samples from random noise, reversing the diffusion process. They have begun to replace adversarial generative networks in more recent deep generative models, such as DALL-E 2 and Imagen for generating creative art and images based on textual descriptions.
Computational challenges
In many cases, a single GPU is not sufficient to process the large amounts of data available for training. Over the past decade, the ability to build parallel and distributed training algorithms has improved significantly.
A major challenge in designing scalable algorithms is that the workhorse of deep learning optimization, stochastic gradient descent, relies on relatively small minibatches of data to process. At the same time, small batches limit the efficiency of GPUs. Therefore, training on 1,024 GPUs with a minibatch size of, say, 32 images per batch is equivalent to an aggregate minibatch of about 32,000 images.
Some work in recent years has pushed the size to 64,000 observations, reducing the training time of the ResNet-50 model on the ImageNet dataset to less than 7 minutes. In comparison, training times were initially on the order of days.
The ability to parallelize computations has also contributed to advances in reinforcement learning. This has led to significant advances in computers achieving superhuman performance in tasks such as Go, Atari games, Starcraft, and in physics simulations (e.g., using MuJoCo) when environment simulators are available. Simply put, reinforcement learning works best when many tuples (state, action, reward) are available. Simulation offers this possibility.
Dissemination of deep learning
Deep learning frameworks have played a crucial role in the dissemination of ideas and models. The first generation of open-source frameworks for modeling neural networks consisted of Caffe, Torch, and Theano. Many seminal papers were written using these tools.
These have been replaced by TensorFlow (often used through its high-level API Keras), CNTK, Caffe 2 and Apache MXNet. The third generation of frameworks consists of so-called imperative tools for deep learning, a trend that was probably triggered by Chainer, which used a syntax similar to Python NumPy to describe models. This idea has been adopted by PyTorch, MXNet’s Gluon API, and JAX.
The division of labor between systems researchers building better tools and statistical modelers building better neural networks has greatly simplified things. For example, training a linear logistic regression model was a nontrivial problem, worthy of being given to new PhD students only a decade ago. Today, this task can be done with less than 10 lines of code, putting it within the reach of any programmer.