Deep learning: developments in the 21st century

Deep learning techniques have their roots in previous centuries. In recent decades there have been some really important evolutions that have led to the algorithms we use today. Let's learn about the evolutions that led to today's technology.

Tempo di lettura: 5 minuti

As seen in the article Deep learning: roots, the techniques used to date by deep learning have their roots in mathematical/statistical methodologies and foundations from centuries past. Of course, much has changed with the advent of new technologies. The availability of huge amounts of data collected and disseminated through the World Wide Web has revolutionized the possibility of having data for any problem, as well as generating new challenges to address. Not only has the Internet changed the rules of the game, but also the emergence of large companies in the ICT field, the spread of low-cost, high-quality sensors, the increase in storage size for data storage (Kryder’s law) and the increase in computing power(Moore’s law). But has all this really revolutionized the field of deep learning?

GPUs

The real revolution in the field of deep learning has been the steady and formidable progress of GPU models, i.e., cards originally designed for video games. In fact, thanks to these cards, algorithms and models that seemed computationally unachievable have become within everyone’s reach. Just to understand the progress made in the last fifty years let us look at the following table.

Decade	Dataset	Memory	Floating point calculations per second
1970	100 (Iris)	1 KB	100 KF (Intel 8080)
1980	1 K (house prices in Boston)	100 KB	1 MF (Intel 80186)
1990	10 K (optical character recognition)	10 MB	10 MF (Intel 80486)
2000	10 M (web pages)	100 MB	1 GF (Intel Core)
2010	10 G (advertising)	1 GB	1 TF (NVIDIA C2050)
2020	1 T (social network)	100 GB	1 PF (NVIDIA DGX-2)

Note that random access memory has not kept pace with data growth. At the same time, the increase in computing power has outpaced the growth in data sets. This means that statistical models must become more memory-efficient and thus are free to spend more computer cycles to optimize parameters due to the increase in computational budget. As a result, the focus of machine learning and statistics has shifted from linear (generalized) models and kernel methods to deep neural networks. This is also one of the reasons why many of the pillars of deep learning, such as multilayer perceptrons (McCulloch and Pitts, 1943), convolutional neural networks (LeCun et al., 1998), short-term memory (Hochreiter and Schmidhuber, 1997) and Q-Learning (Watkins and Dayan, 1992), have essentially been “rediscovered” in the past decade, after remaining relatively dormant for a long time.

The progress of the 21st century

Recent advances in statistical models, applications, and algorithms have sometimes been compared to the Cambrian explosion-a moment of rapid progress in the evolution of species. In fact, the state of the art is not merely a consequence of available resources applied to decades-old algorithms.

For example, new methods for capacity control, such as dropout (Srivastava et al., 2014)), have been introduced to help mitigate overfitting. In this case, noise is injected throughout the neural network during training.

Differently, attention mechanisms solved a second problem that has plagued statistics for more than a century: how to increase the memory and complexity of a system without increasing the number of learnable parameters. Researchers have found an elegant solution using what can only be seen as a learnable pointer structure (Bahdanau et al., 2014). Instead of having to remember an entire sequence of text, e.g., for machine translation, in a fixed-size representation, they simply stored a pointer to the intermediate state of the translation process. This greatly increased accuracy for long sequences, since the model no longer has to remember the entire sequence before starting the generation of a new one.

Built exclusively on attention mechanisms, the Transformer architecture (Vaswani et al., 2017) has demonstrated superior scalar behavior: it performs better as the size of the dataset, model, and amount of computation for training increases. This architecture has demonstrated convincing success in a wide range of areas, such as natural language processing, computer vision, speech recognition, reinforcement learning, and graph neural networks. For example, a single Transformer pre-trained on modalities as diverse as text, images, joint twists, and button presses can play Atari games, subtitle images, chat, and control a robot (Reed et al., 2022).

By modeling the probabilities of text sequences, language models are able to predict one text based on another text. Increasing data, models, and computations have unlocked an increasing number of capabilities of language models to perform desired tasks through the generation of human-like text based on input text (Anil et al., 2023, Brown et al., 2020, OpenAI, 2023, Touvron et al., 2023). For example, by aligning language patterns with human intent (Ouyang et al., 2022), OpenAI’s ChatGPT allows users to interact with it conversationally to solve problems, such as code debugging and creative writing.

Multi-stage designs, for example via memory networks (Sukhbaatar et al., 2015) and programmer-neural interpreter (Reed and De Freitas, 2015) have enabled statistical modelers to describe iterative approaches to reasoning. These tools allow the internal state of the deep neural network to be repeatedly changed, thus executing successive steps in a chain of reasoning, just as a processor can change memory for a computation.

A key development in deep generative modeling was the invention of generative adversarial networks (Goodfellow et al., 2014). Traditionally, statistical methods for density estimation and generative models focused on finding suitable probability distributions and (often approximate) algorithms for sampling from them. As a result, these algorithms were largely limited by the lack of flexibility inherent in statistical models.

The crucial innovation in adversarial generative networks has been to replace the sampler with an arbitrary algorithm with differentiable parameters. These are then adjusted so that the discriminator (effectively a two-sample test) cannot distinguish false from real data.

Because arbitrary algorithms can be used to generate the data, density estimation has opened up to a wide variety of techniques. The examples of galloping zebras (Zhu et al., 2017) and fake celebrity faces (Karras et al., 2017) testify to this progress. Even amateur doodlers can produce photorealistic images just based on sketches describing the layout of a scene (Park et al., 2019).

Moreover, while the diffusion process gradually adds random noise to data samples, diffusion models (Ho et al., 2020) learn the denoising process to gradually build data samples from random noise, reversing the diffusion process. They have begun to replace adversarial generative networks in more recent deep generative models, such as DALL-E 2 and Imagen for generating creative art and images based on textual descriptions.

Computational challenges

In many cases, a single GPU is not sufficient to process the large amounts of data available for training. Over the past decade, the ability to build parallel and distributed training algorithms has improved significantly.

A major challenge in designing scalable algorithms is that the workhorse of deep learning optimization, stochastic gradient descent, relies on relatively small minibatches of data to process. At the same time, small batches limit the efficiency of GPUs. Therefore, training on 1,024 GPUs with a minibatch size of, say, 32 images per batch is equivalent to an aggregate minibatch of about 32,000 images.

Some work in recent years has pushed the size to 64,000 observations, reducing the training time of the ResNet-50 model on the ImageNet dataset to less than 7 minutes. In comparison, training times were initially on the order of days.

The ability to parallelize computations has also contributed to advances in reinforcement learning. This has led to significant advances in computers achieving superhuman performance in tasks such as Go, Atari games, Starcraft, and in physics simulations (e.g., using MuJoCo) when environment simulators are available. Simply put, reinforcement learning works best when many tuples (state, action, reward) are available. Simulation offers this possibility.

Dissemination of deep learning

Deep learning frameworks have played a crucial role in the dissemination of ideas and models. The first generation of open-source frameworks for modeling neural networks consisted of Caffe, Torch, and Theano. Many seminal papers were written using these tools.

These have been replaced by TensorFlow (often used through its high-level API Keras), CNTK, Caffe 2 and Apache MXNet. The third generation of frameworks consists of so-called imperative tools for deep learning, a trend that was probably triggered by Chainer, which used a syntax similar to Python NumPy to describe models. This idea has been adopted by PyTorch, MXNet’s Gluon API, and JAX.

The division of labor between systems researchers building better tools and statistical modelers building better neural networks has greatly simplified things. For example, training a linear logistic regression model was a nontrivial problem, worthy of being given to new PhD students only a decade ago. Today, this task can be done with less than 10 lines of code, putting it within the reach of any programmer.

More To Explore

DBMS

Apache Kafka Part 1: What Stream Processing Is and Why It Changes Everything

Kafka is not a typical message broker — it’s the distributed nervous system powering Netflix, LinkedIn, and Uber. It handles millions of events per second without losing a single one, with guaranteed ordering per partition. This first installment explains the core concepts (topics, partitions, offsets, consumer groups) using a real use case: the 50 ARPA Piedmont stations from the Smart City project at Politecnico di Torino.

Alessandro Fiori 6 July 2026

Development

Supabase: the Open-Source Backend for Your Vibe-Coded Apps

Lovable and Bolt build the frontend in minutes. But where does user data live? How does login work? Who can see what? Supabase answers all of these questions: managed PostgreSQL, ready-to-use authentication, file storage, and Row Level Security — all free up to a generous limit, all integrable in a single click from the main vibe coding tools.

Alessandro Fiori 29 June 2026