Deep learning: unsupervised and reinforcement learning

Supervised methods require that our example data also provide us with the labels we should predict. However, these learning methods are limited. In some cases we either do not know what we are looking for or the environment itself can give us hints as to how our model should evolve. Unsupervised and reinforcement learning methods address these types of problems.

Tempo di lettura: 6 minuti

The creation of models from data can be based on several techniques. The examples seen in the previous articles Deep learning: Supervised learning [part 1] and Deep learning: Supervised learning [part 2], we focused on supervised learning, in which we feed the model a huge dataset containing both features and corresponding label values. For these problems, the desired output is already known a priori: we want to label new data with the same labels available to us and achieve a high level of accuracy of our model. However, in many contexts the goal is unclear or we do not have an already labeled dataset available.

How do we do this in these cases? We will use different techniques that fall under the name of unsupervised learning.

Unsupervised and Self-Supervised Learning

In unsupervised learning problems, the type and number of questions we can ask are limited only by our creativity. We will address unsupervised learning techniques in later articles. However, to pique your curiosity, let us describe some of the questions that can be asked.

Can we find a small number of examples that accurately summarize the data? Given a collection of photos, can we group them into different categories based on what they represent (e.g., photos of landscapes, animals, children, etc.)? Similarly, given a collection of users’ browsing activities, can we define certain groups of users who have similar behaviors? This problem is typically known as clustering.
Can we find a small number of parameters that accurately capture the relevant properties of the data? The trajectories of a ball are well described by the speed, diameter, and mass of the ball. In contrast, tailors have developed a small number of parameters that fairly accurately describe the shape of the human body to fit clothes. These problems are called subspace estimation. If the dependence is linear, it is called principal component analysis.
Is there a description of the root causes of many of the data we observe? For example, if we have demographic data on house prices, pollution, crime, location, education, and wages, can we find out how they are related to each other simply by relying on empirical data? The fields dealing with causality and probabilistic graphical models address these questions.
Another important and exciting recent development in unsupervised learning is the advent of deep generative models. These models estimate data density, either explicitly or implicitly. Once trained, we can use a generative model either to score examples based on their probability or to sample synthetic examples from the learned distribution. The first deep learning breakthroughs in generative modeling came with the invention of variational autoencoders (Kingma e Welling, 2014, Rezende et al., 2014) and continued with the development of generative adversarial networks (Goodfellow et al., 2014). More recent advances include stream normalization (Dinh et al., 2014, Dinh et al., 2017) and diffusion models (Ho et al., 2020, Sohl-Dickstein et al., 2015, Song and Ermon, 2019, Song et al., 2021).

A further development in unsupervised learning has been the rise of self-supervised learning, techniques that exploit some aspect of unlabeled data to provide supervision. For text, we can train models to “fill in the blanks” by predicting randomly masked words using surrounding words (contexts) in large corpora without any labeling effort (Devlin et al., 2018)! For images, we can train models to distinguish the relative position between two cropped regions of the same image (Doersch et al., 2015), to predict an occluded part of an image based on the remaining portions of the image, or to predict whether two examples are perturbed versions of the same underlying image. Self-supervised models often learn representations that are later exploited to refine the resulting models on some downstream task of interest.

Interacting with the environment

So far we have not talked about where the data come from or what happens when a machine learning model generates an output. This is because supervised learning and unsupervised learning do not address these issues in a very sophisticated way. In each case, a large amount of data is collected and machines are set in motion to create the models without interacting with the environment anymore. Because all learning occurs after the algorithm has been disconnected from the environment, this is sometimes called offline learning. For example, supervised learning assumes the simple interaction pattern shown below.

This simplicity of offline learning has its own appeal. The advantage is that we can deal with model building in isolation, without worrying about the complications arising from interactions with a dynamic environment. But this formulation of the problem is limiting. If you grew up reading Asimov’s robotic novels, you probably imagine artificially intelligent agents capable not only of making predictions but also of taking actions in the world. We want to think about intelligent agents, not just predictive models. That means we need to think about choosing actions, not just making predictions. Unlike mere predictions, actions have an actual impact on the environment. If we want to train an intelligent agent, we need to consider how its actions might impact the agent’s future observations, so offline learning is inappropriate.

Considering interaction with the environment opens up a number of new modeling issues. The following are just a few examples.

Does the environment remember what we have done previously?
Does the environment want to help us, e.g., a user reading a text in a speech recognizer?
Does the environment want to beat us, e.g., spammers adapting their e-mails to evade spam filters?
Does the environment have changing dynamics? For example, will future data always look like the past, or will patterns change over time, either naturally or in response to our automated tools?

These questions raise the problem of changing distribution, where training and test data are different. An example of this type, which many of us may have encountered, is that of exams written by a lecturer while the tasks were composed by his assistants. Below we will describe reinforcement learning, a framework for solving learning problems in which an agent interacts with an environment.

Reinforcement learning

If you are interested in using machine learning to develop an agent that interacts with an environment and performs actions, you will probably end up focusing on reinforcement learning. This could include applications to robotics, dialogue systems, and even the development of artificial intelligence (AI) for video games. Deep reinforcement learning has become increasingly popular in recent years. The revolutionary deep network Q, which beat humans at Atari games using only visual input (Mnih et al., 2015), and the AlphaGo program, which dethroned the world champion board game Go (Silver et al., 2016), are two famous examples.

Reinforcement learning provides a very general description of a problem in which an agent interacts with an environment in a series of time steps. At each time step, the agent receives an observation from the environment and must choose an action that is subsequently relayed back to the environment via a mechanism (sometimes called an actuator). After each cycle, the agent receives a reward from the environment based on its action. The agent subsequently receives a new observation and chooses an action by continuing the interaction with the environment. The behavior of an agent learning by reinforcement is governed by a policy. In short, a policy is nothing more than a function that maps from observations of the environment to actions. The goal of reinforcement learning is to produce good policies.

Learning by reinforcement can be used in various contexts. For example, supervised learning can be reformulated as learning by reinforcement. Suppose we have a classification problem. We could create a learning-by-reinforcement agent with an action corresponding to each class. We could then create an environment that gives a reward exactly equal to the loss function of the original supervised learning problem.

Moreover, reinforcement learning can solve many problems that supervised learning cannot solve. For example, in supervised learning we always expect the training input to be associated with the correct label. In reinforcement learning, on the other hand, we do not assume that for each observation the environment will tell us the optimal action. In general, we only get a reward. Moreover, the environment may not even tell us which actions led to the reward.

Consider the game of chess. The only real reward signal comes at the end of the game, when we win, earning a reward of 1, or when we lose, receiving a reward of -1. We must, therefore, address the problem of assigning rewards: determining which actions to credit or blame for an outcome.

These models may also have to deal with the problem of partial observability. In other words, the current observation may not tell everything about the current state. Suppose your cleaning robot became trapped in one of the many identical cabinets in your home. To save the robot, you need to infer its precise location, which may require you to consider observations prior to entering the closet.

Finally, at any given time, the model may know a good policy, but there may be many better policies that the agent has never tried. The agent must constantly choose whether to exploit the best strategy (currently) known as a policy, or to explore the space of strategies, potentially giving up some short-term reward in exchange for knowledge.

The general problem of reinforcement learning has a very general setting. Actions influence subsequent observations. Rewards are observed only when they correspond to the chosen actions. The environment can be fully or partially observed. Taking all this complexity into account at once may be too much to ask. Moreover, not all practical problems present all this complexity. Consequently, researchers have studied a number of special cases of reinforcement learning problems.

When the environment is fully observed, we call the reinforcement learning problem a Markov decision process. When the state does not depend on previous actions, we call it a contextual bandit problem. When there is no state, but only a set of available actions with initially unknown rewards, we have the classic multi-armed bandit problem.

More To Explore

DBMS

Apache Kafka Part 1: What Stream Processing Is and Why It Changes Everything

Kafka is not a typical message broker — it’s the distributed nervous system powering Netflix, LinkedIn, and Uber. It handles millions of events per second without losing a single one, with guaranteed ordering per partition. This first installment explains the core concepts (topics, partitions, offsets, consumer groups) using a real use case: the 50 ARPA Piedmont stations from the Smart City project at Politecnico di Torino.

Alessandro Fiori 6 July 2026

Development

Supabase: the Open-Source Backend for Your Vibe-Coded Apps

Lovable and Bolt build the frontend in minutes. But where does user data live? How does login work? Who can see what? Supabase answers all of these questions: managed PostgreSQL, ready-to-use authentication, file storage, and Row Level Security — all free up to a generous limit, all integrable in a single click from the main vibe coding tools.

Alessandro Fiori 29 June 2026