Deep learning: Supervised learning [part 2]

In addition to classical classification and regression methods, there are other approaches and fields in which machine learning is used. Indeed, we can assign different labels to the same sample, create models to improve search in our applications, or suggest highly relevant content based on an individual user's profile. Finally, we can build models that use data sequences to produce new sequences, such as translations or conversion from audio to text and vice versa.

Tempo di lettura: 6 minuti

In the previous article Deep learning: Supervised learning [part 1] we introduced the supervised methods of deep learning, including regression and classification. These two techniques are not the only ones available to us. In fact, depending on the goal of our analysis and the nature of the data itself, different techniques can be used. In the following we will look at other techniques that are always supervised, that is, that are based on the observation of already labeled examples. In particular we will focus on labeling, search, recommender systems, and sequence learning.

Labeling

Some classification problems, including those illustrated in article Deep learning: Supervised learning [part 1], fit neatly into binary or multiclass classification configurations. For example, we could train an ordinary binary classifier to distinguish dogs from cats. Given the current state of computer vision, we can do this easily, with ready-made tools. However, regardless of the accuracy of our model, we may run into trouble when the classifier encounters the image below that represents the Bremen Town Musicians, the characters from a popular German fairy tale.

As can be seen, the image represents a cat, a rooster, a dog and a donkey, with some trees in the background. If we expect to encounter such images, multiclass classification may not be the right formulation of our problem. In fact, we need our model to be able to indicate that the image depicts multiple animals, i.e., in our example a cat, a dog, a donkey and a rooster.

The problem of learning to predict classes that are not mutually exclusive is called multi-label classification. Self-tagging problems are also generally best described in terms of multi-label classification. Indeed, think of the tags that people may apply to blog posts. Five to 10 tags are usually applied to an article. In theory, the tags also exhibit some correlation with each other. For example, posts on “cloud computing” are very likely to mention “AWS” while posts on “machine learning” are likely to mention “GPU.”

Sometimes these tagging problems are based on huge sets of tags. The National Library of Medicine employs many professional annotators who associate each article to be indexed in PubMed with a set of tags drawn from the Medical Subject Headings (MeSH) ontology, a collection of about 28,000 tags. Proper tagging of articles is important because it enables researchers to conduct comprehensive literature reviews. This is a time-consuming and human resource-intensive process. Machine learning could provide interim tags that are then possibly revised later by experts in the field.

The search

In the field of information retrieval, we often impose ranks on sets of items. Take Web search, for example. The goal is not so much to determine whether a particular page is relevant to a query, but rather which, among a set of relevant results, should be shown most prominently to a particular user. One way to do this might be to first score each item in the set and then retrieve the highest-rated items. PageRank, the algorithm underlying Google’s search engine, was an early example of such a scoring system. Oddly enough, the score provided by PageRank did not depend on the actual query. Instead, it relied on a simple relevance filter to identify the set of relevant candidates and then used PageRank to prioritize the most authoritative pages. Today, search engines use machine learning and behavioral models to obtain query-dependent relevance scores.

Recommendation systems

Recommendation systems are another problem related to search and ranking. The problems are similar insofar as the goal is to display a set of items relevant to the user. The main difference from traditional search engines, is the emphasis on personalization of results based on user profiles. For example, for movie recommendations, the results page for a science fiction fan and that for a connoisseur of Robin Williams comedies might differ significantly. Similar problems occur in other recommendation contexts, such as for retail products, music, and news.

In some cases, customers provide explicit feedback, communicating how much they liked a particular product (e.g., product ratings and reviews on Amazon and IMDb). In other cases, they provide implicit feedback, such as skipping titles in a playlist, which might indicate dissatisfaction or perhaps just that the song was inappropriate in context. In the simplest formulations, these systems are trained to estimate some score, such as an expected star rating or the likelihood that a given user will purchase a particular item.

Given such a model, for a given user one could retrieve the set of items with the highest scores, which could then be recommended to the user. The recommendation systems used in the marketplace are much more advanced and take into account the detailed activity of the user and the characteristics of the item in calculating these scores.

Despite their enormous economic value, recommendation systems built only on the basis of predictive models have some serious conceptual flaws. For example, some feedback is usually not taken into account because it represents a small number of the available examples. Just think of movie or product reviews. On a five-point scale, items or movies receive many one- and five-star ratings, while three-star ratings are decidedly few. In some cases, users’ buying habits are influenced by the result of the recommendation algorithm in use on the platform. However, learning algorithms do not always take this into account. Therefore, it is very likely that feedback loops will form in which a recommender system prefers an item that is deemed better (due to higher purchases) and which in turn is recommended even more frequently.

Sequential learning

So far we have examined problems in which we have a fixed number of inputs and produce a fixed number of outputs. For example, we have considered the prediction of house prices with a fixed set of characteristics: square footage, number of bedrooms, number of bathrooms, and distance from the city center. We also discussed the mapping from an image (of fixed size) to the predicted probabilities of belonging to each of a fixed number of classes and the prediction of star ratings associated with purchases. In these cases, once our model has been trained, after each test example has been entered into our model, it is immediately forgotten. We assumed that subsequent observations were independent, so there was no need to maintain this context.

But how should we deal with video fragments? In this case, each fragment could be composed of a different number of frames. And our guess about what happens in each frame could be much stronger if we take into account the preceding or following frames. The same is true for language. For example, a popular deep learning problem is machine translation: the task of capturing sentences in one source language and predicting their translation into another language.

These problems also arise in medicine. We might want a model to monitor patients in the intensive care unit and issue alerts whenever their risk of death in the next 24 hours exceeds a certain threshold. In this case, we would not want to throw away everything we know about the patient’s history every hour, because we would not want to make predictions based only on the most recent measurements.

Questions like these are among the most interesting applications of machine learning and are classified as sequential learning. They require a model that can acquire input sequences or output sequences (or both). In particular, sequence-to-sequence learning considers problems in which both inputs and outputs consist of sequences of varying lengths. For example, determining the order in which a user reads a Web page is a two-dimensional layout analysis problem. Dialogue problems present all kinds of additional complications, where determining what to say after a word or phrase requires taking into account real-world knowledge and the previous state of the conversation over long time distances. Below we present some of the examples.

Tagging

This type of problem requires annotating a sequence of text with attributes. In this case, the inputs and outputs are aligned, that is, they have the same number and occur in the corresponding order. For example, in part-of-speech (PoS) tagging, we annotate each word in a sentence with the corresponding part of speech, i.e., “noun” or “direct object.” Alternatively, we might want to know which groups of contiguous words refer to entities, such as people, places, or organizations. In the simple example below, we might want to indicate only whether or not a word in the sentence is part of an entity (labeled “Ent”).

				
					Tom has dinner in Washington with Sally
Ent  -    -    -     Ent      -    Ent

Automatic speech recognition

In speech recognition, the input sequence is an audio recording as shown in the figure, while the output is a transcript of what is said. The challenge is that there are many more audio fragments than text, i.e., there is no 1:1 correspondence between audio and text. In fact, sound is typically sampled between 8kHz and 16kHz, implying that since thousands of samples can correspond to a single spoken word. These are sequence-to-sequence learning problems, where the output is much shorter than the input. While humans are extraordinarily good at recognizing speech, even from low-quality audio, getting computers to accomplish the same feat is a real challenge.

Text to speech

This is the reverse of the automatic speech recognition described above. In this case, the input is text and the output is an audio file. In this case, the output is much longer than the input.

Automatic translation

Unlike the case of speech recognition, in which the corresponding inputs and outputs occur in the same order, in machine translation, misaligned data pose a new challenge. In this case, input and output sequences may have different lengths and the corresponding regions of the respective sequences may appear in a different order. Consider the following illustrative example of the particular tendency of Germans to place verbs at the end of sentences:

				
					German:           Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?
English:          Have you already looked at this excellent textbook?
Wrong alignment:  Have you yourself already this excellent textbook looked at?

More To Explore

DBMS

Apache Kafka Part 1: What Stream Processing Is and Why It Changes Everything

Kafka is not a typical message broker — it’s the distributed nervous system powering Netflix, LinkedIn, and Uber. It handles millions of events per second without losing a single one, with guaranteed ordering per partition. This first installment explains the core concepts (topics, partitions, offsets, consumer groups) using a real use case: the 50 ARPA Piedmont stations from the Smart City project at Politecnico di Torino.

Alessandro Fiori 6 July 2026

Development

Supabase: the Open-Source Backend for Your Vibe-Coded Apps

Lovable and Bolt build the frontend in minutes. But where does user data live? How does login work? Who can see what? Supabase answers all of these questions: managed PostgreSQL, ready-to-use authentication, file storage, and Row Level Security — all free up to a generous limit, all integrable in a single click from the main vibe coding tools.

Alessandro Fiori 29 June 2026