Deep learning: Supervised learning [part 1]

When we need to analyze data, we have several techniques at our disposal. In deep learning, but more generally in data science, we can make use of some techniques called superivision learning. These consist of looking at some example data to predict values and/or labels on new data.

Share

Tempo di lettura: 6 minuti

The data scientist has several techniques at his disposal to analyze data and create models for his projects. Identifying the correct technique for one’s project is not always so trivial. In fact, depending on the inherent characteristics of the data and our goal, we can achieve very different results using different techniques. There are different families of machine learning and deep learning algorithms. In this article we will focus on supervised learning techniques.

Supervised learning

Supervised learning starts with a dataset containing both features and labels. Tasks in this category aim to produce a model that predicts labels when given input features. Each feature-label pair is called an example or sample. Supervision comes into play because, to choose parameters, we data scientists (i.e., supervisors) provide the model with a data set consisting of already labeled examples. In probabilistic terms, we are typically interested in estimating the conditional probability of a label given the input characteristics. Although it is only one of many paradigms, supervised learning accounts for most of the successful applications of machine learning in industry. In part this is because many important tasks can be clearly described as estimating the probability of something unknown given a particular set of available data. Some examples are:

  • Predicting the presence or absence of a disease given a computed tomography image.
  • Predicting the correct translation into a foreign language given a sentence in our native language.
  • Predicting the price of a stock next month based on this month’s budget data.

While all supervised learning problems are captured by the simple description “predict labels given input characteristics,” supervised learning itself can take different forms and require a lot of decisions in modeling, based on, just to name a few, the type, size and quantity of inputs and outputs. For example, we may use different models to process sequences of arbitrary length and vector representations of fixed length.

Informally, the learning process is similar to the following. First, we take a large collection of examples whose characteristics we know and select a random subset of them, acquiring labels for each. Sometimes these labels may be features of the data already collected, while other times human annotators must be used to label our samples. The set of these inputs and the corresponding labels constitute the training set. The training dataset is fed into a supervised learning algorithm, a function that takes one dataset as input and produces another function: the model. Finally, we can feed the model new, previously unseen inputs, using its outputs as predictions of the corresponding label. The complete process is shown below.

Regression

Consider, for example, a dataset collected from a database of home sales. We could represent our data as a table in which each row corresponds to a house and each column to a characteristic of the house, such as square footage, number of bedrooms, number of bathrooms, and number of minutes (walking) to the city center. In this dataset, each example would be a specific house and the corresponding feature vector would be a row in the table. If you live in New York or San Francisco, the feature vector (square footage, number of bedrooms, number of bathrooms, walking distance) for your house might look like: [70, 1,1, 60]. If, on the other hand, you live in Pittsburgh, the vector corresponding to your house might be [800, 4, 3, 10]. Fixed-length feature vectors like this are essential for most classical machine learning algorithms.

But why should we use regression on this dataset? Actually using one approach over another depends on the objective of the analysis. Suppose you are in the market for a new house. You might want to estimate the fair market value of a house, given certain characteristics such as those described above. The data might consist of historical house listings, and the labels might be observed sales prices. When the labels take arbitrary numerical values (even within a range), this is called a regression problem. The goal is to produce a model whose predictions are very close to the actual values of the labels.

Many practical problems are easily described as regression problems. For example, predicting the rating a user will assign to a movie can be considered a regression problem. Just think, if you had designed an algorithm in 2009 that could accomplish this feat, you could have won the million-dollar Netflix prize. Predicting how long patients will stay in the hospital, how long a surgery will take, and how much rain will fall in the next few hours are also regression problems.

To assess the goodness of our regression model we will use, primarily, the mean square error, that is, how much the predicted value deviates from the true value. If our model can minimize the quadratic error, our model will be highly reliable.

Classification

While regression models are great for answering questions that seek to quantify a measure, some problems do not fit this model. Consider, for example, a bank that wants to develop a check-scanning function for its mobile application. Ideally, the customer would simply take a picture of the check, and the app would automatically recognize the text from the image. Assuming we are able to segment the image patches corresponding to each handwritten character, the main task remaining is to determine which character, among the known ones, is represented in each image patch. This type of problem is called classification and requires a different set of tools than those used for regression, although many techniques are applicable.

In classification, we want our model to examine features, such as the pixel values of an image, and then predict to which category (usually called a class), among a discrete set of options, the observed example belongs. For handwritten figures, we might have ten classes, corresponding to the digits 0 through 9. The simplest form of classification is when there are only two classes, also called the binary classification problem. For example, our dataset might be pictures of animals and our labels might be the classes {dog, cat}. Whereas when we use regression we look for a model that provides a numerical value, in classification we look for a model whose result is the assignment of one of the available classes.

For various reasons, it can be difficult to optimize a model that provides only a categorical assignment, for example, “dog” or “cat.” In such cases, it is usually much easier to express our model in the language of probability. Given the characteristics of an example, our model assigns a probability to each possible class. Returning to our animal classification example, in which the classes are {dog, cat} a classifier might see a picture and result in a probability of 0.9 that the picture is a dog. We can interpret this number by saying that the classifier is 90 percent sure that the image depicts a dog.

In some cases the probability provided by the model is not used for the final decision. Suppose you want to create an app that, using a photo, provides the probability that a mushroom you collect in the woods is poisonous or not. Suppose that our poison-detection classifier provides as a result that the probability that a mushroom is poisonous is 0.2. In other words, the classifier is 80% sure that our mushroom is not poisonous. However, I don’t think anyone would dream of eating it knowing that there is a 20% risk that our dinner will be somewhat indigestible. Thus, to decide whether or not to eat the mushroom, we need to calculate the expected harm associated with each action, which depends on both the likely outcomes and the benefits or harms associated with each. In this case, the harm incurred by eating the mushroom could be 0.2 × ∞ +0.8 × 0 = ∞, while the loss from discarding the mushroom is 0.2 × 0 +0.8 × 1 = 0.8. Our caution is therefore justified.

When there are more than two possible classes, the problem is called multiclass classification instead. Common examples include recognition of handwritten characters, as discussed in the first example. While for regression problems we have tried to minimize the quadratic error loss function, the common loss function for classification problems is called cross entropy.

Classification, of course, can be much more complicated than simple binary or multiclass classification. For example, there are some variants of classification involving hierarchically structured classes. In these cases, not all errors are equal: if we have to make a mistake, we may prefer to misclassify in a related class rather than in a distant class. This is usually referred to as hierarchical classification.

An example of hierarchical classification is that of animals. In this case, it might not be so bad to confuse a poodle with a schnauzer, but our model would pay a huge penalty if it confused a poodle with a dinosaur. The choice of hierarchy depends on the intended use of the model. For example, rattlesnakes and gartersnakes might be close in the phylogenetic tree, but confusing a rattlesnake with a garter snake could have fatal consequences.

More To Explore

Artificial intelligence

Sentiment Analysis & Topic Modeling: What Your Customers Really Mean

You have 200 reviews, 500 support tickets, 1,000 social media comments. Reading them all would take days — and you’d still miss the most important patterns. Sentiment Analysis and Topic Modeling solve exactly this: in ten minutes you get the emotional tone of every text, recurring themes grouped automatically, and a strategic summary that manual reading would never have produced.

Artificial intelligence

Multimodal AI: Analyze PDFs, Images and Documents with Claude, GPT-4 and Gemini

AI no longer reads only text. Claude summarizes a 10-page quote in 30 seconds. GPT-4 Vision transcribes data from a dashboard screenshot into a ready-to-use table. Gemini 1.5 Pro navigates 1,000-page documents citing the sources. This guide shows how they work, when to use which tool, and where the time savings are measurable — with real screenshots from live sessions.

3 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

Progetta con MongoDB!!!

Acquista il nuovo libro che ti aiuterà a usare correttamente MongoDB per le tue applicazioni. Disponibile ora su Amazon!