In the previous article, Deep learning: introduction, we saw how machine learning techniques are used in our daily activities. In the key parala recognition example, we described a dataset consisting of audio fragments and binary labels and gave an idea of how we could train a model to approximate a mapping from fragments to classifications. This type of problem, in which we try to predict an unknown designated label based on known inputs given from a set of examples for which the labels are known, is called supervised learning. This is just one of many types of machine learning problems.
What, then, are the basic concepts that underlie all machine learning techniques regardless of the type of problem we address? Here are the four pillars on which everything we will see in this series of articles is based:
- The data we can learn from
- A data transformation model
- An objective function that quantifies how well (or poorly) the model is doing
- An algorithm that adjusts model parameters to optimize an objective function
Data
It goes without saying that you cannot do data science without data. In general, in this area, we are concerned with handling a collection of examples. In order to work with these examples in a useful and efficient way, it is necessary to find a suitable representation, usually numerical. Each example (also called a data instance, sample) typically consists of a set of attributes called features (also called covariates or inputs), based on which the model must make its predictions. In supervised learning problems, the goal is to predict the value of a special attribute, called a label (or target), that is not part of the model’s inputs.
If we are working with image data, each example might consist of a single photograph (the features) and a number indicating the category to which the photograph belongs (the label). The photograph would be represented numerically as three grids of numerical values representing the brightness of red, green, and blue light at each pixel position. For example, a 100 X 100 pixel color photograph would consist of 100 X 100 X 3= 30000 numerical values.
Alternatively, we could work with data from electronic medical records and tackle the task of predicting the probability that a given patient will survive in the next 30 days. In this case, our characteristics could consist of a collection of readily available attributes and frequently recorded measures, such as age, vital signs, comorbidities, current medications, and recent procedures. The label available for training would be a binary value indicating whether each patient in the historical data survived within the 30-day window.
In these cases, when each example has the same number of features, we say that the inputs are fixed-length vectors and call the length of the vectors the dimensionality of the data. As one can imagine, fixed-length inputs can be very convenient, as they simplify the problem. However, not all data can be easily represented as fixed-length vectors. For example, while we expect clinical images produced by a laboratory to come from a precise microscope with a very specific resolution, we cannot expect images extracted from the Internet to all have the same resolution or shape. For images, we might think of cropping them to a standard size, but this strategy will surely lead us to lose relevant information. Textual data are a separate chapter. These cannot be represented at all by fixed-length vectors. Just think of customer reviews left on e-commerce sites such as Amazon, IMDb, and TripAdvisor. Some are short and perhaps uninformative; others are very long and verbose. One of the main advantages of deep learning over traditional methods is the comparative grace with which modern models can handle data of varying lengths.
In general, the more data we have, the easier our work becomes. When we have more data, we can train more powerful models and rely less on preconceived assumptions.
Finally, it is not enough to have a lot of data and process it intelligently. It is necessary to have the right data. If the data are full of errors or if the chosen features are not predictive of the magnitude of interest, learning is bound to fail. The situation is well represented by the cliché: garbage in, garbage out. Moreover, poor predictive performance is not the only potential consequence. In sensitive applications of machine learning, such as predictive policing, resume screening, and risk models used for loans and insurance, we need to be particularly alert to the consequences of garbage data. One commonly encountered predictive error involves datasets in which certain groups of people are not represented in the training data. Imagine applying a skin cancer recognition system that has never seen black skin. The system might err on this subcategory of subjects.
Predictive errors can also occur when the data are highly biased to represent a given reality. For example, if hiring decisions are used to train a predictive model that will be used to screen resumes, machine learning models might inadvertently catch and automate incorrect assumptions based on biases. Note that this can all happen without the data analyst’s knowledge.
Models
By model we mean a computational machine that ingests data of a certain type and provides predictions in a certain format. For example, we might want to build a system that captures photos and predicts the presence of people. Otherwise, we might want to acquire a series of readings from environmental sensors and identify the presence of anomalies in the readings.
There are several different models. For example, simple models, such as statistical models, are perfectly capable of dealing with suitably simple problems, Otherwise, deep learning approaches consist of many successive transformations of data that are concatenated from the top down.
Objective functions
Previously, we introduced machine learning as learning from experience. Learning is defined as the improvement of a task. But how do we say that there has been improvement? Obviously, the quantification of improvement and performance of our model cannot be subjective. Therefore, we need formal measures of how good (or bad) our models are. In machine learning, and more generally in optimization, we call these measures objective functions. By convention, we usually define objective functions so that the lower the returned value the better. This is just a convention, and because we choose that the lower value is better, these functions are sometimes called loss functions.
When trying to predict numerical values, the most common loss function is the quadratic error, which is the square of the difference between the prediction and the correct value. For classification, the most common goal is to minimize the error rate, that is, the fraction of examples in which our predictions disagree with the truth. Some goals (e.g., quadratic error) are easy to optimize, while others (e.g., error rate) are difficult to optimize directly, due to non-differentiability or other complications. In these cases, it is common to optimize a surrogate objective.
During optimization, we consider the loss as a function of the model parameters and treat the training dataset as a constant. We learn the best values of our model parameters by minimizing the loss incurred on a set consisting of a number of examples collected for training. However, a good result on the training data does not guarantee a good result on the unseen data. For this reason, it is usually desired to divide the available data into two partitions: the training data set (or training set), for learning the model parameters, and the test data set (or test set), which is used for evaluation. Although the results obtained are encouraging, this does not guarantee success on the unseen data. When a model performs well on the training set but fails to generalize to the unseen data, it is said to be overfitting to the training data.
Optimization algorithms
Once a data source, a suitable representation of the data, a model and a well-defined objective function have been obtained, an algorithm is needed that can search for the best possible parameters to minimize the loss function. The most popular optimization algorithms for deep learning are based on an approach called gradient descent. In short, at each step this method checks, for each parameter, how the loss on the training set would change as that parameter changes. It then updates the parameter in the direction that reduces the loss.