Every day new deep learning algorithms and models come out to answer new and old problems. In the articles Deep learning: Supervised learning [part 1], Deep learning: Supervised learning [part 2] etc, we examined a small subset of problems that deep learning can address. The techniques available to us can be many and more or less suitable for our problems. It is up to us to figure out which methodology to use in the specific context. But have all these techniques been developed only in the last few years? Many that we have histo indeed yes, but their roots come from much further back. In this article we will study a bit of the history of data analysis that underlies the techniques we use today.
From the Middle Ages to the 19th century
The desire to analyze data and predict future outcomes has always been present in humans and is the basis of much of the natural sciences and mathematics. Two examples are the Bernoulli distribution, named after Jacob Bernoulli (1655-1705), and the Gaussian distribution discovered by Carl Friedrich Gauss (1777-1855). Gauss invented, for example, the least mean squares algorithm, which is still used today for a multitude of problems, from insurance calculations to medical diagnostics. Such tools have improved the experimental approach in the natural sciences: for example, Ohm’s law relating current and voltage in a resistor is perfectly described by a linear model.
As early as the Middle Ages, mathematicians had a keen insight into estimates. For example, Jacob Köbel’s (1460-1533) geometry book illustrates the average foot length of 16 adult men to estimate the typical foot length in the population.
What is shown in the illustration is the experiment that was done by Köbel. On leaving a church, a group of 16 adult men were asked to line up and have their feet measured. The sum of these measurements was then divided by 16 to obtain an estimate of what is now called the foot. This “algorithm” was later improved to handle deformed feet: the two men with the shortest and longest feet were sent away, averaging only the remaining ones. This is one of the earliest examples of truncated mean estimation.
The 20th century
Statistics took off with the availability and collection of data. One of its pioneers, Ronald Fisher (1890-1962), contributed significantly to its theory and applications in genetics. Many of his algorithms (such as linear discriminant analysis) and concepts (such as Fisher’s information matrix) still occupy a prominent place in the foundations of modern statistics. His data resources have also had a lasting impact. The Iris dataset that Fisher published in 1936 is still used today to demonstrate machine learning algorithms. Fisher was also an advocate of eugenics, which should remind us that the morally questionable use of data science has as long and enduring a history as its productive use in industry and the natural sciences.
Other influences for machine learning come from the information theory of Claude Shannon (1916-2001) and the theory of computation proposed by Alan Turing (1912-1954). Turing posed the question “can machines think?” in his famous article Computing Machinery and Intelligence (Turing, 1950). Describing what is now known as the Turing test, he proposed that a machine can be considered intelligent if it is difficult for a human evaluator to distinguish machine responses from human responses, based on purely textual interactions.
Further influences came from neuroscience and psychology. After all, humans clearly exhibit intelligent behavior. Many scholars have wondered whether it is possible to explain and possibly decode this ability. One of the first biologically inspired algorithms was formulated by Donald Hebb (1904-1985). In his groundbreaking book The Organization of Behavior (Hebb, 1949), he stated that neurons learn through positive reinforcement. This principle became known as the Hebbian learning rule. These ideas inspired later work, such as Rosenblatt’s perceptron learning algorithm, and laid the foundation for many stochastic gradient descent algorithms that today form the basis of deep learning: reinforcing desirable behavior and decreasing undesirable behavior to achieve good parameter settings in a neural network.
Biological inspiration is what gave neural networks their name. For more than a century (beginning with the models of Alexander Bain, 1873, and James Sherrington, 1890), researchers have sought to assemble computational circuits that resemble networks of interacting neurons. Over time, the interpretation of biology has become less literal, but the name has remained. Underlying it are some key principles found in most networks today:
- The alternation of linear and nonlinear processing units, often referred to as layers.
- The use of the chain rule (also known as backpropagation) to adjust the parameters of the entire network at once.
After rapid initial progress, neural network research came to a standstill from about 1995 to 2005. This was mainly due to two reasons. First, training a network is computationally expensive. While random access memory was abundant at the end of the last century, computational power was scarce. Second, the datasets were relatively small. In fact, Fisher’s Iris dataset of 1936 was still a popular tool for testing the effectiveness of algorithms. The MNIST dataset, with its 60,000 handwritten digits, was considered huge.
Given the scarcity of data and computation, strong statistical tools such as kernel methods, decision trees, and graphical models proved empirically superior in many applications. Moreover, unlike neural networks, they did not require weeks of training and provided predictable results with strong theoretical guarantees.
One Response