In the previous article NLP: a comprehensive guide [Part 2] , we introduced what natural language processing (NLP) is and in what application fields the generated models are used. Many users, including some companies, use them without fully knowing how they work because the results obtained are more than satisfactory in many cases. To take full advantage of them, however, it is necessary to know the basics of how natural language is processed by these systems.
In this article we will analyze precisely these aspects, starting from text processing and exploring the main techniques used in this context.
How does natural language processing (NLP) work?
NLP models work by finding relationships between the constituent parts of language, i.e., the letters, words and sentences present in a textual data set. NLP architectures use various methods for data preprocessing, feature extraction and modeling. Below we will introduce the main techniques used.
Data preprocessing
Before a model processes text for a specific task, the text often needs to be preprocessed to improve model performance or to transform words and characters into a format that the model can understand. Data-centric AI is a growing movement that prioritizes data preprocessing. Various techniques illustrated below can be used for data preprocessing.
Stemming e lemmatization
Stemming is an informal process of converting words into their base forms using heuristic rules. For example, ” university,” ” universities,” and “university’s” can be mapped to the base form “univers.” Lemmatization is a more formal way of finding roots by analyzing the morphology of a word using the vocabulary of a dictionary. Stemming and lemmatization are provided by libraries such as spaCy and NLTK.
Sentence segmentation
This technique breaks down a large text into linguistically meaningful sentence units. This is obvious in languages such as English, where the end of a sentence is marked by a period, but it is not always so obvious. A period can be used either to mark an abbreviation or to end a sentence, and in this case, the period should be part of the token of the abbreviation itself. The process becomes even more complex in languages, such as Old Chinese, that do not have a delimiter marking the end of a sentence.
Stop word removal
At this stage, the most frequent words that do not add much information to the text are removed. For example, articles, conjunctions, and prepositions.
Tokenization
Tokenization divides text into individual words and word fragments. The result generally consists of a word index and tokenized text in which words can be represented as numeric tokens for use in various deep learning methods. A method that instructs language models to ignore unimportant tokens can improve efficiency.
Feature extraction
Most conventional machine learning techniques work on features (usually numbers that describe a document in relationship to the corpus that contains it) created by Bag-of-Words, TF-IDF, or generic features such as document length, word polarity, and metadata (e.g., whether the text has associated tags or scores). More recent techniques include Word2Vec, GLoVE, and feature learning during the process of training a neural network.
Bag-of-Words
This technique counts the number of times each word or n-gram (combination of n words) appears in a document. For example, the Bag-of-Words model creates a numerical representation of the dataset based on the number of words in the document.
TF-IDF
While in Bag-of-Words, we count the occurrences of each word or n-gram in a document, with TF-IDF we weigh each word according to its importance. To assess the importance of a word, we consider two elements:
- Frequency of terms: How important is the word in the document?
TF(word in a document)= Number of occurrences of that word in the document / Number of words in the document
- Inverse document frequency: How important is the term in the whole corpus?
IDF(word in corpus)=log(number of documents in corpus / number of documents that include the word)
A word is important if it recurs many times in a document. But this creates a problem. Words such as “a” and “the” appear often. Because of this, their TF score will always be high. This problem can be solved by removing stop words from the document and also by using Inverse Document Frequency. This measure will be high if the word is rare, while taking low values if the word is common in the corpus. The TF-IDF score of a word is the product of TF and IDF.
Word2Vec
Introduced in 2013, it uses a vanilla neural network to learn high-dimensional word embeddings from raw text. It is available in two variants:
- Skip-Gram: you try to predict surrounding words based on a target word
- Continuous Bag-of-Words (CBOW): tries to predict the target word from the surrounding words.
After discarding the last layer after training, these models take a word as input and produce a word embedding that can be used as input for many NLP tasks. Word2Vec embeddings capture context. If certain words appear in similar contexts, their embeddings will be similar.
GLoVE
Like Word2Vec it learns word embeddings, but it does so using matrix factorization techniques rather than neural learning. The GLoVE model builds a matrix based on the global count of co-occurrences between words.
Modeling
After preprocessing the data, it is fed into an NLP architecture that models it to perform a series of tasks. Numerical features extracted from the techniques described above can be fed into various models, depending on the task at hand. For example, for classification, the output of the TF-IDF vectorizer can be fed to logistic regression, naive Bayes, decision trees or gradient boosted trees. Otherwise, for entity recognition, Markov models can be used along with n-grams.
Deep neural networks generally work without the use of extracted features, although TF-IDF or Bag-of-Words features can be used as input.
In language models, the goal of a language model is to predict the next word when given a stream of input words. Probabilistic models using the Markov hypothesis are an example:
P(Wn)=P(Wn|Wn-1)
Deep learning is also used to create these language models. These models take as input an embedding of words and, at each time state, return the probability distribution of the next word as a probability for each word in the dictionary. Pre-trained language models learn the structure of a particular language by processing a large corpus, such as Wikipedia. They can then be tuned for a particular task. For example, BERT has been fine-tuned for tasks ranging from fact-checking to headline writing.
The main techniques of natural language processing
Most of the NLP tasks discussed above can be modeled by some general techniques. These techniques can be classified into two categories:
- Traditional machine learning methods
Deep learning methods
Traditional machine learning techniques
In this category we find the following techniques:
- Logistic regression
- Naive Bayes
- Decision trees
- Latent Dirichlet Allocation (LDA)
- Hidden Markov models
Below we introduce their main characteristics.
Logistic regression
Logistic regression is a supervised classification algorithm that aims to predict the probability of an event occurring based on some input. In NLP, logistic regression models can be applied to solve problems such as sentiment analysis, spam detection and toxicity classification.
Naive Bayes
It too is a supervised classification algorithm that calculates the conditional probability distribution P(label | text) using the following Bayes formula:
P(label | text) = P(label) x P(text|label) / P(text)
and predicts based on which joint distribution has the highest probability. The assumption of the Naive Bayes model is that the individual words are independent. Thus:
P(text|label) = P(word_1|label)*P(word_2|label)*...P(word_n|label)
In NLP, these statistical methods can be applied to solve problems such as spam detection or finding bugs in software code.
Decision trees
Decision trees are a class of supervised classification models that partition the data set according to different features to maximize the information gain in those partitions. An example of a decision tree is shown in the figure below. Their structure is very simple and highly interpretable as nodes impose conditions on the data features to determine which class they belong to.
Latent Dirichlet Allocation (LDA)
LDA models a document as a collection of topics and a topic as a collection of words. It is a statistical approach used primarily for modeling the topics covered in a document. The intuition behind it is that we can describe any topic using only a small collection of words in the corpus.
Hidden Markov models
Markov models are probabilistic models that decide the next state of a system based on the current state. For example, in NLP, we might suggest the next word based on the previous word. We can model this as a Markov model in which we can find the transition probabilities from word1 to word2, that is, P(word1|word2). Then we can use the product of these transition probabilities to find the probability of a sentence.
The Hidden Markov Model (HMM) is a probabilistic modeling technique that introduces a hidden state into the Markov model. A hidden state is a property of the data that is not directly observed. HMMs are used for part-of-speech (POS) tagging, where words in a sentence are the observed states and POS tags are the hidden states. The HMM adds a concept called emission probability: the probability of an observation given a hidden state. In the example above, this is the probability of a word given its POS tag. HMMs assume that this probability can be inverted: given a sentence, we can compute the part-of-speech tag of each word based on both the probability that a word has a certain part-of-speech tag and the probability that a particular part-of-speech tag follows the part-of-speech tag assigned to the previous word. In practice, this problem is solved using Viterbi’s algorithm. In the next figure we show a possible Hidden Markov Model.
Deep learning techniques
Instead, the following techniques belong to this category:
- Convolutional neural network (CNN)
- Recurrent neural network (RNN)
- Autoencoder
- Encoder-decoder sequence-to-sequence
- Transformers
We outline their characteristics below.
Convolutional neural network (CNN)
The idea of using a CNN to classify text was first presented in the article “Convolutional Neural Networks for Sentence Classification” by Yoon Kim. The central insight is to see a document as an image. However, instead of pixels, the input is sentences or documents represented as an array of words. A graphical representation of the CNNs intuition is shown in the figure.
Recurrent Neural Network (RNN)
Many text classification techniques using deep learning process words in close proximity using n-grams or a window (CNN). They are able to see “New York” as a single instance. However, they cannot grasp the context provided by a particular sequence of text. They do not learn the sequential structure of the data, in which each word depends on the previous word or a word in the previous sentence. RNNs, on the other hand, remember previous information using hidden states and link it to the current task. Architectures known as Gated Recurrent Unit (GRU) and long short-term memory (LSTM) are types of RNNs designed to remember information for an extended period. In addition, the bidirectional LSTM/GRU retains contextual information in both directions, which is useful in text classification. RNNs have also been used to generate mathematical proofs and translate human thoughts into words. A representation of them is shown in the figure.
Autoencoder
Autoencoders are deep learning encoder-decoders that approximate a mapping from X to Y, i.e., input=output. They first compress the input features into a low-dimensional representation (sometimes called latent code, latent vector, or latent representation) and learn to reconstruct the input. The representation vector can be used as input to a separate model, so this technique can be used to reduce dimensionality. Figure shows the logic behind this technique.
Among specialists in many other fields, geneticists have applied autoencoders to detect disease-associated mutations in amino acid sequences.
Encoder-decoder sequence-to-sequence
The encoder-decoder seq2seq architecture is an adaptation to specialized autoencoders for translation, summarization and similar tasks. The encoder encapsulates text information in an encoded vector. Unlike an autoencoder, instead of reconstructing the input from the encoded vector, the decoder’s task is to generate a different desired output, such as a translation or summary. A representation of a text translation model is shown in Figure.
Transformers
This architecture was first described in the 2017 article “Attention Is All You Need” (Vaswani, Shazeer, Parmar, et al.). The transformer forgoes recurrence and relies entirely on a self-attention mechanism to track global dependencies between input and output. Because this mechanism processes all words simultaneously (rather than one at a time), it reduces training speed and inference cost compared to RNNs, especially since it is parallelizable. In the following Figure we show you the architecture of a transformer.
Transformer architecture has revolutionized NLP in recent years, leading to models such as BLOOM, Jurassic-X and Turing-NLG. It has also been successfully applied to a variety of different vision tasks, including 3D image creation.