NLP: a comprehensive guide [Part 2] 

Natural language processing (NLP) allows us to create systems that can interpret what we write. But how are the data underlying these models processed? And what techniques are most commonly used? In this guide we will look at these issues.

Share

Tempo di lettura: 9 minuti

In the previous article NLP: a comprehensive guide [Part 2] , we introduced what natural language processing (NLP) is and in what application fields the generated models are used. Many users, including some companies, use them without fully knowing how they work because the results obtained are more than satisfactory in many cases. To take full advantage of them, however, it is necessary to know the basics of how natural language is processed by these systems.

In this article we will analyze precisely these aspects, starting from text processing and exploring the main techniques used in this context.

How does natural language processing (NLP) work?

NLP models work by finding relationships between the constituent parts of language, i.e., the letters, words and sentences present in a textual data set. NLP architectures use various methods for data preprocessing, feature extraction and modeling. Below we will introduce the main techniques used.

Data preprocessing

Before a model processes text for a specific task, the text often needs to be preprocessed to improve model performance or to transform words and characters into a format that the model can understand. Data-centric AI is a growing movement that prioritizes data preprocessing. Various techniques illustrated below can be used for data preprocessing.

Stemming e lemmatization

Stemming is an informal process of converting words into their base forms using heuristic rules. For example, ” university,” ” universities,” and “university’s” can be mapped to the base form “univers.” Lemmatization is a more formal way of finding roots by analyzing the morphology of a word using the vocabulary of a dictionary. Stemming and lemmatization are provided by libraries such as spaCy and NLTK.

Sentence segmentation

This technique breaks down a large text into linguistically meaningful sentence units. This is obvious in languages such as English, where the end of a sentence is marked by a period, but it is not always so obvious. A period can be used either to mark an abbreviation or to end a sentence, and in this case, the period should be part of the token of the abbreviation itself. The process becomes even more complex in languages, such as Old Chinese, that do not have a delimiter marking the end of a sentence.

Stop word removal

At this stage, the most frequent words that do not add much information to the text are removed. For example, articles, conjunctions, and prepositions.

Tokenization

Tokenization divides text into individual words and word fragments. The result generally consists of a word index and tokenized text in which words can be represented as numeric tokens for use in various deep learning methods. A method that instructs language models to ignore unimportant tokens can improve efficiency.

Feature extraction

Most conventional machine learning techniques work on features (usually numbers that describe a document in relationship to the corpus that contains it) created by Bag-of-Words, TF-IDF, or generic features such as document length, word polarity, and metadata (e.g., whether the text has associated tags or scores). More recent techniques include Word2Vec, GLoVE, and feature learning during the process of training a neural network.

Bag-of-Words

This technique counts the number of times each word or n-gram (combination of n words) appears in a document. For example, the Bag-of-Words model creates a numerical representation of the dataset based on the number of words in the document.

TF-IDF

While in Bag-of-Words, we count the occurrences of each word or n-gram in a document, with TF-IDF we weigh each word according to its importance. To assess the importance of a word, we consider two elements:

  • Frequency of terms: How important is the word in the document?
				
					TF(word in a document)= Number of occurrences of that word in the document / Number of words in the document
				
			
  • Inverse document frequency: How important is the term in the whole corpus?
				
					IDF(word in corpus)=log(number of documents in corpus / number of documents that include the word)
				
			

A word is important if it recurs many times in a document. But this creates a problem. Words such as “a” and “the” appear often. Because of this, their TF score will always be high. This problem can be solved by removing stop words from the document and also by using Inverse Document Frequency. This measure will be high if the word is rare, while taking low values if the word is common in the corpus. The TF-IDF score of a word is the product of TF and IDF.

Word2Vec

Introduced in 2013, it uses a vanilla neural network to learn high-dimensional word embeddings from raw text. It is available in two variants:

  • Skip-Gram: you try to predict surrounding words based on a target word
  • Continuous Bag-of-Words (CBOW): tries to predict the target word from the surrounding words.

After discarding the last layer after training, these models take a word as input and produce a word embedding that can be used as input for many NLP tasks. Word2Vec embeddings capture context. If certain words appear in similar contexts, their embeddings will be similar.

GLoVE

Like Word2Vec it learns word embeddings, but it does so using matrix factorization techniques rather than neural learning. The GLoVE model builds a matrix based on the global count of co-occurrences between words.

Modeling

After preprocessing the data, it is fed into an NLP architecture that models it to perform a series of tasks. Numerical features extracted from the techniques described above can be fed into various models, depending on the task at hand. For example, for classification, the output of the TF-IDF vectorizer can be fed to logistic regression, naive Bayes, decision trees or gradient boosted trees. Otherwise, for entity recognition, Markov models can be used along with n-grams.

Deep neural networks generally work without the use of extracted features, although TF-IDF or Bag-of-Words features can be used as input.

In language models, the goal of a language model is to predict the next word when given a stream of input words. Probabilistic models using the Markov hypothesis are an example:

				
					P(Wn)=P(Wn|Wn-1)
				
			

Deep learning is also used to create these language models. These models take as input an embedding of words and, at each time state, return the probability distribution of the next word as a probability for each word in the dictionary. Pre-trained language models learn the structure of a particular language by processing a large corpus, such as Wikipedia. They can then be tuned for a particular task. For example, BERT has been fine-tuned for tasks ranging from fact-checking to headline writing.

The main techniques of natural language processing

Most of the NLP tasks discussed above can be modeled by some general techniques. These techniques can be classified into two categories:

  • Traditional machine learning methods
    Deep learning methods

Traditional machine learning techniques

In this category we find the following techniques:

  • Logistic regression
  • Naive Bayes
  • Decision trees
  • Latent Dirichlet Allocation (LDA)
  • Hidden Markov models

Below we introduce their main characteristics.

Logistic regression

Logistic regression is a supervised classification algorithm that aims to predict the probability of an event occurring based on some input. In NLP, logistic regression models can be applied to solve problems such as sentiment analysis, spam detection and toxicity classification.

Naive Bayes

It too is a supervised classification algorithm that calculates the conditional probability distribution P(label | text) using the following Bayes formula:

				
					P(label | text) = P(label) x P(text|label) / P(text) 
				
			

and predicts based on which joint distribution has the highest probability. The assumption of the Naive Bayes model is that the individual words are independent. Thus:

				
					P(text|label) = P(word_1|label)*P(word_2|label)*...P(word_n|label)
				
			

In NLP, these statistical methods can be applied to solve problems such as spam detection or finding bugs in software code.

Decision trees

Decision trees are a class of supervised classification models that partition the data set according to different features to maximize the information gain in those partitions. An example of a decision tree is shown in the figure below. Their structure is very simple and highly interpretable as nodes impose conditions on the data features to determine which class they belong to.

Latent Dirichlet Allocation (LDA)

LDA models a document as a collection of topics and a topic as a collection of words. It is a statistical approach used primarily for modeling the topics covered in a document. The intuition behind it is that we can describe any topic using only a small collection of words in the corpus.

Hidden Markov models

Markov models are probabilistic models that decide the next state of a system based on the current state. For example, in NLP, we might suggest the next word based on the previous word. We can model this as a Markov model in which we can find the transition probabilities from word1 to word2, that is, P(word1|word2). Then we can use the product of these transition probabilities to find the probability of a sentence.

The Hidden Markov Model (HMM) is a probabilistic modeling technique that introduces a hidden state into the Markov model. A hidden state is a property of the data that is not directly observed. HMMs are used for part-of-speech (POS) tagging, where words in a sentence are the observed states and POS tags are the hidden states. The HMM adds a concept called emission probability: the probability of an observation given a hidden state. In the example above, this is the probability of a word given its POS tag. HMMs assume that this probability can be inverted: given a sentence, we can compute the part-of-speech tag of each word based on both the probability that a word has a certain part-of-speech tag and the probability that a particular part-of-speech tag follows the part-of-speech tag assigned to the previous word. In practice, this problem is solved using Viterbi’s algorithm. In the next figure we show a possible Hidden Markov Model.

Deep learning techniques

Instead, the following techniques belong to this category:

  • Convolutional neural network (CNN)
  • Recurrent neural network (RNN)
  • Autoencoder
  • Encoder-decoder sequence-to-sequence
  • Transformers

We outline their characteristics below.

Convolutional neural network (CNN)

The idea of using a CNN to classify text was first presented in the article “Convolutional Neural Networks for Sentence Classification” by Yoon Kim. The central insight is to see a document as an image. However, instead of pixels, the input is sentences or documents represented as an array of words. A graphical representation of the CNNs intuition is shown in the figure.

Recurrent Neural Network (RNN)

Many text classification techniques using deep learning process words in close proximity using n-grams or a window (CNN). They are able to see “New York” as a single instance. However, they cannot grasp the context provided by a particular sequence of text. They do not learn the sequential structure of the data, in which each word depends on the previous word or a word in the previous sentence. RNNs, on the other hand, remember previous information using hidden states and link it to the current task. Architectures known as Gated Recurrent Unit (GRU) and long short-term memory (LSTM) are types of RNNs designed to remember information for an extended period. In addition, the bidirectional LSTM/GRU retains contextual information in both directions, which is useful in text classification. RNNs have also been used to generate mathematical proofs and translate human thoughts into words. A representation of them is shown in the figure.

Autoencoder

Autoencoders are deep learning encoder-decoders that approximate a mapping from X to Y, i.e., input=output. They first compress the input features into a low-dimensional representation (sometimes called latent code, latent vector, or latent representation) and learn to reconstruct the input. The representation vector can be used as input to a separate model, so this technique can be used to reduce dimensionality. Figure shows the logic behind this technique.

Among specialists in many other fields, geneticists have applied autoencoders to detect disease-associated mutations in amino acid sequences.

Encoder-decoder sequence-to-sequence

The encoder-decoder seq2seq architecture is an adaptation to specialized autoencoders for translation, summarization and similar tasks. The encoder encapsulates text information in an encoded vector. Unlike an autoencoder, instead of reconstructing the input from the encoded vector, the decoder’s task is to generate a different desired output, such as a translation or summary. A representation of a text translation model is shown in Figure.

Transformers

This architecture was first described in the 2017 article “Attention Is All You Need” (Vaswani, Shazeer, Parmar, et al.). The transformer forgoes recurrence and relies entirely on a self-attention mechanism to track global dependencies between input and output. Because this mechanism processes all words simultaneously (rather than one at a time), it reduces training speed and inference cost compared to RNNs, especially since it is parallelizable. In the following Figure we show you the architecture of a transformer.

Transformer architecture has revolutionized NLP in recent years, leading to models such as BLOOM, Jurassic-X and Turing-NLG. It has also been successfully applied to a variety of different vision tasks, including 3D image creation.

More To Explore

Artificial intelligence

Sentiment Analysis & Topic Modeling: What Your Customers Really Mean

You have 200 reviews, 500 support tickets, 1,000 social media comments. Reading them all would take days — and you’d still miss the most important patterns. Sentiment Analysis and Topic Modeling solve exactly this: in ten minutes you get the emotional tone of every text, recurring themes grouped automatically, and a strategic summary that manual reading would never have produced.

Artificial intelligence

Multimodal AI: Analyze PDFs, Images and Documents with Claude, GPT-4 and Gemini

AI no longer reads only text. Claude summarizes a 10-page quote in 30 seconds. GPT-4 Vision transcribes data from a dashboard screenshot into a ready-to-use table. Gemini 1.5 Pro navigates 1,000-page documents citing the sources. This guide shows how they work, when to use which tool, and where the time savings are measurable — with real screenshots from live sessions.

Leave a Reply

Your email address will not be published. Required fields are marked *

Progetta con MongoDB!!!

Acquista il nuovo libro che ti aiuterà a usare correttamente MongoDB per le tue applicazioni. Disponibile ora su Amazon!