Natural language processing (NLP) is one of the most fashionable areas of artificial intelligence (AI) in recent years thanks to applications such as text generators that create coherent texts, chatbots that trick people into thinking they are talking to a human interlocutor, and text-to-image programmes that produce photorealistic images from a textual description. Recent years have brought a revolution in the ability of computers to understand human languages, programming languages and even biological and chemical sequences, such as DNA and protein structures, that resemble language. The latest artificial intelligence models are unlocking these areas to analyse the meanings of input text and generate meaningful and expressive results.
In this article, we will understand what natural language processing is and why it is so important today. Next, we will look at some of the main application areas in which NLP models are being used.
What is natural language processing
Natural language processing (NLP) is the discipline that deals with building models that can manipulate human language (or data that resemble it) in the way it is written, spoken and organised. It evolved from computational linguistics, which uses computer science to understand the principles of language, but rather than developing theoretical models, NLP is an engineering discipline that seeks to build technology to perform useful tasks. NLP can be divided into two overlapping subfields: natural language understanding (NLU), which focuses on semantic analysis and/or determining the meaning of text, and natural language generation (NLG), which focuses on the generation of text by a computer. NLP is distinct from speech recognition, which attempts to analyse spoken language by transforming sound into text and vice versa.
Why is natural language processing important?
NLP is, by now, an integral part of everyday life and is becoming increasingly so with the application of language technology to various sectors such as retail (e.g. in customer service chatbots) and medicine (interpretation or synthesis of electronic medical records). Conversational agents, such as Amazon’s Alexa and Apple’s Siri, use NLP to listen to users’ questions and find answers. More sophisticated agents, such as GPT-3 and its evolutions, are able to generate sophisticated texts on a wide variety of topics and feed chatbots capable of holding coherent conversations. Google uses NLP to improve its search engine results and social networks such as Facebook use it to detect and filter out inappropriate and offensive speech.
NLP is becoming increasingly sophisticated, but there is still much work to be done. Current systems are prone to distortions and inconsistencies and sometimes behave erratically. Despite the challenges, machine learning engineers have many opportunities to apply NLP in ways that are increasingly relevant to the functioning of society.
How is natural language processing used?
NLP is used for a wide variety of language-related tasks, including answering questions, classifying text in various ways and conversing with users.
Below we look at some of the tasks that can be addressed by NLP
Sentiment analysis
Sentiment analysis is the process of classifying the emotional intent of a text. Typically, the input of a classification model is a text and the output is the probability that the sentiment expressed is positive, negative or neutral. Typically, this probability is based on hand-generated features such as n-grams of words, the use of TF-IDF or the use of deep learning models to capture long and short-term sequential dependencies. Sentiment analysis is used to classify customer reviews on various online platforms and for niche applications such as identifying signs of mental illness in online comments.
Toxicity classification
This is a branch of sentiment analysis in which the objective is not only to classify hostile intent, but also to identify particular categories such as threats, insults, obscenities and hatred towards certain identities. The input of this model, as in sentiment analysis, is a text and the output is generally the probability associated with each class of toxicity. Toxicity classification models can be used to moderate and improve online conversations by silencing offensive comments, detecting hate speech or analysing documents for defamation.
Automatic translation
This task is focused on automating translation between different languages. Google Translate and DeepL are perhaps the most popular and widely used applications. They are used to improve communication between people on social media platforms. The most effective approaches are also able to distinguish between words with similar meanings. Some systems also perform language identification.
Named entity recognition
This process aims to extract the entities, belonging to predefined categories, present in a text such as names of people, organisations, places and quantities. The output generated by these models consists of the various named entities with their positions within the original text. Entity recognition is useful in applications such as summarising news articles and combating misinformation. Below is an example of a model’s output.
Spam detection
Spam detection is a prevalent binary classification problem in NLP, the purpose of which is to classify e-mails as spam or not. Spam detectors use not only the text of an email but usually also other information such as the title of the email, the name and email of the sender. Their purpose is to provide a probability that the e-mail is spam or not. E-mail providers such as Gmail use these templates to provide a better user experience by detecting unsolicited and unwanted e-mails and moving them to the spam folder.
Correction of grammatical mistakes
Models in this category encode grammatical rules to correct a text. This is mainly seen as a sequence-sequence task, in which a template is trained on an ungrammatical sentence as input and a correct sentence as output. Online grammar checkers, such as Grammarly, and word processing systems, such as Microsoft Word, use them to provide a better writing experience for their customers. Schools also use them to evaluate student papers.
Topic modeling
This task is an unsupervised text mining activity that takes a collection of documents and discovers the topics covered. The output obtained is a list of topics that defines the words for each topic and the proportions of assignment of each topic to each document in the collection. Various techniques exist. For example, Latent Dirichlet Allocation (LDA), one of the most popular techniques, tries to view a document as a collection of topics and a topic as a collection of words.
Text generation
Text generation, more formally known as natural language generation (NLG), produces text similar to that written by humans. Such models can be refined to produce text of different genres and formats, including tweets, blogs and even computer code. Text generation has been performed using Markov processes, LSTM, BERT, GPT-3, GPT-4, LaMDA and other approaches. It is mainly used by two applications, but in recent years their use is expanding greatly. The main applications are:
- Autocomplete: in this case, the model predicts the next word after those typed in by the user. For example, Google uses auto-completion to predict search queries. One of the most famous autocomplete models is OpenAI’s GPT, which has been used to write articles, song lyrics and much more.
- Chatbot: these systems automate one part of the conversation, while a human interlocutor provides the other part of the conversation. They can be divided into the following two categories:
- Querying the database: starting from a database of questions and answers, the human user queries it using natural language.
- Conversation generation: these chatbots can simulate a dialogue with a human interlocutor. Some are able to initiate wide-ranging conversations. A high-profile example is Google’s LaMDA, which provided such human-like answers to questions that one of its developers was convinced it had feelings.
Information retrieval
Information retrieval searches for the most relevant documents for a query. This is a problem that every search and recommendation system has to face. The objective is not to answer a particular query, but to retrieve, from a collection of documents (even very large ones) the most relevant documents for the query made. Information retrieval systems mainly perform two processes: indexing and comparison. One of the classic software in this field is Elasticsearch, which, however, does not use natural language models. In more complex modern systems, indexing is performed with a vector space model via Two-Tower Networks, while comparison is performed, as in Elasticsearch, using similarity or distance scores. Google has recently supplemented its search function with a multimodal information retrieval model that works with text, image and video data.
Summarization
The objective of this task is to shorten a text or a collection of documents to highlight the most relevant information. Researchers at Salesforce have developed a summariser that also evaluates the consistency of facts to ensure the accuracy of the result. Summarization is divided into two classes of methods:
- Extractive summarization focuses on extracting the most important sentences from a long text and combining them to form a summary. Typically, the extractive summary scores each sentence of an input text and then selects the most relevant sentences to form the summary.
- Abstract summary, on the other hand, produces a summary by paraphrasing the original text. It is similar to the writing of a summary by a human being that includes words and phrases not present in the original text. This type of summary is usually modeled as a sequence-sequence task, in which the input is a long text and the output is a summary.
Question answering
In this case, the model must answer questions posed by humans in a natural language. One of the most prominent examples of question answering was Watson, who in 2011 performed in the TV game show Jeopardy against human champions, winning by considerable margins. In general, question-answering tasks are of two types:
- Multiple choice: The multiple choice question problem consists of a question and a series of possible answers. The learning task is to choose the correct answer.
- Open domain: In open domain question answering, the model provides answers to natural language questions without any options, often querying a large number of texts.