Artificial intelligence has made huge strides in recent years. With it, large amounts of data can be analyzed in order to generate models capable of performing even very complex tasks. A branch of artificial intelligence that fascinates a lot is the ability to generate text from a few words of input: Natural Language Generation.
In this article we will see how it is possible to generate articles on Elastisearch whose features we have already discussed in the articles ELK Stack: what it is and what it is used for, What is Kibana used for?, Kibana: let’s explore data, and Python starting from a simple question and a few lines of Python code. We will use, in particular, the GPT-2 Large Pre-trained model and HuggingFace Transformers. However, let’s start by giving two broad facts about the technologies we will be using.
Announced in 2019 by OpenAI, GPT-2 is the successor to GPT, and was theorized in the article Language Models are Unsupervised Multitask Learners. This language model is composed of transformers with 1.5 billion parameters and trained on a dataset of 8 million web pages. The training goal of GPT-2 is very simple: predict the next word by knowing all previous words within a text. The heterogeneity of the training dataset implies that the generated model can be adapted to different application domains. Compared to its predecessor, GPT-2 is a direct scale-up with more than 10 times the parameters and trained on a 10 times larger dataset.
However, several models of GPT2 have been released as shown in the figure. These differ by size: small (124M parameters), medium (355M parameters), large (774M parameters), and extra large (1.5BM parameters, i.e., the full implementation).
In May 2020, OpenAI announced GPT-3. Unlike its predecessors, this model can only be used through paid APIs which makes it less usable as a starting point for creating new models.
If you want more information you can read an interesting article directly from the OpenAI website.
HuggingFace Transformers (formerly known as PyTorch-transformers and PyTorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for natural language understanding (NLU) and natural language generation (NLG) with over 32 pre-trained models in more than 100 languages and deep interoperability between TensorFlow 2.0 and PyTorch.
Basically Hugging Face Transformers is a python package that has some pre-defined or pre-trained functions, pipelines and templates that we can use for our natural language processing tasks.
GPT-2 Tokenizer and templates are also included in HuggingFace Transformers.
For more information about the available Transformers and their use we refer you to the official documentation.
For this introductory tutorial to text generation with GPT we will use a Jupyter Notebook. As we have seen in the article Jupyter Notebook: user’s guide, it is possible to install it simply through Anaconda.
To take advantage of the HuggingFace transformers it is necessary to install, in addition to them, Pytorch. To do this, simply go to the Anaconda site and search for the packages that interest you. For each package is indicated the architecture for which it is released. For example, if you have a Mac with an M1 chip you’ll need to look at all packages compatible with the osx-arm64 platform.
Clicking on the package will provide us with installation instructions, as you can see below.
To install the packages we need, Pytorch and transformers, you can use the following commands.
conda install -c pytorch conda install -c conda-forge transformers
Once the installations are complete, you can launch Anaconda and then a Jupyter Notebook or Jupyterlab.
Generating a text on Elastisearch
First, we import GPT2LMHeadModel for text generation and GPT2Tokenizer for text tokenization.
from transformers import GPT2LMHeadModel , GPT2Tokenizer
Now we load the model into the Jupyter notebook.
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = GPT2LMHeadModel.from_pretrained('gpt2-large' , pad_token_id = tokenizer.eos_token_id)
Since we are using the large model, it will take a while for it to download. Make sure you have more than 3GB free before running this command. Otherwise you can use the smaller versions of the GTP-2 templates.
For text generation, we need to provide some initial text to our model. In order for the input text to be correctly recognized by the model, we need to preprocess it. We then define the text to start with as shown below.
sentence = 'What is Elasticsearch?'
At this point we need to transform it into PyTorch tensors.
input_ids = tokenizer.encode(sentence, return_tensors='pt')
The words are converted into a numeric index which you can also view using the decode function.
input_ids = tokenizer.encode(sentence, return_tensors='pt') # tensor([[ 2061, 318, 48567, 12947, 30]])
Finally, we generate the text using the function generated by GPT2LMHeadModel.
output = model.generate(input_ids, max_length = 1000, num_beams = 5, no_repeat_ngram_size = 2, early_stopping = True)
Before showing you the result, let’s analyze in detail the various parameters used to generate the text.
- max_length: Maximum number of words in the generated text.
- num_beams: Beam search reduces the risk of missing high probability hidden word sequences by keeping the most likely number of hypotheses at each time step and ultimately choosing the hypothesis that has the highest overall probability. Beam search will always find an output sequence with a higher probability than greedy search, even if it is not guaranteed to find the most likely output.
- no_repeat_ngram_size: While the output is arguably more fluent, the output still includes repetitions of the same word sequences. A simple workaround is to introduce n-gram penalties (i.e., sequences of n words) as introduced by Paulus et al. (2017) and Klein et al. (2017). Setting the penalty to 0 ensures that no n-gram appears twice.
- early_stopping: Set it to the value True, generation will end when all beam hypotheses have reached the EOS token.
More details about the arguments of the generation function can be found in the official documentation.
The generate function will return the ids of the tokens in our new text. By simply decoding the result we can print our article. The text generation time varies depending on parameters, including the length of the text to be generated, and the power of your machine.
print (tokenizer.decode(output, skip_special_tokens=True ))
Here’s the final result.
Since this is a technical topic, many commands have been included that are more related to Docker than Elasticsearch. The beginning is quite good, but then it deviates too much from the initial topic. Finally, the text does not conclude.
Generating text about Python
Let us now try to modify the initial sentence with the following question.
sentence = 'What is Python?'
We reduce the number of maximum words generated to reduce both runtime and possible text drift. The result we get for a maximum 500 word text is as follows.
In this case, the topic is very well discussed even if it is a bit short. No one could guess that it was a machine and not a human that wrote it.
As we have seen text generation using GPT-2 is very simple and opens up a variety of uses in many areas including chatbots, content generation for websites and use in Industry 4.0.