Large language models (LLMs) and knowledge graphs (KGs) are different ways to provide more people with access to data. KGs use semantics to connect datasets through their meaning, that is, the entities they represent. LLMs use vectors and deep neural networks to predict natural language. Both often aim to “unlock” data. For companies implementing KGs, the end goal is usually something like a data marketplace, semantic data analysis, and/or increased data centricity in the enterprise. These are different solutions with the same end goal: to make more data available to the right people faster. For companies implementing an LLM or other similar generative artificial intelligence solution, the goal is often similar: to provide end users (employees or customers) with a “digital assistant” that can deliver the right information more quickly. The potential symbiosis is evident: some of the major weaknesses of LLMs, namely being black-box models and having difficulty with factual knowledge, are some of the major strengths of KGs. KGs are essentially collections of facts and are fully interpretable. But how can and should KGs and LLMs be implemented together in a company?
Suppose we need to write a cover letter for a new job search. If we try to use ChatGPT or other LLMs, the result we get will be a document that is well structured and focused on a specific job description, provided we explicitly include the existing cover letter and job description in the prompt. However, the document will show some relevant problems. It may, in fact, include work experience we have never had or training courses we have never attended.
This example is to make you understand the strengths and weaknesses of LLMs and why KGs are an important part of their implementation. Moreover, this use case is not very different from what many large companies currently use LLMs for: automatic report generation.
In this article we are going to analyze how LLMs can help us build KGs correctly and efficiently.
LLMs to support KG generation and curation.
LLMs are valuable tools for KG creation. One way to take advantage of the technology of LLMs in the KG curation process is to incorporate your KG into a vector database. A vector database is a database built to store vectors or lists of numbers. Vectorization is one of the main, if not the main, technological components of language models. These models, through incredible amounts of training data, learn to associate words with vectors. The vectors capture semantic and syntactic information about the word based on its context in the training data. Using an embedding service trained with these incredible amounts of data, we can exploit this semantic and syntactic information in our KG.
KG vectorization, of course, is by no means the only way to use LLM technology in KG curation and construction. Moreover, none of these applications of LLMs are new to KG creation. Natural Language Processing (NLP) techniques have been used for decades for entity extraction, for example, and LLM is just a new capability to assist the ontologist/taxonomist with the creations.
Some of the ways in which LLMs can assist in the KG creation process are as follows.
Entity resolution
Entity resolution is the process of aligning records that refer to the same real-world entity. For example, acetaminophen, a common painkiller used in the United States and sold under the trade name Tylenol, is called paracetamol in Italy and sold under the trade name Tachipirima. These four names do not look alike at all, but if one were to enter KG into a vector database, the vectors would have the semantic understanding to know that these entities are closely related.
Tagging unstructured data
Suppose we want to incorporate some unstructured data into our KG. You have a bunch of PDFs with vague file names, but you know that there is important information in those documents. You need to label these documents with the file type and topic. If the topical taxonomy and the document type taxonomy have been incorporated, you simply vectorize the documents and the vector database will identify the most relevant entities in each taxonomy.
Extraction of entities and classes
Create or enhance a controlled vocabulary such as an ontology or taxonomy based on a corpus of unstructured data. Entity extraction is similar to tagging, but the goal is to improve the ontology rather than to incorporate the unstructured data into KG. Suppose we have a geographic ontology and want to populate it with instances of countries, cities, states, etc. One can use an LLM to extract entities from a text corpus to populate the ontology. Similarly, one can use the LLM to extract classes and relationships between classes from the corpus. Suppose we forgot to include “capital” in our ontology. The LLM might be able to extract a new class or property of a city.
One Response