Elasticsearch: use of match queries

Elasticsearch offers a very good tool for textual queries. In this article we will begin to understand how to query textual fields using match queries. The various types of queries will allow us to refine the searches for our future projects.

Share

Reading time: 6 minutes

In previous articles, ELK Stack: what it is and what it is used for, What is Kibana used for? and Kibana: build your own dashboard, we looked at some of the features of the Elastisearch stack and in particular how to visualize the data we have available. Instead, in this series of articles we will see how to query the data using the API so that we can build our own applications using this powerful search engine.

In particular, in this article, we are going to look at some examples of match-based queries. Below are the main examples of queries covered in the guide for quick reference:

Category Type Match criteria Query Match No match
match full-text Find match if any of the search keywords are present in the field (analysis is also done on the search keywords) "search better" 1. can I search for better results 1. sear for the box
2. search better please 2. I won the bet
3. you know, for SEARCH 3. there are some things
4. there is a better place out there 4. some people are good at everything
multi_match full-text To apply the match query to multiple fields key1: "search" key2: "better" If key1 contains the word "search" OR if key2 contains the word "better." N/A
match_phrase full-text It will attempt to extract to the exact sentence, in the same order search better let me search better 1.can I search for better results
2.this is for search betterment
match_phrase_prefix full-text It will attempt to extract to the exact phrase in order, but the last term will match as a prefix. search better 1. let me search better can I search for better results
2. this is for search betterment

Creating the working environment

In these tutorials, we will use the latest available version of Elasticsearch, namely 8.3. To make it easier to set up the working environment, you can use the docker found here. This will install a full ELK stack on your machine that you can also use for other projects.

Once we have downloaded the Docker project, done the build and started it, we need to import the data we will use in this tutorial. Therefore, connect to Kibana via browser at http://localhost:5601/. It will prompt you to enter credentials, which are the default credentials:

  • username: elastic
  • password: changeme

Now open the Console found in the Management > Dev Tools menu. We will use this tool to run queries.

The data we will use is available here. Before loading them we create the index employees from the console with the following command:

PUT employees 

To handle dates appropriately, we define a mapping for the date_of_birth field with the following command.

PUT employees/_mapping 
{ "properties": { 
    "date_of_birth": { 
        "type": "date", 
        "format": "dd/MM/yyyy" 
    } 
} }
 

At this point we can import the data. We open a shell and type the following command.

curl --user elastic:changeme -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/employees/_bulk' --data-binary @employees.json 

Doing so will load the data and create mappings for fields that we have not defined previously. Specifically, text fields will be indexed either as text (using the standard analyzer) or as keywords, while numeric data will be indexed as floats.

You can verify the mapping with the following command.

GET employees/_mapping 

Now that we have an index with documents and a specified mapping, we are ready to start with queries.

Match Query

The “match” query is one of the most basic and commonly used queries in Elasticsearch and works like a full-text query. We can use this query to search for text, numbers, or Boolean values.

Let’s look for the word “heuristic” contained in the “phrase” field of the documents we previously uploaded.

POST employees/_search
{
  "query": {
    "match": {
      "phrase": {
        "query" : "heuristic"
      }
    }
  }
} 

Of the 4 documents in our index, only 2 documents contain the word ” heuristic ” in the ” phrase ” field:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.6785375,
    "hits": [
      {
        "_index": "employees",
        "_id": "2",
        "_score": 0.6785375,
        "_source": {
          "id": 2,
          "name": "Othilia Cathel",
          "email": "[email protected]",
          "gender": "female",
          "ip_address": "3.164.153.228",
          "date_of_birth": "22/07/1987",
          "company": "Edgepulse",
          "position": "Structural Engineer",
          "experience": 11,
          "country": "China",
          "phrase": "Grass-roots heuristic help-desk",
          "salary": 193530
        }
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0.62577873,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        }
      }
    ]
  }
}
 

What if we want to search for more than one word? Using the same query we just executed, we search for “heuristic roots help”:

POST employees/_search
{
  "query": {
    "match": {
      "phrase": {
        "query" : "heuristic roots help"
      }
    }
  }
}
 

This returns the same document as before because, by default, Elasticsearch treats each word in the search query with an OR operator. In our case, the query will match any document containing “heuristic” OR “roots” OR “help”.

Changing the operator parameter

The default behavior of the OR operator applied to multiword searches can be modified using the “operator” parameter passed along with the “match” query.

We can specify the operator parameter with the values “OR” or “AND”.

Let’s see what happens when we supply the “AND” operator parameter in the same query executed earlier.

POST employees/_search
{
  "query": {
    "match": {
      "phrase": {
        "query" : "heuristic roots help",
        "operator" : "AND"
      }
    }
  }
}
 

Now the results will return only one document (document id=2), since it is the only document that contains all three search keywords in the “phrase” field.

minimum_should_match

To take this a step further, we can set a threshold for a minimum number of matching words that the document must contain. For example, if we set this parameter to 1, the query will check all documents with at least one matching word.

If, on the other hand, we set the parameter “minium_should_match” to 3, then all three words must appear in the document to be classified as matching.

In our case, the following query would return only 1 document (with id=2), since it is the only one that matches our criteria.

POST employees/_search
{
  "query": {
    "match": {
      "phrase": {
        "query" : "heuristic roots help",
        "minimum_should_match": 3
      }
    }
  }
}
 

Multi-Match Query

So far we have dealt with matches over a single field, that is, we have searched for keywords within a single field called “phrase.”

But what if we needed to search for keywords in multiple fields of a document? This is where the multi-match query comes in.

Proviamo a fare un esempio di ricerca per la keyword “research help” nei campi “position” e “phrase”.

POST employees/_search
{
  "query": {
    "multi_match": {
        "query" : "research help"
        , "fields": ["position","phrase"]
    }
  }
} 

The response will be as follows:

{
  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.2613049,
    "hits": [
      {
        "_index": "employees",
        "_id": "1",
        "_score": 1.2613049,
        "_source": {
          "id": 1,
          "name": "Huntlee Dargavel",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "58.11.89.193",
          "date_of_birth": "11/09/1990",
          "company": "Talane",
          "position": "Research Associate",
          "experience": 7,
          "country": "China",
          "phrase": "Multi-channelled coherent leverage",
          "salary": 180025
        }
      },
      {
        "_index": "employees",
        "_id": "2",
        "_score": 1.1785964,
        "_source": {
          "id": 2,
          "name": "Othilia Cathel",
          "email": "[email protected]",
          "gender": "female",
          "ip_address": "3.164.153.228",
          "date_of_birth": "22/07/1987",
          "company": "Edgepulse",
          "position": "Structural Engineer",
          "experience": 11,
          "country": "China",
          "phrase": "Grass-roots heuristic help-desk",
          "salary": 193530
        }
      }
    ]
  }
} 

Match Phrase

Match_phrase is another commonly used query that, as its name indicates, looks for phrases in a field.

If we were to look for the phrase “roots heuristic coherent ” in the “phrase” field of the index, we could use the following query:

GET employees/_search
{
  "query": {
    "match_phrase": {
      "phrase": {
        "query": "roots heuristic coherent"
      }
    }
  }
} 

This will return documents with the exact phrase “consistent heuristic roots,” including word order. In our case, we have only one result that matches the above criteria, as shown in the answer below.

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.877336,
    "hits": [
      {
        "_index": "employees",
        "_id": "4",
        "_score": 1.877336,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        }
      }
    ]
  }
} 

Slop parameter

One useful feature we can use in the match_phrase query is the “slop” parameter, which allows us to create more flexible searches.

Suppose we search for “roots coherent ” with the match_phrase query. We would not receive any documents from the employee index. This is because for match_phrase to match, the terms must be in the exact order.

Now let’s use the slop parameter and see what happens:

GET employees/_search
{
  "query": {
    "match_phrase": {
      "phrase": {
        "query": "roots coherent",
        "slop": 1
      }
    }
  }
} 

With slop=1, the query indicates that it is possible to move a word for a match and thus we will receive the following response. In the response below, you can see that “consistent roots” matches the document “consistent heuristic roots.” This is because the slop parameter allows 1 term to be skipped.

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.7873249,
    "hits": [
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0.7873249,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        }
      }
    ]
  }
} 

Match Phrase Prefix

The match_phrase_prefix query is similar to the match_phrase query, but in this case the last term of the search keyword is considered as a prefix and is used to match any term beginning with that prefix.

First, we insert a document into our index to better understand the match_phrase_prefix query

PUT employees/_doc/5
{
  "id": 4,
  "name": "Jennifer Lawrence",
  "email": "[email protected]",
  "gender": "female",
  "ip_address": "100.37.110.59",
  "date_of_birth": "17/05/1995",
  "company": "Monsnto",
  "position": "Resources Manager",
  "experience": 10,
  "country": "Germany",
  "phrase": "Emulation of roots heuristic complete systems",
  "salary": 300000
} 

Now we apply the match_phrase_prefix query:

GET employees/_search
{
"_source": [ "phrase" ],
  "query": {
    "match_phrase_prefix": {
      "phrase": {
        "query": "roots heuristic co"
      }
    }
  }
} 

In the results below, we can see that the documents with coherent and complete match the query. We can also use the slop parameter in the “match_phrase” query.

{
  "took": 72,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 3.0871696,
    "hits": [
      {
        "_index": "employees",
        "_id": "4",
        "_score": 3.0871696,
        "_source": {
          "phrase": "Emulation of roots heuristic coherent systems"
        }
      },
      {
        "_index": "employees",
        "_id": "5",
        "_score": 3.0871696,
        "_source": {
          "phrase": "Emulation of roots heuristic complete systems"
        }
      }
    ]
  }
} 

Note: “match_phrase_query” tries to match 50 expansions (by default) of the last supplied keyword (co in our example). This value can be increased or decreased by specifying the “max_expansions” parameter.

Because of this prefix property and the ease of setting the match_phrase_prefix query, it is often used for the autocomplete feature.

Now let’s delete the document we just added with id=5.

DELETE employees/_doc/5 

More To Explore

Artificial intelligence

AI: the best prompt techniques for leveraging LLMs

Prompt techniques are the basis for the use of LLMs. There are several studies and guidelines for obtaining the best results from these models. Let us analyze some of them to extract the basic principles that will allow us to obtain the desired answers according to our task.

Artificial intelligence

AI: create a chatbot with your own data

ChatGPT allows us to have a virtual assistant at our complete disposal. However, it has one major limitation: it does not know our private data. How can we build our own virtual assistant, or rather a chabot, that uses our data and does not require us to invest money? Let’s find out how to build one using open LLM, free computational environments such as Colab, and Python libraries to manage PDF files and create simple and intuitive web interfaces.

3 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!