Elasticsearch: compound query

Elasticsearch offers a very valuable tool for performing simple as well as complex searches. In this article we will understand how to include multiple conditions in the same query and modify the score calculation based on custom functions and data values.

Share

Reading time: 11 minutes

In articles Elasticsearch: use of match queries and Elasticsearch: use of term queries we saw how to query both textual fields and structured data stored within an Elasticsearch index. The queries used are relatively simple, but allow us to extract the data of interest and perform more refined textual searches than the standards that a relational database allows. Often in real cases, however, we need to check multiple conditions at once. In addition, we may need to change the parameters of the query score calculation or change the behavior of individual queries. Compound queries are queries that help us realize the scenarios just described. In this article we will examine some of the most useful compound queries.

As in previous tutorials, we will use the same data on employees. Therefore, we recommend that you carefully read the instructions to install the Elasticsearch stack on your PC via the Docker repository and import the data correctly.

Below are the main query examples covered in this guide for quick reference:

Category Type Match criteria Query Match No Match
bool compound To apply a combination of queries and logical operators must, key1:"search" 1. search will be better 1. search better for silk
should, key2:"better" 2. search will be there 2. search for silk
must_not, key3:"silk"
function_score: weight compound Assigns higher scores for greater weights search clause1 - weight 50 I documenti con la clausola di ricerca 1 ottiene un punteggio più alto di quello dei documenti corrispondenti alla clausola di ricerca 2 N/A
search clause 2 - weight 25
function_score: script_score compound Modify the score using custom scripts N/A N/A N/A
function_score: field_value_factor compound Change the score based on a specific field N/A N/A N/A

Bool Query

The bool query provides a way to combine multiple queries in a boolean way. For example, if we want to retrieve all documents with the keyword “researcher” in the “position” field and those with more than 12 years of experience, we need to use the combination of the match query and the range query. This type of query can be formulated using the bool query. This type of query mainly involves 4 types of occurrences:

must Conditions or queries must be present in the documents to consider them matching. Also, this contributes to the value of the score. 
For example: if we keep query A and query B in the must section, each document in the result would satisfy both queries, i.e. query A AND query B
should Conditions/queries must match. 
Result = queryA OR queryB
filter Same as must clause, but the score will be ignored
must_not The specified conditions/queries must not must not be present in the documents. The score is ignored and kept 0 as the results are ignored.

The typical structure of a bool query is as follows:

POST _search
{
  "query": {
    "bool" : {
      "must" : [],
      "filter": [],
      "must_not" : [],
      "should" : []
    }
  }
} 

Let us now see how to use bool query for different use cases.

Must

In our example, we want to find all employees who have 12 years or more experience and who also have the word “manager” in the “position” field. We can do this with the following query

POST employees/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "position": "manager"
          }
        },
        {
          "range": {
            "experience": {
              "gte": 12
            }
          }
        }
      ]
    }
  }
} 

The response for the above query will have the documents corresponding to both queries in the “must” array and is shown below:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.7261541,
    "hits": [
      {
        "_index": "employees",
        "_id": "4",
        "_score": 1.7261541,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        }
      },
      {
        "_index": "employees",
        "_id": "3",
        "_score": 1.6099696,
        "_source": {
          "id": 3,
          "name": "Winston Waren",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "202.37.210.94",
          "date_of_birth": "10/11/1985",
          "company": "Yozio",
          "position": "Human Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Versatile object-oriented emulation",
          "salary": 50616
        }
      }
    ]
  }
}
 

Filter

The previous example showed the “must” parameter in the bool query. In the results of the previous example we can see that the results had values in the “_score” field. Let us now use the same query, but this time replace the “must” parameter with “filter.”

POST employees/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "position": "manager"
          }
        },
        {
          "range": {
            "experience": {
              "gte": 12
            }
          }
        }
      ]
    }
  }
} 

The result obtained will be as follows.

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0,
    "hits": [
      {
        "_index": "employees",
        "_id": "3",
        "_score": 0,
        "_source": {
          "id": 3,
          "name": "Winston Waren",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "202.37.210.94",
          "date_of_birth": "10/11/1985",
          "company": "Yozio",
          "position": "Human Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Versatile object-oriented emulation",
          "salary": 50616
        }
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        }
      }
    ]
  }
} 

You may notice that the score value is zero for the search results. This is because when using the filter context, the score is not calculated by Elasticsearch to make the search faster.

If you use a must condition with a filter condition, the scores are calculated for the clauses in must, but they are not calculated for the filter side.

Should

Let us now see the effect of the “should” section in the bool query. Let us add a “should” clause to the query in the previous example. This “should” condition must match the documents that contain the text “versatile” in the “phrase” fields of the documents. The query for this condition will look like the one below:

POST employees/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "position": "manager"
          }
        },
        {
          "range": {
            "experience": {
              "gte": 12
            }
          }
        }
      ],
    "should": [
      {
        "match": {
          "phrase": "versatile"
        }
      }
    ]
    }
  }
} 

Now the results will be the same 2 documents received in the previous example, but the document with id=3, which had been shown as the last result, is shown as the first result. This is because the clause in the “should” array occurs in that document and so the score has increased and thus it has been promoted as the first document.

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 2.8970814,
    "hits": [
      {
        "_index": "employees",
        "_id": "3",
        "_score": 2.8970814,
        "_source": {
          "id": 3,
          "name": "Winston Waren",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "202.37.210.94",
          "date_of_birth": "10/11/1985",
          "company": "Yozio",
          "position": "Human Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Versatile object-oriented emulation",
          "salary": 50616
        }
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 1.7261541,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        }
      }
    ]
  }
} 

Multiple conditions

A real example of bool query could be more complex than the simple ones given above. What if users want to get employees who might belong to the companies “Yamaha” or “Telane”, and who have the title “manager” or “associate”, with a salary above 100,000.

The above condition, when abbreviated, can be reduced as follows.

(company = Yamaha OR company = Yozio ) AND
(position = manager OR position = associate ) AND 
(salary>=100000) 

This can be achieved by using multiple bool queries within a single must clause, as shown in the following query:

POST employees/_search
{
    "query": {
        "bool": {
            "must": [
              {
                "bool": {
                    "should": [{
                        "match": {
                            "company": "Talane"
                        }
                    }, {
                        "match": {
                            "company": "Yamaha"
                        }
                    }]
                }
            }, 
            {
                "bool": {
                    "should": [
                      {
                        "match": {
                            "position": "manager"
                        }
                    }, {
                        "match": {
                            "position": "Associate"
                        }
                    }
                    ]
                }
            }, {
                "bool": {
                    "must": [
                      {
                        "range": {
                          "salary": {
                            "gte": 100000
                          }
                        }
                      }
                      ]
                }
            }]
        }
    }
}
 

Boosting Query

Sometimes, search criteria require that some results be downgraded, but not completely removed from the search results. In these cases, it is useful to enhance the query. Let’s look at a simple example to demonstrate this.

Let’s search for all employees in China and then downgrade the employees of the company “Telane” in the search results. We can use a boosting query such as the one below

POST  employees/_search
{
    "query": {
        "boosting" : {
            "positive" : {
                "match": {
                  "country": "china"
                }
            },
            "negative" : {
                 "match": {
                  "company": "Talane"
                }
            },
            "negative_boost" : 0.5
        }
    }
} 

Now the answer of the above query would be as shown below, where we can see that the employee of the company “Talane” is ranked last and has a difference of 0.5 in the score from the previous result.

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 0.10536051,
    "hits": [
      {
        "_index": "employees",
        "_id": "2",
        "_score": 0.10536051,
        "_source": {
          "id": 2,
          "name": "Othilia Cathel",
          "email": "[email protected]",
          "gender": "female",
          "ip_address": "3.164.153.228",
          "date_of_birth": "22/07/1987",
          "company": "Edgepulse",
          "position": "Structural Engineer",
          "experience": 11,
          "country": "China",
          "phrase": "Grass-roots heuristic help-desk",
          "salary": 193530
        }
      },
      {
        "_index": "employees",
        "_id": "3",
        "_score": 0.10536051,
        "_source": {
          "id": 3,
          "name": "Winston Waren",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "202.37.210.94",
          "date_of_birth": "10/11/1985",
          "company": "Yozio",
          "position": "Human Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Versatile object-oriented emulation",
          "salary": 50616
        }
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0.10536051,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        }
      },
      {
        "_index": "employees",
        "_id": "1",
        "_score": 0.052680254,
        "_source": {
          "id": 1,
          "name": "Huntlee Dargavel",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "58.11.89.193",
          "date_of_birth": "11/09/1990",
          "company": "Talane",
          "position": "Research Associate",
          "experience": 7,
          "country": "China",
          "phrase": "Multi-channelled coherent leverage",
          "salary": 180025
        }
      }
    ]
  }
}
 

You can apply any query to the “positive” and “negative” sections of the boosting query. This is useful when multiple conditions need to be applied to a bool query. An example of such a query is given below:

GET employees/_search
{
  "query": {
    "boosting": {
      "positive": {
        "bool": {
          "should": [
            {
              "match": {
                "country": {
                  "query": "china"
                }
              }
            },
            {
              "range": {
                "experience": {
                  "gte": 10
                }
              }
            }
          ]
        }
      },
      "negative": {
        "match": {
          "gender": "female"
        }
      },
      "negative_boost": 0.5
    }
  }
} 

Query Function Score

The function_score query allows the scoring of documents returned by a query to be changed. The function_score query requires a query and one or more functions to calculate the score. If no functions are mentioned, the query is executed normally.

The simplest case of function score, without any functions, is shown below:

Case 1: simple match query

GET   employees/_search
{
   "_source": ["position"],
    "query": {
        "match" : {
            "position" : "manager"
        }
    }
} 

Result

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.72615415,
    "hits": [
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0.72615415,
        "_source": {
          "position": "Resources Manager"
        }
      },
      {
        "_index": "employees",
        "_id": "3",
        "_score": 0.60996956,
        "_source": {
          "position": "Human Resources Manager"
        }
      }
    ]
  }
}
 

Case 2: match query with the modified score using the score function

GET   employees/_search
{
   "_source": ["position"],
    "query": {
        "function_score" : {
            "query": {
   "match" : {
      "position" : "manager"
                  }
           },
    "boost": 5,
    "boost_mode" : "multiply"
    }
  }
} 

Result

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 3.630771,
    "hits": [
      {
        "_index": "employees",
        "_id": "4",
        "_score": 3.630771,
        "_source": {
          "position": "Resources Manager"
        }
      },
      {
        "_index": "employees",
        "_id": "3",
        "_score": 3.0498476,
        "_source": {
          "position": "Human Resources Manager"
        }
      }
    ]
  }
} 

function_score: weight

As mentioned earlier, we can use one or more scoring functions in the “functions” array of the “function_score” query. One of the simplest but important functions is the “weight” scoring function.

According to the documentation, this function allows us to multiply the score by the given weight. The weight can be defined for each function in the function array (previous example) and is multiplied by the score calculated by the respective function.

Let us modify the previous query slightly to see how it works. We include two filters in the “functions” part of the query. The first will look for the term “coherent” in the “phrase” field of the document and, if found, will increase the score by a weight of 2. The second clause will look for the term “emulation” in the “phrase” field and increase the score by a factor of 10, for such documents. Here is the query:

GET employees/_search
{
"_source": ["position","phrase"], 
  "query": {
    "function_score": {
      "query": {
        "match": {
          "position": "manager"
        }
      },
      "functions": [
        {
          "filter": {
            "match": {
              "phrase": "coherent"
            }
          },
          "weight": 2
        },
        {
          "filter": {
            "match": {
              "phrase": "emulation"
            }
          },
          "weight": 10
        }
      ],
      "score_mode": "multiply", 
      "boost": "5",
      "boost_mode": "multiply"
    }
  }
} 

The result of the previous query is as follows:

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 72.61542,
    "hits": [
      {
        "_index": "employees",
        "_id": "4",
        "_score": 72.61542,
        "_source": {
          "phrase": "Emulation of roots heuristic coherent systems",
          "position": "Resources Manager"
        }
      },
      {
        "_index": "employees",
        "_id": "3",
        "_score": 30.498476,
        "_source": {
          "phrase": "Versatile object-oriented emulation",
          "position": "Human Resources Manager"
        }
      }
    ]
  }
} 

The simple matching part of the query on the location field produced a score of 3.63 and 3.04 for the two documents. When the first function of the feature array (matching for the keyword “consistent”) was applied, there was only one match, that for the document with id = 4.

The current score for this document was multiplied by the weight factor of the “consistent” match, which is 2. Now the new score for the document becomes 3.63*2 = 7.2

Next, the second condition (matching by “emulation”) was matched for both documents.

So the current score for the document with id=4 is 7.2*10 = 72, where 10 is the weight factor for the second clause.

The document with id=3 was matched only for the second clause and thus its score is 3.0*10=30.

function_score: script_score

It is often the case that we need to calculate the score based on one or more fields, and for this the default scoring mechanism is not sufficient. Elasticsearch provides the scoring function “script_score” to calculate the score based on custom requirements. We can provide a script that will return the score for each document based on the custom logic of the fields.

For example, we need to calculate scores based on salary and experience, i.e., employees with the highest salary/experience ratio should score higher. We can use the following function_score query for the same purpose:

GET employees/_search
{
  "_source": [
    "name",
    "experience",
    "salary"
  ],
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "source": "(doc['salary'].value/doc['experience'].value)/1000"
            }
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
} 

The previous part of the script will generate the scores for the search results. For example, for an employee with salary = 180025 and experience = 7 the generated score will be: (180025/7)/1000 = 25.

Since we are using boost_mode: replace the scores calculated by the script will be assigned to each document. The results for the above query are shown below:

{
  "took": 48,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 25,
    "hits": [
      {
        "_index": "employees",
        "_id": "1",
        "_score": 25,
        "_source": {
          "name": "Huntlee Dargavel",
          "experience": 7,
          "salary": 180025
        }
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 25,
        "_source": {
          "name": "Alan Thomas",
          "experience": 12,
          "salary": 300000
        }
      },
      {
        "_index": "employees",
        "_id": "2",
        "_score": 17,
        "_source": {
          "name": "Othilia Cathel",
          "experience": 11,
          "salary": 193530
        }
      },
      {
        "_index": "employees",
        "_id": "3",
        "_score": 4,
        "_source": {
          "name": "Winston Waren",
          "experience": 12,
          "salary": 50616
        }
      }
    ]
  }
} 

function_score: field_value_factor

We can use a document field to influence the score, using the “field_value_factor” function. This is in some ways a simple alternative to “script_score.” In our example, we use the value of the field “experience” to influence our score as follows

GET employees/_search
{
  "_source": ["name","experience"], 
    "query": {
        "function_score": {
            "field_value_factor": {
                "field": "experience",      
                 "factor": 0.5,
                "modifier": "square",
                "missing": 1
            }
        }
    }
} 

The result of the query is shown below:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 36,
    "hits": [
      {
        "_index": "employees",
        "_id": "3",
        "_score": 36,
        "_source": {
          "name": "Winston Waren",
          "experience": 12
        }
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 36,
        "_source": {
          "name": "Alan Thomas",
          "experience": 12
        }
      },
      {
        "_index": "employees",
        "_id": "2",
        "_score": 30.25,
        "_source": {
          "name": "Othilia Cathel",
          "experience": 11
        }
      },
      {
        "_index": "employees",
        "_id": "1",
        "_score": 12.25,
        "_source": {
          "name": "Huntlee Dargavel",
          "experience": 7
        }
      }
    ]
  }
} 

The scoring calculation for the above would be as follows:

Square of (factor*doc[experience].value) 

For a document with “experience” of 12, the score will be:

square of (0.5*12) = square of (6) = 36 

function_score: Decay Functions

Consider the use case of searching for hotels near a location. For this use case, the closer the hotel is, the more relevant the search results are, but when it is farther away, the search becomes meaningless. Or, to further refine the concept, if the hotel is farther than a walking distance of 1 km from the location, the search results should show a rapid decline in score. While those within 1 km should score higher.

For this type of use case, a decreasing scoring mode is the best choice, that is, the score will start to decay from the point of interest. For this purpose, Elasticsearch has scoring functions called decay functions. There are three types of decay functions: “gauss,” “linear,” and “exponential” or “exp.”

Let’s take a use case example from our scenario. We need to score employees according to their salary. Those close to 200,000 and between 17,000 and 23,000 should score higher, while those below and above the range should score significantly lower.

GET employees/_search
{
  "_source": [
    "name",
    "salary"
  ],
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "functions": [
       {
         "gauss": {
           "salary": {
             "origin": 200000,
             "scale": 30000
           }
         }
       }
      ],
      "boost_mode": "replace"
    }
  }
} 

The origin represents the point from which to start calculating the distance. The scale represents the distance from the origin, up to which the priority for scoring should be assigned. There are other optional parameters that can be consulted in the Elastic documentation.

The results of the query are shown below:

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 0.9682744,
    "hits": [
      {
        "_index": "employees",
        "_id": "2",
        "_score": 0.9682744,
        "_source": {
          "name": "Othilia Cathel",
          "salary": 193530
        }
      },
      {
        "_index": "employees",
        "_id": "1",
        "_score": 0.7354331,
        "_source": {
          "name": "Huntlee Dargavel",
          "salary": 180025
        }
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0.00045208726,
        "_source": {
          "name": "Alan Thomas",
          "salary": 300000
        }
      },
      {
        "_index": "employees",
        "_id": "3",
        "_score": 3.4350627e-8,
        "_source": {
          "name": "Winston Waren",
          "salary": 50616
        }
      }
    ]
  }
} 

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part2]

Gradio is a python library that allows us to create web applications quickly and intuitively for our machine learning and AI models. Our applications always require user interaction and layout customization. Let us find out, through examples, how to improve our applications.

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!