Elasticsearch: join and bonus queries

Elasticsearch is a very good NoSQL database for performing efficient searches on textual and structured data. Despite this, it does not natively support joins between documents. However, there are queries that, by means of an appropriate schema definition, allow searches on related documents between them. We will find out how to write join queries and also some particular queries that might be useful in our projects.

Share

Reading time: 7 minutes

In the articles Elasticsearch: use of match queries, Elasticsearch: use of term queries, and Elasticsearch: compound query, we saw how to query both textual fields and structured data by imposing the verification of multiple conditions simultaneously on an Elasticsearch index. Also, in the compound queries we saw how to modify the score calculation to fit the data of each document. These queries, however, do not address the need to search for documents that are related to each other through kinship relationships. Consider that Elasticsearch is a NoSQL database and therefore join operations between documents are not native and, while available, are limited and onerous.

In this article we will see how to define a hierarchy between documents for use in searches. In addition, we will provide some “bonus” queries that might come in handy for some projects.

As in previous tutorials, we will use the same environment. Thus, we encourage you to carefully read the instructions on how to install the Elasticsearch stack on your PC via the Docker repository and how to properly import data.

Category Type Match criteria Query Match No Match
has_child join Queries child documents and returns matching parent documents (of matching children). N/A N/A N/A
has_parent join It runs a query on the parent documents and returns the matching parent documents (of the matching parents). N/A N/A N/A
query_string full-text Multi-purpose query that "can be a collector" to the use of other queries such as "match", "multi-match", "regexp", "wildcard" etc. Has strict formatting (position:engineer) OR (salary:(>=10000 AND <=52000)) Documents with the text "engineer" in the "position" field OR documents that have a pay range between 10,000 and 52,000 (including 10,000 and 52,000) N/A
simple_query_string full-text Like query_string, but not strict (position:engineer) | (country:china) Documents with "engineer" in the "position" field OR china in the "country" field. N/A

Parent-Child Query

One-to-many relationships can be handled using the parent-child method (now called join operation) in Elasticsearch. Let us demonstrate this with an example scenario. Consider we have a forum, where anyone can post any topic (say posts). Users can comment on individual posts. Then, in this scenario, we can consider the individual posts as parent documents and the comments to them as children. This is best explained in the figure below:

For this operation, a separate index will be created with a special mapping (schema) applied.

Create the index with the join data type with the following request

PUT post-comments
{
  "mappings": {
    "properties": {
      "document_type": { 
        "type": "join",
        "relations": {
          "post": "comment" 
        }
      }
    }
  }
} 

In the above diagram, you can see the presence of a type named “join,” which indicates that this index will have parent and child related documents. In addition, the names of the parent and child identifiers are defined in the “relations” object.

That is, post:comment refers to the parent:child relationship. Each document will consist of a field called “document_type,” which will have the value “post” or “comment.” The value “post” will indicate that the document is a parent and the value “comment” will indicate that the document is a “child.”

Let’s index some documents:

PUT post-comments/_doc/1
{
"document_type": {
    "name": "post" 
  },
"post_title" : "Angel Has Fallen"
} 
PUT post-comments/_doc/2
{
"document_type": {
    "name": "post" 
  },
"post_title" : "Beauty and the beast - a nice movie"
} 

Let’s now index child documents

PUT post-comments/_doc/A?routing=1
{
"document_type": {
    "name": "comment",
    "parent": "1"
  },
"comment_author": "Neil Soans",
"comment_description": "'Angel has Fallen' has some redeeming qualities, but they're too few and far in between to justify its existence"
} 
PUT post-comments/_doc/B?routing=1
{
"document_type": {
    "name": "comment",
    "parent": "1"
  },
"comment_author": "Exiled Universe",
"comment_description": "Best in the trilogy! This movie wasn't better than the Rambo movie but it was very very close."
} 
PUT post-comments/_doc/D?routing=1
{
"document_type": {
    "name": "comment",
    "parent": "2"
  },
"comment_author": "Emma Cochrane",
"comment_description": "There's the sublime beauty of a forgotten world and the promise of happily-ever-after to draw you to one of your favourite fairy tales, once again. Give it an encore."
} 
PUT post-comments/_doc/E?routing=1
{
"document_type": {
    "name": "comment",
    "parent": "2"
  },
"comment_author": "Common Sense Media Editors",
"comment_description": "Stellar music, brisk storytelling, delightful animation, and compelling characters make this both a great animated feature for kids and a great movie for anyone"
} 

Query has_child

This query queries the child documents and returns the parents associated with them as results. Suppose we need to search for the term “music” in the “comments_description” field in the child documents and need to get the parent documents matching the search results, we can use the has_child query as follows:

GET post-comments/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query": {
        "match": {
          "comment_description": "music"
        }
      }
    }
  }
} 

For the above query, the child documents that match the search are only the document with id=E, for which the parent is the document with id=2. The result of the search will take us to the parent document as below:

{
  "took": 46,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "post-comments",
        "_id": "2",
        "_score": 1,
        "_source": {
          "document_type": {
            "name": "post"
          },
          "post_title": "Beauty and the beast - a nice movie"
        }
      }
    ]
  }
} 

Query has_parent

The has_parent query performs the opposite of the has_child query, that is, it returns the child documents of the parent documents that match the query.

We search for the word “Beauty” in the parent document and return the child documents of the matching parents. For this purpose, we can use the following query

GET post-comments/_search
{
  "query": {
    "has_parent": {
      "parent_type": "post",
      "query": {
        "match": {
          "post_title": "Beauty"
        }
      }
    }
  }
} 

The parent document corresponding to the above query is the one with document id =1. As you can see from the answer below, the child documents corresponding to document id=1 are returned by the previous query:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "post-comments",
        "_id": "D",
        "_score": 1,
        "_routing": "1",
        "_source": {
          "document_type": {
            "name": "comment",
            "parent": "2"
          },
          "comment_author": "Emma Cochrane",
          "comment_description": "There's the sublime beauty of a forgotten world and the promise of happily-ever-after to draw you to one of your favourite fairy tales, once again. Give it an encore."
        }
      },
      {
        "_index": "post-comments",
        "_id": "E",
        "_score": 1,
        "_routing": "1",
        "_source": {
          "document_type": {
            "name": "comment",
            "parent": "2"
          },
          "comment_author": "Common Sense Media Editors",
          "comment_description": "Stellar music, brisk storytelling, delightful animation, and compelling characters make this both a great animated feature for kids and a great movie for anyone"
        }
      }
    ]
  }
} 

Return of child documents with parents

Sometimes, both the parent and child documents need to be displayed in the search results. For example, if we are listing posts, it would be nice to display some comments below them as well.

To achieve this, we use the has_child query to return the parents via the parameter “inner_hits” we also return the children.

GET post-comments/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query": {
        "match": {
          "comment_description": "music"
        }
      },
      "inner_hits": {}
    }
    
  }
} 

The result of the query will be as follows:

{
  "took": 43,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "post-comments",
        "_id": "2",
        "_score": 1,
        "_source": {
          "document_type": {
            "name": "post"
          },
          "post_title": "Beauty and the beast - a nice movie"
        },
        "inner_hits": {
          "comment": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 1.1829326,
              "hits": [
                {
                  "_index": "post-comments",
                  "_id": "E",
                  "_score": 1.1829326,
                  "_routing": "1",
                  "_source": {
                    "document_type": {
                      "name": "comment",
                      "parent": "2"
                    },
                    "comment_author": "Common Sense Media Editors",
                    "comment_description": "Stellar music, brisk storytelling, delightful animation, and compelling characters make this both a great animated feature for kids and a great movie for anyone"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
} 

Query bonus

Below are some queries that might come in handy in some contexts.

query_string

The query “query_string” is a special multipurpose query, which can group the use of several other queries such as “match”, “multi-match”, “wildcard”, regexp” etc. The “query_string” query follows a strict format and its violation produces error messages. For this reason, despite its capabilities, it is rarely used for implementing user-facing search boxes.

Let us see an example of a query in action:

POST employees/_search
{
  "query": {
    "query_string": {
      "query": "(roots heuristic systems) OR (enigneer~) OR (salary:(>=10000 AND <=52000)) ",
      "fields": [
        "position",
        "phrase^3"
      ]
    }
  }
} 

The above query will search for the words “roots” OR “heuristics” OR “systems” OR “engineer” (the use of ~ in the query indicates the use of a fuzzy query) in the “location” and “phrase” fields and return the results. “phrase^3” indicates that matches found in the “phrase” field should be incremented by a factor of 3. Wage:(>10000 AND <=52000), indicates to retrieve documents that have the value of the “wage” field between 10000 and 52000.

The query simple_query_string

The query “simple_query_string” is a simplified form of query_string with two main differences

It is more error tolerant, meaning that it does not return errors if the syntax is wrong. In fact, it ignores the faulty part of the query. This makes it easier to use for UI search boxes.

The AND/OR/NOT etc. operators are replaced by +/|/-.

A simple example might be:

POST employees/_search
{
  "query": {
    "simple_query_string": {
      "query": "(roots) | (resources manager) + (male) ",
      "fields": [
        "gender",
        "position",
        "phrase^3"
      ]
    }
  }
} 

The above query would look for “roots” OR “resources” OR “manager” AND “male” in all the fields indicated in the “fields” array.

Named queries

Named queries, as the name suggests, are about naming queries. In some cases it is useful to identify which parts of the query match the document. Elasticsearch provides us with just such a feature, allowing us to name the query or parts of the query so that we can see these names with the corresponding documents.

For example, let’s submit the following query

POST employees/_search
{
  "query": {
    "match": {
      "phrase": {
        "query": "roots" ,
        "_name": "phrase_field_name" }
    }
  }
} 

The result obtained will be as follows

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.6785375,
    "hits": [
      {
        "_index": "employees",
        "_id": "2",
        "_score": 0.6785375,
        "_source": {
          "id": 2,
          "name": "Othilia Cathel",
          "email": "[email protected]",
          "gender": "female",
          "ip_address": "3.164.153.228",
          "date_of_birth": "22/07/1987",
          "company": "Edgepulse",
          "position": "Structural Engineer",
          "experience": 11,
          "country": "China",
          "phrase": "Grass-roots heuristic help-desk",
          "salary": 193530
        },
        "matched_queries": [
          "phrase_field_name"
        ]
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0.62577873,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        },
        "matched_queries": [
          "phrase_field_name"
        ]
      }
    ]
  }
} 

In the above example, the matched query is provided with a parameter “_name,” which has the query name as “phrase_field_name.” In the results, we have the documents that have been matched to the results with an array field called “matched_queries,” which contains the names of the matched queries (here “phrase_field_name”).

The following example shows the use of named queries in a bool query, which is one of the most common use cases of named queries.

POST employees/_search
{
  "query": {
    "bool":{
        "should": [
            {"match": {
                "phrase": {
                    "query": "roots",
                    "_name": "phrase_field_name"}
                }
            },
            {"match": {
                "gender": {
                    "query": "female" ,
                    "_name": "gender_field_name"}
                }
            }
        ]
    }
  }
} 

The outcome will be as follows:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.8825103,
    "hits": [
      {
        "_index": "employees",
        "_id": "2",
        "_score": 1.8825103,
        "_source": {
          "id": 2,
          "name": "Othilia Cathel",
          "email": "[email protected]",
          "gender": "female",
          "ip_address": "3.164.153.228",
          "date_of_birth": "22/07/1987",
          "company": "Edgepulse",
          "position": "Structural Engineer",
          "experience": 11,
          "country": "China",
          "phrase": "Grass-roots heuristic help-desk",
          "salary": 193530
        },
        "matched_queries": [
          "phrase_field_name",
          "gender_field_name"
        ]
      },
      {
        "_index": "employees",
        "_id": "4",
        "_score": 0.62577873,
        "_source": {
          "id": 4,
          "name": "Alan Thomas",
          "email": "[email protected]",
          "gender": "male",
          "ip_address": "200.47.210.95",
          "date_of_birth": "11/12/1985",
          "company": "Yamaha",
          "position": "Resources Manager",
          "experience": 12,
          "country": "China",
          "phrase": "Emulation of roots heuristic coherent systems",
          "salary": 300000
        },
        "matched_queries": [
          "phrase_field_name"
        ]
      }
    ]
  }
} 

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part2]

Gradio is a python library that allows us to create web applications quickly and intuitively for our machine learning and AI models. Our applications always require user interaction and layout customization. Let us find out, through examples, how to improve our applications.

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!