Elasticsearch: bucket aggregations [part 1]

With Elasticsearch's bucket aggregations we can create groups of documents. In this article we will mainly focus on aggregations based on keyword type fields in indexes. We will use several examples to understand the main differences between the available aggregation functions.

Share

Reading time: 7 minutes

As introduced in article Elasticsearch: the aggregation types, Elasticsearch allows not only for data searches but also for analysis. Among the various types seen, in this article we will deal with bucket aggregations. These categorize sets of documents into groups. The type of aggregation determines whether or not a given document falls into a bucket.

You can use bucket aggregations to implement faceted navigation (usually placed as a sidebar on a search results landing page) to help users narrow down results of interest.

In this article, we will begin to analyze some aggregations belonging to this type. In particular, we will focus on aggregations involving terms.

terms

Term aggregation dynamically creates a bucket for each unique term in a field.
The following example uses term aggregation to find the number of documents per response code in web log data:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "response_codes": {
      "terms": {
        "field": "response.keyword",
        "size": 10
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "response_codes" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "200",
        "doc_count" : 12832
      },
      {
        "key" : "404",
        "doc_count" : 801
      },
      {
        "key" : "503",
        "doc_count" : 441
      }
    ]
  }
 }
} 

Values are returned with the key key. doc_count specifies the number of documents in each bucket. By default, the buckets are sorted in descending order of doc_count.

The response also includes two keys named doc_count_error_upper_bound and sum_other_doc_count.

The term aggregation returns the most important unique terms. Therefore, if the data has many unique terms, some of them may not appear in the results. The sum_other_doc_count field is the sum of the documents excluded from the response. In this case, the number is 0 because all unique values appear in the response.

The doc_count_error_upper_bound field represents the maximum possible count for a unique value that is excluded from the final results. Use this field to estimate the margin of error of the count.

The count may not be accurate. A coordinating node responsible for aggregation asks each shard for its most important unique terms. Imagine a scenario in which the size parameter is 3. Term aggregation asks each shard for its top 3 unique terms. The coordinating node takes all the results and aggregates them to calculate the final result. If a shard has an object that is not part of the top 3, it will not appear in the response.

This is especially true if the size is set to a low number. Since the default size is 10, an error is unlikely to occur. If you do not need high precision and want to increase performance, you can reduce the size.

sampler, diversified_sampler

If you are aggregating millions of documents, you can use sampler aggregation to reduce the scope to a small sample of documents and get a faster response. The sampler aggregation selects samples based on the highest scoring documents.

The results are approximate but accurately represent the distribution of real data. The aggregation sampler significantly improves query performance, but the estimated answers are not completely reliable.

The basic syntax is:

“aggs”: {
  "SAMPLE": {
    "sampler": {
      "shard_size": 100
    },
    "aggs": {...}
  }
} 

The shard_size property tells Elasticsearch how many documents (at most) to collect from each shard.

The following example limits the number of documents collected in each shard to 1,000 and then groups the documents based on an aggregation of terms:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 1000
      },
      "aggs": {
        "terms": {
          "terms": {
            "field": "agent.keyword"
          }
        }
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "sample" : {
    "doc_count" : 1000,
    "terms" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
          "doc_count" : 368
        },
        {
          "key" : "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24",
          "doc_count" : 329
        },
        {
          "key" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
          "doc_count" : 303
        }
      ]
    }
  }
 }
} 

The diversified_sampler aggregation allows you to reduce bias in the distribution of the sample pool. You can use the field setting to control the maximum number of documents collected in a shard that share a common value:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "sample": {
      "diversified_sampler": {
        "shard_size": 1000,
        "field": "response.keyword"
      },
      "aggs": {
        "terms": {
          "terms": {
            "field": "agent.keyword"
          }
        }
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "sample" : {
    "doc_count" : 3,
    "terms" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
          "doc_count" : 2
        },
        {
          "key" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
          "doc_count" : 1
        }
      ]
    }
  }
 }
} 

significant_terms, significant_text

The significant_terms aggregation allows the identification of occurrences of unusual or interesting terms in a filtered subset relative to the rest of the data in an index.

A foreground set is the set of filtered documents. A background set is the set of all documents in an index. The significant_terms aggregation examines all the documents in the foreground set and finds a score for significant occurrences relative to the documents in the background set.

In the example web log data, each document has a field containing the visitor’s user-agent. This example looks for all requests coming from an iOS operating system. An aggregation of regular terms on this foreground set returns Firefox because it has the most documents in this bucket. On the other hand, a significant_terms aggregation returns Internet Explorer (IE) because IE has significantly more presence in the foreground set than the background set.

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "query": {
    "terms": {
      "machine.os.keyword": [
        "ios"
      ]
    }
  },
  "aggs": {
    "significant_response_codes": {
      "significant_terms": {
        "field": "agent.keyword"
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "significant_response_codes" : {
    "doc_count" : 2737,
    "bg_count" : 14074,
    "buckets" : [
      {
        "key" : "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
        "doc_count" : 818,
        "score" : 0.01462731514608217,
        "bg_count" : 4010
      },
      {
        "key" : "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
        "doc_count" : 1067,
        "score" : 0.009062566630410223,
        "bg_count" : 5362
      }
    ]
  }
 }
} 

If the significant_terms aggregation returns no results, it is possible that the results were not filtered with a query. Alternatively, the distribution of terms in the foreground set could be the same as in the background set, implying that there is nothing unusual in the foreground set.

The significant_text aggregation is similar to the significant_terms aggregation, but it concerns raw text fields. Significant_text measures the measured change in popularity between the foreground and background set using statistical analysis. For example, it might suggest Tesla when searching for the acronym TSLA.

This aggregation reanalyzes the source text on the fly, filtering out noisy data such as duplicate paragraphs, boilerplate headers and footers, and so on, which might otherwise skew the results.

Reanalysis of high-cardinality datasets can be a CPU-intensive task. We recommend using the significant_text aggregation within a sampler aggregation to limit the analysis to a small selection of documents with the best matches, for example 200.

The following parameters can be set:

  • min_doc_count: returns results that match more than a configured number of top results. It is recommended not to set the minimum number of documents to 1, because it tends to return terms that are typos or misspellings. Finding more than one instance of a term helps to understand that the meaning is not the result of a single incident. The default value of 3 is used to provide a minimum weight of evidence.
  • shard_size: setting a high value increases stability (and accuracy) at the expense of computational performance.
  • shard_min_doc_count: if the text contains a lot of low-frequency words and you are not interested in them (e.g., typos), you can set the shard_min_doc_count parameter to filter candidate terms at the shard level with reasonable certainty that the required min_doc_count will not be reached even after merging the significant local frequencies of the text. The default value is 1, which has no impact until explicitly set. It is recommended that this value be set much lower than the min_doc_count value.

Suppose you have the complete works of Shakespeare indexed in an Elasticsearch cluster. You can download the dataset here. You can find meaningful texts related to the word “breathe” in the text_entry field:

GET shakespeare/_search
{
  "query": {
    "match": {
      "text_entry": "breathe"
    }
  },
  "aggregations": {
    "my_sample": {
      "sampler": {
        "shard_size": 100
      },
      "aggregations": {
        "keywords": {
          "significant_text": {
            "field": "text_entry",
            "min_doc_count": 4
          }
        }
      }
    }
  }
} 

Sample response

"aggregations" : {
  "my_sample" : {
    "doc_count" : 59,
    "keywords" : {
      "doc_count" : 59,
      "bg_count" : 111396,
      "buckets" : [
        {
          "key" : "breathe",
          "doc_count" : 59,
          "score" : 1887.0677966101694,
          "bg_count" : 59
        },
        {
          "key" : "air",
          "doc_count" : 4,
          "score" : 2.641295376716233,
          "bg_count" : 189
        },
        {
          "key" : "dead",
          "doc_count" : 4,
          "score" : 0.9665839666414213,
          "bg_count" : 495
        },
        {
          "key" : "life",
          "doc_count" : 5,
          "score" : 0.9090787433467572,
          "bg_count" : 805
        }
      ]
    }
  }
 }
} 

The most significant texts in relation to breath are air, death, and life.

The significant_text aggregation has the following limitations:

  • it does not support child aggregations because child aggregations have a high memory cost. As a solution, a subsequent query can be added using a term aggregation with an include clause and a child aggregation.
  • does not support nested objects because it works with the JSON source of the document.
  • the document count may have some (usually small) inaccuracies because it is based on the sum of the samples returned by each shard. You can use the shard_size parameter to fine-tune the trade-off between precision and performance. By default, the shard_size parameter is set to -1 to automatically estimate the number of shards and the size parameter.

For both significant_terms and significant_text aggregations, the default source of statistical information for background term frequencies is the entire index. It is possible to narrow the scope with a background filter to achieve greater precision:

GET shakespeare/_search
{
  "query": {
    "match": {
      "text_entry": "breathe"
    }
  },
  "aggregations": {
    "my_sample": {
      "sampler": {
        "shard_size": 100
      },
      "aggregations": {
        "keywords": {
          "significant_text": {
            "field": "text_entry",
            "background_filter": {
              "term": {
                "speaker": "JOHN OF GAUNT"
              }
            }
          }
        }
      }
    }
  }
} 

missing

If there are documents in the index that do not contain the aggregator field at all or the aggregator field has a value of NULL, use the missing parameter to specify the name of the bucket in which to insert those documents.

The following example adds all missing values to a bucket named “N/A.”

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "response_codes": {
      "terms": {
        "field": "response.keyword",
        "size": 10,
        "missing": "N/A"
      }
  }
} 

Since the default value of the min_doc_count parameter is 1, the missing parameter returns no bucket in the response. Set the min_doc_count parameter to 0 to display the bucket “N/A” in the response:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "response_codes": {
      "terms": {
        "field": "response.keyword",
        "size": 10,
        "missing": "N/A",
        "min_doc_count": 0
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "response_codes" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "200",
        "doc_count" : 12832
      },
      {
        "key" : "404",
        "doc_count" : 801
      },
      {
        "key" : "503",
        "doc_count" : 441
      },
      {
        "key" : "N/A",
        "doc_count" : 0
      }
    ]
  }
 }
} 

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part2]

Gradio is a python library that allows us to create web applications quickly and intuitively for our machine learning and AI models. Our applications always require user interaction and layout customization. Let us find out, through examples, how to improve our applications.

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!