Elasticsearch: the aggregation types

Elasticsearch is a widely used NoSQL database for developing search engines because of its ability to index text appropriately. But it does not stop at just that. Thanks to aggregations, Elasticsearch can be used to analyze data and extract statistics from large masses of data. Let's learn about this functionality of his that underlies many visualizations used by Kibana.

Share

Reading time: 3 minutes

In the previous articles, Elasticsearch: use of match queries, Elasticsearch: use of term queries, Elasticsearch: compound query, and Elasticsearch: join and bonus queries, we have seen how to query documents saved within an Elasticsearch index. But Elasticsearch is not just for searching structured information or unstructured text. Aggregations allow you to leverage Elasticsearch’s powerful analytic engine to analyze data and extract statistics.

Use cases for aggregations range from analyzing real-time data to take an action to using Kibana to create a visualization dashboard. In fact, many visualizations, which we have already seen in the articles Kibana: let’s explore data and Kibana: build your own dashboard to create interactive dashboards, rely on aggregation.

The great potential of Elasticsearch is the ability to perform aggregations on huge datasets in milliseconds. Obviously, compared to queries, aggregations consume more CPU cycles and memory. Therefore, this type of search is mainly used for creating dashboards or performing complex analyses on data.

In this series of articles we will study, through examples, the various types of aggregation to understand what information and statistics we can extract. Specifically, in this article we will introduce the syntax of aggregation and the various types, which we will then analyze later.

Aggregations on text fields

By default, Elasticsearch does not support aggregations over a text field. Since text fields are tokenized, an aggregation on a text field must reverse the tokenization process to return to the original string and then formulate an aggregation based on it. This operation consumes a lot of memory and degrades cluster performance.

Although it is possible to enable aggregations on text fields by setting the fielddata parameter to true in the mapping, the aggregations are still based on the tokenized words and not on the raw text.

It is recommended to keep a raw version of the text field as a keyword type field on which aggregations can be performed. In this case, aggregations can be performed on the title.raw field instead of the title field:

PUT movies
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fielddata": true,
        "fields": {
          "raw": {
            "type": "keyword"
          }
        }
      }
    }
  }
} 

General aggregation structure

The structure of an aggregation query is as follows:

GET _search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "AGG_TYPE": {}
    }
  }
} 

If you are only interested in the aggregation result and not the query results, you should set the size to 0.

Any number of aggregations can be defined in the aggs property. Each aggregation is defined by its name and one of the aggregation types supported by Elasticsearch.

The name of the aggregation helps to distinguish the different aggregations in the response. The AGG_TYPE property allows you to specify the type of the aggregation.

Sample aggregation

This section uses e-commerce and sample web log data from Kibana. To add sample data, log into Kibana, choose Home and Try our sample data. For sample e-commerce orders and sample web logs, choose Add data.

Example of average calculation

To find the average value of the taxful_total_price field:

GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "avg_taxful_total_price": {
      "avg": {
        "field": "taxful_total_price"
      }
    }
  }
} 

Sample response

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4675,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_taxful_total_price" : {
      "value" : 75.05542864304813
    }
  }
} 

The aggregation block in the response shows the average value of the taxful_total_price field.

Types of aggregations

There are three main types of aggregations:

  • Metric aggregations: calculate metrics such as sum, min, max, and avg over numeric fields.
  • Bucket aggregations: sort query results into groups based on some criteria.
  • Pipeline aggregations: transform the output of one aggregation into an input for another.

Nested aggregations

Aggregations within aggregations are called nested aggregations or subaggregations. Not all types of aggregations allow nested aggregations to be defined. In fact, metric aggregations produce simple results that cannot be used for further aggregations. In contrast, bucket aggregations produce groups of documents that can be nested in other aggregations. Complex data analyses can be performed by nested within metric and bucket aggregations.

General syntax of nested aggregation

{
  "aggs": {
    "name": {
      "type": {
        "data"
      },
      "aggs": {
        "nested": {
          "type": {
            "data"
          }
        }
      }
    }
  }
} 

The internal aggs keyword starts a new nested aggregation. The syntax of the parent aggregation and the nested aggregation is the same. Nested aggregations are executed in the context of the parent aggregations.

You can also associate aggregations with search queries to narrow down the elements to be analyzed before the aggregation. If you do not add a query, Elasticsearch implicitly uses the match_all query.

We will see in subsequent articles some examples for the various types of aggregations.

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part2]

Gradio is a python library that allows us to create web applications quickly and intuitively for our machine learning and AI models. Our applications always require user interaction and layout customization. Let us find out, through examples, how to improve our applications.

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!