Elasticsearch: bucket aggregations [part 2]

With Elasticsearch's bucket aggregations we can create groups of documents. Having seen in the previous article aggregations based on keyword type fields, we will now focus on other functions oriented to other data types. In particular, we will use aggregations for definitions of numeric ranges, dates, or groups based on georeferenced data.

Share

Reading time: 9 minutes

As seen in article XXX, Elasticsearch allows for bucket-based aggregations that define groups of documents. We have seen how to define term-based buckets and handle missing values. In this article, however, we will look at aggregations involving the definition of value ranges (dates and numbers) and geographic data.

histogram, date_histogram

Histogram aggregation groups documents according to a specified range.

With histogram aggregations, you can visualize the distributions of values in a given range of documents very easily. Of course, Elasticsearch does not return an actual graph; that is what Kibana is for. But it will provide you with the JSON response that you can use to construct your graph.

The following example splits the number_of_bytes field into 10,000 intervals:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "number_of_bytes": {
      "histogram": {
        "field": "bytes",
        "interval": 10000
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "number_of_bytes" : {
    "buckets" : [
      {
        "key" : 0.0,
        "doc_count" : 13372
      },
      {
        "key" : 10000.0,
        "doc_count" : 702
      }
    ]
  }
 }
} 

The date_histogram aggregation uses the mathematics of dates to generate histograms for time series.

For example, you can find the number of visits to your website per month:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "logs_per_month": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "month"
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "logs_per_month" : {
    "buckets" : [
      {
        "key_as_string" : "2020-10-01T00:00:00.000Z",
        "key" : 1601510400000,
        "doc_count" : 1635
      },
      {
        "key_as_string" : "2020-11-01T00:00:00.000Z",
        "key" : 1604188800000,
        "doc_count" : 6844
      },
      {
        "key_as_string" : "2020-12-01T00:00:00.000Z",
        "key" : 1606780800000,
        "doc_count" : 5595
      }
    ]
  }
}
} 

The response contains three months of logs. If you graph these values, you can see the peaks and valleys of request traffic to your website month by month.

range, date_range, ip_range

Aggregation by range allows you to define the range for each bucket.

For example, you can find the number of bytes between 1000 and 2000, 2000 and 3000, and 3000 and 4000. Within the range parameter, you can define the ranges as objects in an array.

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "number_of_bytes_distribution": {
      "range": {
        "field": "bytes",
        "ranges": [
          {
            "from": 1000,
            "to": 2000
          },
          {
            "from": 2000,
            "to": 3000
          },
          {
            "from": 3000,
            "to": 4000
          }
        ]
      }
    }
  }
} 

The answer includes the values of the from key and excludes the values of the to key.

Sample response

...
"aggregations" : {
  "number_of_bytes_distribution" : {
    "buckets" : [
      {
        "key" : "1000.0-2000.0",
        "from" : 1000.0,
        "to" : 2000.0,
        "doc_count" : 805
      },
      {
        "key" : "2000.0-3000.0",
        "from" : 2000.0,
        "to" : 3000.0,
        "doc_count" : 1369
      },
      {
        "key" : "3000.0-4000.0",
        "from" : 3000.0,
        "to" : 4000.0,
        "doc_count" : 1422
      }
    ]
  }
 }
} 

Date_range aggregation is conceptually the same as range aggregation, but it allows you to perform date math. For example, you can get all documents from the last 10 days. To make the date more readable, include the format with a format parameter.

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "number_of_bytes": {
      "date_range": {
        "field": "@timestamp",
        "format": "MM-yyyy",
        "ranges": [
          {
            "from": "now-10d/d",
            "to": "now"
          }
        ]
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "number_of_bytes" : {
    "buckets" : [
      {
        "key" : "03-2021-03-2021",
        "from" : 1.6145568E12,
        "from_as_string" : "03-2021",
        "to" : 1.615451329043E12,
        "to_as_string" : "03-2021",
        "doc_count" : 0
      }
    ]
  }
 }
} 

The ip_range aggregation is for ip addresses. It works on fields of type ip. You can define IP ranges and masks in CIDR notation .

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "access": {
      "ip_range": {
        "field": "ip",
        "ranges": [
          {
            "from": "1.0.0.0",
            "to": "126.158.155.183"
          },
          {
            "mask": "1.0.0.0/8"
          }
        ]
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "access" : {
    "buckets" : [
      {
        "key" : "1.0.0.0/8",
        "from" : "1.0.0.0",
        "to" : "2.0.0.0",
        "doc_count" : 98
      },
      {
        "key" : "1.0.0.0-126.158.155.183",
        "from" : "1.0.0.0",
        "to" : "126.158.155.183",
        "doc_count" : 7184
      }
    ]
  }
 }
} 

filter, filters

An aggregation filter is a query clause, just like a search query – match or term or range. You can use filter aggregation to narrow down the entire set of documents to a specific set before creating the buckets.

The following example shows avg aggregation performed in the context of a filter. The avg aggregation aggregates only those documents that match the range query:

GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "low_value": {
      "filter": {
        "range": {
          "taxful_total_price": {
            "lte": 50
          }
        }
      },
      "aggs": {
        "avg_amount": {
          "avg": {
            "field": "taxful_total_price"
          }
        }
      }
    }
  }
} 

Sample response

"aggregations" : {
  "low_value" : {
    "doc_count" : 1633,
    "avg_amount" : {
      "value" : 38.363175998928355
    }
  }
 }
} 

A filter aggregation is the same as filter aggregation, but allows multiple filter aggregations to be used. While filter aggregation results in a single bucket, filter aggregation returns multiple buckets, one for each of the defined filters.

To create a bucket for all documents that do not match any of the filter queries, set the other_bucket property to true:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "200_os": {
      "filters": {
        "other_bucket": true,
        "filters": [
          {
            "term": {
              "response.keyword": "200"
            }
          },
          {
            "term": {
              "machine.os.keyword": "osx"
            }
          }
        ]
      },
      "aggs": {
        "avg_amount": {
          "avg": {
            "field": "bytes"
          }
        }
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "200_os" : {
    "buckets" : [
      {
        "doc_count" : 12832,
        "avg_amount" : {
          "value" : 5897.852711970075
        }
      },
      {
        "doc_count" : 2825,
        "avg_amount" : {
          "value" : 5620.347256637168
        }
      },
      {
        "doc_count" : 1017,
        "avg_amount" : {
          "value" : 3247.0963618485744
        }
      }
    ]
  }
 }
} 

global

Global aggregations allow you to step outside the aggregation context of an aggregation filter. Even if you have included a filter query that narrows a set of documents, global aggregation aggregates all documents as if the filter query were not there. It ignores the filter aggregation and implicitly assumes the match_all query.

The following example returns the average value of the taxful_total_price field from all documents in the index:

GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "query": {
    "range": {
      "taxful_total_price": {
        "lte": 50
      }
    }
  },
  "aggs": {
    "total_avg_amount": {
      "global": {},
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "taxful_total_price"
          }
        }
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "total_avg_amount" : {
    "doc_count" : 4675,
    "avg_price" : {
      "value" : 75.05542864304813
    }
  }
 }
} 

It can be seen that the average value for the taxful_total_price field is 75.05 and not 38.36 as seen in the filter example when the query was matched.

geo_distance, geohash_grid

The geo_distance aggregation groups documents into concentric circles based on distances from a source geo-field. This is the same as range aggregation, but it works on geographic locations.
For example, you can use geo_distance aggregation to find all pizzerias within a 1 km radius. The search results are limited to the 1 km radius specified by the user, but you can add another result found within 2 km.

You can use geo_distance aggregation only on fields mapped as geo_point.

A point is a single geographic coordinate, such as the current location shown by the smartphone. A point in Elasticsearch is represented as follows:

{
  "location": {
    "type": "point",
    "coordinates": {
      "lat": 83.76,
      "lon": -81.2
    }
  }
} 

You can also specify latitude and longitude as an array [-81.20, 83.76] or as a string “83.76, -81.20.”

This table lists the relevant fields of a geo_distance aggregation:

Field Description Required
field

Specify the geo point field that you want to work on.

Yes
origin

Specify the geo point that’s used to compute the distances from.

Yes
ranges

Specify a list of ranges to collect documents based on their distance from the target point.

Yes
unit

Define the units used in the ranges array. The unit defaults to m (meters), but you can switch to other units like km (kilometers), mi (miles), in (inches), yd (yards), cm (centimeters), and mm (millimeters).

No
distance_type

Specify how Elasticsearch calculates the distance. The default is sloppy_arc (faster but less accurate), but can also be set to arc (slower but most accurate) or plane (fastest but least accurate). Because of high error margins, use plane only for small geographic areas.

No

The syntax is as follows:

{
  "aggs": {
    "aggregation_name": {
      "geo_distance": {
        "field": "field_1",
        "origin": "x, y",
        "ranges": [
          {
            "to": "value_1"
          },
          {
            "from": "value_2",
            "to": "value_3"
          },
          {
            "from": "value_4"
          }
        ]
      }
    }
  }
} 

This example forms buckets from the following distances from a field of geopoints:

  • Less than 10 km
  • 10 to 20 km
  • From 20 to 50 km
  • From 50 to 100 km
  • More than 100 km
GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "position": {
      "geo_distance": {
        "field": "geo.coordinates",
        "origin": {
          "lat": 83.76,
          "lon": -81.2
        },
        "ranges": [
          {
            "to": 10
          },
          {
            "from": 10,
            "to": 20
          },
          {
            "from": 20,
            "to": 50
          },
          {
            "from": 50,
            "to": 100
          },
          {
            "from": 100
          }
        ]
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "position" : {
    "buckets" : [
      {
        "key" : "*-10.0",
        "from" : 0.0,
        "to" : 10.0,
        "doc_count" : 0
      },
      {
        "key" : "10.0-20.0",
        "from" : 10.0,
        "to" : 20.0,
        "doc_count" : 0
      },
      {
        "key" : "20.0-50.0",
        "from" : 20.0,
        "to" : 50.0,
        "doc_count" : 0
      },
      {
        "key" : "50.0-100.0",
        "from" : 50.0,
        "to" : 100.0,
        "doc_count" : 0
      },
      {
        "key" : "100.0-*",
        "from" : 100.0,
        "doc_count" : 14074
      }
    ]
  }
 }
} 

The geohash_grid aggregates documents for geographic analysis. It organizes a geographic region into a grid of smaller regions of different sizes or precisions. Lower precision values represent larger geographic regions and higher values represent smaller, more precise geographic regions.

The number of results returned by a query may be too large to display every single geographic point on a map. The geohash_grid aggregation groups geo neighboring points by calculating the Geohash for each point, at the user-defined level of precision (1 to 12; default is 5).

The data in the web log example is distributed over a large geographic area, so a lower precision value can be used. You can zoom in on the map by increasing the precision value:

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "geo_hash": {
      "geohash_grid": {
        "field": "geo.coordinates",
        "precision": 4
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "geo_hash" : {
    "buckets" : [
      {
        "key" : "c1cg",
        "doc_count" : 104
      },
      {
        "key" : "dr5r",
        "doc_count" : 26
      },
      {
        "key" : "9q5b",
        "doc_count" : 20
      },
      {
        "key" : "c20g",
        "doc_count" : 19
      },
      {
        "key" : "dr70",
        "doc_count" : 18
      }
      ...
    ]
  }
 }
} 

You can view the aggregated response on a map using Kibana.

The more precise the aggregation, the more resources Elasticsearch consumes because of the number of buckets the aggregation has to compute. By default, Elasticsearch does not generate more than 10,000 buckets. You can change this behavior using the size attribute, but keep in mind that performance may suffer for very large queries consisting of thousands of buckets.

adjacency_matrix

The adjacency_matrix aggregation allows you to define filter expressions and returns a matrix of the intersecting filters, where each nonempty cell in the matrix represents a bucket. You can find out how many documents fall into any combination of filters.

Use adjacency_matrix aggregation to find out how the concepts are related, displaying the data as graphs.

For example, in the example eCommerce dataset, to analyze how different manufacturing companies are related:

GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "interactions": {
      "adjacency_matrix": {
        "filters": {
          "grpA": {
            "match": {
              "manufacturer.keyword": "Low Tide Media"
            }
          },
          "grpB": {
            "match": {
              "manufacturer.keyword": "Elitelligence"
            }
          },
          "grpC": {
            "match": {
              "manufacturer.keyword": "Oceanavigations"
            }
          }
        }
      }
    }
  }
} 

Sample response

{
   ...
   "aggregations" : {
     "interactions" : {
       "buckets" : [
         {
           "key" : "grpA",
           "doc_count" : 1553
         },
         {
           "key" : "grpA&grpB",
           "doc_count" : 590
         },
         {
           "key" : "grpA&grpC",
           "doc_count" : 329
         },
         {
           "key" : "grpB",
           "doc_count" : 1370
         },
         {
           "key" : "grpB&grpC",
           "doc_count" : 299
         },
         {
           "key" : "grpC",
           "doc_count" : 1218
         }
       ]
     }
   }
 } 

Let’s take a closer look at the outcome:

{
    "key" : "grpA&grpB",
    "doc_count" : 590
} 
  • grpA: Products manufactured by Low Tide Media.

  • grpB: Products manufactured by Elitelligence.

  • 590: Number of products manufactured by both.

You can use Kibana to represent this data with a network graph.

nested, reverse_nested

Nested aggregation allows fields within a nested object to be aggregated. The nested type is a specialized version of the object data type, which allows arrays of objects to be indexed so that they can be queried independently of each other.

With the object type, all data are stored in the same document, so matches for a query can traverse subdocuments. For example, imagine a log index with pages mapped as the object data type:

PUT logs/_doc/0
{
  "response": "200",
  "pages": [
    {
      "page": "landing",
      "load_time": 200
    },
    {
      "page": "blog",
      "load_time": 500
    }
  ]
} 

Elasticsearch combines all subproperties of entity relations that look something like this:

{
  "logs": {
    "pages": ["landing", "blog"],
    "load_time": ["200", "500"]
  }
} 

Thus, if you wanted to search this index with pages=landing and load_time=500, this document would match the criteria even though the value of load_time for landing is 200.

If you want to be sure that these object matches do not happen, map the field as a nested type:

PUT logs
{
  "mappings": {
    "properties": {
      "pages": {
        "type": "nested",
        "properties": {
          "page": { "type": "text" },
          "load_time": { "type": "double" }
        }
      }
    }
  }
} 

Nested documents allow the same JSON document to be indexed, but keep the pages in separate Lucene documents, causing only queries such as pages=landing and load_time=200 to return the expected result. Internally, nested objects index each object in the array as a separate hidden document, which means that each nested object can be queried independently of the others.

A nested path relative to the parent containing the nested documents must be specified:

GET logs/_search
{
  "query": {
    "match": { "response": "200" }
  },
  "aggs": {
    "pages": {
      "nested": {
        "path": "pages"
      },
      "aggs": {
        "min_load_time": { "min": { "field": "pages.load_time" } }
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "pages" : {
    "doc_count" : 2,
    "min_price" : {
      "value" : 200.0
    }
  }
 }
} 

You can also aggregate values from nested documents to their parent; this aggregation is called reverse_nested. You can use reverse_nested to aggregate a field from the parent document after grouping by the field of the nested object. The reverse_nested aggregation “joins” the parent page and gets the loading time for each variation.

The reverse_nested aggregation is a sub-aggregation within a nested aggregation. It accepts a single option, called path. This option defines the number of steps backward in the document hierarchy that Elasticsearch takes to compute aggregations.

GET logs/_search
{
  "query": {
    "match": { "response": "200" }
  },
  "aggs": {
    "pages": {
      "nested": {
        "path": "pages"
      },
      "aggs": {
        "top_pages_per_load_time": {
          "terms": {
            "field": "pages.load_time"
          },
          "aggs": {
            "comment_to_logs": {
              "reverse_nested": {},
              "aggs": {
                "min_load_time": {
                  "min": {
                    "field": "pages.load_time"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
} 

Sample response

...
"aggregations" : {
  "pages" : {
    "doc_count" : 2,
    "top_pages_per_load_time" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 200.0,
          "doc_count" : 1,
          "comment_to_logs" : {
            "doc_count" : 1,
            "min_load_time" : {
              "value" : null
            }
          }
        },
        {
          "key" : 500.0,
          "doc_count" : 1,
          "comment_to_logs" : {
            "doc_count" : 1,
            "min_load_time" : {
              "value" : null
            }
          }
        }
      ]
    }
  }
 }
} 

The response shows that the log index has one page with a loading time of 200 and one with a loading time of 500.

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part2]

Gradio is a python library that allows us to create web applications quickly and intuitively for our machine learning and AI models. Our applications always require user interaction and layout customization. Let us find out, through examples, how to improve our applications.

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!