Elasticsearch: metric aggregations

In addition to text search, Elasticsearch allows analysis on data using aggregations. Among the various types of aggregation available, the metric ones are aimed precisely at calculating statistics on one or more fields. Through examples we will see what information we can extract with this type of aggregation.

Share

As introduced in article Elasticsearch: the aggregation types, Elasticsearch allows not only for data searches but also for analysis. Among the various types seen, in this article we will deal with metric aggregations. These allow us to perform simple calculations, such as searching for the minimum, maximum, and mean values of a field.

Metric aggregations are of two types:

• single-value metric aggregations: return a single metric. For example, sum, min, max, avg, cardinality and value_count
• multiple value metric aggregations: return more than one metric. For example, stats, extended_stats, matrix_stats, percentile, percentile_ranks, geo_bound, top_hits, and scripted_metric.

Below we will look at some of the aggregation functions. The data we will use are the sample data provided by Kibana. To add this data, log in to Kibana, choose Home and Try our sample data. Finally, click on Add data for eCommerce and web logs data.

sum, min, max and avg

The sum, min, max, and avg metrics are aggregations of single-valued metrics that return the sum, minimum, maximum, and average value of a field, respectively.

The following example calculates the total sum of the taxful_total_price field:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"sum_taxful_total_price": {
"sum": {
"field": "taxful_total_price"
}
}
}
} 

Sample response

...
"aggregations" : {
"sum_taxful_total_price" : {
"value" : 350884.12890625
}
}
} 

In a similar way, it is possible to find the minimum, maximum and mean values of a field.

Cardinality

The cardinality metric is an aggregation of single-value metrics that counts the number of unique or distinct values of a field.

The following example finds the number of unique products in an e-commerce store:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"unique_products": {
"cardinality": {
"field": "products.product_id"
}
}
}
} 

Sample response

...
"aggregations" : {
"unique_products" : {
"value" : 7033
}
}
} 

The cardinality count is approximate. If one had tens of thousands of products in the store, an accurate cardinality calculation would require loading all values into a hash set and returning its size. This approach is not scalable because it requires more memory and causes high latency.

You can control the tradeoff between memory and precision with the precision_threshold setting. This setting defines the threshold below which counts are expected to be nearly accurate. Above this value, the counts may become somewhat less accurate. The default value of precision_threshold is 3,000. The maximum value supported is 40,000.

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"unique_products": {
"cardinality": {
"field": "products.product_id",
"precision_threshold": 10000
}
}
}
} 

value_count

The value_count metric is a single-value metric aggregation that calculates the number of values on which an aggregation is based.

For example, you can use the value_count metric with the avg metric to find how many numbers the aggregation uses to calculate an average value.

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"number_of_values": {
"value_count": {
"field": "taxful_total_price"
}
}
}
} 

Sample response

...
"aggregations" : {
"number_of_values" : {
"value" : 4675
}
}
} 

stats, extended_stats, matrix_stats

Metric stats is a multi-value metric aggregation that returns all basic metrics such as min, max, sum, avg, and value_count in a single aggregation query.

The following example returns basic stats for the taxful_total_price field:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"stats_taxful_total_price": {
"stats": {
"field": "taxful_total_price"
}
}
}
} 

Sample response

...
"aggregations" : {
"stats_taxful_total_price" : {
"count" : 4675,
"min" : 6.98828125,
"max" : 2250.0,
"avg" : 75.05542864304813,
"sum" : 350884.12890625
}
}
} 

The extended_stats aggregation is an extended version of the stats aggregation. In addition to including basic statistics, extended_stats also returns statistics such as sum_of_squares, variance, and std_deviation.

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"extended_stats_taxful_total_price": {
"extended_stats": {
"field": "taxful_total_price"
}
}
}
} 

Sample response

...
"aggregations" : {
"extended_stats_taxful_total_price" : {
"count" : 4675,
"min" : 6.98828125,
"max" : 2250.0,
"avg" : 75.05542864304813,
"sum" : 350884.12890625,
"sum_of_squares" : 3.9367749294174194E7,
"variance" : 2787.59157113862,
"variance_population" : 2787.59157113862,
"variance_sampling" : 2788.187974983536,
"std_deviation" : 52.79764740155209,
"std_deviation_population" : 52.79764740155209,
"std_deviation_sampling" : 52.80329511482722,
"std_deviation_bounds" : {
"upper" : 180.6507234461523,
"lower" : -30.53986616005605,
"upper_population" : 180.6507234461523,
"lower_population" : -30.53986616005605,
"upper_sampling" : 180.66201887270256,
"lower_sampling" : -30.551161586606312
}
}
}
} 

The std_deviation_bounds object provides a visual variance of the data with a range of plus/minus two standard deviations from the mean. To set the standard deviation to a different value, such as 3, set sigma to 3:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"extended_stats_taxful_total_price": {
"extended_stats": {
"field": "taxful_total_price",
"sigma": 3
}
}
}
} 

The matrix_stats aggregation generates advanced statistics for multiple fields in matrix form. The following example returns advanced statistics in matrix form for the taxful_total_price and products.base_price fields:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"matrix_stats_taxful_total_price": {
"matrix_stats": {
"fields": ["taxful_total_price", "products.base_price"]
}
}
}
} 

Sample response

...
"aggregations" : {
"matrix_stats_taxful_total_price" : {
"doc_count" : 4675,
"fields" : [
{
"name" : "products.base_price",
"count" : 4675,
"mean" : 34.994239430147196,
"variance" : 360.5035285833703,
"skewness" : 5.530161335032702,
"kurtosis" : 131.16306324042148,
"covariance" : {
"products.base_price" : 360.5035285833703,
"taxful_total_price" : 846.6489362233166
},
"correlation" : {
"products.base_price" : 1.0,
"taxful_total_price" : 0.8444765264325268
}
},
{
"name" : "taxful_total_price",
"count" : 4675,
"mean" : 75.05542864304839,
"variance" : 2788.1879749835402,
"skewness" : 15.812149139924037,
"kurtosis" : 619.1235507385902,
"covariance" : {
"products.base_price" : 846.6489362233166,
"taxful_total_price" : 2788.1879749835402
},
"correlation" : {
"products.base_price" : 0.8444765264325268,
"taxful_total_price" : 1.0
}
}
]
}
}
} 

A description of the returned metrics is given in the table.

Statistic Description
count

The number of samples measured.

mean

The average value of the field measured from the sample.

variance

How far the values of the field measured are spread out from its mean value. The larger the variance, the more it’s spread from its mean value.

skewness

An asymmetric measure of the distribution of the field’s values around the mean.

kurtosis

A measure of the tail heaviness of a distribution. As the tail becomes lighter, kurtosis decreases. As the tail becomes heavier, kurtosis increases.

covariance

A measure of the joint variability between two fields. A positive value means their values move in the same direction and vice versa.

correlation

A measure of the strength of the relationship between two fields. The valid values are between [-1, 1]. A value of -1 means that the value is negatively correlated and a value of 1 means that it’s positively correlated. A value of 0 means that there’s no identifiable relationship between them.

percentile, percentile_ranks

The percentile is the percentage of the data that is at or below a certain threshold value.

The percentile metric is a multi-value metric aggregation to find outliers in the data or to understand the distribution of the data.

Like the cardinality metric, the percentile metric is also approximate.

The following example calculates the percentile in relation to the taxful_total_price field:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"percentile_taxful_total_price": {
"percentiles": {
"field": "taxful_total_price"
}
}
}
} 

Sample response

...
"aggregations" : {
"percentile_taxful_total_price" : {
"values" : {
"1.0" : 21.984375,
"5.0" : 27.984375,
"25.0" : 44.96875,
"50.0" : 64.22061688311689,
"75.0" : 93.0,
"95.0" : 156.0,
"99.0" : 222.0
}
}
}
} 

The percentile rank is the percentile of values that are above or below a threshold, grouped by a specific value. For example, if a value is greater than or equal to 80% of the values, it has a percentile rank of 80.

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"percentile_rank_taxful_total_price": {
"percentile_ranks": {
"field": "taxful_total_price",
"values": [
10,
15
]
}
}
}
} 

Sample response

...
"aggregations" : {
"percentile_rank_taxful_total_price" : {
"values" : {
"10.0" : 0.055096056411283456,
"15.0" : 0.0830092961834656
}
}
}
} 

geo_bound

The geo_bound metric is an aggregation of multi-valued metrics that calculates the bounding box in terms of latitude and longitude around a geo_point field.

The following example returns the geo_bound metric for the geoip.location field:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"geo": {
"geo_bounds": {
"field": "geoip.location"
}
}
}
} 

Sample response

"aggregations" : {
"geo" : {
"bounds" : {
"top_left" : {
"lat" : 52.49999997206032,
"lon" : -118.20000001229346
},
"bottom_right" : {
"lat" : 4.599999985657632,
"lon" : 55.299999956041574
}
}
}
}
} 

top_hits

The top_hits metric is a multi-valued metric aggregation that ranks the corresponding documents according to a relevance score for the aggregated field.

The following options can be specified:

• from: The initial position of the hit.
• size: The maximum size of the hits to be returned. The default value is 3.
• sort: How the matching results are sorted. By default, the results are sorted according to the relevance score of the aggregation query.

The following example returns the top 5 products in the e-commerce data:

GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"top_hits_products": {
"top_hits": {
"size": 5
}
}
}
} 

Sample response

...
"aggregations" : {
"top_hits_products" : {
"hits" : {
"total" : {
"value" : 4675,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "kibana_sample_data_ecommerce",
"_type" : "_doc",
"_id" : "glMlwXcBQVLeQPrkHPtI",
"_score" : 1.0,
"_source" : {
"category" : [
"Women's Accessories",
"Women's Clothing"
],
"currency" : "EUR",
"customer_first_name" : "rania",
"customer_full_name" : "rania Evans",
"customer_gender" : "FEMALE",
"customer_id" : 24,
"customer_last_name" : "Evans",
"customer_phone" : "",
"day_of_week" : "Sunday",
"day_of_week_i" : 6,
"email" : "[email protected]",
"manufacturer" : [
"Tigress Enterprises"
],
"order_date" : "2021-02-28T14:16:48+00:00",
"order_id" : 583581,
"products" : [
{
"base_price" : 10.99,
"discount_percentage" : 0,
"quantity" : 1,
"manufacturer" : "Tigress Enterprises",
"tax_amount" : 0,
"product_id" : 19024,
"category" : "Women's Accessories",
"sku" : "ZO0082400824",
"taxless_price" : 10.99,
"unit_discount_amount" : 0,
"min_price" : 5.17,
"_id" : "sold_product_583581_19024",
"discount_amount" : 0,
"created_on" : "2016-12-25T14:16:48+00:00",
"product_name" : "Snood - white/grey/peach",
"price" : 10.99,
"taxful_price" : 10.99,
"base_unit_price" : 10.99
},
{
"base_price" : 32.99,
"discount_percentage" : 0,
"quantity" : 1,
"manufacturer" : "Tigress Enterprises",
"tax_amount" : 0,
"product_id" : 19260,
"category" : "Women's Clothing",
"sku" : "ZO0071900719",
"taxless_price" : 32.99,
"unit_discount_amount" : 0,
"min_price" : 17.15,
"_id" : "sold_product_583581_19260",
"discount_amount" : 0,
"created_on" : "2016-12-25T14:16:48+00:00",
"product_name" : "Cardigan - grey",
"price" : 32.99,
"taxful_price" : 32.99,
"base_unit_price" : 32.99
}
],
"sku" : [
"ZO0082400824",
"ZO0071900719"
],
"taxful_total_price" : 43.98,
"taxless_total_price" : 43.98,
"total_quantity" : 2,
"total_unique_products" : 2,
"type" : "order",
"user" : "rani",
"geoip" : {
"country_iso_code" : "EG",
"location" : {
"lon" : 31.3,
"lat" : 30.1
},
"region_name" : "Cairo Governorate",
"continent_name" : "Africa",
"city_name" : "Cairo"
},
"event" : {
"dataset" : "sample_ecommerce"
}
}
...
}
]
}
}
}
}


scripted_metric

The scripted_metric is a multi-valued metric aggregation that returns metrics calculated by a specified script.

A script has four phases: the init phase, the map phase, the combine phase, and the reduce phase.

• init_script: (OPTIONAL) Sets the initial state and runs before any document collection.
• map_script: Checks the value of the type field and performs aggregation on the collected documents.
• combine_script: Aggregates the state returned by each shard. The aggregated value is returned to the coordinator node.
• reduce_script: Provides access to the states variable; this variable combines the results of combine_script on each shard into an array.

The following example aggregates the different HTTP response types in the web log data:

GET kibana_sample_data_logs/_search
{
"size": 0,
"aggregations": {
"responses.counts": {
"scripted_metric": {
"init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
"map_script": """
def code = doc['response.keyword'].value;
if (code.startsWith('5') || code.startsWith('4')) {
state.responses.error += 1 ;
} else if(code.startsWith('2')) {
state.responses.success += 1;
} else {
state.responses.other += 1;
}
""",
"combine_script": "state.responses",
"reduce_script": """
def counts = ['error': 0L, 'success': 0L, 'other': 0L];
for (responses in states) {
counts.error += responses['error'];
counts.success += responses['success'];
counts.other += responses['other'];
}
return counts;
"""
}
}
}
} 

Sample response

...
"aggregations" : {
"responses.counts" : {
"value" : {
"other" : 0,
"success" : 12832,
"error" : 1242
}
}
}
} 

More To Explore

Python language

Plotly Go: advanced visualization in Python

Visualizing data is critical to better understand the data and analysis performed. There are several tools, free and paid, that allow you to create fantastic dashboards. However, it is possible to write a few riches in Python to get great results and be more flexible depending on the project of interest. Let’s find out how to create interactive Scatter Bubble charts with Plotly Go on a real project.

Python language

Clustering: a real project to explore data

Clustering is a very powerful tool for grouping data. There are many algorithms that can be applied, so the choice is always difficult. In addition, all clustering algorithms require parameters to work. By means of a real case study, applied to real estate data, we will combine PCA, hierarchical clustering and K-means to provide optimal clustering solutions.