In the previous articles, Elasticsearch: bucket aggregations [part 1] and Elasticsearch: bucket aggregations [part 2], we saw how Elasticsearch allows us to extract statistics and create subsets from the data. Some of the functions seen return very interesting information. However, in many cases it is necessary to use the result seen above to perform new analyses. This can be done with aggregation pipelines. These, in fact, allow you to concatenate aggregations, that is, sending the results of one aggregation as input to another to obtain a more detailed result. Aggregation pipelines can be used to compute complex statistical and mathematical measures such as derivatives, moving averages, cumulative sums, and so on.

## Syntax of pipeline aggregation

An aggregation pipeline uses the buckets_path property to access the results of other aggregations. The buckets_path property has a specific syntax:

```
buckets_path = <AGG_NAME>[<AGG_SEPARATOR>,<AGG_NAME>]*
[<METRIC_SEPARATOR>, <METRIC>];
```

where:

**AGG_NAME**is the name of the aggregation.**AGG_SEPARATOR**separates the aggregations. It is represented as >.**METRIC_SEPARATOR**separates aggregations from their metrics. It is represented as …**METRIC**is the name of the metric, in the case of multi-valued metric aggregations.

For example, my_sum.sum selects the sum metric of an aggregation named my_sum. popular_tags>my_sum.sum nests my_sum.sum in the popular_tags aggregation.

The following additional parameters can also be specified:

**gap_policy**: Real-world data may contain gaps or null values. With the gap_policy property you can specify the policy to handle these missing data. You can set the gap_policy property to skip the missing data and continue from the next available value, or to insert_zeros to replace the missing values with zero and continue execution.**format**: The type of format for the output value. For example, yyyy-MM-dd for a date value.

### Quick example

To sum all the buckets returned by the sum_total_memory aggregation:

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"number_of_bytes": {
"histogram": {
"field": "bytes",
"interval": 10000
},
"aggs": {
"sum_total_memory": {
"sum": {
"field": "phpmemory"
}
}
}
},
"sum_copies": {
"sum_bucket": {
"buckets_path": "number_of_bytes>sum_total_memory"
}
}
}
}
```

### Sample response

```
...
"aggregations" : {
"number_of_bytes" : {
"buckets" : [
{
"key" : 0.0,
"doc_count" : 13372,
"sum_total_memory" : {
"value" : 9.12664E7
}
},
{
"key" : 10000.0,
"doc_count" : 702,
"sum_total_memory" : {
"value" : 0.0
}
}
]
},
"sum_copies" : {
"value" : 9.12664E7
}
}
}
```

## Types of aggregation pipelines

Aggregation pipelines are of two types:

### Sibling aggregations

Sibling aggregations take the output of a nested aggregation and produce new buckets or new aggregations at the same level as the nested buckets. Sibling aggregations must be a multi-bucket aggregation (have multiple values clustered for a given field) and the metric must be a numeric value. min_bucket, max_bucket, sum_bucket, and avg_bucket are common sibling aggregations.

### Parent aggregations

Parent aggregations take the output of an external aggregation and produce new buckets or new aggregations at the same level as existing buckets. Parent aggregations must have min_doc_count set to 0 (default value for histogram aggregations) and the specified metric must be a numeric value. If min_doc_count is greater than 0, some buckets are omitted, which may lead to incorrect results. Derivatives and cumulative_sum are common parent aggregations.

Below we will see the most commonly used aggregations belonging to the two categories.

## avg_bucket, sum_bucket, min_bucket, max_bucket

The avg_bucket, sum_bucket, min_bucket and max_bucket aggregations are sibling aggregations that calculate the mean, sum, minimum and maximum values of a metric in each bucket of a previous aggregation.

The following example creates a histogram of dates with an interval of one month. The sum sub-aggregation calculates the sum of all bytes for each month. Finally, the avg_bucket aggregation uses this sum to calculate the average number of bytes per month:

```
POST kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"visits_per_month": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"sum_of_bytes": {
"sum": {
"field": "bytes"
}
}
}
},
"avg_monthly_bytes": {
"avg_bucket": {
"buckets_path": "visits_per_month>sum_of_bytes"
}
}
}
}
```

### Sample response

```
...
"aggregations" : {
"visits_per_month" : {
"buckets" : [
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"sum_of_bytes" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"sum_of_bytes" : {
"value" : 3.8880434E7
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"sum_of_bytes" : {
"value" : 3.1445055E7
}
}
]
},
"avg_monthly_bytes" : {
"value" : 2.6575229666666668E7
}
}
}
```

Similarly, sum_bucket, min_bucket and max_bucket values can be calculated for bytes per month.

## stats_bucket, extended_stats_bucket

The stats_bucket aggregation is a sibling aggregation that returns a set of statistics (count, min, max, avg, and sum) for the buckets from a previous aggregation.

The following example returns basic statistics for the buckets returned by the sum_of_bytes aggregation nested in the visits_per_month aggregation:

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"visits_per_month": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"sum_of_bytes": {
"sum": {
"field": "bytes"
}
}
}
},
"stats_monthly_bytes": {
"stats_bucket": {
"buckets_path": "visits_per_month>sum_of_bytes"
}
}
}
}
```

### Sample response

```
"stats_monthly_bytes" : {
"count" : 3,
"min" : 9400200.0,
"max" : 3.8880434E7,
"avg" : 2.6575229666666668E7,
"sum" : 7.9725689E7
}
}
}
```

The extended_stats aggregation is an extended version of the stats aggregation. In addition to including basic statistics, extended_stats also provides statistics such as sum_of_squares, variance, and std_deviation.

### Sample response

```
"stats_monthly_visits" : {
"count" : 3,
"min" : 9400200.0,
"max" : 3.8880434E7,
"avg" : 2.6575229666666668E7,
"sum" : 7.9725689E7,
"sum_of_squares" : 2.588843392021381E15,
"variance" : 1.5670496550438025E14,
"variance_population" : 1.5670496550438025E14,
"variance_sampling" : 2.3505744825657038E14,
"std_deviation" : 1.251818539183616E7,
"std_deviation_population" : 1.251818539183616E7,
"std_deviation_sampling" : 1.5331583357780447E7,
"std_deviation_bounds" : {
"upper" : 5.161160045033899E7,
"lower" : 1538858.8829943463,
"upper_population" : 5.161160045033899E7,
"lower_population" : 1538858.8829943463,
"upper_sampling" : 5.723839638222756E7,
"lower_sampling" : -4087937.0488942266
}
}
}
}
```

## bucket_script, bucket_selector

The bucket_script aggregation is a parent aggregation that executes a script to perform the per-bucket calculations of a previous aggregation. Ensure that the metrics are of numeric type and that the returned values are also numeric.

Use the script parameter to add the script. The script can be inline, in a file, or in an index. To enable the inline script, add the following line to the elasticsearch.yml file in the config folder:

**script.inline: on**

The buckets_path property consists of multiple entries. Each entry consists of a key and a value. The key is the name of the value that can be used in the script.

The basic syntax is:

```
{
"bucket_script": {
"buckets_path": {
"my_var1": "the_sum",
"my_var2": "the_value_count"
},
"script": "params.my_var1 / params.my_var2"
}
}
```

The following example uses sum aggregation on the buckets generated from a histogram of dates. From the resulting bucket values, the percentage of RAM in a 10,000-byte interval in the context of a zipper extension is calculated:

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"histogram": {
"field": "bytes",
"interval": "10000"
},
"aggs": {
"total_ram": {
"sum": {
"field": "machine.ram"
}
},
"ext-type": {
"filter": {
"term": {
"extension.keyword": "zip"
}
},
"aggs": {
"total_ram": {
"sum": {
"field": "machine.ram"
}
}
}
},
"ram-percentage": {
"bucket_script": {
"buckets_path": {
"machineRam": "ext-type>total_ram",
"totalRam": "total_ram"
},
"script": "params.machineRam / params.totalRam"
}
}
}
}
}
}
```

### Sample response

```
"aggregations" : {
"sales_per_month" : {
"buckets" : [
{
"key" : 0.0,
"doc_count" : 13372,
"os-type" : {
"doc_count" : 1558,
"total_ram" : {
"value" : 2.0090783268864E13
}
},
"total_ram" : {
"value" : 1.7214228922368E14
},
"ram-percentage" : {
"value" : 0.11671032934131736
}
},
{
"key" : 10000.0,
"doc_count" : 702,
"os-type" : {
"doc_count" : 116,
"total_ram" : {
"value" : 1.622423896064E12
}
},
"total_ram" : {
"value" : 9.015136354304E12
},
"ram-percentage" : {
"value" : 0.17996665078608862
}
}
]
}
}
}
```

The RAM percentage is calculated and added to the end of each bucket.

The bucket_selector aggregation is a script-based aggregation that selects the returned buckets from a histogram (or date_histogram) aggregation. It is used in scenarios where you do not want certain buckets in the output based on user-supplied conditions.

The bucket_selector aggregation runs a script to decide whether a bucket remains in the parent multi-bucket aggregation.

The basic syntax is:

```
{
"bucket_selector": {
"buckets_path": {
"my_var1": "the_sum",
"my_var2": "the_value_count"
},
"script": "params.my_var1 / params.my_var2"
}
}
```

The following example calculates the sum of bytes and then evaluates whether this sum is greater than 20,000. If true, the bucket is retained in the bucket list. Otherwise, it is deleted from the final output.

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"bytes_per_month": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"total_bytes": {
"sum": {
"field": "bytes"
}
},
"bytes_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalBytes": "total_bytes"
},
"script": "params.totalBytes > 20000"
}
}
}
}
}
}
```

### Sample response

```
"aggregations" : {
"bytes_per_month" : {
"buckets" : [
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"total_bytes" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"total_bytes" : {
"value" : 3.8880434E7
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"total_bytes" : {
"value" : 3.1445055E7
}
}
]
}
}
}
```

## bucket_sort

The bucket_sort aggregation is a parent aggregation that sorts the buckets of a previous aggregation.

You can specify different sort fields and the corresponding order. In addition, you can sort each bucket by its key, count, or sub-aggregations. You can also truncate the buckets by setting the from and size parameters.

The syntax is as follows:

```
{
"bucket_sort": {
"sort": [
{"sort_field_1": {"order": "asc"}},
{"sort_field_2": {"order": "desc"}},
"sort_field_3"
],
"from":1,
"size":3
}
}
```

The following example sorts the buckets of a date_histogram aggregation based on the calculated total_sum values. The buckets are sorted in descending order so that the buckets with the largest number of bytes are returned first.

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"total_bytes": {
"sum": {
"field": "bytes"
}
},
"bytes_bucket_sort": {
"bucket_sort": {
"sort": [
{ "total_bytes": { "order": "desc" } }
],
"size": 3
}
}
}
}
}
}
```

### Sample response

```
"aggregations" : {
"sales_per_month" : {
"buckets" : [
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"total_bytes" : {
"value" : 3.8880434E7
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"total_bytes" : {
"value" : 3.1445055E7
}
},
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"total_bytes" : {
"value" : 9400200.0
}
}
]
}
}
}
```

You can also use this aggregation to truncate the resulting buckets without sorting. For this purpose, simply use the from and/or size parameters without the sorting.

## cumulative_sum

The cumulative_sum aggregation is a parent aggregation that calculates the cumulative sum of each bucket of a previous aggregation.

A cumulative sum is a sequence of partial sums of a given sequence. For example, the cumulative sums of the sequence {a,b,c,…} are a, a+b, a+b+c and so on. You can use the cumulative sum to visualize the rate of change of a field over time.

The following example calculates the cumulative number of bytes on a monthly basis:

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"no-of-bytes": {
"sum": {
"field": "bytes"
}
},
"cumulative_bytes": {
"cumulative_sum": {
"buckets_path": "no-of-bytes"
}
}
}
}
}
}
```

### Sample response

```
...
"aggregations" : {
"sales_per_month" : {
"buckets" : [
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"no-of-bytes" : {
"value" : 9400200.0
},
"cumulative_bytes" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"no-of-bytes" : {
"value" : 3.8880434E7
},
"cumulative_bytes" : {
"value" : 4.8280634E7
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"no-of-bytes" : {
"value" : 3.1445055E7
},
"cumulative_bytes" : {
"value" : 7.9725689E7
}
}
]
}
}
}
```

## Derivative

The derivative aggregation is a parent aggregation that computes the 1st-order and 2nd-order derivatives of each bucket of a previous aggregation.

In mathematics, the derivative of a function measures its sensitivity to change. In other words, a derivative assesses the rate of change of a function with respect to a variable. To learn more about derivatives, see Wikipedia.

You can use derivatives to calculate the rate of change of numerical values with respect to previous time periods.

The 1st-order derivative indicates whether a metric is increasing or decreasing and by how much it is increasing or decreasing.

The following example calculates the 1st-order derivative for the sum of bytes per month. The 1st-order derivative is the difference between the number of bytes in the current month and the number of bytes in the previous month:

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"number_of_bytes": {
"sum": {
"field": "bytes"
}
},
"bytes_deriv": {
"derivative": {
"buckets_path": "number_of_bytes"
}
}
}
}
}
}
```

### Sample response

```
...
"aggregations" : {
"sales_per_month" : {
"buckets" : [
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"number_of_bytes" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"number_of_bytes" : {
"value" : 3.8880434E7
},
"bytes_deriv" : {
"value" : 2.9480234E7
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"number_of_bytes" : {
"value" : 3.1445055E7
},
"bytes_deriv" : {
"value" : -7435379.0
}
}
]
}
}
}
```

The 2nd-order derivative is a double derivative or a derivative of the derivative. It indicates how the rate of change of a quantity changes. It is the difference between the 1st-order derivatives of adjacent buckets.

To calculate a 2nd-order derivative, it is necessary to concatenate one aggregation of derivatives with another:

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"number_of_bytes": {
"sum": {
"field": "bytes"
}
},
"bytes_deriv": {
"derivative": {
"buckets_path": "number_of_bytes"
}
},
"bytes_2nd_deriv": {
"derivative": {
"buckets_path": "bytes_deriv"
}
}
}
}
}
}
```

### Sample response

```
...
"aggregations" : {
"sales_per_month" : {
"buckets" : [
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"number_of_bytes" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"number_of_bytes" : {
"value" : 3.8880434E7
},
"bytes_deriv" : {
"value" : 2.9480234E7
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"number_of_bytes" : {
"value" : 3.1445055E7
},
"bytes_deriv" : {
"value" : -7435379.0
},
"bytes_2nd_deriv" : {
"value" : -3.6915613E7
}
}
]
}
}
}
```

The first bucket does not have a 1st-order derivative because a derivative needs at least two points for comparison. The first and second buckets do not have a 2nd order derivative because a 2nd order derivative needs at least two data points of the 1st order derivative.

The 1st-order derivative for the “2020-11-01” bucket is 2.9480234E7 and the “2020-12-01” bucket is -7435379. Thus, the 2nd order derivative of the “2020-12-01” bucket is -3.6915613E7 (-7435379-2.9480234E7).

In theory, one could continue to concatenate the aggregations of derivatives to compute the 3rd, 4th and even higher order derivatives. However, this would be of no value for most datasets.

moving_avg

A moving_avg aggregation is a parent aggregation that computes the moving average metric.

The moving_avg aggregation finds the set of averages of different windows (subsets) of a data set. The size of a window represents the number of data points covered by the window in each iteration (specified by the window property and set to 5 by default). At each iteration, the algorithm averages all data points that fall within the window and then scrolls forward by excluding the first member of the previous window and including the first member of the next window.

For example, given data [1, 5, 8, 23, 34, 28, 7, 23, 20, 19], a simple moving average with a window size 5 can be calculated as follows:

```
(1 + 5 + 8 + 23 + 34) / 5 = 14.2
(5 + 8 + 23 + 34+ 28) / 5 = 19.6
(8 + 23 + 34 + 28 + 7) / 5 = 20
```

and so on…

You can use moving_avg aggregation to smooth short-term fluctuations or to highlight long-term trends or cycles in time series data.

Specify a small window size (e.g., window: 10) that closely follows the data to smooth out small-scale fluctuations. Alternatively, specify a larger window size (e.g., window: 100) that moves far away from the actual data to smooth out all higher-frequency fluctuations or random noise, making lower-frequency trends more visible.

The following example nests a moving_avg aggregation into a date_histogram aggregation:

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"my_date_histogram": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"sum_of_bytes": {
"sum": { "field": "bytes" }
},
"moving_avg_of_sum_of_bytes": {
"moving_fn": { "buckets_path": "sum_of_bytes", "window": 10,
"script": "MovingFunctions.min(values)"}
}
}
}
}
}
```

### Sample response

```
...
"aggregations" : {
"my_date_histogram" : {
"buckets" : [
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"sum_of_bytes" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"sum_of_bytes" : {
"value" : 3.8880434E7
},
"moving_avg_of_sum_of_bytes" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"sum_of_bytes" : {
"value" : 3.1445055E7
},
"moving_avg_of_sum_of_bytes" : {
"value" : 2.4140317E7
}
}
]
}
}
}
```

## serial_diff

The serial_diff aggregation is a pipeline parent aggregation that calculates a range of value differences between a time interval of the buckets of previous aggregations.

You can use serial_diff aggregation to find the changes in the data between time periods instead of finding the integer value.

With the lag parameter (a positive, nonzero integer value), you can specify which previous bucket to subtract from the current one. If you do not specify the lag parameter, Elasticsearch sets it to 1.

Suppose the population of a city grows over time. If you use aggregation by serial differentiation with a period of one day, you can see the daily growth. For example, you can calculate a series of differences of the average weekly changes in a total price.

```
GET kibana_sample_data_logs/_search
{
"size": 0,
"aggs": {
"my_date_histogram": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "month"
},
"aggs": {
"the_sum": {
"sum": {
"field": "bytes"
}
},
"thirtieth_difference": {
"serial_diff": {
"buckets_path": "the_sum",
"lag" : 30
}
}
}
}
}
}
```

### Sample response

```
...
"aggregations" : {
"my_date_histogram" : {
"buckets" : [
{
"key_as_string" : "2020-10-01T00:00:00.000Z",
"key" : 1601510400000,
"doc_count" : 1635,
"the_sum" : {
"value" : 9400200.0
}
},
{
"key_as_string" : "2020-11-01T00:00:00.000Z",
"key" : 1604188800000,
"doc_count" : 6844,
"the_sum" : {
"value" : 3.8880434E7
}
},
{
"key_as_string" : "2020-12-01T00:00:00.000Z",
"key" : 1606780800000,
"doc_count" : 5595,
"the_sum" : {
"value" : 3.1445055E7
}
}
]
}
}
}
```