Elasticsearch in Action: Pipeline Aggregations (1/2)

Madhusudhan Konda
4 min readJan 31, 2023
Excerpts taken from my upcoming book: Elasticsearch in Action

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Me @ Medium || LinkedIn || Twitter || GitHub

Elasticsearch provides a third set of aggregations called pipeline aggregations that permit chaining the aggregations. These aggregations work on the output of other aggregations rather than the individual documents or fields of the documents. That is, we create a pipeline aggregation by passing the output of a bucket or a metric aggregation.

Pipeline aggregation types

Broadly speaking, we can group pipeline aggregations into two types: parent and sibling. As mentioned previously, the parent pipeline aggregations are a group of aggregations that work on the input from the parent aggregation to produce new buckets or new aggregations, which are then added to the existing buckets. The sibling pipeline aggregations produce a new aggregation at the same level of the sibling aggregation.

Sample data

We will look at both parent and sibling aggregation types in detail as we execute some examples in this section. We’ll use the coffee_sales data as a sample data set for running these pipeline aggregations. Follow the usual process of indexing the data using the _bulk API as the listing below shows. You can fetch the sample data from the book’s repository on Github.

PUT coffee_sales/_bulk
{"index":{"_id":"1"}}
{"date":"2022-09-01","sales":{"cappuccino":23,"latte":12,"americano":9,"tea":7},"price":{"cappuccino":2.50,"latte":2.40,"americano":2.10,"tea":1.50}}
{"index":{"_id":"2"}}
{ "date":"2022-09-02","sales":{"cappuccino":40,"latte":16,"americano":19,"tea":15},"price":{"cappuccino":2.50,"latte":2.40,"americano":2.10,"tea":1.50}}

Executing this query indexes two sales documents in to our coffee_sales index. Now that we have a couple of documents in coffee_sales, the next step is to create a set of pipeline aggregations to help us understand them in detail.

Syntax for the pipeline aggregations

Pipeline aggregations, as discussed, work on the input from other aggregations. That means, when declaring the pipeline, it is expected that we provide a reference to the metric or bucket aggregations. For our example, we can set this reference as buckets_path, which is made of the aggregation names with an appropriate separator in the query. The buckets_path variable is a mechanism to identify the input to the pipeline query.

For example, the figure below indicates the parent aggregation cappuccino_sales, whereas the pipeline aggregation cumultive_sum as defined by total_cappuccinos refers to the parent aggregation via the buckets_path, which is set with a value referring to the name of the parent aggregation.

Figure : Parent pipeline aggregation bucket path setting

The buckets_path setting becomes a bit more involved if the aggregation that’s in play is a sibling aggregation. The figure below shows the aggregation.

Figure : Sibling pipeline aggregation bucket path setting

The max_bucket in the aggregation in the figure is a sibling pipeline aggregation (defined under the highest_cappuccino_sales_bucket aggregation), which calculates the result by taking input from the other aggregations set by the buckets_path variable. In this case, it is fed by the aggregation called cappuccino_sales, which lives under the sales_by_coffee sibling aggregation.

If you are puzzled with buckets_path or even with pipeline aggregations, hang in there. We will go over them in practice in the next few sections.

List of pipeline aggregations

Knowing if the pipeline aggregation falls in which type of aggregation, parent or sibling, helps us develop these aggregations with ease. Tables given below show the list of pipeline aggregations and their definitions.

Table : Parent pipeline aggregations
Table : Sibling pipeline aggregations

We will not be able to go over all of the pipeline aggregations in this section, but we can learn and understand the basics of pipeline aggregations by working through a few common ones. To begin, let’s suppose we want to find our cumulative coffee sales: how many cappuccinos are sold daily, for example. Instead of having a daily score, we want to have the total number of cappuccinos sold from the first day of operation, accumulated daily. The cumulative_sum aggregation is a handy parent pipeline aggregation that keeps a sum total for the current day as well as tracks the sum for the next day and so on.

Let’s see it in action in the next article.

Me @ Medium || LinkedIn || Twitter || GitHub

These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.

Elasticsearch in Action

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet