Elasticsearch in Action: Pipeline Aggregations (1/2)
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
Elasticsearch provides a third set of aggregations called pipeline aggregations that permit chaining the aggregations. These aggregations work on the output of other aggregations rather than the individual documents or fields of the documents. That is, we create a pipeline aggregation by passing the output of a bucket or a metric aggregation.
Pipeline aggregation types
Broadly speaking, we can group pipeline aggregations into two types: parent and sibling. As mentioned previously, the parent pipeline aggregations are a group of aggregations that work on the input from the parent aggregation to produce new buckets or new aggregations, which are then added to the existing buckets. The sibling pipeline aggregations produce a new aggregation at the same level of the sibling aggregation.
Sample data
We will look at both parent and sibling aggregation types in detail as we execute some examples in this section. We’ll use the coffee_sales
data as a sample data set for running these pipeline aggregations. Follow the usual process of indexing the data using the _bulk
API as the listing below shows. You can fetch the sample data from the book’s repository on Github.
PUT coffee_sales/_bulk
{"index":{"_id":"1"}}
{"date":"2022-09-01","sales":{"cappuccino":23,"latte":12,"americano":9,"tea":7},"price":{"cappuccino":2.50,"latte":2.40,"americano":2.10,"tea":1.50}}
{"index":{"_id":"2"}}
{ "date":"2022-09-02","sales":{"cappuccino":40,"latte":16,"americano":19,"tea":15},"price":{"cappuccino":2.50,"latte":2.40,"americano":2.10,"tea":1.50}}
Executing this query indexes two sales documents in to our coffee_sales
index. Now that we have a couple of documents in coffee_sales
, the next step is to create a set of pipeline aggregations to help us understand them in detail.
Syntax for the pipeline aggregations
Pipeline aggregations, as discussed, work on the input from other aggregations. That means, when declaring the pipeline, it is expected that we provide a reference to the metric or bucket aggregations. For our example, we can set this reference as buckets_path
, which is made of the aggregation names with an appropriate separator in the query. The buckets_path
variable is a mechanism to identify the input to the pipeline query.
For example, the figure below indicates the parent aggregation cappuccino_sales
, whereas the pipeline aggregation cumultive_sum
as defined by total_cappuccinos
refers to the parent aggregation via the buckets_path
, which is set with a value referring to the name of the parent aggregation.
The buckets_path
setting becomes a bit more involved if the aggregation that’s in play is a sibling aggregation. The figure below shows the aggregation.
The max_bucket
in the aggregation in the figure is a sibling pipeline aggregation (defined under the highest_cappuccino_sales_bucket
aggregation), which calculates the result by taking input from the other aggregations set by the buckets_path
variable. In this case, it is fed by the aggregation called cappuccino_sales
, which lives under the sales_by_coffee
sibling aggregation.
If you are puzzled with buckets_path
or even with pipeline aggregations, hang in there. We will go over them in practice in the next few sections.
List of pipeline aggregations
Knowing if the pipeline aggregation falls in which type of aggregation, parent or sibling, helps us develop these aggregations with ease. Tables given below show the list of pipeline aggregations and their definitions.
We will not be able to go over all of the pipeline aggregations in this section, but we can learn and understand the basics of pipeline aggregations by working through a few common ones. To begin, let’s suppose we want to find our cumulative coffee sales: how many cappuccinos are sold daily, for example. Instead of having a daily score, we want to have the total number of cappuccinos sold from the first day of operation, accumulated daily. The cumulative_sum
aggregation is a handy parent pipeline aggregation that keeps a sum total for the current day as well as tracks the sum for the next day and so on.
Let’s see it in action in the next article.