Elasticsearch in Action: Bucket Aggregations (1/2)
Excerpts taken from my upcoming book: Elasticsearch in Action
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
Me @ Medium || LinkedIn || Twitter || GitHub
We learning about bucket aggregations in this article and child-level bucket aggregations in the next instalment.
Overview
One of the requirements on data is to run some grouping operations. Elasticsearch calls these grouping actions bucket aggregations. Their sole aim is to categorize data into groups, commonly called buckets.
Bucketing is a process of collecting data into interval buckets. For example,
- Grouping runners for a marathon according to their age bracket (21–30, 31–40, 41–50).
- Categorizing schools based on their inspection ratings (good, outstanding, exceptional).
- Getting the number of new houses constructed each month, each year, and so on.
Before we start playing with bucketing aggregations, let’s reuse a data set we’ve worked with in the past: the books data. Pick up the dataset from my GitHub page and index it.
This sample snippet provides a quick reminder. (Note that this is not a full dataset.)
POST _bulk
{"index":{"_index":"books","_id":"1"}}
{"title": "Core Java Volume I – Fundamentals","author": "Cay S. Horstmann","edition": 11, "synopsis": "Java reference book that offers a detailed explanation of various features of Core Java, including exception handling, interfaces, and lambda expressions. Significant highlights of the book include simple language, conciseness, and detailed examples.","amazon_rating": 4.6,"release_date": "2018-08-27","tags": ["Programming Languages, Java Programming"]}
{"index":{"_index":"books","_id":"2"}}
{"title": "Effective Java","author": "Joshua Bloch", "edition": 3,"synopsis": "A must-have book for every Java programmer and Java aspirant, Effective Java makes up for an excellent complementary read with other Java books or learning material. The book offers 78 best practices to follow for making the code better.", "amazon_rating": 4.7, "release_date": "2017-12-27", "tags": ["Object Oriented Software Design"]}
Now that we’ve primed our server with book data, let’s run some common bucketing aggregations. There are at least two dozen of these aggregations out of the box; each one has its own bucketing strategy.
It is quite a boring and repetitive task to document all the aggregations here. Once you have the concept and get the idea of working with bucketing, you should be good to go with others by following the documentation. For now, let’s start with the common bucket aggregation: histograms.
Histograms
Histograms are pretty neat bar charts, representing grouped data. Most analytical software tools provide visual as well as data representation of histograms. Elasticsearch exposes a histogram bucket aggregation out of the box.
We may have worked with histograms where the data is split into multiple categories based on the appropriate interval it falls under. The histograms in Elasticsearch are no different: they create a set of buckets over all the documents on a predetermined interval.
Let’s take an example of categorizing books by ratings. We want to find the number of books in each of the ratings such as 2–3, 3–4, or 4–5, and so on. We can create a histogram aggregation using the set interval 1 so the books fall in respective buckets of 1-step ratings. The query in the following listing demonstrates this.
`GET books/_search
{
"size": 0,
"aggs": {
"ratings_histogram": {
"histogram": {
"field": "amazon_rating",
"interval": 1
}
}
}
}
As the listing shows, the histogram
aggregation expects the field on which we want to aggregate the buckets as well as the interval of the buckets. In the listing, we split the books based on the amazon_rating
field with an interval of 1. This fetches all books that fall between 3–4, 4–5, and so on.
Running the query fetches 2 books that fall in the bucket of 2–3 ratings, 35 books with the rating 3–4, and so on. The response indicates that the buckets have two fields: key
and doc_count
. The key
field represents the bucket classification, whereas the doc_count
field indicates the number of documents that fits in the bucket.
Histogram aggregation with Kibana
In the previous listing, we developed an aggregation query and executed it in Kibana’s console. The results in JSON format aren’t visually appealing as you can see in the figure. It is up to the client who receives that data to represent it as a visual chart. Kibana, however, has a rich set of visualizations to aggregate data. While working with the Kibana visualizations is out of scope for the discussion here, the figure below shows the same data represented as histogram in Kibana’s Dashboard but this time with an interval of 0.5.
As you can see from the bar chart in the figure, the data is categorized into buckets based on the interval 0.5 and filled with the documents that fit in them. To learn all about Kibana visualizations, refer to the documentation at https://www.elastic.co/guide/en/kibana/current/dashboard.html.
The date histogram
At times, we may want to group the data based on not necessarily numbers but on dates. For example, we might want to find all the books released each year, get the weekly sales of an iPhone product, or determine the daily server threat attempts every hour, and so forth. This is where the date_histogram
aggregation comes in handy.
Although the histogram’s bucketing strategy we looked at in the last section was based on numerical intervals, Elasticsearch also provides a histogram based on dates, aptly called date_histogram
. Let’s say we want to categorize books based on their release dates. Here’s the query that applies bucketing based on a book’s release date.
GET books/_search
{
"size":0,
"aggs": {
"release_year_histogram": {
"date_histogram": {
"field": "release_date",
"calendar_interval": "year"
}
}
}
}
This query uses a date_histogram
aggregation, which requires the field on which the aggregation is expected to run and the bucket interval. In the example, we use release_date
as the date field with a year interval.
> We can set the bucket’s interval value to any of year, quarter, month, week, day, hour, minute, second
, and millisecond
, based on your requirements
Running the query in the previous listing produces the individual buckets for each year and the number of documents in that bucket, as shown here:
...
{
"key_as_string" : "2020-01-01T00:00:00.000Z",
"key" : 1577836800000,
"doc_count" : 5
},
{
"key_as_string" : "2021-01-01T00:00:00.000Z",
"key" : 1609459200000,
"doc_count" : 6
},
{
"key_as_string" : "2022-01-01T00:00:00.000Z",
"key" : 1640995200000,
"doc_count" : 3
}
...
As you can deduce from the results, each key (expressed as key_as_string
) represents a year: 2020, 2021, 2022. As the results show, there are 5 books released in 2020, 6 books in 2021, and 3 in 2022.
Interval set up for the date histogram
In the code for the previous listing, we set the interval to year
in the calendar_interval
attribute. In addition to calendar_interval
, there’s another type of interval: the fixed_interval
. We can set this interval in either of these two ways: as a calendar interval or as a fixed interval. There’s a subtle difference between these two types, so to understand, let’s look at the differences in the following subsections.
Calendar interval
The calendar interval, declared as calendar_interval
, is calendar-aware, meaning that the hours and days in a month are adjusted according to the daylight settings of the calendar. The following units are acceptable values: year, quarter, month, week, day, hour, minute,second
, and millisecond
. They can also be represented as single units like 1y, 1q, 1M, 1w, 1d, 1h
, and 1m
, respectively. For example, we could write the query in the previous listing as “calendar_interval”: “1y”
instead of using “year”
.
Note that we can’t use multiples like 5y
(five years) or 4q
(four quarters) when setting the interval using calendar_interval
. For example, setting the interval as “calendar_interval”: “4q”
results in a parser exception: “The supplied interval [4q] could not be parsed as a calendar interval”
.
Fixed interval
The fixed_interval
allows setting time intervals as a fixed number of units such as 365d (365 days), 12h (12 hours), and so on. When we don’t need to worry about the calendar settings, we can use these fixed intervals. The accepted values are days (d), hours (h), minutes (m), seconds (s),
and milliseconds(ms)
.
Because fixed_interval
does not know about the calendar, unlike calendar_interval
, there are no units to support month, year, quarter, and so on. These attributes depend on the calendar (every month has a certain number of days and so on). As an example, the following listing fetches all the documents for 730 days (2 years).
GET books/_search
{
"size":0,
"aggs": {
"release_date_histogram": {
"date_histogram": {
"field": "release_date",
"fixed_interval": "730d"
}
}
}
}
As you can see, the query uses a fixed_interval
of 730d
(or 2 years). The results should show all books in buckets of exactly 730 days. The following snippet demonstrates this:
{
"key_as_string" : "2017-12-20T00:00:00.000Z",
"key" : 1513728000000,
"doc_count" : 11
},
{
"key_as_string" : "2019-12-20T00:00:00.000Z",
"key" : 1576800000000,
"doc_count" : 11
},
{
"key_as_string" : "2021-12-19T00:00:00.000Z",
"key" : 1639872000000,
"doc_count" : 3
}
If you are curious, run the same query with two different settings: “calendar_interval”: “1y”
and “fixed_interval”: “365d”
. You can refer to my GitHub page for the executable code when experimenting with these settings.
Once the queries successfully run, check the keys. In the former (the one with fixed_interval: 730d
), the keys start exactly on the 1st of January (“key_as_string” : “2005–01–01”
); in the latter (the one with fixed_interval: 365d
), they start on the first release date, 23rd of December 2004 (“key_as_string” : “2004–12–23”
). The second bucket then simply adds 365 days from the first release date, thus yielding the 23rd of December 2005 (“key_as_string” : “2005–12–23”
).
> When we use a fixed_interval
, the range starts from the first document’s available date. Going forward, fixed_interval
is added to that. For example, if the publish_date
of a document is 25–12–2020 and if you set the interval as “month”, the range starts with 25–12–2020 and goes to 25–01–2021, 25–02–2021, and so on.
In the next article, we look at child-level bucket aggregations in the next article.