Elasticsearch in Action: Child Level Bucket Aggregations (2/2)
Excerpts taken from my upcoming book: Elasticsearch in Action
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
In the last article, we learned about bucket aggregations. In this article, we look at child-level bucket aggregations.
Child-level aggregations
In addition to creating the buckets with the respective ranges, we may want to aggregate the data inside those buckets too. For example, we may want to find the average rating of a book for each bucket.
To satisfy such requirements, we can use a sub-aggregation: an aggregation that works on the bucket’s data. With bucketing aggregations, we get support for both metric or bucket aggregations, applied at the child level. The code in the following listing demonstrates the query that fetches books released yearly plus the average rating for each bucket.
GET books/_search
{
"size":0,
"aggs": {
"release_date_histogram": {
"date_histogram": {
"field": "release_date",
"calendar_interval": "1y"
},
"aggs": {
"avg_rating_per_bucket": {
"avg": {
"field": "amazon_rating"
}
}
}
}
}
}
As you can see from the listing, there are two blocks of aggregation, one weaved inside the other. The outer aggregation (release_date_histogram
) produces the data as a histogram, based on the calendar interval of one year. The results of this aggregation are then fed to the next level of aggregation: the inner aggregation (avg_rating_per_bucket
). The inner aggregation considers each of the buckets as its scope and runs the average (avg
) aggregation on that data. This produces the average rating of a book per bucket. The figure below shows the expected result from the aggregation execution.
As you can see in figure, the keys are the dates that honor the calendar year with a set of documents present in that bucket. The notable thing about this query is that an additional object in the bucket, avg_rating_per_bucket
, consists of our average book rating.
Custom range aggregation
The histogram provides an automatic set of ranges for the given intervals. There are times we may want the data to be segregated in to a certain range, which is not dictated by a strict interval (for example, we may want to classify people into three groups by age: 18–21, 22–49, and 50+). The standardized interval does not cut this requirement. All we need is a mechanism to customize the ranges. That’s exactly what we use the range
aggregation for.
The range
aggregation aggregates the documents in a user-defined custom range. Let’s look at it in action by writing a query to fetch books that fall in just two categories of ratings: above and below a value of 4 (1–4 and 4–5). The query in the following listing does exactly that.
GET books/_search
{
"size": 0,
"aggs": {
"book_ratings_range": {
"range": {
"field": "amazon_rating",
"ranges": [
{
"from": 1,
"to": 4
},
{
"from": 4,
"to": 5
}
]
}
}
}
}
The query constructs an aggregation with a custom range defined by an array (ranges
) with only two buckets: from 1–4 and from 4–5. The following response indicates that there are 2 books that fall in to the 1–4 rating and the rest in to the 4–5 rating:
"aggregations" : {
"book_ratings_range" : {
"buckets" : [
{
"key" : "1.0-4.0",
"from" : 1.0,
"to" : 4.0,
"doc_count" : 2
},
{
"key" : "4.0-5.0",
"from" : 4.0,
"to" : 5.0,
"doc_count" : 35
}
]
}
}
The range
aggregation is a slight variation from histogram aggregations and, hence, well suited for special or custom ranges that a user needs sometimes. Of course, if you want to go with the system-provided categories and don’t need customization, a histogram is the suitable one for that purpose.
> The range
aggregation is made up of from
and to
attributes with the from
value included while the to
value is excluded when calculating the bucket items that fit this range.
IP Addresses Range
By the same principle, we can also classify IP addresses in a custom range using a dedicated ip_range
aggregation. The code in the listing below demonstrates exactly that. Note that this code is for demonstrative purposes only because we do not have a networks
index primed with data consisting of the localhost_ip_address
field.
GET networks/_search
{
"aggs": {
"my_ip_addresses_custom_range": {
"ip_range": {
"field": "localhost_ip_address",
"ranges": [
{
"to": "192.168.0.10",
"from": "192.168.0.20"
},
{
"to": "192.168.0.20",
"from": "192.168.0.100"
}
]}
}
}
}
As you can see from this sample aggregation, we can segregate the IP addresses as per our custom range. The query produces two ranges: one that includes 192.168.0.10
to 192.168.0.20
and, in the second bucket, 192.168.0.20
to 192.168.0.100
.
The terms aggregation
When we want to retrieve an aggregated count of a certain field, say authors and their book count, the terms
aggregations comes in handy. The terms aggregation collects data in the buckets for each occurrence of the term. For example, in the following query, the terms aggregation creates a bucket for each author and the number of books they’ve written.
GET books/_search?size=0
{
"aggs": {
"author_book_count": {
"terms": {
"field": "author.keyword"
}
}
}
}
The query uses the terms
aggregation to fetch the list of authors in the books
index as well as their book count. The response indicates the key
as author, and the doc_count
shows the number of books for each author:
"buckets" : [
{
"key" : "Herbert Schildt",
"doc_count" : 2
},
{
"key" : "Mike McGrath",
"doc_count" : 2
},
{
"key" : "Terry Norton",
"doc_count" : 2
},
{
"key" : "Adam Scott",
"doc_count" : 1
}
..
]}
As you can see from this response, each bucket represents an author with the number of books that the author wrote. By default, the terms
aggregation only returns the top 10 aggregations, but you can tweak this return size by setting the size
parameter in the terms
aggregation as the following listing shows.
GET books/_search?size=0
{
"aggs": {
"author_book_count": {
"terms": {
"field": "author.keyword",
"size": 25
}
}
}
}
Here, setting size
to 25 fetches 25 aggregations (25 authors and their book count).
Multi-terms aggregation
The multi_terms
aggregation resembles the terms
aggregation with an additional feature: it aggregates the data based on multiple keys. For example, rather than just finding the number of books written by an author, we might want to find the number of books with a specific title and author. The following listing shows the query to get the author and their book’s title(s) as a map.
GET books/_search?size=0
{
"aggs": {
"author_title_map": {
"multi_terms": {
"terms": [
{
"field": "author.keyword"
},
{
"field": "title.keyword"
}
]
}
}
}
}
As you can see in the listing above, multi_terms
accepts a set of terms. In the example, we expect Elasticsearch to return a book count using the author/title
keys. The response indicates that we were able to retrieve such information:
{
"key" : [
"Adam Scott",
"JavaScript Everywhere"
],
"key_as_string" : "Adam Scott|JavaScript Everywhere",
"doc_count" : 1
},
{
"key" : [
"Al Sweigart",
"Automate The Boring Stuff With Python"
],
"key_as_string" : "Al Sweigart|Automate The Boring Stuff With Python",
"doc_count" : 1
},
...
This response shows two variations of the key
representation: as a set of fields (both author
and title
) and as a string (key_as_string
), which is nothing more than stitching both fields together by a |
delimiter. The doc_count
indicates the number of documents (books) that are present in the index for that key.
If you are curious, rerun the query in the previous listing, this time using the tags and the title as terms. You should get multiple books under the same tags as we would expect (the code is available in my GitHub repository).