Elasticsearch in Action: Child Level Bucket Aggregations (2/2)

6 min readJan 30, 2023

Excerpts taken from my upcoming book: Elasticsearch in Action

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Me @ Medium || LinkedIn || Twitter || GitHub

In the last article, we learned about bucket aggregations. In this article, we look at child-level bucket aggregations.

Child-level aggregations

In addition to creating the buckets with the respective ranges, we may want to aggregate the data inside those buckets too. For example, we may want to find the average rating of a book for each bucket.

To satisfy such requirements, we can use a sub-aggregation: an aggregation that works on the bucket’s data. With bucketing aggregations, we get support for both metric or bucket aggregations, applied at the child level. The code in the following listing demonstrates the query that fetches books released yearly plus the average rating for each bucket.

GET books/_search
{
  "size":0,
  "aggs": {
    "release_date_histogram": { 
      "date_histogram": {
        "field": "release_date",
        "calendar_interval": "1y"
      },
      "aggs": { 
        "avg_rating_per_bucket": { 
          "avg": { 
            "field": "amazon_rating"
          }
        }
      }
    }
  }
}

As you can see from the listing, there are two blocks of aggregation, one weaved inside the other. The outer aggregation (release_date_histogram) produces the data as a histogram, based on the calendar interval of one year. The results of this aggregation are then fed to the next level of aggregation: the inner aggregation (avg_rating_per_bucket). The inner aggregation considers each of the buckets as its scope and runs the average (avg) aggregation on that data. This produces the average rating of a book per bucket. The figure below shows the expected result from the aggregation execution.

Figure : Finding the average rating per bucket (sub-aggregation)

As you can see in figure, the keys are the dates that honor the calendar year with a set of documents present in that bucket. The notable thing about this query is that an additional object in the bucket, avg_rating_per_bucket, consists of our average book rating.

Custom range aggregation

The histogram provides an automatic set of ranges for the given intervals. There are times we may want the data to be segregated in to a certain range, which is not dictated by a strict interval (for example, we may want to classify people into three groups by age: 18–21, 22–49, and 50+). The standardized interval does not cut this requirement. All we need is a mechanism to customize the ranges. That’s exactly what we use the range aggregation for.

The range aggregation aggregates the documents in a user-defined custom range. Let’s look at it in action by writing a query to fetch books that fall in just two categories of ratings: above and below a value of 4 (1–4 and 4–5). The query in the following listing does exactly that.

GET books/_search
{
  "size": 0,
  "aggs": {
    "book_ratings_range": {
      "range": { 
        "field": "amazon_rating", 
        "ranges": [ 
          {
            "from": 1,
            "to": 4
          },
          {
            "from": 4,
            "to": 5
          }
          
        ]
      }
    }
  }
}

The query constructs an aggregation with a custom range defined by an array (ranges) with only two buckets: from 1–4 and from 4–5. The following response indicates that there are 2 books that fall in to the 1–4 rating and the rest in to the 4–5 rating:

"aggregations" : {
  "book_ratings_range" : {
    "buckets" : [
    {
      "key" : "1.0-4.0",
      "from" : 1.0,
      "to" : 4.0,
      "doc_count" : 2
    },
    {
      "key" : "4.0-5.0",
      "from" : 4.0,
      "to" : 5.0,
      "doc_count" : 35
    }
  ]
 }
}

The range aggregation is a slight variation from histogram aggregations and, hence, well suited for special or custom ranges that a user needs sometimes. Of course, if you want to go with the system-provided categories and don’t need customization, a histogram is the suitable one for that purpose.

> The range aggregation is made up of from and to attributes with the from value included while the to value is excluded when calculating the bucket items that fit this range.

IP Addresses Range

By the same principle, we can also classify IP addresses in a custom range using a dedicated ip_range aggregation. The code in the listing below demonstrates exactly that. Note that this code is for demonstrative purposes only because we do not have a networks index primed with data consisting of the localhost_ip_address field.

GET networks/_search
{
  "aggs": {
    "my_ip_addresses_custom_range": {
      "ip_range": { 
        "field": "localhost_ip_address", 
        "ranges": [ 
          {
            "to": "192.168.0.10",
            "from": "192.168.0.20"
          },
          {
            "to": "192.168.0.20",
            "from": "192.168.0.100"
          }
      ]}
    }
  }
}

As you can see from this sample aggregation, we can segregate the IP addresses as per our custom range. The query produces two ranges: one that includes 192.168.0.10 to 192.168.0.20 and, in the second bucket, 192.168.0.20 to 192.168.0.100.

The terms aggregation

When we want to retrieve an aggregated count of a certain field, say authors and their book count, the terms aggregations comes in handy. The terms aggregation collects data in the buckets for each occurrence of the term. For example, in the following query, the terms aggregation creates a bucket for each author and the number of books they’ve written.

GET books/_search?size=0
{
  "aggs": {
    "author_book_count": {
      "terms": { 
        "field": "author.keyword" 
      }
    }
  }
}

The query uses the terms aggregation to fetch the list of authors in the books index as well as their book count. The response indicates the key as author, and the doc_count shows the number of books for each author:

"buckets" : [
  {
    "key" : "Herbert Schildt",
    "doc_count" : 2
  },
  {
    "key" : "Mike McGrath",
    "doc_count" : 2
  },
  {
    "key" : "Terry Norton",
    "doc_count" : 2
  },
  {
    "key" : "Adam Scott",
    "doc_count" : 1
  }
..
]}

As you can see from this response, each bucket represents an author with the number of books that the author wrote. By default, the terms aggregation only returns the top 10 aggregations, but you can tweak this return size by setting the size parameter in the terms aggregation as the following listing shows.

GET books/_search?size=0
{
  "aggs": {
    "author_book_count": {
      "terms": {
        "field": "author.keyword",
        "size": 25 
      }
    }
  }
}

Here, setting size to 25 fetches 25 aggregations (25 authors and their book count).

Multi-terms aggregation

The multi_terms aggregation resembles the terms aggregation with an additional feature: it aggregates the data based on multiple keys. For example, rather than just finding the number of books written by an author, we might want to find the number of books with a specific title and author. The following listing shows the query to get the author and their book’s title(s) as a map.

GET books/_search?size=0
{
  "aggs": {
    "author_title_map": {
      "multi_terms": { 
        "terms": [ 
          {
            "field": "author.keyword"
          },
          {
            "field": "title.keyword"
          }
        ]
      }
    }
  }
}

As you can see in the listing above, multi_terms accepts a set of terms. In the example, we expect Elasticsearch to return a book count using the author/title keys. The response indicates that we were able to retrieve such information:

{
  "key" : [
    "Adam Scott",
    "JavaScript Everywhere"
  ],
  "key_as_string" : "Adam Scott|JavaScript Everywhere",
  "doc_count" : 1
},
{
  "key" : [
    "Al Sweigart",
    "Automate The Boring Stuff With Python"
   ],
   "key_as_string" : "Al Sweigart|Automate The Boring Stuff With Python",
   "doc_count" : 1
},
...

This response shows two variations of the key representation: as a set of fields (both author and title) and as a string (key_as_string), which is nothing more than stitching both fields together by a | delimiter. The doc_count indicates the number of documents (books) that are present in the index for that key.

If you are curious, rerun the query in the previous listing, this time using the tags and the title as terms. You should get multiple books under the same tags as we would expect (the code is available in my GitHub repository).