Elasticsearch in Action: Gentle Introduction to Data Streams
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
Time-series data
We have been working on indices (such as movies
, movie_reviews
etc) which will hold and collect data over time. Usually what we do if the data gets huge it adding additional indices to copy (or move) data across to accommodate. The expectation is that this type of data doesn’t need to be rolled over into newer indices periodically, like hourly, daily or monthly. Keeping this in the back of your mind, let’s look at a different type of data — time series data.
As the name indicates, the time-series data is time sensitive and time dependent. Take an example of logs that are generated from an apache web server, shown in the figure below:
The logs are continuously logged to a current day’s log file. For each of the log statements, a timestamp is associated with it. At midnight, the file will be backed up with a date stamp and a new file will be created for the brand new day. The log framework will initiate the rollover automatically during the day cutover.
If we wish to hold the log data in Elasticsearch, we need to rethink the strategy of indexing the data that changes/rolls over periodically into indices. Surely, we can write an index-rollover script that could potentially rollover the indices at midnight every day. But there’s more to this than just rolling over the data. For example, we also need to take care of directing the search requests against a single mother index rather than multiple rolling indices. We will be creating an alias for this purpose ideally.
Alias: Alias is an alternate name set against a single or a set of multiple indices. The ideal way to search against multiple indices is by creating an alias pointing to multiple indices. When we search against an alias, we are essentially searching against all the indices that were backed up by this alias.
This leads us to an important concept called data streams, discussed in the next section.
Data streams
Data streams accommodate time series data in Elasticsearch — they let us hold the data in multiple indices but allow access as a single resource for search and analytical related queries. As discussed earlier, the data that is tagged to a date or time axis such as logs, automated car’s events, pollution levels in a city etc, is expected to be hosted in timed indices. These indices on a high level are called data streams. Behind the scenes, each of the data streams has a set of indices for each of the time points. These indices are auto generated by Elasticsearch and hidden.
The figure shown below demonstrates an example data stream for ecommerce order logs generated and captured daily. It also shows us how the order data stream is composed of auto generated hidden indices per day. The data stream itself is nothing more than an alias for the time-series (rolling) hidden indices behind the scenes. While the search/read requests are spanned across all the data stream’s backing hidden indices, the indexing requests will be only directed to the new (current) index.
A data stream consists of automatically generated hidden indices
Data streams are created using a matching indexing template. Templates are the blueprints consisting of settings and configuration values when creating resources like indices. Indices created from a template inherit the settings defined in the template.
Take an example of logs that are written to a file on a daily basis. These logs are then exported to the indices suffixed with a time period like my-app-2021–10–24.log. When a day is rolled off to the next day, you’d expect the respective index to be rolled over too; for example, my-app-2021–10–24.log to my-app-2021–10–25.log (the date is incremented by day) as the figure 6.10 shows:
We can write a scheduled job that can do this for us, but fortunately, Elastic released a new feature relatively recently called index life-cycle management (ILM). We discussed ILM in this article — have a look.