Elasticsearch in Action: Introducing Ingest Pipelines (1/3)
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
Part 1/3 of the Ingest Pipelines mini-article series:
Overview
Data that’s expected to be indexed into Elasticsearch may need to undergo transformation and manipulation. Consider an example of loading millions of legal documents represented as PDF files into Elasticsearch for searching. Though bulk loading them is one of the ways, it is inadequate, cumbersome and error prone.
If you are thinking of ETL (extract-transform-load) tools employed for such data manipulation tasks, you are absolutely right. We have a plethora of such tools including Logstash. Logstash surely manipulates our data before indexing into Elasticsearch or persisting to a database or some other destinations. However, it is not lightweight, and needs an elaborate (if not complex) setup, preferably on a different machine.
Ingest pipelines are pipelines with a set of processors using the same syntax as Query DSL and get them applied on the incoming data to ETL them. The workflow is straightforward:
- Create one or more pipelines with the expected logic based on the business requirements of what transformations or enhancements or enrichments are to be carried out on the data
- Invoke the pipelines on the incoming data; the data goes through the series of processors in a pipeline, getting manipulated at every stage
- The processed data is then indexed
The figure given below shows the workings of two independent pipelines:
Here, we have created two independent pipelines with different sets of processors. These pipelines are hosted/created on an ingest node.
The data gets massaged while going through these processors before indexing. We can invoke these pipelines during a bulk load or indexing individual documents.
A processor is a software component that can do one activity of transformation on the incoming data. A pipeline is made of a series of these processors. Each of these processors are given a dedicated to task of performing one task. They take the input and “processes” it based on their logic and spits out the processed data for the next stage. We can chain as many of these processors as the requirements dictate. Elasticsearch provides over three dozen processors out of the box.
That’s a gentle introduction of ingest pipelines — in the next article, we will go over mechanics of these pipelines and how they work together for manipulating the data.
Part 1/3 of the Ingest Pipelines mini-article series: