Elasticsearch in Action: Mechanics of Ingest Pipelines (2/3)
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
Part 2/3 of the Ingest Pipelines mini-article series:
Overview
In the last article, we learned that Ingest pipelines help transformation and manipulation of data with relatively minimal or no costly setup. A pipeline is made of a chain of processors, each of the processors performing a specific operation on the incoming data. These ingest processors are loaded onto nodes that are assigned a ingest role; each node in a cluster can be assigned to a role — master, data, ingest, machine learning and so on.
Let’s take an example of MI5’s top secret operations data being loaded from a database into Elasticsearch to integrate the search feature. The data that’s extracted from the database must be stamped with an additional category field with “confidential” to begin with before indexing it into Elasticsearch.
The first step in solutionizing this requirement is to create a pipeline with a processor. We need to add a field called “category” as “CONFIDENTIAL” using a set
processor.
Creating an Ingest Pipeline
Elasticsearch exposes _ingest
API to create and test pipelines. We can use the _ingest/pipeline/<pipeline_name
URL to create a new pipeline. The following code in the listing below creates a new pipeline with one processor — called set
processor:
PUT _ingest/pipeline/confidential_files_pipeline
{
"description": "Stamp confidential on the file (document)",
"processors": [
{
"set": {
"field": "category",
"value": "confidential"
}
}
]
}
Here, as shown in the listing above, we are creating an ingest pipeline called confidential_files_pipeline
pipeline with a single set
processor. The set
processor’s job is to create a new field called “category” with a value “confidential”. When a new document is passed through this pipeline, the set
processor will add an additional field called category
to the document on the fly.
Dry Run the Pipeline
Once the pipeline definition was created, executing it will get them stored in cluster state. It is now ready to be put to use. However, we can dry-run this before we start indexing the data. We dry-run the pipeline using a _simulate
API call. The code in the listing below demonstrates the simulation of the above pipeline:
POST _ingest/pipeline/confidential_files_pipeline/_simulate #A
{
"docs": [{
"_source": {
"op_name": "Operation Cobra"
}
}
]
}
Executing the above code will fetch the response as shown below. When we simulate (dry-run) the pipeline, the document will not be indexed instead it will test out the pipeline’s logic.
{
"docs": [
{
"doc": {
"_index": "_index",
"_id": "_id",
"_version": "-3",
"_source": {
"category": "confidential",
"op_name": "Operation Cobra"
},
"_ingest": {
"timestamp": "2022-11-03T23:42:33.379569Z"
}
}
}
]
}
The _source
object consists of our modified document: an additional field category
is added by the pipeline. This is the magic of the set
processor.
Chaining the Processors
As we discussed earlier, we can chain the processors. Say, if we wish to uppercase the above, all you do is add another processor — aptly called uppercase — to the pipeline and re-run the query. The query in the listing shown below adds uppercase
processor to the same existing pipeline:
PUT _ingest/pipeline/confidential_files_pipeline
{
"description": "Stamp confidential on the file (document)",
"processors": [
{
"set": {
"field": "category",
"value": "confidential"
},
"uppercase": {
"field": "category"
}
}
]
}
As the query shows, we added an uppercase
processor so that the two processors are chained up: the output from the first processor will become the input to the second processor. The result of the above is shown in the snippet below:
"_source": {
"category": "CONFIDENTIAL",
"op_name": "Operation Cobra"
}
The category
field gets added by the set
processor while the same field gets uppercased by the uppercase
processor, yielding the CONFIDENTIAL
field on the final document.
In the next article, we will see a practical example of loading PDF files using ingest pipelines into Elasticsearch so we can enable search on the content.
Part 2/3 of the Ingest Pipelines mini-article series: