Elasticsearch in Action: Custom Analyzers

Madhusudhan Konda
4 min readJan 26, 2023
Elasticsearch in Action by M Konda

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Elasticsearch provides much flexibility when it comes to analyzers: if off-the-shelf analyzers won’t cut it for you, you can create your own custom analyzers. These custom analyzers can be a mix-and-match of existing components from a large stash of Elasticsearch’s component library.

The practice is to define a custom analyzer in settings when creating an index with the required filters and a tokenizer. We can provide any number of character and token filters but only one tokenizer as figure below depicts.

Figure : Anatomy of a custom analyzer

As the figure shows, we define a custom analyzer on an index by setting the type to custom. Our custom analyzer is developed with an array of character filters represented by the `char_filter` object and another array of token filters represented by the filter attribute.

Note : Elasticsearch folks should’ve named the filter object as token_filter instead of filter because just the char_filter represents the character filter. And one more thing, plural char_filters and token_filters would’ve made sense, in my opinion, as they expect an array of stringified filters!

We are expected to provide the tokenizer from our list of off-the-shelf tokenizers with the custom configuration. Let’s look at an example of creating a custom analyzer. Listing given below demonstrates the script for developing a custom analyzer. It has

  • A character filter (html_strip) that strips some HTML characters from the input field.
  • A standard tokenizer that tokenizes the field based on whitespace and punctuation.
  • A token filter for uppercasing the words.
PUT index_with_custom_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{#A
"type":"custom", #B
"char_filter":["html_strip"],#C
"tokenizer":"standard",#D
"filter":["uppercase"]#E
}
}
}
}
}

We can test the analyzer using the following code snippet:

POST index_with_custom_analyzer/_analyze
{
"text": "<H1>HELLO, WoRLD</H1>",
"analyzer": "custom_analyzer"
}

This program produces two tokens: [“HELLO”, “WORLD”], indicating that our html_strip filter removed the H1 HTML tags before letting the standard tokenizer split the field into two tokens based on a whitespace delimiter. Finally, the tokens were uppercased as they passed through the uppercase token filter.

While the customization helps satisfy a range of requirements, there’s even more advanced requirements that can be achieved.

Advanced customization

While default configurations of the analyzer components work most of the time, sometimes we may need to create analyzers with non-default configurations of the components that make up the analyzer. Say we want to use a mapping character filter that would map characters like & to and and < and > to less than and greater than, respectively, and so on.

Let’s suppose our requirement is to develop a custom analyzer that parses text for Greek letters and produces a list of Greek letters as a result. The following listing demonstrates the code to create an index with analysis settings.

PUT index_with_parse_greek_letters_custom_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"greek_letter_custom_analyzer":{
"type":"custom",
"char_filter":["greek_symbol_mapper"],
"tokenizer":"standard",
"filter":["lowercase", "greek_keep_words"]
}
},
"char_filter": {
"greek_symbol_mapper":{
"type":"mapping",
"mappings":[
"α => alpha",
"β => Beta",
"γ => Gamma"
]
}
},
"filter": {
"greek_keep_words":{
"type":"keep",
"keep_words":["alpha", "beta", "gamma"]
}
}
}
}
}

The code in the listing is a bit of a handful, however, understanding it is simple and easy. In the first part, where we define a custom analyzer, we provide a list of filters (both character and token filters if needed) and a tokenizer. You can imagine this as an entry point to the analyzer’s definition.

The second part of the code then defines the filters that were declared earlier. For example, the greek_symbol_mapper, which is redeclared under a new char_filter section, uses the mapping type as the filter’s type with a set of mappings. The same goes for the filter block, which defines the keep_words filter. The keep_words filter removes any words that aren’t present in the list of keep_words.

Once you have the script ready, we can execute the test sample for analysis. In the following listing, we have a sentence that’s expected to be passed through the test analysis phase.

POST index_with_parse_greek_letters_custom_analyzer/_analyze
{
"text": "α and β are roots of a quadratic equation. γ isn't",
"analyzer": "greek_letter_custom_analyzer"
}

The Greek letters (α, β and γ, for example) are processed by the custom analyzer (greek_letter_custom_analyzer), and it outputs the following:

“alpha”,”beta”,”gamma”

The rest of the words like roots and quadratic equation were removed.

We can configure the analyzers not just at the field level but other places such as at an index level too. We can also specify a different analyzer for search queries if the requirement dictates.

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet