Language Analyzers in Action

Madhusudhan Konda
4 min readJan 26, 2023
Elasticsearch in Action by M Konda

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Elasticsearch provides a long list of language analyzers that are suitable when working with most languages. Moreover, you can configure these out-of-the-box language analyzers to add a stop words filter so you don’t index unnecessary (or common) words of that language. The list of analyzers are Arabic, Armenian, Basque, Bengali, Bulgarian, Catalan, Czech, Dutch, English, Finnish, French, Galician, German, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, and Turkish. The following code listing demonstrates three (English, German, and Hindi) language analyzers in action.

POST _analyze
{
"text": "She sells sea shells",
"analyzer": "english"
}

# German Language Analyzer
POST _analyze
{
"text": "Guten Morgen",
"analyzer": "german"
}

# Hindi Language Analyzer
POST _analyze
{
"text": "नमस्ते कैसी हो तुम",
"analyzer": "hindi"
}

We can configure the language analyzers with a few additional parameters to provide our own list of stop words or to ask the analyzers to exclude the stemming operation. For example, there are a handful of words that are categorized as stop words by the stop token filter that is used by the English analyzer. We can override this list as per our convenience. Say we only want to override “a”, ”an”, ”the”, “and”, and “but”. In this case, we can configure our stop words as the following listing shows.

PUT index_with_custom_english_analyzer #A
{
"settings": {
"analysis": {
"analyzer": {
"index_with_custom_english_analyzer":{
"type":"english",
"stopwords":["a","an","is","and","for"]
}
}
}
}
}

As the code indicates, we created an index with a custom English analyzer and a set of user-defined stop words. When we test (see listing shown below) a piece of text with this analyzer, we can see that the stop words were honored.

POST index_with_custom_english_analyzer/_analyze
{
"text":"A dog is for a life",
"analyzer":"index_with_custom_english_analyzer"
}

This code outputs just two tokens: “dog”, and “life”. The words “a”, “is”, and “for” are removed as they match the stop words that we specified earlier.

Language analyzers have another feature that they are always eager to implement: stemming. Stemming is a mechanism to reduce the words to their root form. For example, any form of the word “author” (“authors”, ”authoring”, ”authored”, etc.) is reduced to the single word “author”. The following listing shows this behavior.

POST index_with_custom_english_analyzer/_analyze
{
"text":"author authors authoring authored",
"analyzer":"english"
}

This code produces four tokens (tokenized, based on whitespace tokenizer) for which all are “author” as the root word for any form of “author” is “author”! But sometimes the stemming might go a bit too far. If you add “authorization” or “authority” to the list of words in the previous listing, unfortunately the words get stemmed and indexed as “author”! Obviously, you will not be able to find pertinent answers when you are searching for “authority” or “authorization” because those words did not make it to the inverted index in the first place due to stemming.

All is not lost. We can configure our English analyzer, asking it to ignore certain words such as “authorization” and “authority” in this case. These do not need to get through the analyzer. In this case, we can bring the stem_exclusion attribute to configure the words that need to be excluded from the stemming. The code in listing given below does this exactly, by creating an index with custom settings and passing the arguments to the stem_exclusion parameter.

PUT index_with_stem_exclusion_english_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"stem_exclusion_english_analyzer":{
"type":"english",
"stem_exclusion":["authority","authorization"]
}
}
}
}
}

Once you’ve created the index with these settings, the next step is to test the indexing request. The following listing uses the English analyzer to test a piece of text.

POST index_with_stem_exclusion_english_analyzer/_analyze
{
"text": "No one can challenge my authority without my authorization",
"analyzer": "stem_exclusion_english_analyzer"
}

The tokens that were spit as a consequence of the code in the previous listing consist of our two words: “authority” and “authorization”. This indicates that they both were untouched!

While most analyzers do what we want in most cases, at times, however, we may need to implement text analysis for a few additional requirements. For example, we may want to remove some special characters like HTML tags from the incoming text or to avoid stop words. The job of removing HTML tags is taken care of by the html_strip character filter and, unfortunately, not all analyzers have them.

In such cases, we can customize the respective analyzer by configuring the required functionality for our needs. We can add a new character filter like html_strip and perhaps enable the stop token filter too.

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet