Elasticsearch in Action: Anatomy of a Text Analyzer

Madhusudhan Konda
5 min readJan 25, 2023
Elasticsearch in Action by M Konda

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Me @ Medium || LinkedIn || Twitter || GitHub

The analyzer is a software module essentially tasked with two functions: tokenization and normalization. Elasticsearch employs tokenization and normalization processes so the text fields are thoroughly analyzed and stored in inverted indexes for advanced query matching. Let’s look at these concepts at a high level before drilling down into the anatomy of the analyzer.

Tokenization

Tokenization is a process of splitting sentences into individual words, and it follows certain rules. For example, we can instruct the process to break sentences by a delimiter such as whitespace, a letter, a pattern, or other criteria. This process is carried out by a component called a tokenizer, whose sole job is to chop the sentence into individual words called tokens by following certain rules when breaking the sentence into words. A whitespace tokenizer is commonly employed during the tokenization process with each word in the sentence separated by whitespace, removing any punctuation and other noncharacters.

The words can also be split based on nonletters, colons, or some other custom separators. For example, a movie reviewer’s assessment saying, “The movie was sick!!! Hilarious :) :)” can be split into individual words: “The”, “movie”, “was”, “sick”, “Hilarious”, and so on (note the words are not yet lowercased). Or “pickled-peppers” can be tokenized to “pickled” and “peppers”, “K8s” can be tokenized to “K” and “s”, and so on. While this helps to search on words (individual or combined), it can only go so far to answer all the queries such as those with synonyms, plurals, and other searches we mentioned earlier. The normalization process will take the analysis from here to the next stage.

Normalization

Normalization is where the tokens (words) are massaged, transformed, modified, and enriched in the form of stemming, synonyms, stop words, and other features. Here is where additional features are added to the analysis process to ensure the data is stored appropriately for searching purposes. One such feature is the stemming: stemming is an operation where the words are reduced (stemmed) to their root words. For example, “author” is a root word for “authors”, “authoring”, and “authored”.

In addition to stemming, normalization also deals with finding appropriate synonyms before adding them to the inverted index. For example, “author” may have additional synonyms such as “wordsmith”, “novelist”, “writer”, and so on. And finally, each document will have a number of words such as “a”, “an”, “and”, “is”, “but”, “the”, and so on that are called stop words because they really do not have a place in finding the relevant documents.

Both these functions — tokenization and normalization — are carried out by the analyzer module. An analyzer does this by employing filters and a tokenizer. Let’s dissect the analyzer module and see what it is made of.

Anatomy of an analyzer

Tokenization and normalization are carried out by essentially three software components: character filters, tokenizers, and token filters — which are essentially glued as an analyzer module. As the figure given below indicates, an analyzer module consists of a set of filters and a tokenizer. Filters work both on the raw text as character filters and on tokenized text as token filters. The tokenizer’s job is to split the sentence into individual words (tokens).

Figure : Anatomy of an analyzer module

All text fields go through this pipe: the raw text is cleaned by the character filters, and the resulting text is passed on to the tokenizer. The tokenizer then splits the text into tokens (aka individual words). The tokens then pass through the token filters where they get modified, enriched, and enhanced. Finally, the finalized tokens are then stored in the appropriate inverted indices. The search query gets analyzed too, in the same manner as the indexing of the text.

The figure below shows an example that explains the analysis process.

Figure : An example of text analysis in action

The analyzer is composed of three low-level building blocks. These are

  • Character filters — Applied on the character level, where every character of the text goes through these filters.
  • The filter’s job is to remove unwanted characters from the text string. This process could, for example, purge HTML tags like <h1>, <href>, <src> from the input text. It also helps replace some text with other text (e.g., Greek letters with the equivalent English words) or match some text in a regular expression (regex) and replace it with its equivalent (e.g., match an email based on a regex and extract the domain of the organization). These character filters are optional; analyzers can exist without a character filter. Elasticsearch provides three character filters out of the box: html_strip, mapping and pattern_replace.
  • Tokenizers — Split the sentences into words by using a delimiter such as whitespace, punctuation, or some form of word boundaries.
  • Every analyzer must have one and only one tokenizer. Elasticsearch provides a handful of these tokenizers to help split the incoming text into individual tokens. The words can then be fed through the token filters for further normalization. A standard analyzer is used by Elasticsearch by default, which breaks the words based on grammar and punctuation.
  • Token filters — Work on tokens produced by the tokenizers for further processing. For example, the token can change the case, create synonyms, provide the root word (stemming), or produce n-grams and shingles, and so on.
  • Token filters are optional. They can either be zero or many, associated with an analyzer module. There is a long list of token filters provided by Elasticsearch out of the box.

Note that both the character and token filters are optional but we must have one tokenizer.

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet