Elasticsearch in Action: Keyword, Fingerprint and Pattern Analyzers

Madhusudhan Konda
5 min readJan 25, 2023
Elasticsearch in Action by M Konda

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Me @ Medium || LinkedIn || Twitter || GitHub

Keyword analyzer

As the name suggests, the keyword analyzer stores the text as is without any modifications and tokenization. That is, the analyzer does not tokenize the text, nor does it undergo any further analysis via filters or tokenizers. Instead, it is stored as a string representing a keyword type. As figure below depicts, the keyword analyzer is composed of just a noop (no-operation) tokenizer and no character or token filters.

Figure : Anatomy of the keyword analyzer

The text that gets passed through the analyzer is converted and stored as a keyword. For example, if we pass in “Elasticsearch in Action” through the keyword analyzer, the whole text string is stored as is, unlike earlier instances where the text was split into tokens. The code in the following listing demonstrates this.

POST _analyze
{
"text":"Elasticsearch in Action",
"analyzer": "keyword"
}

The output of this script is shown in the following snippet:

"tokens" : [{
"token" : "Elasticsearch in Action",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 0
}]

As you can see, there’s only one token that was produced as a result of processing the text via the keyword analyzer. There’s no lowercasing. However, there will be a change in the way we search if we use the keyword analyzer for processing the text. Searching a single word will not match the text string. We must provide an exact match. We must provide the exact group of words as in the original sentence; in this case, “Elasticsearch in Action”.

Fingerprint analyzer

The fingerprint analyzer removes duplicate words, extended characters, and sorts the words alphabetically to create a single token. It consists of a standard tokenizer along with four token filters: fingerprint, lowercase, stop words, and ASCII folding filters. Figure given below shows this pictorially.

Figure : Anatomy of the fingerprint analyzer

For example, let’s analyze the following text (a definition of a South Indian dish called dosa). The following listing includes a description of this fare.

POST _analyze
{
"text": "A dosa is a thin pancake or crepe originating from South India. It is made from a fermented batter consisting of lentils and rice.",
"analyzer": "fingerprint"
}

The output of the text processed by a fingerprint analyzer is shown here:

"tokens" : [{
"token" : "a and batter consisting crepe dosa fermented from india is it lentils made of or originating pancake rice south thin",
"start_offset" : 0,
"end_offset" : 130,
"type" : "fingerprint",
"position" : 0
}]

When you look closely at the response, you will find that the output is made up of only one token. The words are lowercased and sorted, duplicate words (“a”, “of”, “from”) are removed as well before turning the set of words into a single token.

Pattern analyzer

Sometimes, we may want to tokenize and analyze text based on a certain pattern (for example, removing the first n numbers of a phone number or removing a dash for every four digits from a card number and so on). Elasticsearch provides a pattern analyzer just for that purpose.

The default pattern analyzer works on splitting the sentences into tokens based on non word characters. This pattern is represented as \W+ internally. As figure below demonstrates, the pattern tokenizer, along with lowercase and stop filters, makes the pattern analyzer:

Figure : Anatomy of the pattern analyzer

As the default (standard) analyzer only works on nonletter delimiters, for any other patterns we need to configure the analyzer by providing the required patterns. Patterns are regular expressions provided as a string when configuring the analyzer. The patterns use Java regular expressions. To learn more about Java regular expressions follow this link:

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html

Let’s just say we have an e-commerce payments authorizing application and are actually receiving payment authorization requests from various parties. A 16-digit long card number is provided in the format 1234–5678–9000–0000. We want to tokenize this card data on a dash (-) and extract the four tokens individually. We can do so by creating a pattern that splits the field into tokens based on the dash delimiter.

To configure the pattern analyzer, we must create an index by setting pattern_analyzer as the analyzer in the settings object. The following listing shows the configuration in action.

PUT index_with_dash_pattern_analyzer #A
{
"settings": {
"analysis": {
"analyzer": {
"pattern_analyzer": { #B
"type": "pattern", #C
"pattern": "[-]", #D
"lowercase": true #E
}
}
}
}
}

In the code, we create an index with some pattern analyzer settings. The pattern attribute indicates the regex, which follows Java’s regular expression syntax. In this case, we set the dash as our delimiter, so the text is tokenized when it encounters that character. Now that we have the index created, let’s put this analyzer into action as the following listing shows.

POST index_with_dash_pattern_analyzer/_analyze
{
"text": "1234-5678-9000-0000",
"analyzer": "pattern_analyzer"
}

The output of this command produces four tokens: [“1234”,”5678",”9000",”0000"]. The text can be tokenized based on a plethora of patterns. I suggest that you experiment with regex patterns to get the full benefit from the pattern analyzer.

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet