Elasticsearch in Action: Tokenizers

Madhusudhan Konda
5 min readJan 26, 2023
Elasticsearch in Action by M Konda

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Me @ Medium || LinkedIn || Twitter || GitHub

The job of a tokenizer is to create tokens based on certain criteria. Tokenizers split the incoming input fields into tokens that are, most likely, the individual words of a sentence. There are over a dozen tokenizers, each of them tokenizing fields as per the tokenizer’s definition.

Note : As you can imagine, going over all tokenizers in a print book is not only impractical but also boring to read basically similar text with tiny changes. I have picked a few important and popular tokenizers here so you can understand the concept and mechanics behind a tokenizer. Obviously, the code is available for most of them on my GitHub page: https://github.com/madhusudhankonda/elasticsearch-in-action

Standard tokenizer

A standard tokenizer splits the words based on word boundaries and punctuation. It tokenizes the text fields based on whitespace delimiters as well as on punctuation like commas, hyphens, colons, semicolons, and so forth. The following code uses the _analyze API to execute the tokenzier on a field:

POST _analyze
{
"text": "Hello,cruel world!",
"tokenizer": "standard"
}

This results in three tokens: “Hello”, “cruel”, and “world”. The comma and the whitespace act as delimiters to tokenize the field into individual tokens.

The standard analyzer has only one attribute that can be customized, the max_token_length. This attribute helps produce tokens of the size defined by the max_token_length property (default size is 255). We can set this property by creating a custom analyzer with a custom tokenizer as the following listing shows.

PUT index_with_custom_standard_tokenizer
{
"settings": {
"analysis": {
"analyzer": {
"custom_token_length_analyzer": { #A
"tokenizer": "custom_token_length_tokenizer"
}
},
"tokenizer": {
"custom_token_length_tokenizer": {
"type": "standard",
"max_token_length": 2 #B
}
}
}
}
}

Similar to the way we created an index with a custom component for types of character filters in an earlier section, we can follow the same path to create an index with a custom analyzer that encompasses a standard tokenizer. The tokenizer is then extended by providing the max_token_length size (in the previous listing, the size is set to 2). Once the index is created, we can then use the _analyze API to test the field as the following listing shows. This code spits out two tokens: “Bo” and “nd”, which honors our request for a token size of 2 characters.

POST index_with_custom_standard_tokenizer/_analyze
{
"text": "Bond",
"analyzer": "custom_token_length_analyzer"
}

N-gram and edge_ngram tokenizers

Before we jump into learning the n-gram tokenizers, let’s recap n-grams, edge_ngrams, and shingles.

The n-grams are a sequence of words for a given size prepared from a given word. Take as an example the word “coffee”. The two-letter n-grams, usually called bi-grams, are “co”, “of”, “ff”, “fe”, and “ee”. Similarly, the three-letter tri-grams are “cof”, “off”, “ffe”, and “fee”. As you can see from these two examples, the n-grams are prepared by sliding the letter window.

On the other hand, the edge_ngrams produce words with letters anchored at the beginning of the word. Considering “coffee” as our example, the edge_ngram produces “c”, “co”, “cof”, “coff”, “coffe”, and “coffee”. Figure below depicts the n-grams and edge_ngrams.

Figure : Pictorial representation of n-grams and edge_ngrams

The n-gram and edge_ngram tokenizers emit n-grams as the name suggests. Let’s look at them in action.

The n-gram tokenizer

For correcting spellings and breaking words, we usually use n-grams. The n-gram tokenizer emits n-grams of a minimum size as 1 and a maximum size of 2 by default. For example, this code produces n-grams of the word “Bond”.

POST _analyze
{
"text": "Bond",
"tokenizer": "ngram"
}

The output is [B, Bo, o, on, n, nd, d]. You can see that each n-gram is made of one or two letters: this is the default behavior. We can customize the min_gram and max_gram sizes by specifying the configuration as demonstrated in the following listing:

Listing : An n_gram tokenizer

PUT index_with_ngram_tokenizer
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer":{
"tokenizer":"ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer":{
"type":"ngram",
"min_gram":2,
"max_gram":3,
"token_chars":[
"letter"
]
}
}
}
}
}

Using the min_gram and max_gram attributes of the ngram tokenizer (set to 2 and 3, respectively in the example), we can configure the index to produce n-grams. Let’s test the feature as shown in the following listing :

Listing : Testing the n_gram tokenizer

POST index_with_ngram_tokenizer/_analyze
{
"text": "bond",
"analyzer": "ngram_analyzer"
}

This produces these n-grams: “bo”, “bon”, “on”, “ond”, and “nd”. As you can see, the n-grams were of size 2 and 3 characters.

The edge_ngram tokenizer

Following the same path, we can use edge_ngram tokenizer to spit out edge n-grams. as the code snippet that creates the analyzer with the edge_ngram tokenizer demonstrates:

..
"tokenizer": {
"my_edge_ngram_tokenizer":{
"type":"edge_ngram",
"min_gram":2,
"max_gram":6,
"token_chars":["letter","digit"]
}
}

Once we have the edge_ngram tokenizer attached to a custom analyzer, we can test the field using the _analyze API. The following shows how:

POST index_with_edge_ngram/_analyze
{
"text": "bond",
"analyzer": "edge_ngram_analyzer"
}

This invocation spits out these edge grams: “b”, “bo”, “bon”, and “bond.” Note that all the words are anchored on the first letter.

Other tokenizers

As you can imagine, there are a handful of other tokenizers and listing them one by one not only feels repetitious but also impractical. I have, however, created code examples in my GitHub page (​​https://github.com/madhusudhankonda/elasticsearch-in-action), so I suggest you go over the code there to work through the other tokenizer examples. Table below provides an explanation of the other tokenizers in brief.

Table : Out-of-the-box tokenizers

The final component of an analyzer is a token filter. Its job is to work on the tokens that were spit out by the tokenizers.

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet