Elasticsearch in Action: Standard Text Analyzer

Madhusudhan Konda
8 min readJan 25, 2023
Elasticsearch in Action by M Konda

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Me @ Medium || LinkedIn || Twitter || GitHub

Elasticsearch provides over half a dozen out-of-the-box analyzers that we can use in the text analysis phase. These analyzers most likely suffice for the basic cases, but should there be a need to create a custom one, one can do that by instantiating a new analyzer module with the required components that make up that module. The table below lists the analyzers that Elasticsearch provides us with:

The standard analyzer is the default analyzer and is widely used during text analysis. Let’s look at the standard analyzer with an example in the next section, and following that we will look at each of the other analyzers in turn.

Note : Elasticsearch provides a handful of built-in analyzers as well as letting us create a plethora of analyzers by customizing them with a mix and match of filters and tokenizers. It would be too verbose and impractical to go over each of them, but I will present as many examples as possible in these articles. However, I advise you to refer to the official documentation for specific components and their integration into your application.

Standard analyzer

The standard analyzer is the default analyzer used in Elasticsearch. The standard analyzer’s job is to tokenize sentences based on the whitespaces, punctuation, and grammar. Let’s suppose we want to build an index with a weird combination of snacks and drinks. Consider the following text that mentions coffee with popcorn:

“Hot cup of ☕ and a 🍿is a Weird Combo :(!!”

We can index this text into a weird_combos index as shown in the following:

POST weird_combos/_doc 
{
"text": "Hot cup of ☕ and a 🍿is a Weird Combo :(!!"
}

The text gets tokenized and the list of tokens are spit as shown here in a condensed form:

[“hot”, “cup”, “of”, “☕”, “and”, “a”, “””🍿”””, “is”, “a”, “weird”, “combo”]

The tokens are lowercased as you can tell from the output. The smiley at the end as well as the exclamation marks are removed by the standard tokenizer, but the emojis are saved as if they were textual information. This is the default behavior of the standard analyzer, which tokenizes the words based on a whitespace and strips of non letter characters like punctuation. Figure below shows the workings of the previous input text when passed through the analyzer.

Figure : The standard (default) analyzer in action

In fact, the following shows how we can use the _analyze API to check the output before we index the text.

GET _analyze
{
"text": "Hot cup of ☕ and a 🍿is a Weird Combo :(!!"
}

The output of this GET command is shown in the following snippet. (For brevity, other than the first token, the rest of them are condensed.)

{
"tokens" : [
{
"token" : "hot", #A
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{ "token" : "cup", ... },
{ "token" : "of", ... }, #B
{ "token" : "☕", ... }, #C
{ "token" : "and", ... },
{ "token" : "a", ... },
{ "token" : """🍿""", ... },
{ "token" : "is", ... },
{ "token" : "a", ... },
{ "token" : "weird", ... },
{ "token" : "combo", ... }#D
]
}

The output indicates the works of the standard analyzer: the words were split based on whitespace and nonletters (punctuation), which is the mark of the standard tokenizer. The tokens are then passed through the lowercase token filter.

Note : Components of a built-in analyzer : Each of the built-in analyzers comes with a predefined set of components such as character filters, tokenizers and token filters — for example a fingerprint analyzer is composed of a standard tokenizer along with a bunch of token filters (fingerprint, lowercase, asciifolding and stop token filters) but no character filters. It isn’t easy to tell the anatomy of an analyzer unless you have memorized them over time! So, my advice is to check the definition of the analyzer on the documentation page if you need to go over the nitty gritties of an analyzer in detail.

The figure below shows a condensed output of this command in DevTools. As you can observe, the tokens for “coffee” and “popcorn” are stored as is and the non letter characters such as :( and !! are removed.

Figure : The output tokens from a standard analyzer

Testing the standard analyzer

We can add the specific analyzer during our text analysis testing phase by adding an additional analyzer attribute in the code. The following listing demonstrates this.

GET _analyze
{
"analyzer": "standard",
"text": "Hot cup of ☕ and a 🍿is a Weird Combo :(!!"
}

You can replace the value of the analyzer to your chosen one if you are testing the text field using a different analyzer, like: “analyzer”: “whitespace”, for example.

This code produces the same result as that shown in figure above. The output indicates that the text was tokenized and lowercased. Figure given below gives us a pictorial representation of the standard analyzer with its internal components and anatomy.

Figure : Anatomy of a standard analyzer

As the figure depicts, the standard analyzer consists of a standard tokenizer and two token filters: lowercase and stop filters. There is no character filter defined on the standard analyzer. To remind ourselves once again, analyzers consist of zero or more character filters, at least one tokenizer, and zero or more token filters.

Although the standard analyzer is clubbed with a stop words token filter, the stop words filter is disabled by default. We can, however, switch it on by configuring its properties.

Configuring the standard analyzer

Elasticsearch allows us to configure a few parameters such as the stop words filter, stop words path, and maximum token length on the standard analyzer. The way to configure the properties is via the index settings. When we create an index, we can configure the analyzer through the settings component:

PUT <my_index>
{
"settings": {
"analysis": {
"analyzer": {
...
}
}
}
}

Stop words configuration

Let’s take an example of enabling English stop words on the standard analyzer. We could do this by adding a filter during index creation as the following listing shows.

PUT my_index_with_stopwords
{
"settings": {
"analysis": {
"analyzer": {
"standard_with_stopwords":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}

As we’ve noticed earlier, the stop words filter on the standard analyzer was the disabled. Now that we’ve created the index with a standard analyzer that is configured with the stop words, any text that gets indexed goes through this modified analyzer. To test this, we can invoke the _analyze endpoint on the index as demonstrated here in the listing below:

POST my_index_with_stopwords/_analyze 
{
"text": ["Hot cup of ☕ and a 🍿is a Weird Combo :(!!"],
"analyzer": "standard_with_stopwords"
}

The output of this call shows that the common (English) stop words such as “of”, “a”, and “is” were removed:

["hot", "cup", "☕" "🍿","weird", "combo"]

We can change the stop words for a language of our choice. For example, the code in the following listing shows the index with Hindi stop words and the standard analyzer.

PUT my_index_with_stopwords_hindi
{
"settings": {
"analysis": {
"analyzer": {
"standard_with_stopwords_hindi":{
"type":"standard",
"stopwords":"_hindi_"
}
}
}
}
}

We can test the text using the aforementioned standard_with_stopwords_hindi analyzer:

POST my_index_with_stopwords_hindi/_analyze
{
"text": ["आप क्या कर रहे हो?"],
"analyzer": "standard_with_stopwords_hindi"
}

If you are curious to know what this Hindi sentence represents, its equivalent is “what are you doing?”

The output from the above script is shown here:

"tokens" : [{
"token" : "क्या",
"start_offset" : 3,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}]

The only token that gets output is क्या (the second word) because the rest of the words were stop words. (They are common in the Hindi language).

File-based stopwords

If our requirement isn’t satisfied or catered to by the built-in stop words filters, we can provide the stop words via an explicit file.

Let’s say we don’t want users to input swear words in our application. We can create a file with all the blacklisted swear words and add the path of the file as the parameter to the standard analyzer. The file must be present relative to the config folder of Elastisearch’s home. The following listing creates the index with an analyzer that accepts a stop word file:

PUT index_with_swear_stopwords
{
"settings": {
"analysis": {
"analyzer": {
"swearwords_analyzer":{#A
"type":"standard", #B
"stopwords_path":"swearwords.txt" #C
}
}
}
}
}

The stopwords_path attribute looks for a file (swearwords.txt in this case) in a directory inside the Elasticsearch’s config folder. The following listing demonstrates the path of creating the file in the config folder. Make sure you change directory into $ELASTICSEARCH_HOME/config and create the swearwords.txt file in there. In the listing, notice that the blacklisted words are created in a new line.

file:swearwords.txt
damn
bugger
bloody hell
what the hell
sucks

Once the file is created and the index is developed as the following listing shows, we are ready to put the analyzer with the custom-defined swear words to use:

POST index_with_swear_stopwords/_analyze
{
"text": ["Damn, that sucks!"],
"analyzer": "swearwords_analyzer"
}

This code should stop the first and last words going through the indexing process because those two words were in our swear words black list. The next attribute that we can configure is the length of tokens: how long we need them as output. This is discussed in the next section.

Configuring token length

We can also configure the maximum token length; in which case, the token is split based on the length asked for. For example, the listing below creates an index with a standard analyzer. The analyzer is configured to have a maximum token length of 7 characters. If we provide a word that is 13 characters long, the word would be split into 7 and 6 characters (for example, Elasticsearch would become “Elastic”, “search”).

PUT my_index_with_max_token_length
{
"settings": {
"analysis": {
"analyzer": {
"standard_max_token_length":{
"type":"standard",
"max_token_length":7
}
}
}
}
}

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet