Elasticsearch in Action: Testing Text Analyzers
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
You might be a bit curious to find out how Elasticsearch breaks the text, modifies it, and then plays with it. After all, knowing how the text is split and enhanced upfront helps us to choose the appropriate analyzer and customize it if needed. Fortunately, Elasticsearch exposes an endpoint just for testing the text analysis process. It provides an _analyze
endpoint that helps us understand the process in detail. This is really a handy API that allows us to test how the engine treats a text when indexed. It’ll be easy to explain using an example.
Let’s just say we want to find out how Elasticsearch deals with this piece of text when indexed: “James bond 007”. The following listing shows this in action.
GET _analyze
{
"text": "James Bond 007"
}
This script produces a set of tokens as shown in thefigure below.
The output of the query shows us how the analyzer treats the text field. In this case, the field is split into three tokens (“james”, “bond”
, and “007”
), all lowercase. Because we didn’t specify the analyzer in the code, by default, it is assumed to be the standard analyzer. Each of the tokens has a type ALPHANUM
for string, NUM
for numeric token, and so on. The position of the token is saved too, as you can see from the result shown in the figure. That brings to the next point: specifying the analyzer explicitly during the _analyze
test.
Explicit analyzer tests
In listing in the previous section, we didn’t mention the analyzer, although it was applied implicitly by the engine. This is the standard analyzer by default. However, we can explicitly enable an analyzer too. The code in the following listing shows how to enable the simple
analyzer.
GET _analyze
{
"text": "James Bond 007",
"analyzer": "simple"
}
The simple
analyzer truncates text when a nonletter character is encountered, so this code produces only two tokens, “james” and “bond” (“007” was truncated), as opposed to three tokens from the earlier script that used the standard analyzer.
If you are curious, change the analyzer to english
. The output tokens would then be “jame”, “bond”, and “007”. The notable point is that “james” has been stemmed to “jame” when the english
analyzer was applied.
Configuring analyzers on the fly
We can also use the _analyze
API to club together a few filters and a tokenizer (as though we are creating a custom analyzer on the fly). The idea is that we can create a custom analyzer by mixing and matching the given filters and tokenizers. (We are not really building or developing a new analyzer as such). This on-demand custom analyzer is demonstrated in the following code listing,
GET _analyze
{
"tokenizer": "path_hierarchy",
"filter": ["uppercase"],
"text": "/Volumes/FILES/Dev"
}
The code in this listing uses a path_hierarchy
with an uppercase filter, and therefore produces three tokens: “/VOLUMES”, “/VOLUMES/FILES”
, and “/VOLUMES/FILES/DEV”
. The path_hierarchy
tokenizer splits the text, based on a path separator; hence, you see three tokens telling us about the three folders of hierarchy.
The _analyze endpoint : The _analyze
endpoint helps a lot when understanding the way the text has been treated and indexed by the engine, as well as the reasons why a search query may not have produced the desired output. We can use this as the first point to test our text with the expected analyzers before we proceed to put the code into production.