Elasticsearch in Action: Simple and Whitespace Analyzers
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
Simple analyzer
While the standard
analyzer breaks down the text into tokens when encountered with whitespaces or punctuation, the simple analyzer tokenizes the sentences at the occurrence of a nonletter like a number, space, apostrophe, or hyphen. It does this by using a lowercase tokenizer, which is not associated with any character or token filters. This is represented pictorially in figure below.
Let’s consider an example of indexing the text “Lukša’s K8s in Action” as the script in the following listing shows.
POST _analyze
{
"text": ["Lukša's K8s in Action"],
"analyzer": "simple"
}
This results in
["lukša","s","k","s","in","action"]
The tokens were split when an apostrophe (“Lukša’s” becomes “Lukša” and “s”) or numbers (“K8s” becomes “k” and “s”) were encountered and the resulting tokens were lowercased.
There is not much configuration that a simple analyzer can do, but if we want to add a filter (character or token), the easiest way to do this is to create a custom analyzer with the required filters and the lowercase tokenizer (the simple analyzer has a lone lowercase tokenizer).
Whitespace analyzer
As the name suggests, the whitespace analyzer splits the text into tokens when it encounters whitespaces. There are no character or token filters on this analyzer except a whitespace tokenizer as the figure shows.
The following listing shows the script for the whitespace analyzer. We can execute the script in the listing to get the desired output as shown:
POST _analyze
{
"text":"Peter_Piper picked a peck of PICKLED-peppers!!",
"analyzer": "whitespace"
}
If we test this script, we’ll get this set of tokens:
["Peter_Piper", "picked", "a", "peck", "of", "PICKLED-peppers!!"]
Two points to note from the result: the text was tokenized only on a whitespace, it was not tokenized on dashes, underscores, and punctuation. The second point is that the case is preserved. The capitalization of the characters and words were kept intact.
As mentioned earlier, similar to the simple analyzer, the whitespace tokenizer is not exposed with configurable parameters. If we need to modify the behavior of the analyzer, we may need to go through the route of creating a modified custom whitespace analyzer.