Elasticsearch in Action: Stemmer, Shingles and Synonym Filters
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
The tokens produced by the tokenizers may need further enriching or enhancements such as lowercasing (or uppercasing) the tokens, providing synonyms, developing stemming words, removing the apostrophes or punctuation, and so on. Token filters work on the tokens to perform such transformations.
Elasticsearch provides almost 50 token filters and, as you can imagine, discussing all of them here is not feasible. I’ve managed to grab a few, but feel free to reference the official documentation for the rest of the token filters. We can test a token filter by simply attaching to a tokenizer and using it in the _analyze
API call as the following listing shows:
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["uppercase","reverse"],
"text" : "bond"
}
The filter accepts an array of token filters; for example, we provided the uppercase and reverse filters in this example). The output would be “DNOB” (“bond” is uppercased and reversed).
You can also attach the filters to a custom analyzer as the following listing demonstrates. Then because we know how to attach token filters, we’ll look at a few examples.
PUT index_with_token_filters
{
"settings": {
"analysis": {
"analyzer": {
"token_filter_analyzer": {#A
"tokenizer": "standard",
"filter": [ "uppercase","reverse"]#B
}
}
}
}
}
Stemmer filter
Stemming is a mechanism to reduce the words to their root words (for example, the word “bark” is the root word for “barking”. Elasticsearch provides an out-of-the-box stemmer that reduces the words to their root form. The following listing demonstrates an example of stemmer usage.
POST _analyze
{
"tokenizer": "standard",
"filter": ["stemmer"],
"text": "barking is my life"
}
When executed, this code produces a list of tokens: “bark”, “is”, “my”, and “life”. As you can see, the original word, “barking”, is transformed to “bark”.
Shingle filter
Shingles are the word n-grams that are generated at the token level (unlike the n-grams and edge_ngrams that emit n-grams at a letter level). For example, the text, “james bond” emits as “james”, and “james bond”. The following code shows an example usage of shingle filter:
POST _analyze
{
"tokenizer": "standard",
"filter": ["shingle"],
"text": "java python go"
}
The result of this code execution is [java, java python, python, python go, go]. The default behavior of the filter is to emit unigrams and two-word n-grams. We can change this default behavior by creating a custom analyzer with a custom shingle filter. The following listing shows how this is configured.
PUT index_with_shingle
{
"settings": {
"analysis": {
"analyzer": {
"shingles_analyzer":{
"tokenizer":"standard",
"filter":["shingles_filter"] #A
}
},
"filter": {
"shingles_filter":{ #B
"type":"shingle",
"min_shingle_size":2,
"max_shingle_size":3,
"output_unigrams":false #C
}
}
}
}
}
Invoking this code on some text (as shown in the listing below) produces two and three groups of words.
POST index_with_shingle/_analyze
{
"text": "java python go",
"analyzer": "shingles_analyzer"
}
The analyzer returns [java python, java python go, python go]
because we’ve configured the filter to produce only 2- and 3-word shingles. The unigram (one word shingle) like “java”, “python”, and so forth are removed in the output because we disabled our filter to output them.
Synonym filter
We worked with synonyms earlier without really going into detail. Synonyms are different words with the same meanings. For example, with football and soccer (the latter being the way football was called in America), both should point to a football game. The synonyms filter helps create a set of words to help produce a richer user experience while searching.
Elasticsearch expects us to provide a set of words and their synonyms by configuring the analyzer with a synonym token filter. We create the synonyms filter on an index’s settings as the listing demonstrates:
PUT index_with_synonyms
{
"settings": {
"analysis": {
"filter": {
"synonyms_filter":{
"type":"synonym",
"synonyms":[ "soccer => football"]
}
}
}
}
}
In the code example, we created a synonyms list (soccer is treated as an alternate name to football) associated with the synonym
type. Once we have the index configured with this filter, we can test the text field:
POST index_with_synonyms/_analyze
{
"text": "What's soccer?",
"tokenizer": "standard",
"filter": ["synonyms_filter"]
}
This produces two tokens: “What’s”, and “football”. As you can see from the output, the word “soccer” is replaced with the word “football”.
Synonyms from a file
We can provide the synonyms via a file on a filesystem rather than hard coding them as we did in the previous listing. To do that, we need to provide the file path in the synonyms_path
variable as the following listing demonstrates.
PUT index_with_synonyms_from_file_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"synonyms_analyzer":{
"type":"standard",
"filter":["synonyms_from_file_filter"]
}
}
,"filter": {
"synonyms_from_file_filter":{
"type":"synonym",
"synonyms_path":"synonyms.txt" #A Relative path of the synonyms file
}
}
}
}
}
Make sure a file called “synonyms.txt
” is created under $ELASTICSEARCH_HOME/config
with the following contents:
The synonyms.txt file with a set of synonyms
# file: synonyms.txt
important=>imperative
beautiful=>gorgeous
We can call the file using a relative or absolute path. The relative path points to the config directory of Elasticsearch’s installation folder. We can test the above analyzer by invoking the _analyze
API with the following input, as shown in the in the listing :
POST index_with_synonyms_from_file_analyzer/_analyze
{
"text": "important",
"tokenizer": "standard",
"filter": ["synonyms_from_file_filter"]
}
We should certainly get the “imperative” token as the response, proving the point that the synonyms were being picked up from the synonyms.txt file we dropped in the config folder. You can add more values to this file while Elasticsearch is running and try it out too.