Elasticsearch in Action: Match Phrase (match_phrase) Queries
The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.
In the last article, we looked at match
query in detail. We work with match_phrase
query in this article.
The match phrase (match_phrase
) query finds the documents that match exactly a given phrase. The idea behind the match phrase is to search for the phrase (group of words) in a given field in the same order. For example, if you are looking for the phrase “book for every Java programmer” in the synopsis of a book, documents are searched with those words in that order.
Words can be split individually and searched with an AND/OR
operator when using a match
query. The match_phrase
query is the opposite. It returns the results matching the search phrase exactly. The following listing illustrates the match_phrase
query in action.
GET books/_search
{
"query": {
"match_phrase": {
"synopsis": "book for every Java programmer"
}
}
}
The match_phrase
query expects a phrase as you can see in the code in the previous listing. It returns exactly one document because we only have one in our books index with that phrase in the synopsis field.
Match phrase with the keyword slop
What if we drop a word or two in between the said phrase? Say, for example, we remove the for or every (or both) from the phrase “book for every Java programmer” and rerun the same query. Unfortunately, the query wouldn’t return any results! The reason for this is that match_phrase
expects the words in a phrase to match the exact phrase, word by word. Searching “book Java programmer” returns no results. Fortunately, there is a fix to this problem: using a parameter called slop
.
The slop
parameter allows us to ignore the number of words in between the words in that phrase. We can drop the in-between words in the phrase. However, we need to let the engine know how many words to drop. This is done by setting a value for the slop
parameter. The attribute slop
is an integer value indicating the number of words that can be ignored in a phrase when searching match_phrase
. For example, slop
with 1 ignores one word, slop
with 2 forgives two words missing in a phrase, and so on. The default value of slop is 0, meaning we will not be forgiven for providing a phrase with missing words.
Coming back to our example, let’s drop a word from the given phrase, so instead of a “book for every Java programmer,” we’ll search for the phrase “every Java programmer,” dropping the word for. Because we drop a single world, we need to set the slop
parameter to 1 (the missing word is just one word). The query in the next listing demonstrates this. Obviously, we need to expand the query by providing two further parameters in the query
and slop
objects for the synopsis
field.
GET books/_search
{
"query": {
"match_phrase": {
"synopsis": {
"query": "book every Java programmer",
"slop": 1
}
}
}
}
If you want to use the slop
parameter, both query
and slop
must be provided along with the field’s object as demonstrated in the previous listing (the long form of the query). Because slop
is set to 1, the query matches if one word is missing in an entire phrase in the synopsis field.
Without a doubt, this query returns the book matching our entire phrase. The takeaway from this example is that a match phrase query looks for an exact phrase, but if you are not sure of the exact phrase, you can use the slop
parameter to indicate how forgiving your query should be.
There’s a slight variation to the match phrase query — the match phrase prefix (match_phrase_prefix
) query. In addition to matching an exact phrase, we can expect the last word to be matched as a prefix.
We look at match_phrase_prefix
query in the next article.
These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.