Overview of Term-Level Queries

Madhusudhan Konda
7 min readJan 30, 2023
Excerpts taken from my upcoming book: Elasticsearch in Action

Me @ Medium || LinkedIn || Twitter || GitHub

Structured data search

The term-level search is a structured search where the queries return results in exact matches. They search for structured data such as dates, numbers, and ranges. With this type of search, we don’t care about how well the results match (like how well the documents correspond to the query) but that it returns the data (or not) if the query is matched. Hence, we do not expect a relevancy score associated with the results from a term-level search.

The term-level search produces a Yes or No binary option similar to the database’s WHERE clause. The basic idea for this kind of search is that the results are binary: the query results are fetched if the condition is met; otherwise, it returns none if the condition fails.

Although the documents have a score associated with them, the scores really don’t matter. The documents are returned if they match the query but not with relevancy. In fact, we can run the term-level queries with a constant score. They can be cached by the server, thus, gaining a performance benefit should the same query be rerun. The traditional database search is like this sort.

Term-level queries are not analyzed

One important characteristic of term-level queries is that the queries are not analyzed (unlike full-text queries). The terms are matched against the words stored in the inverted index without having to apply the analyzers to match the indexing pattern. This means that the search words must match with the fields indexed in the inverted index.

For example, if you search for Java in a title field using a term-level query, chances are the documents won’t match. The reason for this is that during the indexing process, assuming we have a standard analyzer in action, the word Java gets converted to a lowercase java and gets inserted into the inverted index. Because term-level queries are not analyzed, the engine tries to match the search word Java with the word in the inverted index, java, hence, the match fails. We can return the same query (with the capitalized Java) if we use a keyword type instead (we will go over the usage and explanation for this shortly, so hang tight).

The term-level queries are, hence, suitable for keyword searches, not text-field searches because we know any field identified as a keyword is not analyzed during the indexing process. The field is added to the inverted index without carrying out the analysis on it. Like the keywords, the numerics, Booleans, ranges, etc., are not analyzed and directly added to the respective inverted indices.

Let’s take a simple example of the movie, “The Godfather.” The figure below pictorially demonstrates the indexing and term-level search. As you’ll see, the standard analyzer doesn’t find a hit because “The Godfather” doesn’t exist as a single token stored in the inverted index (it was split into two tokens by the analyzer). Similarly, using just Godfather as a search word in the term-level query doesn’t return any results either because, again, the word Godfather does not match the lowercase godfather.

Indexing and term-level searching for the movie, “The Godfather”

As the figure shows, there are two processes: indexing the document and searching for the document. If the field is a text field, assuming the standard analyzer is applied, the title is broken into two tokens and lowercased [“the” “godfather”] during the indexing process.

On the other hand, during the term-level search, the search terms are passed as is, without any text analysis. If the term-level query searches for “The Godfather”, the engine attempts to search for the exact string"The Godfather” in the inverted index.

We can still run term-level queries on text fields, although it’s not advisable on fields with lengthy text. If the text has enumerations like days of a week, movie certificates, or gender, etc., we can also use term-level queries. If we are indexing gender such as Male and Female, the term-level queries must use “male” and “female” in order to successfully return any results because of the standard analyzer’s activity during the indexing process. The takeaway is that term-level queries search for exact words.

Elasticsearch exposes a handful of term-level queries, which include term, terms, IDs, fuzzy, exists, range, and others.

Term queries

The term query fetches the documents that exactly match a given field. The field is not analyzed, instead it is matched against the value that’s stored as is during the indexing in the inverted index. For example, using our movie dataset, if we were to search an R-rated movie, we can develop a term query as shown in the following listing.

GET movies/_search
{
"query": {
"term": {
"certificate": "R"
}
}
}

The name of the query (term in this case) identifies that we are about to perform a term-level search. The object expects the field (certificate, in this case) and the search value. Keep in mind, the certificate is a keyword data type hence, during the indexing process, the value “R” was not processed by any analyzer (actually it’s a keyword analyzer which doesn’t alter the case) hence it’s stored as is.

If you run this query, you’d get all R-rated movies (14 in our sample data set are R-rated). These are wrapped up in the return JSON response. In the next section we will observe the effect of running term-level search on text fields (instead of keyword types).

Term queries on text fields

Let’s see what happens if we change the query with the rating value to r from R by lowercasing our search criteria (such as “certificate”: “r”). To our surprise, we notice this query didn’t get any results. Can you guess the reason?

Elasticsearch analyzes text fields during indexing as well as when searching. As the certificate field is a keyword type, so the field never goes through the analysis process. This means that it will always be matched with the contents of the inverted index. When indexing the document, the certificate value “R” is never tokenized or passed through by filters; hence, it’s inserted into the inverted index as is.

The other side of the coin is searching: Term queries do not get analyzed as well. Unlike a standard tokenizer that tokenizes the query’s field into multiple tokens and lowercases them, the query field remains as is. If you use R, it is considered as R because no lowercase conversion (via the standard tokenizer) is applied behind the scenes. Therefore, when we search for a certificate as lowercased (r, for example), unfortunately, there are no matches (R was indexed not r), so there is no result.

This brings up an important point to consider when working with term queries: term queries are not suitable when working with text fields. Although nothing stops you from using them, they are intended to be used on non text fields like keywords, numericals, and dates.

For whatever the reason, if you want to use a term query on a text field, make sure the text field is indexed like an enumeration or a constant. For example, an order status field with CREATED, CANCELLED, FULFILLED states can be a good candidate to use a term query though the field was a text field.

However, should the text fields have been populated with unstructured text like non- enumeration styled values, we will not get the expected results when term queries are run on them. Let’s checkout an example of what happens when we run a term query on a text field in the next section.

Example: Applying a term query on a movie’s title

Let’s see what happens if we search a text field called titleusing a term query. In the listing below, we search for “The Godfather” in the title of the movie using the term query.

GET movies/_search
{
"query": {
"term": {
"title": "The Godfather"
}
}
}

Running the code in the above listing, we receive no results (refer to the above given figure for pictorial illustration). The reason for this is that the titlefield is a text field, meaning that the field has undergone an analysis process and is stored in the index prior to the search. “The Godfather” is broken down into tokens and stored as lowercase tokens (because we are using the standard analyzer by default) with [“the”, “godfather”] in the inverted index. The search queries are not analyzed for term queries; they instead take the word as is and compare it against the inverted index. In this case, the “The Godfather” query criteria does not match with the tokens (the, godfather) for the title field.

Also, rerunning the query using “the godfather” does not return any results (try running the query, lowercasing the title like this). The term query tries to match the exact value,“the godfather”, which is not in the inverted index (remember, it’s tokenized and stored as two words: the and godfather). However, searching on the word “godfather” returns in the results because the word “godfather”was analyzed and inserted into the inverted index during the data indexing and hence a match is found.

The takeaway is that we need to run the term query over a non text field. Should you want to use the term query to search text fields, make sure the text field has data in the form of enumerations or constants.

Me @ Medium || LinkedIn || Twitter || GitHub

These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.

Elasticsearch in Action

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet