Elasticsearch In Action: Core Data Types
I will be presenting these short and condensed articles as a mini-series on each of the topics in the next few months. The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 7.15 (book is targeting 8.0 version) version.
This is part 4/n series on Mapping:
- 1: Overview of Mapping
- 2. Dynamic Mapping
- 3. Explicit Mapping
- 4. Core data types
- 5. Advanced data types
This is a tad bit of a longer article with hands-on examples — be sure to have your running instance of Elasticsearch and Kibana to try out the code.
Elasticsearch provides a rich list of data types, ranging from simple to complex to specialized types. The list of these types keeps growing, so watch for more along the way.
Data type classifications
Elasticsearch provides over two dozen different data types, so we have a good selection of appropriate data types based on our specific needs. Data types can be broadly classified under the following categories:
- Simple types — The common data types representing strings (textual information), dates, numbers, and other basic data variants. The examples are
text
,boolean
,long
,date
,double
,binary
, etc. - Complex types — The complex types are created by composing additional types, similar to an object construction in a programming language where the objects can hold inner objects. The complex types can be flattened or nested to create even more complex data structures based on the requirements at hand. The examples are
object
,nested
,flattened
,join
, etc. - Specialized types — These types are predominantly used for specialized cases such as geolocation and IP addresses. The common example types are
geo_shape
,geo_point
,ip
, and range types such asdate_range
,ip_range
and others.
Every field in a document can have at least one or more types associated as per the business and data requirements. The table below provides a list of common data types with some examples.
As of writing this article, there are about 29 data types defined by Elasticsearch. Elasticsearch defines the types microscopically in some cases for the benefit of optimizing the search queries. For example, the text
types are further classified into more specific types such as search_as_you_type, match_only_text, completion, token_count
, and others.
Multiple data types
In programming languages like Java or C#, we can’t define a variable with two different types. However, there is no such restriction in Elasticsearch. The engine is pretty cool when it comes to representing a field with multiple data types, allowing us to create multiple types for the same field. For example, we may want the author of a book to be both a
text
type as well askeyword
type. Each of these fields has a specific characteristic, like the keywords will not be analyzed, meaning that the field gets stored as-is.
Elasticsearch provides a set of core data types such as text, keyword, date, long, boolean
etc, for employing them to represent data. In the next few sections of the articles, we will run over the core basic data types with examples where possible.
The text
datatype
We consume a lot of textual information in the current digital world, for example as blog posts, research articles, news items, tweets, and many others. This human-readable text, also termed as full text or unstructured text in a search engine lingo, is the bread and butter of a modern search engine. If there’s one type of data that search engines must do well is the full-text data type. Elasticsearch defines a dedicated data type to handle full-text data — the text
datatype - to support such textual information fields.
We can set the text datatype on a property when creating an index explicitly as the code listing here demonstrates:
# Creating a property (name) with a text type explicitly
PUT my_index_with_text_type
{
"mappings": {
"properties": {
"name":{
"type": "text"
}
}
}
}
Let’s take a slight detour and understand how Elasticsearch treats a text
field.
Analyzing text fields
Any field that’s stamped with the text
datatype gets analyzed before it gets persisted. The unstructured or as commonly called the full text, undergoes an analysis process whereby the data is split into tokens, characters are filtered out, words are reduced to their root word (stemming), synonyms are added, and other natural language processing rules are applied. The following shows a user’s review comment on a movie:
"The movie was sick!!! Hilarious :) :) and WITTY ;) a KiLLer 👍"
When this document is indexed, it undergoes an analysis based on the analyzer. Analyzers are software modules employed by Elasticsearch to analyse the incoming text to tokenize and normalize. By default, Elasticsearch uses a Standard Analyser and, analysing the above review comment leads to the following steps:
- The tags, punctuation and special characters are stripped away using character filters. This is how it looks after this step:
The movie was sick Hilarious and WITTY a KiLLer 👍
- The sentence is broken down into tokens using a tokenizer resulting in:
[the, movie, was, sick, Hilarious, and, WITTY, a, KiLLer, 👍]
- The tokens are changed to lowercase using token filters, so it looks like this:
[the, movie, was, sick, hilarious, and, witty, a, killer, 👍]
These steps may vary, depending on the choice of your analyser. For example, if you choose an English analyzer, the tokens are reduced to be root words (the process is called stemming):
[movi, sick, hilari, witti, killer, 👍].
Did you notice the stemmed words like movi
, hilari
, witti
? They are actually not real words per se but the incorrect spellings don’t matter as long as all the derived forms can match the stemmed words.
Stemming
Stemming is a process of reducing the words to their root words. For example,
fighter
,fight
, andfought
all may lead to one word,fight
.SimilarlyAuthoring
,authors
,authored
all point toauthor
. The stemmers are language dependent, for example, one can employ agerman
stemmer if the chosen language of the documents is German. The stemmers are declared via token filters when composing the analyzer module during the text analysis phase in Elasticsearch. The same process is retriggered during execution of the search queries on the same field too. We have a dedicated chapter in the book for the text analysis.
Elasticsearch defines the types microscopically, for example, further classifying the text fields into more specific types such as search_as_you_type, match_only_text, completion, token_count
, and others. We look at them in the latter part of the book in a separate chapter on advanced types (however, sample code is available on the book’s GitHub repository if you are curious). We continue our journey of learning about common data types, the keyword
data type being the next in line.
The keywords data types
The keywords family of data types is composed of keyword
, constant_keyword
and wildcard
field types. Let’s look at these types here.
The keyword type
The structured data, such as pin codes, bank accounts, phone numbers, don’t need to be searched as partial matches or produce relevant results. The results tend to provide a binary output: returns results if a match or return none. This type of query doesn’t care about how well the document is matched, so you expect no relevance scores associated with the results. Such structured data is represented as keyword
data type in Elasticsearch.
The keyword
datatype leaves the fields untouched. The field is untokenized and not analyzed. The advantage of keyword
fields is that they can be used in data aggregations, range queries, and filtering and sorting operations on the data. To set a keyword
type, use this format:
"field_name":{ "type": "keyword" }
For example, the following code listing creates an email with keyword
type:
PUT faculty {
"mappings": {
"properties": {
"email": {
"type": "keyword"
}
}
}
}
We can also declare numeric values as keywords too, for example, the credit_card_number
maybe declared as a keyword
for efficient access than as a numeric such as long
. There’s no way we can build range queries on such data. The rule of thumb is if the numerical fields are not used in range queries, then declaring them as keyword
types is advised as it aids faster retrieval.
The constant_keyword
type
When the corpus of documents is expected to have the same value, irrespective of any number, constant_keyword
type comes in handy. Let’s just say the United Kingdom is carrying out a census in 2031, and for obvious reasons, the country
field of each citizen’ individual census document is expected to be “United Kingdom” by default. Ideally, there is no need to send the country
field for each of the documents when they are indexed into the census index. This is where the constant_keyword
will be helpful. The mapping schema defines an index (census) with a field called country
and its type being constant_keyword
.
The code shown here demonstrates the mapping definition, with the country
field’s type as constant_keyword
. We are setting a default value for this field as “United Kingdom” at the time of declaring the mapping definition.
PUT census
{
"mappings": {
"properties": {
"country":{
"type": "constant_keyword",
"value":"United Kingdom"
}
}
}
}
We index a document for John Doe, with just his name (no country
field):
PUT census/_doc/1
{
"name":"John Doe"
}
When we search for all residents of the UK (though the document hasn’t got that field during indexing), we receive the positive result — returning John’s document:
GET census/_search
{
"query": {
"term": {
"country": {
"value": "United Kingdom"
}
}
}
}
The constant_keyword
the field will have the same value for every document in that index.
The wildcard data type
The wildcard
data is another special data type that belongs to the keywords family which supports searching data using wildcards and regular expressions. We define the field as a wildcard
type by declaring it as "type": "wildcard"
in the mapping definition. We can query the field by issuing a wildcard query as demonstrated in the listing below (the document "description":"Null Pointer exception as object is null"
was indexed prior to this query)
GET errors/_search
{
"query": {
"wildcard": {
"description": {
"value": "*obj*"
}
}
}
}
Keyword fields are efficient and performant, so using them appropriately will improve the indexing and search query performance.
The date data type
Elasticsearch provides a date
datatype for supporting indexing and searching date-based operations. The date fields are considered to be structured data; hence, you can use them in sorting, filtering, and aggregations. Elasticsearch parses the string value and infers it as a date if the value confirms the ISO 8601 date standard. That is, the date value is expected to be in the format of yyyy-MM-dd
or with a time component as yyyy-MM-ddTHH:mm:ss
.
JSON doesn’t have a date type, so dates in the incoming documents are expressed as strings. These are parsed by Elasticsearch and indexed appropriately. For example, a value such as “
article_date":"2021-05-01
" or "article_date":"2021-05-01T15:45:50
" is considered a date and is indexed as date type because the value conforms to the ISO standard.
Just as we did earlier with other data types, we can create a field of date
type during the mapping definition, as the listing below creates a departure_date_time
field for a flight document.
PUT flights
{
"mappings": {
"properties": {
"departure_date_time": {
"type": "date"
}
}
}
}
When indexing a flight document, setting the “departure_date_time
" :"2021-08-06
" (or as "2021-08-06T05:30:00
" with time component) will index the document with the date as expected.
When no mapping definition for a date field exists in an index, Elasticsearch parses a document successfully when the format of the date is either in yyyy-MM-dd (ISO date format) or in yyyy/MM/dd (non-ISO date format) format. However, once we’ve created the mapping definition for a date, the date format of the incoming document is expected as per the format defined during the mapping definition.
We can of course change the format of the date if we need to, that is, instead of setting the date in ISO format (yyyy-MM-dd
), we can customize the format as per our need by setting the required format on the field during its creation, as shown in the snippet here:
"departure_date_time":{
"type": "date",
"format": "dd-MM-yyyy||dd-MM-yy"
}
The incoming documents can now have the departure field set as
"departure_date_time" :"06-08-2021"
or
"departure_date_time" :"06-08-21"
Numeric data types
Elasticsearch supplies a handful of numeric data types to handle integer and floating-point data. The table shown below provides the list of numeric types:
We declare the field and its data type as “field_name”:{ "type": "short"}
. The following code snippet demonstrates how we can create a mapping schema with a few numeric fields:
"age":{
"type": "short"
},
"grade":{
"type": "half_float"
},
"roll_number":{
"type": "long"
}
..
The boolean data type
The boolean data type represents the binary value of a field: true
or false
. For example, we can declare the field’s type as boolean
, shown in the snippet below:
PUT blockbusters
{
"mappings": {
"properties": {
"blockbuster": {
"type": "boolean"
}
}
}
}
We can then index a couple of movies (Avatar as a blockbuster and Mulan(2020) as a flop) shown in the code snippet below:
# Avatar
PUT blockbusters/_doc/2
{
"title":"Avatar",
"blockbuster":true
} # Mulan - note how we are setting stringified "false"PUT blockbusters/_doc/2
{
"title":"Mulan",
"blockbuster":"false"
}
In addition to setting the field as JSON’s boolean type (true
or false
), the field also accepts “stringified” boolean values such as "true
" or "false
” as you can see in the second example (Mulan). You can use a term (booleans are classified as structured-data) query to fetch the results, for example, the following query will fetch the Avatar as the blockbuster:
GET blockbusters/_search
{
"query": {
"term": {
"blockbuster": {
"value": "true"
}
}
}
}
You can also provide an empty string for a false value: "blockbuster":""
.
The range data type
The range data types represent lower and upper bounds for a field. For example, if we want to select a group of volunteers for a vaccine trial, we can segregate the volunteers based on some categories such as age 25–50, 51–70, demographics such as income level, city dwellers, and so on.
Elasticsearch supplies a range data type for supporting search queries on range data. The range is defined by operators such as lte
(less than or equal to) and lt
(less than) for upper bounds and gte
(greater than or equal to) and gt
(greater than) for lower bounds.
There are various types of range data types provided in Elasticsearch: date_range
, integer_range, float_range, ip_range
, and others.
The date_range type example
The date_range date type helps index a range of dates for a field. We then can use range queries to match some criteria based on the lower and upper bounds of the dates.
Let’s code an example to demonstrate the date_range
type. Venkat Subramaniam is an award-winning author who delivers training sessions on various subjects from programming to design to testing. Let’s consider a list of his training courses and the dates for our example.
We create a trainings
index with two fields, name of the course and training dates - with text and date_range
types respectively, as given in the listing below:
PUT trainings
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"training_dates": {
"type": "date_range"
}
}
}
}
Now that we have the index ready, let’s go ahead and index a few documents with Venkat’s training courses and dates.
# First document
PUT trainings/_doc/1
{
"name":"Functional Programming in Java",
"training_dates":{
"gte":"2021-08-07",
"lte":"2021-08-10"
}
}# Second document
PUT trainings/_doc/2
{
"name":"Programming Kotlin",
"training_dates":{
"gte":"2021-08-09",
"lte":"2021-08-12"
}
}# Third document
PUT trainings/_doc/3
{
"name":"Reactive Programming",
"training_dates":{
"gte":"2021-08-17",
"lte":"2021-08-20"
}
}
The data_range
type field expects two values: an upper bound and a lower bound. These are usually represented by abbreviations like gte
(greater than or equal to), lt
(less than), and so on. Now we prepped up the data, let’s issue a search request (listing given below) to find out Venkat’s courses between two dates:
GET trainings/_search
{
"query": {
"range": {
"training_dates": {
"gt": "2021-08-10",
"lt": "2021-08-12"
}
}
}
}
As a response to the query, we see Venkat is delivering Programming Kotlin between these two dates (the second document matches for these dates). The data_range
made it easy to search among a range of data.
In addition to date_range, we can create other ranges like ip_range, float_range, double_range, integer_range
, and so on. Refer to my GitHub repository for more examples.
The IP (ip
) address data type
Elasticsearch provides a specific data type to support internet protocol (IP) addresses: the ip
data type. This data type supports both IPv4 and IPv6 IP addresses. To create a field of ip
type, use "field":{"type": "ip"}
as the following example shows:
PUT networks
{
"mappings": {
"properties": {
"router_ip": {
"type": "ip"
}
}
}
}
Indexing the document is then straightforward:
PUT networks/_doc/1
{
"router_ip":"35.177.57.111"
}
We can use our _search
endpoint to search for the IP addresses that match our query. The following query searches for data in the networks
index to get the matching IP address:
GET networks/_search
{
"query": {
"term": {
"router_ip": {
"value": "35.177.0.0/16"
}
}
}
}
It’s time to wrap up — we looked at some of the important core data types (of course there are a few more!) Elasticsearch provides in this article.
In the next article, we go over a list of advanced data types like an object
, nested
, join
and others. Stay tuned!
These short articles are condensed excerpts taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository.