Just Elasticsearch: 3/n. Indexing Operations

9 min readAug 13, 2020

This is the third article of a series of articles explaining Elasticsearch as simple as possible with practical examples. The articles are condensed versions of the topics from my upcoming book Just Elasticsearch. My aim is to create a simple, straight-to-the-point, example-driven book that one should read over a weekend to get started on Elasticsearch. Articles in the series:

All the code snippets, examples, and datasets related to this series of articles are available on my Github Wiki.

Overview

An index is a logical collection of our data represented as documents. Documents of similar shapes — for example, employees, orders, login audit data, news stories by region, and so on are held in each of its own indices. Each index is distributed across shards and replicas. A newly created index would be associated with a set number of shards and replicas.

Elasticsearch provides a set of Index APIs for all indexing operations such as creating, deleting, reindexing, and others. We can access these APIs using REST over HTTP.

Every index is associated with three sets of configurations: settings, mappings, and aliases:

We use settings configuration for creating the number shards and replicas amongst other properties, required for the index. The shards and replicas will allow scaling and high availability of the data.
The mappings define the schema of our data. It defines the datatypes of each and every field of our data that’s been stored and searched.
Aliases are the alternate names given to an index or set of indices. Aliases allow querying across multiple indices easy as well as reindexing data with zero downtime.

We will look at three components shortly, but first, let’s focus on the indexing operations.

Creating an Index

An index can be created in two ways: explicitly using Index APIs or automatically creating them ( implicit) when indexing documents. We whizzed through the details in our first article, but let’s recap once again what we’ve learned earlier in the next couple of sections.

Automatic Creation

When we index a document for the first time, Elasticsearch wouldn’t moan about a non-existing index. Instead, it goes and creates it happily for us. Elasticsearch applies some default settings like creating one shard and one replica for the index created this manner.

The Convention over Configuration is a software paradigm that advocates building software with sensible defaults. These sensible defaults are important for application setup and runtime so user journey is as smooth as possible. Elastic employs this philosophy so the software runs out of the box, with minimal tweaks or configuration settings.

We create a document using the Document API, as the snippet below shows:

PUT cars/_doc/1 
{ 
  "make":"Maserati", 
  "model":"GranTurismo Sport", 
  "speed_mph":186 
}

When this request for creating a document is sent to Elasticsearch, the server instantly creates an index called cars if that index wasn’t created earlier. The index will be set with default settings and then a document with ID 1 will be indexed into this index. Remember, we did not ask the Elasticsearch to create the index, it went ahead and automatically created one for us.

We can fetch the details of the newly created index by calling the GET <index_name> method, which will return the details of the index, including the defaults our Elasticsearh friend applied for us:

// Retrieve cars index Response
GET cars 
{ 
  "cars" : { 
    "aliases" : { }, 
    "mappings" : { }, 
    "settings" : { 
      "index" : { 
        "creation_date" : "1587857239887", 
        "number_of_shards" : "1", 
        "number_of_replicas" : "1", 
        "uuid" : "T81am-KEQK6lVSLAd_vRiw", 
        "version" : { 
          "created" : "7040099" 
         }, 
        "provided_name" : "cars" 
      } 
    } 
  } 
}

As you can see from the above response, each index is made of mappings, settings, and aliases. You can find by default Elasticsearch creates one shard and one replica ( number_of_shards and number_of_replicas properties) when an index is created. We can adjust these settings if we create the index explicitly, which is the subject of the next section.

Explicitly Creating an Index

In this section, we will look at creating an index explicitly using index API. We won’t let Elasticsearch decide the configuration for our server, instead, we will provide all the required inputs by ourselves. There are advantages when we wish to create an index explicitly — and my preference too. We can dictate Elasticsearch to create an index with the required number of shards and replicas than depending on the default count, as shown in the snippet below:

// This command will create a cars index 
PUT cars 
{ 
  "settings": { 
    "number_of_shards":2, 
    "number_of_replicas":1 
  }
}

Creating an index is an easy task: use the api call PUT <index_name> to create a new index with default settings. We wish to have two shards and two replicas for every index we create explicitly as per our requirements, so we use settings object with number_of_shards and number_of_replicas set to two.

Now we have our cars index which is equipped with two shards and two replicas. When we are indexing documents, Elasticsearch will route the documents based on a routing hash algorithm to one of the shards. Once the document reaches the shard, that shard will persist the data and parallelly sends a copy to the replica for backup.

When we create an index, it will have a set of shards allocated to it. It is impossible to change this shard number (unless we reindex data)

Once an index is created, we cannot change the number of shards. The routing function is dependent on the number of shards and when a document is indexed, the route is deduced as a function of shard count. If Elasticsearch lets us change the shard settings (say from 2 shards we change the number to 4), the routing function will break, thus misplacement of the document’s home. This is one of the reasons why Elasticsearch wouldn’t allow us to change the shards once an index was live.

Should we have to re-configure shards for whatever the reason, re-indexing is the best bet. We will discuss reindexing APIs later in this series.

Deleting an Index

Deleting an existing index is straight forward, simply issue DELETE <index_name> command:

// Delete cars index 
DELETE cars 
{  
  "acknowledged" : true 
}

Of course, you can delete multiple indices too. Append a comma-separated list of indices to delete in one go:

// Delete multiple indices
DELETE cars, movies, order.// You can delete indices using a wild card too
DELETE *.

Open and Closing Indices

The indices can be closed or open, depending on the use case. Closing the index means exactly that — it is closed for business. After the index is closed, there will be no indexing of fresh documents or search queries or analytics. Closed indices are not available for normal operations, so do take care before closing the indices (there’s a chance it could break the system if they are referenced in your code).

On the other hand, the opening of the index will kick start the shards back into business, so they are open for indexing and searching when ready.

You can use POST <index_name>/_close API to close an index. We can close multiple indices by using the comma-separated indices like POST <index_name1>,<index_name2>,<index_*>/_close or even all the indices by using a wild card: POST */_close

You can open up the closed index for business by simply calling open API: POST <index_name>/_open.

Index Settings

Now that we’ve got a good idea of indexing operations, let’s dig in to understand the properties of an index. Every index can be instantiated with some properties — whether default or custom ones — called settings. We played earlier by changing the number_of_replicas property on an existing index.

Index settings exist as two variants:

Dynamic Settings: These are the settings that can be modified on a live index. For example, properties like changing the number of replicas, allowing/disallowing writes, refresh intervals, etc. We use the _settings api to update properties on the live index.
Static Settings: The static settings can only be applied during the process of index creation, like a number of shards, codec, and a couple of others. None of the settings can be changed as long as the index is in use (Of course, if you wish to change the static settings of a live index, you can close the index to re-apply the settings or re-create).

The settings can also be applied globally across multi-cluster platforms.

Index Templates

Copying the same settings across various indices, especially one by one is a tedious job. You really wish a predefined settings schema so creating a new index will implicitly be molded from this settings schema. Any new index created will follow the same settings and hence be homogenous across the organization. Also, perhaps DevOps wouldn’t need to advocate the optimal settings for individual teams in an organization over and over. One use case might be to create a set of patterns based on environments. Say, a dev environment indices should have 3 shards and 2 replicas, while PROD must have 5 shards and 5 replicas, etc.

This is where templating of indices comes into the picture. We can create a template with predefined patterns. So, when creating a new index, if the index name matches the pattern, the template is applied. In addition to this, we can create a template based on a glob pattern such as wild cards, prefixes, and others. We can create a set of templates with appropriate index patterns with predefined settings.

Glob (short for global command) pattern is a common wildcard pattern used in computer software. For example, we use regularly in our programs for searching all files ensign with txt or log, like: *.txt or *java, *.log etc.

We use the _template endpoint to create such patterns. Let’s create a template for cars pattern:

PUT /_template/dev_template 
{ 
  "index_patterns":["*_cars","cars*"], 
  "settings":{ 
    "number_of_shards":5, 
    "number_of_replicas":2, 
    "blocks":{ 
      "read_only_allow_delete":true 
    } 
  } 
}

The above command creates an index template with specific settings as provided in the body of the method. It can use wildcards, prefixes, suffixes, and other sophisticated glob patterns. For example, the above snippet shows the index_patterns property takes in a list of patterns — *_cars, cars* patterns

Now that we have a template pre-created, we can use this template when creating any new index whose name matches the index_patterns property defined in the pattern. The templated settings will be applied to any of the matches — for eg., old_cars, family_cars, carsoflondon, and carsnew so on.

// Creating old_cars - uses *_cars pattern 
PUT old_cars // Creating racecars uses the second pattern: cars* pattern 
PUT carsoflondon

To check if Elasticsearch honored our templating request while creating these indices, simply fetch the index GET old_cars so you know if the index has number_of_shards as 5.

Do keep a note that templating wouldn’t work retrospectively, that is, any pre-existing indices will not be altered. You can use GET /_template/dev_template command to fetch the persisted template.

Aliases

Aliases are alternate names given to indices for various purposes such as:

Aggregating data from multiple indices (as a single alias) for easy searching
Enabling zero downtime during re-indexing

Once we have an alias created, you can use it for indexing, querying, and all other purposes as if it were an index.

Aliases are quite a handy and useful tool during development as well as in production. We can group multiple indices and assign an alias to them so one can write queries against a single alias than a dozen indices!

We can use _alias endpoint for creating aliases. This method creates an alias for a given index. The method format is PUT <index_name>/_alias/<alias_name>. You can also create a single alias for a set of indices using comma-separated index names or even wild cards.

See _alias API in action below (highlighted in bold):

// The cars is the alias to old_cars index 
PUT old_cars/_alias/cars // Multiple indices with one alias 
PUT vintage_cars,power_cars,rare_cars/_alias/highend_cars // Even wild carded indices with a single alias name 
PUT *cars/_alias/all_cars

We can alias multiple indices to a single alias, as you can see from the second and third examples in the above snippet.

In addition to working with aliases using _alias API, there is another API for working on multiple aliasing actions as we see shortly: _alaises api. It combines a few actions such as adding and removing aliases as well as deleting the indices in one go. Let’s see it in action:

POST /_aliases 
{ "actions" : [ {"remove"{"index":"vintage_cars","alias":"vintage_cars_alias" }}, 
{"add":{"index":"vintage_cars2","alias":"vintage_cars_alias" }} ] 
}

Here we are removing an alias vintage_cars_alias against vintage_cars and reassign it to new vintage_cars2

Summary

In this article, we looked at indexing operations in detail. We’ve learned about implicit and explicit index creation, deleting indices, closing and opening of indices, and other index operations. We went over the index templating with some examples. Finally, we’ve glanced over the index configurations such as settings and aliases.

In the next article, we will go over the document operations. Stay tuned!

All the code snippets and datasets are available at my GitHub Wiki

Originally published at https://www.linkedin.com.