HTML Strip, Mapping and Pattern Replace Character Filters

Madhusudhan Konda
6 min readJan 26, 2023
Elasticsearch in Action by M Konda

The excerpts are taken from my book Elasticsearch in Action, Second Edition. The code is available in my GitHub repository. You can find executable Kibana scripts in the repository so you can run the commands in Kibana straight away. All code is tested against Elasticsearch 8.4 version.

Me @ Medium || LinkedIn || Twitter || GitHub

When a user searches for answers, the expectation is that they won’t search with punctuation or special characters. For example, there is a high chance a user may search for “cant find my keys” (without punctuation) rather than “can’t find my keys !!!”. Similarly, the user is not expected to search the string “<h1>Where is my cheese?</h1>” (with the HTML tags). We don’t even expect the user to search using XML tags like <operation>callMe</operation>. The search criteria doesn’t need to be polluted with unneeded characters. And, sometimes, we don’t expect users to search using symbols: α instead of alpha or β in place of beta, and so on.

Based on these assumptions, we can analyze and clean the incoming text using character filters. Character filters help purge the unwanted characters from the input stream. Though they are optional, if they are used, they form the first component in the analyzer module.

An analyzer can consist of a zero or more character filters. The character filter carries out the following specific functions:

  • Removes the unwanted characters from an input stream. For example, if the incoming text has HTML markup like “<h1>Where is my cheese?</h1>”, the requirement is to get the <h1> tags dropped.
  • Adds to or replaces additional characters in the existing stream. If the input field has a set of 0’s and 1’s , then perhaps we may want to replace them with “false” and “true”, respectively. If the input stream has the character β, we might map it to the word “beta” and index the field.

Elasticsearch provides three character filters, which we will see in action in the next sections.

Types of character filters

There are three character filters that we use to construct an analyzer: HTML strip, mapping, and pattern filters. We saw these in action in the earlier sections, so in this section we will go over the semantics briefly.

HTML strip (hmtl_strip) filter

As the name suggests, this filter strips the unwanted HTML tags from the input fields. For example, when the input field with a value of <h1>Where is my cheese?</h1> is processed by the HTML strip (html_strip) character filters, the <h1> tags gets purged, leaving “Where is my cheese?”. Note that this filter does not touch the punctuation or casing of the words. We can test the html_strip character using the _analyze API as the following listing shows:

POST _analyze
{
"text":"<h1>Where is my cheese?</h1>",
"tokenizer": "standard",
"char_filter": ["html_strip"]
}

The character filter simply strips the <h1> tags from the input field to produce “Where”, “is”, “my”, “Cheese” tokens instead. However, there might be a requirement to avoid parsing the input field for certain HTML tags; say, for example, the business requirement could be to strip the <h1> tags from the sentences but to preserve the preformatted (<pre>) tags. For example,

<h1>Where is my cheese?</h1>
<pre>We are human beings that lookout for cheese constantly!</pre>

Fortunately there is a way out. We can configure the html_strip filter to add an additional escaped_tags array with the list of tags that needs to be unparsed. Let’s see it in action. The first step is to create an index with the required custom analyzer as the following listing shows.

PUT index_with_html_strip_filter
{
"settings": {
"analysis": {
"analyzer": {
"custom_html_strip_filter_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_html_strip_filter"] #A
}
},
"char_filter": {
"my_html_strip_filter":{
"type":"html_strip",
"escaped_tags":["h1"] #B
}
}
}
}
}

We just created an index with a custom analyzer made of a html_strip character filter. The notable difference is that the html_strip character is extended in this example to use the escaped_tags option, so the field consisting of <h1> tags will be untouched. To test this, run the code in the following listing, which proves this point.

POST index_with_html_strip_filter/_analyze
{
"text": "<h1>Hello,</h1> <h2>World!</h2>",
"analyzer": "custom_html_strip_filter_analyzer"
}

This code leaves the word with the <h1> tag as is, stripping the <h2> tag. It results in this output:”<h1>Hello,</h1> World!”.

Mapping character filter

The mapping character filter’s sole job is to match a key and replace it with a value. As we saw in our earlier example of conversion of Greek letters to English words, the mapping filter parsed the symbols and replaced them with words: α as alpha, β as beta,and so on.

We can test the mapping character filter. For example, the UK in the following listing will be replaced with the United Kingdom when parsed with the mapping filter.

POST _analyze
{
"text": "I am from UK",
"char_filter": [
{
"type": "mapping",
"mappings": [
"UK => United Kingdom"
]
}
]
}

If we want to create a custom analyzer with a configured mapping character filter, we should follow the same process for creating an index with analyzer settings and the required filters. This code example shows the procedure for customizing a keyword analyzer to attach a character mapping filter:

PUT index_with_mapping_char_filter
{
"settings": {
"analysis": {
"analyzer": {
"my_social_abbreviations_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_social_abbreviations"
]
}
},
"char_filter": {
"my_social_abbreviations": {
"type": "mapping",
"mappings": [
"LOL => laughing out loud",
"BRB => be right back",
"OMG => oh my god"
]
}
}
}
}
}

We’ve now created an index with custom analyzer settings, providing a bunch of mappings in the character filter. Now that we have the index with custom analyzer, we can follow the same process of testing it using the _analyze API, shown in the listing below:

POST index_with_mapping_char_filter/_analyze
{
"text": "LOL",
"analyzer": "my_social_abbreviations_analyzer"
}

The text results in “token” : “laughing out loud”, which indicates that “LOL” was replaced with the full form, “laughing out loud”.

Mappings via a file

We can also provide a file with mappings in it, rather than specifying them in the definition. Listing below demonstrates a character filter with mappings loaded from an external file, secret_organizations.txt. The file must be present in Elasticsearch’s config directory (<INSTALL_DIR/elasticsearch/config) or input with an absolute path where it is located.

POST _analyze
{
"text": "FBI and CIA are USA's security organizations",
"char_filter": [
{
"type": "mapping",
"mappings_path": "secret_organizations.txt"
}
]
}

Pattern replace character filter

The pattern_replace character filter, as the name suggests, replaces the characters with a new character when the field matches with a regular expression (regex). Following the same code pattern as the one from the mapping filter, let’s create an index with an analyzer associated with a pattern-replace character filter. The code in the following listing does exactly that.

PUT index_with_pattern_replace_filter
{
"settings": {
"analysis": {
"analyzer": {
"my_pattern_replace_analyzer":{
"tokenizer":"keyword",
"char_filter":["pattern_replace_filter"]
}
},
"char_filter": {
"pattern_replace_filter":{
"type":"pattern_replace",
"pattern":"_",
"replacement":"-"
}
}
}
}
}

The code in this example demonstrates a mechanism to define and develop a custom analyzer with a pattern_replace character filter. Here, we try to match and replace our input field, replacing the underscore (_) character with a dash (-). If you test the analyzer as shown in the following code listing (7.41), we see that the output, “Apple-Boy-Cat”, replaced all the underscores with dashes.

POST index_with_pattern_replace_filter/_analyze
{
"text": "Apple_Boy_Cat",
"analyzer": "my_pattern_replace_analyzer"
}

While the sentences are cleaned and cleared of unwanted characters, there remains the job of splitting the sentences into individual tokens based on delimiters, patterns, and other criteria. And that job is undertaken by a tokenizer component, discussed in the next section.

--

--

Madhusudhan Konda
Madhusudhan Konda

Written by Madhusudhan Konda

Madhusudhan Konda is a full-stack lead engineer, mentor, and conference speaker. He delivers live online training on Elasticsearch, Elastic Stack &Spring Cloud

No responses yet