Basic Concepts of a Search Index

Introduction

A text index is a way to store information about textual data to improve ease of retrieval. The basic idea is to store the information in such a way that it is possible to retrieve what you are looking for without doing a full scan of the data. A text index will do this by storing information about each indexed word separately. So where in the raw text you might have a text field with the contents “this is a description”, a full text index would store information about each individual word separately, so it would store a form of “this”, “is”, “a”, and “description”.

There are two motivations to using an indexing engine: speed and quality of results. Generally speaking, sifting through a massive amount of text is a slow process. (We discussed that earlier on this blog when we looked at the Knuth-Morris-Pratt algorithm). There are also numerous intricacies of language that can trip up a naive search that a search index can work around.

So let’s dive in, shall we?

Tokenizing

The first stage of generating an index is to split the source text up into individual words or phrases. This stage is called “tokenizing.” A simple way of tokenizing text is the “white space tokenizing” technique, which basically splits text up by whitespace, as I did with “this is a description” earlier. But punctuation is a pitfall here. Without considering punctuation, tokenizing a phrase like “this is a pizza, and the pizza is good” would create separate and distinct tokens for each “pizza,” and “pizza”. This is generally not a desired outcome, so most tokenization algorithms will allow for a whitespace tokenizer that also accepts a list of punctuation characters.

There is a lot more room for nuance in tokenization: for example what if you want to use a whitespace+punctuation tokenizer, but you have IP addresses that you want to be searchable in your data? Then tokenizing something like “The IP was 127.0.0.1″ would yield separate tokens for “127”, “0” and “1”.

Another possible issue is that you may want multiple terms to be treated like single terms. For example, records that contain the phrase “magnetic resonance imaging” should be treated as more relevant in searches when the terms all appear together than when they are apart. There are different ways to approach this problem, but one is “NGram” tokenizing (this is a fancy way of saying that a given number N of adjacent terms will each be treated as one token).

The myriad nuances of tokenization are why industry leading search indices allow a developer to define their own tokenizing routine.

Analysis

Fully featured search indices like ElasticSearch or Solr will allow a software developer to define an analysis chain on their indexed data. The analysis chain is a way of normalizing the language that will be indexed so that related terms will be able to match after the index has been created.

Once a collection of tokens is generated, further processing is generally desired before storing indexed terms. If a user searches for “skeet shooting”, records with “skeet shoot” should probably match. Generally this issue is tackled by storing “stemmed” words in the index, where the “stem” of a word is generally a form of the word with any suffixes removed. A popular stemmer is the “Lovins Snowball stemmer.” (I recommend reading this link; it’s well-written and highly fascinating.)

There are two important points to consider here:

Stemming is language specific. Suffixes that denote different meaning vary from language to language. And pictogram based languages are (to my knowledge) practically immune to this approach to stemming.
Simple suffix removal is not a foolproof method of word stemming, for example “flammable” and “inflammable” have precisely the same meaning, but removing their suffixes leaves different stems. You wouldn’t want someone using your index to search for whether a product is flammable, and find no results because it is inflammable.

The solution to the first issue is to use different stemming algorithms for different languages. The solution to the second issue is to support synonyms in the search index; if the system is aware that flammable and inflammable represent the same thing, your users will find the safety information they need even if they use the term that isn’t in your data.

Proper analysis is where an indexing engine will beat out a text search in terms of quality of results.

Storing

Storing indexed data can be done in any number of ways. The important thing is that lookups should be fast. Index sizes and data concerns will determine which data structure is best. For my demo code below I am using PHP so that basically limits me to using PHP’s built-in array data structure as a hash map.

An example index

Here is a couple functions that create a simple index:

file_put_contents('index.dat', serialize(createIndex($records));

/**
* @param $record array collection of records
* @return array record collection combined with indices
*/
function createIndex($record) {
    $output = ['records' => $records, 'index' => []];
    foreach($records as $recordKey => $record) {
        $record = str_replace(array('.', ','), ' ', $record);
        $record = strtolower($record);
        $tokens = whitespaceTokenize($record);

        foreach($tokens as $token) {
            if (!isset($output['index'][$token])) {
                $output['index'][$token] = [];
            }
            $output['index'][$token][$recordKey] = 1;
        }
    }
    return $output;
}

/**
* @param $records array|string record or records to be split by white space.
* @return array tokenized collection
*/
function whitespaceTokenize($records) {
    $out = [];
    if (is_string($records)) {
        $records = [$records];
    }
    foreach($records as $record) {

        $out = array_merge($out, preg_split('/\s+/', $record));
    }
    return $out;
}

Note how the indexer splits out terms then stores them in a hash that maps back to which record the term can be found in. This is the basic idea of a search index.

And here is a quick way to read matches from the index:

$index = unserialize(file_get_contents('index.dat'));

$keywords = strtolower($argv[1]);

$matches = array_keys($index['index'][$keywords]);
echo count($matches) . ' matches' . "\n";
foreach($matches as $matchKey => $match) {
    echo "$match => " . $index['records'][$match] . "\n";
}

You will note here that I did not create a configurable punctuation + whitespace tokenizer. And in fact, none of the more advanced features I discussed earlier are implemented. This code is only meant to illustrate how an index works, not be fully featured. The basic concept here could be cleaned up, coded to a generalized storage interface, and then the more advanced features could be added as needed to make a functioning search index written in pure PHP.

Conclusion

We have seen a mile-high overview of how a search index works, discussed the tokenization and analysis processes, briefly discussed storage, and seen just about the most basic working example of a search index. Search indices are a fascinating technology that empower users through services like Bing, Million Short, and Duck Duck Go to find information on the internet, but we have also identified some issues in the technology and hinted at some methods of dealing with these problems. We may take a more in-depth look at some of those solutions in a future post.