Testimonials

Martin Martinez
Multi-Programming Solutions team has been there for me every step of the way for this huge project! Martin Martinez
Luke Mitchell/Managing Consultant/Reach Students
Multi-Programming Solutions delivered on time, within budget and precisely to the brief – everything a client could wish for. Luke Mitchell, Managing Consultant, Reach Students
Dane Westo
I have been working with multiprogr for about 4 months now. The programming and designing has been second to none. I have used several developers and designers over a 3 year period. multiprogr is the Dane Westo
Matthew Campbell
Fantastic job, with intuitive understanding of functionality and proactive recommendations. Matthew Campbell
Benjamin Melki, Kaeria CEO
Excellent job again. Benjamin Melki, Kaeria CEO
Ken Kimble
This team is awesome! Ken Kimble

Blog

Automatic spelling correction using trigram string matching algorithm

July 2016, 14
trigram string algorythm

Today, finding information is the main task of the computer. Searching the specified substring in the text is a simple, but important task for modern search engines, DBMSs, etc. There are quite a number of string matching algorithms, including a given substring search algorithms. A subset of this group of algorithms is fuzzy search algorithms.

 

Fuzzy string search algorithms

Fuzzy string searching algorithms (also known as approximate string matching ) are the basis of modern automatic spelling check systems, which are used in document editors, optical character recognition systems, and search engines. The main task to be solved with the help of fuzzy search is to look up the correct spelling of an erroneous word in the dictionary.

There are a large number of fuzzy string searching  algorithms with different complexity. Traditionally, the concept of algorithm complexity is associated with the computer resources usage: how much CPU time the program requires for its implementation, how much of the machine's memory is allocated? Memory accounting is typically maintained by the amount of data, the memory consumed for the execution of the program commands is not taken into account. Time is calculated in relative units so that this evaluation, if possible, was similar for machines with different clock frequencies.

String metrics

Fuzzy phrase search algorithms are characterized by string metric - a function of the distance between the two words, which allows the assessment of their similarity score in the present context. A rigorous mathematical definition of a metric attribute includes the need to comply with the condition of the triangle inequality. Meanwhile, in most cases, a metric refers to a more general concept, which does not require the implementation of such conditions; this concept can also be called the distance.

Among the most famous string metrics are Hamming, Levenshtein and Damerau-Levenshtein distances. Though it should be noted that Hamming distance is considered a string metric only on the set of words of the same length, which greatly limits the scope of its application. In addition to the above, many other distances measure is sometimes used in practice, such as the Jaro-Winkler distance, for example.

Fuzzy string search is a very costly task in terms of computing resources, especially if you need to provide a high accuracy of the received results. Many scientists are working on studies of fuzzy search algorithms in an effort to improve their performance. The whole portals dedicated to fuzzy search research, exist.  Most useful algorithms are examined in this great paper in detail. One of the most balanced today is deservedly considered the N-gram algorithm.

N gram algorithm

This method was invented a while ago and is the most widely used, because of its extremely simple implementation and reasonably good performance. The algorithm is based on the following principle: "If the word A coincides with the word B, taking into account a number of errors, then very likely they will have at least one common substring of length N”. These substrings of length N are called N-grams. During indexing, the words are broken into such N-grams, and then the word lists ordered by alphabet for each of the N-grams is created. During a search, the request is also divided into N-grams, and a sequential scan of the collection, containing this string is made for each of them. Another trigram search method using the suffix tree exists, but it is way less efficient.

Trigrams and their usage in spelling correction

The most commonly used in practice are trigram matching - substrings of length 3. Selecting a larger value of N leads to a limitation on the minimum word length at which the error detection is possible.

An important trigram method usage is the spelling correction based on the word level model. Trigram word level is a pattern of three words, and trigrams model includes information on the repetition frequency of each trigram in the text. The simplest algorithm using word trigram similarity is based on calculating the probability of the whole sentence, by dividing it into trigrams and calculating their sum frequency. Thus, the trigram based indexing is made. Then the words in a sentence are replaced by candidates, and the likelihood of proposals is evaluated again. Ultimately, the embodiment with the highest probability is considered most likely to be correct.

We used the trigram search algorithm in our website's search mechanism and would like to describe the details of its implementation a little.


The way it is coded

One of the projects that we work on is a large B2B and B2C portal, representing a huge directory where proper search functionality is a mandatory. The directory is based on MongoDB and Sphinx Search Engine, and is written with the use of PHP5. All the provided listings should be reviewed with this in mind.

Creating trigrams in PHP:

  /**
   * Builds N-grams for given word
   *
   * @param string  $word  -  word
   * @param integer $number – a number of N-grams
   *
   * @return string
   */
  public static function nGram($word, $number = 3)
  {
      $ngrams = "";
      // Add underscore suffix to the word:

      $unders = str_repeat('_', $number - 1);
      $fullWord = $unders . $word . $unders;
      for ($i = 0; $i < mb_strlen($fullWord) - ($number - 1); $i++) {
          $ngrams .= mb_substr($fullWord, $i, $number, 'UTF-8') . " ";
      }
      return trim($ngrams);
  }
Next trigram list is inserted into Sphinx as a separate table (we’ll be looking for similar words within this table). A copy of the same table is stored in the Mongo database.
Method that creates the data array to be inserted into Sphinx:
   /**
   *
   * @param array $words - an array of words for which we build trigrams
   *
   * @return array - the resulting array containing information about words
 */
public function buildFrequencies(array $words)
  {
      $freq = array();
      foreach ($words as $word) {
          if (isset($freq[$word])) {
              $freq[$word]['frequency']++;
          } else {
              $wordLen = mb_strlen($word, 'UTF-8');
              $freq[$word] = array(
                'frequency' => 1,
                'word' => $word,
                'length' => $wordLen,
                'trigramm' => Str::nGram($word, 3)
              );
          }
      }
      return $freq;
  }
When user enters a search request trying to find similar words:
   /**
   * Get an array of similar words for the given word
   *
   * @param array $words - words that we should look up similar words for
   *
   * @return array
   */
  public function getSimilar(array $words)
  {
Obtaining the base form of the original word:
    $wordsBaseForms = array();
Saving the base forms of words to filter them later:
    foreach($words as $word){
          $baseForm = $this->morphy->getBaseForm($word);
          
          if ($baseForm) {
              $baseForm = current($baseForm);
              $wordsBaseForms[] = $baseForm;
          } else {
              $wordsBaseForms[] = $word;
          }
      }
Preparing the request to Sphinx for getting similar words by their trigrams:
      $qIds = array();
      $this->sphinx->multiInit();
      foreach ($words as $word) {
          $trirams = explode(" ", Str::nGram($word, 3));
          $wordLen = mb_strlen($word, 'UTF-8');
          $where = "length>=" . ($wordLen - 2) . " AND length<=" . ($wordLen + 2);
          $qIds[$word] = $this->sphinx->createCommand()
              ->select('*,weight() AS wght')
              ->from($this->index)
              ->match($trirams, 'trigramm', 'OR')
              ->where($where)
              ->order('wght DESC, frequency DESC')
              ->limit(25)
              ->options(array('ranker' => 'matchany'))
              ->multiAdd();
      }
      $results = $this->sphinx->multiExecute();
This variable will contain ids of similar words from the frequency table:
      $possibleCorrections = array();
Array, containing all the ids of all the words to look them up in a single request:
      $allIds = array();
      
      foreach ($qIds as $word => $index) {
          if(isset($results[$index])){
              $ids = Arr::getFieldFromArray(@$results[$index], 'id');
Preparing data types because MongoDB is data type sensitive:
              $ids = array_map('intval', $ids);
              $allIds = array_merge($ids, $allIds);
              $possibleCorrections[$word] = $ids;
          }
      }
Pull words directly from MongoDB:
      $data = $this->getByPks(array_filter($allIds));
In $possibleCorrections change word’s id to its meaning:
      foreach ($possibleCorrections as $word => $ids) {
          foreach ($ids as $key => $id) {
              if(isset($data[$id])){
                  $possibleCorrections[$word][$key] = $data[$id]['word'];
              }
          }
      }
      
      $corr = array();
      foreach($possibleCorrections as $wordToCorrect => $words){
          foreach ($words as $mCorrect) {
Calculating the Damerau–Levenshtein distance for two lines (to weed out non-similar words):
      $dl = Str::DLevenshtein($wordToCorrect, $mCorrect);
Disregarding those that have little similarity:
      if ($dl <= 3 && $dl != 0) {
Fixes should not contain forms of the same words (as we search by all the forms). To achieve this, we use the initial form of the word-correction as one of the array.
      $corrBaseForm = $this->morphy->getBaseForm($mCorrect);
                  if ($corrBaseForm) {
                      $corrBaseForm = current($corrBaseForm);
                      if(!in_array($corrBaseForm, $wordsBaseForms)){
                          $corr[$wordToCorrect][$corrBaseForm][$dl][] = $mCorrect;
                      }
                  } else if(!in_array($mCorrect, $wordsBaseForms)) {
                      $corr[$wordToCorrect][$mCorrect][$dl][] = $mCorrect;
                  }
              }
          }
      }
Going through the array and choosing the first available form of the same word (sorted by relevance). Ignoring the rest.
      $corrs = array();
      
      foreach ($corr as $wordToCorrect => $mCorrects) {
          foreach ($mCorrects as $baseForm => $words) {
              $correctWord = reset($words);
              $dl = key($words);
              $corrs[$wordToCorrect][$dl][] = current($correctWord);
          }
      }
      $corrections = array();
      
      foreach($corrs as $wordToCorrect => $mCorrects){
Sorting by the Damerau–Levenshtein distance increase:
      ksort($mCorrects, SORT_NUMERIC);
          if(count($mCorrects)){
              $corrections[$wordToCorrect] = array_slice(Arr::streightArray($mCorrects), 0, 25);
          } else {
              $corrections[$wordToCorrect] = false;
          }
      }
      return $corrections;
  }


Final thoughts

The relevance of the search engines’ development issues is undoubted for the longest time. Today, however, a number of additional requirements are presented to the search engines, such as possibility to build a query in a natural language, an information retrieval by the informally specified terms, the ability to create complex queries with a consistent refinement of their parameters. Expanding the search boundaries and object domains is just the kind of job algorithms for approximate string matching can score excellently at.



Other articles

Thank you for getting in touch!

We appreciate you contacting us and will try to get back to you within a few hours. Have a great day ahead!