Thursday, March 24, 2011

Fuzzy string search

Intro

Fuzzy search algorithms (also known as similarity search algorithms) are a basis of spell-checkers and full-fledged search engines like Google or Yandex. For example, these algorithms are used to provide the "Did you mean ..." function.

In this article I'll discuss the following concepts, methods and algorithms:
  • Levenshtein distance
  • Damerau-Levenshtein distance
  • Bitap algorithm with modifications by Wu and Manber
  • Spell-checker method
  • N-gram method
  • Signature hashing method
  • BK-trees
I'll also perform a comparative test of the quality and efficiency of algorithms.

So...

Fuzzy search is a very useful feature of any search engine. However, its effective implementation is much more complicated than implementing a simple search for an exact match.

The problem of fuzzy string searching can be formulated as follows:
"Find in the text or dictionary of size n all the words that match the given word (or start with the given word), taking into account k possible differences (errors)."

For example, if you're requested for "machine" with two possible errors, find the words "marine", "lachine", "martine", and so on.

Fuzzy search algorithms are characterized by metric - a function of distance between two words, which provides a measure of their similarity. A strict mathematical definition of metric includes a requirement to meet triangle inequality (X - a set of words, p - metric):

Triangle inequality

Meanwhile, in most cases a metric is understood as a more general concept that does not meet the condition above, this concept can also be called distance.

Among the most well-known metrics are Hamming, Levenshtein and Damerau-Levenshtein distances. Note that the Hamming distance is a metric only on a set of words of equal length, and that greatly limits the scope of its application.

However, in practice, the Hamming distance is useless, yielding more natural from the human point of view metrics, which will be discussed below.

Levenshtein distance

The Levenshtein distance, also known as "edit distance", is the most commonly used metric, the algorithms of its computation can be found at every turn.
Nevertheless, it is necessary to make some comments about the most popular algorithm of calculation - Wagner-Fischer method.
The original version of this algorithm has time complexity of O(mn) and consume O(mn) memory, where m and n are the lengths of the compared strings. The whole process can be represented by the following matrix:

Levenshtein distance matrix

If you look at the algorithm's work process, it is easy to see that at each step only the last two rows of the matrix are used, hence, memory consumption can be reduced to O(min(m, n)).

Levenshtein algorithm's work process

But that's not all - the algorithm can be optimized further, if no more than k differences should be found. In this case it is necessary to calculate only the diagonal band of width 2k+1 in matrix (Ukkonen cut-off), which reduces the time complexity to O (k min (m, n)).

Prefix distance
Usually it is necessary to calculate the distance between the prefix pattern and a string - ie, to find the distance between the specified prefix and nearest string prefix. In this case, you must take the smallest of the distances from the prefix pattern to all the prefixes of the string. Obviously, the prefix length can not be considered as a metric in the strict mathematical sense, what limits its application.

Often, the specific value of a distance is not as important as fact that it exceeds a certain value.

Damerau-Levenshtein distance

This variation contributes to the definition of the Levenshtein distance one more rule - transposition of two adjacent letters are also counted as one operation, along with insertions, deletions, and substitutions.
A couple of years ago, Frederick Damerau would ensure that most typing errors - transpositions. Therefore, this metric gives the best results in practice.

Damerau-Levenshtein algorithm's work process

To calculate this distance, it suffices to slightly modify the regular Levenshtein algorithm as follows: hold not two, but the last three rows, and add an appropriate additional condition - in the case of transposition take into account its cost.

In addition to the above, there are many others sometimes used in practice distances, such as Jaro–Winkler metric, many of which are available in SimMetrics and SecondString libraries.

Fuzzy search algorithms without indexing (Online)

These algorithms are designed to search against previously unknown text, and can be used, for example, in a text editor, document viewers or web browsers to search the page. They do not require text pre-processing and can operate with a continuous stream of data.

Linear search

A simple one-by-one metric computation (eg, Levenshtein metric) for words of the input text. When you use metric limitation, this method allows to achieve optimum speed.
But at the same time, than more k, than more time grows. Asymptotic time complexity - O (kn).

Bitap (also known as Shift-Or or Baeza-Yates-Gonnet, and it's modifications by Wu and Manber)

Bitap algorithm and its various modifications are most often used for fuzzy search without indexing. Its variation is used, for example, in unix-utility agrep, which one functions like the standard grep, but it supports errors in the search query, and even provides a limited ability to use regular expressions.

For the first time the idea of this algorithm is proposed by Ricardo Baeza-Yates and Gaston Gonnet, the relevant article published in 1992.
The original version of the algorithm deals only with letter substitutions, and, in fact, computes the Hamming distance. But later Sun Wu and Udi Manber suggested a modification of this algorithm for computing the Levenshtein distance, ie brought support insertions and deletions, and developed the first version of agrep utility.

Bitshift operation

         Bitshift operation

Insertions
Insertions
Deletions
Deletions
Substitutions
Substitutions
Result value
Result value

Where k - error count, j - letter index, sx - letter mask (in mask one bits are placed at positions corresponding to the positions of this letter in the query).
Query match or mismatch is determined by the last bit of the result vector R.

High speed of this algorithm is ensured by the bit parallelism - calculations can be performed on 32 or more bits in a single operation.
In this case, a trivial implementation supports the search for the words shorten than 32 symbols. This restriction is caused by the width of a standard type int (32-bit architectures). We can use wider types, but it's usage may slow down the algorithm's work.

Despite the fact that the asymptotic time of this algorithm O (kn) equals linear method's time, it is much faster when the query is long and number of errors k more than 2.

Testing

Testing was performed on the text of 3.2 million words, average word length - 10.

Exact search
Search time: 3562 ms

Linear search using Levenshtein metric
Search time with k = 2: 5728 ms
Search time with k = 5: 8385 ms

Bitap with Wu-Manber modifications search
Search time with k = 2: 5499 ms
Search time with k = 5: 5928 ms

It is obvious that a simple iteration using the metric, in contrast to the Bitap algorithm, highly depends on the number of errors k.

At the same time, if we should search in constant large text, the search time can be greatly reduced by making text preprocessing (indexing).

Fuzzy search algorithms with indexing (Offline)

Feature of all fuzzy search algorithms with indexing is that the index is based on the dictionary compiled by the original text or list of records in a database.

These algorithms use different approaches to solve problem - some of them use reduction to exact search, while others use properties of metrics to construct various spatial structures and so on.

At the first step, we construct a dictionary using the original text, which would contain words and their position in text. Also, it is possible to calculate the frequency of words and phrases to improve search results.

It is assumed that the index, as well as dictionary, entirely loaded into memory.

Dictionary specifications:
  • Source text — 8.2 Gb Moshkow's library (lib.ru), 680 millions of words;
  • Dictionary size — 65 Mb;
  • Word count - 3.2 million;
  • Average word length — 9.5 letters;
  • Average word square length — 10.0 letters;
  • Russian alphabet (32 letters). Words that contain characters not in the alphabet, are not included in the dictionary.
Dependence of the dictionary size from the text size is not strictly linear - to a certain volume base word set is formed, between 15% on 500 thousand words and 5% on 5 million words, and then approaches a linear dependence, slowly decreasing and reaching 0.5% at 680 million words. Subsequent growth is ensured due to rare words.

Growth of dictionary size

Spell-checker method

As the name implies, this algorithm is often used in spelling correction systems (ie, spell-checker systems), when the size of the dictionary is small, or else when speed is not the main criterion.
It is based on reducing the problem of fuzzy search to the problem of exact search.

A set of "mistaken" words is built from original query. Then every word from this set is searched in the dictionary for exact match.

Spell-checker method

Its time complexity is strongly dependent on the number of errors k and the alphabet size |A|, and for a binary search over the dictionary is:
image

For example, in case of error count k = 1 and word length of 7 in the English alphabet set of misspelled words will have about 370 words, so we need to make 370 queries in the dictionary, which is quite acceptable.
But even at k = 2 the size of this set will be more than 75,000 words, which corresponds to a complete iteration over a small dictionary and, therefore, time is sufficiently large. In this case, we should not forget also that for each of such words are necessary to search for an exact match in the dictionary.

Specialties:
The algorithm can be easily modified to generate "mistaken" words using arbitrary rules and, moreover, does not require any dictionary preprocessing or additional memory.

Possible improvements:
We can generate not a whole set of "mistaken" words, but only those that are most likely to occur in real situations, like words with common spelling mistakes or typos.

N-gram method

This method was invented long ago, and is the most widely used because its implementation is relatively simple and it provides a reasonably good performance. This algorithm is based on the principle:

"If the word A matches with the word B considering several errors, then they will most likely have at least one common substring of length N".

These substrings of length N are named "N-grams".
At indexing step, the word is partitioned into N-grams, and then the word is added to lists that correspond each of these N-grams. At search step, the query is also partitioned into N-grams, and for each of them corresponding lists are scanned using the metric.

N-gram method

The most frequently used in practice are trigrams - substrings of length 3. Choosing a larger value of N leads to a limitation on the minimum length of words at which error detection is still possible.

Specialties:
N-gram algorithm doesn't find all possible spelling errors. Take, for instance, the word VOTKA (which must be corrected to VODKA, of course), and divide it into trigrams: VOTKA > VOT OTK TKA - we can see that all of these trigrams contain an error T. Thus, the word "VODKA" will not be found because it does not contain any of the trigrams, and will not get into their lists. The shorter the word and more errors in it, the higher chance that it won't contains in corresponding to query N-grams lists and will not appear in the result set.

Meanwhile, N-gram method leaves ample scope for using custom metrics with arbitrary properties and complexity, but there remains a need for brute force of about 15% of the dictionary.

We can separate N-gram hash tables by position of N-gram in the word (first modification M1). As the length of a word and the query can't differ by more than k, and the position of N-grams in the word can't differ by more than k. Thus, we should check only table that corresponds to the N-gram position in the word, and k tables to the left and to the right, total 2k+1 neighboring tables.

First modification of N-gram method

We can even slightly reduce the size of iterating set by separating tables by word length, and, similarly, scanning only the neighboring 2k+1 tables (second modification M2).

Signature hashing

This algorithm is described in L. M. Boytsov's article "Signature hashing". It is based on a fairly obvious representation of the "structure" of the word as a bit word, used as a hash (signature) in the hash table.

When indexing, such hashes are calculated for each word and this word is added in the corresponding table row. Then, during the search the hash is computed for a query and set of adjacent hashes that differ from the query's hash with no more than k bits is generated. For each of these hashes we iterate over corresponding list of words using the metric.

The hash computing process - for each bit of the hash a group of characters from the alphabet is matched. Bit 1 at position i in the hash means that there is a symbol of i-th group of the alphabet in the word. Order of the letters in the word is absolutely does not matter.

Signature hashing

Single character deletion either does not change the hash value (if the word still have characters from the same group of the alphabet), or bit of corresponding group will be changed to 0. When you insert a similar manner or a bit of get up at 1 or no changes will be. Single character substitution a bit more complicated - the hash can either remain unchanged or will change in 1 or 2 bits. In case of transpositions there are no changes in the hash because the order of symbols at the hash construction does not matter as noted earlier. Thus, to fully cover the k errors we need to change at least 2k bits in the hash.

List of hashes with one error

Average time complexity with k "incomplete" (insertions, deletions and transpositions, as well as a part of substitutions) errors:
Signature hashing time complexity

Specialties:
The fact that the replacement of one character can change two bits at a time, the algorithm that works with, for example, changing of 2 bits in hash, in reality won't return the full set of results because of lack of significant (depending on the ratio of the hash size to the alphabet size) amount of the words with two substitutions (and the wider the hash, the more frequently two bits will be changed at the same and the more incomplete set will be returned). In addition, this algorithm does not allow for prefix search.

BK-trees

Burkhard-Keller trees are metric trees, algorithms for constructing such trees based on the ability of the metrics to meet the triangle inequality:

Triangle inequality

This property allows metrics to form the metric spaces of arbitrary dimension. These metric spaces are not necessarily Euclidean, for example, the Levenshtein and Damerau-Levenshtein metrics form a non-Euclidean space. Based on these properties, we can construct a data structure for searching in a metric space, which is Barkhard-Keller tree.

BK-tree

Improvements:
We can use the ability of some metrics to calculate the limited distance, setting an upper limit to the sum of the maximum distance to the node descendants and the resulting distance, which will speed up the process a bit:

BK-trees metric limitations

Testing

Testing was performed on a laptop with Intel Core Duo T2500 (2GHz/667MHz FSB/2MB), 2Gb RAM, OS — Ubuntu 10.10 Desktop i686, JRE — OpenJDK 6 Update 20.

Time comparison

Testing was performed using Damerau-Levenshtein distance with error count k = 2. Index size is specified including dictionary size (65 Mb).

Spell-checker method
Index size: 65 Mb
Search time: 320 ms / 330 ms
Result recall: 100%

N-gram (original)
Index size: 170 Mb
Index creation time: 32 s
Search time: 71 ms / 110 ms
Result recall: 65%

N-gram (first modification)
Index size: 170 Mb
Index creation time: 32 s
Search time: 39 ms / 46 ms
Result recall: 63%

N-gram (second modification)
Index size: 170 Mb
Index creation time: 32 s
Search time: 37 ms / 45 ms
Result recall: 62%

Signature hashing
Index size: 85 Mb
Index creation time: 0.6 s
Search time: 55 ms
Result recall: 56.5%

BK-trees
Index size: 150 Mb
Index creation time: 120 s
Search time: 540 ms
Result recall: 63%

Conclusion

Most of fuzzy search algorithms with indexing are not truly sublinear (i.e., having an asymptotic time of O (log n) or below), and their performance usually depends on N. Nevertheless, multiple enhancements and improvements make it possible to achieve sufficiently small operational time, even with very large dictionaries.

There is also another set of various and inefficient methods, based on adaptations of techniques and methods from other subject areas to the current. Among these methods is the prefix tree (Trie) adaptation to fuzzy search problems, which I left neglected due to its low efficiency. But there are also algorithms based on the original approaches, for example, the Maass-Novak algorithm, it has sublinear asymptotic time, but it is highly inefficient because of the huge constants hidden behind asymptotic time estimation, which leads to a huge index.

The use of fuzzy search algorithms in real search engines is closely related to the phonetic algorithms, lexical stemming algorithms, which extract base part from different forms of the same word (for example, that functionality provided by Snowball), statistic-based ranking or the use of some complex sophisticated metrics.

This link http://code.google.com/p/fuzzy-search-tools takes you to my Java implementation of the following stuff:
  • Levenshtein Distance (with cutoff and prefix version);
  • Damerau-Levenshtein Distance (with cutoff and prefix version);
  • Bitap (Shift-Or with Wu-Manber modifications);
  • Spell-checker Method;
  • N-Gram Method (with some modifications);
  • Signature Hashing Method;
  • Burkhard-Keller (BK) Trees.
I tried to make this code easy to understand, but effective enough for practical application. So enjoy.

It is worth noting that in the process of researching this subject I've made some own work, which allows to reduce the search time significantly due to a moderate increase in the index size and some restrictions on freedom of used metrics. But that's another cool story.

Links:

  1. Java source codes for this article. http://code.google.com/p/fuzzy-search-tools
  2. Levenshtein distance. http://en.wikipedia.org/wiki/Levenshtein_distance
  3. Damerau-Levenshtein distance. http://en.wikipedia.org/wiki/Damerau–Levenshtein_distance
  4. Shift-Or description with Wu-Manber modifications, in Deutsch. http://de.wikipedia.org/wiki/Baeza-Yates-Gonnet-Algorithmus
  5. N-gram method. http://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf
  6. Signature hashing. http://www.springerlink.com/content/1aahunb7n41j32ne/
  7. Signature hashing in Russian. http://itman.narod.ru/articles/rtf/confart.zip
  8. Information retrieval and fuzzy string searching. http://itman.narod.ru/english/index.htm
  9. Shift-Or and some other algorithms implementation. http://johannburkard.de/software/stringsearch/
  10. Fast Text Searching with Agrep (Wu & Manber). http://www.at.php.net/utils/admin-tools/agrep/agrep.ps.1
  11. Damn Cool Algorithms - Levenshtein automata, BK-tree, and some others. http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
  12. BK-tree Java implementation. http://code.google.com/p/java-bk-tree/
  13. Maass-Novak algorithm. http://yury.name/internet/09ia-seminar.ppt
  14. SimMetrics metric library. http://staffwww.dcs.shef.ac.uk/people/S.Chapman/simmetrics.html
  15. SecondString metric library. http://sourceforge.net/projects/secondstring/

43 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hrm, is there any reason you decided not to test suffix arrays/trees? It seems like they are one of the most popular approaches to fuzzy search.

    Another fairly new technique is called canopy clustering. It's beholden to the triangle inequality, but it's a great initial blocking method.

    ReplyDelete
  4. For n-gram-based techniques, you may want to check an open-source C++ package at http://flamingo.ics.uci.edu/

    A trie-based solution has been shown to be very efficient, especially for instant search. Check http://ipubmed.ics.uci.edu/

    Chen Li

    ReplyDelete
  5. Richard Minerich said:
    > Hrm, is there any reason you decided not to
    > test suffix arrays/trees? It seems like they
    > are one of the most popular approaches to fuzzy
    > search.

    But suffix arrays only find matches at the beginning of the sentence. If there is an error at char two then how will you get a good fuzzy match?

    Richard Minerich said:
    > Another fairly new technique is called canopy
    > clustering. It's beholden to the triangle
    > inequality, but it's a great initial blocking
    > method.

    I skimmed this paper and I fail to see the connection to fuzzy string matching.

    ReplyDelete
  6. Thank you ! A concise and effective explanation. It would be great if some use case scenarios are included.

    ReplyDelete
  7. Richard Minerich:

    In fact, I tried to use suffix array and I wrote the algorithm (and implemented it in Java) based on the following steps:
    1. Split the word into k + 1 parts skipping one letter between them: "crocodile" (k = 2) => "cro_cod_le"
    2. Search for each part in suffix array "cro" => "crocodile"[13421], "macro"[412345] (word index in the dictionary shown in square brackets)
    3. Intersect found word indexes to obtain the result (and filter them using levenshtein to remove rare mismatches)

    It works really fast but requires a lot of space (about 1G for 3.2M words)

    The problem is that we need to store inverse mapping from word dictionary index to it's suffix array index.

    If you want to see this implementation, I can place it somewhere.

    I'll try to read about "canopy clustering" soon.

    ReplyDelete
  8. Very interesting and detailed exposition of current techniques!

    Any reason why you left out soundex? (I know it is very language specific, but it's really effective and can result in really good performance when the process of finding a soundex match is converted to one of an exact prefix match using Suffix Arrays, as Richard Minerich mentioned).

    I've done some work on this here: http://dhruvbird.com/autocomplete.pdf

    ReplyDelete
  9. Your previous post is on soundex, so it seems like a natural incorporation of that technique to the one of inexact string match (albeit not a general one). But, in practice, I would think many use cases would ask for inexact phonetic string matching.

    ReplyDelete
  10. Chen Li:
    The FLAMINGO project looks very interesting.
    Also I've implemented a lot of fuzzy search things for The Xapian Project (it is in C++ too) - some search algorithms, phonetic matching, context-sensitive spelling correction (including splitted and sticked words), transliteration, language detection.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Richard,

    Suffix trees can be used for approximate string searching in a way similar to n-gram matching. This approach is called pattern partitioning.

    I have never used suffix trees or arrays to this end, because of the following:
    1) Uncompressed suffix trees/arrays are not compact.
    2) There are sophisticated compact versions (check the works of Navarro, Mäkinen, and Ferragina), but there is certainly a tradeoff in speed/index size involved. Not sure if those can beat an optimized n-gram index. I would love, however, to see these methods tested :-)
    3) I overall do not like pattern partitioning methods, because the average search time grows linearly with the dictionary size. In many case, this *CAN* be important.

    Suffix trees/arrays are more pertinent in molecular biology, but for a dictionary searching, they seem to be be an overkill. Here, you can do a much better job with some of the super-linear-index methods, or you can use a pair of regular tries (a second is over reversed strings) to get an excellent performance.

    BTW, I recently tested Chen Li's Flamingo n-gram implementation. It is very efficient (in fact 1.5-2 times faster than my own n-gram-based code) and it also supports prunning, in case you don't care for 100% recall

    ReplyDelete
  13. The advantage of a suffix tree or trie approach is that you don't have to calculate string distance for your entire dictionary, since this can become expensive.

    As for growing linearly with dictionary size, while true, this is easily defeated through sharding which scales in similar fashion.

    ReplyDelete
  14. Chris, no doubt about a trie-like structure, I doubt that ST is the right trie-like structure. Unless u need to search for arbitrary substrings.

    ReplyDelete
  15. Awhile back I ran across http://norvig.com/spell-correct.html, which covers writing a spell checker in python. It's an interesting read if anyone is interested.

    ReplyDelete
  16. Itman, good call. I have the bad habit of conflating suffix-trees and tries. Indeed, I did mean tries.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. Joseph Turian: Suffix based approach can actually find any common substring down to a length you specify.

    ReplyDelete
  19. Canopy Cluster is more of a pre-processing step to speed things up. You cluster your target corpus first and so can avoid the O(N^2) problem later on by only comparing within one cluster.

    I myself haven't had much luck with straight n-gram technqiues, it's just too slow. My primary corpus is a few million names though, so I can see how I would need different sets of techniques than someone say, indexing academic papers.


    Nikita Smetanin:
    That is a big downside downside of the suffix based approaches for sure. It's all about the time-space tradeoff.

    Itman:
    I'll look into Pattern Partitioning. Thanks for the tip!

    ReplyDelete
  20. you should have a look at the Levenshtein Automata

    http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

    its the best solution i have found so far

    ReplyDelete
  21. This comment has been removed by the author.

    ReplyDelete
  22. Do you happen to have the dictionnary used for your tests available somewhere ?

    I've implemented a solution for this exact problem using tries. And since you discarded it because you say it's not efficient enough, I'd like to compare it to your results :)

    Thanks

    ReplyDelete
  23. I've written a blog article about fuzzy search with a Trie.

    http://blog.vjeux.com/2011/c/c-fuzzy-search-with-trie.html

    As for performances, I search every word at a distance 2 of the 400 most popular english words. The dictionnary contains 3 millions words. It takes 22 seconds on my old 2Ghz server to run it. It takes an average of 55ms per fuzzy search and generates an average of 247 results.

    The inputs and times look similar to the ones you obtain in your benchmarks. When I will have some time, I will update my code in order to take your inputs, this way you'll be able to compare on your test computer.

    ReplyDelete
  24. Hi everybody,

    Never mind my previous comment on Nick's blog. I missed a point and, apparently, Nick's solution makes sense.

    ReplyDelete
  25. Hi, for developers who are interested in a new algorithm that combines token based, character based, and phonetic based algorithms into one fast and accurate name matching technology visit http://www.fame-api.com

    ReplyDelete
  26. Many thanks Nikita, this was exactly the concise and practical overview I was looking for.

    ReplyDelete
  27. Very nice! I've to deal with the search engine in company's web-based product and we're using htdig for this. Your article was very helpful in understanding fuzzyalgos, I'll implement these in our product :)

    ReplyDelete
  28. Good day!

    I was wondering if you could help me in my report. It is about the use of "FUZZY LOGIC" in searching the database. I've searched the net for the basic concepts of fuzzy logic and I always get these three steps: fuzzification, inference engine and
    defuzzification.

    Now, my problem is this: how exactly can I apply these three steps with the algorithms that you've discussed here? I think they will fit the fuzzification stage but how about the other two?

    Thanks in advance..!

    ReplyDelete
    Replies
    1. Sorry, but this article is about fuzzy search, which has nothing in common with fuzzy logic. I hope you'll find answers somewhere else.

      Delete
  29. Liked your post very much... I have been considering a consume-order-agnostic approach to approximate match, as an extension to exact match (presented here: http://arxiv.org/pdf/1209.4554v1.pdf). I think there might be an advantage here - would you agree?

    ReplyDelete
  30. I like this post. However I would disagree that the Trie based methods are slow. Here is example of Fuzzy search based on Trie, it works pretty fast:
    http://www.softcorporation.com/products/people/index.html

    ReplyDelete
    Replies
    1. You're right, I've investigated the Levenshtein automata algorithm (which works on DAWG - a general case of a trie) and it seems to be fast enough with a moderate index (DAWG) size.

      Delete
    2. For large string and larger errors, though, different methods are required. Because trie has a cost O(length^k alphabet_size^k). This works for names because, both length and k are small.

      Delete
  31. I downloaded the dictionary, put it in our MatchMaker tool, and ran queries against it in around 10 ms. Index-size is around 110 MB and it calculates the full Damerau-Levenshtein distance with unbounded error-correction.
    The same index can even do a block-edit calculation (blocks of strings are shifted for cost 1!) within 20 ms.
    Optionally you can couple an approximate phonetics to it or do all kinds of extra comparison.

    So THERE IS A LEVENSHTEIN INDEX type.

    ReplyDelete
    Replies
    1. Yes, I know about Levenshtein automata, I implemented and tested it some time ago, but unfortunately I'd no enough time to write an article about it and some other very fast algorithms. Maybe I'll publish my Levenshtein automata java source code if it still exists.

      However, I think that Levenshtein automata may have performance issues if you store them in external memory (because of its structure), so this case should be analyzed separately.

      Delete
  32. I did further tests with another Damerau-Levenshtein index type based on automata. It doesn't preprossess any potential error, but does realtime computation of all edit choices:

    up to 1 error: below 1 ms
    up to 2 errors:around 5 ms
    up to 3 errors:around 15 ms

    If you don't believe it, I can give more info!

    ReplyDelete
  33. One can do better than 1 ms easily.

    ReplyDelete
  34. Anybody know or can guess what algorithm QSR NVivo 10 use to find the exact or fuzzy text matching? The software works like this: store certain documents containing whatever text and in any language; then you can start searching for a certain word or multiple words. It does effectively catch the related words with its synonyms and it builds Word Tree Map for it. How does it do that? I mean what type of word distance calculations do they use? do they have a dictionary or do they use Wordnet or what?

    ReplyDelete
  35. Thank you for sharing valuable information. Nice post. I enjoyed reading this post. The whole blog is very nice found some good stuff and good information here Thanks..Also visit my page Design and multimedia At 21st and century experience counts a lot as we believe that the single most important elements in mastering any task is experience.

    ReplyDelete
  36. AM Field Engineer from Japan, I am not used to this but i have to do this because i want the world and entire earth to know about Lord Zilialia. He is a blessing to man kind and a blessing to me in particular. Search no more for help any where your help is come. Just contact Lord Zilialia on spellcaster1202@gmail.com and you will be glad you did. Even Japanese do need help.

    ReplyDelete