Wednesday, April 10, 2013

String Matching - What's the Right Match

Is this the Right Match?

We are mere human beings and when we want to find the right match we have so many tools at our disposal and that too without even our knowledge. That’s God's power of matching algorithm, you know you are using it but you don't realize its complexity because you are so much overwhelmed by its simplicity.

Matching engines are not unknown any more. In computing world when we talk about match and merge we try to quantify and qualify information under groups. Information’s no longer remains information they are data and there are attributes/nature of those data that we try manipulate, compute and calculate. 
From 1918 till this date, ever since Soundex was patented there are series of algorithms that mathematicians and computing engineers have tried creating but the very base remained same. In order to compute the likeness of a word we need to compute it mainly in two forms 
1. Exact, which is very much like stripping the word into ASCI or binary and comparing them. 
2. Fuzzy/Phonetic/Similar, which is to create codes out of its pronunciation to match, which is indeed complex but very matured. 

Now lets take it to a level above, what happens when we match sentence to sentence? You cannot break every word in a sentence to compare. Obviously for group of words the algorithm differs but so far we have the knowledge of tools matching names, and matching addresses. What if we have to compute a mechanism to match phrases? Well! The complexity is rising in many fold and so is the awes of the fact that when we use God's match rules we can apply that to any thing and any form and most of the time without even thinking that we are doing it. We can apply it to names, to places, to people and their types, to smells to colors even to relations.
 Next time if you ever think, "What's the Right Match?' try thinking how are you deriving that it is the right match.. most of the time the answer will be it's that guy whom we all know as the inner self. 

All boils down to the simplest form of matching that we can programmatically obtain and is often known as String Matching.

String Matching involves two major type of algorithms:
1. String Matching Algorithms a.k.a String Searching Algorithms
In order to match a string you will have to identify the string. Pick it and try searching for that string in a given set of strings. 

Kind of String Searching:
A common string searching method is to select the candidate to match or search across. This will narrow down the list and also the expense to perform the search.

Another common example involves "normalization". For many purposes, a search for a phrase such as "to be" should succeed even in places where there is something else intervening between the "to" and the "be":
  • More than one space
  • Other "whitespace" characters such as tabs, non-breaking spaces, line-breaks, etc.
  • Less commonly, a hyphen or soft hyphen
  • In structured texts, tags or even arbitrarily large but "parenthetical" things such as footnotes, list-numbers or other markers, embedded images, and so on.
Many symbol systems include characters that are synonymous (at least for some purposes):
  • Latin-based alphabets distinguish lower-case from upper-case, but for many purposes string search is expected to ignore the distinction.
  • Many languages include ligatures, where one composite character is equivalent to two or more other characters.
  • Many writing systems involve diacritical marks such as accents or vowel points, which may vary in their usage, or be of varying importance in matching.
  • DNA sequences can involve non-coding segments which may be ignored for some purposes, or polymorphisms that lead to no change in the encoded proteins, which may not count as a true difference for some other purposes.
  • Some languages have rules where a different character or form of character must be used at the start, middle, or end of words.
Finally, for strings that represent natural language, aspects of the language itself become involved. For example, one might wish to find all occurrences of a "word" despite it having alternate spellings, prefixes or suffixes, etc.
Another more complex type of search is regular expression searching, where the user constructs a pattern of characters or other symbols, and any match to the pattern should fulfill the search. 

2. Approximate String Matching

If you want to read more about it please visit:
String searching algorithm
Approximate string matching

Happy Wednesday!

Kinshuk Dutta

New York

Scala & Spark for Managing & Analyzing Big Data (Using Machine Learning)

Managing & Analyzing Big Data using Apache Scala & Apache Spark In this blog we will see how to use Scala and Spark to analyze Big D...