The string similarity formula was developed to satisfy the subsequent criteria:
- A genuine representation of lexical similarity – chain with smaller differences is seen as are close. Particularly, a significant sub-string convergence should indicate a top level of similarity within strings.
- A robustness to improvement of word order- two strings that incorporate equivalent statement, in a separate purchase, ought to be named are similar. Having said that, if one string is just a random anagram of the figures included in the some other, this may be should (usually) be seen as dissimilar.
- Language independency – the formula should work not only in English, but in addition in many different languages.
The similarity is actually calculated in three actions:
- Partition each string into a list of tokens.
- Computing the similarity between tokens by making use of a string edit-distance algorithm (expansion element: semantic similarity dimension utilising the WordNet library).
- Processing the similarity between two token listings.
There is another debate for your resource.
An improved similarity standing algorithm for changeable duration strings
Many thanks all for the assist and guide.
Martin Xie [MSFT] MSDN Community service | Feedback to united states Get or demand Code trial from Microsoft Please make every effort to mark the responses as answers when they assist and unmark them when they incorporate no assist.
- Marked as response by Martin_Xie Monday, September 26, 2011 8:48 AM
Something your question,explain they considerably more specific,i have mistaken for your
As an instance “a_logfile.txt” and “logfile_a.txt” must certanly be really similiar and aswell “loga_file.txt” and “logfile.text” not “myText.txt” and “logfile.txt”
When it fixed your problem,Please simply click “tag As response” on that blog post and “Mark as Helpful”. Happy Development!
Ok we try it once more 🙂
Really I do want to evaluate filenames and I also need to get a percentage number in how similiar these are typically. I do not know if this can be possible after all.
Such as a filename “a_filename.txt” and “filename_a.txt” is really similiar for all of us but exactly how should I obtain the exact same result programmatically.
Another sample filename “file_abc_.txt” and fil_abc_e.txt” can be similiar but once again how can i get the benefit programmaticaly
That is probably more difficult than it seems at first.
Have a look at http://en.wikipedia.org/wiki/String_metrics and adhere a few of the backlinks.
Regards David R Every system eventually becomes rococo, then rubble. – Alan Perlis really the only valid description of rule quality: WTFs/minute.
Thanks for visiting MSDN Forum.
This information demonstrates a great choice about: how-to Compute the similarity between two words/strings. The formula was created in C# and you will install the demo inside the house.
The string similarity formula was created to fulfill the next criteria:
- A true representation of lexical similarity – strings with little variations is recognized as getting comparable. In particular, an important sub-string overlap should indicate flingster nedir a higher standard of similarity within strings.
- A robustness to modifications of word purchase- two chain that incorporate exactly the same keywords, but in another type of order, is thought to be getting comparable. On the other hand, if an individual string is simply a random anagram of figures included in the other, it should (usually) getting named dissimilar.
- Code liberty – the algorithm should run not only in English, but in many different dialects.
The similarity is actually determined in three tips:
- Partition each sequence into a summary of tokens.
- Processing the similarity between tokens by making use of a string edit-distance formula (extension function: semantic similarity description making use of the WordNet library).
- Computing the similarity between two token listings.
There is another discussion for the guide.
An improved similarity ranking formula for variable duration chain
Cheers all for the help and suggestions.
Martin Xie [MSFT] MSDN neighborhood help | Feedback to us bring or demand signal test from Microsoft Kindly take time to draw the responses as answers when they help and unmark them if they supply no services.
- Marked as solution by Martin_Xie Monday, September 26, 2011 8:48 was
You will find composed a laws for my job to discover similar brands approximately from databases.
initially we made use of the DIFFERENCE(string1, string2)>=4 function of SQL servers nonetheless it don’t help me to because like whenever first name had been “21” and 2nd title was “21 jump road” the result included two names whereas certainly they failed to also similar. and so the lead set of these a query included over 700 values that was very poor in such a case.
however receive a similar IMPROVEMENT function for c# that has been almost the same as SQL version of that purpose. eg it matched the similarity of “asdcdfsdfgdsgdg” and “asdewwetqwetrwe” as Perfect that is obviously incorrect.
I then developed a course because of this problems to get more efficient similarity between strings.
the name of the class are StringCompare and listed here is an introduction to this course:
WHAT’S STRING EXAMINE?
StringCompare try a comparing appliance for strings. Maybe not an ordinal assessment, but a relative comparison that decides just how much two strings tend to be comparable or simply how much not similar.
By position the great tradeoff beliefs you will get a beneficial comparison for strings.
WAYS TO USE:
Very first you ought to create an instance of StringCompare with tradeoff prices or default tradeoff prices.
You will find 4 standards that can be arranged:
Here is the minimum acceptable amount of similarity between two strings that researching with StringCompare. This benefits is utilized for strings with all the duration of about 8.
This is the minimal appropriate portion of similarity between two strings that contrasting with StringCompare. This advantages is utilized for strings with the size below 8.
This is basically the max acceptable portion of endurance between two chain that contrasting with StringCompare. This importance is used for chain using length of at the least 8.
Here is the maximum appropriate percentage of threshold between two strings that researching with StringCompare. This appreciate is employed for strings with the length below 8.
* after you have created an instance possible name InstanceName.IsEqual (string1, string2) to look for the equivalence of two strings.
* see your equivalence try relative to the minSimilarty and maxTolerance your put before.
* give consideration to that larger minSimilarity values can lead to a lot more limited information and vice versa.
* Consider that decreased maxTolerance principles can lead to a lot more restricted results and vice versa.