How to implement Fuzzy string matching algorithm in Python

Fuzzy string matching is the process of comparing two strings to determine their similarity. Unlike exact string matching, fuzzy string matching considers differences such as spelling mistakes, typos & variations in word order. This makes it a valuable tool in spell-checking, data deduplication, and natural language processing applications.

The similarity between two strings is determined by fuzzy string matching algorithms using various methods, such as Levenshtein Distance, Jaccard Similarity & Cosine Similarity. Using these methods, it is possible to spot patterns in the strings and assess how similar they are.

Fuzzy string matching is helpful in many sectors where massive data must be processed rapidly and reliably. Fuzzy string matching, for instance, is used in e-commerce to match products with similar names or descriptions, making it simpler for buyers to locate what they’re looking for.

This string-matching technique in data science is used to find duplicate entries in massive datasets, simplifying the cleaning and analysis of the data. Moreover, extracting meaning from text data is simpler when related phrases or concepts are identified using fuzzy string matching in natural language processing.

Installing and Importing Fuzzywuzzy Library

  1. Open a terminal or command prompt.
  2. Use pip to install the fuzzywuzzy library by using pip install fuzzywuzzy
  3. Once installation completed, it is time to import fuzzywuzzy library in Python script following from fuzzywuzzy import fuzz syntax.
  4. It will import the fuzz module from fuzzywuzzy library, which contains functions for fuzzy string matching.
  5. If you want to use the process or extract functions, you can also import the process module of the fuzzywuzzy library using the following code: from fuzzywuzzy import process

process” module has built-in methods for matching a string against a list of strings and returning the closest match, and for extracting the most similar string from a list of strings, respectively.

Comparing Two Strings with Fuzzywuzzy Methods

In this section, we will discuss the four most commonly used methods from fuzzywuzzy to compare two strings.

Comparing Two Strings: fuzz.ratio()

This fuzz.ratio() method compares the two strings character by character and calculates the Levenshtein distance between them. It compares two strings and returns a similarity score between 0 and 100. The more similar the strings are, higher the score will be.

from fuzzywuzzy import fuzz

string1 = "apple" #1st string to compare
string2 = "aple" #2nd string to compare
similarity_score = fuzz.ratio(string1, string2)
print(similarity_score)

In the code snippet above fuzz.ratio() method returns a similarity score of 91, indicating that the two strings are quite similar to each other.

Comparing two strings based on shortest string: fuzz.partial_ratio()

fuzz.partial_ratio() method is similar to fuzz.ratio(), but it compares the strings based on the shortest string. It is useful when we want to compare two strings that have different lengths.

from fuzzywuzzy import fuzz

string1 = "apple pie" #1st string to compare
string2 = "apple" #2nd string to compare
similarity_score = fuzz.partial_ratio(string1, string2)
print(similarity_score)

In this code snippet, fuzz.partial_ratio() method returns a similarity score of 100, indicating that the two strings are exactly the same.

Comparing two strings after sorting the tokens: fuzz.token_sort_ratio()

fuzz.token_sort_ratio() method compares two strings after sorting the tokens. It is useful when we want to compare two strings that have the same words but in different order.

from fuzzywuzzy import fuzz

string1 = "apple pie with ice cream" #1st string to compare
string2 = "apple with ice cream pie" #2nd string to compare
similarity_score = fuzz.token_sort_ratio(string1, string2)
print(similarity_score)

In this example, fuzz.token_sort_ratio() method returns a similarity score of 100, indicating that the two strings are entirely similar after sorting the tokens.

Comparing two strings after removing duplicate tokens: Fuzz.token_set_ratio()

fuzz.token_set_ratio() method compares two strings after removing duplicate tokens. It’s used when we want to compare two strings that have the same words but contain duplicates.

from fuzzywuzzy import fuzz

string1 = "apple pie with ice cream"
string2 = "apple with ice cream pie"
similarity_score = fuzz.token_set_ratio(string1, string2)
print(similarity_score) # Output: 100

In above code, fuzz.token_set_ratio() method returns a similarity score of 100, indicating that the two strings are exactly the same after removing the duplicate tokens.

Handling large data with Fuzzywuzzy

Let’s explore three functions that are particularly useful for handling large data: fuzz.WRatio(), fuzz.process(), and fuzz.extract().

fuzz.WRatio()

The fuzz.WRatio() function is similar to the fuzz.ratio() method, but it is based on the Levenshtein Distance algorithm. The smallest number of single-character alterations (insertions, deletions, or substitutions) necessary to change one string into another is known as the Levenshtein distance.

fuzz.WRatio() method returns a similarity score between 0 and 100 based on the Levenshtein Distance. This function is particularly useful for comparing two strings that may contain typos or misspellings.

fuzz.process()

The fuzz.process() function allows you to apply a fuzzywuzzy function to a list of strings. When you need to compare a lot of strings and get similarity scores for each comparison, this is helpful. It takes a list of strings and a fuzzywuzzy function as input and returns a list of tuples, where each tuple contains the original string and the similarity score.

This function can be used with any of the fuzzywuzzy functions, such as fuzz.ratio(), fuzz.partial_ratio(), fuzz.token_sort_ratio(), and fuzz.token_set_ratio(), among others.

fuzz.extract()

The fuzz.extract() function allows you to extract the most similar string from a list of strings. This is useful when you have a large number of strings and you need to identify the string that is most similar to a given reference string.

It takes a reference string and a list of strings as input and returns a tuple containing the most similar string and its similarity score. This function can be used with any of the fuzzywuzzy functions, such as fuzz.ratio(), fuzz.partial_ratio(), fuzz.token_sort_ratio(), and fuzz.token_set_ratio(), among others.

Choosing the appropriate function based on the requirement?

The choice of the appropriate function depends on the specific requirements of the application. If you want to compare two strings character by character, you can use the fuzz.ratio() function. If you want to compare two strings that have different lengths, we can use the fuzz.partial_ratio() function.

If you want to compare two strings that have the same words but in a different order, you can use the fuzz.token_sort_ratio() function. If you want to compare two strings that have the same words but with some duplicates, you can use the fuzz.token_set_ratio() function.

Best practices for fuzzy string matching in Python

Following best practices when implementing fuzzy string matching algorithms in Python is essential. Here are some best practices for fuzzy string matching in Python:

  1. Preprocess data before applying fuzzy string matching functions: Before applying fuzzy string matching functions, it is crucial to preprocess the data to remove noise, special characters, and stopwords. This can be achieved using tokenization, stemming, and lemmatization techniques. Preprocessing the data can improve the accuracy of fuzzy string matching algorithms.
  2. Choose the appropriate fuzzy string matching function based on the requirement: Fuzzywuzzy provides several functions for fuzzy string matching, such as fuzz.ratio(), fuzz.partial_ratio(), fuzz.token_sort_ratio(), and fuzz.token_set_ratio(). It is important to choose the appropriate function based on the application’s specific requirements. For example, fuzz.partial_ratio() can be more accurate if the strings being compared have different lengths.
  3. Understand the limitations of fuzzy string matching: Fuzzy string matching is not always accurate and can sometimes produce false positives or false negatives. It is crucial to comprehend the constraints of fuzzy string matching and to combine it with other methods to increase accuracy.
  4. Test the performance of fuzzy string matching functions: Before using them in production, it is important to test their performance on a sample dataset. This can help identify any issues with the algorithm and improve its accuracy.
  5. Use other libraries for fuzzy string matching: Fuzzywuzzy is not the only library available for fuzzy string matching in Python. Depending on the requirements of the application, other libraries, like textdistance, jellyfish, and difflib, offer comparable functionality.
  6. Customizing fuzzy string matching methods: Fuzzywuzzy functions can be customized by changing the parameters or implementing a custom algorithm. This can help improve the accuracy of fuzzy string matching functions for specific use.

The Recap

In addition to these four functions, the fuzzywuzzy library also provides several other functions such as fuzz.WRatio(), fuzz.QRatio(), and fuzz.UQRatio() that are useful in specific situations.

You now have the understanding of how fuzzy string matching algorithm works in Python and with the help of the fuzzywuzzy library we have been able to show some examples. If you follow the best practices, you will be able to improve the accuracy of fuzzy string matching algorithms and achieve better application results.

Stay in the Loop

Get the weekly email from Algoideas that makes reading the AI/ML stuff instructive. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

- Advertisement -

You might also like...