Many times I have had the unfortunate experience of wrangling several datasets with user-entered identifiers (e.g., real names, code names, etc.). In all of these cases, not once have I encountered a project free of errors. Years ago, I would spend hours sifting through spreadsheets, manually linking participants. Even after learning more efficient methods, the scattered misspells would plague my near-perfect merges. The magic of stripping whitespace and lowercasing were no match for the dreaded extra fat-fingered letter.

Something I have found that has saved me hours of eye-strain and new grey hairs is fuzzy matching. In short, fuzzy matching is matching texts that, although not spelled exactly the same, are identical in reality. There are copious ways that this method is used, and the one I use most in my work is matching participant identifiers that have been entered incorrectly.

To illustrate this, let’s imagine a simple pre-post study design. Initially, participants complete a survey measuring their overall level of well-being; following the intervention, their well-being is measured again. While their name is entered for them on the pre-experimental survey, the post survey is completed online and requires them to enter their name themselves. This is when the headache begins.

Now, let’s generate some data to see what this looks like and what can be done.

Note: the names below are fictitious and conjured from http://listofrandomnames.com/

import pandas as pd
pre_experiment = pd.DataFrame({'participant':['Marvin Sprouse', 'Shala Pintor', 'Armanda Olivero', 'Shelby Bickett',
                                              'Reinaldo Averitt','Kareem Purser','Oneida Cadogan', 'Davida Pruett',
                                              'Lucas Catanzaro','Ernie Grajeda'],
                               'pre_wellbeing':[4,6,3,5,5,2,7,1,7,10]})
pre_experiment.head()
participant pre_data
0 Marvin Sprouse 1
1 Shala Pintor 2
2 Armanda Olivero 3
3 Shelby Bickett 4
4 Reinaldo Averitt 5
post_experiment = pd.DataFrame({'participant':['MarvinSprouse', 'Shalla Pintor', 'armanda oliver0', 'shelbey bicket',
                                              'Reinaldo Averitt','Karem Puser','Oneida cadogan', 'Davida Pruettt',
                                              'Lucas Catanzaro','Ernie grajda'],
                               'post_wellbeing':[6,5,5,7,4,9,10,5,7,10]})
post_experiment.head()
participant post_data
0 MarvinSprouse 3
1 Shalla Pintor 0
2 armanda oliver0 5
3 shelbey bicket 7
4 Reinaldo Averitt 4

In this example, we can easily tell which are the correct names, and it also helps that they’re in the same order! In actual experimental conditions where there are hundreds or thousands of participants, though, this is not so easy. Pandas has many native string methods that make cleaning text data easier (e.g., str.lower() can easily handle Oneida Cadogan and Oneida cadogan) - but instances like Ernie Grajeda and Ernie grajda are a bit more difficult.

Again, the end-goal here is to merge our pre and post data to have something we can analyze. To do this, we need the keys (i.e., here those being the participant column) to be exact matches.

This is what happens if we simply tried to merge the pre and post conditions:

pre_experiment.merge(post_experiment)
participant pre_data post_data
0 Reinaldo Averitt 5 4
1 Lucas Catanzaro 9 13

Only 2 are retained! This is obviously problematic and where fuzzy matching is perfect for the job. We’ll be using the fuzzywuzzy library. For installation and more information on the library, you can visit the official github page here (https://github.com/seatgeek/fuzzywuzzy).

Fuzzywuzzy utilizes the Levenshtein Distance to determine string similarity. To put it simply, the Levenshtein Distance is a metric to determine how similar two strings are to eachother based on how many edits are required to transform one into the other.

Let’s import what we’ll use!

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

We’ll be using process.extract() for our matching. First, I’ll pass 2 arguments in to demonstrate the output.

process.extract(post_experiment.loc[0,'participant'], pre_experiment['participant'])
[('Marvin Sprouse', 96, 0),
 ('Davida Pruett', 54, 7),
 ('Kareem Purser', 46, 5),
 ('Armanda Olivero', 36, 2),
 ('Shala Pintor', 32, 1)]

One way to read the syntax is that we want to look for a match to post_experiment.loc[0,'participant'] (i.e., MarvinSprouse) in the entire participant column. We get 5 potential matches in return, with each match containing the actual proposed match, the similarity score, and the corresponding row position of the proposed match.

The potential matches are given in descending order based on similarity, and the first match of Marvin Sprouse with a similarity score of 96 at row 0 in pre_experiment is indeed correct.

We can use process.extractOne to only return the best match as well. You can also accomplish this using limit = 1 after the last argument (e.g., pre_experiment['participant'], limit = 1).

process.extractOne(post_experiment.loc[0,'participant'], pre_experiment['participant'])
('Marvin Sprouse', 96, 0)

When I use this method with Pandas dataframes, I like have each part of the output as their own column to spot-check the matches if I suspect any discrepancies. We’ll have these automatically generated into columns, but I’ll print them out here as an example too.

print(f"Proposed match: {process.extractOne(post_experiment.loc[0,'participant'], pre_experiment['participant'])[0]}")
print(f"Similarity: {process.extractOne(post_experiment.loc[0,'participant'], pre_experiment['participant'])[1]}")
Proposed match: Marvin Sprouse
Similarity: 96

We can now make a function to apply to the entire dataframe so we don’t have to do each individually. A couple of things will happen at each row in post_experiment.

  1. That row’s participant will get matched to one in pre_experiment and placed in the dataframe
  2. The similarity score of the match will also be recorded
def fuzzy_match(row):

    row['fuzzy_match'] = process.extractOne(row['participant'], pre_experiment['participant'])[0]
    row['similarity'] = process.extractOne(row['participant'], pre_experiment['participant'])[1]

    return row

post_fuzzy = post_experiment.apply(fuzzy_match, axis = 1)
post_fuzzy.head()
participant post_data fuzzy_match similarity
0 MarvinSprouse 3 Marvin Sprouse 96
1 Shalla Pintor 0 Shala Pintor 96
2 armanda oliver0 5 Armanda Olivero 93
3 shelbey bicket 7 Shelby Bickett 93
4 Reinaldo Averitt 4 Reinaldo Averitt 100

Everything appears to have worked well, and the data are ready to be merged.

Below, we inner join pre_experiment and post_experiment based on matching values in pre_experiment['participant'] and post_experiment['fuzzy_match']. For the moment, we need to explicitly state these or Pandas will attempt to use post_experiment['participant'], which will give us the same result as the original merge that we attempted.

pre_post_merge = (
    pre_experiment.merge(post_fuzzy,
                         left_on = 'participant',
                         right_on = 'fuzzy_match')
)

pre_post_merge.head()
participant_x pre_data participant_y post_data fuzzy_match similarity
0 Marvin Sprouse 1 MarvinSprouse 3 Marvin Sprouse 96
1 Shala Pintor 2 Shalla Pintor 0 Shala Pintor 96
2 Armanda Olivero 3 armanda oliver0 5 Armanda Olivero 93
3 Shelby Bickett 4 shelbey bicket 7 Shelby Bickett 93
4 Reinaldo Averitt 5 Reinaldo Averitt 4 Reinaldo Averitt 100

We can do a quick glance as a final sanity check, and everything looks good to go! After just a couple of finishing touches, we’ll have a completely clean and merged dataset.

cleaned_merged = (pre_post_merge[['participant_x', 'pre_data', 'post_data']]
                  .rename(columns = {'participant_x':'participant'})
                 )
cleaned_merged.head()
participant pre_data post_data
0 Marvin Sprouse 1 3
1 Shala Pintor 2 0
2 Armanda Olivero 3 5
3 Shelby Bickett 4 7
4 Reinaldo Averitt 5 4

I hope that this tutorial has been helpful and that this method saves you as much time as it has saved me. Although fuzzymatching is wonderful, there are still times when entries are too disparate to salvage. Like all techniques, it is also not immune to error.

One method that I use to monitor match failures are left joins and an examination of missing values in the merged data (i.e., the data with fuzzy matches) - this allowing me to see which of the valid names have not appeared in the other data and necessitate further examination.