Data Cleaning and Record Linkage

Due: Thursday, Feb 16th at 5pm

You may work alone or in a pair on this assignment.

In this assignment you will take a pair of datasets containing restaurant names and addresses, clean them, and link them, i.e., find records in the two datasets that refer to the same restaurant.

This assignment will give you extensive experience in data cleaning and munging with pandas, as well as experience in using matplotlib, and in probabilistic record linkage.

Here is an overview of the various tasks you’ll perform. Some of these will make more sense after you have read the full description.

  1. The restaurant datasets are lists of names and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.txt and fodors.txt. Your first task is to create a pandas dataframe for each consisting of four columns: (i) the original full name and address string, and extracted columns consisting of (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). Call the dataframes zagat and fodors.
  2. For record linkage you will use a training dataset of known links, available in the file known_links.txt. Your next task is to read this dataset into a pandas dataframe and find the correspondence between the pairs in the known links and the rows in zagat and fodors. Call a pair from the known links a match, and the dataframe matches.
  3. You will also need a dataset of pairs that are not matches. We will call such a pair an unmatch. You will create this by choosing random pairs from zagat and fodors. Call this the unmatches dataframe. (The terms match and unmatch follow the convention in record linkage. )
  4. You will use the Jaro-Winkler distance to measure the similarity between a field in a pair of records (one each from zagat and fodors). Your fourth task will be to explore the distribution of the Jaro-Winkler scores when a pair is a match and compare it to when the pair is an unmatch. (This step is not strictly required for record linkage, but will help clarify the rationale behind the approach we adopt.) For this you will plot histograms for the distance values for six cases, and create a pdf file containing all your histograms.
  5. To link records, you will use (a simplified version of) the probabilistic record linkage approach. In this you will represent the distances between the fields of a pair of restaurants as a vector of three integers. There are 27 possible such vectors. Your next task is to find the probability of each vector given whether the pair it came from is a match or an unmatch. Here you will use the matches and unmatches dataframes.
  6. Next, you will order the vectors, and partition them into three sets: match_vectors, possible_vectors, and unmatch_vectors. If a test pair maps to a vector in the first set, we classify it as a match; if it maps to a vector in the second set, we classify it as a possible match; else we classify it as an unmatch. (Possible matches need to be verified manually.)
  7. It will then be time to find matches. You will iterate through all potential pairs, for each find the corresponding vector, and the set it belongs to. Based on the set you will mark it as a match, possible match, or unmatch. You will print the number of each type you find, and also store all matches in a csv file.
  8. Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the above step but only consider potential pairs that also have the same value for either name, city, or address.

Create restaurant dataframes

The restaurant datasets are lists of names and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.txt and fodors.txt. Your first task is to create a pandas dataframe for each consisting of four columns: (i) the original full name and address string, (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). You will extract the latter three columns from the first, and use them for record linkage later on.

Extracting new fields from text

pandas provides the function extract() that takes a regular expression to create new fields from text data. Assume, e.g., that the original name and address strings from Zagat are in a column named nameaddress in zagat. Then the call:

zagat['nameaddress'].str.extract(r'^([^\d]*)(\d.*)$', expand=True)

returns a dataframe with two columns—each corresponding to a group in the regular expression—the first containing the longest prefix of the nameaddress string that contains no digits, and the second containing the rest of the string. The columns of the returned dataframe can be appended to the zagat dataframe using the pandas function concat().

Use (possibly multiple calls to) extract() to create the name, city, and address columns. We found the following useful in this data munging process:

  • Build the regular expression(s) you need by iterative refinement—look at some of the data, try out an initial regular expression, observe where it works and where it fails, and improve upon it. If the regular expression doesn’t match a given string, extract returns null values. So if you want to look at the strings where your regular expression failed, use, e.g., zagat.ix[df.iloc[:,0].isnull()], in which df is the name of the dataframe returned by extract().
  • Even so, you may not find a single regular expression that extracts all three columns. It may be easier to extract the columns in stages, using separate calls to extract(). Further, there may be special cases that are unlike all other data and are best handled by hard-coding their values. In the approximately 850 rows in the combined data, we hard-coded the extraction for about 20 restaurants. You should not hard-code more than 50 cases.
  • Addresses often start with a street number. But not always, and sometimes the restaurant name itself has a number as well.
  • City names often begin after the name of a road, i.e., after a substring ending with “Blvd.”, “Ave.”, “PCH”, or “Broadway”. It is simple to build a complete list of such suffixes by iterative refinement.
  • We recommend using a jupyter notebook, at least for the iterative refinement process.
  • You should be comfortable with the different indexing and selection mechanisms in pandas.
  • It is best not to name the restaurant name column name, because pandas dataframes have a predefined attribute called name, and you get unexpected results if you use the dataframename.columnname style of referring to a column.

Create the matches dataframe

For record linkage, you will use a training dataset of known links, available in the file known_links.txt. It contains 50 pairs of restaurant names and addresses that are known to refer to the same restaurant. The first in the pair is from Zagat and the second from Fodor’s. Your next data munging task is to read this dataset into a pandas dataframe called matches. You will find read_csv() unsuitable for this due to the irregular formatting in known_links.txt. For example, sometimes a restaurant string is in a single line, and sometimes split across two lines. We recommend writing custom Python code to read the dataset.

You will also find that not all restaurant strings are exactly equal to the corresponding ones in the Zagat and Fodor’s datasets, and will have to correct those. This correspondence will be useful during record linkage.

Create the unmatches dataframe

It turns out that for record linkage you also need a dataset of unmatches. For this choose 1000 random restaurants from zagat and pair them with 1000 random restaurants from fodors. There are several ways to make random choices: Python’s random library, Numpy functions for creating random arrays, or pandas own sample function for sampling a dataframe.

Technically the random pairs may include some pairs that actually match, but the overwhelming number will be unmatches, and we’ll ignore this minor source of error.

Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can, and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches would also have been helpful as well, but that requires expensive manual work.

Caution. The random choices you make will influence the rest of your results. This makes it hard to debug a program if the bug only shows up for some random choices. So during the development and debugging process, it is useful to seed the random generator with a fixed number. E.g., when using pandas the following call:

zagat.sample(1000, replace = True, random_state = 1234)

always returns the same 1000 rows because the random state is fixed through the seed 1234. Once your code is bug-free you should remove the random_state option.

Explore Jaro-Winkler scores, plot histograms

To link records we take a potential pair, one each from zagat and fodors, and decide if their strings are similar enough. Our results are better when the strings are broken up into natural components, which you have already accomplished—into the components name, city, and address.

When the components are strings, a natural measure of similarity, or rather the distance between two strings, is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. In practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the Jaro-Winkler distance is used instead.

The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1, and a complete mismatch (“Medici” and “Subway”) has a score of 0.

You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. Install it using the standard:

sudo pip3 install -U jellyfish

To find the Jaro-Winkler distance use, e.g.,:

import jellyfish
score = jellyfish.jaro_winkler("medici", "noodles etc.")

Your next task is to explore the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants, both when the pairs are matches and when they are unmatches.

Your specific task here is to create a pdf file, named histograms.pdf containing six histograms, labelled appropriately. The first pair of histograms is for the Jaro-Winkler distances between names for restaurants, one each for matches and unmatches. The second and third are similar, except for cities and addresses.

Here you will find useful the pandas function plot.hist(), and the matplotlib.pyplot functions figure(), subplot(), xlabel(), title(), and savefig(). See, e.g., the tutorial on working with multiple figures. Our histograms are in histograms_sample.pdf. Your results would vary because of different random choices, and possibly different extraction processing.

Estimate vector probabilities given match or unmatch

Observe that the probability of a match does not increase monotonically with the similarity for a field; rather, there are clusters in which this probability is high. For instance, in our results, for the address field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for the unmatch pairs.

The above implies that a single “cut-off” on the similarity for a field, or even a group of fields will not serve our purpose. So we break-up the range of similarity values into discrete chunks, and determine the probability of a match for each chunk separately from that of neighboring chunks. Each chunk (in the 3-dimensional space of name, city, and address similarities) can be identified by a vector as described below.

We have adopted (a simplified version of) the probabilistic record linkage approach proposed by Fellegi and Sunter. Provided in utils.py is a simple utility function get_jw_category() that takes a Jaro-Winkler distance and returns an integer category between 0 to 2, essentially breaking the range of the Jaro-Winkler score into three blocks. We apply this on all three fields: names, cities, and addresses. Thus for any pair of restaurants, we can create a vector (name_category, city_category, address_category) that represents the similarity of the fields for that pair. Whether, during record linkage, the pair should be classified as a match or unmatch, or something in between, will depend on certain probabilities associated with the corresponding vector.

We determine whether a vector should be classified as a match, possible match, or unmatch, based on estimates of the probability that a pair results in that vector when given that the pair is a match, as well as when given that the pair is an unmatch. Formally, assume that \((r_1, r_2)\) is a potential pair, \(v(r_1, r_2)\) is the vector formed from its field similarities, and \(V\) is the set of all possible vectors. For every \(w \in V\) we need estimates for two quantities:

\[\begin{split}\begin{array}{l} m(w) \text{ defined as }\mathrm{P}[v(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is a match}], \text{ and} \\ u(w) \text{ defined as } \mathrm{P}[v(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is an unmatch}]. \end{array}\end{split}\]

Compute estimates for the former by iterating through all the pairs in matches, determining their vectors, and counting the frequency of each of 27 possible vectors during this process. The relative frequency for a vector \(w\) is the estimate of the probability for \(w\) given a matching pair. Similarly find an estimate for the probability given an unmatch by iterating through vectors for all pairs in unmatches.

Partition vectors into match, possible match, and unmatch sets

How should we partition the set of vectors \(V\) into the three different sets: match_vectors, possible_vectors, and unmatch_vectors? It depends on our tolerance for two kinds of errors:

  • false positive rate, the probability that we classify an actual unmatch as a match, and
  • false negative rate, the probability that we classify an actual match as an unmatch.

Assume that \(\mu\) is the maximum false positive rate and \(\lambda\) is the maximum false negative rate we are willing to tolerate, and given these are satisfied we want to maximize the number of our matches.

In order to partition the set of vectors \(V\), we do the following—

  1. Assign all vectors \(w\) that have \(m(w) = 0\) and \(u(w) = 0\) to possible_vectors.
  2. Order the rest of the vectors such that the vectors with \(u(w) = 0\) are in front of everyone else, and the remaining are in decreasing (non-increasing) order of \(m(w)/u(w)\).
  3. Beginning from the first vector in the order above find the last vector \(w_1\) such that the cumulative sum of \(u(w)\) values (for \(w\) from first to \(w_1\)) is at most \(\mu\). Add all vectors from the first till \(w_1\) (inclusive) to match_vectors.
  4. Beginning from the last vector in the reverse order, find the furthest-from-last vector \(w_2\) such that the cumulative sum of \(m(w)\) values is at most \(\lambda\). Add vectors from the last till \(w_2\) (inclusive) to unmatch_vectors.
  5. Add any remaining vectors (in the ones that were ordered) to possible_vectors.

You should first convince yourself that the above approach makes sense. You will then realize that the above only works when \(w_1\) happens to be ahead of \(w_2\). Work out the version of the algorithm when this is not the case, and implement the complete algorithm.

Find matches and count possible matches and unmatches

Once you have the three sets for given maximum false positive and false negative rates, simply iterate through every pair in zagat and fodor to determine counts of matches, possible matches, and unmatches.

Block on name, city, or address

Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same value for a blocking key. Implement a version of the record linking algorithm in which either name, city, or address is supplied as a blocking key.

Task checklist

Write Python code, with appropriate auxiliary functions and documentation that implements the following function:

find_matches(mu, lambda_, outfile, block_on)

which takes mu, the maximum false positive rate, lambda_, the maximum false negative rate, outfile, the file name in which to store the matches, and block to indicate the blocking key to use, if any. It should create the file histograms.pdf, and store all the matches it finds in outfile in the csv format (Zagat name and address followed by Fodor’s name and address), and returns a 3-tuple containing the number of matches, possible matches, and unmatches it found.

In particular, ensure you—

  1. Create zagat, fodors, matches, and unmatches.
  2. Plot histograms for names, city, and address similarity for pairs from matches and unmatches.
  3. Determine relative frequencies for 27 vectors corresponding to pairs from matches and unmatches.
  4. Order vectors and create sets match_vectors, possible_vectors, and unmatch_vectors.
  5. Count the number of matches, possible matches, and unmatches for all potential pairs from zagat and fodors, possibly after applying a blocking key.
  6. Store the matches in a csv file.

Getting started

Follow these start-up instructions if you plan to work alone.

Follow these start-up instructions if you are going to work with the same partner as you had for PA #2 or PA #3.

Follow these start-up instructions if you are going to work in a new pair.

Once you follow these instructions, your repository will contain a directory named pa4,

Submission

Follow these submission instructions if you are working alone.

Follow these submission instructions if you are working in a pair.