Deduping data without a key

One of the challenges I’m working through right now is trying to identify unique person records from a blizzard of API’d data that purports to have a unique key for each person, but, in reality, not so much (or, honestly, at all). As a result, I’m working to identify unique people by shared attributes across records. For instance, email can be a reliable unique identifier, but we have many families in our records where dad or mom might share the same email address as the kids. So we’re taking a more holistic approach to examine all the attributes of a particular record. For instance, in the case below (credit: Melissa), do these two records represent one woman or two?

dedupe

Here are some observations about the challenge to illustrate that it’s not simply a byte-to-byte comparison.

  • Can we identify “Beth” as a nickname for “Elizabeth,” and would this woman use both names? Note that the record with the more-formal “Elizabeth” also contains the complete spelling of her city while the informal “Beth” record also uses an informal abbreviation for the city.
  • Could the last name of “Smithe” on the more-informal record represent a typo that should be considered a match for “Smith” on the more formal record?
  • Both records list the same address, so that’s looking positive.
  • The purchase history looks consistently-upscale across both records.

So now we inject human judgment to determine whether these records represent one woman or two. If we judge both records to represent the same woman, we have created a rule we can reuse for other records about nicknames, typos, formality of record source, addresses, and purchase history.

As we create those rules, our algorithm grows smarter and more accurate as it applies them over the entire dataset. To make that happen, I’m using the dedupe library in Python. It approaches this initially as a clustering problem, and then as I provide rules from the examples it identifies, it can classify more of our blizzard of records into unique people.

I’ll keep you updated on what I learn from the process, but welcome your ideas and feedback. Feel free to comment or tweet back your thoughts.