One of the challenges I’m working through right now is trying to identify unique person records from a blizzard of API’d data that purports to have a unique key for each person, but, in reality, not so much (or, honestly, at all). As a result, I’m working to identify unique people by shared attributes across records. For instance, email can be a reliable unique identifier, but we have many families in our records where dad or mom might share the same email address as the kids. So we’re taking a more holistic approach to examine all the attributes of a particular record. For instance, in the case below (credit: Melissa), do these two records represent one woman or two?
One of the first things people working in data science / machine learning have to grapple with is that our claims are predictive, not inferential. Instead of making careful, logical moves from one clear, proven claim to another, we mix together a set of features, build a model, and then present our results as a reasonably-probable forecast of the future.
Thoughtful people, however, know that “correlation does not equal causation.” Just because two things can be shown to occur in some sort of reliable sequence does not mean that Thing A necessarily caused Thing B. When I leave for work, I notice that the sun is often rising. But unlike the rooster, I can’t really take credit for that.
Jim Keeler graciously invited me to attend this past Tuesday’s meeting of the Austin Chapter of SIM (the Society for Information Management), which is doing a broader version of the work we used to do six years ago through the CIO/CTO Roundtable of the Austin Technology Council. Kudos to them for doing notably stronger work than we did to reach out to women in tech. It was warm and energizing to re-connect with old friends like Jim, Russ Finney, and Vijay George. It was fascinating to meet new people and learn from them as well. Here are a few things I learned that evening: