1
Using SQL Joins to Perform Fuzzy Matches on Multiple Identifiers
Jedediah J. Teres, MDRC, New York, NY
ABSTRACT
Matching observations from different data sources is problematic without a reliable shared identifier. Using
multiple identifiers can be more restrictive as it requires multiple exact matches. One way around this is to create a
match score based on the number of matching identifiers. This score can be weighted to favor certain matches or
sets of matches (e.g., first name and last name) over others (e.g., first name and date of birth). This paper builds
on a previous paper describing a technique for creating and using a match score in the context of a SQL join to
find matches in cases when all identifiers are not exactly the same and allows for the use of COMPGED to find
close matches without requiring strict equality.
INTRODUCTION
This paper expands upon a previously described method of combining data sets using multiple variables as
matching criteria in PROC SQL by using COMPGED to allow for fuzzy matching. Prior to SAS 9.2, using
COMPGED in the context of a SQL JOIN produced a note to the log each time a character was compared to a
blank space. With the release of SAS 9.2, this is no longer an issue, and COMPGED can be used to expand the
flexibility of JOINS in SQL. Knowledge of the SELECT statement and JOINs in PROC SQL is assumed. Prior
knowledge of COMPGED is helpful but not necessary. An understanding of the PUT function is needed.
SAMPLE DATA SETS
Two SAS data sets are used for illustration purposes in this paper: REF and CHK.
Data set REF
Obs FNAME
LNAME
SSN NINO
DOB
1 John
Baldwin
123-45-6789 BM567891E
1/3/1946
2 Robert Plant
234-56-7890 YZ912345H
8/20/1948
3 Jimmy
Page
345-67-8901 HL234567B
1/9/1944
4 John
Bonham
456-78-9012 YZ891234G
5/31/1948
5 Ray
Davies
567-89-0123 HL456789D
6/21/1944
6 Dave
Davies
678-90-1234 KR789123F
2/3/1947
7 Peter
Quaife
789-01-2345 AA123456A
12/31/1943
8 Mick
Avory
890-12-3456 HL345678C
2/15/1944
Data set REF contains 5 variables and 8 observations. Each observation represents a unique sample member;
there are no duplicate observations in the data set REF.
Data set CHK
Obs FNAME
LNAME
SSN NINO
DOB BAND
1 John
Jones
123-45-6789 BM567891E
1/3/1946 Led Zeppelin
2 Robert Plant
234-56-7890 YZ912235H
8/20/1948 Led Zeppelin
3 Jimmy
Page
345-67-8907 HL234567D
1/9/1944 Led Zeppelin
4 John
Bonham 456-78-9012 YZ891234G
5/31/1948 Led Zeppelin
5 Ray
Davies 567-89-0123 HL456789D
6/21/1944 The Kinks
6 Dave
Davies 678-90-1234 KR789213F
2/3/1947 The Kinks
7 Pete
Quiafe 789-01-2346 AA132456A 12/31/1943 The Kinks
8 Mick
Avery
890-21-3456 HL345678C
2/15/1944 The Kinks
9 Jay
Davie
567-89-1023 HI636789D
6/12/1984 Blue Devils
10 Dave
David
678-90-1324 KR789213F
2/3/1974 The Minks
Data set CHK contains 6 variables and 10 observations. Each observation represents a unique individual; there
are no duplicate observations in the data set CHK. The data set CHK contains the same variables as the data set
REF, but there is an additional variable, BAND. The goal is to merge the data set CHK to the data set REF in order
to pick up the BAND variable for the people in the sample.
PROC SQL
NESUG 2011
2
Using AN INNER JOIN TO COMBINE DATA SETS
In its most basic form, the syntax for combing data sets using an inner join in PROC SQL is as straightforward as
a match-merge in a DATA step. Here, the data sets REF and CHK are joined using the National Insurance
Number field (NINO) as the common identifier.
proc sql noprint ;
create table inner_join1 as
select ref.*, band
from ref inner join chk
on (ref.nino eq chk.nino)
order by band, lname, fname ;
quit ;
The resulting data set contains only 4 observations. The original data set REF had 8 observations, so 3
observations failed to return a match.
Obs FNAME LNAME
SSN NINO
DOB BAND
1 John
Baldwin
123-45-6789 BM567891E
1/3/1946 Led Zeppelin
2 John
Bonham
456-78-9012 YZ891234G
5/31/1948 Led Zeppelin
3 Mick
Avory
890-12-3456 HL345678C
2/15/1944 The Kinks
4 Ray
Davies
567-89-0123 HL456789D
6/21/1944 The Kinks
Upon further investigation, it becomes clear that several NINOs do not match.
John
Baldwin
BM567891E
John
Jones
BM567891E
Robert Plant
YZ912345H
Robert
Plant
YZ912235H
Jimmy
Page
HL234567 B
Jimmy
Page
HL234567D
John
Bonham
YZ891234G
John
Bonham
YZ891234G
Ray
Davies
HL456789D
Ray
Davies
HL456789D
Dave
Davies
KR789 123F
Dave
Davies
KR789213F
Peter
Quaife
AA1 23456A
Pete
Quiafe
AA132456A
Mick
Avory
HL345678C
Mick
Avery
HL345678C
These are all fairly minor differences consisting mostly of the transposition of two numbers or the errant
substitution of a letter. Each represents a possible data entry error. With a data set this small, values could be
edited so they match on both files. In a larger data set, that would not be a viable solution, so it's well worth
considering what other options are available.
COMPGED
Since NINO is a character variable, we can use the COMPGED function to compute the generalized edit distance
between the two character values.
The SAS documentation on the COMPGED function states that “
Generalized
edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two
strings.”
The finer points of COMPGED are beyond the scope of this paper, but the general syntax takes the form
COMPGED(string-1, string-2 <,cutoff> <,modifiers>)
The COMPGED function returns a value based on the difference between the two character strings. The default
values for minor transpositions and substitutions tend to be 100 or less. Using the COMPGED function as part of
the criteria
in the JOIN would allow for greater flexibility and not require strict equality, making “fuzzier” matches
possible. Technically, we are still matching on a binary outcome
—whether COMPGED returns a value less than
100 or not. It’s the method of calculating that value that allows for some fuzziness.
proc sql noprint ;
create table inner_join2a as
select ref.*, band
from ref inner join chk
on (compged(ref.nino,chk.nino) le 100)
order by band, lname, fname ;
quit ;
Note that in order to use COMPGED, we must specify the tolerance
—
that is, what is the maximum level of
dissimilarity between two values of NINO that is acceptable? By specifying that the value returned by the
PROC SQL
NESUG 2011
3
COMPGED function must be less than or equal to 100, we are allowing for only very slight differences. COMPGED
will return 0 if two strings are equal, so we are allowing strict equality as well.
The resulting data set has 8 observations, but all is not well. Robert Plant from Led Zeppelin is missing, and Dave
Davies returned a match
for someone in a band called “The Minks.”
Obs FNAME
LNAME
SSN NINO
DOB BAND
1 John
Baldwin
123-45-6789 BM567891E
1/3/1946 Led Zeppelin
2 John
Bonham
456-78-9012 YZ891234G
5/31/1948 Led Zeppelin
3 Jimmy
Page
345-67-8901 HL234567B
1/9/1944 Led Zeppelin
4 Mick
Avory
890-12-3456 HL345678C
2/15/1944 The Kinks
5 Dave
Davies
678-90-1234 KR789123F
2/3/1947 The Kinks
6 Ray
Davies
567-89-0123 HL456789D
6/21/1944 The Kinks
7 Peter
Quaife
789-01-2345 AA123456A
12/31/1943 The Kinks
8 Dave
Davies
678-90-1234 KR789123F
2/3/1947 The Minks
The members of these bands have been assigned both British National Insurance Numbers (NINOs) and
American Social Security Numbers (SSNs) for illustration purposes. While we were unable to match Robert Plant
on his NINO, perhaps looking for fuzzy matches on SSN would return the desired match.
The following code:
proc sql noprint ;
create table inner_join2b as
select ref.*, band
from ref inner join chk
on (compged(ref.ssn,chk.ssn) le 100)
order by band, lname, fname ;
quit ;
Returns the following error message:
ERROR: Function COMPGED requires a character expression as argument 1.
ERROR: Function COMPGED requires a character expression as argument 2.
ERROR: Expression using less than or equal (<=) has components that are of
different data types.
COMPGED requires character expressions, or strings. The PUT function returns a string value, and can be nested
in the COMPGED function.
The following code:
proc sql noprint ;
create table inner_join2c as
select ref.*, band
from ref inner join chk
on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)
order by band, lname, fname ;
quit ;
Creates the following data set:
Obs FNAME
LNAME
SSN NINO
DOB BAND
1 Ray
Davies
567-89-0123 HL456789D
6/21/1944 Blue Devils
2 John
Baldwin
123-45-6789 BM567891E
1/3/1946 Led Zeppelin
3 John
Bonham
456-78-9012 YZ891234G
5/31/1948 Led Zeppelin
4 Jimmy
Page
345-67-8901 HL234567B
1/9/1944 Led Zeppelin
5 Robert Plant
234-56-7890 YZ912345H
8/20/1948 Led Zeppelin
6 Mick
Avory
890-12-3456 HL345678C
2/15/1944 The Kinks
7 Dave
Davies
678-90-1234 KR789123F
2/3/1947 The Kinks
8 Ray
Davies
567-89-0123 HL456789D
6/21/1944 The Kinks
9 Peter
Quaife
789-01-2345 AA123456A
12/31/1943 The Kinks
10 Dave
Davies
678-90-1234 KR789123F
2/3/1947 The Minks
PROC SQL
NESUG 2011
4
We have successfully rescued Robert Plant from limbo but at the cost of exacerbating the over-match issue from
the previous JOIN. Now we have Dave Davies from The Minks as well as Ray Davies from Blue Devils. Clearly,
we must refine the match criteria.
USING A WEIGHTED MATCH SCORE
Until now, we have allowed for fuzzy matches, but we have treated them the same as exact matches. What would
happen if we assigned more weight to exact matches while still allowing for fuzzy matches?
The code below creates a weighted match score (see Teres 2009) that gives twice as much weight to exact
matches as fuzzy matches. This match score is then used to select the observation with the closest match as
defined by the HAVING clause.
The following query:
proc sql ;
create table inner_join3b as
select ref.*,
band,
((2*(ref.ssn eq chk.ssn)) +
(compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)) as wms
from ref inner join chk
on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)
group by ref.ssn
having calculated wms eq max(calculated wms)
order by band, lname, fname ;
quit ;
Creates the following data set:
Obs FNAME
LNAME
SSN NINO
DOB BAND
wms
1 John
Baldwin 123-45-6789 BM567891E
1/3/1946 Led Zeppelin
3
2 John
Bonham
456-78-9012 YZ891234G
5/31/1948 Led Zeppelin
3
3 Jimmy
Page
345-67-8901 HL234567B
1/9/1944 Led Zeppelin
1
4 Robert Plant
234-56-7890 YZ912345H
8/20/1948 Led Zeppelin
3
5 Mick
Avory
890-12-3456 HL345678C
2/15/1944 The Kinks
1
6 Dave
Davies
678-90-1234 KR789123F
2/3/1947 The Kinks
3
7 Ray
Davies
567-89-0123 HL456789D
6/21/1944 The Kinks
3
8 Peter
Quaife
789-01-2345 AA123456A 12/31/1943 The Kinks
1
At this point, we have met the goal of combining the two data sets. However, the weighted match score (WMS)
takes on only 2 values, 1 and 3. It may be desirable to include all available identifying information in the creation of
the match score to allow for greater variability in the quality of the matches.
The following code expands the idea of including both the exact and fuzzy matches to all identifiers. Note the use
of the nested PUT function with the FORMAT MMDDYY10 for DOB.
The following query:
proc sql noprint ;
create table inner_join5 as
select ref.*,
band,
((2*(ref.ssn eq chk.ssn)) +
(compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) +
(2*(ref.nino eq chk.nino)) +
(compged(ref.nino, chk.nino) le 100) +
(2*(ref.fname eq chk.fname)) +
(compged(ref.fname, chk.fname) le 100) +
(2*(ref.lname eq chk.lname)) +
(compged(ref.lname, chk.lname) le 100) +
(2*(ref.dob eq chk.dob)) +
(compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100)) as
wms
PROC SQL
NESUG 2011
5
from ref inner join chk
on (
(compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) or
(compged(ref.nino, chk.nino) le 100) or
(compged(ref.fname, chk.fname) le 100) or
(compged(ref.lname, chk.lname) le 100) or
(compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100)
)
group by ref.ssn
having ((calculated wms eq max(calculated wms)))
order by band, ref.lname, ref.fname ;
quit ;
Produces the following data set:
Obs FNAME
LNAME
SSN NINO
DOB BAND
wms
1 John
Baldwin 123-45-6789 BM567891E
1/3/1946 Led Zeppelin
12
2 John
Bonham
456-78-9012 YZ891234G
5/31/1948 Led Zeppelin
15
3 Jimmy
Page
345-67-8901 HL234567B
1/9/1944 Led Zeppelin
11
4 Robert Plant
234-56-7890 YZ912345H
8/20/1948 Led Zeppelin
12
5 Mick
Avory
890-12-3456 HL345678C
2/15/1944 The Kinks
11
6 Dave
Davies
678-90-1234 KR789123F
2/3/1947 The Kinks
13
7 Ray
Davies
567-89-0123 HL456789D
6/21/1944 The Kinks
15
8 Peter
Quaife
789-01-2345 AA123456A 12/31/1943 The Kinks
7
Here the match score is weighted so that it favors exact matches, assigning them twice the weight of fuzzier
matches. It becomes much clearer that some matches are better than others. For example, Peter Quaife had a
WMS value of 7 because while his entries in REF and CHK matched exactly on NINO and DOB, there were slight
discrepancies in FNAME, LNAME, and SSN (each differed by one character). On the other hand, John Bonham
and Ray Davies had much stronger matches, with WMS values of 15, indicating that they were exact matches.
Because different identifiers might be more or less important, however, it’s hard to say whether matches with
weighted match scores of 15 are “twice as good” as a match with a value of 7.
This approach could be taken even further. Matches on SSN, fuzzy or otherwise, could be given more weight than
matches on first name, for example. COMPGED options can also
be refined to change the “penalties” associated
with certain types of mismatches between character expressions to best fit any given context.
CONCLUSIONS
The use of the COMPGED function in PROC SQL joins offers an incredible amount of flexibility when combining
data sets. Allowing for fuzzy matches opens the door to over-matching a data set by including spurious matches,
however. Creating a weighted match score to favor exact matches is a useful tool for refining results when
combining data sets.
REFERENCES
SAS Institute Inc. 2011. SAS 9.2 Language Reference: Dictionary, Fourth Edition. Cary, NC: SAS Institute Inc.
“
Functions and Call Routines: COMPGED
.”
Staum, Paulette (2007).
“Fuzzy Matching using the COMPGED Function
.
”
Proceedings of the 2007 NorthEast
SAS Users Group Conference.
Teres, Jedediah J (2009).
“
Using SQL Joins to Perform Weighted Matches on Multiple Identifiers
.”
Proceedings of
the 2009 NorthEast SAS Users Group Conference.
ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
The author wishes to thank Paulette Staum for her encouragement, and Chris Bost and Aaron Hill for reviewing
this paper.
PROC SQL
NESUG 2011
6
CONTACT INFORMATION
Jedediah Teres
MDRC
16 East 34
th
St, 19
th
Floor
New York, NY 10016
(212) 340-8807 telephone
(212) 684-0832 fax
jed.teres@mdrc.org
www.mdrc.org
PROC SQL
NESUG 2011
Dostları ilə paylaş: |