Using sql joins to Perform Fuzzy Matches on Multiple Identifiers

Yüklə 262,04 Kb.

Pdf görüntüsü

tarix	16.08.2018
ölçüsü	262,04 Kb.
	#63155

1

Using SQL Joins to Perform Fuzzy Matches on Multiple Identifiers

Jedediah J. Teres, MDRC, New York, NY

ABSTRACT

Matching observations from different data sources is problematic without a reliable shared identifier. Using

multiple identifiers can be more restrictive as it requires multiple exact matches. One way around this is to create a

match score based on the number of matching identifiers. This score can be weighted to favor certain matches or

sets of matches (e.g., first name and last name) over others (e.g., first name and date of birth). This paper builds

on a previous paper describing a technique for creating and using a match score in the context of a SQL join to

find matches in cases when all identifiers are not exactly the same and allows for the use of COMPGED to find

close matches without requiring strict equality.

INTRODUCTION

This paper expands upon a previously described method of combining data sets using multiple variables as

matching criteria in PROC SQL by using COMPGED to allow for fuzzy matching. Prior to SAS 9.2, using

COMPGED in the context of a SQL JOIN produced a note to the log each time a character was compared to a

blank space. With the release of SAS 9.2, this is no longer an issue, and COMPGED can be used to expand the

flexibility of JOINS in SQL. Knowledge of the SELECT statement and JOINs in PROC SQL is assumed. Prior

knowledge of COMPGED is helpful but not necessary. An understanding of the PUT function is needed.

SAMPLE DATA SETS

Two SAS data sets are used for illustration purposes in this paper: REF and CHK.

Data set REF

Obs FNAME

LNAME

SSN NINO

DOB

1 John

Baldwin

123-45-6789 BM567891E

1/3/1946

2 Robert Plant

234-56-7890 YZ912345H

8/20/1948

3 Jimmy

Page

345-67-8901 HL234567B

1/9/1944

4 John

Bonham

456-78-9012 YZ891234G

5/31/1948

5 Ray

Davies

567-89-0123 HL456789D

6/21/1944

6 Dave

Davies

678-90-1234 KR789123F

2/3/1947

7 Peter

Quaife

789-01-2345 AA123456A

12/31/1943

8 Mick

Avory

890-12-3456 HL345678C

2/15/1944

Data set REF contains 5 variables and 8 observations. Each observation represents a unique sample member;

there are no duplicate observations in the data set REF.

Data set CHK

Obs FNAME

LNAME

SSN NINO

DOB BAND

1 John

Jones

123-45-6789 BM567891E

1/3/1946 Led Zeppelin

2 Robert Plant

234-56-7890 YZ912235H

8/20/1948 Led Zeppelin

3 Jimmy

Page

345-67-8907 HL234567D

1/9/1944 Led Zeppelin

4 John

Bonham 456-78-9012 YZ891234G

5/31/1948 Led Zeppelin

5 Ray

Davies 567-89-0123 HL456789D

6/21/1944 The Kinks

6 Dave

Davies 678-90-1234 KR789213F

2/3/1947 The Kinks

7 Pete

Quiafe 789-01-2346 AA132456A 12/31/1943 The Kinks

8 Mick

Avery

890-21-3456 HL345678C

2/15/1944 The Kinks

9 Jay

Davie

567-89-1023 HI636789D

6/12/1984 Blue Devils

10 Dave

David

678-90-1324 KR789213F

2/3/1974 The Minks

Data set CHK contains 6 variables and 10 observations. Each observation represents a unique individual; there

are no duplicate observations in the data set CHK. The data set CHK contains the same variables as the data set

REF, but there is an additional variable, BAND. The goal is to merge the data set CHK to the data set REF in order

to pick up the BAND variable for the people in the sample.

PROC SQL

NESUG 2011

2

Using AN INNER JOIN TO COMBINE DATA SETS

In its most basic form, the syntax for combing data sets using an inner join in PROC SQL is as straightforward as

a match-merge in a DATA step. Here, the data sets REF and CHK are joined using the National Insurance

Number field (NINO) as the common identifier.

proc sql noprint ;

create table inner_join1 as

select ref.*, band

from ref inner join chk

on (ref.nino eq chk.nino)

order by band, lname, fname ;

quit ;

The resulting data set contains only 4 observations. The original data set REF had 8 observations, so 3

observations failed to return a match.

Obs FNAME LNAME

SSN NINO

DOB BAND

1 John

Baldwin

123-45-6789 BM567891E

1/3/1946 Led Zeppelin

2 John

Bonham

456-78-9012 YZ891234G

5/31/1948 Led Zeppelin

3 Mick

Avory

890-12-3456 HL345678C

2/15/1944 The Kinks

4 Ray

Davies

567-89-0123 HL456789D

6/21/1944 The Kinks

Upon further investigation, it becomes clear that several NINOs do not match.

John

Baldwin

BM567891E

John

Jones

BM567891E

Robert Plant

YZ912345H

Robert

Plant

YZ912235H

Jimmy

Page

HL234567B

Jimmy

Page

HL234567D

John

Bonham

YZ891234G

John

Bonham

YZ891234G

Ray

Davies

HL456789D

Ray

Davies

HL456789D

Dave

Davies

KR789123F

Dave

Davies

KR789213F

Peter

Quaife

AA123456A

Pete

Quiafe

AA132456A

Mick

Avory

HL345678C

Mick

Avery

HL345678C

These are all fairly minor differences consisting mostly of the transposition of two numbers or the errant

substitution of a letter. Each represents a possible data entry error. With a data set this small, values could be

edited so they match on both files. In a larger data set, that would not be a viable solution, so it's well worth

considering what other options are available.

COMPGED

Since NINO is a character variable, we can use the COMPGED function to compute the generalized edit distance

between the two character values.

The SAS documentation on the COMPGED function states that “

Generalized

edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two

strings.”

The finer points of COMPGED are beyond the scope of this paper, but the general syntax takes the form

COMPGED(string-1, string-2 <,cutoff> <,modifiers>)

The COMPGED function returns a value based on the difference between the two character strings. The default

values for minor transpositions and substitutions tend to be 100 or less. Using the COMPGED function as part of

the criteria

in the JOIN would allow for greater flexibility and not require strict equality, making “fuzzier” matches

possible. Technically, we are still matching on a binary outcome

—whether COMPGED returns a value less than

100 or not. It’s the method of calculating that value that allows for some fuzziness.

proc sql noprint ;

create table inner_join2a as

select ref.*, band

from ref inner join chk

on (compged(ref.nino,chk.nino) le 100)

order by band, lname, fname ;

quit ;

Note that in order to use COMPGED, we must specify the tolerance

—

that is, what is the maximum level of

dissimilarity between two values of NINO that is acceptable? By specifying that the value returned by the

PROC SQL

NESUG 2011

COMPGED function must be less than or equal to 100, we are allowing for only very slight differences. COMPGED

will return 0 if two strings are equal, so we are allowing strict equality as well.

The resulting data set has 8 observations, but all is not well. Robert Plant from Led Zeppelin is missing, and Dave

Davies returned a match

for someone in a band called “The Minks.”

Obs FNAME

LNAME

SSN NINO

DOB BAND

1 John

Baldwin

123-45-6789 BM567891E

1/3/1946 Led Zeppelin

2 John

Bonham

456-78-9012 YZ891234G

5/31/1948 Led Zeppelin

3 Jimmy

Page

345-67-8901 HL234567B

1/9/1944 Led Zeppelin

4 Mick

Avory

890-12-3456 HL345678C

2/15/1944 The Kinks

5 Dave

Davies

678-90-1234 KR789123F

2/3/1947 The Kinks

6 Ray

Davies

567-89-0123 HL456789D

6/21/1944 The Kinks

7 Peter

Quaife

789-01-2345 AA123456A

12/31/1943 The Kinks

8 Dave

Davies

678-90-1234 KR789123F

2/3/1947 The Minks

The members of these bands have been assigned both British National Insurance Numbers (NINOs) and

American Social Security Numbers (SSNs) for illustration purposes. While we were unable to match Robert Plant

on his NINO, perhaps looking for fuzzy matches on SSN would return the desired match.

The following code:

proc sql noprint ;

create table inner_join2b as

select ref.*, band

from ref inner join chk

on (compged(ref.ssn,chk.ssn) le 100)

order by band, lname, fname ;

quit ;

Returns the following error message:

ERROR: Function COMPGED requires a character expression as argument 1.

ERROR: Function COMPGED requires a character expression as argument 2.

ERROR: Expression using less than or equal (<=) has components that are of

different data types.

COMPGED requires character expressions, or strings. The PUT function returns a string value, and can be nested

in the COMPGED function.

The following code:

proc sql noprint ;

create table inner_join2c as

select ref.*, band

from ref inner join chk

on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)

order by band, lname, fname ;

quit ;

Creates the following data set:

Obs FNAME

LNAME

SSN NINO

DOB BAND

1 Ray

Davies

567-89-0123 HL456789D

6/21/1944 Blue Devils

2 John

Baldwin

123-45-6789 BM567891E

1/3/1946 Led Zeppelin

3 John

Bonham

456-78-9012 YZ891234G

5/31/1948 Led Zeppelin

4 Jimmy

Page

345-67-8901 HL234567B

1/9/1944 Led Zeppelin

5 Robert Plant

234-56-7890 YZ912345H

8/20/1948 Led Zeppelin

6 Mick

Avory

890-12-3456 HL345678C

2/15/1944 The Kinks

7 Dave

Davies

678-90-1234 KR789123F

2/3/1947 The Kinks

8 Ray

Davies

567-89-0123 HL456789D

6/21/1944 The Kinks

9 Peter

Quaife

789-01-2345 AA123456A

12/31/1943 The Kinks

10 Dave

Davies

678-90-1234 KR789123F

2/3/1947 The Minks

PROC SQL

NESUG 2011

We have successfully rescued Robert Plant from limbo but at the cost of exacerbating the over-match issue from

the previous JOIN. Now we have Dave Davies from The Minks as well as Ray Davies from Blue Devils. Clearly,

we must refine the match criteria.

USING A WEIGHTED MATCH SCORE

Until now, we have allowed for fuzzy matches, but we have treated them the same as exact matches. What would

happen if we assigned more weight to exact matches while still allowing for fuzzy matches?

The code below creates a weighted match score (see Teres 2009) that gives twice as much weight to exact

matches as fuzzy matches. This match score is then used to select the observation with the closest match as

defined by the HAVING clause.

The following query:

proc sql ;

create table inner_join3b as

select ref.*,

band,

((2*(ref.ssn eq chk.ssn)) +

(compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)) as wms

from ref inner join chk

on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)

group by ref.ssn

having calculated wms eq max(calculated wms)

order by band, lname, fname ;

quit ;

Creates the following data set:

Obs FNAME

LNAME

SSN NINO

DOB BAND

wms

1 John

Baldwin 123-45-6789 BM567891E

1/3/1946 Led Zeppelin

2 John

Bonham

456-78-9012 YZ891234G

5/31/1948 Led Zeppelin

3 Jimmy

Page

345-67-8901 HL234567B

1/9/1944 Led Zeppelin

4 Robert Plant

234-56-7890 YZ912345H

8/20/1948 Led Zeppelin

5 Mick

Avory

890-12-3456 HL345678C

2/15/1944 The Kinks

6 Dave

Davies

678-90-1234 KR789123F

2/3/1947 The Kinks

7 Ray

Davies

567-89-0123 HL456789D

6/21/1944 The Kinks

8 Peter

Quaife

789-01-2345 AA123456A 12/31/1943 The Kinks

At this point, we have met the goal of combining the two data sets. However, the weighted match score (WMS)

takes on only 2 values, 1 and 3. It may be desirable to include all available identifying information in the creation of

the match score to allow for greater variability in the quality of the matches.

The following code expands the idea of including both the exact and fuzzy matches to all identifiers. Note the use

of the nested PUT function with the FORMAT MMDDYY10 for DOB.

The following query:

proc sql noprint ;

create table inner_join5 as

select ref.*,

band,

((2*(ref.ssn eq chk.ssn)) +

(compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) +

(2*(ref.nino eq chk.nino)) +

(compged(ref.nino, chk.nino) le 100) +

(2*(ref.fname eq chk.fname)) +

(compged(ref.fname, chk.fname) le 100) +

(2*(ref.lname eq chk.lname)) +

(compged(ref.lname, chk.lname) le 100) +

(2*(ref.dob eq chk.dob)) +

(compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100)) as

wms

PROC SQL

NESUG 2011

from ref inner join chk

on (

(compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) or

(compged(ref.nino, chk.nino) le 100) or

(compged(ref.fname, chk.fname) le 100) or

(compged(ref.lname, chk.lname) le 100) or

(compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100)

)

group by ref.ssn

having ((calculated wms eq max(calculated wms)))

order by band, ref.lname, ref.fname ;

quit ;

Produces the following data set:

Obs FNAME

LNAME

SSN NINO

DOB BAND

wms

1 John

Baldwin 123-45-6789 BM567891E

1/3/1946 Led Zeppelin

2 John

Bonham

456-78-9012 YZ891234G

5/31/1948 Led Zeppelin

3 Jimmy

Page

345-67-8901 HL234567B

1/9/1944 Led Zeppelin

4 Robert Plant

234-56-7890 YZ912345H

8/20/1948 Led Zeppelin

5 Mick

Avory

890-12-3456 HL345678C

2/15/1944 The Kinks

6 Dave

Davies

678-90-1234 KR789123F

2/3/1947 The Kinks

7 Ray

Davies

567-89-0123 HL456789D

6/21/1944 The Kinks

8 Peter

Quaife

789-01-2345 AA123456A 12/31/1943 The Kinks

Here the match score is weighted so that it favors exact matches, assigning them twice the weight of fuzzier

matches. It becomes much clearer that some matches are better than others. For example, Peter Quaife had a

WMS value of 7 because while his entries in REF and CHK matched exactly on NINO and DOB, there were slight

discrepancies in FNAME, LNAME, and SSN (each differed by one character). On the other hand, John Bonham

and Ray Davies had much stronger matches, with WMS values of 15, indicating that they were exact matches.

Because different identifiers might be more or less important, however, it’s hard to say whether matches with

weighted match scores of 15 are “twice as good” as a match with a value of 7.

This approach could be taken even further. Matches on SSN, fuzzy or otherwise, could be given more weight than

matches on first name, for example. COMPGED options can also

be refined to change the “penalties” associated

with certain types of mismatches between character expressions to best fit any given context.

CONCLUSIONS

The use of the COMPGED function in PROC SQL joins offers an incredible amount of flexibility when combining

data sets. Allowing for fuzzy matches opens the door to over-matching a data set by including spurious matches,

however. Creating a weighted match score to favor exact matches is a useful tool for refining results when

combining data sets.

REFERENCES

SAS Institute Inc. 2011. SAS 9.2 Language Reference: Dictionary, Fourth Edition. Cary, NC: SAS Institute Inc.

“

Functions and Call Routines: COMPGED

.”

Staum, Paulette (2007).

“Fuzzy Matching using the COMPGED Function

”

Proceedings of the 2007 NorthEast

SAS Users Group Conference.

Teres, Jedediah J (2009).

“

Using SQL Joins to Perform Weighted Matches on Multiple Identifiers

.”

Proceedings of

the 2009 NorthEast SAS Users Group Conference.

ACKNOWLEDGMENTS

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS

Institute Inc. in the USA and other countries. ® indicates USA registration.

The author wishes to thank Paulette Staum for her encouragement, and Chris Bost and Aaron Hill for reviewing

this paper.

PROC SQL

NESUG 2011

6

CONTACT INFORMATION

Jedediah Teres

MDRC

16 East 34

St, 19

Floor

New York, NY 10016

(212) 340-8807 telephone

(212) 684-0832 fax

jed.teres@mdrc.org

www.mdrc.org

PROC SQL

NESUG 2011

Yüklə 262,04 Kb.

Dostları ilə paylaş: