Fuzzy matching in SAS is a technique of deciding programmatically if one word is identical to the other.
For example, deciding if ‘John Doe‘ is identical to ‘Johnny Doe‘.
In SAS, there are a number of function that will let you carry out a fuzzy match. I’ll present you the most commonly used functions with an example.
1. COMPARE Function
The COMPARE function returns the position of the left most character by which two strings differ, or returns 0 if there isn’t any different.
This function is most helpful when comparing two strings that must be the identical, however, could only differ when
- The difference in cases like lowercase or uppercase
- There are leading or trailing blanks difference
This function takes the place of the following code:
if strip(upcase(name1))=strip(upcase(name2)) then do;
The COMPARE function can be used to make an identical comparison.
if compare(name1,name2,'il')=0 then do;
Read: Compare Function in SAS
2. COMPGED Function
The COMPGED function returns the generalized edit distance between two strings. Specifically, the
COMPED function returns a generalization of the Levenshtein edit distance, which is a measure of dissimilarity between two strings.
The Levenshtein edit distance is the variety of operations (deletions, insertions, or replacement) of a single character which can be required to transform string-1 into string-2.
Each operation basically ‘costs’ a certain value. For instance, if string-1 is the results of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the larger the cost. The COMPGED function will return the full cost for all operations that occur.
The costs returned by COMPGED will be altered by using CALL COMPOST in order that the cost is specific to your needs.
A typical use of the COMPGED function is to compare email addresses.
Example: Using COMPGED function to compare if two email addresses are identical
data test; email1='JohnDoe@abc.com'; email2='John_Doe@abc.com'; email@example.com'; cost1=compged(email1, email2); cost2=compged(email2, email3); run; proc print;
Note that the cost of comparing email1 – JohnDoe@abc.com with email2 – is John_Doe@abc.com is 30. This is the cost of adding underscore (_) to email 1.
The cost of comparing email1 – JohnDoe@abc.com with email3 – firstname.lastname@example.org is 300. This involves the replacement of uppercase letters and adding an underscore.
The COMPLEV function is similar to the COMPGED function. The difference is that the Levenshtein edit distance that’s computed by COMPLEV is a special case of the generalized edit distance that’s computed by COMPGED.
COMPLEV executes way more quickly than COMPGED. However, the COMPLEV function is just not as highly effective or versatile as the COMPGED function.
The Syntax for COMPLEV function is as below.
COMPLEV(string-1, string-2 <, cutoff> <, modifier(s)>)
String-1 and String-2 are the two strings which you want to compare.
Cutoff specifies a numeric constant, variable, or expression. If the actual Levenshtein edit distance is greater than the value of cutoff, the value that is returned is equal to the value of cutoff.
TIP: If the values of string-1 and string-2 are long use a small value of cutoff to improve the efficiency of COMPLEV function.
The optional modifiers for COMPLEV function are:
- i or I → Using this modifier will ignore the case in string-1 and string-2 while comparison.
- l or L → It removes leading blanks in string-1 and string-2 before comparing the values. The COMPLEV function ignores trailing blanks.
- n or N → It removes quotation marks from any argument that is an n-literal and also ignores the case of string-1 and string-2.
- : (colon)→ It truncates the longer of string-1 or string-2 to the length of the shorter string, or to 1, whichever is greater.
TIP: COMPLEV ignores blanks that are used as modifiers.
data test; string1="JohnDoe@abc.com"; string2="JohnDoe@cdb.com"; result=complev(string1, string2,1); run;
The COMPLEV function is usually most helpful when comparing simple strings and when the speed of comparison is vital.
3. SPEDIS Function
The SPEDIS function is the oldie-but-a-goodie of the bunch which can be used to perform fuzzy matching in SAS. The SPEDIS function determines the probability of two phrases matching, expressed as the asymmetric spelling distance between the 2 phrases.
SPEDIS is much like COMPGED in that it assigns a cost to the every operation equivalent to swap, append and delete. SPEDIS will sum the costs and then divide the sum by the length of the first argument.
Note: The order of the arguments important in the SPEDIS function. Therefore, SPEDIS(X, Y) doesn’t at all times equal SPEDIS(Y, X).
query – This argument is used to identify the word to query for the likelihood of a match. SPEDIS removes trailing blanks before comparing the value.
keyword – It specifies a target word for the query.
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.
The argument specifies any SAS character expression.
The SOUNDEX function returns a copy of the argument encoded by using the following steps.
Step 1: Retain the first letter in the argument and discard the following letters:
A E H I O U W Y
Step 2: Assign the following numbers to these classes of letters:
- B F P V
- C G J K Q S X Z
- D T
- M N
- If the string has any double letters, they should be treated as one letter.
Example for expresso, the code is E2162
E- Ignored, x-2,p-1, r-6,e-Ignored,ss-2,o-Ignored
- If there are two or more adjacent letters or side-by-side have the same classification from Step 2, it will be interpreted as one letter.
Example – For PV Sindhu the code is P253
PV – Ignored
S – 2, I – Ignored, N – 5, D – 3, H, U – Ignored
data test; string1="expresso"; string2="tokio"; result1=soundex(string1); result2=soundex(string2); run;