Fuzzy matching in SAS is a technique of deciding programmatically if one word is identical to the other.
For example, deciding if ‘John Doe‘ is identical to ‘Johnny Doe‘.
In SAS, several functions will let you carry out a fuzzy match. I’ll present you the most commonly used functions with an example.
What Is Fuzzy Matching in Sas?
Fuzzy matching in SAS refers to the process of finding records in one or more datasets that are approximately, but not exactly, equal to a specified target record. Unlike exact matching, where records must match perfectly to be considered a match, fuzzy matching allows for some level of discrepancy or error. This is particularly useful when dealing with human-entered data, which may contain typos, abbreviations, or other inconsistencies.
SAS offers various functions and techniques for fuzzy matching, such as the COMPGED
, COMPLEV
, SPEDIS
, and SOUNDEX
functions. These functions calculate similarity scores or distances between strings, helping you identify records that are likely to be equivalent. Fuzzy matching is commonly used in data cleaning, deduplication, and record linkage tasks, making it an essential skill for data analysts and SAS programmers.
1. COMPARE Function
The COMPARE function returns the position of the leftmost character by which two strings differ or returns 0 if there isn’t any difference.
This function is most helpful when comparing two strings that must be identical, however, could only differ when
- The difference in cases like lowercase or uppercase
- There are leading or trailing blanks difference
This function takes the place of the following code:
if strip(upcase(name1))=strip(upcase(name2)) then do
The COMPARE function can be used to make an identical comparison.
if compare(name1,name2,'il')=0 then do
Read: Compare Function in SAS
2. COMPGED Function
The COMPGED function returns the generalised edit distance between two strings. Specifically, the COMPED function returns a generalisation of the Levenshtein edit distance, a measure of dissimilarity between two strings.
The Levenshtein edit distance is the variety of operations (deletions, insertions, or replacement) of a single character that can be required to transform string-1 into string-2.
Each operation basically ‘costs’ a particular value. For instance, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the larger the cost. The COMPGED function will return the full cost for all operations that occur.
The costs returned by COMPGED will be altered by using CALL COMPOST so that the cost is specific to your needs.
A typical use of the COMPGED function is to compare email addresses.
Example: Using the COMPGED function to compare if two email addresses are identical
data test;
email1='[email protected]';
email2='[email protected]';
email3='[email protected]';
cost1=compged(email1, email2);
cost2=compged(email2, email3);
run;
The cost of comparing email1 – [email protected] with email2 – is [email protected] is 30. This is the cost of adding underscore (_) to email 1.
The cost of comparing email1 – [email protected] with email3 – [email protected] is 300. This involves the replacement of uppercase letters and adding an underscore.
COMPLEV Function
The COMPLEV function is similar to the COMPGED function. The difference is that the Levenshtein edit distance that COMPLEV computes is a special case of the generalised edit distance that COMPGED computes.
COMPLEV executes way more quickly than COMPGED. However, the COMPLEV function is just not as highly effective or versatile as the COMPGED function.
The Syntax for the COMPLEV function is as below.
COMPLEV(string-1, string-2 <, cutoff> <, modifier(s)>)
String-1 and String-2 are the two strings that you want to compare.
Cutoff specifies a numeric constant, variable, or expression. If the actual Levenshtein edit distance is greater than the cutoff value, the value returned is equal to the cutoff value.
TIP: If the values of string-1 and string-2 are long, use a small cutoff value to improve the COMPLEV function’s efficiency.
The optional modifiers for COMPLEV function are:
- i or I → Using this modifier will ignore the case in string-1 and string-2 during comparison.
- l or L → It removes leading blanks in string-1 and string-2 before comparing the values. The COMPLEV function ignores trailing blanks.
- n or N → It removes quotation marks from any argument that is an n-literal and ignores the case of string-1 and string-2.
- : (colon)→ It truncates the longer of string-1 or string-2 to the length of the shorter string, or to 1, whichever is greater.
TIP: COMPLEV ignores blanks that are used as modifiers.
Example:
data test;
string1="[email protected]";
string2="[email protected]";
result=complev(string1, string2,1);
run;
Output:
The COMPLEV function is usually most helpful when comparing simple strings and when the speed of comparison is vital.
3. SPEDIS Function
The SPEDIS function is the oldie-but-a-goodie of the bunch, which can be used to perform fuzzy matching in SAS. The SPEDIS function determines the probability of two phrases matching, expressed as the asymmetric spelling distance between the 2 phrases.
SPEDIS is much like COMPGED in that it assigns a cost to every operation equivalent to swapping, appending and deleting. SPEDIS will sum the costs and then divide the sum by the length of the first argument.
Note: The order of the arguments is essential in the SPEDIS function. Therefore, SPEDIS(X, Y) doesn’t always equal SPEDIS(Y, X).
Syntax:
SPEDIS(query, keyword)
query – This argument is used to identify the word to query for the likelihood of a match. SPEDIS removes trailing blanks before comparing the value.
keyword – It specifies a target word for the query.
Example:
data test;
string1="enviroment";
string2="environment";
result=spedis(string1, string2);
run;
Output:
4. SOUNDEX Function
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.
Syntax:SOUNDEX(argument)
The argument specifies any SAS character expression.
The SOUNDEX function returns a copy of the encoded argument using the following steps.
Step 1: Retain the first letter in the argument and discard the following letters:
A E H I O U W Y
Step 2: Assign the following numbers to these classes of letters:
- B F P V
- C G J K Q S X Z
- D T
- L
- M N
- R
Additional rules:
- The string should be treated as one letter if it has double letters.
For example- for expresso, the code is E2162
E- Ignored, x-2,p-1, r-6,e-Ignored,ss-2,o-Ignored
- If there are two or more adjacent letters or side-by-side have the same classification from Step 2, it will be interpreted as one letter.
Example – For PV Sindhu, the code is P253
PV – Ignored
S – 2, I – Ignored, N – 5, D – 3, H, U – Ignored
Example:
data test;
string1="expresso";
string2="tokio";
result1=soundex(string1);
result2=soundex(string2);
run;
Output: