Fuzzy matching in SAS is a technique of deciding programmatically if one word is identical to the other.
For example, deciding if ’John Doe’ is identical to ’Johnny Doe‘.
In SAS, several functions will let you carry out a fuzzy match. I’ll present you the most commonly used functions with an example.
Fuzzy matching in SAS refers to the process of finding records in one or more datasets that are approximately, but not exactly, equal to a specified target record. Unlike exact matching, where records must match perfectly to be considered a match, fuzzy matching allows for some level of discrepancy or error. This is particularly useful when dealing with human-entered data, which may contain typos, abbreviations, or other inconsistencies.
SAS offers various functions and techniques for fuzzy matching, such as the COMPGED, COMPLEV, SPEDIS, and SOUNDEX functions. These functions calculate similarity scores or distances between strings, helping you identify records that are likely to be equivalent. Fuzzy matching is commonly used in data cleaning, deduplication, and record linkage tasks, making it an essential skill for data analysts and SAS programmers.
The COMPARE function returns the position of the leftmost character by which two strings differ or returns 0 if there isn’t any difference.
This function is most helpful when comparing two strings that must be identical, however, could only differ when
The difference in cases like lowercase or uppercase
There are leading or trailing blanks difference
This function takes the place of the following “code”:
if strip(upcase(name1))=strip(upcase(name2)) then do
The COMPARE function can be used to make an identical comparison.
if compare(name1,name2,'il')=0 then do
“Read”: Compare Function in SAS
The COMPGED function returns the [generalised](“http”://kodu.ut.ee/~varmo/tday-rouge/kaarik-slides.pdf) edit distance between two strings. Specifically, the COMPED function returns a generalisation of the Levenshtein edit distance, a measure of dissimilarity between two strings.
The Levenshtein edit distance is the variety of operations (deletions, insertions, or replacement) of a single character that can be required to transform string-1 into string-2.
Each operation basically ‘costs’ a particular value. For instance, if string-1 is the result of inserting a space into string-2, this has a cost of 10. The more dramatic the operation, the larger the cost. The COMPGED function will return the full cost for all operations that occur.
The costs returned by COMPGED will be altered by using CALL COMPOST so that the cost is specific to your needs.
A typical use of the COMPGED function is to compare email addresses.
data test;email1='JohnDoe@abc.com';email2='John_Doe@abc.com';email3='john_doe@abc.com';cost1=compged(email1, email2);cost2=compged(email2, email3);run;
The cost of comparing email1 - _JohnDoe@abc.com_ with email2 - is John_Doe@abc.com is 30. This is the cost of adding underscore (_) to email 1.
The cost of comparing email1 - _JohnDoe@abc.com_ with email3 - john_doe@abc.com is 300. This involves the replacement of uppercase letters and adding an underscore.

COMPGED Function Example in SAS
The COMPLEV function is similar to the COMPGED function. The difference is that the Levenshtein edit distance that COMPLEV computes is a special case of the generalised edit distance that COMPGED computes.
COMPLEV executes way more quickly than COMPGED. However, the COMPLEV function is just not as highly effective or versatile as the COMPGED function.
The Syntax for the COMPLEV function is as below.
COMPLEV(string-1, string-2 <, cutoff> <, modifier(s)>)
String-1 and String-2 are the two strings that you want to compare.
Cutoff specifies a numeric constant, variable, or expression. If the actual Levenshtein edit distance is greater than the cutoff value, the value returned is equal to the cutoff value.
“TIP”: If the values of string-1 and string-2 are long, use a small cutoff value to improve the COMPLEV function’s efficiency.
The optional modifiers for COMPLEV function “are”:
i or I → Using this modifier will ignore the case in string-1 and string-2 during comparison.
l or L → It removes leading blanks in string-1 and string-2 before comparing the values. The COMPLEV function ignores trailing blanks.
n or N → It removes quotation marks from any argument that is an n-literal and ignores the case of string-1 and string-2.
: (colon)→ It truncates the longer of string-1 or string-2 to the length of the shorter string, or to 1, whichever is greater.
“TIP”: COMPLEV ignores blanks that are used as modifiers.
“Example”:
data test;string1="JohnDoe@abc.com";string2="JohnDoe@cdb.com";result=complev(string1, string2,1);run;
“Output”:

COMPLEV Function Example in SAS
The COMPLEV function is usually most helpful when comparing simple strings and when the speed of comparison is vital.
The SPEDIS function is the oldie-but-a-goodie of the bunch, which can be used to perform fuzzy matching in SAS. The SPEDIS function determines the probability of two phrases matching, expressed as the asymmetric spelling distance between the 2 phrases.
SPEDIS is much like COMPGED in that it assigns a cost to every operation equivalent to swapping, appending and deleting. SPEDIS will sum the costs and then divide the sum by the length of the first argument.
“Note”: The order of the arguments is essential in the SPEDIS function. Therefore, SPEDIS(X, Y) doesn’t always equal SPEDIS(Y, X).
“Syntax”:
SPEDIS(query, keyword)
query - This argument is used to identify the word to query for the likelihood of a match. SPEDIS removes trailing blanks before comparing the value.
keyword - It specifies a target word for the query.
“Example”:
data test;string1="enviroment";string2="environment";result=spedis(string1, string2);run;
“Output”:

Example of SPEDIS function in SAS
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.
“Syntax”:
SOUNDEX(argument)
The argument specifies any SAS character expression.
The SOUNDEX function returns a copy of the encoded argument using the following steps.
Step “1”: Retain the first letter in the argument and discard the following “letters”:
A E H I O U W Y
Step “2”: Assign the following numbers to these classes of “letters”:
B F P V
C G J K Q S X Z
D T
L
M N
R
Additional “rules”:
For example- for expresso, the code is E2162
E- Ignored, x-2,p-1, r-6,e-Ignored,ss-2,o-Ignored
Example - For PV Sindhu, the code is P253
PV - Ignored
S - 2, I - Ignored, N - 5, D - 3, H, U - Ignored
“Example”:
data test;string1="expresso";string2="tokio";result1=soundex(string1);result2=soundex(string2);run;
“Output”:

Soundex Function Example in SAS
