Data Cleaning in SAS: A Complete Guide with Examples

Home Contact

Base SAS

Subhro Kar

December 19, 2024

2 min

Removing Leading Zeros from Numbers

Separating Combined Text Fields

Cleaning Up Text with COMPRESS Function

Using TRANSLATE Function for Character Replacement

Handling Missing Values Like a Pro

Putting It All Together: A Complete Cleaning Example

Key Takeaways

Data Cleaning in SAS: A Complete Guide with Examples

Data cleaning is like organizing your messy room before guests arrive - it’s essential for making your data analysis shine! When you’re working with real-world data, you’ll often find it’s not as clean as you’d like. Don’t worry though - SAS has some fantastic tools to help you tidy up your data quickly and efficiently.

In this guide, we’ll walk through practical examples using SAS functions like INDEXC, SUBSTR, LENGTH, COMPRESS, and TRANSLATE to tackle common data cleaning challenges. Let’s dive in!

Removing Leading Zeros from Numbers

Ever received data where numbers look like “00123” or “000456”? These leading zeros can cause problems when you’re trying to do calculations or comparisons. Here’s how to clean them up:

Related: If you need to add leading zeros instead, check out our guide on Add Leading Zeros in SAS.

data messy_data;
    input id_number $10.;
    datalines;
012345
000123
00789A
00123X
;
run;

data clean_data;
    set messy_data;
    /* Find the position of the first non-zero digit */
    first_digit = indexc(id_number, '123456789');
    
    /* Extract everything from the first non-zero digit onwards */
    if first_digit > 0 then
        clean_id = substr(id_number, first_digit);
    else
        clean_id = id_number; /* Keep original if no digits found */
run;

proc print data=clean_data;
    title "Before and After: Removing Leading Zeros";
run;

What’s happening here?

INDEXC finds the position of the first character that matches any of the digits 1-9
SUBSTR extracts the string starting from that position
The result: “012345” becomes “12345”, and “000123” becomes “123”

Learn More: For a deep dive into the SUBSTR function, see our complete SUBSTR in SAS tutorial.

Separating Combined Text Fields

Sometimes you’ll get data where city and state are mashed together like “LosAngelesCA”. Let’s separate them:

data combined_data;
    input City_State $20.;
    datalines;
LosAngelesCA
NewYorkNY
AustinTX
ChicagoIL
;
run;

data separated_data;
    set combined_data;
    total_length = length(City_State);
    
    /* Assuming state is always the last 2 characters */
    state = substr(City_State, total_length - 1, 2);
    city = substr(City_State, 1, total_length - 2);
run;

proc print data=separated_data;
    title "Separating City and State";
run;

Pro tip: This works great when you know the state abbreviation is always 2 characters. For more complex separations, you might need different approaches!

Cleaning Up Text with COMPRESS Function

Removing Letters from Street Numbers

Street addresses sometimes get messy with letters mixed into numbers. Here’s how to clean them:

data messy_addresses;
    input street_number $10.;
    datalines;
123A
456B
789
101C
;
run;

data clean_addresses;
    set messy_addresses;
    /* Remove all letters, keep only digits */
    clean_number = compress(street_number, , 'a'); /* 'a' removes alphabetic chars */
run;

proc print data=clean_addresses;
    title "Cleaning Street Numbers";
run;

Removing Unwanted Punctuation

Text fields often come with extra punctuation that you don’t need:

data raw_text_data;
    input original_text $50.;
    datalines;
Hello, world! Welcome@@@ to SAS.
Goodbye!!! See you soon.
What# is this$?
;
run;

data clean_text_data;
    set raw_text_data;
    /* Remove specific unwanted characters */
    cleaned_text = compress(original_text, '.,!@#$%^&*()_+');
run;

proc print data=clean_text_data;
    title "Before and After: Removing Punctuation";
run;

The result: “Hello, world! Welcome@@@ to SAS.” becomes “Hello world Welcome to SAS”

Deep Dive: Want to master the COMPRESS function? Read our comprehensive Compress Function in SAS guide.

Using TRANSLATE Function for Character Replacement

The TRANSLATE function is perfect when you need to swap one character for another. Here’s a common scenario:

Converting European Number Format

In many European countries, decimals use commas instead of periods. Let’s standardize them:

data european_numbers;
    input price_text $10.;
    datalines;
123,45
678,90
1234,56
;
run;

data standardized_numbers;
    set european_numbers;
    /* Replace commas with periods */
    us_format = translate(price_text, '.', ',');
    
    /* Convert to actual numeric value */
    price_numeric = input(us_format, best12.);
run;

proc print data=standardized_numbers;
    title "Converting European Number Format";
    format price_numeric dollar8.2;
run;

Fixing Common Typos

You can also use TRANSLATE to fix common data entry mistakes:

data typo_data;
    input response $10.;
    datalines;
YES
yes
Yes
NO
no
No
;
run;

data fixed_data;
    set typo_data;
    /* Convert to uppercase for consistency */
    clean_response = upcase(response);
run;

proc print data=fixed_data;
    title "Standardizing Responses";
run;

Handling Missing Values Like a Pro

Missing data is everywhere in real-world datasets. Here are some smart ways to handle it:

Essential Reading: For a complete guide on this topic, check out Missing Values in SAS.

Replacing Missing Values with Mean

data sample_data;
    input id score;
    datalines;
1 85
2 .
3 92
4 78
5 .
6 88
;
run;

/* First, calculate the mean of non-missing values */
proc means data=sample_data noprint;
    var score;
    output out=mean_score mean=avg_score;
run;

/* Merge back and replace missing values */
data complete_data;
    if _n_ = 1 then set mean_score;
    set sample_data;
    
    if missing(score) then
        final_score = avg_score;
    else
        final_score = score;
        
    drop _type_ _freq_ avg_score;
run;

proc print data=complete_data;
    title "Replacing Missing Values with Mean";
    format final_score 8.1;
run;

Creating Missing Value Indicators

Sometimes it’s useful to know which values were originally missing:

data indicator_example;
    input id income;
    datalines;
1 50000
2 .
3 75000
4 .
5 60000
;
run;

data with_indicators;
    set indicator_example;
    
    /* Create indicator for missing values */
    income_missing = (missing(income));
    
    /* Replace missing with median (example value) */
    if missing(income) then income_clean = 60000;
    else income_clean = income;
run;

proc print data=with_indicators;
    title "Missing Value Indicators";
run;

Putting It All Together: A Complete Cleaning Example

Let’s combine several techniques in one comprehensive example:

data messy_customer_data;
    input customer_id $10. name $20. phone $15. income $10.;
    datalines;
00123 John,Doe!!! (555)123-4567 45,000
00456 Jane@Smith (555)987-6543 .
00789 Bob##Johnson 5559876543 67,500
;
run;

data clean_customer_data;
    set messy_customer_data;
    
    /* Clean customer ID - remove leading zeros */
    first_digit = indexc(customer_id, '123456789');
    if first_digit > 0 then
        clean_id = substr(customer_id, first_digit);
    else
        clean_id = customer_id;
    
    /* Clean name - remove unwanted characters */
    clean_name = compress(name, ',!@#$%^&*()');
    
    /* Clean phone - keep only digits */
    clean_phone = compress(phone, , 'kd'); /* 'kd' keeps only digits */
    
    /* Clean income - replace comma with period and convert */
    income_standard = translate(income, '.', ',');
    if not missing(income_standard) then
        income_numeric = input(income_standard, best12.);
    
    /* Keep only the cleaned variables */
    keep clean_id clean_name clean_phone income_numeric;
run;

proc print data=clean_customer_data;
    title "Complete Data Cleaning Example";
    format income_numeric dollar10.;
run;

Key Takeaways

Data cleaning doesn’t have to be scary! Here are the main points to remember:

INDEXC and SUBSTR are great for finding and extracting specific parts of strings
COMPRESS is your friend for removing unwanted characters
TRANSLATE helps you swap characters consistently
Always handle missing values thoughtfully - don’t just ignore them
Test your cleaning logic on a small sample first

Remember, clean data leads to reliable analysis. Take the time to understand your data’s quirks, and use these SAS functions to transform messy data into something you can work with confidently.

Now that you’ve mastered data cleaning basics, explore these related topics: