Data cleaning is like organizing your messy room before guests arrive - it’s essential for making your data analysis shine! When you’re working with real-world data, you’ll often find it’s not as clean as you’d like. Don’t worry though - SAS has some fantastic tools to help you tidy up your data quickly and efficiently.
In this guide, we’ll walk through practical examples using SAS functions like INDEXC, SUBSTR, LENGTH, COMPRESS, and TRANSLATE to tackle common data cleaning challenges. Let’s dive in!
Ever received data where numbers look like “00123” or “000456”? These leading zeros can cause problems when you’re trying to do calculations or comparisons. Here’s how to clean them up:
Related: If you need to add leading zeros instead, check out our guide on Add Leading Zeros in SAS.
data messy_data;input id_number $10.;datalines;01234500012300789A00123X;run;data clean_data;set messy_data;/* Find the position of the first non-zero digit */first_digit = indexc(id_number, '123456789');/* Extract everything from the first non-zero digit onwards */if first_digit > 0 thenclean_id = substr(id_number, first_digit);elseclean_id = id_number; /* Keep original if no digits found */run;proc print data=clean_data;title "Before and After: Removing Leading Zeros";run;
What’s happening here?
INDEXC finds the position of the first character that matches any of the digits 1-9SUBSTR extracts the string starting from that positionLearn More: For a deep dive into the SUBSTR function, see our complete SUBSTR in SAS tutorial.
Sometimes you’ll get data where city and state are mashed together like “LosAngelesCA”. Let’s separate them:
data combined_data;input City_State $20.;datalines;LosAngelesCANewYorkNYAustinTXChicagoIL;run;data separated_data;set combined_data;total_length = length(City_State);/* Assuming state is always the last 2 characters */state = substr(City_State, total_length - 1, 2);city = substr(City_State, 1, total_length - 2);run;proc print data=separated_data;title "Separating City and State";run;
Pro tip: This works great when you know the state abbreviation is always 2 characters. For more complex separations, you might need different approaches!
Street addresses sometimes get messy with letters mixed into numbers. Here’s how to clean them:
data messy_addresses;input street_number $10.;datalines;123A456B789101C;run;data clean_addresses;set messy_addresses;/* Remove all letters, keep only digits */clean_number = compress(street_number, , 'a'); /* 'a' removes alphabetic chars */run;proc print data=clean_addresses;title "Cleaning Street Numbers";run;
Text fields often come with extra punctuation that you don’t need:
data raw_text_data;input original_text $50.;datalines;Hello, world! Welcome@@@ to SAS.Goodbye!!! See you soon.What# is this$?;run;data clean_text_data;set raw_text_data;/* Remove specific unwanted characters */cleaned_text = compress(original_text, '.,!@#$%^&*()_+');run;proc print data=clean_text_data;title "Before and After: Removing Punctuation";run;
The result: “Hello, world! Welcome@@@ to SAS.” becomes “Hello world Welcome to SAS”
Deep Dive: Want to master the COMPRESS function? Read our comprehensive Compress Function in SAS guide.
The TRANSLATE function is perfect when you need to swap one character for another. Here’s a common scenario:
In many European countries, decimals use commas instead of periods. Let’s standardize them:
data european_numbers;input price_text $10.;datalines;123,45678,901234,56;run;data standardized_numbers;set european_numbers;/* Replace commas with periods */us_format = translate(price_text, '.', ',');/* Convert to actual numeric value */price_numeric = input(us_format, best12.);run;proc print data=standardized_numbers;title "Converting European Number Format";format price_numeric dollar8.2;run;
You can also use TRANSLATE to fix common data entry mistakes:
data typo_data;input response $10.;datalines;YESyesYesNOnoNo;run;data fixed_data;set typo_data;/* Convert to uppercase for consistency */clean_response = upcase(response);run;proc print data=fixed_data;title "Standardizing Responses";run;
Missing data is everywhere in real-world datasets. Here are some smart ways to handle it:
Essential Reading: For a complete guide on this topic, check out Missing Values in SAS.
data sample_data;input id score;datalines;1 852 .3 924 785 .6 88;run;/* First, calculate the mean of non-missing values */proc means data=sample_data noprint;var score;output out=mean_score mean=avg_score;run;/* Merge back and replace missing values */data complete_data;if _n_ = 1 then set mean_score;set sample_data;if missing(score) thenfinal_score = avg_score;elsefinal_score = score;drop _type_ _freq_ avg_score;run;proc print data=complete_data;title "Replacing Missing Values with Mean";format final_score 8.1;run;
Sometimes it’s useful to know which values were originally missing:
data indicator_example;input id income;datalines;1 500002 .3 750004 .5 60000;run;data with_indicators;set indicator_example;/* Create indicator for missing values */income_missing = (missing(income));/* Replace missing with median (example value) */if missing(income) then income_clean = 60000;else income_clean = income;run;proc print data=with_indicators;title "Missing Value Indicators";run;
Let’s combine several techniques in one comprehensive example:
data messy_customer_data;input customer_id $10. name $20. phone $15. income $10.;datalines;00123 John,Doe!!! (555)123-4567 45,00000456 Jane@Smith (555)987-6543 .00789 Bob##Johnson 5559876543 67,500;run;data clean_customer_data;set messy_customer_data;/* Clean customer ID - remove leading zeros */first_digit = indexc(customer_id, '123456789');if first_digit > 0 thenclean_id = substr(customer_id, first_digit);elseclean_id = customer_id;/* Clean name - remove unwanted characters */clean_name = compress(name, ',!@#$%^&*()');/* Clean phone - keep only digits */clean_phone = compress(phone, , 'kd'); /* 'kd' keeps only digits *//* Clean income - replace comma with period and convert */income_standard = translate(income, '.', ',');if not missing(income_standard) thenincome_numeric = input(income_standard, best12.);/* Keep only the cleaned variables */keep clean_id clean_name clean_phone income_numeric;run;proc print data=clean_customer_data;title "Complete Data Cleaning Example";format income_numeric dollar10.;run;
Data cleaning doesn’t have to be scary! Here are the main points to remember:
Remember, clean data leads to reliable analysis. Take the time to understand your data’s quirks, and use these SAS functions to transform messy data into something you can work with confidently.
Now that you’ve mastered data cleaning basics, explore these related topics:
Happy data cleaning!
