The Mystery of Proc Sort Options
Sorting the data is always a resource-intensive operation. Therefore, using the
PROC SORT procedure efficiently can save you both time and computing resources.
There are a number of options associated with PROC SORT that can be used not only to control the performance and capabilities of the procedure but also to the resulting data set.
The NODUPREC Option in Proc Sort
NODUPREC option (as well as its aliases
NODUP) is used far too often.
You might think that using this option will remove all duplicate observations, and although this is what it nominally is supposed to do, it does not necessarily cause PROC SORT to remove all duplicate observations.
In fact, it only removes duplicate observations that are adjacent after sorting.
When the sorting process results in a data set in which duplicate observations are not next to each other (they do not come one after the other sequentially), they will not be detected and the duplicate observation(s) will not be removed.
data SortExample; input Subject visit labdt:mmddyy8. sodium chloride; format labdt mmddyy8.; datalines; 200 1 07/06/2006 140 103 200 2 07/13/2006 144 106 200 1 07/06/2006 140 103 200 4 07/13/2006 140 103 200 4 07/13/2006 140 103 200 5 07/07/2006 142 104 ; run;
proc sort data=SortExample out=sorted noduprecs; by subject; run;
The row Observation number 4 is removed and a NOTE in the log is written. Observation number 1 and 3 are also duplicate but ignored by the option.
NOTE: There were 6 observations read from the data set WORK.SORTEXAMPLE. NOTE: 1 duplicate observations were deleted. NOTE: The data set WORK.SORTED has 5 observations and 5 variables.
When key fields in the BY statement are sufficient to form a primary key that can uniquely identify the observations within a BY group, it will make the NODUPREC option work as we would hope that it would.
In the previous example if we include BY statement with VISIT and LABDT as
well as SUBJECT, the duplicate record will be removed.
proc sort data=SortExample out=sorted noduprecs; by subject visit labdt; run;
You may also specify the
_all_ option, which will remove the duplicate rows.
NOTE: There were 6 observations read from the data set WORK.SORTEXAMPLE. NOTE: 2 duplicate observations were deleted. NOTE: The data set WORK.SORTED has 4 observations and 5 variables.
NODUPKEY and DUPOUT
NODUPKEY option checks and eliminates observations with duplicate BY values keeping only the first occurrence in the BY group.
proc sort data=SortExample nodupkey; by visit; run;
Note that, all the duplicate records within the same BY group (Visit) have been removed.
When the NODUPREC or the NODUPKEY options are used, the LOG will note when observations are removed.
However, which observations were removed will not be written in the LOG.
If you want to be able to see these observations, the
DUPOUT= option can be used to save the duplicate observations into a separate data set
proc sort data=SortExample out=sorted nodupkey dupout=dup1; by visit; run;
NOUNIQUEKEY and UNIQUEOUT
NOUNIQUEKEY option checks and eliminates observations from the output data set that has a unique sort key.
A sort key is unique when the observation containing the key is the only observation within a BY group.
proc sort data=SortExample nouniquekey out=dupsreq uniqueout=uniqreq; by visit; run;
Duplicate records deleted from the original dataset. You can save the duplicate records in another dataset by using the OUT option.
If option OUT is not used, the original dataset is overwritten and will contain only the duplicate records.
UNIQUEOUT= option can be used with the
NOUNIQUEKEY option. UNIQUEOUT= SAS-data-set specifies the output data set for observations that will contain unique records.
OVERWRITE option will enable you to delete the input data set before the replacement output data set of the same name is populated with observations.
Using this option can reduce disk space requirements.
PRESORTED checks within the input data set to determine whether the sequence of observations is in order before sorting is done.
Use the PRESORTED option when you know or strongly suspect that a data set is already in order according to the key variables that are specified in the BY statement.
Using this option can eliminate the unnecessary overhead of the cost of sorting the data set.
EQUALS and NOEQUALS options
EQUALS option specifies the order of the observations in the output data set and it maintains the relative order of the observations from within the input data set to the output data set for observations with identical BY variable values.
NOEQUALS does not necessarily preserve this order in the output data set.
NOSORTEQUALS are the two system options that control whether the first observation in a group is selected or not.
SORTEQUAL specifies that observations with identical BY variable values are to retain the same relative positions in the output data set as in the input data set.
NOSORTEQUALS specifies that no resources be used to control the order of observations with identical BY variable values in the output data set.
To understand this, let’s look at an example below.
Data Example; input ID $ var $; datalines; 4 A4 4 B4 1 A1 1 A1 1 A2 1 A3 1 A5 3 C3 3 C3 3 C3 2 B2 ; run;
proc sort data=example nodupkey; by id; run;
Using the EQUALS options, which is also the default, Sort procedure maintains the order of Group 1 – A1 which is also the first observation in the BY group.
proc sort data=example nodupkey noequals; by id; run;
Using the NOEQUALS options, we can change this default behavior according to our needs. Thus, ID 1 has the value of A3.
Handling of observations with duplicate keys
PROC SORT with the NODUPKEY option removes records with identical keys. However, this has been enhanced, and now you have the ability to choose which of the observations is kept, and you can send the duplicates that aren’t kept to a dataset.
The following examples use a dataset containing customer ID numbers and the quantity purchased in a month.
data orders; input custid qty month $6.; datalines; 10270 5 JAN 12230 4 JAN 19323 7 JAN 19323 3 FEB 22779 2 FEB 22779 2 FEB 28819 4 FEB 29252 1 FEB 30457 9 MAR 30457 1 APR 30457 3 MAY 30457 2 MAY 31918 1 MAY 31918 1 MAY 30457 2 JUN ; run;
Let’s suppose, you need to find the customers who have order quantity more than one for a given month.
You can proc sort step as below which will sort the data based on custid and month and the unique orders are kept in the dataset. The duplicate observation is sent to a different dataset using the dupout option.
proc sort data=orders out=sort1 nodupkey dupout=dups; by custid month; run;
Note, the output in the dataset specified in DUPOUT option gives us the 1st records for each BY group. What if we want to have all the duplicate values in the output dataset.
Using, the NOUNIQUE key and UNIQUE out option you can have all the duplicate values as below which is also the output we want. i.e all customers who have more than one order quantity in a month.
proc sort data=orders nouniquekeys out=alldups uniqueout=uniques; by custid month; run;
This combination can be used instead of the First and last dataset variables and in a few lines of code. Below are all the unique records which are stored in uniques dataset.
PROC SORT uses the ‘collating sequence’ to determine the sorted order of values.
There are two commonly used collating sequences, EBCDIC (for mainframe systems) and ASCII (for most other machines running Operating Systems like Windows and UNIX).
You have long been able to select one or the other of these two different collating
sequences by specifying the EBCDIC or ASCII options on the PROC SORT statement.
The SORTSEQ option allows you to further refine the way the selected collating sequence is used.
This includes subsets or locals within a national collating sequence. Even without changing the base collating sequence the SORTSEQ option can be beneficial.
You can get more information on using Linguistic Collation options of Proc sort in the below article.