0 Comments

January 26, 2020

Sorting the data is always a resource-intensive operation. Therefore, using the PROC SORT procedure efficiently can save you both time and computing resources.

There are a number of options associated with PROC SORT that can be used not only to control the performance and capabilities of the procedure but also to the resulting data set.

The NODUPREC Option in Proc Sort

The NODUPREC option (as well as its aliases NODUPLICATES and NODUP) is used far too often.

You might think that using this option will remove all duplicate observations, and although this is what it nominally is supposed to do, it does not necessarily cause PROC SORT to remove all duplicate observations.

In fact, it only removes duplicate observations that are adjacent after sorting.

When the sorting process results in a data set in which duplicate observations are not next to each other (they do not come one after the other sequentially), they will not be detected and the duplicate observation(s) will not be removed.

data SortExample;
input Subject visit labdt:mmddyy8. sodium chloride;
format labdt mmddyy8.;
datalines;
200 1 07/06/2006 140 103
200 2 07/13/2006 144 106
200 1 07/06/2006 140 103
200 4 07/13/2006 140 103
200 4 07/13/2006 140 103
200 5 07/07/2006 142 104
;
run;
Proc Sort
Input dataset
proc sort data=SortExample out=sorted noduprecs;
by subject;
run;
Proc Sort

The row Observation number 4 is removed and a NOTE in the log is written. Observation number 1 and 3 are also duplicate but ignored by the option.

NOTE: There were 6 observations read from the data set WORK.SORTEXAMPLE.
NOTE: 1 duplicate observations were deleted.
NOTE: The data set WORK.SORTED has 5 observations and 5 variables.

When key fields in the BY statement are sufficient to form a primary key that can uniquely identify the observations within a BY group, it will make the NODUPREC option work as we would hope that it would.

In the previous example if we include BY statement with VISIT and LABDT as
well as SUBJECT, the duplicate record will be removed.

proc sort data=SortExample out=sorted noduprecs;
by subject visit labdt;
run;

You may also specify the _all_ option, which will remove the duplicate rows.

NOTE: There were 6 observations read from the data set WORK.SORTEXAMPLE.
NOTE: 2 duplicate observations were deleted.
NOTE: The data set WORK.SORTED has 4 observations and 5 variables.

NODUPKEY and DUPOUT

The NODUPKEY option checks and eliminates observations with duplicate BY values keeping only the first occurrence in the BY group.

proc sort data=SortExample nodupkey;
by visit;
run;
Proc Sort

Note that, all the duplicate records within the same BY group (Visit) have been removed.

When the NODUPREC or the NODUPKEY options are used, the LOG will note when observations are removed.

However, which observations were removed will not be written in the LOG.

If you want to be able to see these observations, the DUPOUT= option can be used to save the duplicate observations into a separate data set

proc sort data=SortExample out=sorted nodupkey dupout=dup1;
by visit;
run;

NOUNIQUEKEY and UNIQUEOUT

The NOUNIQUEKEY option checks and eliminates observations from the output data set that has a unique sort key.

A sort key is unique when the observation containing the key is the only observation within a BY group.

proc sort data=SortExample nouniquekey out=dupsreq uniqueout=uniqreq;
by visit;
run;
Proc Sort
Duplicate Observation from the dupreq dataset

Duplicate records deleted from the original dataset. You can save the duplicate records in another dataset by using the OUT option.

If option OUT is not used, the original dataset is overwritten and will contain only the duplicate records.

The UNIQUEOUT= option can be used with the NOUNIQUEKEY option. UNIQUEOUT= SAS-data-set specifies the output data set for observations that will contain unique records.

Proc Sort
Unique records from UNIQUEOUT option

OVERWRITE

The OVERWRITE option will enable you to delete the input data set before the replacement output data set of the same name is populated with observations.

Using this option can reduce disk space requirements.

PRESORTED

The PRESORTED checks within the input data set to determine whether the sequence of observations is in order before sorting is done.

Use the PRESORTED option when you know or strongly suspect that a data set is already in order according to the key variables that are specified in the BY statement.

Using this option can eliminate the unnecessary overhead of the cost of sorting the data set.

EQUALS and NOEQUALS options

EQUALS option specifies the order of the observations in the output data set and it maintains the relative order of the observations from within the input data set to the output data set for observations with identical BY variable values.

NOEQUALS does not necessarily preserve this order in the output data set.

The SORTEQUAL and NOSORTEQUALS are the two system options that control whether the first observation in a group is selected or not.

SORTEQUAL specifies that observations with identical BY variable values are to retain the same relative positions in the output data set as in the input data set.

NOSORTEQUALS specifies that no resources be used to control the order of observations with identical BY variable values in the output data set.

To understand this, let’s look at an example below.

Data Example;
 input ID $ var $;
 datalines;
4 A4
4 B4
1 A1
1 A1
1 A2
1 A3
1 A5
3 C3
3 C3
3 C3
2 B2
;
run;
Proc Sort
Input dataset
proc sort data=example nodupkey;
 by id;
run;
Proc Sort
Using the default sort

Using the EQUALS options, which is also the default, Sort procedure maintains the order of Group 1 – A1 which is also the first observation in the BY group.

proc sort data=example nodupkey noequals;
 by id;
run;
Proc Sort
Using the NOEQUALS option

Using the NOEQUALS options, we can change this default behavior according to our needs. Thus, ID 1 has the value of A3.

Handling of observations with duplicate keys

PROC SORT with the NODUPKEY option removes records with identical keys. However, this has been enhanced, and now you have the ability to choose which of the observations is kept, and you can send the duplicates that aren’t kept to a dataset.

The following examples use a dataset containing customer ID numbers and the quantity purchased in a month.

data orders;
input custid qty month $6.;
datalines;
10270 5 JAN
12230 4 JAN
19323 7 JAN
19323 3 FEB
22779 2 FEB
22779 2 FEB
28819 4 FEB
29252 1 FEB
30457 9 MAR
30457 1 APR
30457 3 MAY
30457 2 MAY
31918 1 MAY
31918 1 MAY
30457 2 JUN
;
run;
Proc Sort
Input dataset

Let’s suppose, you need to find the customers who have order quantity more than one for a given month.

You can proc sort step as below which will sort the data based on custid and month and the unique orders are kept in the dataset. The duplicate observation is sent to a different dataset using the dupout option.

proc sort data=orders out=sort1 nodupkey dupout=dups;
by custid month;
run;
Proc Sort
The contents of dups dataset

Note, the output in the dataset specified in DUPOUT option gives us the 1st records for each BY group. What if we want to have all the duplicate values in the output dataset.

Using, the NOUNIQUE key and UNIQUE out option you can have all the duplicate values as below which is also the output we want. i.e all customers who have more than one order quantity in a month.

proc sort data=orders nouniquekeys out=alldups uniqueout=uniques;
 by custid month;
run;
The Mystery of Proc Sort Options 1
The contents of alldups dataset

This combination can be used instead of the First and last dataset variables and in a few lines of code. Below are all the unique records which are stored in uniques dataset.

Proc Sort

SORTSEQ

PROC SORT uses the ‘collating sequence’ to determine the sorted order of values.

There are two commonly used collating sequences, EBCDIC (for mainframe systems) and ASCII (for most other machines running Operating Systems like Windows and UNIX).

You have long been able to select one or the other of these two different collating
sequences by specifying the EBCDIC or ASCII options on the PROC SORT statement.

The SORTSEQ option allows you to further refine the way the selected collating sequence is used.

This includes subsets or locals within a national collating sequence. Even without changing the base collating sequence the SORTSEQ option can be beneficial.

You can get more information on using Linguistic Collation options of Proc sort in the below article.

Linguistic Collation: Everyone Can Get What They Expect

related posts:


Column Input in SAS to Read Raw Data Arranged in Columns


Date Interval Functions – INTNX and INTCK in SAS


SAS Generation Datasets

Subhro Kar

About the author

Been in the realm with the professionals of the IT industry. I am passionate about Coding, Blogging, Web Designing and deliver creative and useful content for a wide array of audience.

Subhro

Leave a Reply

Your email address will not be published. Required fields are marked

This site uses Akismet to reduce spam. Learn how your comment data is processed.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}