7 Table lookup techniques for SAS Programmers

A table lookup is used to determine the value of a variable in the primary table to determine the value of another variable that is present in a secondary table.

There are a number of techniques for performing these table lookup. These techniques can be radically different both in terms of programming complexity and performance.

As a programmer, you will be faced with complexities and performance issues when dealing with large datasets.

In this article, you will learn about 8 different table lookup techniques and general guidelines, that will help you to make informed decisions on which table lookup technique to use on which situation.

DATA Step Merges and SQL Joins

The use of joins in an SQL step and the MERGE in the DATA step is another very common way to perform table lookup by matching values between two data sets.

The data must be sorted or indexed when using the MERGE statement, while the SQL step does not require sorting of data.

MERGE Statement

The MERGE statement is used to combine two or more data sets generally used with a common variable that is present in both the tables. For the purpose of this discussion, one of these data sets will contain the information that is to be looked up to.

data subjectVistits;
 merge mylib.subjects(in=a) mylib.visits(in=b);
 by subjid;
 if a;
run;

The BY statement is used to make sure that the observations are correctly aligned and should also include sufficient variables to form a unique key in all but at most one of the data sets.

For a successful table lookup using the MERGE statement, both of the incoming data sets must be indexed or sorted.

The IF statement has been used to eliminate any subjid in subjects dataset that does not appear in visits dataset. In this case, the visits will be missing for those subjid’s.

If we want to restrict the lookup to only those subjid with matches in visits dataset, the IF statement could be replaced with if a and b. The result is achieved directly using SQL join discussed next.

SQL Join

The merging process is called a Join when using SQL. The advantage is unlike data step merge, the input dataset need not be sorted. However, using SQL join to perform table lookup does use resources when the tables are large.

There are a number of different types of joins within SQL, and one that closely matches the previous step is shown below.

proc sql noprint;
 create table subjectvisits as select a.subjid, b.visitdt, lname, fname from 
  mylib.subjects a, mylib.visits b where a.subjid=b.subjid;
quit;

SQL does not require either of the two data sets to be sorted prior to the join, and unless we specifically request that the resulting data table be sorted using the ORDER BY clause otherwise it will reflect the order of the incoming data.

Merge Using Double SET Statements

The double SET statements can replace the single MERGE statement and the programmer can perform table lookup keeping the two datasets in sync.

data subjectvisits(keep=subjno visitdt fname lname);
 /*Primary dataset*/
 set mylib.subjects(rename=(subjid=subjno));

 if subjno=subjid then
  output;

 do while(subjno>subjid);

  /*Secondary dataset*/
  set mylib.visits(keep=subjid visitdt);

  if subjno=subjid then
   output;
 end;
run;

In this example, the two incoming data sets have already been sorted (by subjid). The primary data set(subjects) contains the observations for which we need the visitdt.

The secondary data set (visits) contains just the names that are to be retrieved. It is also possible that the lookup data set will contain subjid that have no match in the first or primary data set.

Each observation is read from the primary dataset subjects. Because subjects and visits both have the same variable subjid, it is renamed to subjno. This will allow you to access and compare the subjid’s from both data sets at the same time.

The value of subjno (subjid from visits) is compared to the value of subjid from subjects which is from the second dataset visits.

On the first pass, no observation will be read from visits and subjid will be missing. subjno that do not have matching names will not be written out.

The DO WHILE is used to read successive rows from the secondary dataset visits.

It is possible that there are subjno and visits in the visits data set are not in the primary dataset subjects. These observations will have subjid less than subjno. This loop cycles through any extra visit until the second dataset visit catch up to the primary dataset (subjno=subjid).

These techniques can be faster but more complicated than a MERGE. However, they do still require that both incoming data sets be sorted.

Using Formats

With the use of FORMATS, you can ignore the logical processing of assignment statements and thus improve the performance.

When a value is retrieved from a format, a binary search is used and this means that binary searches operate by iteratively splitting a list in half until the target is found, The search process tends to be faster than sequential searches—especially as the number of items increases.

Let’s go through the process of creating a user-defined format from the Visit dataset and then use the format in subject dataset to retrieve the visit dates.

First, I will be creating a control dataset which will store the formats.

data fmt;
 set mylib.visits(rename=(subjid=start visitdt=label)) end=eof;
 retain fmtname "visitfmt" type "N";

 If _n_ eq 1 then
  HLO='L';

 if eof then
  do;
   Label="";
   HLO='O';
  end;
run;
table lookup

The subjid and visitdt variables are renamed to start and label repectively. Start is the value that you want to format the label. In this example, we want to map the visitdt for corresponding subjid’s. Label is the value that you wish to generate.

A retain statement is used to assign visitdt to the subjid and they have the same value on every record. fmtname specifies the name of the format that is visitfmt. The TYPE 'N' states that the format is a Numeric format.

When all the input observations are processed the END option causes the variable EOF to be set to 1. When the last record of the dataset is read, HLO is set to ‘O’ which stands for Other. This indicates that all the other subjid’s which are not defined in the input dataset will be assigned the null value.

Now, our control dataset is ready, we can use PROC FORMAT to create the format as below.

proc format library=work cntlin=fmt fmtlib;
run;

The CNTLIN option tells SAS to create a format from a dataset fmt that is the control dataset created earlier. The FMTLIB option in PROC FORMAT is used to print the entire label of the format description

table lookup using Proc Format

Now, it’s time to use the vstfmt to the subject dataset.

data subjectsvisits;
 set mylib.subjects;
 Visitdt=input(put(subjid, visitfmt.), 10.);
 format visitdt mmddyy8.;
run;
table lookup

Since I have created the format and not informat, the put statement is used to convert this format to a character. The character dates are again converted back to numeric so that we can use the mmddyy8. format to display the date format.

Note that, subjid 20 is blank as this subjid is not present in the visit dataset and is also defined in HLO option in the control dataset.

Using Indexes

Indexes are techniques used to logically sort the data without physically sorting it. Indexed are helpful as you may not need to perform sorting of data prior to merges.

Indexes must be created, stored, and maintained. They are usually created through either PROC DATASETS (shown below), PROC SQL or through data step.

I have written a separate topic on the index which you can go through for detailed understanding.

proc datasets library=mylib nolist;
 modify visits;
 index create subjid / unique;
 modify subjects;
 index create subjid;
 run;

I have created an Index for both the subjects and the visit dataset. The MODIFY statement is used to name the datasets that are to receive the indexes. And the INDEX statement defines the index for each data set.

data subjectvisits;
 set mylib.subjects (keep=subjid fname lname);
 set mylib.visits key=subjid/unique;

 if _iorc_ ne 0 then
  visitdt=' ';
run;

The KEY= option on the SET statement identifies an index that is to be used. An observation is read from an unsorted dataset. This assigns a value for the index variable (subjid) into the PDV.

An observation is read from the lookup dataset visits. Since the KEY= option is specified, the observation corresponding to the current value of subjid is returned. The observations read from subjects dataset can be in any order, and values of subjid can also be repeated.

The temporary variable IORC will be 0 when an indexed value is successfully read. If the missing value for the visitdt is not found IORC will be 0. Otherwise, the value of visitdt will have been retained from the previous observation.

Indexes are stored in a separate file and hence, the size of this file can become large as the number of indexes, observations, and variables used to form the indexes increases.

Indexes can substantially speed up processes. They can also SLOW things down. Not all data sets are good candidates to be indexed and not all variables will form good indexes.

Using array lookup

Arrays can be used to speed up a search or when sorting is not the appropriate option. Key addressing is one of the forms of lookup technique where an array is the value of the variable that we want to lookup.

In the following data step, the list of codes is read sequentially, once into an array that stores the visitdt and using the subjid as the array subscript.

The second DO UNTIL then reads the primary dataset. In this loop, the visitdt is fetched from the array and assigned to the variable visitdt variable created in the primary dataset.

In addition to its speed of execution, a major advantage of this technique is that neither of the incoming data sets needs to be sorted.

data subjectvisits(keep=subjid visitdt fname lname);
 array myarray{99999} _TEMPORARY_;

 do until(alldates);
  set mylib.visits end=alldates;
  myarray[subjid]=visitdt;
 end;

 do until(allsubjid);
  set mylib.subjects end=allsubjid;
  visitdt=myarray[subjid];
  output subjectvisits;
 end;
 stop;
run;

A character array of temporary values is created. This array will hold the values to be retrieved and will be indexed by the subid. The length of the array elements must be sufficient to hold each value being inserted into the array.

The first DO LOOP is used to read all of the observations from the data set that contains the values to be looked up. The temporary variable alldates is defined using the END= option which will be 0 for all the observation except the last one.

The value of the visitdt is stored in the array element identified by the subjid.

As the observations are read from the visits dataset, the values of subjid will be loaded in the PDV where it can be used to retrieve the value of visitdt from myarray.

Using the HASH Object

Programmers with very large data sets are often limited by constraints that are put on them either by memory or processing speed. For instance, it is not practical to sort a very large data set.

You cannot use unsorted data sets for merging using the BY statement and without BY statement the merge may not be feasible.

Joins in SQL are possible by using the BUFFERSIZE option, while joining a large table. but this still may not be a useful solution. INDEX can be created for medium size datasets, however, for large datasets, the size of the index is larger and requires maintenance.

Fortunately, there are a number of techniques for handling these situations as well.

Essentially the hash object defines an array in memory, initializes its values with data from a table, and sets up an indexing variable or variables that can be either numeric or character.


data subjectvisits(keep=subjid lname fname visitdt);
 if 0 then
  set mylib.visits;
 declare hash lookup(dataset: 'mylib.visits', hashexp: 8);
 lookup.defineKey('subjid');
 lookup.defineData('visitdt');
 lookup.defineDone();

 do until(done);
  set mylib.subjects(keep=subjid lname fname) end=done;

  if lookup.find()=0 then
   output subjectvisits;
 end;
 stop;
run;

The attributes for the variables that are to be retrieved from the hash object need to established on the PDV. The set statement is used only during the data step compilation phase to determine variable values.

The HASH object itself is defined, named, and loaded using the DECLARE statement. The attributes of objects are then defined using the DEFINE key, DEFINEDATA and DEFINEDONE methods.

This hash object has been named LOOKUP, and has been loaded with the data that contains the values (visitdt) that we want to be able to look up.

The number of bins (2^8= 256) used by the hash table is specified.

The DEFINEKEY method is used to list one or more key variables (subjid) whose values are used to index the lookup hash table.

The DEFINEDATA method lists contain those variables which are to be added to the LOOKUP hash table. The value of these variables can be retrieved using the FIND method.

A DO UNTIL loop is used to cycle through the observations in the primary data set subjects.

Each observation is read and its value for subjid is loaded into the PDV. Since subjid is the key variable for the hash object, its value will be automatically retrieved by the find method when retrieving the value of visitdt.

The temporary variable DONE will be set to 1 when the last observation is read from the visits dataset.

The visitdt, which is being held in the hash table, is retrieved through the use of the FIND which returns 0 or 1 for success and failure respectively. When the result is successful out the resulting observation with the visitdt defined.

Since the SET statement is inside a loop, we should always make sure to stop the implied data step loop manually.

Download Exercise Files

Subhro Kar

Been in the realm with the professionals of the IT industry. I am passionate about Coding, Blogging, Web Designing and deliver creative and useful content for a wide array of audience.

>