A table lookup is used to determine the value of a variable in the primary table to determine the value of another variable present in a secondary table.
There are several techniques for performing these table lookups. These techniques can be radically different in terms of programming complexity and performance.
You will face complexities and performance issues when dealing with large datasets as a programmer.
In this article, you will learn about seven different table lookup techniques and general guidelines to help you decide which table lookup technique to use.
DATA Step Merges and SQL Joins
The use of joins in an SQL step and the MERGE
in the DATA step is another common way to perform table lookup by matching values between two data sets.
The data must be sorted or indexed using the MERGE statement, while the SQL step does not require data sorting.
1. MERGE Statement
The MERGE statement is used to combine two or more data sets generally used with a common variable that is present in both tables. For this discussion, one of these data sets will contain the information that is to be looked up to.
data subjectVistits;
merge mylib.subjects(in=a) mylib.visits(in=b);
by subjid;
if a;
run;
The BY
statement is used to ensure that the observations are correctly aligned and should also include sufficient variables to form a unique key in all but at most one of the data sets.
Both incoming data sets must be indexed or sorted for a successful table lookup using the MERGE statement.
The IF
statement has been used to eliminate any subjid in subjects dataset that does not appear in the visits dataset. In this case, the visits will be missing for those subjid’s.
If we want to restrict the lookup to only those subjid with matches in the visits dataset, the IF statement could be replaced with if a and b
. The result is achieved directly using SQL join discussed next.
2. SQL Join
The merging process is called a Join when using SQL. Unlike data step merge, the advantage is that the input dataset need not be sorted. However, using SQL join to perform table lookup does use resources when the tables are large.
There are many different types of joins within SQL, and one that closely matches the previous step is shown below.
proc sql noprint;
create table subjectvisits as select a.subjid, b.visitdt, lname, fname from
mylib.subjects a, mylib.visits b where a.subjid=b.subjid;
quit;
SQL does not require either of the two data sets to be sorted before the join. Unless we specifically request that the resulting data table be sorted using the ORDER BY clause, it will reflect the order of the incoming data.
3. Merge Using Double SET Statements
The double SET statements can replace the single MERGE statement, and the programmer can perform table lookup keeping the two datasets in sync.
data subjectvisits(keep=subjno visitdt fname lname);
/*Primary dataset*/
set mylib.subjects(rename=(subjid=subjno));
if subjno=subjid then
output;
do while(subjno>subjid);
/*Secondary dataset*/
set mylib.visits(keep=subjid visitdt);
if subjno=subjid then
output;
end;
run;
In this example, the two incoming data sets have already been sorted (by subjid). The primary data set(subjects) contains the observations for which we need the visitdt.
The secondary data set (visits) contains just the names to be retrieved. It is also possible that the lookup data set will contain subjid that have no match in the first or primary data set.
Each observation is read from the primary dataset subjects. Because subjects and visits both have the same variable subjid, it is renamed to subjno. This will allow you to access and compare the subjid’s from both data sets simultaneously.
The value of subjno (subjid from visits) is compared to the value of subjid from subjects which is from the second dataset visits.
On the first pass, no observation will be read from visits, and subjid will be missing. subjno that do not have matching names will not be written out.
The DO WHILE
is used to read successive rows from the secondary dataset visits.
There may be subjno and visits in the visits data set that are not in the primary dataset subjects. These observations will have subjid less than subjno. This loop cycles through any extra visit until the second dataset visit catches up to the primary dataset (subjno=subjid
).
These techniques can be faster but more complicated than a MERGE. However, they do still require that both incoming data sets be sorted.
4. Using Formats
With the use of FORMATS, you can ignore the logical processing of assignment statements and thus improve performance.
When a value is retrieved from a format, a binary search is used, and this means that binary searches operate by iteratively splitting a list in half until the target is found. The search process tends to be faster than sequential searches—especially as the number of items increases.
Let’s go through the process of creating a user-defined format from the Visit dataset and then use the format in the subject dataset to retrieve the visit dates.
First, we will create a control dataset that will store the formats.
data fmt;
set mylib.visits(rename=(subjid=start visitdt=label)) end=eof;
retain fmtname "visitfmt" type "N";
If _n_ eq 1 then
HLO='L';
if eof then
do;
Label="";
HLO='O';
end;
run;
The subjid and visitdt variables are renamed to start and label, respectively. The start is the value that you want to format the label.
In this example, we want to map the visitdt for corresponding subjid’s. The label is the value that you wish to generate.
A retain statement is used to assign visitdt to the subjid and they have the same value on every record. fmtname
specifies the name of the format that is visitfmt. The TYPE 'N'
states that the format is a Numeric format.
When all the input observations are processed, the END option causes the variable EOF to be set to 1. When the last record of the dataset is read, HLO is set to ‘O’, which stands for Other. This indicates that all the other subjid’s not defined in the input dataset will be assigned the null value.
Now that our control dataset is ready, we can use PROC FORMAT to create the format below.
proc format library=work cntlin=fmt fmtlib;
run;
The CNTLIN
option tells SAS to create a format from a dataset fmt, the control dataset created earlier. The FMTLIB
option in PROC FORMAT is used to print the entire label of the format description
Now, it’s time to use the vstfmt to the subject dataset.
data subjectsvisits;
set mylib.subjects;
Visitdt=input(put(subjid, visitfmt.), 10.);
format visitdt mmddyy8.;
run;
Since I have created the format and not informat, the put statement is used to convert this format to a character. The character dates are again converted to numeric, so we can use the mmddyy8. format to display the date format.
Note that subjid 20 is blank as this subjid is not present in the visit dataset and is also defined in the HLO option in the control dataset.
5. Using Indexes
Indexes are techniques used to sort the data without physically sorting it logically. Indexed are helpful as you may not need to perform sorting of data before merges.
Indexes must be created, stored, and maintained. They are usually created through either PROC DATASETS (shown below), PROC SQL, or the data step.
We have a separate topic on the index which you can go through for a detailed understanding.
proc datasets library=mylib nolist;
modify visits;
index create subjid / unique;
modify subjects;
index create subjid;
run;
We have created an Index for both the subjects and the visit dataset. The MODIFY statement is used to name the datasets that receive the indexes. And the INDEX statement defines the index for each data set.
data subjectvisits;
set mylib.subjects (keep=subjid fname lname);
set mylib.visits key=subjid/unique;
if _iorc_ ne 0 then
visitdt=' ';
run;
The KEY= option on the SET statement identifies an index to be used. An observation is read from an unsorted dataset. This assigns a value for the index variable (subjid) into the PDV.
An observation is read from the lookup dataset visits. Since the KEY
= option is specified, the observation corresponding to the current value of subjid is returned. The observations read from the subjects dataset can be in any order, and values of subjid can also be repeated.
The temporary variable IORC will be 0 when an indexed value is successfully read. If the missing value for the visitdt is not found, IROC will be 0. Otherwise, the value of visitdt will have been retained from the previous observation.
Indexes are stored in a separate file; hence, the size of this file can become large as the number of indexes, observations, and variables used to form the indexes increases.
Indexes can substantially speed up processes. They can also SLOW things down. Not all data sets are good candidates to be indexed, and not all variables will form good indexes.
6. Using array lookup
Arrays can be used to speed up a search or when sorting is not the appropriate option. Key addressing is one of the forms of lookup technique where an array is the value of the variable we want to lookup.
In the following data step, the list of codes is read sequentially, once into an array that stores the visitdt and using the subjid as the array subscript.
The second DO UNTIL then reads the primary dataset. In this loop, the visitdt is fetched from the array and assigned to the variable visitdt variable created in the primary dataset.
In addition to its execution speed, a major advantage of this technique is that neither of the incoming data sets needs to be sorted.
data subjectvisits(keep=subjid visitdt fname lname);
array myarray{99999} _TEMPORARY_;
do until(alldates);
set mylib.visits end=alldates;
myarray[subjid]=visitdt;
end;
do until(allsubjid);
set mylib.subjects end=allsubjid;
visitdt=myarray[subjid];
output subjectvisits;
end;
stop;
run;
A character array of temporary values is created. This array will hold the values to be retrieved and indexed by the subid. The length of the array elements must be sufficient to hold each value being inserted into the array.
The first DO LOOP is used to read all of the observations from the data set that contains the values to be looked up. The temporary variable alldates is defined using the END= option, which will be 0 for all the observations except the last one.
The value of the visitdt is stored in the array element identified by the subjid.
As the observations are read from the visits dataset, the values of subjid will be loaded in the PDV, where it can be used to retrieve the value of visitdt from myarray.
7. Using the HASH Object
Programmers with large data sets are often limited by constraints that are put on them by memory or processing speed. For instance, it is not practical to sort a large data set.
You cannot use unsorted data sets for merging using the BY statement; without the BY statement, the merge may not be feasible.
Joins in SQL are possible by using the BUFFERSIZE
option, while joining a large table. but this still may not be a useful solution. INDEX can be created for medium size datasets. However, the index size for large datasets is larger and requires maintenance.
Fortunately, there are several techniques for handling these situations as well.
Essentially the hash object defines an array in memory, initializes its values with data from a table, and sets up an indexing variable or variables that can be either numeric or character.
data subjectvisits(keep=subjid lname fname visitdt);
if 0 then
set mylib.visits;
declare hash lookup(dataset: 'mylib.visits', hashexp: 8);
lookup.defineKey('subjid');
lookup.defineData('visitdt');
lookup.defineDone();
do until(done);
set mylib.subjects(keep=subjid lname fname) end=done;
if lookup.find()=0 then
output subjectvisits;
end;
stop;
run;
The attributes for the variables to be retrieved from the hash object need to be established on the PDV. The set statement is used only to determine variable values during the data step compilation.
The HASH object itself is defined, named, and loaded using the <strong>DECLARE</strong>
statement. The attributes of objects are then defined using the DEFINE
key, DEFINEDATA
and DEFINEDONE
methods.
This hash object has been named LOOKUP and has been loaded with the data that contains the values (visitdt) that we want to be able to look up.
The number of bins used by the hash table is specified.
The DEFINEKEY
method lists one or more key variables (subjid) whose values are used to index the lookup hash table.
The DEFINEDATA
method lists contain variables to be added to the LOOKUP hash table. The value of these variables can be retrieved using the FIND
method.
A DO UNTIL
loop is used to cycle through the observations in the primary data set subjects.
Each observation is read, and its value for subjid is loaded into the PDV. Since subjid is the key variable for the hash object, its value will be automatically retrieved by the find method when retrieving the value of visitdt.
The temporary variable DONE
will be set to 1 when the last observation is read from the visits dataset.
The visitdt, held in the hash table, is retrieved using 0 or 1 for success and failure, respectively. When the result is successful, the resulting observation with the visitdt is defined.
Since the SET statement is inside a loop, we should always make sure to stop the implied data step loop manually.
I’ve been surfing on-line greater than 3 hours today, but I by no means found any fascinating article like yours. It is lovely price enough for me. Personally, if all site owners and bloggers made just right content as you probably did, the internet will likely be a lot more helpful than ever before.
Thank you a lot for all this very interesting job !
I am unable to find such interesting things in SAS in French.
Would you like tellling me if the method using formats require to the data sets to be sorted ?
Thanks again.
Sorting is not required for using the format method