FIRST and LAST processing are temporary variables created automatically by a DATA step when a BY statement is used.
The values of these Boolean variables will either be 1 for true or 0 for false.
- FIRST.variable: This variable gets a value of 1 the moment a new group begins within your sorted dataset (based on your BY variables). For all other records within that group, its value remains 0.
- LAST.variable: This variable acquires a value of 1 for the final record of each group within your sorted dataset. Otherwise, it stays at 0.
Note: Input Data set must be sorted or indexed before applying FIRST. and LAST. Variables.
A Simple Example
Suppose you have a dataset with sales figures organized by ‘Region’ and ‘Month’. Here’s how to use FIRST. and LAST. variables to calculate the first sale and total sales per region:
data transactions;
input customer_id $ date mmddyy10. purchase_amount;
format date date9.;
datalines;
A001 01/20/2024 25.50
B005 02/15/2024 12.99
A001 02/28/2024 45.00
C012 01/12/2024 30.75
B005 01/24/2024 8.50
;
run;
proc sort data=transactions;
by customer_id;
run;
data first_last_purchases;
set transactions;
by customer_id;
if first.customer_id then first_purchase_date = date;
if last.customer_id then last_purchase_date = date;
/* Output if we want to see the results */
if first.customer_id or last.customer_id;
format first_purchase_date last_purchase_date date9.;
run;
proc print;
Explanation:
- Sorting (Required): We sort by
customer_id
to group the data for FIRST. and LAST. to work correctly. - Detecting First/Last:
if first.customer_id then ...
: When a new customer ID appears, we store the date infirst_purchase_date
.if last.customer_id then ...
: When it’s the last record for a customer ID, we store the date inlast_purchase_date
.
- Output (Optional): The last
if
statement outputs rows only when it’s either the first or last record of each customer, making it easier to see the results.
Calculating Differences Between Records: Let’s say you have stock prices over time. You could use FIRST. and LAST. variables to calculate the change in price from the beginning to the end of each day, week, or month:
Example 1 : Calculating difference between records
data stock_data;
infile datalines truncover;
input ticker_symbol $4. date :yymmdd10. price best32.;
format date date9.; /* Apply the date format for output */
datalines;
AAPL 2024-02-22 150.50
TSLA 2024-02-21 205.75
AAPL 2024-02-21 148.30
MSFT 2024-02-22 250.22
GOOG 2024-02-22 1200.15
AMZN 2024-02-21 97.25
GOOG 2024-02-21 1195.00
AMZN 2024-02-22 98.50
FB 2024-02-22 175.32
;
run;
proc sort data=stock_data;
by ticker_symbol date;
quit;
data price_change;
set stock_data;
by ticker_symbol;
retain first_price last_price;
if first.ticker_symbol then first_price = price;
if last.ticker_symbol then do;
last_price = price;
price_change = last_price - first_price;
output;
end;
run;
Explanation:
- The RETAIN statement prevents
first_price
andlast_price
from being reset to missing at the beginning of each iteration of the data step. if first.ticker_symbol ...
: If it’s the first record of a ticker group, store the ‘price’ infirst_price
.if last.ticker_symbol ...
: If it’s the last record of a ticker group:- Store the ‘price’ in
last_price
. - Calculate
price_change
(last price – first price). output;
: Write a record to the ‘price_change’ dataset.
- Store the ‘price’ in
Example 2: Calculating Running Totals by Group
data sales;
input month date9. sales;
format month monyy7.;
datalines;
01Jan2024 100
01Jan2024 150
01Feb2024 200
01Feb2024 250
01Mar2024 300
;
run;
proc sort data=sales;
by month;
run;
proc print;
data sales_with_total;
set sales;
by month;
retain total_monthly 0 total_running 0;
if first.month then total_monthly = 0; /* Initialize total for each month */
total_monthly + sales; /* Calculate monthly total */
total_running + sales; /* Update running total */
if last.month then do;
output; /* Output total for each month */
total_monthly = .; /* Reset total for next month */
end;
run;
proc print;
Explanation:
retain ...;
: Preventstotal_monthly
andtotal_running
from resetting to missing at the start of each data step iteration.if first.month ...
: If it’s the first record of a month, resettotal_monthly
to 0.total_monthly + sales;
: Accumulates sales for the current month.total_running + sales;
: Accumulates the overall running total.if last.month ...
: When it’s the last record of a month:output;
: Writes a record to the ‘sales_with_total’ dataset (containing the calculated totals).total_monthly = .;
: Resetstotal_monthly
for the next month.
Example 3: Finding Maximum and Minimum Values
data grades;
input student_id grade;
datalines;
1 90
1 85
2 80
2 95
3 75
3 85
;
run;
data max_min_grade;
set grades;
by student_id;
retain max_grade min_grade;
if first.student_id then do;
max_grade = grade;
min_grade = grade;
end;
else do;
if grade > max_grade then max_grade = grade;
if grade < min_grade then min_grade = grade;
end;
if last.student_id;
run;
Example 4: Calculating Group wise Average
proc sort data=sales;
by region;
quit;
data region_average;
set sales;
by region;
retain total count;
if first.region then do;
total = 0;
count = 0;
end;
total + sales;
count + 1;
if last.region then do;
average = total / count;
output;
end;
run;
proc print;
Example 5 : Identifying Duplicate records
You can use first and last variables to identify duplicate records without using the proc sort procedure.
data customer;
input customer_id $;
datalines;
A
A
B
C
C
C
;
run;
data duplicate_records;
set customer;
by customer_id;
if first.customer_id then count = 1;
else count + 1;
if last.customer_id and count > 1 then output;
run;
Key Points to Keep in Mind
- Always remember to sort your data using a
PROC SORT
and aBY
statement for FIRST. and LAST. variables to function correctly. - FIRST. and LAST. variables are temporary; they exist within the DATA step and aren’t added to the final output dataset unless you explicitly create new variables to store their values.