PROC SQL is a powerful procedure in SAS that allows users to manage and retrieve data in a structured manner. It offers the flexibility of SQL (Structured Query Language) within the SAS environment, enabling intricate data manipulations and analyses.
The Essence of Subqueries
A proc sql subqueries, often termed an inner or nested query, is a query embedded within another SQL query. It can retrieve data that will be used in the main query as a condition to further refine the data that is to be retrieved.
Importance: Subqueries enhance the flexibility of data retrieval, allowing for dynamic data extraction based on conditions set by the main query.
Types of Proc SQL Subqueries
Proc SQL Subqueries can be broadly categorized into four types based on their function and the kind of result they return.
Scalar Subqueries: These return a single value. They are often used in the SELECT, WHERE, or HAVING clause of the main query.
Row Subqueries: These return a single row of values. They are commonly used in comparison operations.
Column Subqueries: These return a column of values. They are typically used in the WHERE clause of the main query.
Table Subqueries: Also known as derived tables, these return a table. They are used in the FROM clause of the main query.
Scalar Subqueries
Scalar subqueries are versatile tools that return a single value. This value can be used for comparison or arithmetic operations in the main query. For example, one could use a scalar subquery to retrieve the average salary of employees. Then, this value can be used in the main query to find employees who earn above the average.
Example:
proc sql;
select "Average Height" as Description, (select mean(Height) from
sashelp.class) as Value from sashelp.class where upcase(Name)='ALFRED';
quit;
Explanation:
Main Query:
select "Average Height" as Description, [...subquery...] as Value from
sashelp.class where upcase(Name)='ALFRED';
This is the main or outer query that retrieves the description “Average Height” and the result of the scalar subquery for the student named “ALFRED” from the ‘sashelp.class’ dataset.
Scalar Subquery:
(select mean(Height) from sashelp.class)
This subquery calculates the average height of all students in the `sashelp.class` dataset. Since it returns a single value (the average height), it’s a scalar subquery. If you run this subquery alone, it will give you the average height value.
The main query uses the result of the scalar subquery in its SELECT clause. Specifically, it retrieves the description “Average Height” and the average height value (calculated by the subquery) for the student named “ALFRED”.
So, the process is as follows:
- The scalar subquery runs first and calculates the average height of all students.
- The main query then runs, retrieving the description and the average height value for the student named “ALFRED”.
- The result is a table row with the description “Average Height” and the average height of all students in the `sashelp.class` dataset, but only for the record where the student’s name is “ALFRED”.
Row Subqueries: Exploring the Potential
Row subqueries return a single row of values. They are particularly useful when comparing records across different tables or datasets. For instance, one might want to retrieve records from a dataset that match a specific record in another dataset.
Imagine we have two datasets: students
and prefects
. We want to find the average age of students who are not prefects. Here’s how we can achieve this using a row subquery:
/* Sample data creation */
data students;
set sashelp.class;
run;
data prefects;
set sashelp.class(where=(Name in ('Alfred', 'Alice')));
keep Name;
run;
proc sql;
create table avg_age_non_prefects as
select AVG(Age) as AverageAge
from students
where Name NOT IN (select Name from prefects);
quit;
In this query, the subquery (select Name from prefects)
returns the names of students who are prefects. The main query then calculates the average age of students from the student's
dataset who are not in the list of prefects.
This is a basic example of how row subqueries can be used in SQL to filter results based on a list of values returned by a subquery.
When writing a SQL query, it’s important to understand the difference between a single-row subquery and a scalar subquery. A single-row subquery returns a value that is used in the WHERE clause, while a scalar subquery is a SELECT statement that is used in the column list. Think of it as an inline function in the SELECT column list.
Column Subqueries: Enhancing Data Retrieval
Column subqueries return a column of values. They are typically used to filter records in the main query based on a list of values. For instance, one might want to retrieve all records that match a list of product IDs.
Column subqueries are designed to return a single column of values. These values can then be used in an IN, NOT IN, EXISTS, or NOT EXISTS condition in the main query.
For this example, let’s assume we have a dataset selected_students that contains a list of student names who participated in a special school event. We want to retrieve all records from the sashelp.class that match the names in this list.
First, let’s create the selected_students dataset:
data selected_students;
input Name $15.;
datalines;
Alice
Brian
Emily
Henry
;
run;
Now, using a column subquery, we’ll retrieve the records of these selected students from the `sashelp.class` dataset:
proc sql;
select Name, Age, Height, Weight
from sashelp.class
where Name in (select Name from selected_students);
quit;
This SQL procedure will return the details of the students named “Alice”, “Brian”, “Emily”, and “Henry” from the sashelp.class dataset, as these are the names listed in our selected_students dataset.
Table Subqueries: Broadening the Horizon
Table subqueries, or derived tables, are used in the FROM clause of the main query. They allow for the creation of temporary tables that can be used for further processing in the main query.
Table subqueries, also known as derived tables or inline views, are subqueries that return a table which can be used in the FROM clause of an SQL statement. They allow for complex data manipulations and can be used in conjunction with joins aggregations, and other SQL operations.
For this example, let’s assume we have two datasets: sales
and returns
. We want to find out the total sales for products that have never been returned.
/* Sample data creation */
data sales;
input ProductID SalesAmount;
datalines;
1 100
2 150
3 200
4 250
5 300
;
data returns;
input ProductID ReturnAmount;
datalines;
2 50
4 100
;
Now, using a table subquery, we’ll calculate the average height for each age group and join it with the `sashelp.class` dataset:
proc sql;
create table sales_not_returned as
select ProductID, sum(SalesAmount) as TotalSales
from sales
where ProductID NOT IN (select ProductID from returns)
group by ProductID;
quit;
In this example, the table subquery (select ProductID from returns)
returns a list of product IDs that have been returned. The main query then sums the sales amounts from the sales
dataset for products that are not in this list, effectively giving us the total sales for products that have never been returned.
Correlated vs. Non-Correlated Subqueries
Correlated Subquery: A subquery that references columns from the outer query. (“TOP 50 SQL Interview Question and Answers – LinkedIn”) (“TOP 50 SQL Interview Questions and Answers – LinkedIn”) It is executed once for each row processed by the outer query. It relies on the outer query for its values.
The goal was to find students from the student's
dataset whose average score is higher than the average score of their respective class in the class_avg
dataset.
We’ll find students whose weight is above the average weight of all students.
proc sql;
create table above_avg_students as
select Name, Weight
from sashelp.class A
where Weight > (select mean(Weight) from sashelp.class B);
quit;
proc print;
In this example, the subquery calculates the average weight of all students, and the outer query selects students whose weight is above this average.
Non-Correlated Subquery: An independent subquery that can be executed separately from the main query. It does not rely on the outer query for its values.
proc sql;
create table taller_than_average as select Name, Height from sashelp.class
where Height > (select mean(Height) from sashelp.class);
quit;
proc print;
The result is a table (taller_than_average) having the names and heights of students who are taller than the average height of the entire class.
Subquery in the HAVING Clause
A subquery is used in the HAVING clause of a SQL statement. It allows you to filter grouped records by a condition that involves aggregate functions or calculations based on the grouped data.
proc sql;
create table classes_above_avg_weight as select Age, avg(Weight) as Avg_Weight
from sashelp.class group by Age having avg(Weight) > (select mean(Weight)
from sashelp.class);
quit;
proc print;
This returns age groups whose average weight is above the overall average weight of all students.
Subquery with EXISTS and NOT EXISTS
EXISTS and NOT EXISTS are operators used with subqueries to check for the existence or non-existence of rows. EXISTS returns true if the subquery returns one or more rows, while NOT EXISTS returns true if the subquery returns no rows.
proc sql;
create table exists_example as select Name, Age from sashelp.class a where
exists (select 1 from sashelp.class b where a.Age=b.Age and b.Height > 65);
quit;
proc print;
This returns student of ages where at least one student is taller than 65 inches.
Subquery with IN and NOT IN
IN and NOT IN are operators used with subqueries to filter results based on a list of values returned from a subquery. IN returns rows that match values in the list, while NOT IN returns rows that don’t match any values in the list.
proc sql;
create table in_example as select Name, Age from sashelp.class where Age
in (select Age from sashelp.class where Height > 65);
quit;
proc print;
This returns students whose age matches those of students taller than 65 inches.
Advanced Techniques in Subqueries
Using CASE Statements
Subqueries can be made more flexible with conditional logic. By incorporating the CASE statement within subqueries, you can return different values based on certain conditions.
Example:
Imagine you want to categorise students in the sashelp.class dataset based on their age: “Young” for age less than 13, “Teen” for age between 13 and 18, and “Adult” for age above 18.
proc sql;
create table age_categories as select Name, Age, (case when Age < 13 then
"Young" when Age between 13 and 18 then "Teen" else "Adult" end) as Category
from sashelp.class;
quit;
proc print;
In this example, the CASE statement within the main query classifies each student into a category based on their age.
Combining Multiple Subqueries
You can combine the results of multiple subqueries using set operations like the UNION or INTERSECT operator. This allows you to retrieve and merge data from various sources or conditions into a single result set.
Example:
Suppose you want to retrieve the names of students from the sashelp.class dataset who are either the tallest or the shortest:
proc sql;
create table extremes as select Name, Height from sashelp.class where
Height=(select max(Height) from sashelp.class) union select Name, Height from
sashelp.class where Height=(select min(Height) from sashelp.class);
quit;
proc print;
In this example:
The first subquery retrieves the tallest student(s) by comparing each student’s height to the maximum height in the dataset.
The second subquery retrieves the shortest student(s) by comparing each student’s height to the minimum height in the dataset.
The UNION operation then combines the results of these two subqueries into a single result set, which has both the tallest and shortest student(s).
When to use Joins and Subqueries
Both SQL Joins and subqueries are powerful tools in SQL that allow you to combine and filter data from multiple tables or datasets. The choice between using a join or a subquery often depends on the specific task at hand, the structure of your data, and personal preference. Let’s delve into when to use each and provide examples for clarity:
When to use Joins?
- When you need to combine rows from two or more tables based on a related column.
- When you want to retrieve data from multiple tables in a single query.
- When performance is a concern, as joins can sometimes be faster than subqueries.
Example:
Suppose you have two tables: `students` (with columns `student_id`, `name`) and `grades` (with columns `student_id`, `subject`, `grade`). You want to retrieve the names of students along with their grades:
data students;
input StudentID $ Name $;
datalines;
S01 John
S02 Jane
S03 Bob
;
run;
data scores;
input StudentID $ Score;
datalines;
S01 85
S02 90
S03 78
;
run;
/* Using Join */
proc sql;
create table combined_data as select A.StudentID, A.Name, B.Score from
students A join scores B on A.StudentID=B.StudentID;
quit;
When to use Subqueries?
- When you want to retrieve data based on the result of another query.
- When you need to perform operations on a subset of your data before joining it with another table.
- When you want to use aggregate functions to filter results (e.g., retrieving rows based on the average of a column).
Example:
Using the ‘sashelp.class’ dataset, suppose you want to retrieve the names of students whose weight is above the average weight:
proc sql;
create table above_avg_students as
select StudentID, Score
from scores
where Score > (select mean(Score) from scores);
quit;
proc print;
In this example, the subquery calculates the average weight of all students, and the main query retrieves students whose weight exceeds this average.
Joins are best suited for situations where you need to combine data from multiple tables based on a related column. Subqueries are ideal when you need to perform a calculation or filter data based on the result of another query.
In practice, the choice between joins and subqueries often comes down to the specific requirements of your task and the structure of your data. It’s also worth noting that in many cases, a task can be accomplished using either method, so understanding both tools and their strengths will allow you to choose the most efficient approach for your needs.
Important Points:
- Performance Considerations: Subqueries, especially correlated subqueries, can be slower because they might be executed multiple times. It’s essential to optimize and test the performance.
- Limitations: Not all SQL operations that can be performed in the main query can be performed in a subquery.
- Order of Execution: In SQL, the subquery is always executed before the main query.
- Subquery Depth: SAS has a limit on the nesting depth of subqueries. Avoid overly complex nested subqueries.
FAQs
-
1. What is the difference between a subquery and a join?
While both can be used to combine data from multiple tables, a subquery returns data that will be used in the main query as a condition, whereas a join combines rows from multiple tables based on a related column.
-
2. Can I use aggregate functions in a subquery?
Yes, aggregate functions like SUM, AVG, MAX, etc., can be used in subqueries.
-
3. Why is my correlated subquery running slow?
Correlated subqueries can be slow because they might be executed multiple times. Consider optimizing the query or using joins if appropriate.
-
4. Can I use a subquery in the SELECT clause?
Yes, but it should return a single value (scalar subquery) to be used in the SELECT clause.
-
5. What is the difference between IN and EXISTS in subqueries?
Both are used to filter results based on a subquery. However, IN checks for values in a list returned by a subquery, while EXISTS checks for the existence of rows in a subquery.
-
6. How do I handle NULL values when using IN or NOT IN with a subquery?
Be cautious when using NOT IN with subqueries that might return NULL values. It’s often safer to use NOT EXISTS in such cases.
Optimising Subqueries:
- Use JOIN operations where possible instead of subqueries.
- Use EXISTS instead of IN for correlated subqueries.
Common Errors and Their Resolutions
1. Error: Subquery returns more than one row in a scalar subquery context.
Resolution: Ensure that the subquery is designed to return only a single value.
2. Error: Mismatch in the number of columns between the main query and subquery.
Resolution: Ensure that both queries have the same number of columns, especially in row subqueries.
Sample Code and Dataset
For our examples, we used the sashelp.class dataset. However, for more complex scenarios, one can create custom datasets using the DATA step in SAS.
Final Thoughts on Subqueries in the Modern Data Landscape
Subqueries, when used effectively, can greatly enhance the power and flexibility of data retrieval in SAS. By understanding their nuances and potential, data analysts can harness their full potential in a myriad of data scenarios.