PROC SQL is a powerful procedure in SAS that allows users to manage and retrieve data in a structured manner. It offers the flexibility of SQL (Structured Query Language) within the SAS environment, enabling intricate data manipulations and analyses.
The Essence of Subqueries
A proc sql subqueries, often termed an inner or nested query, is a query embedded within another SQL query. It can retrieve data that will be used in the main query as a condition to further refine the data that is to be retrieved.
”Importance”: Subqueries enhance the flexibility of data retrieval, allowing for dynamic data extraction based on conditions set by the main query.
Proc SQL Subqueries can be broadly categorized into four types based on their function and the kind of result they return.
Scalar “Subqueries”: These return a single value. They are often used in the SELECT, WHERE, or HAVING clause of the main query.
Row “Subqueries”: These return a single row of values. They are commonly used in comparison operations.
Column “Subqueries”: These return a column of values. They are typically used in the WHERE clause of the main query.
Table “Subqueries”: Also known as derived tables, these return a table. They are used in the FROM clause of the main query.
Scalar subqueries are versatile tools that return a single value. This value can be used for comparison or arithmetic operations in the main query. For example, one could use a scalar subquery to retrieve the average salary of employees. Then, this value can be used in the main query to find employees who earn above the average.
“Example”:
proc sql;select "Average Height" as Description, (select mean(Height) fromsashelp.class) as Value from sashelp.class where upcase(Name)='ALFRED';quit;
“Explanation”:
Main Query:
select "Average Height" as Description, [...subquery...] as Value fromsashelp.class where upcase(Name)='ALFRED';
This is the main or outer query that retrieves the description “Average Height” and the result of the scalar subquery for the student named “ALFRED” from the ‘sashelp.class’ dataset.
Scalar “Subquery”:
(select mean(Height) from sashelp.class)
This subquery calculates the average height of all students in the `sashelp.class` dataset. Since it returns a single value (the average height), it’s a scalar subquery. If you run this subquery alone, it will give you the average height value.
The main query uses the result of the scalar subquery in its SELECT clause. Specifically, it retrieves the description “Average Height” and the average height value (calculated by the subquery) for the student named “ALFRED”.
So, the process is as “follows”:
The scalar subquery runs first and calculates the average height of all students.
The main query then runs, retrieving the description and the average height value for the student named “ALFRED”.
The result is a table row with the description “Average Height” and the average height of all students in the `sashelp.class` dataset, but only for the record where the student’s name is “ALFRED”.

Row subqueries return a single row of values. They are particularly useful when comparing records across different tables or datasets. For instance, one might want to retrieve records from a dataset that match a specific record in another dataset.
Imagine we have two “datasets”: students and prefects. We want to find the average age of students who are not prefects. Here’s how we can achieve this using a row “subquery”:
/* Sample data creation */data students;set sashelp.class;run;data prefects;set sashelp.class(where=(Name in ('Alfred', 'Alice')));keep Name;run;proc sql;create table avg_age_non_prefects asselect AVG(Age) as AverageAgefrom studentswhere Name NOT IN (select Name from prefects);quit;
In this query, the subquery (select Name from prefects) returns the names of students who are prefects. The main query then calculates the average age of students from the student's dataset who are not in the list of prefects.

This is a basic example of how row subqueries can be used in SQL to filter results based on a list of values returned by a subquery.
When writing a SQL query, it’s important to understand the difference between a single-row subquery and a scalar subquery. A single-row subquery returns a value that is used in the WHERE clause, while a scalar subquery is a SELECT statement that is used in the column list. Think of it as an inline function in the SELECT column list.
Column subqueries return a column of values. They are typically used to filter records in the main query based on a list of values. For instance, one might want to retrieve all records that match a list of product IDs.
Column subqueries are designed to return a single column of values. These values can then be used in an IN, NOT IN, EXISTS, or NOT EXISTS condition in the main query.
For this example, let’s assume we have a dataset selected_students that contains a list of student names who participated in a special school event. We want to retrieve all records from the sashelp.class that match the names in this list.
First, let’s create the selected_students “dataset”:
data selected_students;input Name $15.;datalines;AliceBrianEmilyHenry;run;
Now, using a column subquery, we’ll retrieve the records of these selected students from the `sashelp.class` “dataset”:
proc sql;select Name, Age, Height, Weightfrom sashelp.classwhere Name in (select Name from selected_students);quit;
This SQL procedure will return the details of the students named “Alice”, “Brian”, “Emily”, and “Henry” from the sashelp.class dataset, as these are the names listed in our selected_students dataset.

Table subqueries, or derived tables, are used in the FROM clause of the main query. They allow for the creation of temporary tables that can be used for further processing in the main query.
Table subqueries, also known as derived tables or inline views, are subqueries that return a table which can be used in the FROM clause of an SQL statement. They allow for complex data manipulations and can be used in conjunction with joins aggregations, and other SQL operations.
For this example, let’s assume we have two “datasets”: sales and returns. We want to find out the total sales for products that have never been returned.
/* Sample data creation */data sales;input ProductID SalesAmount;datalines;1 1002 1503 2004 2505 300;data returns;input ProductID ReturnAmount;datalines;2 504 100;
Now, using a table subquery, we’ll calculate the average height for each age group and join it with the `sashelp.class` “dataset”:
proc sql;create table sales_not_returned asselect ProductID, sum(SalesAmount) as TotalSalesfrom saleswhere ProductID NOT IN (select ProductID from returns)group by ProductID;quit;

In this example, the table subquery (select ProductID from returns) returns a list of product IDs that have been returned. The main query then sums the sales amounts from the sales dataset for products that are not in this list, effectively giving us the total sales for products that have never been returned.
Correlated Subquery: A subquery that references columns from the outer query. (“TOP 50 SQL Interview Question and Answers - LinkedIn”) (“TOP 50 SQL Interview Questions and Answers - LinkedIn”) It is executed once for each row processed by the outer query. It relies on the outer query for its values.
The goal was to find students from the student's dataset whose average score is higher than the average score of their respective class in the class_avg dataset.
We’ll find students whose weight is above the average weight of all students.
proc sql;create table above_avg_students asselect Name, Weightfrom sashelp.class Awhere Weight > (select mean(Weight) from sashelp.class B);quit;proc print;
In this example, the subquery calculates the average weight of all students, and the outer query selects students whose weight is above this average.

Non-Correlated Subquery: An independent subquery that can be executed separately from the main query. It does not rely on the outer query for its values.
proc sql;create table taller_than_average as select Name, Height from sashelp.classwhere Height > (select mean(Height) from sashelp.class);quit;proc print;
The result is a table (taller_than_average) having the names and heights of students who are taller than the average height of the entire class.

A subquery is used in the HAVING clause of a SQL statement. It allows you to filter grouped records by a condition that involves aggregate functions or calculations based on the grouped data.
proc sql;create table classes_above_avg_weight as select Age, avg(Weight) as Avg_Weightfrom sashelp.class group by Age having avg(Weight) > (select mean(Weight)from sashelp.class);quit;proc print;
This returns age groups whose average weight is above the overall average weight of all students.

EXISTS and NOT EXISTS are operators used with subqueries to check for the existence or non-existence of rows. EXISTS returns true if the subquery returns one or more rows, while NOT EXISTS returns true if the subquery returns no rows.
proc sql;create table exists_example as select Name, Age from sashelp.class a whereexists (select 1 from sashelp.class b where a.Age=b.Age and b.Height > 65);quit;proc print;

Subquery with EXISTS and NOT EXISTS
This returns student of ages where at least one student is taller than 65 inches.
IN and NOT IN are operators used with subqueries to filter results based on a list of values returned from a subquery. IN returns rows that match values in the list, while NOT IN returns rows that don’t match any values in the list.
proc sql;create table in_example as select Name, Age from sashelp.class where Agein (select Age from sashelp.class where Height > 65);quit;proc print;
This returns students whose age matches those of students taller than 65 inches.

Subquery with IN and NOT IN
Subqueries can be made more flexible with conditional logic. By incorporating the CASE statement within subqueries, you can return different values based on certain conditions.
“Example”:
Imagine you want to categorise students in the sashelp.class dataset based on their “age”: “Young” for age less than 13, “Teen” for age between 13 and 18, and “Adult” for age above 18.
proc sql;create table age_categories as select Name, Age, (case when Age < 13 then"Young" when Age between 13 and 18 then "Teen" else "Adult" end) as Categoryfrom sashelp.class;quit;proc print;
In this example, the CASE statement within the main query classifies each student into a category based on their age.

Using CASE Statements
You can combine the results of multiple subqueries using set operations like the UNION or INTERSECT operator. This allows you to retrieve and merge data from various sources or conditions into a single result set.
“Example”:
Suppose you want to retrieve the names of students from the sashelp.class dataset who are either the tallest or the “shortest”:
proc sql;create table extremes as select Name, Height from sashelp.class whereHeight=(select max(Height) from sashelp.class) union select Name, Height fromsashelp.class where Height=(select min(Height) from sashelp.class);quit;proc print;

Combining Multiple Subqueries
In this “example”:
The first subquery retrieves the tallest student(s) by comparing each student’s height to the maximum height in the dataset.
The second subquery retrieves the shortest student(s) by comparing each student’s height to the minimum height in the dataset.
The UNION operation then combines the results of these two subqueries into a single result set, which has both the tallest and shortest student(s).
Both SQL Joins and subqueries are powerful tools in SQL that allow you to combine and filter data from multiple tables or datasets. The choice between using a join or a subquery often depends on the specific task at hand, the structure of your data, and personal preference. Let’s delve into when to use each and provide examples for “clarity”:
When to use Joins?
When you need to combine rows from two or more tables based on a related column.
When you want to retrieve data from multiple tables in a single query.
When performance is a concern, as joins can sometimes be faster than subqueries.
Example:
Suppose you have two “tables”: `students` (with columns `student_id`, `name`) and `grades` (with columns `student_id`, `subject`, `grade`). You want to retrieve the names of students along with their “grades”:
data students;input StudentID $ Name $;datalines;S01 JohnS02 JaneS03 Bob;run;data scores;input StudentID $ Score;datalines;S01 85S02 90S03 78;run;/* Using Join */proc sql;create table combined_data as select A.StudentID, A.Name, B.Score fromstudents A join scores B on A.StudentID=B.StudentID;quit;

When to use Joins
When to use Subqueries?
When you want to retrieve data based on the result of another query.
When you need to perform operations on a subset of your data before joining it with another table.
When you want to use aggregate functions to filter results (e.g., retrieving rows based on the average of a column).
Example:
Using the ‘sashelp.class’ dataset, suppose you want to retrieve the names of students whose weight is above the average “weight”:
proc sql;create table above_avg_students asselect StudentID, Scorefrom scoreswhere Score > (select mean(Score) from scores);quit;proc print;

When to use Subqueries
In this example, the subquery calculates the average weight of all students, and the main query retrieves students whose weight exceeds this average.
Joins are best suited for situations where you need to combine data from multiple tables based on a related column. Subqueries are ideal when you need to perform a calculation or filter data based on the result of another query.
In practice, the choice between joins and subqueries often comes down to the specific requirements of your task and the structure of your data. It’s also worth noting that in many cases, a task can be accomplished using either method, so understanding both tools and their strengths will allow you to choose the most efficient approach for your needs.
Performance Considerations: Subqueries, especially correlated subqueries, can be slower because they might be executed multiple times. It’s essential to optimize and test the performance.
“Limitations”: Not all SQL operations that can be performed in the main query can be performed in a subquery.
Order of “Execution”: In SQL, the subquery is always executed before the main query.
Subquery “Depth”: SAS has a limit on the nesting depth of subqueries. Avoid overly complex nested subqueries.
1. What is the difference between a subquery and a join?
While both can be used to combine data from multiple tables, a subquery returns data that will be used in the main query as a condition, whereas a join combines rows from multiple tables based on a related column.
2. Can I use aggregate functions in a subquery?
Yes, aggregate functions like SUM, AVG, MAX, etc., can be used in subqueries.
3. Why is my correlated subquery running slow?
Correlated subqueries can be slow because they might be executed multiple times. Consider optimizing the query or using joins if appropriate.
4. Can I use a subquery in the SELECT clause?
Yes, but it should return a single value (scalar subquery) to be used in the SELECT clause.
5. What is the difference between IN and EXISTS in subqueries?
Both are used to filter results based on a subquery. However, IN checks for values in a list returned by a subquery, while EXISTS checks for the existence of rows in a subquery.
6. How do I handle NULL values when using IN or NOT IN with a subquery?
Be cautious when using NOT IN with subqueries that might return NULL values. It’s often safer to use NOT EXISTS in such cases.
Use JOIN operations where possible instead of subqueries.
Use EXISTS instead of IN for correlated subqueries.
Common Errors and Their Resolutions
1. “Error”: Subquery returns more than one row in a scalar subquery context.
“Resolution”: Ensure that the subquery is designed to return only a single value.
2. “Error”: Mismatch in the number of columns between the main query and subquery.
“Resolution”: Ensure that both queries have the same number of columns, especially in row subqueries.
Sample Code and Dataset
For our examples, we used the sashelp.class dataset. However, for more complex scenarios, one can create custom datasets using the DATA step in SAS.
Final Thoughts on Subqueries in the Modern Data Landscape
Subqueries, when used effectively, can greatly enhance the power and flexibility of data retrieval in SAS. By understanding their nuances and potential, data analysts can harness their full potential in a myriad of data scenarios.
