One of the most common questions that SAS users have when they use Proc SQL is: what is the difference between Count(*) and Count(variable)?
In this blog post, I will explain the difference and show some examples of how to use them correctly.
Count() is a function that returns the number of rows in a table or a group. It does not care about the values of any variables, it just counts how many rows there are. For example, if we have a table called Customers with 10 rows, then Count() will return 10.
Count(variable) is a function that returns the number of non-missing values of a variable in a table or a group. It only counts the rows where the variable has a value, and ignores the rows where the variable is missing. For example, if we have a table called Customers with 10 rows, but one of them has a missing value for the variable Age, then Count(Age) will return 9.
Here are some examples of how to use Count(*) and Count(variable) in Proc SQL:
data sample_data;
input ID Name $ Age;
datalines;
1 John 25
2 Alice 30
3 Bob .
4 Sarah 28
5 Mike 35
;
run;
proc print;
To get the total number of students in the table sample_data table, we can use:
proc sql;
/* Count(*) example */
select count(*) as TotalRows
from sample_data;
quit;
To get the number of non-missing values in the ‘Age’ column of the sample_data table, we can use:
proc sql;
/* Count(variable) example */
select count(Age) as NonMissingAgeCount
from sample_data;
quit;
Count(*) vs Count(variable) in Proc SQL Group By
When using COUNT(*)
and COUNT(variable)
in conjunction with the GROUP BY
clause in PROC SQL, the difference becomes more pronounced.Let’s demonstrate this with an example:
data sample_data;
input ID Name $ Age;
datalines;
1 John 25
2 Alice 30
3 Bob .
4 Sarah 28
5 Mike 35
1 Tom 22
2 Carol 30
;
run;
Use COUNT(*)
with GROUP BY
to count the total number of rows for each unique ID.
proc sql;
/* Count(*) with GROUP BY example */
select ID, count(*) as TotalRows
from sample_data
group by ID;
quit;
Use GROUP BY with COUNT(Age) to count non-missing values in ‘Age’ column for each unique ID.
proc sql;
/* Count(variable) with GROUP BY example */
select ID, count(Age) as NonMissingAgeCount
from sample_data
group by ID;
quit;
Performance implications of using Count(*) versus Count(variable) on large datasets.
When it comes to performance implications of using Count() versus Count(variable) on large datasets in SAS, it is generally recommended to use Count(variable) instead of Count().
Count() counts all rows in a table, including null values, while Count(variable) counts only non-null values. Count() can be slower than Count(variable) because it requires more processing power and memory.
It’s worth noting that using Count(*) to obtain the number of observations in a dataset can lead to performance issues, particularly when dealing with large datasets.
In addition, performance can be affected by the number of variables in the dataset, with datasets containing more variables generally taking longer to process. Proper use of indexes can also have an impact on SAS performance.
Conclusion
As you can see, Count(*) and Count(variable) are useful functions to perform some basic statistics on tables or groups. However, they are not interchangeable, and you need to be careful about which one you use depending on what you want to count. and gives count of non-missing records only.
Count(*) counts both missing and non-missing values while Count(column) only counts non-missing values.