One of the most common questions that SAS users have when they use Proc SQL “is”: what is the difference between Count(*) and Count(variable)?
In this blog post, I will explain the difference and show some examples of how to use them correctly.
Count() is a function that returns the number of rows in a table or a group. It does not care about the values of any variables, it just counts how many rows there are. For example, if we have a table called Customers with 10 rows, then Count() will return 10.
Count(variable) is a function that returns the number of non-missing values of a variable in a table or a group. It only counts the rows where the variable has a value, and ignores the rows where the variable is missing. For example, if we have a table called Customers with 10 rows, but one of them has a missing value for the variable Age, then Count(Age) will return 9.
Here are some examples of how to use Count(*) and Count(variable) in Proc “SQL”:
data sample_data;input ID Name $ Age;datalines;1 John 252 Alice 303 Bob .4 Sarah 285 Mike 35;run;proc print;
To get the total number of students in the table sample_data table, we can “use”:
proc sql;/* Count(*) example */select count(*) as TotalRowsfrom sample_data;quit;
To get the number of non-missing values in the ‘Age’ column of the sample_data table, we can “use”:
proc sql;/* Count(variable) example */select count(Age) as NonMissingAgeCountfrom sample_data;quit;
When using COUNT(*) and COUNT(variable) in conjunction with the GROUP BY clause in PROC SQL, the difference becomes more pronounced.Let’s demonstrate this with an “example”:
data sample_data;input ID Name $ Age;datalines;1 John 252 Alice 303 Bob .4 Sarah 285 Mike 351 Tom 222 Carol 30;run;
Use COUNT(*) with GROUP BY to count the total number of rows for each unique ID.
proc sql;/* Count(*) with GROUP BY example */select ID, count(*) as TotalRowsfrom sample_datagroup by ID;quit;
Use GROUP BY with COUNT(Age) to count non-missing values in ‘Age’ column for each unique ID.
proc sql;/* Count(variable) with GROUP BY example */select ID, count(Age) as NonMissingAgeCountfrom sample_datagroup by ID;quit;
When it comes to performance implications of using Count() versus Count(variable) on large datasets in SAS, it is generally recommended to use Count(variable) instead of Count().
Count() counts all rows in a table, including null values, while Count(variable) counts only non-null values. Count() can be slower than Count(variable) because it requires more processing power and memory.
It’s worth noting that using Count(*) to obtain the number of observations in a dataset can lead to performance issues, particularly when dealing with large datasets.
In addition, performance can be affected by the number of variables in the dataset, with datasets containing more variables generally taking longer to process. Proper use of indexes can also have an impact on SAS performance.
As you can see, Count(*) and Count(variable) are useful functions to perform some basic statistics on tables or groups. However, they are not interchangeable, and you need to be careful about which one you use depending on what you want to count. and gives count of non-missing records only.
Count(*) counts both missing and non-missing values while Count(column) only counts non-missing values.
