Home > Education > Chapter 6 – Group By

Chapter 6 – Group By

Section I – Usage

When to Use

In SQL, data aggregations are done using one or more aggregation functions and sometimes the GROUP BY statement coupled with the HAVING statement. Using these functions, we can return data in a more usable format for reporting and analysis purposes. The HAVING statement allows us to constrain the results of an aggregation, similar to a WHERE statement.

Whenever you are interested in summarizing your data in some manner, the GROUP BY (and HAVING) statements are used. Examples would include, determining the number of rows in a subset of data, averaging test scores for each student in a class, or finding every meter that has a minimum usage of x-kWh.

Section II – Aggregation Functions

Function Description
Count() Returns the number of rows in a column.
Avg() Returns a column’s average rules.
Sum() Returns the sum of a column’s values.
Min() Returns a column’s lowest value.
Max() Returns a column’s highest value.

Count

The count function is pretty self-evident; it counts. The count function will return the number of rows in a column, sometimes matching a set of constraints. Count() has two possible uses, Count(*) will return the number of all possible rows, regardless of missing values, whereas Count([ColumnName]) will only return the number of non-missing rows.

Proc SQL;

Create Table Example01 as

Select Count(*) as CountAll,

Count(Premise) as CountNonMissing

From Chapt06.Obj_Recorders;

Quit;

When we add in the GROUP BY statement, we can then aggregate at a more-in depth level. Let’s do the same example, but group by the Operating Company column (OpCo). When using the GROUP BY statement, only include the column names that are NOT using an aggregation function, and only columns that you wish to include in the output.

Proc SQL;

Create Table Example02 as

Select OpCo,

Count(*) as CountAll,

Count(Premise) as CountNonMissing

From Chapt06.Obj_Recorders

Group By OpCo;

Quit;

Sum

The SUM function is used to return total values in a specific numeric-only column. However, it is at its most powerful when combined with the GROUP BY statement. The difference between the COUNT and SUM functions, is that you can’t use the * wildcard in a SUM statement. It requires a specific numeric-only column.

Proc SQL;

Create Table Example03 as

Select Sum(Value) as TotalKWH

From Chapt06.Data_Load;

Quit;

Now, when we add the GROUP BY statement, it allows us to increase the granularity of our aggregation. We’ll do a couple of examples to really drive this home, and some common mistakes.

First, let’s add the RecorderID, so we can learn exactly how much each point used.

Proc SQL;

Create Table Example04 as

Select RecorderID,

Sum(Value) as TotalKWH

From Chapt06.Data_Load

Group By RecorderID;

Quit;

We still don’t know how long each customer has been recording their data, so at this level, the aggregation is a little useless. Let’s look at the usage of each recorder by year and month. Run the example and see if you can spot the problem before turning the page.

Proc SQL;

Create Table Example05 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Sum(Value) as TotalKWH

From Chapt06.Data_Load

Group By RecorderID;

Quit;

As I’m sure you’ve guessed by now, the previous example didn’t have the correct columns listed in the GROUP BY statement. Since we were trying to aggregate by RecorderID, Month, and Year, then we have to have those items listed, both in the SELECT statement, and in the GROUP BY statement.

In the next two examples we look at how the order of the columns in the GROUP BY statement changes the order of the output. To get around this, simply add an ORDER BY statement to the very end.

Proc SQL;

Create Table Example06 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Sum(Value) as TotalKWH

From Chapt06.Data_Load

Group By RecorderID, Month, Year;

Quit;

Proc SQL;

Create Table Example07 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Sum(Value) as TotalKWH

From Chapt06.Data_Load

Group By RecorderID, Year, Month;

Quit;

Proc SQL;

Create Table Example08 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Sum(Value) as TotalKWH

From Chapt06.Data_Load

Group By RecorderID, Year, Month

Order By Year, Month, RecorderID;

Quit;


Average

The AVG function is used to return the arithmetic mean of a column. AVG counts the number of rows, sums the total values of the column in question, and then divides the sum by the count. Let’s go through the previous SUM examples, but use the AVG function instead.

Proc SQL;

Create Table Example09 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Avg(Value) as TotalKWH

From Chapt06.Data_Load

Group By RecorderID, Month, Year;

Quit;

Proc SQL;

Create Table Example10 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Avg(Value) as AvgKW

From Chapt06.Data_Load

Group By RecorderID, Year, Month;

Quit;

Proc SQL;

Create Table Example11 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Avg(Value) as TotalKWH

From Chapt06.Data_Load

Group By RecorderID, Year, Month

Order By Year, Month, RecorderID;

Quit;


Max and Min

The MAX and MIN functions are similar, so we’ll do them both at once. The MAX function will return the largest value within a column, and the MIN function will return the smallest.

Proc SQL;

Create Table Example12 as

Select RecorderID,

Min(LoadPeriodID) as StartDate format=datetime19.,

Max(LoadPeriodID) as EndDate format=datetime19.

From Chapt06.Data_Load

Group By RecorderID;

Quit;

Proc SQL;

Create Table Example13 as

Select RecorderID,

Min(LoadPeriodID) as StartDate format=datetime19.,

Max(LoadPeriodID) as EndDate format=datetime19.,

Max(Value) as PeakKW

From Chapt06.Data_Load

Group By RecorderID;

Quit;

Section III – HAVING

The HAVING keyword allows us to constrain our results based on the aggregated outcome. Now, most beginners will be confused by the need for the HAVING statement since we already have the WHERE statement, but there is one very large difference in their usage.

First, the HAVING statement should only be used on aggregated columns post-query, and the WHERE statement should only be used on non-aggregated post-query results. Second, and this is VERY important, the WHERE statement follows the FROM statement, and is before the GROUP BY statement; The HAVING statement follows the GROUP BY statement, and precedes the ORDER BY statement.

Select

From

Where

Group By

Having

Order By;

If that seems confusing, don’t worry….it is! Let’s do our first example to show the difference.

Proc SQL;

Create Table Example14 as

Select RecorderID,

Avg(Value) as AvgKW

From Chapt06.Data_Load

Where Value > 2000

Group By RecorderID;

Quit;

In the previous example, we constrained on the Value column to only include those values greater than 2000 kW. In the next example, we’ll constrain the results to only include those items that have an AvgKW greater than 2000.

Proc SQL;

Create Table Example15 as

Select RecorderID,

Avg(Value) as AvgKW

From Chapt06.Data_Load

Group By RecorderID

Having AvgKW > 2000;

Quit;

Let’s do two more examples that show the HAVING statement as a way to exclude rows from the resultant dataset in a before and after:

Proc SQL;

Create Table Example16 as

Select OpCo,

Count(*) as Accounts

From Chapt06.Obj_Recorders

Group By OpCo;

Quit;

Proc SQL;

Create Table Example17 as

Select OpCo,

Count(*) as Accounts

From Chapt06.Obj_Recorders

Group By OpCo

Having Accounts > 5;

Quit;

I won’t do any more HAVING examples at this time, but it’s imperative to remember that the HAVING statement constrains the resultant data set, and the WHERE statement constrains the source dataset.


Section IV – Examples: Let’s tie ‘em together!

In this example, we want to find the peak demand and average demand for each recorder, by month and year where the load factor (avg demand / peak demand) is greater than 70% ordered by highest load factor to lowest:

Proc SQL;

Create Table Example18 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Avg(Value) as TotalKW,

Max(Value) as PeakKW,

Avg(Value)/Max(Value) as LoadFactor

From Chapt06.Data_Load

Group By RecorderID, Month, Year

Having LoadFactor > .7

Order By LoadFactor DESC;

Quit;

Now let’s do the same query, but let’s only do it for the summer months.

Proc SQL;

Create Table Example19 as

Select RecorderID,

Year(DatePart(LoadPeriodID)) as Year,

Month(DatePart(LoadPeriodID)) as Month,

Avg(Value) as TotalKW,

Max(Value) as PeakKW,

Avg(Value)/Max(Value) as LoadFactor

From Chapt06.Data_Load

Where Calculated Month between 6 and 9

Group By RecorderID, Month, Year

Having LoadFactor > .7

Order By LoadFactor DESC;

Quit;

Learn Forward

Ok, one, last example and it’s tricky. We want to determine two important values from this query; we want the Peak Demand for each recorder, and we went the exact interval that peak occurred. Let’s give it a try:

Proc SQL;

Create Table Example20 as

Select Distinct

RecorderID,

LoadPeriodID,

Value

From Chapt06.Data_Load

Group By RecorderID, LoadPeriodID

Having Max(Value) = Value;

Quit;

The reason this is such a difficult query is because we can’t include the LoadPeriodID column in the GROUP BY statement, or else we wouldn’t find the peak for the whole period. So let’s remove LoadPeriodID from the query:

Proc SQL;

Create Table Example21 as

Select Distinct

RecorderID,

Value

From Chapt06.Data_Load

Group By RecorderID

Having Max(Value) = Value;

Quit;

Not what we wanted either. If we don’t include the LoadPeriodID in the GROUP BY statement, then we’ll never know when the peak occurred. So how to we get it? We merge the summary statistics back into the original dataset! For this example, we’re going to let SAS do the dirty work of merging two datasets (original with summary), but in our next class we’ll do the merges explicitly.

Proc SQL;

Create Table Example22 as

Select Distinct

RecorderID,

LoadPeriodID,

Value

From Chapt06.Data_Load

Group By RecorderID

Having Max(Value) = Value;

Quit;

When you ran this query, you probably noticed a note in the SAS log that said it had to remerge summary statistics back with the original data. This is exactly what we want. You also probably noticed that one of the recorders came back with 200,000+ rows of peaks. Any guesses as to why?

Advertisement
Categories: Education Tags:
  1. No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.