Chapter 7 – SQL Joins
Section I – Usage
What is a Join?
Up to this point, we’ve only used single-table queries. If the only analysis we ever did were on single-table databases, the queries we’ve learned so far would be all we would need! However, analysis is never that easy. Most production databases will be normalized in a snowflake schema, which require lengthy joins, and even datamarts using star schemas will require multiple table queries.
So, how does an analyst cope? With Joins. SQL utilizes seven joins, six distinct, which allow us to combine multiple input-tables into a single output-table.
Section II - Types of Joins
Sub-queries
Sub-queries are queries that are embedded within other queries. A sub-query is not a traditional join in that the output will only come from a single input-table, but you will be constraining your results versus one or more other tables.
Let’s start out with a simple example. Here we want to grab only interval data that we’ve estimated. Since the Data_Load table only has numbers, we need to get the data from the LKP_StatusCodes table. The first thing we’ll need to do is determine what constitutes an estimate. We could easily go to the table and lookup the values we need, but it’s easier to write a query.
Proc SQL;
Create Table Example01a as
Select ID
From Chapt07.LKP_StatusCodes
Where Lower(Description) contains “estimate”;
Quit;
Now, this lets us know what exactly it is we’re looking for, but since the Data_Load table doesn’t contain the word “estimate”, we need to get the tables in one query. When using a sub-query, we join the two tables using a WHERE statement. Please note that your sub-query can ONLY return a single column. Also, since it’s possible the sub-query contains more than one row, you MUST use the “in” keyword.
Proc SQL;
Create Table Example01b as
Select *
From Chapt07.Data_Load
Where StatusCodeID in (Select ID
From Chapt07.LKP_StatusCodes
Where Lower(Description) contains “estimate”);
Quit;
WHERE (Implicit Inner Join)
Creating a join is fairly straight-forward. All we need to do is specify every table name we wish to pull data from, and constrain the results using a WHERE statement. However, with this type of join, the resultant dataset will ONLY contain the rows that adhere to the constraints.
Proc SQL;
Create Table Example02 as
Select A.*,
B.City as WeatherStationCity,
B.State as WeatherStationState
From Chapt07.Obj_Recorders as A, Chapt07.Obj_Weatherstations as B
Where A.WeatherStationID = B.ID;
Quit;
We can also join multiple tables in the same way, as long as we remember to add the WHERE statement.
Proc SQL;
Create Table Example03 as
Select A.OpCo,
B.*,
C.Code,
C.Description
From Chapt07.Obj_Recorders as A,
Chapt07.Data_Load as B,
Chapt07.LKP_StatusCodes as C
Where A.ID = B.RecorderID
And B.StatusCodeID = C.ID;
Quit;
Cartesian Join
Before we get into the specific keywords involved in a proper join, let’s discuss the Cartesian Join. A Cartesian join (or Cartesian product for those familiar with linear algebra) occurs whenever we forget the WHERE keyword. You will probably experience the Cartesian join multiple times using SQL, but almost always due to forgetting a constraining column. However, when you do wish to utilize the Cartesian join, it can be very powerful. Take for example, suppose we wish to create a calendar dataset. Now, we could open Excel and type all of it, but it’s much easier to create from scratch. Let’s start by creating three sample datasets:
Proc SQL;
Create Table Example04a as Select Distinct IDYear from Chapt07.LKP_LoadPeriods;
Create Table Example04b as Select Distinct IDMonth from Chapt07.LKP_LoadPeriods;
Quit;
In the Year dataset, we found 6 years. In the month dataset, we obviously found 12. Now that we’ve got our base datasets, let’s merge them.
Proc SQL;
Create Table Example04c as
Select IDYear as Year,
IDMonth as Month
From Example04a, Example04b;
Quit;
Here, we can see that we have 72 rows (6*12), which is exactly what we’d expect.
Inner Join (Explicit Inner Join)
Now that we’ve seen how to merge multiple tables, let’s show another way of merging tables. The INNER JOIN is exactly the same as the WHERE Join, except it’s easier for others to decipher our code. This type of join is known as an equi-join based on the equality of two tables. Let’s examine Examples 2 and 3 and rewrite them using INNER JOIN syntax. In an INNER JOIN statement, we separate tables with the words “INNER JOIN” and then equate them using the “ON” keyword. The “ON” keyword works the same as the WHERE statement. In Set Theory, an Inner Join is also called an Intersection. (A Ç B)
Proc SQL;
Create Table Example05 as
Select A.*,
B.City as WeatherStationCity,
B.State as WeatherStationState
From Chapt07.Obj_Recorders as A
INNER JOIN Chapt07.Obj_Weatherstations as B ON A.WeatherStationID = B.ID;
Quit;
Proc SQL;
Create Table Example06 as
Select A.OpCo,
B.*,
C.Code,
C.Description
From Chapt07.Obj_Recorders as A
INNER JOIN Chapt07.Data_Load as B ON A.ID = B.RecorderID
INNER JOIN Chapt07.LKP_StatusCodes as C ON B.StatusCodeID = C.ID;
Quit;
Please note, SAS will evaluate multi-table joins based on the order in which they are written, unless parentheses are used. This can be very helpful when you need to speed up processes. Join the largest tables last to minimize resource usage.
Section III - Advanced Joins
Before we get to the more advanced joins, run the following query. This will allow us to magnify how the advanced joins work.
Proc SQL;
Update Chapt07.Obj_Recorders Set WeatherStationID = “XXX” Where Opco = “SWP”;
Quit;
Left Outer Join
The Left Outer Join works similar to the Inner Join, except that in the resultant dataset, everything in the “Left” table will be returned, along with any corresponding rows in the “Right” table. In Set theory, we’d represent this as “Everything in A”(A).
Proc SQL;
Create Table Example07 as
Select A.*,
B.*
From Chapt07.Obj_WeatherStations as A
Left Outer Join Chapt07.Obj_Recorders as B
On A.ID = B.WeatherStationID;
Quit;
Right Outer Join
The Right Outer Join is the exact same as the Left Outer Join, except we’re keeping everything that occurs in the “Right” table and any corresponding values that are in the “Left” table. In Set Theory, we’d represent this as “Everything in B” (B).
Proc SQL;
Create Table Example08 as
Select A.*,
B.*
From Chapt07.Obj_WeatherStations as A
Right Outer Join Chapt07.Obj_Recorders as B
On A.ID = B.WeatherStationID;
Quit;
Full Outer Join
The Full Outer Join will return everything from A and B, and when the values in A are not equitable to values in B, a Cartesian product occurs. In Set theory, we’d represent this as A Union B (A È B).
Proc SQL;
Create Table Example09 as
Select A.*,
B.*
From Chapt07.Obj_WeatherStations as A
Full Outer Join Chapt07.Obj_Recorders as B
On A.ID = B.WeatherStationID;
Quit;
