Understanding SQL Joins - From beginners to advanced
Posted in Algorithms, Hash, Joins, Merge, SQL, SQL Server 5 comments
In this article I will show some many samples for SQL Joins, having the following simple database, this database contains all possible relations to can preview all types joins, we have 1-1, 1-many, many-many relationships (as in the figure).
fig. (5)
As you notice, only records from table "Books" where "PublisherId" is not null - I'll talk about about the algorithm used in inner join later -, as in fig. (2), the inner join select only the intersection between two tables.
Left is the left input of the equality equation (in most cases the table that contains foreign key) and right is the input of the equality.
We will have the following result:
we will got the following result:
First notice what is the different between two queries? Left and right, and equality is vice-versa
it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.
Hash join shares many characteristics with merge join. Like merge join, it requires at least one equijoin predicate, supports residual predicates, and supports all outer and semi-joins. Unlike merge join, it does not require ordered input sets and, while it does support full outer join, it does require an equijoin predicate.
(*) starting from this part, I used the material of session I attended couple of years earlier by (Mohamed Moshrif - Microsoft SQL Server Development Team), it was a brilliant session, and it was the main reason that I started loving SQL Server :)
As a start, we have 5 types of joins ("Inner" join, Left join, Right join, Full Join, Cross Join) and Self join will be represented as inner, left or right join as I will show in the next examples.
Inner Join:
Let's start with Join (or Inner Join), as from the previous diagram, Inner join is the intersection between two tables, in our case I will use Books and Publishers tables, these tables have the following data:
when we perform inner join, we will select from the table (Books) records where the "PublisherId" column is not null, so when we write the following Query:
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
fig. (5)
As you notice, only records from table "Books" where "PublisherId" is not null - I'll talk about about the algorithm used in inner join later -, as in fig. (2), the inner join select only the intersection between two tables.
What Does the word Left and Right exactly means??
Before talking Left and right join, what does the words left and right actually represents in the query? if we take the previous query like in the following graph:Left is the left input of the equality equation (in most cases the table that contains foreign key) and right is the input of the equality.
Left Join:
So when say I will use Left Join, this means you will select all records from the left table of the equality including the records in the intersection between the two tables, so if you looked to fig. (2), we will select the "Left" area UNION "Inner" area.
so if we execute the following query:
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books LEFT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
We will have the following result:
Right Joins:
Also the same idea when you use right join, you select all records from the "right" area from fig. (2) union all records in the "inner" area, so if we executed the following query:
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books RIGHT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
we will got the following result:
Fun with Left and Right joins:
After you understand what is left and right joins and what are the output of each one of them, the question now, what will be the result of of the of the next query?SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books RIGHT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID SELECT Books.BookName , Publishers.PublisherName FROM dbo.Publishers LEFT JOIN dbo.Books ON dbo.Publishers.ID = dbo.Books.PublisherId
First notice what is the different between two queries? Left and right, and equality is vice-versa
What about the result?? actually the result of these two queries are the same, since the first query I wrote Right join, and the table on the Right was Publishers, and the the second query I wrote Left query, and the table on the left was also Publisher, that's what I call magic :)
The Myth of the words "INNER" and "OUTER"
As you notice I did not use the keywords "inner" or "outer" in the last queries, what are they mean? and why many people uses them??
When you use these words, you just indicates to yourself what part I will select from this relationship, so when use "Inner" that means you will select the intersection between the two tables where where the foreign key in original table (Master Table) is not null, that means "Join" equals "Inner Join", so the next two queries are absolutely the same and will give the same results:
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books INNER JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
The same Idea for both left and right join, when you write left/right join, you actually select the outer part of the relationship (check both fig. (10) and fig. (2), left and right are outer records of the relationship), this means the following queries are the same:
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books LEFT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
and
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books LEFT OUTER JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
The same for right join also
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books RIGHT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
and
SELECT Books.BookName , Publishers.PublisherName FROM dbo.Books RIGHT OUTER JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
Full Join
Full join is Union of both "Left" and "Right" Join, the full join get all data in tables in teh relationship, so in our sample, the Books table contains 13 records, the publishers table contains 7 records, the full join will get the 13 records of books union all records in table "Publisher" which are not in referenced in "Books" table, the syntax of "Full Join" is like the following query:
SELECT BookName , PublisherName FROM dbo.Books FULL JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
Wich will give the following value:
Cross Join:
Cross join used to get all available combination between two (or more) tables in the query, in our case the Books have 13 records, Publisher have 7 records, so all available records will be 13 x 9 = 91 records, and the syntax of cross join is a little bit different from any other join syntax, since you will not state the primary key and foreign key of the relationship "ON Table2.PrimaryKey = Table1.ForeignKey", so the query will be like this:
SELECT BookName , PublisherName FROM dbo.Books cross JOIN dbo.Publishers
And will give the following result:
Self Join:
Self join do not have a special syntax, but it's accomplished by setting the same table name with an alias to can set a join syntax, and hence we can use either "Inner" or "Outer" (Left/Right) joins, but it will make more since if we used Left join because inner join will not give the accurate data, consider the following example, the "Books" table (Check fig. (2)) we have a column called "ParentBookId" which indicates if this book is a part of series of books the "ParentBookId" will be the Id of the previous book, so if we used the "INNER JOIN", with the following syntax:
SELECT B1.BookName , B2.BookName AS ParentBookName FROM dbo.Books B1 JOIN dbo.Books B2 ON B1.ID = B2.ParentBookId
The result will be only 6 records (since INNER JOIN gets only where foreign key is not null), so the accurate result will be by using outer Join (Left or Right) like this:
SELECT B1.BookName , B2.BookName AS ParentBookName FROM dbo.Books B1 LEFT JOIN dbo.Books B2 ON B1.ID = B2.ParentBookId
Which will get the following result:
How does Join actually works??(*)
1- INNER JOINS:
Considering the following query:
SELECT ColumnA, ColumnB, ColumnC, ...., FROM OUTER_TABLE JOIN INNER_TABLE ON OUTER_TABLE.Column1 = INNER_TABLE.column2
When you execute this query the following algorithm applied:
for each row R1 in the outer_table for each row R2 in the inner_table if R1 joins with R2 return (R1, R2)
In other way, the SQL Engine generates a nested loop, first loop takes every row from the OUTER_TABLE and in the other loop takes every row from the INNER_TABLE then check if the row from outer table joins with the row in the inner table, if not join continue loop, if joins return the two row.
2- OUTER JOINS
The same idea for Outer (Left) join:
SELECT ColumnA, ColumnB, ColumnC, ...., FROM OUTER_TABLE LEFT JOIN INNER_TABLE ON OUTER_TABLE.Column1 = INNER_TABLE.column2
on executing this query, the following algorithm applied:
for each row R1 in the outer table begin for each row R2 in the inner table if R1 joins with R2 return (R1, R2) if R1 did not join return (R1, NULL) end
In other way, the SQL Engine generates a nested loop, first loop takes every row from the OUTER_TABLE and in the other loop takes every row from the INNER_TABLE then check if the row from outer table joins with the row in the inner table, if two rows are joins, return two rows, else return outer row and NULL.
Other Joins Algorithms:
All previous example are done by using the default join algorithm (which is called Loop join algorithm), there are other types of joins, like:
- Nested (Loop) 'Default'
- Merge
- Hash
Merge Joins:
Unlike the nested loops join which supports any join predicate, the merge join requires at least one equijoin predicate. Moreover, the inputs to the merge join must be sorted on the join keys. For example, if we have a join predicate “T1.a = T2.b,” table T1 must be sorted on T1.a and table T2 must be sorted on T2.b.
The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time. At each step, we compare the next row from each input. If the rows are equal, we output a joined row and continue. If the rows are not equal, we discard the lesser of the two inputs and continue. Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join.
We can express the algorithm in pseudo-code as:
get first row R1 from input 1 get first row R2 from input 2 while not at the end of either input begin if R1 joins with R2 begin return (R1, R2) get next row R2 from input 2 end else if R1 < R2 get next row R1 from input 1 else get next row R2 from input 2 end
Unlike the nested loops join where the total cost may be proportional to the product of the number of rows in the input tables, with a merge join each table is read at most once and the total cost is proportional to the sum of the number of rows in the inputs. Thus, merge join is often a better choice for larger inputs.
Hash Join:
it comes to physical join operators, hash join does the heavy lifting. While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins. Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.
Hash join shares many characteristics with merge join. Like merge join, it requires at least one equijoin predicate, supports residual predicates, and supports all outer and semi-joins. Unlike merge join, it does not require ordered input sets and, while it does support full outer join, it does require an equijoin predicate.
The hash join executes in two phases: build and probe. During the build phase, it reads all rows from the first input (often called the left or build input), hashes the rows on the equijoin keys, and creates an in-memory hash table. During the probe phase, it reads all rows from the second input (often called the right or probe input), hashes these rows on the same equijoin keys, and looks or probes for matching rows in the hash table. Since hash functions can lead to collisions (two different key values that hash to the same value), we typically must check each potential match to ensure that it really joins.
for each row R1 in the build table begin calculate hash value on R1 join key(s) insert R1 into the appropriate hash bucket end for each row R2 in the probe table begin calculate hash value on R2 join key(s) for each row R1 in the corresponding hash bucket if R1 joins with R2 return (R1, R2) end