Follow me on Twitter RSS FEED

Understanding SQL Joins - From beginners to advanced

Posted in By waelouf 5 comments

In this article I will show some many samples for SQL Joins, having the following simple database, this database contains all possible relations to can preview all types joins, we have 1-1, 1-many, many-many relationships (as in the figure).

 fig. (1)
As a start, we have 5 types of joins ("Inner" join, Left join, Right join, Full Join, Cross Join) and Self join will be represented as inner, left or right join as I will show in the next examples.

fig. (2)

Inner Join:


Let's start with Join (or Inner Join), as from the previous diagram, Inner join is the intersection between two tables, in our case I will use Books and Publishers tables, these tables have the following data:


fig.(3) fig. (4)
when we perform inner join, we will select from the table (Books) records where the "PublisherId" column is not null, so when we write the following Query:

SELECT  Books.BookName , 
        Publishers.PublisherName
FROM    dbo.Books
        JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID


 fig. (5)

As you notice, only records from table "Books" where "PublisherId" is not null - I'll talk about about the algorithm used in inner join later -, as in fig. (2), the inner join select only the intersection between two tables.

What Does the word Left and Right exactly means??

Before talking Left and right join, what does the words left and right actually represents in the query? if we take the previous query like in the following graph:

 fig. (6)



Left is the left input of the equality equation (in most cases the table that contains foreign key) and right is the input of the equality.

Left Join:

So when say I will use Left Join, this means you will select all records from the left table of the equality including the records in the intersection between the two tables, so if you looked to fig. (2), we will select the "Left" area UNION "Inner" area.

so if we execute the following query:

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        LEFT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID

We will have the following result:

 fig. (7)

Right Joins:

Also the same idea when you use right join, you select all records from the "right" area from fig. (2) union all records in the "inner" area, so if we executed the following query:

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        RIGHT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID

we will got the following result:

 fig. (8)

Fun with Left and Right joins:

After you understand what is left and right joins and what are the output of each one of them, the question now, what will be the result of of the of the next query?

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books 
   RIGHT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID


SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Publishers 
 LEFT JOIN dbo.Books ON  dbo.Publishers.ID = dbo.Books.PublisherId

First notice what is the different between two queries? Left and right, and equality is vice-versa

fig. (9)

What about the result?? actually the result of these two queries are the same, since the first query I wrote Right join, and the table on the Right was Publishers, and the the second query I wrote Left query, and the table on the left was also Publisher, that's what I call magic :)

The Myth of the words "INNER" and "OUTER"

As you notice I did not use the keywords "inner" or "outer" in the last queries, what are they mean? and why many people uses them??

 fig. (10)

When you use these words, you just indicates to yourself what part I will select from this relationship, so when use "Inner" that means you will select the intersection between the two tables where where the foreign key in original table (Master Table) is not null, that means "Join" equals "Inner Join", so the next two queries are absolutely the same and will give the same results:

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        INNER JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID

The same Idea for both left and right join, when you write left/right join, you actually select the outer part of the relationship (check both fig. (10) and fig. (2), left and right are outer records of the relationship), this means the following queries are the same:

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        LEFT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
and 

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        LEFT OUTER JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID

The same for right join also

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        RIGHT JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID
and 

SELECT  Books.BookName ,
        Publishers.PublisherName
FROM    dbo.Books
        RIGHT OUTER JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID

Full Join 

Full join is Union of both "Left" and "Right" Join, the full join get all data in tables in teh relationship, so in our sample, the Books table contains 13 records, the publishers table contains 7 records, the full join will get the 13 records of books union all records in table "Publisher" which are not in referenced in "Books" table, the syntax of "Full Join" is like the following query:

SELECT  BookName ,
        PublisherName
FROM    dbo.Books
        FULL JOIN dbo.Publishers ON dbo.Books.PublisherId = dbo.Publishers.ID

Wich will give the following value:

 fig. (11)

Cross Join:

Cross join used to get all available combination between two (or more) tables in the query, in our case the Books have 13 records, Publisher have 7 records, so all available records will be 13 x 9 = 91 records, and the syntax of cross join is a little bit different from any other join syntax, since you will not state the primary key and foreign key of the relationship "ON Table2.PrimaryKey = Table1.ForeignKey", so the query will be like this:

SELECT  BookName ,
        PublisherName
FROM    dbo.Books
        cross JOIN dbo.Publishers 

And will give the following result:

 fig. (12)

Self Join:

Self join do not have a special syntax, but it's accomplished by setting the same table name with an alias to can set a join syntax, and hence we can use either "Inner" or "Outer" (Left/Right) joins, but it will make more since if we used Left join because inner join will not give the accurate data, consider the following example, the "Books" table (Check fig. (2)) we have a column called "ParentBookId" which indicates if this book is a part of series of books the "ParentBookId" will be the Id of the previous book, so if we used the "INNER JOIN", with the following syntax:

SELECT  B1.BookName ,
        B2.BookName AS ParentBookName
FROM    dbo.Books B1
        JOIN dbo.Books B2 ON B1.ID = B2.ParentBookId

The result will be only 6 records (since INNER JOIN gets only where foreign key is not null), so the accurate result will be by using outer Join (Left or Right) like this:

SELECT  B1.BookName ,
        B2.BookName AS ParentBookName
FROM    dbo.Books B1
        LEFT JOIN dbo.Books B2 ON B1.ID = B2.ParentBookId

Which will get the following result:

 fig. (13)

How does Join actually works??(*)

1- INNER JOINS:

Considering the following query:

SELECT ColumnA, ColumnB, ColumnC, ...., FROM OUTER_TABLE 
       JOIN INNER_TABLE ON OUTER_TABLE.Column1 = INNER_TABLE.column2

When you execute this query the following algorithm applied:

for each row R1 in the outer_table
           for each row R2 in the inner_table
                    if R1 joins with R2
                          return (R1, R2)

In other way, the SQL Engine generates a nested loop, first loop takes every row from the OUTER_TABLE and in the other loop takes every row from the INNER_TABLE then check if the row from outer table joins with the row in the inner table, if not join continue loop, if joins return the two row.

2- OUTER JOINS

The same idea for Outer (Left) join:

SELECT ColumnA, ColumnB, ColumnC, ...., FROM OUTER_TABLE 
       LEFT JOIN INNER_TABLE ON OUTER_TABLE.Column1 = INNER_TABLE.column2

on executing this query, the following algorithm applied:

for each row R1 in the outer table
begin
    for each row R2 in the inner table
        if R1 joins with R2
            return (R1, R2)
        if R1 did not join
            return (R1, NULL)
end

In other way, the SQL Engine generates a nested loop, first loop takes every row from the OUTER_TABLE and in the other loop takes every row from the INNER_TABLE then check if the row from outer table joins with the row in the inner table, if two rows are joins, return two rows, else return outer row and NULL.

Other Joins Algorithms:

All previous example are done by using the default join algorithm (which is called Loop join algorithm), there are other types of joins, like:
  • Nested (Loop) 'Default'
  • Merge
  • Hash
I'll talk briefly about the other types.

Merge Joins:

Unlike the nested loops join which supports any join predicate, the merge join requires at least one equijoin predicate.  Moreover, the inputs to the merge join must be sorted on the join keys.  For example, if we have a join predicate “T1.a = T2.b,” table T1 must be sorted on T1.a and table T2 must be sorted on T2.b.

The merge join works by simultaneously reading and comparing the two sorted inputs one row at a time.  At each step, we compare the next row from each input.  If the rows are equal, we output a joined row and continue.  If the rows are not equal, we discard the lesser of the two inputs and continue.  Since the inputs are sorted, we know that we are discarding a row that is less than any of the remaining rows in either input and, thus, can never join.

We can express the algorithm in pseudo-code as:

get first row R1 from input 1
get first row R2 from input 2
while not at the end of either input
    begin
        if R1 joins with R2
            begin
                return (R1, R2)
                get next row R2 from input 2
            end
        else if R1 < R2
            get next row R1 from input 1
        else
            get next row R2 from input 2
    end

Unlike the nested loops join where the total cost may be proportional to the product of the number of rows in the input tables, with a merge join each table is read at most once and the total cost is proportional to the sum of the number of rows in the inputs.  Thus, merge join is often a better choice for larger inputs.

Hash Join:


it comes to physical join operators, hash join does the heavy lifting.  While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins.  Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

Hash join shares many characteristics with merge join.  Like merge join, it requires at least one equijoin predicate, supports residual predicates, and supports all outer and semi-joins.  Unlike merge join, it does not require ordered input sets and, while it does support full outer join, it does require an equijoin predicate.

The hash join executes in two phases: build and probe.  During the build phase, it reads all rows from the first input (often called the left or build input), hashes the rows on the equijoin keys, and creates an in-memory hash table.  During the probe phase, it reads all rows from the second input (often called the right or probe input), hashes these rows on the same equijoin keys, and looks or probes for matching rows in the hash table.  Since hash functions can lead to collisions (two different key values that hash to the same value), we typically must check each potential match to ensure that it really joins.

for each row R1 in the build table
    begin
        calculate hash value on R1 join key(s)
        insert R1 into the appropriate hash bucket
    end
for each row R2 in the probe table
    begin
        calculate hash value on R2 join key(s)
        for each row R1 in the corresponding hash bucket
            if R1 joins with R2
                return (R1, R2)
    end


I'd like to finalize this post with a simple comparison between Nested Loop, Merge and Hash Joins:




(*) starting from this part, I used the material of session I attended couple of years earlier by (Mohamed Moshrif - Microsoft SQL Server Development Team), it was a brilliant session, and it was the main reason that I started loving SQL Server :)

Beyond Deadlines

Posted in By waelouf 0 comments

There has to be more to success than meeting deadlines... but what?

When I was a kid, I was happy just to play around. I loved the challenge of programming. When I got a program to work, it was a major victory. Back then, even a program that didn’t work was a success of some sort, as long as I had fun writing it. My definition of success centered on personal rewards.

As I gained experience, my software became more complicated and I often lost track of how it worked.
I had to abandon some programs before they were finished. I began to believe that maintainability was the key to success—an idea that was confirmed as I entered the workforce and began working with teams of other programmers. I prided myself on producing elegant, maintainable code. Success meant technical excellence.

Despite good code, some projects flopped. Even impeccably executed projects could elicit yawns from users. I came to realize that my project teams were part of a larger ecosystem involving dozens, hundreds, or even thousands of people. My projects needed to satisfy those people ... particularly the ones signing my paycheck. In fact, for the people funding the work, the value of the software had to exceed its cost. Success meant delivering value to the organization.

These definitions aren’t incompatible. All three types of success are important . Without personal success, you’ll have trouble motivating yourself and employees. Without technical success, your source code will eventually collapse under its own weight. Without organizational success, your team may find that they’re no longer wanted in the company.




The art of agile development
By James Shore, Shane Warden