Wednesday, March 11, 2009

To Join or not to Join, that is the Question!

One of the most difficult aspects of SQL queries are identifying and accurately completing the relationship identification between database objects. Let me try that again, in English: when querying more than one table at a time (in a relational database), the database needs the developer (you) to identify the relationship between the objects. That identification is difficult at times and leads to many of the problems with report results. In this post, I talk about Relational Databases, Cartesian Products, Inner, and Outer Joins. Before I get into the tips around JOINS, let's get into a little more depth regarding the theory behind join requirements.

Relational Database

A relational database is just like it sounds; a database whose contents are related. How they are related is the crux of the join issue. Take the following example:

SPRIDEN and SPBPERS are two Banner tables containing general person information. SPRIDEN contains ID and Name data, while SPBPERS contains demographic information (such as SSN, Birth Date, Citizenship, Ethnicity, etc). Referencing the first image below, we can see that the data stored in the tables is very organized, but how does the system know which SSN and Gender to associate with Mary Johnson?


As we can see, the individual's PIDM exists in both tables. If we connect (JOIN) the two, we will establish the required relationship; indicating to the system how the two records (and by extension, the two tables) are related. Clear as mud right! Hopefully the image below will help explain:


I don't want to spend too much time on theory, so I'm going to keep this brief. Hopefully you now have an idea about why joins are required. The next step is to understand that there are a few different kind of joins. For more information on the types, how and when they're used, and their syntax, click here. One thing I do want to discuss before moving onto more practical matters, is the concept of a cartesian join, or cartesian product.

Cartesian Join/Product

You have created a cartesian join when you query more than one object and DO NOT specify the join condition. You would know you've done that by several things:

  1. The query may bring down the database
  2. If the results do get returned, there are likely duplicate records
  3. The combination of information from the queried objects will make no sense; Mary Johnson's name will appear with other people's demographic information.
  4. Your query contains more than one object in the FROM clause, with no reference to their relationship in the WHERE clause.

As you probably have guessed, cartesian joins are VERY bad. The problem is that Oracle assumes that each row in the first object should be connected to each row in the second. As an example, reference the two tables below. Assuming that they are displaying all the data in each object, the following query would yeild the results below.



select ID, Lname, Gender
  from object1, object2;

As we would expect, we have 27 rows of results (3 from Object2 times 9 from Object1). Now, extrapolate; imagine two tables that contain hundreds of thousands of rows!

200,000 * 300,000 = 60,000,000,000 records returned...ouch!

It has potential to crash your system!!! What we want to do, is specify the join condition as demonstrated above, but before we do that, we need to talk about the different kinds of joins.




Inner Joins

An inner join only returns the "SELECTED" information if both sides of the join condition match. Observe the following example:

SELECT spriden_id, spbpers_ssn
FROM spriden, spbpers
WHERE spriden_pidm = spbpers_pidm;
*NOTE - the case is not important. I use UPPER case here to specify "clauses" of the select statement.

In the above example, the ID, and SSN would only return where the SPRIDEN_PIDM and SPBPERS_PIDM are the same. The implication of that is in regards to this questions, "What if they don't have a record in SPBPERS?". In this case, that individual would be REMOVED from the query results!


To help deal with this issue, we can use what's called an "outer join".

Outer Joins

There are a few different types of outer joins, but for the purposes of this discussion, we really only need to consider the reason you would use an outer vs. an inner join and what the possible repurcussions might be. For more information on outer joins (syntax, etc), click here. First off, why would you use an outer join? The most important reason is to combat the issue mentioned above under "Inner Joins"; if both sides of the join condition do not equate, then the record is dropped. In many cases (addresses, email, telephone, and others), you don't want to lose the record just because the individual in question doesn't have a record in a certain area. In those cases we use an outer join as in the script below:

SELECT spriden_id, spbpers_ssn
FROM spriden, spbpers
WHERE spriden_pidm = spbpers_pidm(+);

Notice the "(+)" at the end of SPBPERS_PIDM in the join condition. This indicates that Oracle should return ALL ROWS from spriden as well as the rows from SPBPERS where the PIDM matches. So, again looking at our sample SPRIDEN and SPBPERS tables below, we would get the following results from this query:


One word of caution: if you include a condition on a table that is outer joined, you MUST outer join on that condition as well!!! For example, let's say we only wanted to look at GENDER = 'M'. Normally, we would use the following to do so:

SELECT spriden_id, spbpers_ssn
FROM spriden, spbpers
WHERE spriden_pidm = spbpers_pidm(+)
and spbpers_sex = 'M';

However, because many individuals have no record in SPBPERS, this query would NOT return ONLY return one row, the row corresponding to the fellow with an 'M' for gender. As such, we must do the following to outer join on the condition as well as the join if we want to include folks with no SPBPERS record:

SELECT spriden_id, spbpers_ssn
FROM spriden, spbpers
WHERE spriden_pidm = spbpers_pidm(+)
and spbpers_sex(+) = 'M';

Notice the "(+)" by the spbpers_sex condition as well as the join condition! So as a general rule: any conditions against a table that is outer joined, must be outer join conditions!

Now that we have an idea of what an outer join is used for, let's discuss its implications. Basically, using an outer join introduces overhead to your query. That being the case, it's best to stick with inner joins whenever possible. Other than that, good luck! I hope this has given you a little insight into join theory, especially with regard to Banner and Oracle. Please post questions as needed! until next time, happy hunting!