Friday, February 27, 2009

Effective Dating and Sub Queries

Effective dating (no, not the romantic kind!) is one of the most common topics of discussion when reporting against Banner (Baseline, especially: the ODS and EDW go a long way to removing effective date and term code logic requirements). Most people are aware of the issue, but do not fully understand why things are the way they are and even if the "why" is understood, "how"
to go about solving the issue can sometimes be very confusing. There are two main examples I like to use when discussing this subject; job history and student records. Job history deals with "effective dates" (NBRJOBS_EFFECTIVE_DATE) while the
student base record deals with "effective term" (SGBSTDN_TERM_CODE_EFF). The following entry discusses both questions of "why" this situation is important and "how" it must be dealt with.

Why is this such a BIG deal!?!?


The most important thing to keep in mind is the two dimensional nature of the way data is stored in Banner (or any database, really). You can think of the Banner database as a really complex and robust series of spreadsheets that hold all your Banner data. Jobs and student information are no different in terms of storage; the difference is in the fact that both store historical information. Let's
take a look at a relatively simple example. Assume for a second that John Doe has only one job currently, but has moved around within the school and has been promoted a couple of times. The way his data is stored in the system would look something like this:


Figure 1 - NBRJOBS

As you can see, all job history is stored in the Banner system. If one wishes to look at the most current job information for John, one MUST find that record associated with his most current (or MAX) effective date. As we can see, that would correspond to the last record in the table, for position B2354 with a salary of $45,000.

Now let's take a look at the student side (SGBSTDN). Even though we are now dealing with effective terms, instead of effective dates, the concept is still the same (and can be applied to any situation in which you're dealing with having to figure out the most current record for a given set of records). As shown in Figure 2 which is very similar to Figure 1, the student information corresponding to John Doe in SGBSTDN can change a random terms throughout their experience in school. If their major or any other attribute (level, type, etc) changes during their tenure, a new record is added to SGBSTDN reflecting the change with a new effective term.


Figure 2 - SGBSTDN

As shown, the problem is the same. Mr. Doe's status changed from new freshmen (N in my simply coded universe) to continuing (C) after his first term. Then, after much thought, he changed his major from psychology (PSYCH) to sociology (SOC) in fall, 2008 (200910). NOTE: all codes are fictional and do not reflect real Banner data. For your school's codes, reference the validation tables associated with each data element. As with the job records, we would simply need to make sure to isolate John's most current record by looking for the record associated with his most current (or MAX) effective term. In either case, we must do this through the use of a correlated subquery.

Alright, alright, now tell me HOW!


Before we get into the details of how a subquery can be used to facilitate our needs, I'd like to briefly discuss why a subquery is required. This question of "why subquery" is a constant source of confusion among report writers everywhere. Typically, what I hear is, let's just include the MAX effective date or term in our query as in the following script:



Whenever an aggregate function (such as COUNT, AVG, MAX, or MIN) is used in a SQL query, all fields NOT being aggregated must exist in the where clause. What that translates into (in english) is that the aggregation occurs around all other fields being selected. So, in the case of the above script, it's looking for the maximum effective date for each combination of ID, Name, Position, Suffix, and Salary. Referencing Figure 1, we can see that this would be fine for the first position, because it would isolate around the terminated record (all selected fields except the date are the same), indicating the most current record for that position. However, for their currently active position, it would return a record for each salary (because it's looking at the maximum effective date for each different salary!!!). The crux of the issue, is the fact that Oracle is relatively "dumb", it simply takes the statement you write and returns the results according to pre-established rules. It cannot "guess" what we are trying to do. That being the case, it looks at each row in turn. For this reason, we must use a sub query to match the effective date in the current row to the MAX effective date for the individual, position, and suffix. Now, on to the solution!!!

A subquery is simply a query within a query. A correlated subquery, as we will see shortly, is a query in the WHERE CLAUSE, usually referencing the same table mroe than once.



There are several things that need to be pointed out about this modified query. First, take note of the subquery; it is the key that tells the system we want to keep only the most current record for each PIDM, POSITION, and SUFFIX combination. Hence, as the results show, the query shows both the most recent terminated AND active records. Additionally, note that the subquery contains only one table (NBRJOBS with the TABLE2 alias). The fact that two "instances" of the same table are referencing each other is what makes the subquery "correlated". An alias for at least one of the two tables is required, otherwise the system has no way to tell them apart. Now that we have excluded the result set, we can eliminate the individual's terminated positions by simply adding that condition to the MAIN QUERY'S WHERE CLAUSE (not the subquery) as follows:



Alright, sit back and take a deep breath, I know a whole new world of SQL querying has been opened to you:). Now, it's time to put it to practice. Using the script below, test yourself by running it on your system to get a list of student pidms who have more than one record on SGBSTDN. Then, write a script that will get ALL of one student's records based on one of the PIDMs in the list using the techniques described above. If you need help or want to compare answers, you can get my solution to this vexing problem in the first comment on this blog entry, but don't
be tempted to cheat!!!:)

SELECT SGBSTDN_PIDM, COUNT(*)
FROM SGBSTDN
HAVING COUNT(*) > 1
ORDER BY SPRIDEN_PIDM

I hope this has been helpful, let me know what you think!!!

4 comments:

  1. Test Answers - Per Zach

    STEP 1 - ALL DATA FOR ONE STUDENT

    SELECT SPRIDEN_ID,
    SPRIDEN_LAST_NAME,
    SPRIDEN_FIRST_NAME,
    SGBSTDN_MAJR_CODE_1,
    SGBSTDN_LEVL_CODE,
    SGBSTDN_STYP_CODE,
    SGBSTDN_TERM_CODE_EFF
    FROM SGBSTDN TABLE1, SPRIDEN
    WHERE SPRIDEN_PIDM = SGBSTDN_PIDM
    AND SPRIDEN_CHANGE_IND IS NULL
    AND SGBSTDN_PIDM = 123

    STEP 2 - ISOLATE MOST CURRENT STUDENT RECORD

    SELECT SPRIDEN_ID,
    SPRIDEN_LAST_NAME,
    SPRIDEN_FIRST_NAME,
    SGBSTDN_MAJR_CODE_1,
    SGBSTDN_LEVL_CODE,
    SGBSTDN_STYP_CODE,
    SGBSTDN_TERM_CODE_EFF
    FROM SGBSTDN TABLE1, SPRIDEN
    WHERE SPRIDEN_PIDM = SGBSTDN_PIDM
    AND SPRIDEN_CHANGE_IND IS NULL
    AND SGBSTDN_PIDM = 123
    AND SGBSTDN_TERM_CODE_EFF =
    (SELECT MAX(SGBSTDN_TERM_CODE_EFF)
    FROM SGBSTDN TABLE2
    WHERE TABLE1.SGBSTDN_PIDM =
    TABLE2.SGBSTDN_PIDM)

    ReplyDelete
  2. How would you go about creating a flat file from an effective term table with a record for each term and person/pidm? Specifically, I want to take the student attribute table, sgrsatt, and create a larger table with pidm and attribute listed for each term...
    Example:
    Term Pidm Attributes
    201140 10101 ATH,HOUS
    201210 10101 ATH

    ReplyDelete
  3. Linked to my previous comment, this is the code I came up with for presenting repeating table from the attribute banner table in a flat way...

    select sgrsatt_term_code_eff as term, sgrsatt_pidm as pidm, wm_concat(sgrsatt_atts_code) as attributes from sgrsatt a
    where sgrsatt_term_code_eff=(select max(sgrsatt_term_code_eff) from sgrsatt b where b.sgrsatt_pidm=a.sgrsatt_pidm and
    sgrsatt_term_code_eff<=(select term from (select term, rownum as row_num from (select stvterm_code as term from stvterm where stvterm_start_date <= sysdate and stvterm_start_date > sysdate-2500 order by term desc)) where row_num=1) and sgrsatt_term_code_eff>'200650') group by sgrsatt_term_code_eff, sgrsatt_pidm
    union
    select sgrsatt_term_code_eff as term, sgrsatt_pidm as pidm, wm_concat(sgrsatt_atts_code) as attributes from sgrsatt a
    where sgrsatt_term_code_eff=(select max(sgrsatt_term_code_eff) from sgrsatt b where b.sgrsatt_pidm=a.sgrsatt_pidm and
    sgrsatt_term_code_eff<=(select term from (select term, rownum as row_num from (select stvterm_code as term from stvterm where stvterm_start_date <= sysdate and stvterm_start_date > sysdate-2500 order by term desc)) where row_num=2) and sgrsatt_term_code_eff>'200650') group by sgrsatt_term_code_eff, sgrsatt_pidm
    union
    select sgrsatt_term_code_eff as term, sgrsatt_pidm as pidm, wm_concat(sgrsatt_atts_code) as attributes from sgrsatt a
    where sgrsatt_term_code_eff=(select max(sgrsatt_term_code_eff) from sgrsatt b where b.sgrsatt_pidm=a.sgrsatt_pidm and
    sgrsatt_term_code_eff<=(select term from (select term, rownum as row_num from (select stvterm_code as term from stvterm where stvterm_start_date <= sysdate and stvterm_start_date > sysdate-2500 order by term desc)) where row_num=3) and sgrsatt_term_code_eff>'200650') group by sgrsatt_term_code_eff, sgrsatt_pidm

    You can union as many of these statements as you want to include more terms.

    ReplyDelete
  4. . Your contents are wonderful and recommended
    dating script

    ReplyDelete