Recursive Queries on the iSeries and System i
June 12, 2006 Michael Sansoterra
Who would’ve guessed, but we’re almost done with the long list of SQL enhancements introduced in release V5R4 of i5/OS. Sorrow not, though, because this tip will bring abundant joy as a major limitation of SQL has just been lifted–the ability to run a query recursively.
Recursive queries are queries that need to do multiple self joins an unknown number of times. Take, for example, an employee master table that contains an employee’s employee number and the manager’s employee number. Say we need a list of all employees along with a listing of all manager’s above the employee. To do this query, we would have to join the employee master file to itself n number of times. n would vary based on how many employees are in the company and how fat the bureaucracy is.
In the past, we could write this query with some clumsy “work arounds” like writing an external user-defined table function in a language that can handle recursive calls. Or, if we knew of a maximum number of recursions for a query (say 10), we could write a query with 10 Left Outer Joins to a single table. Needless to say that makes for an ugly query.
This is where the benefit of using recursion comes in. With a recursive query we no longer need to resort to external programming or a terrible series of self joins. Now we let the system do the self joins and then stop when there is no more data.
Although there are many business applications for recursive queries including reservations, employee hierarchies and indented bill of material listings, let’s stick with about as simple a scenario as we can get. For this example, we’ll write a query to show a program’s references, which is recursive by nature.
In a program calls stack, Program A can call Program B, which can call Programs C and D, and so on n times until we reach a program that doesn’t call anything else. Consider the following table that holds program relationships:
Create Table ProgramRef (Program Char(10), /* This is the caller */ ProgramText Char(50), RefProgram Char(10), /* This is the called program */ Constraint ProgramRef Primary Key (Program,RefProgram))
I will illustrate recursive queries by writing a query to retrieve all of the programs that are potentially called by a given starting program. On my AS/400 system I populated the above table with program call data from the ever popular WRKDBF utility that happened to be on our internal system.
And here is the query that will get the program hierarchy for us:
With ProgramHierarchy (CallLevel,Program,ProgramText,RefProgram) As (Select 1 As CallLevel,Program,ProgramText,RefProgram From ProgramRef Where Program='DBCWRKF' Union All Select CallLevel+1 As CallLevel, PR.Program,PR.ProgramText,PR.RefProgram From ProgramHierarchy PH, ProgramRef PR Where PH.RefProgram=PR.Program ) Select CallLevel As CallStackLevel, Repeat(' ',CallLevel-1)||Program As Program, ProgramText,RefProgram From ProgramHierarchy;
This mess needs explanation! Recursive queries in DB2 are implemented using a Common Table Expression (CTE) and a Union All. The CTE defines a full select that utilizes a Union All. The CTE is a convenient way to assign a name and column list for a query that can later be referenced in a Select. The outline of the CTE portion of the above query is shown here:
With ProgramHierarchy (CallLevel,Program,ProgramText,RefProgram) As ( fullselect goes here... )
The CTE name is called ProgramHierarchy and it will return four columns.
Now let’s zoom in on the fullselect. The first subselect in the full select is:
Select 1 As CallLevel,Program,ProgramText,RefProgram From ProgramRef Where Program='DBCWRKF'
This is a “prime read” and serves as the starting point for the recursive query. In this case, I’ve hard coded a program name that I want the query to start with. In other words, the query should pull out all programs that are called directly or indirectly by program DBCWRKF.
The second subselect in the fullselect is where the self join occurs:
Select CallLevel+1 As CallLevel, PR.Program,PR.ProgramText,PR.RefProgram From ProgramHierarchy PH, ProgramRef PR Where PH.RefProgram=PR.Program
As shown here, the query selects data from the common table expression and then does a join back to the ProgramRef table.
The first time DB2 processes this subselect, the ProgramHierarchy CTE will contain the “prime” row from the first subselect. This row will be joined back to the original ProgramRef table using the RefProgram from the initial read (i.e. the first program called by program DBCWRKF). This cycle of repeat calls to the secondary subselect will continue until there is no more data.
The CallLevel column expression is used as a means to track how many levels deep we are into the query. It will allow us to answer the question, “are we 4 calls or 24 calls down the call stack from our starting program?” Column CallLevel is primed with a 1 in the first subselect. Then it is incremented by one with each successive successful join in the secondary subselect.
Finally, a CTE is always referenced by a Select. This is the final portion of our query. After the CTE has been defined, issue a Select referencing the CTE name so the recursive query will run:
Select CallLevel As CallStackLevel, Repeat(' ',CallLevel-1)||Program As Program, ProgramText,RefProgram From ProgramHierarchy
This simple Select will run the ProgramHierarchy CTE recursively and return the results to us in a tabular format. The only trick involved here is the use of the Repeat() function to artificially indent the program name as the query levels go deeper.
Figure 1, which you can see by clicking here, contains the recursive query results. The listing is impressive except it may not be what we want. This list contains all of the programs at call level one, followed by all the programs at call level two, followed by all programs that are called three levels.
But what if we want the query to walk through each path of every program? In other words, if initial program A calls program B and program C, how do we get DB2 to show us all of program B’s subprograms before it moves on to program C?
The answer is by using the SEARCH clause on the recursive query as shown below:
With Recursive ProgramHierarchy (CallLevel,Program,ProgramText,RefProgram) As (Select 1 As CallLevel,Program,ProgramText,RefProgram From ProgramRef Where Program='DBCWRKF' Union All Select CallLevel+1 As CallLevel, PR.Program,PR.ProgramText,PR.RefProgram From ProgramHierarchy PH, ProgramRef PR Where PH.RefProgram=PR.Program ) Search Depth First By RefProgram,Program Set NewOrder Select CallLevel As CallStackLevel, Repeat(' ',CallLevel-1)||Program As Program, ProgramText,RefProgram From ProgramHierarchy Order By NewOrder
The few modifications done to our original query are highlighted. The first is the insertion of the optional recursive keyword in the CTE. While not required it is a good documentation feature if your query is recursive.
The second modification is the “Search Depth First” search clause. This tells DB2 to process and complete the recursion for each row before moving on to the next. This answers the previous question as to how we can make DB2 finish reporting all subprograms under program B before moving on to program C. The “By” portion must contain the column names that define the unique parent and child relationship for the recursive data. In this case the Referenced Program is related to the Program column.
Finally the Set clause defines a column that will contain an identifier that will allow DB2 to retrieve the results in the desired order (Program A–> Program B –> Program C –> etc.) The column specified in the Set clause should only be used in the ORDER BY of the outermost select.
Figure 2, which you can see by clicking here, contains the modified query results.
There is one other thing to note about recursive queries–the potential for infinite recursion. This scenario will occur if there is a circular reference in the data. By reviewing my sample data table, I saw that program DBRWRKF calls program QCLSCAN. I created a circular reference in the data by adding one row that says QCLSCAN calls DBRWRKF:
Insert Into ProgramRef(Program,ProgramText,RefProgram) Values ('QCLSCAN','This is bad!','DBRWRKF')
This new data will cause an infinite recursion. However, if there is a circular reference, we can handle it easily with the Cycle clause.
With Recursive ProgramHierarchy (CallLevel,Program,ProgramText,RefProgram) As (Select 1 As CallLevel,Program,ProgramText,RefProgram From ProgramRef Where Program='DBCWRKF' Union All Select CallLevel+1 As CallLevel, PR.Program,PR.ProgramText,PR.RefProgram From ProgramHierarchy PH, ProgramRef PR Where PH.RefProgram=PR.Program ) Search Depth First By RefProgram,Program Set NewOrder Cycle Program,RefProgram Set DuplicateError To '*' Default ' ' Select CallLevel As CallStackLevel, Repeat(' ',CallLevel-1)||Program As Program, ProgramText,RefProgram, DuplicateError From ProgramHierarchy Order By NewOrder;
The Cycle clause, as illustrated here, requires the column list that uniquely describes the parent child relationship. In this case, if the Program and RefProgram columns contain duplicate values then there is a circular reference. This clause will allow DB2 to look for this circular condition.
When it finds the condition, how should it let the caller know? Here, the Set keyword is used to define a column that can be referenced in the Select list. This column is automatically defined as CHAR(1) and will contain a default value for rows that are not part of a circular reference and will have a special value if there is a circular reference. In the above example, column DuplicateError will contain a blank by default. When a circular reference is encountered this column will change to an asterisk (*). Circular references will have two rows in the recursive table, both rows having this asterisk non-default value.
One last thing to note is that the CREATE VIEW statement has been enhanced to allow a limited form of a recursive view. However, I have not been able to successfully create a recursive view that contains all the features available in the common table expression recursive query shown above.
For the record, the manual states that recursion is not allowed if the query specifies:
DB2 keeps improving and thereby makes SQL increasingly valuable. This value is not limited to DB2 either as these recursive query features were introduced in SQL Server 2005 as well. This simply means that new DB2 skills are increasingly portable to other database platforms and will make developers increasingly productive.