ADO/JDBC Performance Using Parameters
by Howard F. Arner, Jr.
A lot of my consulting practice is centered on Web and client/server development against the AS/400. Most of the time, I am called in to diagnose performance issues after systems are designed and deployed. In this article, I want to point out the most common performance problem I find when looking at n-tier systems: the use of dynamic SQL. I will also shed some light on how the AS/400 query optimizer works, and teach you a technique for avoiding dynamic SQL.
What Is Dynamic SQL?
SQL that can be used against the AS/400 comes in two flavors, static and dynamic. Static SQL is just what it sounds like: SQL statements that do not change, and therefore can be optimized one time and then the execution plan can be used over and over. Dynamic SQL statements are statements that must be optimized each time the AS/400 sees them. Let's look at an example. Imagine that I have a file full of inventory information. The file contains a part ID code, a description, a price, a unit of measure. Now, if I want the price and description of an item, I could run the following SQL statement:
SELECT PART_DESC, PRICE FROM ITEMS WHERE PARTID = 'ABC123'
The AS/400 will look at the statement, decide on a proper execution plan in order to retrieve the data, and then merrily go about getting the data and returning it to me. Now, let's say that later on in my program I want to find the description of another item. I might execute this statement:
SELECT PART_DESC, PRICE FROM ITEMS WHERE PARTID = 'X3'
Both statements are doing the same operation: returning the description and price of an item in my parts file. However, the AS/400 might not realize that the statements are the same, and therefore it may optimize the statement again. This is potentially a very bad thing. In order to understand why, it is important to understand what the optimizer does when you submit an SQL request.
How the Optimizer Creates a Plan
When DB2 gets an SQL statement, it first decides if it has seen that statement before. It does this by searching through the current open data paths for a statement that matches the current statement. I believe that this initial search is done on string length. If the AS/400 sees an SQL string of the same length, it examines the string to see if it matches the string you just gave it. If they match, the AS/400 will use the previous access plan. If DB2 does not find a previous incarnation of the statement, it will have to optimize the statement to create an execution plan.
When DB2 begins optimization, it parses the statement to see which files will be used in the resolution of the statement and the criteria for retrieving records from those files. DB2 then looks at the files to determine statistics like how many records are in the file or if the file is uniquely keyed. Next, looking at all of the record selection criteria, DB2 creates default filter factors for each selection predicate. This filter factor is used to determine how many records DB2 thinks it will have to read from the physical file.
Next, for each physical file in the query, DB2 begins to look at each logical file over the physical to determine if the file is suitable for resolution of the query. When DB2 evaluates the logical, it computes a value called cost. Cost is how much processor and disk I/O DB2 thinks it will take to do the query with this logical file. If the cost of using the logical is lower than the cost of using the previous logical DB2 looked at, DB2 puts this access path on the stack and pops the old one. If the cost is higher, DB2 discards this access path. DB2 continues to look at logical files until either the list of logicals is exhausted or the optimizer times out. If the optimizer times out, you are stuck with the best execution plan that the optimizer found before it timed out, which is not necessarily the best way to get your data. (More on this later.)
Selectivity and Optimization
DB2 doesn't just look at what keys are in the logical file. It also looks at the selectivity of the key. Selectivity is a measure that lets DB2 know how many records, on average, a single key value points to. Understanding selectivity is really important, as it gives you a window into what data the optimizer is using to make its decisions. Also, understanding how to get selectivity and what it means will give you a better understanding of your database and how data is partitioned.
Let's look at an example of selectivity. Say my items physical file has 1,000,000 records in it and has a logical file keyed by PART_ID and another logical file keyed by CLASS, SUBCLASS, and PART_ID. If I run a DSPFD command for *MBR information over the logical file, the AS/400 will return information about the logical files Access Path Activity Statistics. In this section of information, there is a set of information titled Number of Unique Partial Key Values. In this section, I see a label called Key Field 1 and a number to the right of that label. That number represents the number of unique values that are in that key. Now, in the case of my first index by PART_ID, that number is 1,000,000 because the PART_ID is a unique identifier. If the number of unique values matches the number of records in the physical, the key is unique. In my second logical file by CLASS, SUBCLASS and PART_ID, that number is 100. This indicates that there are 100 different values for the key CLASS. It also tells DB2 that if I am doing a select statement looking for a single CLASS, on average I will return 1,000,000/100 records or 10,000 records. Now, remember this is on average. DB2 does not know the actual distribution of the CLASS field; it only knows average distribution.
Below the entry for Key Field 1 is an entry labeled Key Field 1-2, which contains the number of unique entries for the combination of key 1 + key 2. In the case of my second logical, that number is 2157, which means there are 2,157 unique combinations of the CLASS+SUBCLASS key. This tells DB2 that if I do an SQL statement looking for a given CLASS and SUBCLASS, DB2 will have to read, on average, 463 (1,000,000/2157) records to satisfy my query. Given that there are 1,000,000 records in my physical file, the key combination of CLASS+SUBCLASS is very selective. Click here to see an example of the DSPFD information for a logical file.
The reason I wanted to explain selectivity was so you could understand two things: how the optimizer makes decisions and how expensive the optimization process is. See, the optimizer has to look at the system catalogue to get information about the files your query is accessing. Then, for each file, it has to look at the catalogue to get a list of the logical files over the physical files. Then, for each logical file, the optimizer determines if the keys are suitable to the query. If the keys are suitable, the optimizer calls an API to get the selectivity information about the keys, so that it can determine the cost of using the logical in resolving the query. Starting to get an idea of how compute and I/O-intensive optimization is? Well, think about this: I was doing a query against a legacy table that had 60,000,000 records and 60 logical files over the physical. I joined that table to another, which had 1,500,000 records and 30 logical files. That means the optimizer potentially had to look at 1+60+1+30 files to determine the best way to implement my query. Optimization takes a lot of time and resources, so it is a bad thing.
The most important way to avoid multiple statement optimizations is to use variables in your queries. These variables are called parameter markers. Here is an example of the parts statement using parameters:
SELECT PART_DESC, PRICE FROM ITEMS WHERE PARTID = ?
In ODBC, OLEDB, and JDBC, the question mark (?) is a parameter marker that lets DB2 know that when optimizing the statement, although there is no data at this time, there will be data when the statement is executed. To use a statement with parameters, you first prepare the statement, which sends it to the AS/400, and the AS/400 optimizes the statement, if necessary. Next, you bind any parameters in the statement with data and then execute the statement.
Why is it important that you use parameters? There are two reasons. First, it makes it easier for the AS/400 to see if the statement was used before in the execution of your job. As long as your job is connected to the AS/400, it will attempt to leave any reusable open data paths (ODPs) available. When you execute a query, the AS/400 creates an ODP when you fetch or retrieve the results of the query. When you are finished with the query, the AS/400 leaves the ODP open in a cache, in case you use that statement again. When you give the AS/400 a new statement, it first searches the available ODP, and if the statement is found it will reuse the ODP. This cuts down on both optimization and open operations, which are very expensive.
Where does this method fail you? When you close the connection, the AS/400 closes all of the ODPs. In addition, if you are using connection pooling, connection 1 may use a query and get an ODP. Then the next user tries to do the same function but gets connection 2 from the pool, which has never seen the statement, so the statement is optimized again. In addition, the ODP cache is not unlimited in size. As you do more and more statements against a connection, the AS/400 will start to close ODPs in order to conserve resources. This means that a statement's ODP could be closed and when needed later it will have to be re-optimized. The method I describe in the next paragraph eliminates these problems.
The second reason to use parameterized queries is because of the packaging facility on the AS/400. A package is a collection of SQL statements and their execution plans. If you set up your connection to use packages, the AS/400 will store a copy of all parameterized queries in a package file after they are optimized. In addition, the AS/400 will apply a more rigorous optimization to the statements than it will with dynamic SQL, keeping the optimizer from timing out so that all available access paths are examined. When you attempt to execute an SQL statement from a connection that is using packages, the AS/400 will first search the package file for a copy of the query. If it finds the query, it will retrieve the execution plan and statement metadata, rather than preparing and optimizing the statement, thereby saving valuable I/O and CPU resources.
To use packaging with your queries, you need to use prepared statements and set up your connection to use packages. If you are using ODBC or ODBC via ADO and connect using a data source, you can control the use of packages via the ODBC data source settings in the control panel. The Client Access and HiT Software ODBC drivers all call packaging Extended Dynamic. If you have the Extended Dynamic box checked and you check use/add, your queries that use parameter markers will all be saved in a package file on the AS/400. You can also control the library, package file name, and other variables from the ODBC control panel applet.
If you are using connection strings in OLEBD, ODBC, or JDBC, you can control the use of packages via arguments passed in the connection string. For the Client Access ODBC driver, the keyword is XDYNAMIC=1 to turn on package support. The keyword PKG can be used to supply a comma-separated list of package keywords. This list is in the format: LIB/FILE(IBM),x,0,y,z,0. The x value can be 1, indicating to use the package but do not add statements to it, or 2, indicating to use the package and add new queries to it. The y value specifies how to indicate a package is unusable: 0 indicates to throw an error, 1 issues a warning, and 2 means ignore the unusable package. The z value indicates whether the package should be cached locally on the PC: a 0 (default) indicates that package should stay on the iSeries; whereas a 1 indicates that PC RAM should be used to cache package contents.
Here is an example addition to a connection string for ODBC that turns on packaging, allows the program to add, issues a warning for unusable packages, and does not cache packages locally.
When you are using the JDBC toolbox driver, the following connection string arguments can be specified in the connection URL.
"extended dynamic=true;package library=SQLTHING; package=HOWIE;package add=true; package error=warning"
You can also set the connection properties via properties:
Properties p = new Properties(); p.put("extended dynamic", "true"); p.put("package library", "SQLTHING"); p.put("package", "HOWIE"); Connection c = DriverManager.getConnection ("jdbc:as400://mySystem",p);
Note that you cannot use extended dynamic, packages, when using the native JDBC driver or when using the Client Access OLDB provider, IBMDA400.
How to Execute a Parameterized Statement
When using Active X Data Objects, ADO, passing parameters to an SQL statement is done via the parameters collection of a command object. You can populate the parameters collection yourself, or have ADO do it automatically. The following code shows how to set the PART_ID parameter in our example statement (Assume Con1 is an ADO connection, which is already connected to the target AS/400):
Dim Cmd1 As New ADODB.Command Dim Stmt As String Dim Rs As New ADODB.Recordset Cmd1.ActiveConnection = Con1 Stmt = "SELECT PART_DESC, PRICE " Stmt = Stmt & " FROM PARTS " Stmt = Stmt & " WHERE PART_ID = ? " Cmd1.CommandText = Stmt Cmd1.Parameters.Refresh Cmd1.Parameters(0).Value = "MyPart" Rs.Open Cmd1
Note the call to the Parameters.Refresh method in the above statement. This causes the parameters collection to be instantiated by ADO. ADO sends the command to the AS/400, asking for an SQL Prepare, and then ADO asks the AS/400 to send back the metadata about what fields the statement returns and what parameters the statement requires. ADO uses that information to automatically build the parameters collection. You can save a round trip from the client to the server by instantiating the parameters collection manually, via the Parameters Append method and the CreateParameter method. The following code shows this being done:
Cmd1.CommandText = Stmt Cmd1.Parameters.Append _ Cmd1.CreateParameter("P1", adChar, _ adParamInput, 12, "MyPart") Rs.Open Cmd1
Using JDBC, you can execute a parameterized query using the PreparedStatement object. Here is a snippet that uses a prepared statement to retrieve part information.
con = DriverManager.getConnection( "jdbc:as400://126.96.36.199; extended dynamic=true;package criteria=select; package cache=true;","arner", "secret"); PreparedStatement ps = con.prepareStatement( " SELECT PART_DESC, PRICE FROM PARTS WHERE PART_ID=?"); ps.setString(1, VarClass); ResultSet rs = ps.executeQuery();
The Real World
I always like to tell the following story, because it illustrates how important packaging is. I had a client using dynamic SQL in a JavaServer Pages (JSP) application against a 60,000,000-record table. During testing, no page took longer to render than 10 seconds, so they assumed it was okay for production. They allowed access to the application, and suddenly the page render times increased to 45 and sometimes 180 seconds. They traced the performance bottleneck to SQL and then promptly called me.
First, I took a look at database monitor output and determined that the statements were spending a ton of time in the optimizer. This is because the system had thousands of users and was already heavily loaded, so jobs were swapped in and out a lot. Also, the physical file had over 60 logicals and was joined to two other large tables with many logicals. Most times, the optimizer was running and timing out before a good execution plan was found, and was therefore running a poor plan. Contributing to this, the optimizer examines logical files in creation descending order, so the logical that uniquely identified the record was at the bottom of the list. The optimizer, in many cases, was never seeing the best access plan.
Next, I examined their code and deduced that there were 13 variations of the statement that could be run based on the user input. We changed the code to select the proper statement and to use prepared statements with parameters. We tested each statement to ensure that it got a proper plan and returned data in a reasonable time frame. We turned on extended dynamic, packaged the queries, and released the application. Now no page takes more than 3 seconds to render and most database records are returned in less than a twentieth of a second.
Using Packages in Production
Now that we have talked about the optimizers, the benefits of packages, and how to use packages, it is time to talk about how to deploy solutions that use packages. Here is the scenario that I typically follow:
If you follow the above steps, you end up with a nice package file for your application, with all statements properly and fully optimized. Also, any page that uses dynamic SQL will not be adding its statements to the package file, so the file will not continue to grow.
Last Bits of Advice
A few more ideas might be in order before I sign off. One way of ensuring performance is that during your coding you should execute your statements against the AS/400 to determine that they perform well and use the proper access paths. Execute the statement using parameter markers, as this is how the iSeries will see the statement in production. IBM may say different, but I have seen queries using hard-coded variables optimize differently than queries using parameters, so it is very important to test the statement, as it will appear in your production application.
Another benefit of testing statements is that if you know the statements are executing quickly and using proper execution plans, then when you deploy your application you will be assured that your data access is scalable and not a source of performance problems. I have a presentation available at the Resources section of my Web site that overviews planning and turning queries against legacy systems. To view it, go to www.sqlthing.com.
Finally, get familiar with your data by looking at the selectivity information available from your logical files. This metadata can really help you gain an understanding of how your data is partitioned. I think selectivity is so important that I wrote a report that lists the selectivity of all logicals over a physical, so I can quickly get a feel for the table.
Howard F. Arner, Jr., is a consultant and software developer with Client Server Development. He is the author of iSeries and AS/400 SQL at Work, a learn-by-example guide to SQL on the iSeries. He is also the creator and chief architect of SQLThing Enterprise Edition, a tool for developers that use SQL. You can purchase Howard's book at www.sqlthing.com. Howard spends way too much time talking to DB2, and not enough time talking to his dogs and cats. E-mail: firstname.lastname@example.org
Contact the Editors
|Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.|