Guru: IBM i Experience Sharing, Case 2 – Dealing With CPU Queuing Wait Time
March 21, 2022 Satid Singkorapoom
When we drive our cars, we hope to avoid red lights and traffic jams, because we all hate waiting immobile in traffic. I’m sure that you are aware, fully or subtly, that active jobs in any computer system can encounter wait as well. The IBM i developer team has categorized many types of wait.
In this article, let’s look at CPU Queuing wait time. Let’s see how we can interpret and address it in a sensible way to resolve poor performance. I’ll try to provide you with a useful approach to wait time analysis using a gloriously useful performance reporting tool.
First, how do we look at CPU Queuing wait time? As of IBM i release 6.1, a new built-in GUI performance report tool named IBM i Performance Data Investigator (PDI) was delivered under the browser-based Navigator for i system management GUI tool. (For more information, visit https://developer.ibm.com/tutorials/ibm-i-performance-data-investigator/.) You can use PDI to display many useful performance data charts that help you analyze many aspects of IBM i performance health, including CPU utilization, disk IO workload and response time, memory faulting, and wait times. In this article, I will try to enable you to interpret CPU Queuing wait time.
Let me mention that in 2021, the Navigator for i tool for system administration tasks was found to have a Log4j V1.x vulnerability. IBM delivered a new version of Navigator for i that does not use Log4j V1.x. Please consult this IBM security bulletin for information about how to use the new tool.
One very useful group of PDI charts is the Wait category, and CPU Queuing is in this group. Wait time data was introduced in IBM i 5.4, but there is no PDI tool to display it at this release. You directly query the relevant performance data file instead to look at this wait time category. This is somewhat complicated to do. From my long experience in analyzing IBM i performance, I can gladly say that wait time data are very helpful to me when analyzing performance issues since IBM i 6.1.
First of all, you always need to look at two charts under the Wait category — Waits Overview and Wait by Generic Job or Task. (There is also a Wait by Job or Task chart, but the generic chart is more useful, in my experience). Let’s look at examples of these two charts. Here is a sample of Wait Overview.
In this 24-hour timeline Wait Overview chart, the Partition CPU Utilization line graph hits 100 percent from 3 a.m. to 4 a.m. (See the upper red arrow on the right side of the graph.) This was the critical period of nightly batch process for the customer that I helped with this analysis. All five vertical bar graphs of this batch run period show very large amounts of CPU Queuing Time right on top of relatively smaller bars of Dispatched CPU Time. Look at the color legends on the lower left half of the chart to identify these line and bar graph components. Each individual vertical bar represents a sampling period of performance data. In this case, the sampling period was set at 15 minutes. So, five bars represent a 1.25-hour period.
During this critical period of 100 percent CPU utilization, compare in each vertical bar the proportion of Dispatched CPU Time against the sum of all wait times on top of it. If the proportion of all wait times overwhelms that of Dispatched CPU Time, this indicates bad performance, because jobs are kept in wait status much longer than they actively run. This is the case in those five bars above, which can be interpreted metaphorically that a group of cars (jobs) wait at red lights longer than they are in motion during this 1.25-hour period.
My rule-of-thumb for judging a case of good-to-acceptable performance health is when the proportion of Dispatched CPU Time is not less than the sum of all wait times in each bar. This is the case for the rest of the day in the chart above, where you can see that the sum of all wait times does not exceed that of CPU time in each bar. (The bars are small, but there is “zoom-in” feature for you to use.)
If you see an overwhelming sum of all wait times in some scattered bars that are not contiguous in time, it may not be a serious concern. This is similar to your car waiting at fewer red lights while passing through more green lights along the way to your destination. The excellent performance case is when you see ONLY Dispatched CPU Time with very little or no summarized wait times in every bar. (I’ll give a sample of this later).
Back to the critical batch period, it should be apparent to you that CPU Queuing time is the only dominant wait component, because the sum of all other wait times is very small at the top of each bar.
This is the “forest” view of the data, as all these times in each bar are gathered from all active jobs during each sampling period. If you see that all bars in the entire 24-hour chart have Dispatched CPU Time as the dominant component with all other wait times barely seen at all, then you can conclude that you have a healthy performance case for the day. But if this is not the case, you should next drill down to the “trees” view. Since CPU Queuing is the dominant wait in this example, you should next ask yourself whether this particular dominant wait time appears only in one or a small number of jobs, or is it distributed among most or all jobs?
If you see the latter, it may mean a simple case of overall workload overwhelming the available overall CPU power, and you just need to activate more CPU cores to address this situation.
If you see the former — the dominant wait time appears in only one job or in a specific group of similar jobs — then you have a chance to consider finding ways to modify the jobs to reduce the dominant wait, CPU Queuing in this case. To perform this analysis step, you need to display the chart named Wait for Generic Job or Task, as shown below.
Each horizontal bar in this chart represents a generic group of job names. That is, each bar is an accumulation of all times in all jobs with the same first six characters of their names into one entity. This is based on the assumption that if the names of any jobs running in IBM i share the same first six characters, it is likely that these jobs run the same programs. They may run concurrently or at different periods in the same 24-hour timeline.
The asterisk at the end of the generic job names means full job names are longer than six characters. If you see the names that do not end with “*”, it means full job names are exactly what you see, and such a bar may represent one or more jobs. This naming convention is important because you need to use the generic or specific job names you see to check their relevance to what you see previously in Wait Overview chart.
Having looked at both charts above, I trust it is clear to you that the generic job named ICS618* (the topmost bar in the chart above) is the only group that accumulates dominant amount of CPU Queuing wait time. My customer informed me that ICS618* jobs ran only during the critical batch run period that we saw in the first chart, which highlights that these jobs are the main, if not the only, factor causing inordinate amount of CPU Queuing time, and therefore deserve to be the focus of our problem resolution consideration. My further inquiry revealed that my customer ran 300 concurrent ICS618* jobs, each of which ran the same group of RPG programs with embedded SQL and/or OPNQRYF.
Let’s now try to find a resolution to the issue.
Combining the immense but undesired amount of CPU Queuing wait time with the fact that my customer’s server has 10 POWER8 CPU cores, it is reasonable to conclude that 300 concurrent batch jobs are overwhelmingly too many for all the available CPU cores. The customer also told me that this number of concurrent jobs was determined without any specific justification. I figured that it might not be simple to determine the optimal number for such concurrent jobs but since I had handled this kind of situation several times before and also happened to be familiar with customer’s core application (a popular ISV solution on IBM i), I came up with my rule-of-thumb of five or six “heavyweight jobs” per Power8, Power9, and Power10 core. I therefore suggested that the customer try running 60 jobs instead and observe total run-time and wait-time result. To everyone’s relief, total run-time of the batch process improved by about 15 minutes from the original 1.25 hours. More importantly, the Wait Overview chart showed that CPU Queuing wait time reduced dramatically to just a small patch on the top of each bar. Further slight reduction of concurrent jobs eliminated all CPU Queuing wait time.
Let me say that it is not batch run-time improvement that is the main benefit here, because it is only modest in absolute terms. The more important benefit in this case is the elimination of the dominant amount of CPU Queuing time, because when this type of wait time appears in an overwhelming amount, it causes any jobs running during that period to suffer severely dismal performance. Now that this wait time is eliminated, all jobs run with decent and consistent performance, which we all love to see.
From my past experience, many IBM i customers (and myself as well) struggled with how to determine the optimal number of concurrent batch jobs based on the available CPU power in order to optimize the entire batch run time. Many customers resorted to a high-number guess just to be on the safe side, but the result has frequently been that too much CPU Queuing wait time appeared when I looked at their PDI reports. Now, thanks to this CPU Queuing Wait time analysis, the struggle is over. I hope you agree with me that this analysis is quite straightforward once you learn how.
This approach of analyzing dominant wait time also applies to any other types of wait time that you see in the legends of the Wait chart, but specific ways to address other wait times are different. For example, if you see substantial Journal Time, you need to buy and install the IBM i Journal HA Performance product and turn on the Journal Cache parameter in the relevant journal object(s). If you see substantial Disk Page Fault Wait time, you have bad disk response time or abnormally high memory page fault or both, and need to upgrade disk hardware and/or identify jobs that produce abnormally high memory faulting rate and try to reduce it. The list goes on.
You may now wonder what good overall system performance looks like in the Wait Overview chart? I have a sample below.
You see here that in each of all vertical bars of this 24-hour chart, it is the Dispatched CPU Time that is the persistent overwhelming dominant component against the sum of all wait times appearing at the small tip of each bar. (There’s no need to worry about the few exceptions). In such a case, the overall CPU utilization in the entire daytime period is consistently on the high side, but you can hardly see CPU Queuing wait time at all.
This last chart is a case of a very decent overall system performance from server hardware and some operating system perspectives. It does not always mean there is no performance issue if your IT infrastructure environment involves multiple servers working together, which is quite typical of contemporary client/server and web-based application deployment. If any such application users complain about bad performance of their business operations, but you see the Wait Overview chart that exhibits little overall wait times like the sample above, then it means that the cause of the problem is not in server hardware nor most OS functions. The rarity of overall wait times speak for themselves. We need to look elsewhere for the cause of performance issue. I used to help some customers with such an issue of external causes of performance problems and I intend to share with you about this in another article.
The last point I would like to mention here is that IBM used to publish guidelines for CPU Utilization % for many old AS/400 and iSeries models. But as of the era of POWER5-based servers, when IBM delivered virtual processor, uncapped partitioning, shared processor pool and other virtualization technologies, CPU % Busy guideline became a less useful metric. You may see your server’s CPU frequently utilized at 90 to 100 percent all day long, but as long as you see little or no overall CPU Queuing (and Machine Level Gate Serialization Time, which I do not discuss here), there is no immediate and serious cause of concern about bad overall system performance in an IBM i server. But you still have a long-term concern on either the server’s CPU power sizing or deployment optimization of your application workload or both.
I do hope this has helped you gain some understanding and appreciation for these two useful Wait charts of the PDI tool, and that you will find them beneficial in the future.
The IBM i performance investigative game is afoot!
Satid Singkorapoom has 31 years of experience as an IBM i technical specialist, dating back to the days of the AS/400 and OS/400. His areas of expertise are general IBM i system administration and performance of server hardware, the operating system, and database. He also has an acquired taste for troubleshooting problems for his IBM i customers in the ASEAN geography.