Guru: IBM i Experience Sharing Case 4 – Investigating Time-Sensitive Transaction Issues
June 6, 2022 Satid Singkorapoom
Among the central processing hardware resources in a computer system – CPU, GPU, memory, disk, PCI-Express I/O bus – disk has always been the slowest component. Even the latest flash disk or NVM-Express flash drive or card is slowest, but not by much. Back in the days of hard disk, disk I/O was the most popular cause of performance problems. From experience, I always looked at it first in my investigation.
The modern SAN flash disk can still be the performance bottleneck when it is not deployed properly, as I shared in Guru: IBM i Experience Sharing, Case 3 – When Performance Issues Come From Without.
When I evaluate IBM i server performance, my rule-of-thumb is that a disk response time of five milliseconds or less is considered good to great, and if it stays consistent at this level over a varying disk I/O workload (IOPS and MB/sec rates), that’s even better. For SSD, I use 2.5 milliseconds as the guideline.
Let’s analyze a sample of good disk response time versus disk MB/sec. workload chart (see Figure 1).
This is from a SAN flash disk that exhibits very good response time (but not consistent enough – I’ll discuss that soon). In this 24-hour timeline, the response time (orange bar) never goes beyond 1.5 milliseconds, which is a very desirable performance. Also good is the fact that overall Disk Wait Time (green bar) is also very rare.
As for the disk IO workload, my performance evaluation guideline is that a disk data rate workload of 500 MB/sec and beyond is considered high, and beyond 1,000 MB/sec is very high (purple and red lines). You can see that peak disk data rate reaches about 1,800 MB/sec on several occasions (4 red arrows) but the response time never degrades beyond 1.5 milliseconds, a sample of grace under stress.
In the middle part of the chart, you see a low data rate with very low response time, indicating a period of relatively low disk workload.
In my experience with assessing performance, it was difficult for disk response time not to degrade when it encountered a high I/O workload, which was the case in this sample chart. The good point for this sample is that the degraded response time stayed in the good range. You need to configure a lot of disk controller cards and physical disk units to endow them with the ability to maintain a very consistent response time at high IO workload, and it may cost a fortune of an investment. So, my personal guideline is that it is not bad that disk response time fluctuates in accordance with fluctuating disk I/O workload (IOPS and MB/sec.) as long as it stays under 5 milliseconds (2.5 milliseconds for SSD).
For evaluation of hard disk response time versus IOPS chart, I consider 10,000 IOPS as a high IO workload and 20,000 IOPS very high. For SSD, I use 50,000 IOPS as high.
This guideline served me well in most cases, but there were a few exceptions. I would like to discuss one such case here.
The exception has to do specifically with a time-sensitive workload, which is a kind of client/server transaction in which a client system/device needs a response from its server within a strictly short time period, in most cases a few seconds. A transaction from a cash ATM machine is one good example. Many financial transactions, such as money transfer, all kinds of payment, and account inquiry, are time-sensitive, as it would not make sense to allow a client system/device to wait too long for its server to respond during such transactions. If the server system cannot oblige in due time, it’s sensible to terminate the transaction, roll back any data changes, and let the user retry or not.
Here is a case study involving such a workload.
A banking customer upgraded their Power6 server to Power8 and ran the new server for a few months. The new server had internal hard disk units, no SSD. I was assigned to conduct an overall system performance health assessment of the new server, not to solve any problems. At the end of our discussion of my mission, I asked them whether they had anything they would like me to do in addition to my stated mission. The chief responded that he had one problem for me to see if I could be of any help to the bank. Here follows the description of the nagging problem.
Many times each month, the bank’s clients complained to the call center about failed ATM transactions. Some said they had to retry the transactions a few times to make them go through. On any normal day, there were just a few or no complaints, but the number was the highest (as many as a hundred) on the payday period, which was a cause for serious concern to the bank. I asked if they noticed any discernible pattern to the problem, and the chief responded with a yes. This problem happened only around 8:00 PM and lasted for about half an hour. It never happened during other time slots of the day! After a brief pause during which I needed to use my brain cells, my next question was whether the chief had observed what kind of workload was running in IBM i server during that time slot. The answer was that 8:00 PM was the start of their nightly batch processing. This was getting more and more interesting as each second that went by.
I then asked the chief to tell me what operations were run during 8:00 PM to 8:30 PM and the answer was just one operation: pre-batch data backup, which took roughly half an hour for the many gigabytes of data. What a coincidence! I made such an utterance, and my colleague and the chief said they could not have agreed more. I guessed the chief already had had an idea about the cause of this problem, but no idea how to resolve it.
Two more facts were that the backup was made to a number of IBM i save files, not to tape, and they ran some 15 concurrent backup jobs, aiming to finish the backup ASAP. This means the operation was performing a lot of reading from and at the same time writing to disks – a lot of disk IO workload for sure.
I thanked the chief for all this crucial information and told him I might be able to help after gathering and analyzing performance data. A few hours later, I looked at several PDI charts and obtained the following facts:
- CPU was initially high at about 90 percent for the first 5 minutes to 10 minutes, but much lower for the rest of the data backup period. But there was no corresponding CPU Queuing nor Machine Level Gate Serialization wait time. No problem here.
- Memory faulting of the period was in a range of 300 faults per second. There was just a very small amount of corresponding Disk Page Fault wait time. No problem here.
- Disk IOPS was on the low side in the range of 2,000 IOPS. No problem.
The only dominant wait time of this period was Disk Other Time. My additional research revealed that this could have had to do with disk wait time from the backup operation. There was no clear clue for the problem here.
I didn’t think to produce the Wait by Generic Job or Task chart because this was the early days when the PDI tool was still new, and I had to learn more about it.
I looked at the disk response time versus MB/sec rate chart (Figure 2) and had a nasty surprise.
Right at the middle of this 24-hour chart is 8:00 PM, and you can see both high read and write data rate (two varying-shade blue lines). One peaks at 900 MB/sec while the other at almost 1,800 MB/sec (red arrows). But disk response time (orange bar in red circle) of this period is only about 3 milliseconds and less. But I hope you see that the response time degrades a lot compared to the preceding 12 hours in the chart (mostly under 1 millisecond), but it is understandable with the burst of data rate at 8:00 PM due to the data backup operation to save files. I saw that there was no performance problem at all, therefore my nasty surprise about the transaction time-out.
At the time, I was not familiar with time-sensitive transactions but luckily my colleague explained the concept to me. We checked with the chief, who verified with someone else that such was the case – there was a time-out specified for ATM transactions. Two brains are always better than one!
What went on in my head at that time was a hypothesis that the degraded disk response time, although considered good in a general sense, could be causing the relevant server jobs in IBM i to take a fraction more of a second beyond the time-out to respond to the ATM machines. If we could reduce the degradation, we might be able to solve the problem. But how?
The answer was based on the fact that 15 concurrent jobs ran the backup tasks. It was clear from the chart in Figure 2 that the high data rate workload was stressing the disk hardware. I asked the chief to reduce the concurrent jobs by half (that is, eight jobs) and observe the complaints, especially on payday. The chief later told us that payday complaints had decreased dramatically, but had not totally disappeared. I then asked him to reduce the concurrent backup jobs to six. This eliminated the complaints, and all were happy. Pre-batch backup time increased to about 40 minutes, which was a modest price to pay for the solution to the problem.
Now, if you notice the rightmost half of the chart above (after midnight), you will see that disk response time degrades even more and longer in time (post-batch backup, perhaps) but this does not cause the time-out problem simply because very few people use ATM machines after midnight. Saved by luck!
Later I encountered a few more transaction time-out problems from more financial services companies. They all exhibited the same degraded disk response time. Some were like the case I just described. They were easy to analyze. But others were different, and further analysis was required. In those cases, the cause of degradation differed in that disk hardware configuration had been changed for more usable disk space without proper understanding about the performance consequences.
When you have a compelling need to make changes to the disk hardware of an IBM i server running a production workload with time-sensitive application, it is a good idea to consult an IBM i hardware expert, who can show you how to design the disk hardware change without degrading disk response time. Comparing PDI charts on disk response time from before and after the change is crucial for checking disk performance results of the disk hardware change.
Performance data from PDI charts may not always speak clearly to you as to what the prominent cause of a performance-related problem is. In such a case, it is prudent to scan a basic set of PDI charts (on CPU utilization, waits, disk response time versus I/O workload, memory faulting, and a few more) to identify and pay attention to what you think is the best clue from the data, build a sensible hypothesis from it, and perform an experiment to test the idea and see whether you are on the right track or not.
The lesson learned here is also that in certain situations in which disk response time degrades from its normal range, time-sensitive transactions can be vulnerable to this and misbehave. Keep in mind that there is a random nature to this problem in that not all, but some to many, such transactions will misbehave during that vulnerable period. A general pattern to the problem may be discernible, and if you pay attention, you may identify it. IBM i PDI charts supply a good source of clues for you to look at the time periods when disk response time degrades. Check to see if they correspond to the times when time-out problems occur or not. If so, gather and analyze more PDI charts, formulate the most viable theory, and take actions to test your theory to resolve the degradation.