Guru: IBM i Experience Sharing, Case 3 – When Performance Issues Come From Without
April 4, 2022 Satid Singkorapoom
When I started my IT career 35 years ago, it was in the “centralized” universe that originated from the mainframe model. All core application codes ran in one — and only one — big iron that all users accessed with “dumb” terminals devoid of any GUI. Problem solving in AS/400 systems was frequently straightforward and not time consuming because most cases were anything but elusive.
But the contemporary IT infrastructure universe has evolved into a big onion, with layers that we must peel while troubleshooting. I often find myself having to address a problem in multiple layers, and it no longer surprises me when the cause(s) of a problem lies outside of the IBM i server. When IBM i customers had problems, they almost always called IBM for help first, and my colleagues (including trustworthy business partners) and I dutifully did our best to oblige, only to find in some cases that the causes of the problem were not in IBM i servers. But none of us could have known from the outset that this was the case, because we lived in the embrace of a multi-layer onion of IT.
Let me share with you two cases of such nature in the hope you gain some insight.
The first case was a customer who upgraded from a Power6 server with internal disk to a Power8 server with IBM FlashSystem V9000 SAN storage with flash disk technology. Within the first few months, the customer was disappointed to find that their most important yearly batch process run time (with the heaviest workload to crunch) was only slightly faster on Power8 than it had been on the old server. Such a two-generation server upgrade was quite a big boost in hardware performance, and everyone expected a corresponding boost in batch run time. Another strange thing was that all other operations yielded the expected level of improvement. Data backup time, virtually all non-batch workload, remote SQL data access and such improved satisfactorily.
I often find myself having to address a problem in multiple layers, and it no longer surprises me when the cause(s) of a problem lies outside of the IBM i server.
When I analyzed a set of IBM i PDI performance report charts, I found only two performance anomalies. All other charts exhibited no performance issues. The first issue was on wait time in batch jobs.
In the PDI chart for Wait by Generic Job or Task (Figure 1), batch jobs DD8800* encountered a substantial amount of Machine Level Gate Serialization Time and a dominant amount of Journal Time. I would ask IBM i users to regard this Machine Gate Serialization wait time as similar to the CPU Queuing wait time that I talked about in my previous article. Its cause and solution are similar to CPU Queuing wait in some aspects. So, this issue was addressed in a straightforward way by reducing the number of concurrent batch jobs to a proper amount (perhaps some 10-20 percent less).
The customer had already bought and installed the IBM i HA Journal Performance option. To reduce Journal wait time, they just needed to identify all journal objects that covered all physical files used by DD8800* jobs and then used the Change Journal (CHGJRN) command to enable the parameter Journal Cache.
After taking those actions, the resulting run time improved, but was not yet to the customer’s satisfaction. We proceeded to consider the next performance anomaly that I detected.
This was degraded disk response time during the batch run period, which was quite surprising, considering the fact that SAN storage V9000 with flash disk was used! Figure 2 shows the PDI chart for disk response time vs. disk IOPS.
You can see from the left half of the chart above (batch run period) that disk read and write response times (blue and red lines in the red circle) are in a wide range of 5 to 80 milliseconds! (The right axis indicates “seconds” but this is a bug. The correct unit is millisecond.) The fluctuation span of response time is disappointingly wide. This is perplexingly unexpected of a flash disk system, since the disk IOPS workload (blue and green bar graphs) during the bad response time period is not particularly high for flash disk. It stays below 10,000 IOPS most of the period, which is low for flash disk, with a few bursts that are still not very high.
Another aspect of this disk response time anomaly can be seen from the PDI chart on Disk Overview by Disk Unit (Figure 3 below). Each horizontal bar represents average response time of each logical unit number (LUN) over 24 hours. We have no need to see each bar clearly. The “forest” view is more informative here.
The chart in Figure 3 with the exhibited overall range of response time of six milliseconds (msec) and less would look great if it came from spinning disk units. Although the response time discrepancy of about 1.5 msec between the two disk groups is not a big deal for spinning disk, no discrepancy at all is definitely the best. Since this chart is from LUNs over flash disk units, it does not look too bad, but why the response time discrepancy for flash disks?
When I reported my finding to all parties, I was informed that the new Power8 and FlashSystem V9000 servers were connected to an existing SAN switch the customer used to connect several existing Wintel servers to their SAN storage. The switch supported 16 Gbps fiber ports, but its internal fabric ran only at 8 Gbps speed. Someone wondered if this contributed to the response time fluctuation to some degree.
Luckily, this was not a tough question to answer. I asked for a disk response time report from V9000 itself of the same date and time as my PDI chart in figure 2. The report came in a CSV-format file that listed V9000’s response time in five-minute intervals and this additional important piece of evidence confirmed to us all that the response time fluctuation corresponded to what I saw in Figure 2. I also noticed that the response time from V9000 was somewhat lower than that from my PDI chart by about 10-20 percent. This made it clear that the SAN switch made a minor contribution to the problem at hand.
So, we had a clearly identified anomaly we needed to address. The question was what caused this fluctuating response time degradation in V9000?
Disk response time degrades when disk IO workload rises and there are two main disk IO workload types to consider for an analysis: IO per Second (IOPS) and read/write data transfer rate (MB/sec.). The rise of either of these IO workloads to high values causes disk response time to degrade in a varying degree depending on performance capacity of disk subsystem hardware configuration and certain feature activation. The degradation is considered bad if it is consistently much higher than 5 msec.
The Read/Write Rate and Response Time chart (Figure 2) shows in the leftmost side and the entire right half (left and right to the red circle) that response time does not degrade with rising IOPS rate (vertical green and blue bars). The picture is different from MB/sec. perspective. This rate (dark lines in the chart in Figure 4) during the batch run period is on the high side, as it exceeds 500 Mb/sec. in many instances and even goes beyond 1,000 MB/sec. a few times. This is shown in the disk response time VS read/write MB/sec. rate chart in Figure 4 below.
According to my storage colleague, these high disk IO workloads were still under the capacity of V9000 to handle with grace. But then someone also mentioned that a feature called DRP (Data Reduction Pool) was activated in this V9000, which was data compression + deduplication. This was a crucial piece of information for me because my past experience informed me that this kind of feature normally reduced the high-limit of performance capacity of the disk hardware in handling a high IO workload with decent response time. I did a bit of research in a few IBM sources and found some information about designing DRP with decent performance.
At this point, I saw that I had reached the end of my part of responsibility, because the solution to this problem was in addressing the issue in V9000, about which I have insufficient knowledge and experience. I relinquished the next steps to the storage specialist who had designed this V9000 in order to rectify the wildly fluctuating and discrepancy in response time.
When I checked back on the case, I was informed that the problem was rectified to ensure decent and consistent response time for the customer’s workload. It took some months for the rectification process, and when it was finished, I saw good and more consistent disk response time under 1 msec with a degradation at high workload that did not go beyond 5 msec as shown in the Disk Throughput Overview for Disk Pools chart in Figure 5 below. Batch run time improved as well.
In the left half of the chart in Figure 5, you can see that the MB/sec workload (light blue line) rises to and beyond 1,000 MB/sec. in a few instances. The first one reaches almost 2,000 and stays in that range for about an hour, while response time (dark blue bars) manages to stay below 2 msec most of the time with few high bursts that are still under 5 msec. Such behavior is desirable on any disk subsystem. I still do not like seeing the disk wait time (brown bars on top of dark blue ones), and I suspect it may have to do with the SAN switch that I described earlier.
The lesson learned from this case is that, even with flash disk technology, using sophisticated features that deliver more efficient use of disk space may come with a price of attenuated performance capacity of disk hardware if they are not designed and implemented with careful attention and sufficient relevant knowledge and experience. For a production business workload environment, we need to be careful about using such sophisticated features, since good and consistent disk performance is our foremost priority. It is important to seek assurance from the SAN storage vendor to design SAN storage server in a way that ensures decent and consistent disk response time in a fluctuating IO workload.
Here is one more case study of external source of performance problem in a complex IT universe.
A personal finance services company with about a million clients provides typical modern-day mobile/web services to clients. My customer runs a number of Wintel VMs (Virtual Machines) for web server frontend workloads that interact with many backend server jobs in an IBM i with nine Power8 CPU cores.
When my customer upgraded their fiber LAN from 1 Gbps to 10 Gbps, a frustrating problem followed. Attempts to correct the problem improved the situation somewhat but did not yield a decisive resolution. My customer added one more CPU core and more memory, the ISV backend application team adjusted the number of backend server jobs and relevant parameters, such as max allowed connections per IP port and such, but the problem persisted in a significantly frustrating degree after a few months of collective effort.
The problem was that on the 25th and 26th of every month, the majority of my customer’s clients accessed mobile/web services to process their monthly dues. Many of the clients encountered such annoying problems as rejected initial connections or session freeze after established connections or connection drop during their transactions. Some had to try several times to complete the transactions. Naturally many clients complained to the call center. This problem also existed in the remaining days of each month but to a much lesser degree. Eventually I was asked to come in to provide help.
When I initially obtained this description of the problem, it appeared to me this problem exhibited a strong performance-related nature. (But I also realized in the back of my mind that if I had a hammer, everything would look like a nail!) Since this was a distributed application architecture, the key question was where the bottleneck was. I started from the usual place: IBM i.
After obtaining sufficient IT background of the case, I spent many days checking many factors in IBM i, such as job logs of the server jobs, low-level error logs, history logs, Db2 Plan Cache, and PDI performance reports, but I did not find anything suspicious. I found that IBM i server hardware, operating system and database management system appeared to be in decent, if not ideal, operational and performance health. This was when I told myself to look outside of IBM i, as I was very certain the cause came from outside IBM i environment.
I started with what I could do, which was to rely on IBM i communication trace and the WireShark tool for basic network protocol analysis. The last time I used WireShark was when it was new so many years ago, so I took some time to familiarize myself with the current version, and I found two useful reports that helped me advance further on the path to a solution.
The first useful WireShark report was Analyze –> Expert Information. Comparing the reports from a normal day against those from the problematic peak workload day, I saw that there were a lot of Connection Reset warnings on the peak day as shown in Figure 6 below. (Notice the entry with red arrow and 37,526 incidents.)
Drilling down further revealed to me that all Connection Resets were sent from relevant IP sockets used by backend server jobs in IBM i to their corresponding IP sockets used by frontend VMs for the troubled transactions. A deeper analysis help from a network protocol expert revealed that these resets were all due to backend server jobs receiving corrupted data packets (incorrect data length information) from frontend VMs. The responding connection resets from backend server jobs were therefore expected. This produced a question of why some packets from the frontend were corrupted? The higher the transactions workload, the more the corrupted packets. This strongly exhibited a performance-related pattern.
Searching back into my deep past, I remembered an incident of similar nature. I was involved in a case in which SAP ECC on i was implemented with two iSeries servers, one as DB server and the other as application server. The purpose of this setup was to handle a very high workload from a thousand users who complained about randomly failed transactions at peak workload periods, which occurred almost every day during a certain time period. Although I detected no serious performance issue within both iSeries servers, network protocol analysis from an expert revealed many corrupted data frames sent from the app server. The expert told us he detected what was called TCP buffer overrun and advised us to increase the size of TCP send and receive buffers in both servers to about 1 MB each. (It was about 200 KB at the time.) Happily, this solved the problem. TCP/IP buffer overrun is the result of too small a buffer size for very high network traffic.
Based on this memory of past experience, I suggested the customer increase TCP buffers on both IBM i and Wintel VMs to 500KB, but it turned out that, for Wintel servers, the buffer was adjusted at network adapter level, and my customer could not find anyone who knew how to do this.
The network expert also told me that the frontend severs opened and closed socket connections for every transaction, a wasteful practice that incurred protocol overhead for each transaction and added latency to each connection’s performance. It increased network traffic volume as well. A well-known approach to reduce such network protocol overhead was to create a pool of persistent connections in each frontend VM and allocate and deallocate each pooled connection as needed. This eliminated excessive connection opening and closing protocol exchanges and reduced overhead and latency. Since this approach took time and effort to implement, the customer wanted to pay attention to it later after the current problem was addressed.
So, the problem persisted. Luckily, I noticed another WireShark report that looked promising, the Statistics –> Endpoints report shown in Figure 7 below.
Based on a sample result above, I discussed this report with my customer, who informed me that they expected that all six frontend VMs carry incoming workload equally (IP addresses ending 9 to 14 in the report), but the report clearly showed that this was not the case, to my customer’s dismay. As seen from the report, only three frontend systems (addresses 9, 12, 13) carried most of the transaction workload, while the remaining three (addresses 10, 11,14) carried very little. The customer immediately rectified this problem, and subsequent endpoints report verified the desired workload distribution.
We waited until the following peak day to review the WireShark reports again, and the result was that the count of Connection Reset was reduced from the original 37,000 range down to about 9,000. However, all of us wanted a zero count. I proposed that a sensible way to eliminate Connection Reset was to ensure that all frontend VMs did not carry overwhelming workload that interfered with TCP/IP protocol exchanges and suggested the customer increase the number of frontend VMs to at least 12 from the current six. The customer agreed immediately and increased to 15 VMs. The result was that the following peak day went by without any problem for the first time in many months, to everyone’s joyful relief.
The last hurdle the customer wanted to address was to reduce network protocol overhead using a pool of persistent connections. They set up a new test frontend VM to implement this method and found one last problem for me.
Every morning when they started to test persistent connections, they found that all supposedly persistent connections were all disconnected. They restarted the connection pool and did the test for the day only to find the same connection drop again the following morning.
When they consulted me, it reminded me of another past experience I encountered a few times before, which had to do with the firewall that stood between frontend and backend servers. One thing a firewall is designed to do is to drop any connections that stay idle for too long using a parameter called Idle Connection Time-out. This value is normally set within a range of 5-10 minutes by default. On the server side, one TCP/IP feature called keep-alive is available to prevent a persistent connection from idling for too long. This feature is controlled by the keep-alive timer, and is normally set at longer than 10 minutes by default, 120 minutes is the default in IBM i.
I explained this to the customer and asked them to try increasing the firewall’s idle connection time-out to a value longer than 10 minutes, say, 30 or 60 minutes and set the TCP keep-alive timer to some five minutes shorter. My customer took the action and informed me later that the problem was solved. I considered my contribution to this case at its end and the customer obliged.
In summary, this problem had its root cause in an unbalanced and overwhelming workload at the frontend VMs, which interfered with their TCP/IP’s underlying mechanism. But somehow the customer overlooked this part and reached out to IBM for help. My colleagues and I accommodated the request and gained new experience in the process, not to mention gaining back a happy customer also.
Multiple layers of the infrastructure onion made the troubleshooting process time consuming, and we had to move from layer to layer until we eventually succeeded. The WireShark tool was undeniably crucial. All is well that ends well, according to Shakespeare The Bard.
Troubleshooting in the IT universe these days is dealing with a multi-layer onion of IT infrastructure. I would say those who appreciate this fact, keep an open mind and a sense of curiosity, and act accordingly as a team would well be on the right path to the desired destination.
Satid Singkorapoom has 31 years of experience as an IBM i technical specialist, dating back to the days of the AS/400 and OS/400. His areas of expertise are general IBM i system administration and performance of server hardware, the operating system, and database. He also has an acquired taste for troubleshooting problems for his IBM i customers in the ASEAN geography.