Admin Alert: Dealing with i5 Critical Storage Errors,

April 25, 2007 Joe Hertvik

Part 1

Critical storage errors can occur whenever i5 hard drive usage passes 90 percent of available storage space. Above 90 percent, system performance starts to degrade. When storage usage breaches 95 percent, the system can become unstable and turn itself off or spontaneously reboot. While you can’t predict when critical storage errors will occur, there are several things you can do to detect and cleanup storage problems before they crash your system.

What’s the Disk Facts, Jack?

This week and next week, I’ll deal with the issues involved in monitoring and correcting critical storage errors or storage overflow conditions. Today, I’ll look at some tricks for monitoring your system for developing storage problems. Next week, I’ll discuss some common reasons why your i5 partition storage may be filling up and show you some simple things that you can do to reduce storage usage.

The first step in handling unexpected disk usage spikes is to monitor for them. The i5/OS operating system provides the following two monitoring functions to detect hazardous disk overflow situations.

Setting a disk drive threshold value that tells the system to start sending error messages when system storage usage exceeds the system’s threshold value. The threshold value can be set inside the System Service Tools menu (SST) or inside the Dedicated Service Tools menu (DST). By default, i5/OS sets this value to 90 percent of disk capability.
i5/OS also provides two system values that allow you to designate specific actions which the system will take when the percentage of available hard drive storage falls below a certain level. These system values are the auxiliary storage lower limit value (QSTGLOWLMT) and the auxiliary storage lower limit action (QSTGLOWACN). When available storage falls below the percentage specified in QSTGLOWLMT, the system will trigger the action specified in QSTGLOWACN. By default, the auxiliary storage lower limit is set at 5 percent. The possible actions that you can take after available storage falls below the QSTGLOWMT limit include sending messages, calling a special exit point program for custom processing, ending the system to a restricted state, or powering down the system.

Besides the information in this article, you can find more information on setting and using these values in a previous article I wrote called Protecting Your System from Critical Storage Errors. Today’s article supplements the info in the previous article and it provides additional information about setting QSTGLOWLMT and QSTGLOWACN. These two storage monitoring settings are effective in helping you identify when problems are starting to occur, but they are more effective when combined with a monitoring and paging program such as Bytware’s Messenger Plus product, which can read system logs and message queues and notify you as a problem is occurring.

Compensating for the Defaults

The big problem with IBM’s critical storage notification system is that the i5/OS defaults for these values are set too high to enable administrators to do much more than panic when storage is starting to fill up. So my first line of defense for monitoring disk usage is to set these values to more appropriate levels that will give me more time to react when a problem occurs. In my shops, I usually set the disk drive threshold value to 85 percent and my QSTGLOWLMT limit value (auxiliary storage lower limit) to 15 percent. This provides additional time to analyze and react to storage usage issues before a crisis appears, which creates a more proactive environment for taking care of these issues as they occur.

However, there is one modification that I need to consider when setting QSTGLOWLMT to a higher value, such as 15 percent available storage. By default, QSTGLOWLMT is set to 5 percent available storage, which tells the system to detect and react to the most critical situation where the system can become unstable and spontaneously crash. When setting your auxiliary storage lower limit value higher than 5 percent, you are changing the rules of the game. Now, you are telling the system to look for possible storage overflow errors as the system approaches critical storage, not when it has already passed the 5 percent available storage rate. Depending on the value that you set QSTGLOWLMT to, your critical storage monitoring can be changed from a reactive situation (looking for storage situations when available storage is less than 5 to 10 percent of all hard drive space) to a proactive situation (where available hard drive space is approaching the critical five-to-ten percent value but has not yet passed that mark yet).

So if I change my QSTGLOWLMT value to 15 percent from 5 percent, I should also reconsider what action to change my Auxiliary storage lower limit action system value (QSTGLOWACN) to. At this higher value, I should have more available time before I hit the crisis point of 5 percent available storage (95 percent full) and therefore I shouldn’t set QSTGLOWACN to any of its panic-mode values, such as automatically ending the system to a restricted state (*ENDSYS) or immediately powering down the system and restarting it (*PWRDWNSYS). Theoretically, I should have more time to react and correct the problem at 15 percent available storage (85 percent full) than I have at 5 percent availability. At higher values (15 percent and above) QSTGLOWACN should only be set to one of the following milder settings where I can investigate and correct the problem before it gets much worse.

*MSG: Which tells the system to send a message to the QSYSOPR message queue
*CRITMSG: This tells the system to send a message to the first available user specified in the Critical Messages to user list in my partition’s service attributes. In order for a user to receive a critical message, his user name or user class must be entered in the system service attributes by using the Change Service Attributes command (CHGSRVA). In addition to setting up the Critical Messages user list, you also need to set the Analyze problem automatically service attribute (ANZPRBAUTO) to *YES. ANZPRBAUTO can also be set with the CHGSRVA command.
*REGFAC: Submits a job to call any programs that are registered under the auxiliary storage lower limit exit point (QIBM_QWC_QSTGLOWACN). If you set QSTGLOWACN to this option, however, you should make sure that your exit point programs only perform mild functions that don’t have a significant effect on system processing. These programs can be set up to perform standard maintenance for cleaning up system storage or to monitor and email various system personnel as hard drive disk storage is filling up.

About Our Testing Environment

All configurations described in this article were tested on an i5 550 box running i5/OS V5R3. Most of the commands used here are also available in earlier versions of the i5/OS and OS/400 operating systems, so the configurations should be usable in prior releases. The QSTGLOWLMT and QSTGLOWACN system values are only available in OS/400 V4R2 and later operating systems, including the newer i5/OS V5Rx systems. However, you may notice minor variations in pre-V5R3 copies of these commands. These differences may be due to incremental command improvements that have occurred from release to release.