Admin Alert: Protecting Your System from Critical Storage Errors
April 5, 2006 Joe Hertvik
Like all computers, i5, iSeries, and AS/400 systems are vulnerable to system failures when their disk storage units fill up. i5/OS operating system performance rapidly degrades after system storage reaches 90 percent capacity, and the system can crash–or turn itself off–when storage passes 95 percent. Because of this, it’s wise to understand how your system handles critical storage situations and how it can alert you when disk capacity problems appear.
Critical storage situations can occur anytime system storage approaches and passes 90 percent, and several common situations can put you over the limit. A runaway job can fill up disk storage by constantly adding and deleting records, a programmer can decide to create a new test environment that contains copies of five of your biggest files, or a poorly written SQL statement or query can create incredibly large work files that threaten to overrun hard drive capacity. When these situations occur, it’s important for an i5 administrator to be quickly alerted so that corrective action can be taken before the system crashes.
For critical storage monitoring, the i5/OS V5 operating system–as well as its predecessor OS/400 V4R5 and below systems–offers two monitoring tools that perform an adequate job in alerting system personnel when a storage crisis is brewing. While these tools are not perfect, they are your best line of defense in detecting a potential crisis before it shuts down your business by shutting down your business system.
To monitor and react to critical storage situations, the i5/OS operating system offers the following features:
Together, these features provide a two-step warning system that alerts you when storage problems are imminent so that you can take remedial actions before the problem gets out of hand. Here’s how they work.
Monitoring ASP Storage Thresholds
The easiest way to set your partition’s ASP threshold is by using the System Service Tools menu. You can also change your threshold value through the Dedicated Service Tools menu (DST) but DST generally needs to be run from your system console, so if you do not have access to a system console, you will need to use the SST menu. Links to detailed instructions for working with SST and DST can be found in the Related Stories section at the bottom of this article.
For our purposes, let’s look at setting your ASP threshold through SST, which can be run from any 5250 green-screen session. You start SST by executing the Start System Service Tools command (STRSST) without any parameters, as follows:
Once started, SST will prompt you to sign in before you can run any of its options. If you have never signed in to SST before or have forgotten the SST Security Officer (QSECOFR) password, check out this article on re-enabling locked out system service tool passwords.
After signing into SST, select the following functions to get to the Select ASP to Change Threshold screen.
When ASP system storage use reaches its threshold percentage, the system will send a CPF0907 message (Serious storage condition may exist. Press HELP) to the QSYSOPR message queue and to the system log. This message will repeat every hour that ASP storage usage remains above its storage threshold. Once storage use goes below its threshold, the message will disappear.
Because the CPF0907 message is sent to the QSYSOPR message queue, it’s important to monitor QSYSOPR on a regular basis. In larger shops, this task is usually delegated to the operations staff. If you don’t have system operators but you own a monitoring and messaging product, such as Bytware‘s MessengerPlus, you can set up a system monitor to alert you by email or pager whenever a CPF0907 message is received by the system.
Monitoring Available ASP Storage
Until OS/400 V4R2, the ASP storage threshold was the only tool that administrators had for monitoring disk usage. With V4R2, IBM added the following two system values that also monitor disk usage.
1. The auxiliary storage lower limit system value (QSTGLOWLMT), which, in contrast to the ASP storage threshold, specifies how low your available storage capacity can go before the system takes action. Available storage capacity is defined as the percentage of free ASP storage space. This means that this system value, and its companion value, QSTGLOWACN, focus on how much storage is still available on your machine, rather than on how much storage is already used. To set up your system properly, your QSTRGLOWLMT value should meet the following relationship:
QSTGLOWLMT <= (100 – Storage Threshold)
i5/OS does not monitor to insure that QSTGLOWLMT satisfies this relationship, but it’s important to note that your ASP storage threshold monitoring will never be activated if this equation is reversed and QSTGLOWLMT is greater than 100 less the storage threshold.
New i5 systems are shipped with a default QSTGLOWLMT value of 5 percent.
2. The auxiliary storage lower limit system value (QSTGLOWACN), which specifies what action will be taken when available storage capacity falls below the percentage listed in the QSTGLOWLMT system value. You can set QSTGLOWACN to one of the following values:
*MSG: This option sends a CPI099C message (Critical storage lower limit reached) to the QSYSOPR message queue when your available storage percentage falls below QSTGLOWLMT. It’s also worth noting that the system will also send CPI099C messages to QSYSOPR when QSTGLOWLMT is set to any other value. By default, QSTGLOWACN is set to *MSG for new i5 systems.
*CRITMSG: This option sends message CPI099B (Critical storage condition exists) as a break message to any user or class of users that are defined in the Critical messages to user parameter (CRITMSGUSR) in your partition’s service attributes. You can view the list of critical message users by executing the Change Service Attributes command (CHGSRVA) like this:
And pressing the F4 key to prompt for your service attributes. By default, a partition’s service attributes are set up to send the CPI099B message to almost every class of system users running on an i5, iSeries, or AS/400 partition, including system operators, security officers, security administrators, programmers, and everyday users. Also note that this service attribute is in effect only if the Analyze Problem Automatically service attribute (ANZPRBAUTO) is also set to *YES on your system. You can check and reset your ANZPRBAUTO value from the CHGSRVA command.
*REGFAC: Under this value, the system will automatically call the exit programs that are registered under the QIBM_QWC_QSTGLOWACN exit point. This allows your system to automatically kick off a program to deal with serious storage conditions as they appear.
*ENDSYS: This option immediately ends your partition to a restricted state, where all subsystems and their associated jobs are ended, no new work can enter the system, and the system console is the only active job that can be running. For more information about i5/OS restricted state, see Getting In and Out of Restricted State.
*PWRDWNSYS: Choosing this option tells i5/OS to immediately power down the partition and restart it.
The drill here is clear. First, you use QSTGLOWLMT to specify the lowest percentage of free space that can be available on your system before the system declares that there is a critical storage problem. Then you use QSTGLOWACN to specify the action to be taken when QSTGLOWLMT is breached. If your system’s free storage falls below the QSTGLOWLMT limit, the action in QSTGLOWACN will automatically be taken.
It’s also important to note that according to IBM documentation, these two values come into play during a system IPL. If the system detects that the amount of free storage space is below the QSTGLOWLMT limit during an IPL and QSTGLOWACN is not set to *MSG, the system will come up in restricted state and it will send the following messages to QSYSOPR and the history log.
CPI099D – System starting in storage restricted state.
At this point, the system console will be the only device available for use on your system. To start the system again, you must increase the amount of free storage space so that it is greater than the QSTGLOWLMT value.
Making Everything Work Together
Taken together, your ASP storage threshold levels and the QSTGLOWLMT/QSTGLOWACN system values form a two-step critical storage warning system. The idea is to set your values so that they can work together to give you enough time to react when system storage is nearly overflowing.
For example, if you set your ASP storage threshold to 90 percent and your QSTGLOWLMT value is set to 5 percent, your system will start sending CPF0907 messages to QSYSOPR when your system storage usage first exceeds 90 percent. These messages indicate that you are approaching–but have not yet reached–a critical storage situation and that you still have time to do something about it before things get really bad. It will repeat this message hourly unless your system storage usage retreats back down below the 90 percent threshold level.
If total storage usage doesn’t retreat below 90 percent, the system will initiate a second round of preventative action when the total free space left in your ASP falls below the 5 percent free space level defined in QSTGLOWLMT (95 percent storage used). The system considers this a more desperate storage situation, and it will then kick off the action you specified in QSTGLOWLMT. In addition to any other action it takes, the system will also start sending CPI099C messages to the QSYSOPR message queue in order to alert you that a more serious storage condition has appeared. If you specified *REGFAC, *ENDSYS, or *PWRDWNSYS as your QSTGLOWACN action, the system will take action on its own to help correct the problem.
Easy, But Imperfect
As I noted above, these features are adequate but imperfect tools for monitoring when critical system situations appear. But it you set your limits properly and are diligent in monitoring for problems, ASP storage thresholds and the QSTGLOWLMT and QSTGLOWACN system values can help you avoid unpleasant storage surprises.