Admin Alert: The Case of the Mysterious CPI0999 Storage Error
Published: May 10, 2006
by Joe Hertvik
Most i5 administrators know that system problems can occur when i5 disk units fill up. System degradation can occur when system storage reaches 90 percent, and the system can crash after storage usage passes 95 percent. But while disk drive capacity problems are fairly straightforward to handle, they are not the only storage problems that can occur on OS/400 boxes.
There is a second storage issue--identified when the system starts issuing CPI0999 error messages--that is not as clearly understood as traditional storage capacity problems. A CPI0999 issue can also create severe performance problems on your system and it should be attended to immediately after it is discovered. The problem lies in identifying when the situation legitimately occurs and what to do about it.
CPI0999 can be issued by an i5/OS V5 system or by an iSeries or AS/400 system running OS/400 V4R5 and below. The message text reads "Storage directory threshold reached." CPI0999 can be triggered in conjunction with a CPF0907 message (Serious storage condition may exist. Press HELP), or it can be sent out as a message entirely by itself.
According to most sources, CPI0999 is issued when one of the following two situations occur:
- System storage is filling up and hard drive usage has zoomed past its storage threshold values. Here, the system sends out a CPF0907 message in conjunction with CPI0999. In this situation, there's a good chance (90 percent or more) that the CPI0999 message was a byproduct of an accidental run-up in storage usage. Any situations associated with CPI0999 will probably resolve themselves when total storage usage is once again reduced below its threshold levels.
- When a disk fragmentation problem occurs and i5/OS is indicating that the system has passed what IBM calls "a point of significant concern." Besides fragmentation, a CPI0999 error may also signal that the system directory contains too many entries and needs to be expanded. These conditions affect system performance, and IBM lists CPI0999 as a potentially serious condition that must be corrected immediately. In addition, a piece of older IBM documentation that I found states that a totally fragmented system may not IPL.
In either situation, when storage usage is bypassing its threshold values, it is imperative that you quickly reduce how much storage is used on the system. However, there is a problem. When storage usage is approaching critical mass and it's threatening to crash your system, there usually isn't time for measured thought. Storage overflow emergencies are usually caused by one or two rogue jobs that are quickly filling up hard drive space, and your job is to pinpoint and correct the problem before it damages the system.
So the first step in either situation is to find any job that is filling up system storage and end it before it ends the system. If the job was filling up temporary storage or work files in QTEMP, ending the rogue job should be enough to stop your storage run-up. If the job was filling up storage by writing excessive records to an application file or copying several large production files to a test environment, you will need to reverse that situation, either by deleting records and reorganizing files or by deleting unnecessary files that caused your storage to peak. For tips on reorganizing files with deleted records, click here.
While storage recovery works well for a CPF0907 problem (situation 1), the diagnosis is different for a straight CPI0999 scenario (situation 2) because it is a fragmentation problem as well as a disk storage problem. Increasing unused disk space by deleting, clearing, or reorganizing files will help resolve situation 2, because there will be fewer fragmented files filling up your hard drive, but the fragmentation and system directory issues may still be there.
To make matters worse, all versions of the i5/OS and OS/400 operating system cannot tell if a situation 2 issue is corrected until after an IPL occurs. So the operating system will continue to send you CPI0999 messages every hour until the problem is corrected and the system is IPLed. And if you run your i5/OS box like most shops, where your partitioned system can run for months without re-IPLing, that is a long time to go without knowing whether your problem is solved.
So what can you do if you find yourself in a CPI0999 situation? Here's my list of things to check off on to solve this problem and to stop any critical issues, such as system degradation or a lack of contiguous storage that could prevent your system from IPLing.
- If you are able to, schedule a system IPL as soon as possible. An IPL may delete a large number of temporary objects that are filling up the storage directory and contributing to disk fragmentation. IBM's documentation on the CPI0999 problem is sketchy, but IPLing your box seems to be one of the more effective ways to solve this problem. It's also the only way to turn off the CPI0999 message, which, as I mentioned before, will be sent out every hour on the hour until an IPL occurs.
- Think about performing a Reclaim Storage command (RCLSTG) on your system. RCLSTG attempts to validate and reclaim orphaned, damaged, or incompletely updated objects on your system. It also deletes unusable objects and fragments. The down side with RCLSTG is that it requires your system to be in a restricted state where no work can be done, and, if RCLSTG has not been run in a long time, it may take many hours to run, preventing your system from processing work during that time.
- Take a system inventory and delete or clear out any unnecessary files and members on your system. As I mentioned before, this will increase free disk space that can then be reorganized after an IPL or during the STRDSKRGZ command mentioned in the next point.
- i5/OS also provides the Start Disk Reorganization command (STRDSKRGZ), which allows you to start a disk reorganization function for one or more storage pools on your system. Similar to the Windows defragmentation tool, STRDSKRGZ reorganizes unused disk space together in one place, in order to reduce fragmentation and to allow future disk allocation requests to be performed more efficiently. What's nice about STRDSKRGZ is that it has a Time Limit parameter (TIMLMT) that allows you to run the command only for a set amount of time. Once the time limit is passed, the command finishes its current processing and ends. This allows you to perform a partial reorganization over time, rather than trying to reorganize your entire ASP at one time. The other nice thing about STRDSKRGZ is that it can be run anytime without the system being in restricted state.
- You can also try running the Reclaim Temporary Storage command (RCLTMPSTG), which automatically reclaims storage used by temporarily decompressed copies of panel groups, menus, display files, and printer files. The kicker here is that the temporary storage is not fully reclaimed until the next time the system is IPLed.
If all else fails and you have done everything else to correct this situation short of an IPL, you may just have to run with hourly CPI0999 messages until you can get around to IPLing. Many shops aren't able to IPL frequently, and since it's impossible to know if a straight CPI0999 situation is resolved without an IPL, running with these messages is sometimes the only choice you have.
RELATED STORIES AND ITEMS
Getting In and Out of Restricted State
IBM APAR MA14464 for OS/400 V3R6M0 Dealing with DASD Fragmentation
iSeries Information Center, RCLTMPSTG Command Description
iSeries Information Center, STRDSKRGZ Command Description
Protecting Your System from Critical Storage Errors
Tips for Dealing with Deleted Records in AS/400 Files
Tips on Running RCLSTG