What Should I Monitor For On My IBM i Partition?
March 12, 2014 Hey, Joe:
I want to shore up my IBM i monitoring plan but I need ideas on what to monitor for. I know a problem when it hits me, but I’m not always sure what to look for. What should I be monitoring?
IBM i and Power i system monitoring is an important topic because like a shipping manager, your number one priority is to keep the production lines running. You need to find problems both before and after they occur. Here are the general areas I believe every IBM i administrator should be watching on a daily basis. In the Related Stories section following these tips, I’ve also listed several other IT Jungle articles I’ve written on the subject for further research. I’ve posted a more comprehensive index of system monitoring articles on my joehertvik.com website that I’ll be adding to as time goes on. All these collected resources should provide a good idea on how set up and configure IBM i partitions for system monitoring.
Check to see if your partition is available–This sounds obvious, but would you know if your Power i partition went offline? There are a number of good networking packages you can use to monitor whether your partition is up and running. In a high availability environment, you can configure ping tests between your high availability box and your production system/development systems. Then set up your monitoring software to issue an alert if three or more pings are lost between the two partitions. That way, you would have two systems in different physical locations watching each other.
Environmental issues–Did something happen to the physical environment outside of the IBM i partition? Was there a power outage, network failure, or other outside issue that will affect your business processing? This isn’t strictly a Power i issue but you should check with your network team to ensure someone is watching the network your partition is connected to.
Power i hardware problems–Did a Power i component just fail, such as a disk drive, processor unit, power supply, backplane, Storage Area Network, etc.? When a partition element fails, IBM i always logs a system problem in the Work with Problems (WRKPRB) command screen, and it also sends a message to the QSYSOPR message queue. Set up your monitoring software to look for the following QSYSOPR hardware attention message IDs from message file QCPFMSG in library QSYS. These messages will find the majority of hardware issues on your system.
Maintenance messages–Did your Power system just send off a message that a battery-powered item, such as a disk cache battery or a UPS, is nearing end of life and should be replaced? Power and battery message are also included in the QCPFMSG message file in QSYS. You will want to check QCPFMSG for more power messages you may want to monitor for, but some power messages you can monitor for in QSYSOPR, include:
Operating system issues–Did the IBM i OS just throw off an operating system error? Sometimes these errors will be caught under the hardware problem messages listed above. They will also show up in the Work with Problems screen and in the QSYSOPR message queue. Check both of these locations every day.
Losing connectivity with key communication partners or servers–Are all your co-processing servers still available? Can you still reach your critical customers? Did a data transfer just fail? You can also set up ping testing for critical co-servers. Communication failures with co-servers will many times show up in application errors.
Applications issues–Is there a programming inquiry message in QSYSOPR that needs attention? Did a key batch process run longer or shorter than expected? Did a critical job fail to run? Application errors fall into two categories: 1) Hard errors where the system either cancels a job or puts up an inquiry message that needs answering; or 2) Silent problems where an application is malfunctioning but it’s not sending out an error messages. Check out the articles in the Related Stories section below for tips on how to monitor for application issues.
Backup failures–Is the system hung up because of a media failure or something as simple as waiting for a new tape? For IBM i full systems backups (GO SAVE, option 21), the system is restricted before backing up. If a full system backup hangs and the system doesn’t restart, you can check for system availability again with a ping test from another server to make sure the system comes back up at the right time. For nightly backups where the system isn’t restricted but processing may be limited, these errors can generally be trapped through the hardware messages above.