• The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
Menu
  • The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
  • Admin Alert: Basic i/OS Error Monitoring and Response, Part 1

    January 5, 2011 Joe Hertvik

    I once worked for an insurance company that had a 24x7x365 operations staff. When iSeries error messages occurred, the operations staff mostly handled problem resolution and other staff members were rarely alerted. Sadly, this world no longer exists and it is now left to i/OS administrators to detect when system errors occur on a 24×7 basis and to react to them in a timely manner so that production isn’t delayed.

    Toward that goal, this week and next, I will describe a basic game plan for managing i/OS error messages so that all system and programming issues are quickly handled without having dedicated 24x7x365 coverage on site. While most shops may have pieces of this plan already in place, it’s a good idea to periodically review the basic setup and goals of monitoring to ensure you are not missing anything. This plan is also helpful if your organization is cutting down on its iSeries, System i, or Power i staff, as these articles can also be used as a primer for them to create their own 24x7x365 error monitoring and response plan.

    The Basic Structure of i/OS Error Monitoring and Response

    An i/OS error monitoring and response system contains the following components:

    • Detection–Knowing when there’s a problem
    • Notification–Alerting the on-call staff that something is wrong so they can mobilize the proper resources
    • Mobility–Ensuring that on-call staff can handle issues wherever they are
    • Automatic Answering–Configuring the system to automatically handle problems without outside input, wherever possible
    • Resolution–Analyzing the issue to ensure that the problem doesn’t occur again

    Once these components are in place, you’ll have a robust system that doesn’t require someone to constantly be watching the system. Your monitoring goals are threefold:

    1. You must be responsive and quickly attend to the problem
    2. You must maintain system production and availability
    3. You must still be able to get a good night’s sleep every night

    Let’s look at each component in turn.

    Detection: The Root of the System

    You can’t fix a problem if you don’t know it exists. The first question in putting together an automatic error detection and response system is: What do you want to detect?

    There are literally thousands of situations that you may need to handle to keep your system going. As a baseline, you should at least be monitoring for the following inquiry and error messages in each production system’s operator message queue (QSYSOPR).

    • RPG program messages that start with the literal RPG*. RPG messages are defined in the QRPGMSGE message file in the QSYS library.

    • Allocation and I/O errors that start with the RNQ* or RNX* literals, as defined in the QRNXMSG message file.

    • The following general purpose error messages as defined in the QCPFMSG file:

    • CPA0701 – Error message in job stream
    • CPA0702 – An error occurred in a procedure
    • CPA4072 – File full
    • CPA5305 – File member full

    The text for these messages can be viewed in the QCPFMSG message file by entering the following Work with Message Description (WRKMSGD) command from a 5250 green screen.

    WRKMSGD MSGF(QCPFMSG)
    

    • Any QSYSOPR error message that has a severity code of 81 or above. These indicate critical errors on the system.

    • Make sure that your monitoring system ignores any severity 81 or higher error messages that come from the QSPLJOB user. QSPLJOB user error messages usually indicate printer messages, including printer out of paper, check printer alignment, change form type, etc. You want to set up your monitoring system to only check for critical messages during off-hours, and you will quickly become inundated with messages if you send out alerts for standard printer messages.

    • Any application specific critical messages that may be issued from your third-party software.

    Other messages can be monitored as needed, but this basic list will take care of a number of issues, including system issues such as hardware errors.

    The next step is selecting how your system is automatically going to monitor for messages after your staff has gone home for the night or weekend. For message monitoring, you generally have three choices for monitoring software that can be configured to look for these errors.

    1. A third-party package such as Help/Systems’ Robot/ALERT, Bytware MessengerConsole, CCSS QSystem Monitor, Halcyon Software’s IBM i Monitoring, Scheduling & Automation Software, or SEA’s absMessage package. All of these packages provide the capability for monitoring many different types of i/OS messages and for alerting other devices (such as cell phones or email accounts) when an error is found.
    2. iSeries Navigator(OpsNav) provides message monitoring and notification services. See the IBM i and System i Information Center for more information on how to set up message monitoring inside OpsNav.
    3. Custom written software to monitor the QSYSOPR message queue and to send out messages when it finds an error. You can roll your own solution by dumping all the current QSYSOPR inquiry messages into a printer spooled file, reading that file, and then using the Send Distribution (SNDDST) command to send email alerts out as text messages to on-call support cell phones. For an example of how to dump all your QSYSOPR inquiry messages to a spooled file, see this article on determining which locked object is holding up a job. For an example of how to use the SNDDST command to send emails to users when an error message occurs, check out this article on monitoring whether a specific subsystem is up. For information on sending out text messages as email messages from your i/OS partitions, see this article on configuring messaging software for overnight monitoring.

    Notification–Who Gets the Message When?

    In your message monitoring system, you will generally be sending out alerts to mobile device users via text messaging when an error occurs. As mentioned above, it’s easy to send out a text message through email software, and the assumption here is that all your on-call resources are reachable through text messages on their mobile phones. However, there are a number of management issues you should consider as you configure your notification system. Among these issues are:

    • Who are your responders?
    • Do you have a list of defined responders who will be responsible for ensuring all issues are resolved? What compensation are you offering them for responding to off-hours system issues? What procedure will they follow when they receive an alert? What happens if the off-hours responder doesn’t receive the alert because of mobile device issues, such as a dead battery, out of cell phone range, etc.?

    • Do you have a responder rotation?
    • If you have more than one off-hours responder, do you have a published schedule for who’s on duty each night? Is your software set up to follow the schedule and only send to the on-call person or does everyone get the alerts regardless of whether they are on duty that night or not? How do you handle vacations or business travel when the scheduled off-hours responder may not be available?

    • Do the responders have mobile devices they can carry with them and is your company handling the cost of the mobile device?
    • If you’re requiring people to drop everything and answer a call, do they at least have the required company equipment to do the job? Are they aware that they will have to answer the call whenever they are on duty?

    • Do you have a call tree?
    • If the off-hours responder can’t resolve the issue, what escalation procedure should he follow and who should he call? Have you defined your subject matter experts (SMEs) who need to resolve certain types of issues, such as ERP system errors, hardware errors, Web site errors, etc.? What happens if the designated SME isn’t available? A negotiated and published call tree can alleviate many of these issues.

    In my experience, these issues can be just as tricky as the software configuration. You need to carefully define your call trees and ensure that everyone knows what needs to done in case of a problem. Nobody likes to be on-call during off-hours, so proceed carefully and make sure your responders are taken care of in some way, shape, or form for their trouble.

    Still To Come

    Once you have your detection and notification configuration and procedure set up, your monitoring system is already in place. Next issue, we’ll discuss more advanced issues for enhancing that system and freeing up more responder time, including i/OS techniques for automatically answering error messages, freeing up your tech resources so that they don’t have to be tethered to their computers on nights they are monitoring the system, and things you should do after the problem is resolved.

    RELATED STORIES

    Determining Which Locked Object is Holding Up a Job

    Configuring Messaging Software for Overnight Monitoring

    Monitoring the Monitors



                         Post this story to del.icio.us
                   Post this story to Digg
        Post this story to Slashdot

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Tags:

    Sponsored by
    WorksRight Software

    Do you need area code information?
    Do you need ZIP Code information?
    Do you need ZIP+4 information?
    Do you need city name information?
    Do you need county information?
    Do you need a nearest dealer locator system?

    We can HELP! We have affordable AS/400 software and data to do all of the above. Whether you need a simple city name retrieval system or a sophisticated CASS postal coding system, we have it for you!

    The ZIP/CITY system is based on 5-digit ZIP Codes. You can retrieve city names, state names, county names, area codes, time zones, latitude, longitude, and more just by knowing the ZIP Code. We supply information on all the latest area code changes. A nearest dealer locator function is also included. ZIP/CITY includes software, data, monthly updates, and unlimited support. The cost is $495 per year.

    PER/ZIP4 is a sophisticated CASS certified postal coding system for assigning ZIP Codes, ZIP+4, carrier route, and delivery point codes. PER/ZIP4 also provides county names and FIPS codes. PER/ZIP4 can be used interactively, in batch, and with callable programs. PER/ZIP4 includes software, data, monthly updates, and unlimited support. The cost is $3,900 for the first year, and $1,950 for renewal.

    Just call us and we’ll arrange for 30 days FREE use of either ZIP/CITY or PER/ZIP4.

    WorksRight Software, Inc.
    Phone: 601-856-8337
    Fax: 601-856-9432
    Email: software@worksright.com
    Website: www.worksright.com

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Sponsored Links

    SEQUEL Software:  FREE Webinar. Learn how ABSTRACT can smooth software development. Jan. 19
    Vision Solutions:  Leaders Have Vision...And Vision Has Leaders! FREE White Papers!
    VAULT400:  Which is right for you? Online back-up, DR, HA Webinar. Jan. 20

    IT Jungle Store Top Book Picks

    BACK IN STOCK: Easy Steps to Internet Programming for System i: List Price, $49.95

    The iSeries Express Web Implementer's Guide: List Price, $49.95
    The iSeries Pocket Database Guide: List Price, $59
    The iSeries Pocket SQL Guide: List Price, $59
    The iSeries Pocket WebFacing Primer: List Price, $39
    Migrating to WebSphere Express for iSeries: List Price, $49
    Getting Started with WebSphere Express for iSeries: List Price, $49
    The All-Everything Operating System: List Price, $35
    The Best Joomla! Tutorial Ever!: List Price, $19.95

    Oracle’s Withdrawal of JDE ‘Blue Stack’ Raises Questions In the Best Interests of IBM i

    One thought on “Admin Alert: Basic i/OS Error Monitoring and Response, Part 1”

    • As400 Error Number 3025 – ronglecorp.com says:
      September 2, 2017 at 2:15 pm

      […] Admin Alert: Basic i/OS Error Monitoring and Response, Part 1 – When iSeries error messages occurred. be used as a primer for them to create their own 24x7x365 error monitoring and response plan. The Basic Structure of i/OS Error Monitoring and Response An i/OS error monitoring and. […]

      Reply

    Leave a Reply Cancel reply

Volume 11, Number 1 -- January 5, 2011
THIS ISSUE SPONSORED BY:

SEQUEL Software
WorksRight Software
Bug Busters Software Engineering

Table of Contents

  • Implementing Linked Lists in RPG
  • How To Rename Your Local Database
  • Admin Alert: Basic i/OS Error Monitoring and Response, Part 1
  • Implementing Linked Lists in RPG
  • How To Rename Your Local Database

Content archive

  • The Four Hundred
  • Four Hundred Stuff
  • Four Hundred Guru

Recent Posts

  • Public Preview For Watson Code Assistant for i Available Soon
  • COMMON Youth Movement Continues at POWERUp 2025
  • IBM Preserves Memory Investments Across Power10 And Power11
  • Eradani Uses AI For New EDI And API Service
  • Picking Apart IBM’s $150 Billion In US Manufacturing And R&D
  • FAX/400 And CICS For i Are Dead. What Will IBM Kill Next?
  • Fresche Overhauls X-Analysis With Web UI, AI Smarts
  • Is It Time To Add The Rust Programming Language To IBM i?
  • Is IBM Going To Raise Prices On Power10 Expert Care?
  • IBM i PTF Guide, Volume 27, Number 20

Subscribe

To get news from IT Jungle sent to your inbox every week, subscribe to our newsletter.

Pages

  • About Us
  • Contact
  • Contributors
  • Four Hundred Monitor
  • IBM i PTF Guide
  • Media Kit
  • Subscribe

Search

Copyright © 2025 IT Jungle