Admin Alert: Monitoring the Monitors
May 21, 2008 Joe Hertvik
Recently, my shop had to create a simple monitoring program to ensure that a critical subsystem was always running. This week, I’ll review the process that we used to create that program and I’ll also show you how to perform basic monitoring by using just a few simple CL commands. With some modification, these concepts can be applied to any situation where you can’t use an off-the-shelf monitoring program.
The Problem I Solved
The system problem this solution addresses reminds me of the old saying about who watches the watchmen. On one of our machines, we use a third-party monitoring, paging, and email package to send out emails whenever a system problem occurs. However, due to some configuration problems after our backups, the machine didn’t always restart the package and we wanted to set up a second email monitor to ensure that our monitoring software was always up. After all, if your monitoring software isn’t up, you can’t exactly use it to send out email alerts that it’s down, can you? In this case, we needed something to watch the watchman.
Our shop already had SMTP configured on the target partition, so it wasn’t a big issue to actually send the email with the proper commands or APIs, which is a simple way of saying that you’ll need an active SMTP server and the i5/OS Mail Server Framework (QMSF) to run this program. The programming itself is a matter of using the right tools to monitor the situation and to send out an email, when appropriate.
To check whether the package was running, we wanted to check whether the subsystem that ran the package was up and running. This is relatively easy to do by using the Work with Subsystem Jobs (WRKSBSJOB) command, like this:
If the subsystem is up and running, WRKSBSJOB displays all the jobs in the subsystem, similar to what you would see when you run the Work with Active Jobs command (WRKACTJOB) with the subsystem parameter.
However, the big advantage that WRKSBSJOB has over WRKACTJOB is that when you’re using it to check subsystem jobs, it displays the following CPF1003 error if the subsystem isn’t active.
CPF1003 - Subsystem &1 not active
Because the CPF1003 error is generated when you’re checking the status of a subsystem that isn’t active, that makes the WRKSBSJOB a perfect candidate for my simple monitor package. In my utility, I use WRKSBSJOB as the element that monitors for a running subsystem and sends an email out when the subsystem is not active.
Anatomy of a Monitoring Program
The name of the monitoring program is MONSBS, and it contains the following code.
PGM DCL VAR(&SYSTEM) TYPE(*CHAR) LEN(8) VALUE(*BLANKS) DCL VAR(&SUBJECT) TYPE(*CHAR) LEN(44) + VALUE(*BLANKS) DCL VAR(&SUBJECT1) TYPE(*CHAR) LEN(35) + VALUE(' :SBS subsystem_name NOT + ACTIVE') WRKSBSJOB SBS(subsystem_name) MONMSG MSGID(CPF1003) EXEC(DO) RTVNETA SYSNAME(&SYSTEM) CHGVAR VAR(&SUBJECT) VALUE(&SYSTEM *CAT &SUBJECT1) SNDDST TYPE(*LMSG) + TOINTNET(('firstname.lastname@example.org) + DSTD(&SUBJECT) + LONGMSG('PLEASE RESTART THE SUBSYSTEM') ENDDO DLTSPLF FILE(QPDSPSBJ) MONMSG MSGID(CPF0000) ENDPGM
To make MONSBS work, each piece of code performs the following functions.
System_name: SBS subsystem_name not active'
Once I have my subject line created, I can send the email. There are many different ways to send an email on an i5/OS system. For this program, I chose to use the Send Distribution (SNDDST) command, which works with my SNADS configuration. To use SNDDST to send an email, I coded the following parameters:
The Internet Recipient (TOINTNET) parameter contains the email address of the user that I want to send my subsystem down message to. If I wanted to, I could enter a list of addresses to receive the distribution. In addition, each address contains a companion value called Recipient Type that specifies whether the message should be sent with the recipient being designated as the primary recipient (*PRI), a copied recipient (*CC), or as a blind copied recipient (*BCC).
Once the SNDDST completes, the email message informing my recipient that the target subsystem is down has been sent. In order to use SNDDST, the user running this program must be enrolled in the System Distribution Directory (SDD).
As I noted, you can use any other i5/OS technique to send emails, and you are not limited to using the SNDDST command for this program. You could use the commands that come with many third-party email packages, such as RJS Software Systems’ SMTP/400 product or Gumbo Software’s Gumbo Mail package. You can also send email by using the i5/OS Send Mime Mail API (QtmmSendMail) or the JavaMail components that are shipped as part of the IBM Developer Kit for Java (i5/OS V5R2 or above). There are a number of different alternatives for sending your email. You can read about and find links for all of these techniques in the IBM eServer Email manual.
The end of the program performs some house cleaning to delete a QPDSPSBJ spooled file that might be generated when running this program in batch mode. To provide constant monitoring for my target subsystem, I run this program out of Help/Systems’ Robot/SCHEDULE package every half-hour. If you don’t have Robot/SCHEDULE or another job scheduler on your system, you can also use the native i5/OS Work with Job Schedule Entries (WRKJOBSCDE) command to add a job scheduler entry to enable this job to run at certain times during the day.
MONSBS is a nice little skeleton program that easily be modified to produce emails for a number of situations.