Admin Alert: Monitoring the Monitors
Published: May 21, 2008
by Joe Hertvik
Recently, my shop had to create a simple monitoring program to ensure that a critical subsystem was always running. This week, I'll review the process that we used to create that program and I'll also show you how to perform basic monitoring by using just a few simple CL commands. With some modification, these concepts can be applied to any situation where you can't use an off-the-shelf monitoring program.
The Problem I Solved
The system problem this solution addresses reminds me of the old saying about who watches the watchmen. On one of our machines, we use a third-party monitoring, paging, and email package to send out emails whenever a system problem occurs. However, due to some configuration problems after our backups, the machine didn't always restart the package and we wanted to set up a second email monitor to ensure that our monitoring software was always up. After all, if your monitoring software isn't up, you can't exactly use it to send out email alerts that it's down, can you? In this case, we needed something to watch the watchman.
Our shop already had SMTP configured on the target partition, so it wasn't a big issue to actually send the email with the proper commands or APIs, which is a simple way of saying that you'll need an active SMTP server and the i5/OS Mail Server Framework (QMSF) to run this program. The programming itself is a matter of using the right tools to monitor the situation and to send out an email, when appropriate.
To check whether the package was running, we wanted to check whether the subsystem that ran the package was up and running. This is relatively easy to do by using the Work with Subsystem Jobs (WRKSBSJOB) command, like this:
If the subsystem is up and running, WRKSBSJOB displays all the jobs in the subsystem, similar to what you would see when you run the Work with Active Jobs command (WRKACTJOB) with the subsystem parameter.
However, the big advantage that WRKSBSJOB has over WRKACTJOB is that when you're using it to check subsystem jobs, it displays the following CPF1003 error if the subsystem isn't active.
CPF1003 - Subsystem &1 not active
Because the CPF1003 error is generated when you're checking the status of a subsystem that isn't active, that makes the WRKSBSJOB a perfect candidate for my simple monitor package. In my utility, I use WRKSBSJOB as the element that monitors for a running subsystem and sends an email out when the subsystem is not active.
Anatomy of a Monitoring Program
The name of the monitoring program is MONSBS, and it contains the following code.
DCL VAR(&SYSTEM) TYPE(*CHAR) LEN(8) VALUE(*BLANKS)
DCL VAR(&SUBJECT) TYPE(*CHAR) LEN(44) +
DCL VAR(&SUBJECT1) TYPE(*CHAR) LEN(35) +
VALUE(' :SBS subsystem_name NOT +
MONMSG MSGID(CPF1003) EXEC(DO)
CHGVAR VAR(&SUBJECT) VALUE(&SYSTEM *CAT &SUBJECT1)
SNDDST TYPE(*LMSG) +
LONGMSG('PLEASE RESTART THE SUBSYSTEM')
To make MONSBS work, each piece of code performs the following functions.
- The three DECLARE statements (DCL) create program variables that will be used in the email I'll be sending out when the subsystem isn't up and running.
- The Work with Subsystem Jobs (WRKSBSJOB) command checks to see whether there is a subsystem named subsystem_name up and running on the system. If subsystem jobs are present (running), the program doesn't do anything else and it ends.
- Using the Monitor Message (MONMSG) command, the program monitors for message CPF1003, which is issued when you try to use WRKSBSJOB to display jobs in a subsystem that isn't active. When CPF1003 is detected, the code drops into a DO group command that uses other commands to build an email message that will be sent to the target user.
- The Retrieve Network Attributes (RTVNETA) command retrieves the system name and saves it in the &SYSTEM variable. I'm retrieving the system name for my email because I have several Power i/System i systems running at the same time, and I want to differentiate which system the message is coming from.
- The Change Variable (CHGVAR) command builds the subject line of the email in the &SUBJECT variable. It does this by concatenating the system name (as retrieved in the &SYSTEM variable) with the ':SBS subsystem_name not active' literal in the &SUBJECT1 variable, to make the following subject line.
System_name: SBS subsystem_name not active'
Once I have my subject line created, I can send the email. There are many different ways to send an email on an i5/OS system. For this program, I chose to use the Send Distribution (SNDDST) command, which works with my SNADS configuration. To use SNDDST to send an email, I coded the following parameters:
- The Message Type (TYPE) parameter is *LMSG (Long Message), which supports sending a distribution to an email address. This is mandatory for this program.
The Internet Recipient (TOINTNET) parameter contains the email address of the user that I want to send my subsystem down message to. If I wanted to, I could enter a list of addresses to receive the distribution. In addition, each address contains a companion value called Recipient Type that specifies whether the message should be sent with the recipient being designated as the primary recipient (*PRI), a copied recipient (*CC), or as a blind copied recipient (*BCC).
- The Description (DSTD) parameter will be used as the subject line of the email. This parameter uses the &SUBJECT variable as its value, and the value of &SUBJECT will show up in the email subject line. Note that the DSTD parameter is limited to 44 characters, so the &SUBJECT variable must be shorter than 44 characters.
- Since we are sending this email distribution as a long message type (*LMSG), the body of the email must be contained in the Long Message (LONGMSG) parameter, not in the Message (MSG) parameter. So I put the body of the email message into this parameter.
Once the SNDDST completes, the email message informing my recipient that the target subsystem is down has been sent. In order to use SNDDST, the user running this program must be enrolled in the System Distribution Directory (SDD).
As I noted, you can use any other i5/OS technique to send emails, and you are not limited to using the SNDDST command for this program. You could use the commands that come with many third-party email packages, such as RJS Software Systems' SMTP/400 product or Gumbo Software's Gumbo Mail package. You can also send email by using the i5/OS Send Mime Mail API (QtmmSendMail) or the JavaMail components that are shipped as part of the IBM Developer Kit for Java (i5/OS V5R2 or above). There are a number of different alternatives for sending your email. You can read about and find links for all of these techniques in the IBM eServer Email manual.
The end of the program performs some house cleaning to delete a QPDSPSBJ spooled file that might be generated when running this program in batch mode. To provide constant monitoring for my target subsystem, I run this program out of Help/Systems' Robot/SCHEDULE package every half-hour. If you don't have Robot/SCHEDULE or another job scheduler on your system, you can also use the native i5/OS Work with Job Schedule Entries (WRKJOBSCDE) command to add a job scheduler entry to enable this job to run at certain times during the day.
MONSBS is a nice little skeleton program that easily be modified to produce emails for a number of situations.
Gumbo Software Mail Product Page
Help/Systems' Robot/SCHEDULE Product Page
IBM eServer Email manual
RJS Software Systems' SMTP/400 Product Page
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot