Catching Robot/SCHEDULE Job Failures As They Happen
March 25, 2009 Hey, Joe
We’re using Help/Systems Robot/SCHEDULE version 9.0 to schedule batch jobs on an i5/OS V5R4 machine, and it works great. However, I have a new requirement to alert my staff whenever a Robot job fails to run or terminates unexpectedly. Do you know of any utilities that can monitor and send out email pages when this happens?
Given that Robot/SCHEDULE is one of the most widely used utilities in iSeries, System i, and Power i shops, I’ve run into this problem before and wrote a utility to do exactly what you’re talking about. I’ve run this utility on Robot/SCHEDULE version 9.0 and version 10 installations and it works well under both versions without any changes. I offer my utility here as a service to the i5/OS community without any implied warranty or proof of fitness for your application (end legalese).
The utility uses the Robot/SCHEDULE file named RBTMSG, which records the results of all Robot/SCHEDULE operations. This file is stored in the ROBOTLIB library on my system (it may be stored in another library on your system). RBTMSG contains a lot of information for each job but to determine when a SCHEDULE job fails, I’m really only interested in four pieces of info.
To avoid interfering with Robot/SCHEDULE processing, I created a work file that only contains these RBTMSG fields. I call this file JHRBTMSG. Here’s the layout.
JHRBTMSG A R RBTMSG A CMRNAM 10A TEXT('JOB NAME') A CMMSEV 1A TEXT('SEVERITY') A CMSTIM 7P 0 TEXT('TIME') A CMSDAT 7P 0 TEXT('DATE') A K CMSDAT
Once I had my work file, I wrote the following program to alert me whenever a Robot/Schedule job failed with either a status of ‘T’ or ‘E’. I call this program CHKRBTMSG. At the end of the code, you can find an explanation of what each code line does, along with instructions for setting up this program and a necessary supporting program as scheduled jobs in Robot/SCHEDULE.
0001.00 PGM 0002.00 DCLF FILE(QGPL/JHRBTMSG) 0003.00 DCL VAR(&TDATE7) TYPE(*CHAR) LEN(7) 0004.00 DCL VAR(&TDATE7S) TYPE(*DEC) LEN(7 0) VALUE(0) 0005.00 DCL VAR(&TDATE) TYPE(*CHAR) LEN(6) 0006.00 DCL VAR(&TDAYMON) TYPE(*CHAR) LEN(4) 0007.00 DCL VAR(&TDAY) TYPE(*CHAR) LEN(2) 0008.00 DCL VAR(&TMONTH) TYPE(*CHAR) LEN(2) 0009.00 DCL VAR(&TYEAR) TYPE(*CHAR) LEN(2) 0010.00 DCL VAR(&FAILTIME) TYPE(*CHAR) LEN(8) VALUE(' ') 0011.00 DCL VAR(&CMDTIMCH) TYPE(*CHAR) LEN(7) VALUE(' ') 0012.00 DCL VAR(&FAILHR) TYPE(*CHAR) LEN(2) VALUE(' ') 0013.00 DCL VAR(&FAILMIN) TYPE(*CHAR) LEN(2) VALUE(' ') 0014.00 DCL VAR(&FAILSEC) TYPE(*CHAR) LEN(2) VALUE(' ') 0015.00 DCL VAR(&LASTTIME) TYPE(*CHAR) LEN(7) VALUE(' ') 0016.00 0017.00 DCL VAR(&MSG44) TYPE(*CHAR) LEN(44) VALUE(' ') 0018.00 0019.00 CHKOBJ OBJ(QGPL/LASTRBTCHK) OBJTYPE(*DTAARA) 0020.00 MONMSG MSGID(CPF9801) EXEC(DO) 0021.00 CRTDTAARA DTAARA(QGPL/LASTRBTCHK) TYPE(*CHAR) LEN(7) + 0022.00 TEXT('Data area for Robot job failure + 0023.00 testing') /* Data area to hold time of + 0024.00 last terminated object */ 0025.00 ENDDO 0026.00 0027.00 RTVDTAARA DTAARA(QGPL/LASTRBTCHK) RTNVAR(&LASTTIME) /* + 0028.00 Last time that a terminated robot job + 0029.00 occurred */ 0030.00 0031.00 CPYF FROMFILE(ROBOTLIB/RBTMSG) + 0032.00 TOFILE(QGPL/JHRBTMSG) MBROPT(*REPLACE) + 0033.00 CRTFILE(*NO) INCREL((*IF CMMSEV *EQ 'T') + 0034.00 (*OR CMMSEV *EQ 'E')) FMTOPT(*MAP *DROP) + 0035.00 /* Create a duplicate file of RBTMSG */ 0036.00 0037.00 RTVSYSVAL SYSVAL(QDATE) RTNVAR(&TDATE) 0038.00 CHGVAR VAR(&TDAYMON) VALUE(%SST(&TDATE 1 4)) 0039.00 CHGVAR VAR(&TYEAR) VALUE(%SST(&TDATE 5 2)) 0040.00 CHGVAR VAR(&TDATE7) VALUE('1' *CAT &TYEAR *CAT + 0041.00 &TDAYMON) 0042.00 CHGVAR VAR(&TDATE7S) VALUE(&TDATE7) 0043.00 0044.00 LOOP: RCVF 0045.00 MONMSG MSGID(CPF0864) EXEC(GOTO CMDLBL(ENDPGM)) 0046.00 0047.00 IF COND(&CMSDAT *EQ &TDATE7S) THEN(DO) 0048.00 CHGVAR VAR(&CMDTIMCH) VALUE(&CMSTIM) 0049.00 IF COND(&CMDTIMCH *GT &LASTTIME) THEN(DO) 0050.00 CHGDTAARA DTAARA(QGPL/LASTRBTCHK) VALUE(&CMDTIMCH) /* + 0051.00 Put the last time an error occurred in + 0052.00 the data area */ 0053.00 CHGVAR VAR(&FAILMIN) VALUE(%SST(&CMDTIMCH 4 2)) 0054.00 CHGVAR VAR(&FAILHR) VALUE(%SST(&CMDTIMCH 2 2)) 0055.00 CHGVAR VAR(&FAILSEC) VALUE(%SST(&CMDTIMCH 6 2)) 0056.00 0057.00 CHGVAR VAR(&FAILTIME) VALUE(&FAILHR *CAT ':' *CAT &FAILMIN + 0058.00 *CAT ':' *CAT &FAILSEC) 0059.00 CHGVAR VAR(&MSG44) VALUE('Robot job' *BCAT + 0060.00 &CMRNAM *BCAT 'failed at' *BCAT &FAILTIME) 0061.00 SNDDST TYPE(*LMSG) + 0062.00 TOINTNET(('firstname.lastname@example.org') + 0063.00 DSTD(&MSG44) LONGMSG('Please check the + 0064.00 status of this Robot job and take + 0065.00 corrective action to rerun') 0066.00 ENDDO 0067.00 ENDDO 0068.00 0069.00 GOTO CMDLBL(LOOP) 0070.00 0071.00 ENDPGM: 0072.00 0073.00 0074.00 0075.00 ENDPGM
Here’s how the program works.
Line 2 declares that I am using my JHRBTMSG work file to drive this program.
Lines 3-17 declares several different work variables that I use throughout the program.
Line 18-25 checks for the existence of a data area called LASTRBCHK in QGPL. If it doesn’t find the data area, it recreates it. LASTRBCHK contains the time of day of the last failed SCHEDULE job that was emailed to the staff.
Lines 22-29 retrieves the LASTRBCHK data area back to my program for comparison purposes. It puts the time into a variable called &LASTTIME, which denotes the last time today that a Robot/SCHEDULE job failed on the system and the operations staff was alerted.
Lines 31-35 copies all the RBTMSG records that have a status of ‘T’ (Abnormal termination) or ‘E’ (Error in setup so job submission failed) into my JHRBTMSG work file.
Lines 37-42 retrieves today’s date and formats it into a comparison value to be used against the records in the JHRBTMSG file.
Line 44-45 read the JHRBTMSG file and terminates the program when there are no records left.
Line 47 compares the incoming date on the ‘T’ or ‘E’ record with today’s date. If the job referenced by this record did not run today, the program gets the next record. The program is only interested in failures that happened today.
Line 48 and 49 converts and compares the time stamp on the incoming record to the &LASTTIME variable (the time of the last Robot/SCHEDULE failure that was reported to the staff). It processes the record if the time is greater than the last failure time. If the time is equal to or less than the last failure time, it assumes the failure has already been reported and the program goes to the next record.
Line 50-52 updates the LASTRBCHK data area with the time that the latest Robot/SCHEDULE failure occurred (the time of the current record).
Lines 53-67 format the email message with the name of the Robot/SCHEDULE job that failed. This code then sends the message out to its intended recipients by using the Send Distribution (SNDDST) command.
Line 69 continues the loop to retrieve another record.
Lines 71-75 run the end of program procedure.
To run this job and be alerted as soon as possible when a Robot/SCHEDULE job fails, I set up two jobs inside (where else) Robot/SCHEDULE.
And that’s my routine for detecting and emailing staff whenever a Robot/SCHEDULE failure occurs. Feel free to modify this code however you see fit. Let me know how it works for you.