Guru: Find The Needle In The Haystack With Watches
February 25, 2019 Dawn May
IBM i does a great job of logging information using messages. Messages are sent to message queues, the history log, as well as the job log. IBM i job logs can be very useful for identifying and resolving application issues. No other system has such a great logging facility for reviewing what happened in a job.
What if you need to debug an intermittent application problem? What if the symptom is a message logged to the job log? And what if there are hundreds of jobs where this message could be sent? There are a variety of ways you can review messages sent to job logs: manually review the job log; use the message handling APIs; or use the QSYS2.JOBLOG_INFO Service. While you can automate things using the APIs or the IBM i service, they both require you to build tooling to identify and capture the desired message. The most serious deficiency with this approach is it is after the fact — the error has already occurred. You may find the messages, but what if you need additional diagnostic data at the time of the error?
The operating system supplies an easy way to identify and take immediate action upon messages sent to a job log with the Watch for Event function, which I casually refer to as watches. Watches are not to be confused with watch conditions used when debugging; they are two different things.
Watches are a very efficient way to watch for the occurrence of a message and be notified when that message is sent to the watched location. Watches are great as a diagnostic tool to help you easily find messages related to possible problems. You can watch for messages sent to message queues, the history log, and as this article reviews, to the job log.
A major advantage of watches is their efficiency. Watches do not read through a job log like the APIs or the IBM i Service. Rather, watches are implemented as an exit point within the Message Handling component in the operating system. Once a watch is set up, the message handling component checks for the watch when a message is sent. It’s a quick check, and if the watch condition is not matched, logic continues as normal. It is only when the watched condition is matched that any overhead occurs. It is possible to watch every job in the system without a performance impact.
Your watch program gets called in a different job, a prestart job named QSCWCHPS running in the QUSRWRK subsystem. The QSCWCHPS prestart job entries are shipped with the system. After the exit program completes, the QSCWCHPS job ends and is not reused.
When the watch condition is matched, the job where the message was sent continues on, unaware that the watch program is running. Any overhead from the watched condition occurs in the logic in your program. If you are trying to debug difficult, intermittent issues, your watch program could automatically take action, such as holding the job, dumping the job, displaying the joblog to *PRINT, servicing and tracing the job, sending an email to someone to manually intervene, or whatever action you desire.
The Start Watch command sets up the message watch. The example below will watch for all SQL messages with a severity of 30 or greater sent to the job log of any QZDASOINIT or QSQSRVR job.</p
STRWCH SSNID(WATCHSQL) WCHPGM(MYLIB/WATCHPGM) CALLWCHPGM(*WCHEVT) WCHMSG((SQL* *NONE *MSGDATA *ALL *GE 30)) WCHMSGQ((*JOBLOG)) WCHJOB((*ALL/*ALL/QZDASOINIT) (*ALL/*ALL/QSQSRVR))
When setting up a watch, you can get much more sophisticated with the WCHMSG criteria. For example, the comparison data can specify that the *FROMPGM or *TOPGM must match, or you can specify data in the message replacement text that must match. The comparison text is case-sensitive and may be quoted to preserve the case.
MYLIB/WATCHPGM is the watch for event exit program that you create. This program is called with several input parameters, one of which is the event data. The event data includes information about the message being watched and the job in which it occurred. This allows you to gather quite a bit of information about the error and allows you to determine programmatically what actions you want to take. IBM has provided two example exit programs. There is one in the IBM i Knowledge Center, Exit Program For Watch For Event, written in C. The IBM Rochester Support Center has an example in CL, STRWCH – Watch Exit Programs Explained with CL Example.
IBM uses watches for the Service Monitor feature of the operating system. You can review these watches with the WRKWCH *SRVMON command. If you look at the details for these watches, you will find at least one watch that is watching all job logs on the system. The example below shows the system is monitoring all job logs for the CPD1144 message, which is “An error occurred with the job log server.” IBM wants a problem log generated with diagnostic data if this error ever occurs.
Message ----Message queue----- -------------Job-------------- Message type Name Library Name User Number CPD1144 *ALL *JOBLOG *ALL *ALL *ALL
Watches also allow you to watch for a Product Activity Log (PAL) entry or a Licensed Internal Code Log (LICLOG) entry. These latter two will probably only be used if you are working with IBM on an intermittent system problem.
Watches were added to the operating system starting with V5R3. In that initial release, they were tied to trace tools: STRTRC, STRCMNTRC, TRCCNN, TRCINT, and TRCTCPAPP. With V5R4, watches became stand-alone. They’ve been around for a very long time. Take advantage of this hidden gem.