Controlling System i Shutdown Activities Using An Intelligent Power-Handling Program, Part II
October 17, 2007 Brian Kelly
Note: The code accompanying this article is available for download here.
For Part I of this article, click here.
Enabling a Power-Handling Program To Control System Activity During a Power Interruption
When the UPSDLYTIM‘s automatic shutdown does not serve your installation needs, it is nice to know that there is a better way–an intelligent power handling program. When the unexpected arrives as it often does unexpectedly, a more controllable and granular approach to a system shutdown can save the day or at least save your system so it will run tomorrow. The power-handling program that IBM offers in its Information Center literature is a great start to learn about and begin to implement intelligent power handling.
Rather than providing a shell with a few intentional (hopefully) errors as IBM does to assure that you have thought out the implementation of its code, the code in this response can be implemented upon review and modification. Of course, there are no guarantees, just as with the IBM code. The code available with this article demonstrates how you might build an intelligent power-handling system by providing a number of helpful programs and objects to make the implementation easier to perform and understand. And besides all of the ancillary items that help make this more achievable, you will also find IBM’s power-handling program at the center of this code, a bit fattened up, more complete, and better documented than the shell provided in IBM’s Information Center.
This approach not only works with UPS units, but also with equipment that pretends to be a UPS unit, such as the IBM supplied built-in battery feature found on older models. Depending on your needs, this feature may mitigate the need for a full UPS. If this is your situation, consider that the limited battery life and the age of this feature may make the UPSDLYTIM approach just as effective or perhaps just as ineffective when using this outdated gear. In other words, if you did not have your UPS, and you just had this older feature, I would recommend that you scrap the old battery backup and the old batteries and you get a better UPS option. As I recommended in Part I of this article series, the first step is always to make sure the hardware is working as intended before moving to a system value solution or an intelligent power-handling application.
The IBM operating system (i5/OS or OS/400) support is essentially the same for both the battery feature and the uninterruptible power supply attachment. However, all companies using AS/400 heritage gear would not necessarily take the same options for riding out the power failure or for an orderly shutdown. In different environments, for example, you may want to perform different actions when the uninterruptible power supply begins supplying power to the system or when power is fluctuating. A power-handling program is an intelligent way for you to monitor and control these situations and it can be written to perform necessary functions and inform personnel of a critical power situation. Among other things, the power handling program or programs can perform the following:
To specify that you have chosen to use a power-handling program, after you have created all of the other objects and performed all of the configuration steps to set up the power-handling environment, you would create a message queue and then change the QUPSMSGQ system value to the name of the queue you have created. Once you do this, you have turned on the intelligent power-handling program.
So, before you turn anything on, the best approach is to inch up on implementing the power handling program, which means saving the QUPSMSGQ change until last. Get all other things completed and tested first. Additionally, as we discussed in the hardware section in Part I, under no circumstances should you make any UPS software changes before you have proven that your UPS:
When the intelligent power-handling program is deployed, the system will send the power messages to both the QSYSOPR and the queue you specify in the QUPSMSGQ system value. The very last step you need to take is to set the QUPSDLYTIM system value to *NOMAX. This last step assures that the UPSDLYTIM alarm clock will never go off and will therefore never interfere with your intelligent power handling program. To set the QUPSDLYTIM system value to *NOMAX:
The IBM Web site has a very descriptive system flowchart as to what happens in various power, UPS, and battery backup scenarios. The IBM chart is shown below:
Determining the amount of battery capacity is a guessing game with a lot of hints but little definitive guidance. The UPS vendor and your IBM Business Partner will give you the information you need to purchase the right amount of capacity for your particular System i model and configuration. Based on the initial capacity of the UPS plus the additional expansion batteries, you can calculate the amount of time that the UPS should hold power before your system must come down. For example as referenced in Part 1 of this article last week, reader Elle’s Powerware model on a small i5 system is capable of eight minutes up time maximum when new and fully charged. But she did not get eight minutes. In fact, it appears that she did not get much more than three minutes to get the system closed down properly.
With the extra battery pack, this same model UPS can have its listed time increased to 37 minutes maximum. (Maximum and actual are not necessarily the same.) The actual battery runtime is a function of capacity and age. Believe it or not, there are some IT folks I know who have an aversion to anything electrical or mechanical. They argue with me that it is a good idea to keep your old UPS even when you get a new system. This argument is good as suggesting there is nothing wrong with installing your old battery into your new car. Even if the battery is fully charged, it may not have 100 percent capacity. A typical UPS battery or set of batteries will lose 20 percent to 50 percent of its rated capacity in four to five years, depending on such things as load and ambient room temperature. Elevated operating temperatures tend to increase the loss of capacity. So, you must check your UPS continually to see what temperature and what capacity it is running at.
Additionally, the actual battery runtime after a failure is a function of the discharge load. The more loads the uninterruptible power supply serves (system, console, lights, coffee maker, etc.), the less time it can sustain them. When the battery on the system falls below a specific level of charge, the UPS will issue a weak battery condition signal. The weak battery condition signal materializes into message CPI0964 and is sent to the QUPSMSGQ.
This weak battery condition signal from the uninterruptible power supply affects the shutdown mechanisms. If utility power fails during this condition, the system more than likely will begin an immediate power down. This is not a good situation even if you have coded your program properly. This condition means that your batteries were not charged enough at initial power failure time to handle the time that you gave for recovery. A well-written power-handling program cannot make up for lacking battery bower.
It is always better to be safe than to be sorry. If your UPS has about 37 minutes to hold power, write your program as if 10 or 15 or 20 minutes is the max. A certain amount of additional time is needed if your program actually must do a power-down. The power-down is not instantaneous, and in most cases will be at least five minutes. Moreover, the power-down needs power to perform the necessary shutdown tasks. The PWRDWNSYS command first saves main storage (the pages that have not been written) before it performs its power-down. The amount of time to do this is not an exact number.
Memory save-time during power-down depends on the number of virtual page changes in main storage that have not yet been written to disk. The number of disk arms available is also a factor. The more disk arms, the faster the system can write main storage to disk. The system power-down also depends on the number of jobs and the amount of time it takes to end them all. Typically a job will be close to an instruction boundary in which it can be terminated. However, some instructions are long-running and take longer. In IBM’s worst case formula for saving pages, a four drive small system can save about 4 GB of main storage (assuming all pages need to be written) in about a minute. However, in my own test, the same size system took four minutes to come down from a restricted state. The bottom line is that you want to make sure you have that one minute or five minutes because even a one second deficiency is the difference between a gentle shutdown and a crash.
Implementing a Power-Handling Program
Once you begin the process of implementing an intelligent power-handling program for your UPS and System i, you must follow each direction carefully. You don’t want to cause a power-down crash while you are trying to prevent one from ever happening. The following to-do list can be very helpful in assuring a successful intelligent power-handling program implementation on your AS/400 or i5. Let’s assume that you have upgraded your UPS, checked it out, and are sure that given perfect conditions your UPS will give you 37 minutes of power once the power company has stopped doing the same. This would be your UPS with the extra battery pack included.
This set of steps starts by assuming that QCTL is the controlling subsystem (though yours may be QBASE or some other name). If your controlling subsystem is not QCTL, then substitute your name for it and when we change it to QCTL2 as a test subsystem, likewise change your subsystem by appending a 2 to the end of its name.
Remember when you use any of this code that the original author of the intelligent power-handling program (IBM as well as yours truly) offer no warranty, expressed or implied, for the code provided free-of-charge in this package. So be advised to use this code at your own risk. My clients have had success with this code package, but your results may not be the same. Take these actions in this order with due caution. Before you even begin the checklist of to-dos, examine the system value for QCTLSBSD. This is your controlling subsystem. Make a note of it after you find out what it is. To get your controlling subsystem’s name, execute the following command:
Now, let’s begin creating the objects you need to make this work.
Step 1 Because a power-handling program is critical to the normal recovery of computer operations, all objects used by the power-handling program should be isolated in their own library so they can be reasonably secured from the potential mishandling of other users. Therefore, you should create a special library (UPSLIB) for the UPS code as follows:
CRTLIB LIB(UPSLIB) AUT(*EXCLUDE) CRTAUT(*EXCLUDE)
Step 2 The power-handling program requires exclusive use of the defined message queue to which the UPS will talk. For this purpose, create a unique message queue (UPSMSGQ) and exclude its use from all other users and general system use, as follows:
CRTMSGQ MSGQ(UPSLIB/UPSMSGQ) AUT(*EXCLUDE)
Step 3 The next step is to name the power-handling program UPSPGM. Then type in the program or use the code I provide with your own modifications, compile it, and exclude its use from all other users, as follows:
STRSEU SRCFILE(UPSLIB/QCLSRC) SRCMBR(UPSPGM)
Step 4 Create a user profile for the power-handling system to use. For your UPS environment to work, you need the power-handling programs to be started automatically whenever the system or the controlling subsystem is started. For you to be able to set this up to automatically run, you need a few more objects created including a user profile, a job queue, and a job description. These can be created as follows:
CRTUSRPRF USRPRF(UPSPROFILE) PASSWORD(*NONE) + USRCLS(*SECOFR) TEXT('User profile to run + UPS program (need security authority )') CRTJOBQ JOBQ(UPSLIB/UPSJOBQ) TEXT('Job Queue to + Launch UPS Power handling program') AUT(*EXCLUDE) CRTJOBD JOBD(UPSLIB/UPSJOBD) JOBQ(UPSLIB/UPSJOBQ) + JOBPTY(1) RQSDTA('CALL UPSLIB/UPSPGM') + AUT(*EXCLUDE) USER(UPSPROFILE)
Note: You must provide a user profile as shown above to use the job description as an auto-start job.
Step 5 The next step is to add the newly created job queue called UPSJOBQ in UPSLIB to the controlling subsystem. (QCTL or QBASE in a strict S/36 environment shop.) Make sure sequence #990 is available. You will get an error message if it is not and then you can change it to 991.
ADDJOBQE SBSD(QCTL) JOBQ(UPSLIB/UPSJOBQ) SEQNBR(990)
Note: IBM suggests that for testing purposes you should create a duplicate of the controlling subsystem and do your original testing using the duplicate subsystem. In the implementations that I managed, the client always chose to use the QCTL job description and tested off hours very carefully.
In case you want to use the IBM technique, I have included optional Steps 6 and 7 below. You would not need steps six and seven if you were not creating a duplicate of the controlling subsystem. Eventually you must be running in the real controlling subsystem so this postponement can actually make the implementation more confusing. Therefore, I would recommend you be very careful rather than making the duplicate and implementing in a test environment with duplicates. However, because experience is the best teacher and creating duplicates does mean you gain additional familiarity, Steps 6 through 8 can serve as a guide to this approach.
Optional Step 6 Make a copy of the current controlling subsystem description, as follows:
CRTDUPOBJ OBJ(QCTL) FROMLIB(QSYS) OBJTYPE(*SBSD) TOLIB(QSYS) NEWOBJ(QCTL2)
Note: The IBM startup program for systems using QCTL checks for QCTL (not QCTL2) as the controlling subsystem in either QGPL or in QSYS. If this condition exists, it starts the rest of the system at startup. So, if you choose to use the IBM method for testing the system after a power-down using this approach, it will not start anything other than QCTL2 subsystem if you have made QCTL2 the controlling subsystem. It’s your call. Even if you choose to implement with the QCTL subsystem vs. the QCTL duplicate, it is a good idea to print and to copy the QCTL subsystem just in case you mess something up in the process of implementation.
There is the possibility that an errant auto start job entry in the controlling subsystem can wreak havoc with your machine at startup. For example, there is the possibility that if the UPSPGM (the UPS power-handling program) fails and causes the system to power-down every time it comes up, your system will never come up for productive use. To correct such a problem, you would have to perform drastic measures such as using the system service tools or, even worse, you might have to perform a reload or partial reload of the operating system. So, however you choose to do this, be careful!
You do not have to modify the startup program to start all subsystems since when you change the controlling subsystem to QCTL2; it will only bring up the controlling subsystem and nothing else. This does have its advantages in that the test environment is less cluttered. You should check out the system value QSTRUPPGM for the name and library of the startup program (typically QSTRUP in QSYS). Since this program can be retrieved and changed, take a look at the source so you know what it does. Do not change it unless you have some other reason.
In IBM’s recommended scenario if you do not modify the startup program (repeated for the third time) it will not check for QCTL2 in QSYS or QGPL, and thus the startup program will end without starting the rest of your subsystems. Then, since you are about to add a second auto-start job entry in QCTL2, the UPSLIB/UPSPGM will start in the QCTL2 controlled environment. When you power down the next time, you can then begin testing the power program by pulling the UPS plug from the wall outlet for durations that meet your testing criteria, but not yet.
Step 7 In IBM’s recommended approach, you are to change the controlling subsystem to QCTL2. This is accomplished as shown below:
CHGSYSVAL SYSVAL(QCTLSBSD) VALUE('QCTL2')
If you are not using the duplicate subsystem, then do not perform this task.
Step 8 Add the auto start job entry to the controlling subsystem description, as follows:
ADDAJE SBSD(QSYS/QCTL) JOB(UPSJOB) JOBD(UPSLIB/UPSJOBD)
If you have made a duplicate, then type in the “2.” after QCTL.
Step 9 Review, enter, modify to suit your needs, and compile all of the ancillary programs shown below which are needed to complement the power handling program. Type in the DDS for the UPSLOG file for error logging. Assure that all of the programs used in the UPSPGM are compiled cleanly.
Step 10 Test the programs and objects before testing the UPS. Once you have all the objects built, it’s time to see how the message queue behaves before you go live. Start your UPS program (UPSPGM) manually. To test it without messing with power, first examine the UPSTEST program source in this package to see what it does and how it does it. It is fairly simple but it can exercise your code and you can modify it to simulate just about all of the messages and timings in the system to assure that it all works prior to trying this in a test with a live UPS. When you have checked out the UPSTEST program, fire it up and watch it send MSGID(CPF1816) (Power Failure) to the UPSMSGQ. Your UPSPGM version should behave just as if a real power message arrived. If you are logging, you will see the message in the log. After a short delay, the program will send a MSGID(CPF1817) to the queue to simulate the end of the power outage. At this point you can feel good about your code and be prepared to go into UPS testing or you can test all the UPSPGM events by adding them to the UPSTEST program.
In the code attached at the beginning of this article, I have provided a more detailed testing program called TUPSTEST that is self documented. Additionally, I have provided a more sophisticated, more functional UPSPGM called TUPSPGM. These are worth examining before you implement. This test code also demonstrates how to use additional logging to help you know what is happening every step of the way.
Step 11 After you have exercised the test programs adequately without having to power down, it is time to test with the UPS. Make sure you have taken all of the necessary precautions, such as shutting down all unnecessary subsystems. Unfortunately, you cannot test in a restricted state because the UPSPGM is submitted via a job queue entry and batch jobs will not run in QCTL in a restricted state.
The best way to test is to make the changes to the controlling subsystem and the system values and perform a power down and restart. To start QCTL2 or QCTL as the controlling subsystem for the test, and to bring up the power-handling program at the same time, make the following changes to the UPS system values and then use Step 12 to power down. Remember, you already changed the QCTLSBSD value to QCTL2 or you left it at QCTL so the system is ready for the additional change. The following commands change the system values to allow the program to handle a power outage, as follows:
CHGSYSVAL SYSVAL(QUPSMSGQ) VALUE('UPSMSGQ UPSLIB') CHGSYSVAL SYSVAL(QUPSDLYTIM) VALUE(*NOMAX)
Step 12 Perform an IPL of the system to have the new controlling subsystem description take effect and your new power-handling program, as follows:
PWRDWNSYS OPTION(*IMM ED) RESTART(*YES)
Step 13 After fully understanding the UPSPGM and its ancillary programs called to provide additional functions, begin testing the UPS program in a controlled environment QCTL2 — or QCTL.
The power-handling program should be activated (loaded into memory) at each IPL and remain active at all times. It should be accounted for in the activity level available in your work management subsystem specifications for QCTL or QBASE.
The message queue that is specified in system value name QUPSMSGQ (UPSLIB/UPSMSGQ) is used for the uninterruptible power supply message processing. You can characterize it as the UPS talking to the message queue. So if your program can hear the message queue, it knows what the UPS is saying and the program can then take appropriate action. In order to set the QUPSDLYTIM to *NOMAX, an IBM program must allocate the UPSLIB/UPSMSGQ that was created earlier. The CL program named UPSPGM is launched via the autostart job entry in QCTL to run in QCTL and to converse with the UPS. Once launched, it allocates the queue and controls system operations based on what it “hears” from the queue. The program command to allocates the queue looks like this:
ALCOBJ OBJ(xxx/yyy *MSGQ *EXCL)
This code is in the UPSPGM program. IBM suggests that just a few messages need to be tested for all of this to work fine. There are other messages that the UPS may send but, by and large, they are not of concern to this program. For your own enhancements to this code down the road, you will want to understand all messages, particularly:
As you will see, the UPSPGM program supplies the necessary checks for CPF1816 and CPF1817. Add any additional checks as you see fit. You may choose to ignore the other messages. The UPSPGM program is written to handle a brief power interruption without doing any unique processing. For example, when the CPF1816 message arrives indicating a power failure, the program sets a switch that indicates that the message occurred. The program then perform a RCVMSG with WAIT(30) to cause a message time-out in 30 seconds. The IBM default time is 10 seconds. This is to ride out short power failures. You make the call as to what to change this to before implementing.
In the example shown, it is set to 30 seconds to permit the system power to have an additional 20 seconds to be restored before moving on to preliminary shut-down type activities. If the CPF1817 message indicating that outlet power has been restored is received before the 30-second time-out occurs, the program resets the switch and performs no other action. In this case, the program and the UPS have saved the day.
If, however, the 30 seconds passes and power has not been restored, the safe bet is to begin an orderly shutdown so that most if not all work can be completed (other than batch jobs that may continue to run until stopped). The UPSPGM program calls a program named UPSEND to accomplish this. The UPSPGM and UPSEND programs prepare the system that it may have to do a normal power down if power is not restored after a brief time period.
Changing the System Power Down Options
You have control of what goes into the UPSEND program besides what is already there. For example, if you have remote work stations that are still active, you may want to send them a message requesting that they sign off quickly. You may want to send yourself a message in case you are in the building or signed-on remotely. You may want to issue ENDSBS OPTION(*CNTRLD) for your interactive subsystems to prevent new work stations from signing on or for new batch work from beginning. The UPSEND program code in this package does supply this function. If you have batch jobs running, and you are notified via message, you may want to perform a WRKACTJOB and place a “4” next to specific jobs for controlled cancel or you may want to end them with a command in the following format:
ENDJOB JOB(123456/ANYUSER/ACTIVJOB) OPTION(*CNTRLD)
Be careful that these activities do not waste time. This operation sets an indicator to end the job. If the program does not end itself, the default on ENDJOB of 30 seconds is used. To determine which jobs are active when your system experienced the outage, perform a WRKACTJOB OUTPUT(*PRINT) in the UPSEND when you are about to power down and this will give you the record that you need. You can also add code to CPYSPLF the WRKACTJOB spool file to a DB file and then check which programs are alive. You then scan for specific programs and have the program end them accordingly using the “endjob” command (sample listed above.) You may find this useful.
There are two timers in the UPSPGM. After the first timer expires, the program starts a second timer based on the IBM model program. This second timer performs a RCVMSG WAIT(600). Make sure your UPS can endure for 10 minutes or reduce this value. Increase it if you want to wait longer to shut down. After 10 minutes, this is clearly a long power outage and not a lightening blip. if no message arrives, the program issues the PWRDWNSYS OPTION(*IMMED) command. The 10 minutes in this sample code should be specified based on your anticipated battery time and a consideration for the time that is required for a power-down. Since this program allocates a message queue for the QUPSMSGQ system value and *NOMAX is set for the QUPSDLYTIM system value, the following conditions apply:
These conditions must be met for the system to believe that you have a power-handling program. In the sample code, the conditions are met. The UPSPGM program in the UPSLIB library allocates the queue at program start up. Be sure that no other program allocates this queue. Otherwise, the system will be powered down immediately. Make sure you review the comments in the code before you begin. One caveat: Do not place the system in a restricted mode from a CL program running from a job queue. The command will die in the batch program. That’s why the UPSEND program shuts down selected subsystems. It cannot put the system in restricted state (though that would be nice). At this point, you should be all set to start.
Brian Kelly is an assistant professor in the Business Information Technology program at Marywood University, where he also serves as the System i technical advisor to the IT faculty. Kelly has developed and taught a number of college courses in the IT and business areas and runs an IT consultancy, Kelly Consulting. He is the author of 27 books and he has written numerous magazine and newsletter articles about current IT topics.