Admin Alert: A Checklist For Performing IBM i Planned Maintenance
November 14, 2012 Joe Hertvik
Taking down an IBM i system for planned system maintenance involves more than just performing the maintenance itself. To keep processing running without issue, you need to ensure that the machine ends cleanly before maintenance begins and that it restarts smoothly after maintenance completes. To that end, here’s a starter checklist for how to take down and restart an IBM i machine during planned system maintenance.
The Phases Of System Maintenance
When planning IBM i system maintenance, it’s helpful to divide the maintenance up into three sets of tasks that you must successfully complete. These task sets are:
This article contains a starter checklist of items to complete these tasks. This checklist contains the most common tasks I’ve run into when performing IBM i maintenance. Note that you may need to add more items to this checklist to accommodate tasks that are specific to your shop.
1. Determine how long the machine will be down and whether you want to perform any backups while the machine is unavailable–Before you can schedule downtime, you’ll need a timetable for how long the outage will last. The timetable helps you inform others about how long the box will be out of commission. The timetable should contain both pre- and post-maintenance tasks, not just the amount of time it will take to perform the maintenance tasks.
As you’re doing this, consider whether you’ll want to perform a full system backup while the machine is down. If you’re working on a production machine that’s usually running 24×7, you don’t get too many chances to perform a full system backup. So it might be worthwhile to grab a full system backup while the machine is already down, if you can squeeze the time into your maintenance window.
2. Get permission to take your machine down–The next step is to determine when you can take the machine down (your maintenance window). This may require management permission, since most production systems are in use 24x7x365. You may also have to work around existing Service Level Agreements (SLAs), and you may need to negotiate a time frame depending on machine demands and the availability of a redundant backup system.
If you have a Capacity BackUp (CBU) system or a hot spare, consider whether you’ll need to switch over to the backup system while you perform your maintenance, especially if you have a long maintenance timetable. Activating the CBU may be influenced by how much time it takes to switch to the CBU and back. If it takes an hour to switch to the backup box and an hour to switch back for example, it may not be worth it to activate the backup box if the machine is only going to be down for 45 minutes. Conversely, if you’re performing a hardware upgrade where the machine will be down for 12 to 24 hours, you may want to activate the backup box to keep production processing available.
3. Print a copy of a current WRKACTJOB screen while the production partition is up and running–After you bring the machine back up from maintenance, you’ll want to check whether all your critical jobs and subsystems have restarted. To do this, take a copy of the Work with Active Jobs (WRKACTJOB) screen by executing the following WRKACTJOB command.
4. Determine whether you need to bring your system up in restricted state or semi-restricted state during or after the system maintenance process–Some changes may require the system to be in restricted mode after an IPL. There are different procedures to follow depending on how you’re going to quiet your system. Here are some of the more common scenarios.
5. Determine how you’ll quiet the system–Quieting the system refers to how you orderly stop production processing and bring the system down without affecting data integrity. Determine what procedures you need to execute to end critical applications, prevent data corruption, and complete all pending processing before the system comes down.
If you’re running a CBU, you may be able to use the system quieting procedures in your CBU run book to quiet the system. If you don’t have existing procedures for quieting the system , you may want to create a new set of procedures for orderly putting your partition into a quiet state where no production processing is happening. These procedures may include:
6. Create a run book for how the maintenance will occur–Game plan the shutdown and create a task specific run book of all the procedures that will be performed during maintenance. I often get surprised by how many people (including consultants) who perform maintenance just by using the run book in their heads.
Creating a run book or checklist will help you thoroughly plan the maintenance and ensure you don’t miss any steps. It doesn’t have to be a massive document. A page or two will do, as long as it provides all the steps your technicians will need to perform during maintenance. Nothing causes more problems than skipping a step in an upgrade procedure or performing steps out of order. A filled out checklist can also be used to review the steps already taken or to document the procedure for the next time you perform this maintenance function.
7. Notify any affected users before taking down the system–When planning scheduled maintenance, users should be notified twice before the system is taken down. The first notification should happen well in advance of taking down the system (say three-to-four days before the maintenance occurs), so that the users have time to reschedule their work. Providing advanced notification for a system outage helps ensure that no one schedules work during the maintenance window. You don’t want a remote warehouse shift showing up and expecting to access the system for example, just when you’re ready to take down the box for six hours.
The second notification should occur right before the planned maintenance occurs, to notify the users the machine is ready to come down. You can send a message to all your active 5250 users that the system is ready to come down by using the procedure I outlined in a previous article on sending break messages to interactive users. Send the message about 10 to 15 minutes before you take down the system, to give any active users time to shut down their processing before the system is quieted.
System Maintenance Tasks
8. Take down the system by executing the plan created in the previous steps and perform the maintenance–This is where your planning comes into play. Use the checklists, procedures, and techniques you created in the previous steps to smoothly take down the system and perform the maintenance. Note any problems or detours for project review and correction later on.
Once you finished your planned maintenance, it’s time to reverse course and restart your system. Here are the steps you need to take to bring your system back up into production mode.
9. Restart your system–Depending on which technique you took to quiet your system for maintenance, perform one of the following methods for restarting your system.
A. If you switched over to a CBU system, perform the CBU run book procedures needed to switch processing back to the production system from your back up system.
B. If you IPLed into restricted mode for the maintenance, start your controlling subsystem with the following Start Subsystem (STRSBS) command:
Where controlling_system_name is the name and library of your IBM i controlling subsystem description name. You can find your system’s controlling subsystem name and library in the Controlling Subsystem (QCTLSBSD) system value. Starting the controlling subsystem will execute the system startup program, which will restart your processing.
C. If you changed your system startup program name to *NONE as I recommended in my article on preventing your system from restarting after a full system backup, restore the startup program name to the Startup program system value (QSTRUPPGM) as documented in that article. Once the startup program name is re-entered, end and restart the controlling subsystem to restart your system.
10. Check your systems jobs against the Work with Active Jobs (WRKACTJOB) printout you created before you quieted the system–Use the WRKACTJOB printout to make sure that all the system jobs, subsystems, and printers that were running before you performed the maintenance are running again after the system was restarted. Since IBM i production partitions are so infrequently restarted, there’s a good chance a necessary subsystem or critical job was not added to your startup program and will need to be restarted manually. Checking against the pre-maintenance WRKACTJOB listing will help you catch jobs that didn’t start when you restarted the system.
If you find any necessary jobs that weren’t restarted when the system came up, make a note and change the startup routine as appropriate to start them next time.
11.Check your job scheduling software to determine if any scheduled jobs didn’t run at their appointed times and rerun them, if necessary–When you quiet a system, you also stop your job scheduling software. Determine if there were any missed scheduled jobs that didn’t run during the maintenance window and take appropriate action to rerun those jobs, if necessary.
12.Check your replication software to ensure that data is still being replicated between your source and target systems–Make sure that replication is working correctly and that all new updates are being sent to your back up box.
A Good Starter List
This checklist can provide a good starter list for planning your own system maintenance procedures. Add more tasks to this list as needed, and you’ll easily be able to stop and restart your system whenever you need to perform planned maintenance.