Admin Alert: Critical Information That Every i Administrator Should Have Handy
Published: May 11, 2011
by Joe Hertvik
Similar to other hardware systems, crises can happen whenever you administer iSeries, System i, and Power i equipment in a data center. However, being a good administrator means knowing what to do and having the appropriate information available when you're handling a crisis. This week, I'll look at four key lists you should have handy to restore functionality after a system issue or disaster occurs.
The Big Four Lists
Generally, you'll need the following information at your fingertips when a critical issue occurs that eliminates some or all i operating system functionality in your shop.
- IP addresses, critical profile names, and passwords
- Emergency call trees
- Critical documentation
- Indentifying information
Let's look at each list and see how they can help you recover from a localized or global system issue.
IP Addresses, Profile Names, and Passwords
Hardware and software breakdowns aren't always about your iSeries, System i, or Power i hardware. Sometimes a tape drive or other peripheral device is broken. Other times you need to sign on to your Hardware Management Console (HMC) to make an adjustment. Given the need for service access to Power i and non-Power i equipment, you should know how to retrieve the following information for use by service personnel.
- Retrieve or reset the Security Officer (QSECOFR) and Service User (QSRV) profile passwords. IBM may need to sign on as one of these users to fix your equipment.
- Your QSECOFR System Service Tools (SST) password. Better yet, you can use SST to set up a special service user profile that your maintenance people can use without having to start SST as QSECOFR. Be careful with SST, however, because it is notoriously easy to disable the QSECOFR SST password. If you encounter a disabled QSECOFR SST profile, you can easily reset the password to its default value by following the instructions in this article.
- IP addresses, user profiles, and passwords for any peripheral devices your i partition attaches to. This includes media devices (such as tape drives) that need to be accessed when there's a backup issue. It also includes any controllers that attach to your warehouse line printers and leftover dumb terminals in order to start a session with your Power i or System i partition.
- HMC passwords, especially the HSCROOT user password. The HMC won't let you sign on to perform emergency work unless you have the correct password.
- Sign-ons for companion servers that perform specific functions for your i 5.4.x or i 6.1.x partitions, such as FTP servers, email servers, etc. If something needs to be reset on those servers, you need to be able to sign on and fix the related problem.
- Any other passwords that may be needed to restore critical system features.
I'm not necessarily advocating you keep a list of your critical passwords pinned to your cubicle wall. However, you will need a secured mechanism where you can quickly retrieve sign-on information for critical servers. The challenge is to keep this information accessible while protecting it from malicious or prying eyes.
Emergency Call Trees for When Problem Occurs
No matter how well you know your machines, you'll often have to call someone else for help during off-hours. To complicate matters, you're going to have to call different people for different issues. To handle this, make your staff a call tree that has contact numbers for any or all of the following people who may be needed to fix an iSeries, System i, or Power i issue.
- IBM service.
- Programming staff numbers for applications issues. If possible, get a coverage schedule or central number from your applications staff, as well as back-up numbers for when you're unable to reach the programmer on call.
- Network staff contact information for network issues that can affect i performance (i.e., email, network connectivity, switch fail, etc.).
- Tech support phone numbers for critical software packages.
- Third-party maintenance vendor contacts for peripheral equipment and companion servers.
- The call center number for any off-site co-location facilities (Co-Lo). If you host equipment in a Co-Lo, you need to know who to call to add maintenance personnel to your Let-In Let-Out (LILO) list and who to contact when there's a Co-Lo emergency.
- Corporate Help Desk number for resetting critical i, network, and other server passwords when they expire or are accidentally disabled.
- Call trees and procedures for building problems, such as a roof collapse or leak. A building problem call list should include your maintenance people, as well as any executives who need to make decisions on whether staff should show up in an emergency. As a side note, card readers for building entry can also be disabled during a power outage. You may want to ensure that at least one person in the IT department has a physical key to the building for outage situations.
- If you store backup media off-site, make sure you have contact information for your media storage vendor for quickly retrieving restore media in the event of an emergency.
Make sure you have an electronic copy of this information as well as a physical copy for each of your responders to keep in their car or house. It doesn't do much good if your call trees are on the network and your building looks like this.
In certain situations, paper backup is more than appropriate.
Critical Documents for When Things Go Wrong
If a real emergency hits your building (see above) and you have to restore off-site or put your disaster recovery plan (DR) into play, make sure that you have access to your disaster recovery plan. It sounds simple, but many times the worst place to store a DR plan is as a single copy on an internal server. Besides having an electronic copy of your DR plan, keep one or more physical copies off-site in the event your electronic copies aren't reachable. The same goes for your Capacity BackUp (CBU) system run book, so that you can reconfigure and restart your CBU as your production box when your network isn't available.
Other Identifying Information
Whenever I call IBM hardware service, the service representative always asks me for the phone number where the machine is located, so that they can verify service. And no, it's not enough to give them the machine serial or the building's address or even the organization name. For some reason, IBM must check the official phone number in its database or they don't seem able to place the call for me. And since I sometimes have a hard time finding that number, I usually have to cajole and beg the rep to look it up for me. It's a heck of a way to run a help desk.
So make sure you always have the IBM registered phone number where your machine is located. In my experience, your service call will go smoother if you have it.
There may also be information other vendors require to provide service in an emergency. Off-site media storage vendors may require special passwords to release tapes. Co-Lo vendors may require additional information to add vendors or outside personnel to enter their secured facility (they may even require you to call from a certain cell phone or desk phone). Many outside entities and some internal ones will require identification before they help you. Make sure you know what ID is required and how to go about laying your hands on it if you can't reach your network or enter your building.
Resetting Your QSECOFR Service Tools Password
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot