Admin Alert: Diary of a Production System Upgrade, Part 1
April 28, 2010 Joe Hertvik
Last week, my shop replaced an existing production System i 550 machine with an upgraded Power 6 system. Our plan was to switch production processing to our Capacity BackUp (CBU) machine, configure the new machine, and then switch back. This week and next, I’ll review our installation as a mini-case study and discuss what lessons it can provide to other i/OS system administrators.
At upgrade time, our System i/Power i configuration consisted of the following components:
Our goal was to replace the production machine with a new model 8204 IBM i Power 6 box with 96 Gb of memory and 4 processors. Our plan to accomplish this task in a weekend was simple.
As always, proper planning helps for a smoother transition. But no matter how much planning you do, there’s always the unexpected. Here are some of the speed bumps we hit and the lessons we learned from them.
No Enemy But Time
It quickly became apparent that time was our enemy. A quick look at the task list above shows that we were overly optimistic in what we felt we could accomplish in a weekend. Some of these tasks were half-day projects or entire weekend projects. If I had to do it over again, I would have spread the upgrade out over two weekends, with the first weekend dedicated to bringing up the development partition on the new hardware and the second weekend devoted to bringing up the production machine.
The downside of the two weekend scenario is that it’s more difficult to accomplish with an upgrade, where some of the equipment from the old machine is being ported to the new machine, than it is if you’re installing an entirely new machine to replace your old hardware. With an upgrade, you may not have the option to run both machines at the same time, which can force you to go full gun and migrate all your partitions at the same time. However, if your hardware scenario allows it, I recommend spreading the work out over two weekends when dealing with multiple partitions.
Using Your CBU Properly
Our CBU system was both a blessing and a curse during the upgrade. Our plan called for us to switch production processing to the CBU early Saturday morning, leaving us the rest of the weekend to bring up the production machine and to switch back. However, our timing was off. The actual CBU switch only took 1.5 hours but a few CBU applications needed extra attention and that chewed up another 1.5 hours, putting us behind schedule and delaying development partition migration. The biggest CBU issue we had was working with applications that used digital certificates to communicate with other machines, and we had to rebuild all the certificates on the CBU machine so that they would work with our applications.
The lesson here is that if you’re going to activate a CBU to minimize downtime on your partition, try to do it before your migration weekend so that it doesn’t push back your deployment schedule.
Feeding and Resting the Herd
When you’re performing a long upgrade like this, it’s important to look for opportunities to rest your key people. Nothing that will torpedo an upgrade faster than a tired technician who makes a mistake that takes hours to fix. For this upgrade, we tried to handle our personnel in the following way:
The practical point here is to pace your installation personnel as if they were athletes operating in a peak performance situation, instead of installation resources that you run as hard as you can. Installation weekends always consist of long, grueling hours. You don’t want to burn people out on the first day, only to have them stumble the second day. Anything you can do to keep your crew fresh on the first day should pay off in fewer mistakes the second day.
The other thing that hurt our schedule was the unexpected detours.
For example, at about 6 p.m. on Saturday, my techs decided the machine’s cable management was a mess, so they decided to fix it then because they would never have another chance to straighten it out. Add two hours to deployment time to the project. And then there were configuration questions that we hadn’t considered before the upgrade, such as how to configure our new dual LTO4 tape drives. Should they be shared with our other machines on the network through our SAN switch or should we dedicate one to each partition with an option to quickly switch drives between the partitions, if needed? Each unanticipated conversation usually added a half-hour to the deployment. And of course, there were the unexpected delays with the CBU I mentioned before.
The point to remember here is that while you can make as many checklists as you’d like, there will be some delays and it’s best to schedule extra time into your implementation schedule to deal with the unexpected. When you schedule too tightly, small delays create big problems in project completion.
And then. . .
At this point, it was Saturday night and we had migrated and began running the development partition on the new hardware. Next would come the fun part, putting the production machine online and switching production back from our CBU to the new hardware. And I’ll cover that part of the story, complete with its own twists and turns, next issue.