Admin Alert: Diary of a Production System Upgrade, Part 1

April 28, 2010 Joe Hertvik

Last week, my shop replaced an existing production System i 550 machine with an upgraded Power 6 system. Our plan was to switch production processing to our Capacity BackUp (CBU) machine, configure the new machine, and then switch back. This week and next, I’ll review our installation as a mini-case study and discuss what lessons it can provide to other i/OS system administrators.

The Plan

At upgrade time, our System i/Power i configuration consisted of the following components:

A System i 550 Power 5 machine equipped with 32 Gb of memory, three processors, and two partitions: a production partition and development partition.
A new Power i/Power 6 520 CBU system that was installed at a remote location in January. This box was tested and certified to stand in for the production partition in February. We had found and fixed a number of issues when we switched production processing to the new CBU in February, and we certainly felt that we had found most of the biggest issues and fixed them (we were wrong).

Our goal was to replace the production machine with a new model 8204 IBM i Power 6 box with 96 Gb of memory and 4 processors. Our plan to accomplish this task in a weekend was simple.

Saturday morning: Switch production processing to the CBU so that production could run uninterrupted as we brought up the new configuration.
Saturday afternoon: Migrate the development partition to the Power 6 hardware.
Saturday evening to Sunday morning/afternoon: Bring up the new production partition to serve as the CBU.
Sunday evening: Switch processing back to the new hardware.

As always, proper planning helps for a smoother transition. But no matter how much planning you do, there’s always the unexpected. Here are some of the speed bumps we hit and the lessons we learned from them.

No Enemy But Time

It quickly became apparent that time was our enemy. A quick look at the task list above shows that we were overly optimistic in what we felt we could accomplish in a weekend. Some of these tasks were half-day projects or entire weekend projects. If I had to do it over again, I would have spread the upgrade out over two weekends, with the first weekend dedicated to bringing up the development partition on the new hardware and the second weekend devoted to bringing up the production machine.

The downside of the two weekend scenario is that it’s more difficult to accomplish with an upgrade, where some of the equipment from the old machine is being ported to the new machine, than it is if you’re installing an entirely new machine to replace your old hardware. With an upgrade, you may not have the option to run both machines at the same time, which can force you to go full gun and migrate all your partitions at the same time. However, if your hardware scenario allows it, I recommend spreading the work out over two weekends when dealing with multiple partitions.

Using Your CBU Properly

Our CBU system was both a blessing and a curse during the upgrade. Our plan called for us to switch production processing to the CBU early Saturday morning, leaving us the rest of the weekend to bring up the production machine and to switch back. However, our timing was off. The actual CBU switch only took 1.5 hours but a few CBU applications needed extra attention and that chewed up another 1.5 hours, putting us behind schedule and delaying development partition migration. The biggest CBU issue we had was working with applications that used digital certificates to communicate with other machines, and we had to rebuild all the certificates on the CBU machine so that they would work with our applications.

The lesson here is that if you’re going to activate a CBU to minimize downtime on your partition, try to do it before your migration weekend so that it doesn’t push back your deployment schedule.

Feeding and Resting the Herd

When you’re performing a long upgrade like this, it’s important to look for opportunities to rest your key people. Nothing that will torpedo an upgrade faster than a tired technician who makes a mistake that takes hours to fix. For this upgrade, we tried to handle our personnel in the following way:

Our installation crew consisted of our business partner, his installation tech, our lead tech, and the data center manager (me).
On Saturday, our lead tech started working at 5 a.m. to quiet our systems and to start production system backups. We made sure to give him a four-hour break during the afternoon, just to clear his head and nap, if needed. The business partner’s people flew solo during that time. The lead returned to configuration in late afternoon and worked until 10 p.m.
We also made sure that the entire installation team had a good lunch and dinner, taking appropriate time away from the install to eat and relax. Even though they may add time to the process, relaxed meals are another opportunity for refreshment and they help keep people sharper during a long day.
We had the development partition configured and on the network by Saturday night. The funny thing is that some very good computer techs will work around the clock if you let them. One of the techs wanted to continue work after 10 p.m., but we kicked everyone out of the data center at 10 p.m. We needed them fresh to bring up the production machine on Sunday morning.
Since we had worked a long Saturday, we let everyone sleep in and started production partition installation at 10 a.m. Sunday.

The practical point here is to pace your installation personnel as if they were athletes operating in a peak performance situation, instead of installation resources that you run as hard as you can. Installation weekends always consist of long, grueling hours. You don’t want to burn people out on the first day, only to have them stumble the second day. Anything you can do to keep your crew fresh on the first day should pay off in fewer mistakes the second day.

Detours

The other thing that hurt our schedule was the unexpected detours.

For example, at about 6 p.m. on Saturday, my techs decided the machine’s cable management was a mess, so they decided to fix it then because they would never have another chance to straighten it out. Add two hours to deployment time to the project. And then there were configuration questions that we hadn’t considered before the upgrade, such as how to configure our new dual LTO4 tape drives. Should they be shared with our other machines on the network through our SAN switch or should we dedicate one to each partition with an option to quickly switch drives between the partitions, if needed? Each unanticipated conversation usually added a half-hour to the deployment. And of course, there were the unexpected delays with the CBU I mentioned before.

The point to remember here is that while you can make as many checklists as you’d like, there will be some delays and it’s best to schedule extra time into your implementation schedule to deal with the unexpected. When you schedule too tightly, small delays create big problems in project completion.

And then. . .

At this point, it was Saturday night and we had migrated and began running the development partition on the new hardware. Next would come the fun part, putting the production machine online and switching production back from our CBU to the new hardware. And I’ll cover that part of the story, complete with its own twists and turns, next issue.

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot