Admin Alert: Diary of a Production System Upgrade, Part 2
Published: May 12, 2010
by Joe Hertvik
Last issue, I began discussing a Power i/Power 6 upgrade that was recently completed in my shop. This review served as a case study for discussing some techniques and pitfalls in bringing up new hardware. I'm offering this examination to document some lessons I learned while upgrading in order to help other i/OS administrators who are installing new hardware. This week, let's continue the story.
I previously documented how we started swapping out an existing production System i/Power 5 machine for an upgraded model 8204 IBM i Power 6 box with 96 Gb of memory and 4 processors. Over a single weekend, we attempted to switch processing to our Capacity BackUp machine, rebuild our development and production partitions, and then switch processing back to the new production partition.
What became obvious during the install was that we were trying to do too much in 48 hours. In short order, we had a production delay while switching over to the CBU, struggled to keep the staff rested and alert, and experienced other delays that slowed our progress. By Saturday evening, we had switched production processing to the CBU, and we had migrated our development partition to the new machine. At 11 p.m., we called it a night before starting the next step on Sunday: bringing the production partition up live.
Meanwhile, Back at the Production Partition
On Sunday, we finished restoring our existing system from a full system backup tape and made the necessary adjustments to account for the new hardware and to activate the partition. Since we had activated our CBU machine to fill in for production, making the new partition live was a three-step process.
- Bringing up the new production partition up as the target CBU system rather than as the source production box.
- Synchronizing the production partition's data with the off-site CBU that was currently running production. We had to ensure that all live data processed on the CBU was replicated back to the new partition.
- Switch processing roles between the CBU and the production machine, so that the new partition could take over servicing our users.
Here's how each step shook out.
Bringing up the new production partition as the CBU.
After creating the production partition and restoring its data, we IPLed the machine into restricted mode by doing the following.
1. We changed the Startup Program system value to *NONE so that on an IPL, it wouldn't fire up the production startup program. We did this by running the following Change System Value (CHGSYSVAL) command.
CHGSYSVAL SYSVAL(QSTRUPPGM) VALUE(*NONE)
2. We changed the partition's IPL attributes so that if we IPLed the system, it would automatically come up in Restricted Mode. For instructions on how to do this, see my article You Can Re-IPL an AS/400 into Restricted State.
We then tested and verified that the new partition worked. With the system ready to go, we reconfigured the production box as the target CBU system by following the switch instructions in our CBU run book.
Switching the CBU Back to the New System
By Sunday afternoon, we had the new production partition running as our CBU system. However, before we could switch production processing over to the new partition, we had to wait until the new production box finished synchronizing its data with the real CBU, which was functioning as our production partition.
Three hours later, we started investigating why the new machine wasn't finished synchronizing with the CBU. The delay was caused by running our regular Sunday morning maintenance jobs on the CBU, a process that reorganizes several large files. One job reorganized the largest file on our system, a 160 GB sales history file with 141 million records and 30 million deleted records. And it was taking forever to synchronize this file between machines.
We spent several hours on the phone with technical support. At 12 a.m. Monday morning, we decided to delay bringing the new partition up for production processing that weekend. Instead, we would fix the replication issue to allow the production box to synchronize its data with the CBU and then switch processing over to the new machine the next weekend.
Running Live on the CBU For a Week
Come Monday morning, we were still live on our CBU system, and the new production partition was functioning as the target system CBU. We were running our systems as if it were an actual disaster. This produced a few hiccups that we dealt with during the week, including:
- Problems with running our check printing software. We discovered this problem over the weekend, but the vendor didn't have weekend support hours so it couldn't be resolved until Monday.
- Problems with printing critical documents such as invoices, which ran on Friday night and were still on our old production machine. We saved those documents off the replaced box and restored them to the CBU for printing.
- Our automated job scheduling software stopped working on Tuesday because the vendor had only given us a three-day key to run the software on the CBU during the weekend. We retrieved a new extended key so that we could keep running our scheduled batch software.
- Minor configuration issues that slowed down processing until we straightened out the issue or created a workaround.
Overall, running production on the CBU for a week went fairly smoothly. It turned out to be an unexpected disaster recovery drill that we wound up passing.
Switching Over to the New Production Partition. . . Finally
A week after starting the migration, we were ready to switch processing back from the CBU to the new partition. Before doing that, we held the weekend reorganization jobs so that the replication software wouldn't have to struggle to keep current again before switching back.
Our replication software was working correctly so that the data on the new production partition matched the CBU to within 30 seconds of generation. Since we weren't overwhelming the replication software with reorganizations, we were confident that we could switch production processing over to the new partition without unnecessary delays.
The next Sunday, we began switching production processing from the CBU to the new production partition. But there was one more surprise. Our replication software wouldn't let us reverse the replication flow to make the new production partition the new source machine. After a three-hour call to the replication software vendor, we discovered there was one replication setting that was not set up correctly on the new box. After correcting the mistake, we were finally able to switch and the new production machine started servicing our users.
While our new hardware migration certainly had its share of ups and downs, it provided the following lessons that can be applied to other shops.
- If you're going to switch processing over to your CBU before a migration, do it before migration weekend so that you don't delay the migration.
- Don't plan more than you can handle in a limited time. When migrating multiple partitions to a new machine, do it over two weekends to provide adequate time. The more you schedule for a particular weekend, the more likely you are to encounter delays.
- When working an insane number of hours over a 48-hour weekend, schedule regular breaks for the crew. This helps keep them alert, which cuts down on mistakes.
- Print all critical documents such as invoices and checks before starting the migration. Don't risk holding up company processing while changing hardware.
- Make sure CBU keys and settings are correct. You don't want to be caught off-guard when you have to switch over.
- Before switching processing from a production machine to a CBU and vice versa, hold any automated jobs that produce a high number of transactions that will have to be replicated between the machines during switch back.
- Before the migration, contact key vendors and make sure you have off-hours contact information. High availability vendors in particular must be notified whenever there is a switch test or a migration.
While these won't be the only issues you can run into, I hope that my experience will help your next hardware migration run smoother.
Diary of a Production System Upgrade, Part 1
You Can Re-IPL an AS/400 into Restricted State
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot