Admin Alert: Diary of a Production System Upgrade, Part 2
May 12, 2010 Joe Hertvik
Last issue, I began discussing a Power i/Power 6 upgrade that was recently completed in my shop. This review served as a case study for discussing some techniques and pitfalls in bringing up new hardware. I’m offering this examination to document some lessons I learned while upgrading in order to help other i/OS administrators who are installing new hardware. This week, let’s continue the story.
I previously documented how we started swapping out an existing production System i/Power 5 machine for an upgraded model 8204 IBM i Power 6 box with 96 Gb of memory and 4 processors. Over a single weekend, we attempted to switch processing to our Capacity BackUp machine, rebuild our development and production partitions, and then switch processing back to the new production partition.
What became obvious during the install was that we were trying to do too much in 48 hours. In short order, we had a production delay while switching over to the CBU, struggled to keep the staff rested and alert, and experienced other delays that slowed our progress. By Saturday evening, we had switched production processing to the CBU, and we had migrated our development partition to the new machine. At 11 p.m., we called it a night before starting the next step on Sunday: bringing the production partition up live.
Meanwhile, Back at the Production Partition
On Sunday, we finished restoring our existing system from a full system backup tape and made the necessary adjustments to account for the new hardware and to activate the partition. Since we had activated our CBU machine to fill in for production, making the new partition live was a three-step process.
Here’s how each step shook out.
Bringing up the new production partition as the CBU.
After creating the production partition and restoring its data, we IPLed the machine into restricted mode by doing the following.
1. We changed the Startup Program system value to *NONE so that on an IPL, it wouldn’t fire up the production startup program. We did this by running the following Change System Value (CHGSYSVAL) command.
CHGSYSVAL SYSVAL(QSTRUPPGM) VALUE(*NONE)
2. We changed the partition’s IPL attributes so that if we IPLed the system, it would automatically come up in Restricted Mode. For instructions on how to do this, see my article You Can Re-IPL an AS/400 into Restricted State.
We then tested and verified that the new partition worked. With the system ready to go, we reconfigured the production box as the target CBU system by following the switch instructions in our CBU run book.
Switching the CBU Back to the New System
By Sunday afternoon, we had the new production partition running as our CBU system. However, before we could switch production processing over to the new partition, we had to wait until the new production box finished synchronizing its data with the real CBU, which was functioning as our production partition.
Three hours later, we started investigating why the new machine wasn’t finished synchronizing with the CBU. The delay was caused by running our regular Sunday morning maintenance jobs on the CBU, a process that reorganizes several large files. One job reorganized the largest file on our system, a 160 GB sales history file with 141 million records and 30 million deleted records. And it was taking forever to synchronize this file between machines.
We spent several hours on the phone with technical support. At 12 a.m. Monday morning, we decided to delay bringing the new partition up for production processing that weekend. Instead, we would fix the replication issue to allow the production box to synchronize its data with the CBU and then switch processing over to the new machine the next weekend.
Running Live on the CBU For a Week
Come Monday morning, we were still live on our CBU system, and the new production partition was functioning as the target system CBU. We were running our systems as if it were an actual disaster. This produced a few hiccups that we dealt with during the week, including:
Overall, running production on the CBU for a week went fairly smoothly. It turned out to be an unexpected disaster recovery drill that we wound up passing.
Switching Over to the New Production Partition. . . Finally
A week after starting the migration, we were ready to switch processing back from the CBU to the new partition. Before doing that, we held the weekend reorganization jobs so that the replication software wouldn’t have to struggle to keep current again before switching back.
Our replication software was working correctly so that the data on the new production partition matched the CBU to within 30 seconds of generation. Since we weren’t overwhelming the replication software with reorganizations, we were confident that we could switch production processing over to the new partition without unnecessary delays.
The next Sunday, we began switching production processing from the CBU to the new production partition. But there was one more surprise. Our replication software wouldn’t let us reverse the replication flow to make the new production partition the new source machine. After a three-hour call to the replication software vendor, we discovered there was one replication setting that was not set up correctly on the new box. After correcting the mistake, we were finally able to switch and the new production machine started servicing our users.
While our new hardware migration certainly had its share of ups and downs, it provided the following lessons that can be applied to other shops.
While these won’t be the only issues you can run into, I hope that my experience will help your next hardware migration run smoother.