Admin Alert: Keep Your Data Synced Up During an HA Switch Over
Published: July 14, 2010
by Joe Hertvik
When performing high availability (HA) switch exercises where production processing is temporarily switched from a live system to an HA system and back again, there is one essential procedure that must be followed or you risk screwing up your databases and losing data. Here's the issue and how to avoid it.
Rule #1: Don't Do This!!!
In my humble opinion, this is the cardinal rule for an HA switch over:
Don't update your production data without replication running.
It's a good simple rule to follow, but it's also an easy rule to mess up. Whenever you switch processing from a production system to a Power i Capacity Backup (CBU) system, there is a simple sequence you must follow or else you will lose any last minute production data. The sequence is this:
- Shut down all processing that affects your production databases. This includes interactive users; batch jobs; Website interfaces; System Network Architecture Distribution Services (SNADS) jobs; SQL jobs using ODBC, OLE, or JDBC; and any other jobs where data is updated on your production system.
- Allow your replication software to catch up so that all recent changes have been transmitted from the production system to the CBU, and that the CBU database is up-to-date with production.
- Switch production processing over to the CBU.
This may sound straightforward, but it is a deceptively easy process to violate. During a recent test, all it took to take our CBU database out of sync was to accidentally leave some Web servers up for 10 minutes after the replication software was shut down. Several orders came into the system, and these orders were not replicated to the CBU before we attempted switch over. Because of this, we had to stop our switch over and reconcile the databases between the production and the CBU machines, which took several hours and led us to cancel our planned switch over.
To prevent this from happening in any CBU role switches, here are three tips you can use to modify your CBU run book and configuration to ensure that all production data is replicated to the CBU before you switch processing.
Tip #1: Separate Replication and Production IP Traffic
You can avoid this issue is by using separate IP addresses for production traffic and for CBU replication. In a simple HA scenario, you would only have the following two IP addresses active on your production system.
- IP address one (xxx.xxx.xxx.001) services production traffic on your system. All interactive users, Web servers, clients, and other partner machines communicate over the .001 interface.
- IP address two (xxx.xxx.xxx.002) services all replication tasks that transmit information between the production machine and the CBU. The .002 interface is dedicated to production-to-CBU traffic only.
By segmenting traffic this way, you can shut down all outside IP production traffic simply by taking down the .001 IP interface with the following End TCP/IP Interface (ENDTCPIFC) command.
Ending your production TCP/IP interface separately from your CBU interface ensures that no outside clients or servers are updating production data as you start the switch over process. The separate .002 interface also allows you to finish replicating all production transactions to the CBU after the system goes quiet.
Tip #2: Checking CBU System Integrity
Many software packages contain integrity reports where you can list out the number of records in key system databases or create an aggregate dollar amount of all production orders or your inventory value. One way to partially double-check that production and CBU databases are in sync is to run and double-check your integrity reports on both systems after the production system is quieted. If you find that these totals are out of sync, it is a red flag that something is wrong with replication and you can delay the switch until you find the problem.
Tip #3: A Blueprint for Shutting Down Replication
Ensuring file synchronization during a switch over is a run book process issue. The solution is making sure that you have working procedures in place for shutting off the production system correctly and for ensuring that production and CBU data are synchronized before you switch roles. Here's a rough blueprint that you can use in a run book for shutting down a production system during switch over.
1. Start by ending QINTER interactive processing. Issue the following End Subsystem (ENDSBS) command to take down your interactive subsystem.
ENDSBS SBS(QINTER) DELAY(120)
This will give each user two minutes to finish their work before the system ends all interactive processing in QINTER. If you have multiple interactive subsystems running on your partition, run this command for each subsystem.
2. If you have SNA traffic on your subsystem, end the QSNADS subsystem.
ENDSBS SBS(QSNADS) DELAY(120)
This shuts down all SNA processing, again allowing two minutes for any active jobs to finish processing.
3. Shut down any i/OS Web servers that are running in the QHTTPSVR subsystem, using this command:
ENDSBS SBS(QHTTPSVR) DELAY(120)
4. Shut down any TCP/IP servers that may be exchanging data with the outside world. Make sure that you specify each server you want to shut down instead of using the default value of all servers (*ALL) on the End TCP/IP Server (ENDTCPSVR) command. To end the TCP/IP server for i/OS FTP, for example, issue the following command:
End other TCP/IP servers, as needed.
5. iSeries, System i, and Power i systems use QZDASOINIT pre-start jobs to process SQL requests from clients using ODBC, JDBC, OLE DB, or other connectivity techniques. These jobs generally run in the QUSRWRK subsystem, but they can sometimes run in the QSERVER subsystem. Use this End Prestart Jobs (ENDPJ) command to end these servers.
ENDPJ SBS(QUSRWRK) PGM(QSYS/QZDASOINIT) OPTION(*CNTRLD) DELAY(120)
6. After steps 1 through 5 are completed, you can end the production IP interface (.001) as described above (if you have separate production and CBU IP interfaces).
7. End batch processing in the QBATCH subsystem by running this command:
ENDSBS SBS(QBATCH) OPTION(*CNTRLD) DELAY(*NOLIMIT)
By ending this subsystem in a controlled manner (*CNTRLD) with an unlimited controlled delay time of *NOLIMIT, i/OS allows all currently running QBATCH jobs to complete before ending the subsystem.
Repeat this command for any other subsystems that are running batch jobs.
8. After all of your batch jobs are finished running, check your HA software to ensure that all pending replication entries have been transferred from the production system to the CBU.
9. End all subsystems on the production machine to ensure that no more processing is occurring.
10. If you have set up integrity reports, run the reports on both systems and validate that the systems are in sync. If they are not in sync, investigate and correct.
11. Proceed with your switch over process.
Don't Forget Synchronization on Switch Back
These three tips should also be implemented when you are ready to send production processing home again when your CBU switch over exercise is completed (i.e., switch processing back from the CBU to the production machine at the end of your exercise), so don't forget to add these items to your run book switch back procedures.
It's always the little things that get you, but if you implement some of the techniques here, you shouldn't get caught with out of sync data during a planned HA switch over.
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot