Admin Alert: Three Common Problems with CBUs
September 14, 2011 Joe Hertvik
Power i Capacity BackUp (CBU) systems are complicated critters. Aside from the basic tasks of configuring, testing, and enabling a CBU for taking over processing from a production box, there are a lot of different problems that can occur with CBU configuration, and even when the CBU is running in replication mode. This week, let’s look at a few not-so-obvious CBU configuration issues that can hurt your CBU or production system setup.
Issue #1: Problems with libraries that start with “Q”
By default, many Power i replication packages do not replicate objects in libraries whose names start with “Q”, such as QGPL, QSYS, and QUSRSYS. This is because many of these libraries contain system objects that are specific to a certain machine, and replication may damage the machine setup.
However, there may be certain times when you need to replicate objects from a production system Q library to your CBU system. In particular, you may find that the following items need to be replicated on an exception basis from your QGPL or QUSRSYS libraries in order for your CBU to work correctly when it is running production processing:
Regarding that last item, be careful with replicating the QSTRUP program from one machine to another. By default, Power i system startup program names are named QSTRUP, and people generally locate modified versions of these programs in QGPL. You can easily clobber your CBU startup program if you’re replicating the entire QGPL library to the CBU and both machines use QSTRUP as their startup program name. For that reason, I usually recommend that you use different startup program names for your production and CBU machines, and the CBU startup program should be located in a library that isn’t replicated.
Recommended solution: To avoid conflicts and to ensure you correctly replicate all production objects between your source and target machines, the best situation is to not keep application objects like these in your QGPL and QUSRSYS libraries or any other library whose name starts with the letter Q. If that’s not possible, I’d recommend adding exceptions to your HA replication software data groups so that you only replicate relevant objects from QGPL and QUSRSYS, instead of replicating the entire library. For all future objects, place those objects in a special production library whose name does not start with a Q.
Issue #2: System objects that need to be manually replicated
There are specific Power i OS objects and functions that are difficult to replicate and may need to be manually adjusted so that the CBU versions are in sync with the production. These objects may include:
• System host tables that contain host names and associated IP addresses of the hosts your production and CBU boxes contact. On the green screen, you can check your host names by running the Configure TCP/IP (GO CFGTCP) command and then taking Option 10, Work with TCP/IP host table entries. These entries should match to ensure that inter-system communication works correctly on your CBU when it is impersonating production.
• Shared storage pools that allocate system memory to different subsystems. Use the Work with Shared Storage Pools (WRKSHRPOOL) command to determine whether your storage pool setup is the same on both systems. Storage pool configuration is a part of your production system’s work management setup. If your production subsystem descriptions are replicated to the CBU, they will be configured to use the same storage pools as the production machine. Because of that, your CBU storage pools should be configured roughly the same as your production system storage pools.
Recommended solution: As a matter of regular CBU maintenance, check to make sure non-replicated system objects such as these are the same between your production machine and your CBU.
Issue #3: Spooled file replication gone wrong
Similar to what can happen when users mistake development system output for production data, a CBU can also send fake output to your production system users. Here’s a real-life example that I recently encountered.
One of my warehouse facilities complained that they were printing duplicate shipping labels for a big bulk shipment that was scheduled to go out the next day. There were 14,000 labels to print, but to our surprise, 25,000 to 28,000 labels printed, which included several duplicates of the original 14,000 labels.
We scratched our heads for three days trying to figure out what had happened while our users dutifully separated the first set of labels from the duplicates in their bulk run. Finally, it hit us.
For our CBU and HA solution, we were replicating spooled files from the production machine to the CBU. We did this to ensure that our users could still access their spooled files if we had to switch over to the CBU in an emergency. This meant that for every production output queue, the system tried to replicate any available spooled files over to the same output queue on the CBU.
On the CBU side, it turned out that the QSPL subsystem, which controls all operating system print jobs, was turned on. Further, it also turned out that the same printer output queue that sent labels to our warehouse label printer on the production system was also turned on inside the CBU system. Since the output queue sent production labels directly to our warehouse label printer, the following sequence of events occurred with this particular printer.
• On the production system: While spooled files were waiting in the source output queue to print, they were being replicated over to the same output queue on the CBU.
• On the CBU: Since the same production printer was active on the production box and the CBU and the remote output queue attached to this CBU printer device was active, any new spooled files that reached the CBU’s production printer output queue were also sent to the warehouse’s label printer.
• At the warehouse label printer: Labels were received at the printer both from our production system (good labels) and our CBU (replicated labels). The end result was that the staff received duplicated sets of shipping labels interspersed within the same print run.
Once we knew the problem was a second active printer on our HA box, it was easy to fix this problem by turning off the QSPL subsystem that controls printing.
Recommended solution: Be careful running CBU subsystems that communicate with other devices or users outside of the CBU. Many of these subsystems, including QSPL (for printers) and QHTTPSVR (for Web sites), may be configured with the same parameters as those of the production system that it is meant to replicate. If these subsystems are active when the CBU is in replication mode, they could easily send fake data out to your users, the same way a development partition sometimes sends out fake data when programmers are testing various functions.