|
IBM Issues PTFs to Patch RAID Controllers
by Timothy Prickett Morgan
I told you last October that I had heard some feature 2757 PCI-X RAID 5 disk controllers were on the fritz and that IBM had issued some PTFs to do diagnostics on the boards, to find failing components before the feature 2757 card failed and took down customers' machines. While that problem has apparently been solved, there are still issues with the feature 2757 and feature 2780 RAID controller cards.
On Friday, February 4, IBM sent a technical support bulletin to its business partners alerting them that if they have customers with mirrored disk arrays that are also using the feature 2757 or feature 2780 RAID controllers, they had better get in gear and install some PTFs to patch the microcode on those controllers. While electronic components within the feature 2757 and feature 2780 cards are not failing in the field this time around (as was the case last fall with the feature 2757 controllers, according to my sources at IBM's Rochester iSeries labs), they do have microcode issues that can, nonetheless, under very specific circumstances, can cause system crashes.
The text of the IBM tech support bulletin was brief and, as usual, did not provide much in the way of an explanation of the problem:
All users with a mirrored system AND with 2757 or 2780 IOA controllers
It has been discovered that, given certain conditions, systems that you have setup with Mirroring AND have 2757 or 2780 IOA controllers could result in system down time if a disk failure occurs. We have fixes available for both V5R3 (MF34472) and V5R2 (MF34589). They have been designated as HIPER PTFs and should be downloaded and applied to your systems along with scheduling a system IPL at your earliest possible date to remove the threat of exposure from this issue on your systems. Also, please remember to apply these PTFs and IPL on ALL of your V5R2 and V5R3 partitions on LPAR systems.
To better explain what is going on now with these two RAID controllers, let's talk about what they are, what went wrong with the feature 2757 card, and what still seems to be the problem with the feature 2780s.
The feature 2757 RAID 5 controller was announced in January 2003. It has 235 MB of write cache memory, and with its data compression turned on this cache memory turns into what is effectively (or, depending on whether or not yours crashed, maybe ineffectively) a 757 MB cache. This Ultra3 SCSI controller also supports a RAID5 set with a minimum of three drives, but that RAID set can be expanded to 18 drives. Considering how disk drives have gotten increasingly capacious over the years, being able to make a RAID set with only three drives is a lot better than the 10-drive limit on feature 2778 and feature 4778 Ultra2 SCSI RAID5 controllers. The feature 2757 controller supports up to four SCSI buses, which run at 160 MB/sec, compared with 80 MB/sec. The maximum PCI burst rate on the feature 2757 controller is 532 MB/sec, four times that of the prior card. The compressed write cache, at 757 MB, is more than seven times as large as the 104 MB effective cache on the feature 2778 and feature 4778 cards. The new controller also supports SCSI bus tagged command queuing, which yields faster response times under heavy loads and has new hardware-assisted array parity checking and cache memory scrubbing algorithms that are five times faster than with prior cards. The net effect of using the new controller plus 15K RPM disk drives (also new in January 2003) and the PCI-X slots or expansion towers was that customers could see their disk subsystems improve by a factor of three. Feature 2757 PCI-X RAID 5 controllers plug into second-generation iSeries and i5 models in their internal PCI-X slots or in PCI-X slots in I/O towers. Older iSeries machines must attach these new cards to their servers through I/O towers, since older iSeries machines did not support PCI-X slots.
The feature 2780 card is a variant of the feature 2757 card; it was launched as part of the second wave of eServer i5 announcements back in July 2004. This RAID 5 controller is the same as the existing feature 2757 controller, but it adds 1 GB of read cache on top of the write cache. Feature 2780 was designed to replace a special RAM disk offering from a few years back that was used to boost the batch performance of iSeries machines. The feature 2780 controller was only initially available on the i5 520 and 570 servers, then it was rolled out to the remaining i5 550 and 595 servers in the fall. At about the same time last year, after numerous customer requests, IBM made the feature 2780 card work on first-generation iSeries 270, 820, 830, and 840 machines, as well as on second-generation iSeries 800, 810, 825, 870, and 890 boxes.
According to my source, the original problem with the feature 2757 card was that there was an electronic component that was faulty, and in some cases this could cause the battery-backed write cache on the controller to not be able to flush itself to disks so it could accept new data. This could cause OS/400 objects to fail, and thanks to the single-level storage architecture of OS/400, failing objects can cause a big crash; and sometimes they can even cause a crash so bad that you have to completely reload the system. In fact, according to my source, some customers lost objects, and some had to rebuild their systems from tape. Once the problem was discovered, IBM went through its sales and configuration database and figured out who had the faulty 2757 cards and got them replaced. Just for good measure, it released two PTF diagnostic tools to check for bad 2757s. (That's PTF MF33849 for OS/400 V5R2 and MF33850 for i5/OS V5R3.)
This kind of failure is one of the reasons why IBM has been recommending that OS/400 shops that absolutely cannot deal with this kind of outage should mirror their disk subsystems at the bus level. This way, if one RAID 5 set blows, the other one is there, and single-level store is not corrupted in any way.
But there was a catch, and IBM has only just now figured it out. As IBM's techies looked over the microcode for the feature 2757 and 2780 controllers, they saw that there were still conditions under which even mirrored disk arrays using these sophisticated RAID 5 controllers could cause a system crash if a disk drive or the caches failed in a RAID group. What exactly those conditions are, IBM isn't saying. But the February 4 patches are all about fixing whatever the problem is.
|