IBM Nears The End of the Road for Server Reliability Improvements
October 21, 2024 Alex Woodie
Reliability has always been one of the hallmarks of IBM big iron, the midrange and mainframe systems hundreds of thousands of companies run their businesses on around the world. These sturdy systems simply don’t go down very often, certainly less frequently than “industry-standard” (i.e. Intel-based) systems. In fact, the System z and Power systems are so reliable that, statistically speaking, there’s very little room for improvement.
For data on uptime and reliability, we turn to Information Technology Intelligence Consulting (ITIC), which is led by longtime IT industry analyst Laura DiDio. ITIC has been conducting an annual server and operating system reliability survey since 2009, which was soon after DiDio left the Yankee Group to create ITIC.
For the 2023 Global Server Hardware, Server OS Reliability survey, which is based on a Web-based poll of individuals representing nearly 1,900 companies around the world, ITIC found that 88 percent users of IBM’s Power10 server running IBM i, AIX, or Linux reported 99.999999 percent uptime, or eight 9s of reliability. That corresponds with just 315 milliseconds of unplanned downtime per server per year, “due to underlying system flaws or component failures,” ITIC writes.
The only system that is more reliable than the Power10 was the IBM z16 mainframe running Linux or z/OS. ITIC says that 96 percent of the z16 users report that their businesses achieved 99.9999999 percent server uptime, or nine 9s of availability. “This is the equivalent of a near-imperceptible 31.56 milliseconds of per server annual downtime due to any inherent flaws in the server hardware and its various components,” DiDio writes in her report.
The rest of the server industry didn’t fare as well as IBM. The Lenovo ThinkSystem running Linux – which used to be the IBM System x server before IBM sold it to the Chinese firm for $2.1 billion back in October 2014 – came in third place with 31.5 seconds of annual downtime, which is six 9s of availability.
<insert graphic: ITIC_downtime_report_2023.png Cutline: Source: ITIC 2023 Global Server Hardware, Server OS Reliability survey
The next most reliable systems, per the 2023 ITIC survey, were Linux-based systems from Cisco Systems, Hewlett Packard Enterprise, and Huawei Technology, which reported an average of 1.27 minutes to 1.39 minutes of downtime annually. That corresponds with six 9s of reliability. The lone occupier of the five 9s club was The Fujitsu Primergy, with 5.9 minutes of downtime.
From there, the downtime went up considerably. In the four 9s club were Dell PowerEdge running Linux (24 minutes of annual downtime); Oracle X86 running Linux (32 minutes); Oracle OpenSolaris (37 minutes); and HPE ProLiant running Linux (39 minutes). The worst performing server in DiDio’s analysis was a generic “white box server” running Linux, with 59 minutes of downtime per year, which puts it in the three 9s club.
So, how do these figures compare to the past? For a good comparison, we look back at some of ITIC’s archived reports, which indicate that both Power and System z have been eking out small enhancements in uptime over the years.
In 2021, ITIC reported that Power servers averaged 1.49 minutes of unplanned downtime per server per year, which is five 9s of availability. Cut another way, ITIC reported that 91 percent of Power9 and early Power 10 customers reported five and six 9s of availability, while 94 percent of IBM System z servers reported six and seven 9s of availability.
In 2016, ITIC reported that 61 percent of IBM Power Systems servers and Lenovo System x servers achieved 99.999 percent availability. That corresponds with five 9s of reliability, or about 5.25 minutes of unplanned downtime per server per year, or about as much as the Fujitsu Primergy from the 2023 report.
So to summarize, over the past eight years, IBM i customers have benefited from a reduction of unexpected downtime from about five and a quarter minutes per year to less than one-third of a second. That corresponds from about five or six 9s of reliability to eight 9s of reliability. The System z mainframe shops enjoy a 10x advantage in downtime over their Power brethren, just a scan 31.5 milliseconds per year.
IBM is certainly to be commended for reducing the downtime in its Power and System Z servers. Big Blue is renowned for developing systems with superior reliability and security than the industry at large, and it’s good to see IBM continuing that tradition.
But the truth is that there is nowhere left to go with the System z, and very little room for improvement for Power. The amount of unplanned downtime is already so miniscule, at just a fraction of a second, that even a 1,000x improvement in the number doesn’t do much.

ITIC 2024 Hourly Cost of Downtime Survey
DiDio recently published a pair of reports on the cost of downtime to companies. The vast majority of companies report that the cost of downtime exceeds $300,000 per year, with 20 percent saying it costs them more than $5 million annually. This is clearly an area where the server industry as a whole has a lot of room for improvement.
But not IBM. With less than a second of downtime per year for the Power10 server, DiDio reports that the average company is really not suffering much of a loss at all. “Power10 corporate enterprises spend just $7.18 per server/per year performing remediation due to unplanned server outages that occurred due to inherent flaws in the server hardware or component parts,” she writes. Mainframe shops spend even less.
So while it might look good on paper for the next generation of the System z mainframe to hit 10 nines of reliability, or for the Power11 servers that supposedly will ship next year to get bumped up to nine 9s of reliability, in real life, those improvements won’t move the needle at all for customers.
Clearly, there are ways that IBM can improve its enterprise systems. Security, which is the cause of a lot of unexpected downtime, is a constantly evolving threat that IBM must pay close attention to, which it is. Errors in applications and the problems with the data are also factors that can ultimately result in downtime. IBM’s system business isn’t responsible for anything that far up the stack, and frankly, the cause of a lot of those errors can be traced to humans.
But when it comes to building servers themselves – the collection of processors, RAM, drives, power supplies, network adapters, and the firmware that glues it all together – IBM currently is building the most reliable systems the world has ever seen, which is definitely something to cheer for.
RELATED STORIES
CIOs Say Power Systems Are the Most Reliable
IBM i Delivers Sizable Benefits, Forrester Consulting Reports
ITIC is an armchair analyst (It’s a one person company) hired by IBM to create anonymous surveys using Survey Monkey and anyone can fill out the responses without providing any proof of identity or company details. There are no thousands of customers across the world filling the survey out! It’s ITIC filling in what IBM wants to show. It’s quite evident from the ITIC findings that of course IBM is on top, while its biggest competitors are on the bottom. There is no substantiated proof of these numbers and should not be relied upon!