Where Did My Faulting Guidelines Go?

January 15, 2014 Doug Mewmaw

I remember when I first started in the industry how difficult performance management was, especially in the area of memory. The good news was IBM provided clear defined service levels in regards to memory pools.

I’m not sure what year I discovered the guidelines, maybe in the 1980s, but I do remember that the published guidelines were so easy. In fact, I can still see the faded yellow sticky note on my desk with the following guidelines.

After managing the box for years and attending numerous performance sessions and the like, I discovered the guidelines were not only not talked about openly in sessions, but finding a faulting rate guideline in a published performance book was like finding an old newspaper ad where milk was $1.10 a gallon. You were so happy you found a guideline, but when you realized the information you found was so old, you went back on your endless journey to find the holy grail of guidelines.

Let me digress for a moment with an amazing story regarding two performance management gurus within the IT industry. One is a well-known IBM performance guru who I won’t name because I don’t want to embarrass him. The second person, who I don’t mind embarrassing, is Joe Camilli of Midrange Performance Group (MPG).

Both have amazing credentials in the industry. The IBM performance guru worked for IBM since 1960s and has specialized in helping all of us understand performance management via published books or working problems in the trenches. Joe developed the incredible functionality for IBM that we know as PM/400. Joe is now the creator of one of the industry’s leading performance management software products, MPG’s Performance Navigator.

I must be honest, as I have worked for MPG for nine years. Like most of you out there, those early years were spent just trying figure out performance management. Way before I worked for MPG, I was blessed to receive many valuable hours of performance training from sessions with both Joe and the IBM guru. I think it’s safe to say that both gentlemen know more about performance management than I will probably ever know.

The very humble Joe Camilli tells a great memory guideline story. Remember this is the guy that created the vehicle that helped us all learn performance at a deeper level. I’m paraphrasing but this is the gist of what Joe said:

Years ago, IBM did publish guidelines for system-wide faults per second. Back in the ’80s on the S/38, as I learned more about the system, I began to wonder why there were system-wide faulting guidelines. It occurred to me that a single job running in a single, too small pool could cause sufficient faulting to exceed the system-wide guideline. Logically, it seemed that the high faulting in the single pool would only affect the job(s) running in that pool. But there was a system-wide guideline that was being exceeded. This meant that the one job must be causing some system-wide problem, and I didn’t understand this. The question weighed so heavily on me that I was losing sleep. Eventually, at an IBM technical conference, I mustered the courage to approach one of IBM performance gurus. The IBM guru paused, smiled, and said “Yeah, we shouldn’t have system-wide faulting guidelines.”

Joe added Sometime later they stopped publishing the guidelines.

I love the process improvement story. Two performance gurus who both knew system-wide faulting guidelines were not a good idea when managing memory pools.

So we now know the answer to the question: Where did my faulting guidelines go?

The answer is that Joe Camilli, along with IBM performance folks, and probably many other aspiring performance experts, figured out that system-wide guidelines did not work the same for every system.

Today there are no longer any pool best practice guidelines, with the exception of machine pool faulting, which is 10 faults per second or less.

The ironic thing for me is that faded guideline note worked perfectly at my old company. That is, my interactive response time services levels were really good when my faulting rates were under 200 faults per second. The problem is, your company’s response time suffered when the interactive pool went over 120 faults per second. Again, that is why IBM stopped publishing faulting rate guidelines.

So What Do We Do?

Do we just ignore memory? If you do, make sure your resume is up to date. Basically you have two options:

You can do a capacity planning project, pay an exorbitant amount of money for a ton of memory, and only worry about CPU and disk. As someone who has visited 100-plus companies worldwide, I assure you that buying performance is not a good idea–eventually performance problems reappears with normal system growth. Plus CIOs don’t like to spend a fortune when they don’t have to.
Learn how to manage memory in each pool.

Option number two is the answer I recommend as it’s the essence of what system administrators must do. It’s not easy and easy task, but each system administrator must be able to ensure system performance service levels for ALL core performance components are within best practice guidelines.

Read that last statement above again.

For those of you wondering if after recently moving to Colorado that I’m writing this article while taking advantage of the new Colorado marijuana law, I assure you I am not. Here are the hard truths when working with the memory component:

You must inventory what jobs are running in each pool.
You must measure pool performance at all times. And you must be able to answer questions like: What does the interactive pool look like on a Monday? Tuesday? Wednesday?
You must know the faulting rates when the system is busy. You should know what to expect at month end, during the Christmas season, etc.
When night batch processing took longer than expected, you must analyze all pools and answer these questions: How much memory was in the pool? What was the faulting rate? Is memory automatically moving in and out of pools?
And so on.

Do you remember back in the day when IBM told us to tune the pool until performance suffers, then throttle it back? The performance advice is right on the money, but one can get there a different way.

Start by measuring your systems (all pools) on a good day and on a bad day.

You will be amazed when you start to see patterns. In fact, on a bad day (when performance suffers, customers call in), you might see performance suffer when *interactive is at a certain faulting rate, if memory is indeed the issue. The more good and bad days you measure, the more you will learn about your system, and best of all, your system performance faulting rate guidelines will develop over time. For instance, take a look at the table below. In this example, one might create a service level for the interactive pool to ensure the faulting rate is well below the 300 faults per second guideline.

Imagine after establishing your guidelines for each pool, you can begin grading each pool to ensure the faulting rate is below your company best practice guideline.

For instance, as illustrated in the graphic below, when we measure for an entire month, we can see that we met our defined faulting rate guideline 96 percent of the time. Our performance management grade in this instance: Solid A!)

Here is bottom line. Managing memory is not easy. Even if IBM was providing guidelines, one must still take the following steps. Remember to:

Measure all pools on good days. This process never ends. The more historical data the better!
Measure all pools on bad days. This process also never ends. Again, the more historical data the better.
Measure all pools during peak periods, such as month end, Christmas, etc.
Establish guidelines from real historical data.
Be flexible. Guidelines might change as your company grows, new hardware is installed, when an application change occurs, or simply during your peak season.

The bottom line is measure, measure, measure! Historical data is your best friend.

Last week I discovered I lost 4 pounds by watching what I was consuming AND by hitting the gym for an hour every other day. As my stunning and fit wife has always told me, there is no magic pill. I hate that she’s always right. I’m afraid that getting your memory component in shape on today’s power systems also requires hard work and commitment as well.

And you can quit looking for the 2014 faulting rate guideline book. It’s not out there.

Doug Mewmaw is a “jack of all trades” IT veteran who currently is director of education and analysis at Midrange Performance Group, an IBM i business partner that specializes in performance management and capacity planning. Send your questions or comments for Doug to Ted Holt via the IT Jungle Contact page.

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot