Is Perception the Reality?

June 13, 2011 Doug Mewmaw

On a recent training engagement, I visited a very popular American company. This company fit the profile of ones that I have visited before. In this case, the customer had a Power System 570; i5/OS V5R4 and IBM i 6.1; and had CPU running redline at 100 percent of utilization, using shared resources between LPARs. Support consisted of a help desk (Level 1 support) and a systems analyst (Level 2 or 3), and the entire IT support staff for the IBM i environment was four system admins.

During the week-long class, we talk about best practice performance guidelines and teach the performance management methodologies for creating an environment where we not only have an efficient running system, but where all internal and external customers are happy. One of the vital things we teach in class is the importance of structured monthly reporting. The goal is to give management peace of mind that the systems are being managed within best practice guidelines. Even more important, structured monthly reporting puts everyone on the same page in regards to performance. The following graphic is a great real-life example where we show the response time service level and how the metric has been performing historically:

Here we show where response time is trending upward. When we look at the previous month, we graded the metric at about a C+ (78 percent) If our kids bring home a C+ grade, as parents we start to investigate why. Do we need a tutor? Is our child turning in the homework? Managing system performance is exactly the same mentality. In this example, we definitely have a tuning opportunity.

For a lot of you out there, structured monthly reporting is not done on a regular basis. Worse yet, some customers tell me they’ve never done structure monthly reporting. My analogy would be sending your child to school and never having a report card. How would you ever know what level your child (or your company) is performing at?

When we encounter performance issues like the one above, I always ask a key performance management question:

What are your performance metric service levels?

In other words:

What is your service level for response time? (<1 second? <1/2 second?)
What is your service level for disk arm percentage (Is it 15 percent? 25 percent? 40 percent?)
What is your service levels for disk response time (Is it four minutes? Five minutes?)
And so on…

More often than not, everyone in class starts looking at each other for answers. In those cases where there is no structured monthly reporting, I ask the following question:

How do you measure performance on your system?

In other words, how do know when the system is performing well? Here is the answer I hear way too often:

When the phone rings, we know we have a problem. When it doesn’t, we know we are OK.

Which leads me to my next question: Is the perception really the reality?

Let me share a story from my first job. I worked at a full-service gas station when gasoline was .51 a gallon. This was 1976.

When gasoline reached $1 per gallon, people were outraged. People then stopped being mad. When gasoline reached $2, there was more outrage. People stopped being mad. When gasoline reached $3, people again were outraged, but later the outrage subsided. Why does this phenomenon always occur?

When gasoline reached a new price barrier (i.e., the first time it hit $1), customers were definitely upset. In the chart below, notice that gasoline was $1 to $2 for 672 weeks!?! The problem is the new price barrier becomes the norm. People simply get use to the higher prices.

Historical Gasoline Prices

Decade

Average Price(per
gallon)

Number of Weeks

1950 – 1959

$0.19 to $0.26

5,200

1960 – 1969

$0.31 to $0.35

5,200

1970 – 1979

$0.36 to a little less
than $1

5,200

1980 – 1989

About $1

5,200

May 1992 –

January 2005

$1.10 – 1.99

672

February 2005 – October 2007

$2.00 – 2.99

155

November 2007 – February 2011

> $3.00

156

Source: Buzzel.com
& Energy Information Agency

I use this simple analogy to explain the fallacy of measuring your system based on whether the phone is ringing. Here is what actually happens when companies measure performance management based on a phone call.

When users have bad response time, they indeed call the help desk. The problem is that when system problems don’t get fixed, guess what happens? The problem environment becomes the norm.

In other words:

Customers are familiar with a sub-second response time (Gasoline <$1)
The company grows and various applications changes are made
Now those same customers week after week experience response time that is routinely one to two seconds (Gasoline is now $1 to $2)
After a time, no on calls in because the new service level is accepted by the user community as the new normal

This phenomenon is very common in IT shops that support legacy software. I remember in my old job when my company bought another company. We inherited a new system where the average response time was three to five seconds. Being accustomed to a sub-second response time my entire career, I was shocked at the accepted service level. When I asked a few customers about the response time, I heard responses like:

“The system was a lot faster in the old days””We wish it was faster. We just kind of got use to the slower system.”

Therein lies the problem. The problem environment becomes the norm.

I’ve seen the same phenomenon in other areas of performance management. In some environments, companies utilize paging software when thresholds are hit. I’ve talked to system analysts that have learned to ignore support pages because they already know it’s not a problem they can fix: “The pager always goes when job XYZ is run.”

One of the things that I think is a lost art in the industry is the ability to understand the impact of change to the system. In the above examples, the system needs to be analyzed with historical data. Someone needs to ask:

When did response time increase?
What changed on the system?
What other performance metrics were affected?
Is my workload the same?
And so on…

One of the techniques I like to do is to measure all core performance metrics BEFORE major changes, such as applications, OS, or hardware. I measure CPU, CPW, Disk IO, Disk%, Faulting rates, and number of jobs.

Here we see a page from my Response Time analysis:

This graphic looks at response time historically:

Before the change (.38 seconds)
After the change (1.31 seconds)

Did you notice that after the change, response time increased 246 percent?

I teach that we should not wait for phone calls. After major changes are implemented, we measure all the components that can affect performance. It is important to always consider that a change in one area, may negatively affect another area. That’s why we should measure all components before and after a change.

During this analysis, we measured the faulting rate on the system:

Notice in our real-life example, the faulting rate increased 272 percent after the change! Now we understand why response time increased on our system.

Another neat technique is to do the same exact BEFORE vs. AFTER analysis at the JOB level:

When one makes tuning changes on the system, one must measure the impact of those changes. After the change, the above job ran 51 percent faster. This is performance management at its best.

The bottom line is this: By performing the real-life examples detailed in this article, you can go to your CIO with confidence and supporting documentation. CIOs never want to hear, “I have a feeling this is going on.” They want you to do performance management with facts, not feelings or perception.

When you create this kind of performance management culture, some amazing things can be accomplished. And you will begin to learn your system at a deeper level.

Doug Mewmaw is a “jack of all trades” IT veteran who currently is director of education and analysis atMidrange Performance Group, an i business partner that specializes in performance management and capacity planning.

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot