Why Are Systems Programmers Always To Blame?
Published: June 1, 2009
by Doug Mewmaw
We've all been there. It's first thing Monday morning and the phone is ringing off the hook. The CIO is calling, and everyone is freaking. Even though your system's team didn't submit one single change over the weekend, there's a problem and you, the systems administrator, are being blamed. You know it's not your fault, and wish you had a way to prove what the heck is going on. Sound familiar?
Before I continue, I must share a funny story about my son. Let me preface the story by saying that my wife is a gifted teacher and educator.
Our son recently brought home a report card and it wasn't good. Apparently, the new football team, new cell phone, and new girlfriend didn't leave much time for our young high school freshman to work on earning some solid grades. When my wife and I questioned his poor performance, our son promptly looked at us like it was our fault. Then, we heard a lot of excuses:
"That teacher is not good."
"I just didn't test well in that class."
And my personal favorite, "No one does well in that class."
I questioned him: "Did you try your best? Did you turn in your homework?"
"YES!" was the resounding answer.
I then showed my son the report card where the teacher wrote:
"Needs More Effort. Does Not Turn In Homework."
He said, "That's crazy, Dad."
We went to the computer and logged onto the school's Website where I brought up the historical data for the class. Here is what I saw:
- Test grades were not horrible (B average)
- There were five missing homework assignments
- The average homework grade was 59 percent
It was crystal clear to why my son got a C in the class. The teacher's assessment was spot on. The supporting documentation told the story perfectly.
At that point, my son's expression changed to that of a little boy with his hand caught in the cookie jar. What could he say? The proof was right in front of him.
I bring up this story for all the system programmers out there who are tired of being blamed and simply want a way to prove what is really occurring on the system.
Wouldn't you like a way to prove that the system problem is not your fault? Better yet, when changes occur in your environment, wouldn't you love a way to prove that the change had a positive impact to the system? You can get these answers, but you need to know how. Let's begin by first looking at why the systems admin group is always instantly blamed for any and every systems issue.
The Blame Game
I have visited more than 50 IT shops in the past five years. While in the trenches, I've talked to systems programmers, operators, help desk associates, and even CIOs. During those visits, I've determined five key reasons why system programmers are instantly blamed.
Reason 1. System programmers don't always "really know" their system.
Many times, I'm brought into organizations to help investigate existing performance issues. These are shops with veteran system programmers on site. During one engagement, it was determined that a third-party application had interactive and batch workload in the same subsystem. That's one of the cardinal sins in IBM i performance.
Amazingly, the systems administrators were all shocked that this setup was on their system.
Reason 2. System programmers are not given the time (resources) to analyze system performance, usually because they are too busy fighting fires.
This reason is prevalent to our current economic environment. In many organizations, IT support staffs tend to be lean and mean. In those instances, system programmers can barely do complete their daily jobs, let alone analyze system performance.
Reason 3. Some companies don't publish performance metrics in regard to their core bread and butter" systems.
Reason 4. Clearly defined service levels are not established in regard to system performance.
Some companies never measure their current performance. They base their performance tuning on perception and often use terms like "I have a feeling something is going wrong." Others measure their system internally within their systems group, but never report to the internal and external customers on the health of the system.
Reason 5. There is no supporting documentation to prove what is occurring.
This is a big one. When the system is hitting the fan, the systems programmer's role is to simply play detective and figure how what is going on. I can attest that CIOs look at you funny when you tell them "this is what I think this is going on," but have no proof for your theory.
The goal of the systems programmer should always be to prove you are hitting the mark in regard to system performance.
However, the common "perception" of the system programmer's goal is to prove you know what you are doing.
Do you see the difference? Now let's look at where to begin.
Lessons from the Past: Look at Your System's Historical Data
We want to create an environment where we are able to prove what is actually occurring. In other words, we want to use "the power of historical data" to make our case. If you are measuring core performance metrics historically, you can really do amazing things in regard to performance management. Remember my son's reaction when he saw his actual progress at school online? The proof was in the documentation. You must be able to show your bosses what is really occurring.
In the following simple real-life example, we can see how faulting dramatically increased after a change occurred on the system:
The next example illustrates where an application programmer made one change to the system. The end result was system became "pegged" for months. (See #1.) The average CPU went from 31 percent to 97 percent! The humorous part of the story is that when the application team backed out the change, the systems team measured CPU% after the "back out." They found that the CPU dropped to only 80 percent, which means the application team got caught not really backing out the entire change. (See #2.)
These examples point out the power of historical data and the use of supporting documentation.
Structured Monthly Reporting
The next logical step is to use the historical data and simply "inform" your user community of the health of the system. This one hits close to home for me. Many years ago, my philosophy was to use internal reports for my team. We measured all core performance components and I provided my boss with the performance story of what was occurring. In January of that year, I showed my boss that we needed to do an upgrade in the fall and that we should start preparing. This particular year was very chaotic as my company was going through a merger. In February, I reminded my boss about the upgrade. I did the same in March. I reminded him in April, May, June, and even July. For whatever the reason, the upgrade message was not filtering up the management ladder.
I'm sure you can guess where this story is going. I ended up in the president's office in August explaining why we needed $1.5 million yesterday. One senior director looked at me and said, "Doug, how could you have put us in this situation?" Let the record show that it took an amazing amount of restraint to not "rat" out my boss, who obviously dropped the ball. I was really mad and I walked out of the room saying, "How I can I ensure this never happens to me ever again?"
The answer was structured monthly reporting to all internal and external customers. I wanted to create an environment where everyone was on the same page. Here is an example of one of the pages in my performance report. It's a monthly summary management report:
Here we have a report that shows CPU%. We indicate last month's performance as well as providing management with a historical trend. Not only did I provide best practice guidelines to help management understand the report, but I also provided text that explains the entire performance story. My wife once looked at the report and asked me it was well received. I thought about it a minute and decided it was extremely well received. She laughed and asked if I would I like to know why? With her having a master's degree in education, I was intrigued to hear what she thought. She said people learn different ways: Some people like pictures; some people like numbers; and some people like text. I had provided it all on one document.
I admit I got lucky. Truth be told, I just thought it was a neat-looking report.
Here is the bottom line: Management simply wants peace of mind that everything is OK. So show them!
Look at this real-life example where structured monthly reporting helped solve a performance issue:
Example of a month where everything looks normal.
The following month, we can see that something has changed in the environment. Notice the system workload has dramatically increased. This is a great real-life example where using the historical data can help one easily identify system problems.
Example of a month where something happened.
Drilling down into the data, it was discovered that a PTF was needed on the system. The graph on the left was sent to IBM (along with the identified system tasks that were active). This supporting documentation helped detect the culprit. Notice that the system tasks were taking 11 percent of the CPU resources. Upon putting on a needed PTF, the system was easily fixed.
Supporting Documentation at its best.
Clearly, structured monthly reporting helps you understand your system at a deeper level. If you know your workload for each month, problems will stick out like a sore thumb
Do you see that the system programmer is now proving what is actually occurring on the system? Now, let's get to know our system at even a deeper level.
Measuring Up: Know How Jobs Use Critical Resources
One neat technique to get to better know your system is to simply measure the jobs on your system in regards to critical resources, i.e., CPU, disk, and memory.
I ran the report shown above every single month so I could learn how production jobs were consuming system resources. Following is a real-life example that caused two 20-year systems programmers ran out of a room in a panic:
The reason was simple. The purple piece of pie was a key process at their company. This process that normally consumed 4 percent of the CPU resources was now taking 36 percent! Further historical data investigation showed the exact day when the problem started:
This company discovered that this key process had "deleted access paths" and not only had the problem been causing performance issues for months, it also was occurring in 10 other job streams. Keep in mind that these systems programmers routinely put out structured monthly reports for management. However, it wasn't until they looked at the monthly consumption of resources that they uncovered a "hidden" performance issue.
Can you guess when they fixed the problem?
This is systems management at its best.
Tuning Around: Measuring the Impact of Change
Performance management on the i is not easy. I'm so old that I can remember when IBM performance gurus would offer this performance memory tuning nugget:
Tune it until it breaks, then throttle it back.
OK, we all agree performance tuning is not easy, but what can we do to ensure we are doing performance management correctly and efficiently?
We must measure any change on the system using before vs. after analysis.
In other words, we must create an environment with supporting documentation in regard to the system impact. Instead of having a "feeling" the system is better, prove it!
Here is an example where a company has bad machine pool faulting. In fact, the system is only meeting the best practice guideline 48 percent of the time. (That would earn a grade of "F" from any of my son's teachers.)
So now we must make a tuning change. In our real-life example, we add memory to the machine pool. Now we must do a before vs. after analysis so we can answer this question:
Did my tuning change have a positive or negative impact to the system?
Wow! This is a historical machine pool faulting graph. Looking at this data, we can easily prove the impact to the system. In the above example, notice that faulting dropped considerably. You can then show management the following level of detail:
There is no more guessing. There is no more "I have feeling." The fix is simple. Before making any change to your system, you must measure the metrics. Then measure the metrics after the change. Then, when management wants to know if the change worked, you show them this supporting documentation:
Before the change, the SLA was met 48 percent. (An "F" grade.)
After the change, the SLA was met 99.7 percent. (An "A+" grade.)
So the answer to my question is: My tuning change had a positive impact to the system.
The IT Culture Battle
Being in systems my entire career, I understand the nature of the beast. We are always blamed, and it will always be that way. Unfortunately, I've learned the motto "CYA" is the best way to deflect any accusations that might come my way. It is not uncommon to have people question your expertise. Case in point, I had an IT director once tell me I spend too much time doing performance management (I was a 20-year AS/400 veteran), and I finally had to pull out my actual job description to teach her I was simply doing my job. There have been many times when I would just shake my head in disbelief. But, instead of fighting the culture, I learned to embrace it. Remember: Learn your system at a deeper level:
- Do structured monthly reporting and distribute it to internal and external customers.
- Measure which jobs are consuming system resources each month.
- Always measure the impact of changes on your system.
If you want to blame me or my team, bring it on. But you better have proof, because I do. I always make certain I know exactly what's going on in my environment, and I have the documentation to back it up.
Doug Mewmaw is a "jack of all trades" IT veteran who currently is director of education and analysis at Midrange Performance Group, an i business partner that specializes in performance management and capacity planning.
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot