Memory Management: It’s Your Fault, Now Fix It
July 25, 2007 Doug Mewmaw
This year, I will be on the AS/400 platform for 19 years. Wow–where did that time go? All those years working with peers, customers and the like, I’m convinced that the performance component that is the most misunderstood, is memory. In a previous article, I explained about the importance of the machine pool and we looked at the Performance Adjuster feature of i5/OS as well. The feedback I received from that article proved to me that we need more articles on managing memory.
I would like to share some neat tricks of the trade and best practice techniques that I’ve used over the years that ensured my systems were running smoothly in regards to memory. We’ll assume that the machine pool faulting is within the best practice guidelines–under 10 faults per second.
Let’s start with a basic question that I received in class the other day as it is a perfect starting point for our memory discussion. A student asked this: “What is a fault?”
Here is a great definition one of my mentors gave me years ago: A fault occurs when a virtual address is referenced, but it is not in main storage. When a fault occurs, a job will stop dead in its tracks and wait for an I/O, and by definition, it is a synchronous I/O.
Since I’m a big picture person, let’s look at a non-technical description. My wife is a fifth grade teacher. Every year, she describes a phenomenon that I think explains faulting perfectly. A simple way to look at faulting is thinking of homework with our kids. The scenario is a child doing homework early in a new school year:
A child (our system job, by analogy) needs an answer to a homework question. The problem is that the child forgot some things they learned from the previous year (needed data not in memory). At that point, the child uses Google or Yahoo to search for it or pages through an old book until he or she finds the answer (accessing info from DASD). In other words, the info the child needed was not in their memory. The child is forced to find the answer. That process of not having the info is a fault. Now, the child may have to look for numerous answers (faulting rate). The more the child needs to look for answers, the longer their homework takes (sync I/O affecting how long a job takes to complete).
So why do we need to manage memory? Here are the short answers:
Interactive jobs: If there is a high faulting rate (Sync I/Os are occurring), the interactive job will slow down and response times will suffer.
Batch jobs: If there is a high faulting rate (Sync I/Os are occurring), the batch job simply takes a long time to complete. This phenomenon affects the nightly batch process window.
Where Do You Start?
Managing memory requires two very important work management prerequisites:
I’ll explain each in turn.
1. Inventorying your system pools. More often then not, I’ve run into situations where there were too many cooks in the kitchen. That is, jobs were not only set up by different areas within IT, but the environment was set up many years ago. As a result, its imperative to ensure work management is set up efficiently. I would always ask questions like:
In a real-life example, notice what I encountered in a recent study where a site had performance issues.
2. Prioritize & categorize your system workload. I’m not a big fan of running all jobs out of *BASE. Instead of having all jobs run in a “hodge podge” pool, I like the idea of separating core application workloads into separate pools. To appropriately prioritize and categorize workloads, I would talk to the application teams and the business side of the house to ensure the system is set up correctly. A simple example is shown below:
Once you have your system set up by core functional areas, then you have the starting point where you can begin to manage memory efficiently.
Note: While it is my philosophy to separate workloads into separate pools, I do not recommend slicing the workload into too many memory pools. What is too many? I believe one can separate workload types into no more than 10 total pools. Keep in mind that is my personal guideline. Obviously, everyone has their own guideline or number they feel comfortable with. The bottom line for me is that I don’t want the process of managing memory to become a management nightmare. Too many pools also causes overhead to the system when Performance Adjuster is turned on.
Manage Your Memory With a Simple Methodology
Here’s the methodology I use to manage memory in systems.
1. Start with the big picture and measure total faults on the system
There is no better starting point than looking at total faults on the system. Here’s why:
This graph is telling. Not only does it tell you the faulting rate for the entire system, but it points out a huge faulting rate increase starting the week of July 25. For the record, the faulting increased on the system 369 percent! Check it out:
What happened? Was a new workload added to the system? Was there a memory change on the system? Was there simply an application change? This supporting documentation proves that you need to dig further to see what changed.
2. Determine your system’s faulting factor
Many years ago, I saw a neat article by IBM‘s Mike Denney (called “Analyze This”) explaining that memory analysis should start with determining the percentage the system is page faulting. I wholeheartedly agree. Shortly after the article came out, my company worked with Mike with the goal of taking the concept and making a graph that would help people understand the memory component better. The end result was a neat memory graph called a Faulting Factor. Note: It is not my intention to teach you the technical aspects of how the faulting factor is calculated. Just know that the calculation uses Sync I/Os, disk response time, CPU usage, and faults. For a complete description of how the factor is calculated, see “Analyze This” for details.
So what is a Faulting Factor? It is the percentage of time your system is faulting. It’s pretty easy when you think of a clock.
For example: Your system has a faulting factor of 50 percent. What does that mean? That means in a 60 minute time frame, your system faulted for 30 minutes. Why is that important? That means that your jobs were only in the CPU for 30 minutes! See the example below:
In this real life example, notice that the faulting factor for this system has many intervals where the percentage of faulting is over 40 percent. Thinking about our one hour time clock as a measuring stick, this system had intervals where jobs were waiting almost half the time–24 minutes out of 60.
Remember what the goal is: To have enough memory where your system is processing jobs in the CPU efficiently. This is an example where more memory would help this system. If jobs are waiting a lot, services levels will suffer.
What is the faulting factor guideline? My personal best practice guideline is to be under 30 percent.
3. Measure each pool’s faulting rate. I like to start with a normal day. That is, for ever interval during the day, I like to see what pools are faulting the most.
In this example, each pool is measured for each interval.
Next, I like to look at the pools historically:
Pool 2 (*BASE) is analyzed. Note that the faulting rate is increasing.
Here’s Pool 3 and so on . . . .
What am I looking for? To understand the faulting rate for each pool. I ask questions like:
4. Compare pool faulting and total system faults. I also like to understand how each pool is affecting the overall faulting. I ask this basic question: Are my pool percentages versus total faults constant or has one pool increased significantly?
Here are two graphs that I use. The first is a normal day, plotting pool faulting against total system faulting:
In this example, Pool 2 (*BASE) faulting rate is compared to the total system faulting rate. A neat technique is to graph the pool percentage of total faults as well (shown in green in the chart above). This shows you how the faulting changes throughout the day. This kind of data helps you manage Performance Adjuster more efficiently, too.
I do this for every pool on the system. Next, we breakdown the Faulting Factor, showing faulting factors by pool:
Since the Faulting Factor is such a big component for measuring memory successfully, I like to understand how each pool is affecting the overall system faulting factor. In this example, notice that *BASE’s average faulting factor is 5 percent (in gold in the chart). The total faulting factor is 21 percent (blue). That means of the total faulting factor, 23 percent of it is from *BASE. I do this for each pool.
The Bottom Line
Measuring memory does not have to be an overwhelming experience. True it’s not an exact science and it does takes a bit of detective work. But if you create a structure plan to manage the memory component, you put yourself in great position for solid performance service levels. Here’s a neat thing to take away from this article: I created the above methodology to not only help me with memory analysis, but to also help with my overall performance of the system.
Even though the prerequisites seemed to be a daunting task at first, it was well worth it when I saw performance improvements immediately. From a work management standpoint, I had peace of mind that my system was set up efficiently too. So to summarize, my methodology is as follows:
In my next article, we will drill down deeper and analyze jobs within a pool.
Doug Mewmaw is an 25-year “jack of all trades” IT veteran who currently is director of Education & Analysis at Midrange Performance Group, an iSeries business partner that specializes in performance management and capacity planning. He can be reached at DMewmaw@mpginc.com.