Newsletters   Subscriptions  Forums  Store  Media Kit  About Us  Contact  Search   Home 
mid
Volume 3, Number 7 -- February 18, 2004

Linux 2.6: Let's Take a Look Under the Hood


by Justin Ward

It's been a few weeks since the new Linux 2.6 kernel came out. Linux 2.6 was created explicitly to be faster, to support more systems and peripherals, and to be better than the Linux 2.4 kernel in most features. But what makes the 2.6 kernel tick, and tick so much better than the old one? A new scheduler, new filesystem code, and better multiprocessor support are just the starting points. The kernel has been expanded in all directions with this new release, offering better support for mainframes, midrange machines, desktops, and embedded systems alike.

This week, I will give you an in-depth tour of the new Linux 2.6 technologies that related to the server portions of the kernel.

Multiprocessor Support

Linux 2.6 has much better multiprocessor support than Linux 2.4 had, for both the symmetric multiprocessing (SMP) and the non-uniform memory access (NUMA) ways of ganging up many processors to look like one larger and more powerful processor. NUMA is important for the new architectures for high-end X86 servers based on either Intel and Advanced Micro Devices. IBM's 16-way xSeries 440/445 servers and future eight-way Opteron-based systems use NUMA technologies, for instance. SMP tightly couples processors and their main memory together, whereas NUMA is a more loosely coupled approach that partitions all of the main memory in server cell boards into local and shared memory. Local memory runs at memory speed, but shared memory (which is what allows multiple cell boards to share work) obviously has a big latency in it because even with high-speed fiber-optic system interconnections between cell boards, these links still run a lot slower than main memory buses. NUMA is, however, a lot less expensive to do than SMP for machines with more than four processors, which is why server makers are adopting it.

The Linux 2.6 kernel is ready for production use on systems with up to 32 processors, and has even been tested on 64-processor systems. Linux 2.4 was said to scale to 16-way processing, but for most applications it ran out of gas at eight-way processing (whether on SMP or NUMA architectures). The word on the street is that Linux 2.6 will be able to scale to 32-way or 64-way processing, but most server makers are really only promising efficient performance on 16-way processing until Linux 2.6 gets some kinks worked out.

Of course, the kernel running on 32 processors won't do anything by itself, and this is why a new scheduler was incorporated into the 2.6 kernel. The scheduler is the part of the operating system that controls when a process runs and on what processor it runs. If the scheduler takes too long to decide what process runs next, the entire system performance suffers, since this has to be done every time the processor switches to a different process. In the old kernel, the time it took to decide which processor went next was directly proportional to the number of processes running on the system. This is acceptable for small machines because the process count usually doesn't get too high. Even if the process count is high, there are probably other bottlenecks to worry about. A larger server, however, could have tens of thousands of processes running. This is why a scheduler that runs in constant time (for you bitheads, in O-notation, this is O(1)) was incorporated into the 2.6 kernel. It takes a fixed amount of time to run, regardless of how many processes it has to choose from. This has the effect of making performance more linear.

One of the other big problems with the 2.4 kernel's scheduler was that it would "bounce" processes between CPUs. That is to say, a process would start running on one processor, but often finish on another one. This leads to overhead in transferring the process and processor state between processors, but the real problem becomes apparent when this happens on a NUMA machine: the memory that the process was using on its original CPU may be very far from its new CPU. This causes the process to slow down, as it has to wait longer for memory access, and it can also cause overall system degradation as more data has to contend for the memory bus. With this in mind, the new scheduler was designed to not bounce processes between CPUs, or at least not nearly as much. This is a principle known as "processor affinity." In effect, processors do as much of the work that they start as they can.

The Linux 2.6 kernel is also now aware of symmetric multithreading (SMT), which is also called HyperThreading by Intel. Hyperthreading is a technology by which one processor virtualizes itself into two or more processors, allowing separate processes (or threads) to execute in different parts of the CPU simultaneously. This is different from SMP, which allows separate processes to run on completely separate CPUs at the same time. The new scheduler is aware of virtual processors, and schedules things between virtual processors so as to not overburden the underlying single processor. This is all in addition, of course, to the scheduling that takes place between multiple (actual, not virtual) processors.

The last big advance in multiprocess and multiprocessor support is that the kernel is now fully pre-emptable. In the 2.4 kernel, a process running kernel code (for example, doing I/O) could not be interrupted by another process, even if had used its entire time allotment (as scheduled by the scheduler). By allowing the kernel to pre-empt itself, the system as a whole becomes much more responsive. This is important not only for desktop systems full of interactive programs where the user wants the computer to feel snappy, but also for time-critical server applications. It should be noted that a pre-emptable kernel doesn't necessarily make the system much faster. It just helps to make sure that every process receives its fair timeslice, which gives the perception of a faster system across all users and applications.

Anticipatory file I/O

Server applications are frequently bottlenecked by disk I/O. Waiting on a disk read is particularly painful for a server and often makes up a disproportionate amount of the total runtime of an application. The Linux 2.6 I/O subsystem has implemented a new scheduler for disk requests in an attempt to speed this up, and it has shown remarkable results. Some studies have shown the Apache Web server running almost 50 percent faster because of this.

The naive approach to scheduling disk requests is a very simple first-come, first-serve queue. If application A makes an I/O request before application B, it will be processed first. This ensures a degree of fairness, but is not a very efficient algorithm. Modern day operating systems have more advanced scheduling techniques that take drive geometry into account, and they promote reads above writes (writes can wait, since the application doesn't need anything from the disk to continue). The basic principle, however, is still to process a request, send the data to the application, then pull the next request out of the waiting list and process it.

Anticipatory scheduling is a very simple, though counterintuitive, modification: after processing a read request and sending the data back to the application, the I/O scheduler does nothing at all for a few milliseconds. Instead, it waits to see if another read request comes in. Most of the time, another read request will arrive right away. And most of the time, it will be a request from the same application, for data in the same file, immediately after the data that it just read. If such a request does come in during this wait period, the scheduler sends it to the disk right away. Since the scheduler hasn't done anything else in this interim, the drive head is right where it was before the new request came in: at the end of the last request and the beginning of this new request. By eliminating the seek operation normally necessary at the beginning of a read, the time required for several consecutive read operations is cut drastically. This means that an application can read an entire file in a fraction of the time it would normally take.

Other filesystem updates in Linux 2.6 include built-in support for two new filesystems: JFS (created by IBM) and XFS (created by SGI) are both journaling filesystems, and are now both included in the kernel. They join ext3 and ReiserFS, giving users a choice of four different journaling filesystems from which to choose. Support for Microsoft's NTFS filesystem for Windows (which was added in Linux 2.4) has been improved in Linux 2.6. NTFS support finally includes write support (the 2.4 NTFS code was read-only). Windows filesharing (via SMB, and Microsoft's extensions to SMB, CIFS) code has been brought up to date, and even the Novell NetWare affinity code inside Linux 2.6 has a few new features. What this means is that Linux, Windows, and NetWare can actually share (in terms of both reading and writing) a single file system.

Memory Management

In an effort to provide better support for systems both large and small, the Linux 2.6 kernel's memory management code has also been overhauled. In an example of the kernel being brought to smaller architectures, Linux 2.6 kernel can run on architectures lacking a memory management unit (MMU), a unit that is often not incorporated into embedded devices like PDAs and cell phones (but is inside PCs and servers).

Larger systems, however, are not forgotten. A reverse mapping technique has been incorporated to make memory management on larger systems faster.

Processes share memory. While this provides for very fast inter-process communication and often saves overall memory consumption, it can be a big headache for the underlying operating system. Freeing memory becomes especially difficult, as the OS has to make sure that no process--not just the process that originally requested the memory--is currently using it. In the 2.4 kernel, this meant going through every process's memory page table and searching for the page in question. This becomes a real slowdown as the number of running processes increases.

In order to avoid this problem, the Linux 2.6 kernel keeps a per-page list of processes, in addition to the normal per-process page list. Memory can be freed when the list of processes using a given page is empty. It's much easier (and much faster) to check this than to check the page table of every process.

Of course, this means that the operating system is claiming more memory for itself, and giving less to the applications. Kernel developers are working on a way to minimize the amount of extra memory consumed, but it is already viewed by most as a worthwhile trade. Memory costs have come down significantly in the past few years, and all systems consume much more of it, whether they are Unix, Windows, or proprietary platforms.

User Linux, or Linux Instances

The Linux 2.6 kernel also supports running the Linux kernel as a user-space application. Linux can be run from inside Linux, which gives a huge advantage to kernel driver developers and security analysts. It may also eventually pave the way for running several instances of the operating system, all completely separate from one another, on a single computer.

This type of logical partitioning is a feature in IBM mainframes, in various Unix midrange and enterprise servers from IBM, Hewlett-Packard, Sun Microsystems, and Fujitsu-Siemens, and even in OS/400 and OpenVMS midrange servers. Linux needs to go as virtual as these machines can. Whether or not Linux does it in exactly the same way as all of these machines remains to be seen. They all, in fact, create logical partitions in slightly different ways.

Where Can I Get Linux 2.6?

While no major commercial Linux distribution is shipping with a 2.6 kernel yet, most distributions are shipping with some 2.6 features that have been backported to 2.4. Red Hat's Enterprise Linux 3.0 has most of the big features, including the anticipatory I/O scheduler and the O(1) process scheduler. SuSE Linux Enterprise Server 8 has several 2.6 features, as well. Both of these distributors as well as others have committed to bring full-blown Linux 2.6 distros to market this year.

The new kernel is a huge step forward for Linux, and one that vendors are sure to jump on as soon as possible. Home users, of course, can download the source code and attempt to install the new kernel themselves, along with the updated system utilities to go with it. With an increase in everything but price, this isn't one to pass up.


Justin Ward is a Linux consultant with a bachelor's degree in computer science who works part-time as IT manager for Guild Companies' Linux and Windows cluster. He is looking for full-time work, and comes highly recommended by Guild Companies..

Sponsored By
WINTERNALS SOFTWARE

Now you can have a defragger designed by Windows experts

When it comes to defragging, there's no reason to settle for expensive, time-consuming manual installations and operation. And there's no reason to use a defragger that takes up disk space on every single system it defrags.

Now there's Defrag Manager. The Winternals design team - makers of the world's most powerful Windows utilities - designed it to be so efficient and trouble-free it delivers an ROI in just weeks. Install Defrag Manager on one system to optimize systems throughout your enterprise.

Don't rely on risky, out-of-date technology. Go with the defragger designed by the people who know Windows.

Try it free with an eval CD.


Editor: Timothy Prickett Morgan
Managing Editor: Shannon Pastore
Contributing Editors: Dan Burger, Joe Hertvik, Kevin Vandever,
Shannon O'Donnell, Victor Rozek, Hesh Wiener, Alex Woodie
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

THIS ISSUE
SPONSORED BY:

Hewlett-Packard
Unisys/Microsoft
Winternals Software
Stalker Software
Acucorp


BACK ISSUES

TABLE OF
CONTENTS
Windows Source Code Appears on the Web

Microsoft Fights Unix, Linux with Free SFU

Linux 2.6: Let's Take a Look Under the Hood

OctigaBay Takes Opteron-Linux to New HPC Heights

As I See It: Censoring the Self



Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.
Guild Companies, 50 Park Terrace East, Suite 8F, New York, NY 10034
Privacy Statement