VoIP and the Search for Single Points of Failure
June 25, 2007 Dan Burger
At its most basic level, Voice over Internet Protocol (VoIP) is about moving verbal communications from phone lines to the Internet. In an article last week, Terry Boulais, the director of business development at Key Information Systems, began our introduction to VoIP by explaining its benefits and providing an overview of the technology that could usher a new era of application integration that takes advantage of corporate data and communication efficiencies.
At Key Information Systems, Boulais is intimately involved in helping organizations deploy VoIP. His focus begins with network infrastructure, yet is closely tied to business planning, and in this report he talks about his experience of building a VoIP system in a little more detail.
When a company moves forward with VoIP, it takes on the responsibility for voice communications, including a plan for keeping this system available to its users. This is usually a job done by the phone company. As you’ll find out, it’s one thing to maintain basic phone service at a satisfactory level. It’s another thing to manage phones and the data that help desks and telesales people need to perform their jobs more effectively and create a customer experience that outshines the competition. Building a redundant system to ensure an expected level of availability is a big part of this VoIP phenomenon, and the more applications that are hooked into the VoIP system, the higher the degree of difficulty becomes in making sure it all stays available. But at the heart of VoIP is the management of one system rather than two and an application workload that seems ideal for the System i platform. As with all things, rewards don’t come without work.
Dan Burger: Last week you provided some network examination advice for companies that are considering a move from PBX phone systems to VoIP systems. What can you add to that in terms of preparation and planning?
Terry Boulais: The process involves walking through the different types of scenarios for downtime and determining how to address them. You have to look at the phones as hardware/software integration and look for the single points of failure on the hardware, within the application, and on the interfaces. That implies a disaster recovery strategy. If everything is running on the primary box and something causes that to go down, the organization needs to be prepared to do what needs to be done to get everything up and running on the secondary box and cover all the single points of failure in between.
DB: So what are some of the important things to examine?
TB: Start with hardware points of failure. If the power goes out, is there a redundant power supply? If a disaster prevents access to the building, what are my alternate ways to talk with the box? If a company doesn’t have a remote disaster recovery location, should one be put into place? Does the business plan require recovery in five minutes? Thirty minutes? Sixty minutes? If the business cannot afford that 30 minutes of downtime, it needs to eliminate that single point of failure. If you are preparing a disaster recovery plan, you don’t want your system backup on the same box within a separate partition. You will want the second partition on a box at another location.
One the hardware side, power is the number one point of failure. I’m talking about power outages. California is famous for its rolling blackouts, as an example. To address that problem requires a backup power supply or you sit and wait it out and put up with the downtime until power is restored.
Another hardware example might be a switch that goes out. A switch can be replaced in 15 minutes if you have a spare one on hand. Most people do not unless they rob it from another department. Companies that have a plan in place usually have that “spare” switch in a test lab or in an active place where they know it can be borrowed in an emergency. Most will not spend $15,000, or whatever the cost of a switch is, just to have it in a closet waiting for an emergency.
For a lot of businesses, it comes down to customer service satisfaction. For those who can’t afford a five-minute or 15-minute outage, they need to begin with fault tolerance to determine how long of an outage they can endure. If it’s a very short period, they go to high availability, and if they are concerned about the effect of a disaster on their business, they go to disaster recovery.
DB: 3Com, which first brought VoIP to the System i by porting its Linux solution to a partition and then integrating it with i5/OS features, has fault tolerance built into its VoIP software. Would you explain the importance of this and shed a little more light on applications as a point of failure?
TB: An example of fault tolerance would be to establish the promise to be up and running in a reasonable amount of time–say 30 minutes–on the backup box. Some of the things involved in this process include automatically initiating the failover, starting the apps in a certain order, and teaching users to point to the second box.
The 3Com software automatic picks up on the second partition if there is an issue. When you take that IP application and you have it at fault tolerance, it is role swapping on the fly. The telephone users never recognize the change. However, the VoIP application is only giving you voice communications capabilities on the second box. To get the additional computer applications swapped you need the HA partners to mirror those things over. That process includes the user piece and interface piece of the pie.
DB: It sounds like companies that already have high availability or disaster recovery programs in place will be the most likely early adopters of VoIP. But I also see a potential problem because it’s my understanding that a lot of companies that have HA and DR are not testing, so they are truly ready for the time when a switchover is needed. Isn’t that an issue?
TB: I’d say about 90 percent of companies do not test their disaster recovery plan on a regular basis. The reason is because the testing involves more than just the software. There’s the data portion, which is why the HA software is purchased. Piece number two is the users. It doesn’t matter how the users access that box, there needs to be a means to get them over to the other box when the switchover comes. For some companies, that process is hard. Reason number three is the interfaces. Whatever interfaces are going in and out of that production box–an example would be interfaces that download data from various vendors–you have to point those interfaces to the second box and that part is the hardest. That is the piece that stops most people from testing their DR plan. And part two and part three do not rely on the availability software.
With HA software, you can actually role swap the box and not bring the users and the interfaces. And every single person using HA software can do that. Just flip it over to prove it works and flip it back. That takes five or 10 minutes. Bringing the users and the interfaces over is a pain in the butt. Among other things, Ethernet cards and IP addresses need to be duplicated and interfaces have to be pointed to the new IP address and the interface needs to start at the right sequence number. It usually takes 30 to 60 minutes to test and most people don’t want to take that time to test.
DB: Aren’t the people who need applications in addition to the voice communication piece going to be haunted by the same problem that prevents them from testing their HA systems?
TB: Sure. If there’s an application where somebody calls into the front desk and the person at the front desk wants to enter the caller’s name and account number into an application. That application has to be on the secondary box along with the VoIP application and the person at the front desk has to be connected to that secondary box. And that’s where a decision has to be made about whether those other applications are moved over in 15 minutes, is that OK? Is an hour OK? This is what is meant by tolerance for downtime.
DB: What can you briefly say about bandwidth determination and knowing exactly what a company will need?
TB: 3Com has an assessment tool that works based on the number of users and the number of phones. If a company is just doing generic phone conversations, it will take up a certain amount of bandwidth as an average. As you add bells and whistles, it adds to it.
I tell people as they prepare for bringing additional applications over on the secondary box, they will need to be mirrored and that takes bandwidth. This bandwidth factor is based on transactions per hour. I help them size this aspect. There is an assessment for the VoIP portion and a separate assessment for a disaster recovery or high availability strategy.
Most of the major organizations have a pretty decent network. But for those that don’t, if extra bandwidth is needed, it may be because a company doesn’t have an Ethernet network in place. In that case there will be a need for new CAT 5 or CAT 6 cable and switches that can handle a 10/100 Mbit or Gigabit Ethernet load. With regard to the switches, the traffic on the local network needs to be segmented among the applications, interfaces, and the IP for the phone system. You don’t want to push all the bits and bytes to the other locations where they are not needed.
If the secondary box is going to be on a different network in a different location, then T1 or T3 lines become necessary. The more locations that are networked, the bigger the pipe needs to be.
There are two important factors in sizing bandwidth: determining whether the local network needs upgraded cables and switches and determining how big the pipes need to be between the two locations. That can be adjusted as you go along. You could jump from a T1 to a T3 and it’s no big deal, but if you don’t have the proper switches and you want to do video conferencing and multimedia, for example, you will need to upgrade the 10/100 Mbit or Gigabit Ethernet.
Editor’s note: In a future edition of The Four Hundred, we will continue this VoIP discussion as the topics change to the role of the telecom carriers, deployment schedules, implementation costs, and training. We hope you’ll come back for more.