Splunking Through IBM i Log Mysteries
March 16, 2015 Alex Woodie
Tracking down problems in the modern data center can be an exercise in extreme patience. Today’s mantra of “loosely coupled yet tightly integrated” sounds great theoretically, but doesn’t give you much to go on things start to fail. When the going got tough for one multi-billion-dollar ecommerce operation that relies heavily on the IBM i server, it turned to Splunk to sort it all out.
Splunk made a name for itself as a provider of next-gen power tools that enable IT administrators to collect, analyze, and search vast sums of log files generated by a multitude of systems. Instead of trying to track the log data using Excel or grep (a Unix command line utility), Splunk Enterprise provides a powerful and intuitive interface to correlate and make sense of the mess of data coming in.
The IBM i server is one of the log-file-generating servers that customers can monitor with Splunk. In a case study titled Finding Order(s) in the Chaos, Splunk detailed how a large IBM i shop successfully deployed Splunk Enterprise to track down and fix problems occurring within its complex IT system.
The problems stemmed from the retailer’s use of multiple, custom-built, ERP applications that face both internally and externally, including ordering systems running on IBM i, transaction processing systems running on Tuxedo, a webMethods enterprise service bus, and generous helpings of JMS and MQ messaging services and XML documents, all hooked together in a loosely coupled services oriented architecture.
The retailer’s setup evolved over the years and could scale to tremendous heights. (The retailer did not give Splunk permission to use its name, so it’s only described through its figures: $25 billion in annual revenue, more than 90,000 employees in over 25 countries). However, the non-uniformity of the system made tracking down problems a real bear.
“Whenever orders went missing,” Splunk says in the case study, “troubleshooting issues through such a complex stack was a laborious task that required extensive IT resources over long periods of time. Since the IT systems involved–SOA, EAI-connected and platform-based services–were loosely integrated and the underlying data generated was siloed, tracing a particular customer’s transaction end-to-end was nearly impossible.”
Compounding the problems was the fact that multiple applications had common fields, used but different identifiers, which made it extremely difficult to correlate data. “As a result,” Splunk says, “issue tracking and resolution often took weeks or longer, which undermined productivity and customer satisfaction.”
Instead of ripping out a working enterprise system in the hopes of gaining greater uniformity, as so many IBM i shops have attempted to do over the years, the retailer decided to bring in Splunk Enterprise, which features a powerful NoSQL database that’s able to store and correlate massive amounts of unstructured data.
The retailer uses the Splunk Processing Language (SPL) to extract relevant data from the various servers, and to build end-to-end traces for purchase orders as they traverse the entire system. One of the key advantages that Splunk brought the retailer was it was able to generate a unique transaction ID for each order.
A number of dashboards were also built that enable the retailer’s IT and business users to visualize orders as they progress through the system. These dashboards were critical for achieving the “management by exception” required for tracking problems occurring in an environment that processes 4 million orders per hour.
Splunk says that, by giving the retailer the capability to track customer orders, it has prevented up to 100 lost orders per week, with a corresponding $900,000 in annual revenue recovery. Splunk provides other benefits as well, including the capability to see when service level agreement obligations have been violated.