IBM i Gets An Influx Of Machine Learning Tooling
July 25, 2018 Alex Woodie
Thanks to the new RPM and Yum open source software delivery methods unleased by IBM earlier this year, IBM i shops can now run the latest in Python-based machine learning tools, including NumPy, SciPy, Pandas, and Scikit-Learn.
Interest in using Python to build machine learning models has exploded in recent years, and we’re now at the point where Python is considered to be the predominant language for doing data science, followed closely by R, which is taught more in academic settings.
While the capability to train or run a machine learning model isn’t generally the number one requirement for organizations when they selected the IBM i server to be their business computing platform, having popular Python libraries like NumPy and Scikit-Learn run on it sure doesn’t hurt, says Jesse Grozinski, IBM‘s business architect for open source software on IBM i.
“We have right now at least a dozen different companies we have talked to who are actually interested in application development using these packages,” Grozinski told IT Jungle in a recent interview. “I can’t disclose completely what they’re doing, but they see that machine learning could possibly help them and they’re willing to do their own application development on these packages to see if it will help them.”
Here are the Python-based tools that IBM i shops can now use:
NumPy is a Python-based library for creating large, multi-dimensional arrays and matrices, and also includes a collection of high-level math functions to operate on these arrays. It was created by Travis Oliphant and released as open source through a BSD license in 2005.
Oliphant, who did his Ph.D. work in biomedical imaging at the Mayo Clinic College of Medicine and Science in Rochester, Minnesota, also had a hand in SciPy, another Python module that has modules for optimization, linear algebra, integration, interpolation, special functions, FFT, and signal and image processing. SciPy was initially released in 2001 and is also open source with a BSD license.
Scikit-Learn is a collection of classification, regression, and clustering machine learning algorithms libraries for Python. Specifically, it includes vector machines, random forests, gradient boosting, k-means, and DBSCAN algorithms. It was originally created by David Cournapeau in 2007 and is designed to complement NumPy and SciPy. It also has a BSD license.
Pandas, meanwhile, is a Python-based software library that’s designed to help data scientists with data manipulation and analyses. It includes structures and operations for manipulating numerical table and time-series data sets, or re-shaping data sets, and merging and joining data. It was initially created by Wes McKinney and released in 2008 with a BSD license.
Grozinski admits that a lot of it is experimental at this point. “A dozen companies say machine learning might help us and eight of them might say ‘That didn’t really do what we wanted to do’ but there could be two or three who say ‘This is great,'” he said. “So we definitely have that interest.”
While Python was initially created as a high-level interpreted language for general-purpose computing, interest in the language has exploded thanks to the big data and data science phenomena. Even though the IBM i is not currently considered a platform for machine learning, it just made sense for IBM to bring the most popular Python-based data science packages to the IBM i.
“When we first released Python in 2015 in 5733-OPS, one of the disheartening things with the development team is we said, ‘Hey everybody, here’s Python.’ And the first things people wanted to do with it was machine learning stuff, and it didn’t work,” Grozinski said. “None of these package work. Numpy didn’t work. Pandas didn’t work, SciPy didn’t work. It wasn’t even close to working. That was kind of disheartening for us. But now that we’re doing things in a way that’s aligned with the rest of the open source world, it’s just plain working.”
That experience is similar to the one that many Python-using data scientists went through in the early days of the big data boom. Getting all the various Python packages required to do productive data science work with the language was difficult. There were various package dependencies involving languages, operating systems, and middleware, and it proved extremely frustrating for scientists who just wanted to do science.
That problem was largely solved by Anaconda, which Oliphant founded in 2012 with Peter Wang. The Austin, Texas, company packaged all the required Python software products up and delivered it in an open source package.
Anaconda’s software is a standard part of almost every data scientists’ toolkit at this point, and is downloaded at the rate of 2.5 million copies per month. Anaconda also has a partnership with IBM to run the data science software on System z mainframes and also to include the data science software in its PowerAI software.
Could the full Anaconda package soon be available IBM i shops via the Yum installer? “All I can say is stay tuned,” Grozinski said. “We’re having interesting conversations.”