This article is a guide on Python libraries for machine learning. Various libraries as well as their features are described to help readers understand them better, and enable them to then explore more advanced features.
Python is an excellent environment for building any machine learning (ML) application or system. ML involves data analysis, data engineering, feature extraction and building types of features, while Python has built-in capabilities for data analysis and data engineering. In addition, Python is very popular and is one of the de-facto programming languages as compared to Ruby, Perl or R. As Python provides the following advantages, one can definitely opt for it when building any ML based models or applications:
- It provides good libraries for data analysis.
- It provides an interactive computing feature including IPython and Jupyter Notebooks.
- It provides data visualisation features with multiple options to choose from.
- It provides a variety of libraries, packages and tools for data analysis.
- Python has good support for Pandas, Numpy, Matplotlib, SciPy and Scikit-learn libraries that are among the most used in ML.
- Python also provides scientific computing features including support for popular languages like C, C++ and FORTRAN.
- Python has inbuilt capabilities for doing research and prototyping.
- As Python supports multiple languages, building production environments is very easy.
In this article, I will start with introducing the various Python libraries, one by one. This will help in better understanding how ML models/applications can be built using these. In addition, where possible, I will give examples with small code snippets. I have used the Anaconda distribution for Windows (https://www.anaconda.com/distribution/) for all my code examples here. Anaconda has inbuilt libraries and packages in its downloadable version, as shown in Figures 1 and 2.
NumPy
NumPy is the short form for Numerical Python, which is used for all numerical based calculations. By default, it provides data structures, algorithms and libraries for many of the applications involving numerical data and calculations in Python.
NumPy has a number of advantages as listed below:
- It is very fast and efficient as its engine is built using C code.
- It supports multi-dimensional arrays (using ndarray object).
- It has built-in functions to perform various mathematical operations and computations for arrays.
- All array based operations can be done very efficiently.
- It supports array operations like reading and writing data from disk, files and data repos.
- It supports C/C++ based APIs for extended Python for Numpy’s data structures and computations.
- All the following mathematical operations are supported – random number generation, Fourier transforms, linear algebra and matrix operations.
To import and start using the NumPy library, use the following command:
import numpy as np |
Pandas
Pandas is one of the most popular Python libraries for data manipulation and data cleaning. Often, Pandas is used with NumPy and SciPy along with analytical libraries to build data visualisation. Pandas has the following built-in capabilities:
- It supports the matplotlib library.
- It supports Scikit-learn.
- It supports array based functions and computing.
- In addition to NumPy, Pandas is used for homogeneous numerical array data types.
Pandas Series: A Series is a one-dimensional array. This is almost similar to an object containing a sequence of values (only similar data types are allowed) along with the index having its labels.
In the following example, there are four integer data types in the array ‘my_data’and this data type is shown below as ‘int64’.
1 my_data = pd.Series([1,9,-5,18]) 1 my_data 0 1 1 9 2 -5 3 18 dtypeL int64 |
Various data operations can be performed on Series objects. Just as we can have our own index labels, math operations like greater than, less than, equal to, etc, can be done.
1 my_data2 = pd.Series([1,9,-5,18], index = [‘q’, ‘w’, ‘e’, ‘r’]) my_data2 q 1 w 9 e -5 r 18 my_data2[‘w’] 9 my_data2[‘r’] = 98 my_data2 q 1 w 9 e -5 r 98 dtype: int64 2 my_data2[my_data2 < 100 q 1 w 9 e -5 r 98 dtype: int64 1 my_data2 = [my_data2 **2] 1 my_data2 [q 1 w 81 e 25 r 324 dtype: int64 |
Pandas DataFrames: A DataFrame in Pandas contains an ordered collection of columns. Each can be of a different value type (Boolean, numeric, string or integer). Unlike a Series object, DataFrame will have both row and column indexes. The data will be stored in the one- or two-dimensional array format.
To import and start using the Pandas library, use the following set of commands and steps:
Import pandas as pd print(my_list) [‘shashi’, ‘dhar’, ‘soppin’, ‘is’, ‘writing’, ‘this’, ‘article’] df = pd.DataFrame(my_list) df 0 0 shashi 1 dhar 2 soppin 3 is 4 writing 5 this 6 article |
SciPy
This is a fundamental library for scientific computing. This library provides many numerical routines that are efficient. There are some routines for numerical integration, optimisation, linear algebra and statistical purposes.
SciKit-learn
SciKit-learn is one of the most used ML libraries in Python. It provides simple and efficient tools for carrying out data analysis with data mining related features. The following are the major groups of algorithms that are supported in the SciKit-learn library:
- Classification
- Regression
- Dimensionality reduction
- Model selection
- Preprocessing
- Clustering