Here’s an introduction to Pandas, an open source software library that’s written in Python for data manipulation and analysis. Pandas facilitates the manipulation of numerical tables and the time series.
In recent times, it has been proven again and again that data has become an increasingly important resource. Now, with the Internet boom, large volumes of data are being generated every second. To stay ahead of the competition, companies need efficient ways of analysing this data, which can be represented as a matrix, using Python’s mathematical package, NumPy.
The problem with NumPy is that it doesn’t have sufficient data analysis tools built into it. This is where Pandas comes in. It is a data analysis package, which is built to integrate with NumPy arrays. Pandas has a lot of functionality, but we will cover only a small portion of it in this article.
Getting started
Installing Pandas is a one-step process if you use Pip. Run the following command to install Pandas.
sudo pip install pandas
If you face any difficulties, visit http://pandas.pydata.org/pandas-docs/stable/install.html. You can now try importing Pandas into your Python environment by issuing the following command:
import pandas
In this tutorial, we will be using data from Weather Underground. The dataset for this article can be downloaded from http://www.synesthesiam.com/assets/weather_year.csv and can be imported into Pandas using:
data = pandas.read_csv(weather_year.csv)
The read_csv function creates a dataframe. A dataframe is a tabular representation of the data read. You can get a summary of the dataset by printing the object. The output of the print is as follows:
data <class pandas.core.frame.DataFrame> Int64Index: 366 entries, 0 to 365 Data columns: EDT 366 non-null values Max TemperatureF 366 non-null values Mean TemperatureF 366 non-null values Min TemperatureF 366 non-null values Max Dew PointF 366 non-null values MeanDew PointF 366 non-null values Min DewpointF 366 non-null values Max Humidity 366 non-null values Mean Humidity 366 non-null values Min Humidity 366 non-null values Max Sea Level PressureIn 366 non-null values Mean Sea Level PressureIn 366 non-null values Min Sea Level PressureIn 366 non-null values Max VisibilityMiles 366 non-null values Mean VisibilityMiles 366 non-null values Min VisibilityMiles 366 non-null values Max Wind SpeedMPH 366 non-null values Mean Wind SpeedMPH 366 non-null values Max Gust SpeedMPH 365 non-null values PrecipitationIn 366 non-null values CloudCover 366 non-null values Events 162 non-null values WindDirDegrees 366 non-null values dtypes: float64(4), int64(16), object(3)
As you can see, there are 366 entries in the given dataframe. You can get the column names using data.columns.
The output of the command is given below:
data.columns
Index([EDT, Max TemperatureF, Mean TemperatureF, Min TemperatureF, Max Dew PointF, MeanDew PointF, Min DewpointF, Max Humidity, Mean Humidity, Min Humidity, Max Sea Level PressureIn, Mean Sea Level PressureIn, Min Sea Level PressureIn, Max VisibilityMiles, Mean VisibilityMiles, Min VisibilityMiles, Max Wind SpeedMPH, Mean Wind SpeedMPH, Max Gust SpeedMPH, PrecipitationIn, CloudCover, Events, WindDirDegrees], dtype=object)
To print a particular column of the dataframe, you can simply index it as data[‘EDT’] for a single column or data[[‘EDT’,’Max Humidity’]] for multiple columns. The output for data[‘EDT’] is:
data[EDT] 0 2012-3-10 1 2012-3-11 2 2012-3-12 3 2012-3-13 4 2012-3-14 5 2012-3-15 6 2012-3-16 ... ... ... 361 2013-3-6 362 2013-3-7 363 2013-3-8 364 2013-3-9 365 2013-3-10 Name: EDT, Length: 366
And the output for data[[EDT,Max Humidity]] is:
data[[EDT,Max Humidity]]
<class pandas.core.frame.DataFrame> Int64Index: 366 entries, 0 to 365 Data columns: EDT 366 non-null values Max Humidity 366 non-null values dtypes: int64(1), object(1)
Sometimes, it may be useful to only view a part of the data, just so that you can get a sense of what kind of data you are dealing with. Here you can use the head and tail functions to view the start and end of your dataframe:
data[Max Humidity].head()
0 74 1 78 2 90 3 93 4 93 Name: Max Humidity
Note: The head and tail functions take a parameter which sets the number of rows to be displayed. And can be used as data[Max Humidity].head(n), where ‘n’ is the number of rows. The default is 5.
Working with columns
Now that we have a basis on which to work with our dataframe, we can explore various useful functions provided by Pandas like std to compute the standard deviation, mean to compute the average value, sum to compute the sum of all elements in a column, etc. So if you want to compute the mean of the Max Humidity column, for instance, you can use the following commands:
data['Max Humidity'].mean()
90.027322404371589 data['Max Humidity'].sum() 32950 data['Max Humidity'].std() 9.10843757197798
Note: Most of the Pandas functions ignore NaNs, by default. These regularly occur in data and a convenient way of handling them must be established. This topic is covered more in detail later in this article.
The std and sum function can be used in a similar manner. Also, rather than running these functions on individual columns, you can run them on the entire dataframe, as follows:
data.mean()
Max TemperatureF 66.803279 Mean TemperatureF 55.683060 Min TemperatureF 44.101093 Max Dew PointF 49.549180 MeanDew PointF 44.057377 Min DewpointF 37.980874 Max Humidity 90.027322 Mean Humidity 67.860656 Min Humidity 45.193989 Max Sea Level PressureIn 30.108907 Mean Sea Level PressureIn 30.022705 Min Sea Level PressureIn 29.936831 Max VisibilityMiles 9.994536 Mean VisibilityMiles 8.732240 Min VisibilityMiles 5.797814 Max Wind SpeedMPH 16.418033 Mean Wind SpeedMPH 6.057377 Max Gust SpeedMPH 22.764384 CloudCover 2.885246 WindDirDegrees 189.704918
Using apply for bulk operations
As we have already seen, functions like mean, std and sum work on entire columns, but sometimes it may be useful to apply our own functions to entire columns of the dataframe. For this purpose, Pandas provides the apply function, which takes an anonymous function as a parameter and applies to every element in the column. In this example, let us try to get the square of every element in a column. We can do this with the following code:
data[Max Humidity].apply(lambda d: d**2) 0 5476 1 6084 2 8100 3 8649 4 8649 5 8100 ... ... ... 361 8464 362 7225 363 7744 364 5625 365 2916 Name: Max Humidity, Length: 366
Note: In the Lambda function, the parameter d is implicitly passed to it by Pandas, and contains each element of the a column.
Now you may wonder why you can’t just do this with a loop. Well, the answer is that this operation was written in one single line, which saves code writing time and is much easier to read.
Dealing with NaN values
Pandas provides a function called isnull, which returns a ‘True’ or ‘False’ value depending on whether the value of an element in the column is NaN or None. These values are treated as missing values from the dataset, and so it is always convenient to deal with them separately. We can use the apply function to test every element in a column to see if any NaNs are present. You can use the following command:
e = data[Events].apply(lambda d: pandas.isnull(d)) e 0 True 1 False 2 False 3 True 4 True 5 False ... 361 False 362 True 363 True 364 True 365 True Name: Events, Length: 366
As you can see, a list of Booleans was returned, representing values that are NaN. Now there are two options of how to deal with the NaN values. First, you can choose to drop all rows with NaN values using the dropna function, in the following manner:
data.dropna(subset=[Events]) <class pandas.core.frame.DataFrame> Int64Index: 162 entries, 1 to 361 Data columns: EDT 162 non-null values Max TemperatureF 162 non-null values Mean TemperatureF 162 non-null values Min TemperatureF 162 non-null values Max Dew PointF 162 non-null values MeanDew PointF 162 non-null values Min DewpointF 162 non-null values Max Humidity 162 non-null values Mean Humidity 162 non-null values Min Humidity 162 non-null values Max Sea Level PressureIn 162 non-null values Mean Sea Level PressureIn 162 non-null values Min Sea Level PressureIn 162 non-null values Max VisibilityMiles 162 non-null values Mean VisibilityMiles 162 non-null values Min VisibilityMiles 162 non-null values Max Wind SpeedMPH 162 non-null values Mean Wind SpeedMPH 162 non-null values Max Gust SpeedMPH 162 non-null values PrecipitationIn 162 non-null values CloudCover 162 non-null values Events 162 non-null values WindDirDegrees 162 non-null values dtypes: float64(4), int64(16), object(3)
As you can see, there are only 162 rows, which don’t contain NaNs in the column Events. The other option you have is to replace the NaN values with something easier to deal with using the fillna function. You can do this in the following manner:
data[Events].fillna() 0 1 Rain 2 Rain 3 4 5 Rain-Thunderstorm 6 7 Fog-Thunderstorm 8 Rain 362 363 364 365 Name: Events, Length: 366
Accessing individual rows
So far we have discussed methods dealing with indexing entire columns, but what if you want to access a specific row in your dataframe? Well, Pandas provides a function called irow, which lets you get the value of a specific row. You can use it as follows:
data.irow(0) EDT 2012-3-10 Max TemperatureF 56 Mean TemperatureF 40 Min TemperatureF 24 Max Dew PointF 24 MeanDew PointF 20 Min DewpointF 16 Max Humidity 74 Mean Humidity 50 Min Humidity 26 Max Sea Level PressureIn 30.53 Mean Sea Level PressureIn 30.45 Min Sea Level PressureIn 30.34 Max VisibilityMiles 10 Mean VisibilityMiles 10 Min VisibilityMiles 10 Max Wind SpeedMPH 13 Mean Wind SpeedMPH 6 Max Gust SpeedMPH 17 PrecipitationIn 0.00 CloudCover 0 Events NaN WindDirDegrees 138 Name: 0
Note: Indices start from 0 for indexing the rows.
Filtering
Sometimes you may need to find rows of special interest to you. Lets suppose we want to find out data points in our data frame, which have a mean temperature greater than 40 and less than 50.You can filter out values from your dataframe using the following syntax:
data[(data['Mean TemperatureF']>40) & (data['Mean TemperatureF']<50)] <class 'pandas.core.frame.DataFrame'> Int64Index: 51 entries, 1 to 364 Data columns: EDT 51 non-null values Max TemperatureF 51 non-null values Mean TemperatureF 51 non-null values Min TemperatureF 51 non-null values Max Dew PointF 51 non-null values MeanDew PointF 51 non-null values Min DewpointF 51 non-null values Max Humidity 51 non-null values Mean Humidity 51 non-null values Min Humidity 51 non-null values Max Sea Level PressureIn 51 non-null values Mean Sea Level PressureIn 51 non-null values Min Sea Level PressureIn 51 non-null values Max VisibilityMiles 51 non-null values Mean VisibilityMiles 51 non-null values Min VisibilityMiles 51 non-null values Max Wind SpeedMPH 51 non-null values Mean Wind SpeedMPH 51 non-null values Max Gust SpeedMPH 51 non-null values PrecipitationIn 51 non-null values CloudCover 51 non-null values Events 23 non-null values WindDirDegrees 51 non-null values dtypes: float64(4), int64(16), object(3)
Note: The output of the condition data[Mean TemperatureF]>40 and data[Mean TemperatureF]<50 return a NumPy array, and we must use the brackets to separate them before using the & operator, or else you will get an error message saying that the expression is ambiguous.
Now you can easily get meaningful data from your dataframe by simply filtering out the data that you aren’t interested in. This provides you with a very powerful technique that you can use in conjunction with higher Pandas functions to understand your data.
Getting data out
You can easily write data out by using the to_csv function to write your data out as a csv file.
data.to_csv(weather-mod.csv) Want to make a separate tab? No problem. data.to_csv(data/weather-mod.tsv, sep=\t)
Note: Generally, the dataframe can be indexed by any Boolean NumPy array. In a sense, only values that are true will be retained. For example, if we use the variable e, (e = data[Events].apply(lambda d: pandas.isnull(d))) which contains the list of all rows that have NaN values for data[Events], as data[e], we will get a dataframe which has rows that only have NaN values for data[Events]
You can also write your data out in other formats that can be found on the Pandas doc.
This article only covers the basics of what can be done with Pandas. It also supports a lot of higher level functions like plotting data to give a better feel of the data being dealt with. If you want to learn more about Pandas, check the online documentation. It is very readable, user-friendly and is a great place to get a better understanding of how Pandas works. Frameworks like Pandas let a Python application take advantage of such data analysis tools easily. There are also other languages which support data analysis and you may want to check them out. These include R, MATLAB, Julia and Octave. To wrap up, these languages and packages greatly increase your understanding of data. In a world where data is becoming increasingly important, it is critical that we deal with our data smartly.