Data Analytics with Pandas: An Introduction

0
5059

Data science has emerged as an essential skill today. Analysing Big Data to generate insights that help businesses profit more is one of the key skills required to be a data scientist. The Pandas library is an open source data analytics tool in Python. It is a must-learn for every aspiring data scientist.

In an article, the Harvard Business Review rated ‘data scientist’ as the sexiest job to have in the 21st century. This is due to multiple reasons, the primary one being the abundance of Big Data available today, and the potential for us to draw key insights from it that could help companies make more profits. Another important development has been the advent of Python as the singlemost in-demand programming language for data analytics. This is due to the vast variety of libraries available in Python to clean, analyse and visualise data.
In this article, I will introduce you to the Pandas library, an open source Python based data analytics tool that helps us organise and structure data for easier and intuitive analysis of data.

What is Pandas?
From its initial release in 2008, Pandas has emerged as the most popular open source tool for organising, structuring and manipulating data. Its latest version (1.0.3) was released in March 2020. The library is written in Python, Cython and C. It has a huge and active open source community that supplements its development, and has big sponsors such as Anaconda and the Chan Zuckerberg Initiative, to name a few.

Using Pandas, you can load csv, xlsx, zip, txt, json, xml, html, images, hierarchical data, pdf, docx, mp3, mp4 or SQL data and visualise it in the form of structured columns and rows. Further, you can manipulate this data easily to add, edit or delete rows and columns.

Installing Pandas
Installing Pandas is as straightforward as installing any other library in Python. Just open your terminal (make sure you have Pip installed) and run the following piece of code:

pip install pandas

Loading data
As discussed earlier, you can load multiple types of data using Pandas.
I will create a simple csv file and load the data from that. For the purposes of this tutorial, I will use the examples of the current COVID-19 cases in different states across India.

States  Confirmed
Telangana 3496
Maharashtra 82968
Tamil Nadu 30152
Delhi 27654
Gujarat 19592

Figure 1: CSV data

In order to create a better visual for the data, I will use Jupyter Notebook (.ipynb) for this tutorial. However, you can create a simple .py file and run the program as well. In order to follow this tutorial using Jupyter notebook, do create a simple csv file as shown in Figure 1, and save it into the same folder as your .ipynb file.

Navigate to the folder of your csv file using the terminal and run Jupyter Notebook. The browser dashboard for Jupyter will open. Click New → Python 3. This will open up a fresh Python 3 notebook for you.

In the notebook, the first thing we need to do is import the Pandas library and load the csv file into a Pandas data frame. We do this using the read_csv() method in Pandas.

import pandas as pd

df = pd.read_csv(‘data.csv’)

A data frame in Pandas is a two-dimensional data structure, used to visualise data in the rows and columns format.

Visualising a data frame
In order to visualise the data frame, you can simply type the name of the variable you used to store the data frame in (in this case, it’s df) and run the code. That will give you a visual of the data as shown in Figure 2.

  States Confirmed
0 Telangana 3496
1 Maharashtra 82968
2 Tamil Nadu 30152
3 Delhi 27654
4 Gujarat 19592

Figure 2: Pandas data frame

However, printing the entire data frame may not always be ideal, especially if the data set is huge. You can instead use the head() method in Pandas to visualise the first five rows of data from the data frame.

#Visualize the first 5 rows of data frame
df.head()

You can also use the shape attribute in Pandas, which outputs a tuple that gives the total number of rows and columns in the data frame.

#Total no. of rows and columns
df.shape

In order to easily generate the basic statistical details pertaining to your data frame such as mean, standard deviation, etc, you can use the describe() function in Pandas that will visualise those details for you.

#Generate Basic statistical details
df.describe()

Figure 3 shows the statistical details generated using the describe() method.

Confirmed
count 5.000000
mean 32772.400000
std 29931.143126
min 3496.000000
25% 19592.000000
50% 27654.000000
75% 30152.000000
max 82968.000000

Figure 3: Output of the describe() method

Adding rows and columns
Let us add the confirmed cases of COVID-19 in Rajasthan to our data frame. In order to do this, we will have to add another row of data, for which we create a dictionary of the data that needs to be added and use the append() method in Pandas to add the data. Figure 4 shows the output after adding Rajasthan to the data frame.

#adding a row to data frame

row = {‘States’: ‘Rajasthan’, ‘Confirmed’: 10331}
df = df.append(row, ignore_index=True)
df
States Confirmed
0 Telangana 3496
1 Maharashtra 82968
2 Tamil Nadu 30152
3 Delhi 27654
4 Gujarat 19592
5 Rajasthan 10331

Figure 4: Adding a row to data frame

Now let’s go ahead and add the values for all the patients who have recovered from COVID-19 in each state. In order to do this, we will have to add a separate column to our data frame. Figure 5 shows the output of adding the recovered patients’ column to the data frame.

#Adding a column to data frame
recovered = [1710, 37390, 16395, 10664, 13316, 7501]
df[‘Recovered’] = recovered
df
States Confirmed Recovered
0 Telangana 3496 1710
1 Maharashtra 82968 37390
2 Tamil Nadu 30152 16395
3 Delhi 27654 10664
4 Gujarat 19592 13316
5 Rajasthan 10331 7501

Figure 5: Adding a column to the data frame

Dropping rows and columns
Dropping or deleting rows and columns in a Pandas data frame is easy using the drop() method. We can drop a row using its index, observation or using an if-else condition. We can even drop a range of rows. Since this is an introductory article, I will not get into the details but will show readers the simplest way to do this. Let us drop the row containing data for Rajasthan. This can be done using the following code. Figure 6 shows the data frame after dropping the row containing Rajasthan.

#drop a row from data frame

df = df.drop(df.index[5])
df
States Confirmed Recovered
0 Telangana 3496 1710
1 Maharashtra 82968 37390
2 Tamil Nadu 30152 16395
3 Delhi 27654 10664
4 Gujarat 19592 13316

Figure 6: Dropping a row from the data frame

Now let us drop the ‘Recovered’ column from our data set. Figure 7 shows the data frame after dropping the ‘Recovered’ column.

#drop a column from data frame

df = df.drop(‘Recovered’, axis=1)
df

The axis=1 property lets the drop() method know that the reference is to a column instead of a row.

States Confirmed
0 Telangana 3496
1 Maharashtra 82968
2 Tamil Nadu 30152
3 Delhi 27654
4 Gujarat 19592

Figure 7: Dropping a column from the data frame

Update values
You can update existing values in the rows of the data frame. Again, there are many different ways that you can do this. For the purpose of this tutorial, I am going to show you what I think is the simplest technique, using the at() method in Pandas.

Let’s say we want to update the value of confirmed cases in Maharashtra from 82,968 to 83,000. We can achieve the output shown in Figure 8 using the following piece of code:

#Edit/Update values

df.at[1,’Confirmed’] = 83000
df
States Confirmed
0 Telangana 3496
1 Maharashtra 83000
2 Tamil Nadu 30152
3 Delhi 27654
4 Gujarat 19592

Figure 8: Updating a value in the data frame

Exporting data
You can export your Pandas data frame into multiple file types such as JSON, csx, xlsx, etc. The following code snippet contains the code for some of the commonly used data types for export:

#export data frame to csv
df.to_csv()

#export data frame to xlsx
df.to_excel()

#export data frame to JSON
df.to_json()

Looking forward
It is important to keep in mind that we have only touched the basics of the Pandas library in this article. There is a whole lot more to this library with a wide array of methods, attributes and advanced techniques of dealing with complex data. My book, ‘Data Analytics with Pandas for Absolute Beginners’, covers some of the more advanced concepts and methods in detail.

LEAVE A REPLY

Please enter your comment!
Please enter your name here