Data science has emerged as an essential skill today. Analysing Big Data to generate insights that help businesses profit more is one of the key skills required to be a data scientist. The Pandas library is an open source data analytics tool in Python. It is a must-learn for every aspiring data scientist.
In an article, the Harvard Business Review rated ‘data scientist’ as the sexiest job to have in the 21st century. This is due to multiple reasons, the primary one being the abundance of Big Data available today, and the potential for us to draw key insights from it that could help companies make more profits. Another important development has been the advent of Python as the singlemost in-demand programming language for data analytics. This is due to the vast variety of libraries available in Python to clean, analyse and visualise data.
In this article, I will introduce you to the Pandas library, an open source Python based data analytics tool that helps us organise and structure data for easier and intuitive analysis of data.
What is Pandas?
From its initial release in 2008, Pandas has emerged as the most popular open source tool for organising, structuring and manipulating data. Its latest version (1.0.3) was released in March 2020. The library is written in Python, Cython and C. It has a huge and active open source community that supplements its development, and has big sponsors such as Anaconda and the Chan Zuckerberg Initiative, to name a few.
Using Pandas, you can load csv, xlsx, zip, txt, json, xml, html, images, hierarchical data, pdf, docx, mp3, mp4 or SQL data and visualise it in the form of structured columns and rows. Further, you can manipulate this data easily to add, edit or delete rows and columns.
Installing Pandas
Installing Pandas is as straightforward as installing any other library in Python. Just open your terminal (make sure you have Pip installed) and run the following piece of code:
pip install pandas
Loading data
As discussed earlier, you can load multiple types of data using Pandas.
I will create a simple csv file and load the data from that. For the purposes of this tutorial, I will use the examples of the current COVID-19 cases in different states across India.
States | Confirmed |
Telangana | 3496 |
Maharashtra | 82968 |
Tamil Nadu | 30152 |
Delhi | 27654 |
Gujarat | 19592 |
Figure 1: CSV data
In order to create a better visual for the data, I will use Jupyter Notebook (.ipynb) for this tutorial. However, you can create a simple .py file and run the program as well. In order to follow this tutorial using Jupyter notebook, do create a simple csv file as shown in Figure 1, and save it into the same folder as your .ipynb file.
Navigate to the folder of your csv file using the terminal and run Jupyter Notebook. The browser dashboard for Jupyter will open. Click New → Python 3. This will open up a fresh Python 3 notebook for you.
In the notebook, the first thing we need to do is import the Pandas library and load the csv file into a Pandas data frame. We do this using the read_csv() method in Pandas.
import pandas as pd df = pd.read_csv(‘data.csv’)
A data frame in Pandas is a two-dimensional data structure, used to visualise data in the rows and columns format.
Visualising a data frame
In order to visualise the data frame, you can simply type the name of the variable you used to store the data frame in (in this case, it’s df) and run the code. That will give you a visual of the data as shown in Figure 2.
States | Confirmed | |
0 | Telangana | 3496 |
1 | Maharashtra | 82968 |
2 | Tamil Nadu | 30152 |
3 | Delhi | 27654 |
4 | Gujarat | 19592 |
Figure 2: Pandas data frame
However, printing the entire data frame may not always be ideal, especially if the data set is huge. You can instead use the head() method in Pandas to visualise the first five rows of data from the data frame.
#Visualize the first 5 rows of data frame df.head()
You can also use the shape attribute in Pandas, which outputs a tuple that gives the total number of rows and columns in the data frame.
#Total no. of rows and columns df.shape
In order to easily generate the basic statistical details pertaining to your data frame such as mean, standard deviation, etc, you can use the describe() function in Pandas that will visualise those details for you.
#Generate Basic statistical details df.describe()
Figure 3 shows the statistical details generated using the describe() method.
Confirmed | |
count | 5.000000 |
mean | 32772.400000 |
std | 29931.143126 |
min | 3496.000000 |
25% | 19592.000000 |
50% | 27654.000000 |
75% | 30152.000000 |
max | 82968.000000 |
Figure 3: Output of the describe() method
Adding rows and columns
Let us add the confirmed cases of COVID-19 in Rajasthan to our data frame. In order to do this, we will have to add another row of data, for which we create a dictionary of the data that needs to be added and use the append() method in Pandas to add the data. Figure 4 shows the output after adding Rajasthan to the data frame.
#adding a row to data frame row = {‘States’: ‘Rajasthan’, ‘Confirmed’: 10331} df = df.append(row, ignore_index=True) df
States | Confirmed | |
0 | Telangana | 3496 |
1 | Maharashtra | 82968 |
2 | Tamil Nadu | 30152 |
3 | Delhi | 27654 |
4 | Gujarat | 19592 |
5 | Rajasthan | 10331 |
Figure 4: Adding a row to data frame
Now let’s go ahead and add the values for all the patients who have recovered from COVID-19 in each state. In order to do this, we will have to add a separate column to our data frame. Figure 5 shows the output of adding the recovered patients’ column to the data frame.
#Adding a column to data frame recovered = [1710, 37390, 16395, 10664, 13316, 7501] df[‘Recovered’] = recovered df
States | Confirmed | Recovered | |
0 | Telangana | 3496 | 1710 |
1 | Maharashtra | 82968 | 37390 |
2 | Tamil Nadu | 30152 | 16395 |
3 | Delhi | 27654 | 10664 |
4 | Gujarat | 19592 | 13316 |
5 | Rajasthan | 10331 | 7501 |
Figure 5: Adding a column to the data frame
Dropping rows and columns
Dropping or deleting rows and columns in a Pandas data frame is easy using the drop() method. We can drop a row using its index, observation or using an if-else condition. We can even drop a range of rows. Since this is an introductory article, I will not get into the details but will show readers the simplest way to do this. Let us drop the row containing data for Rajasthan. This can be done using the following code. Figure 6 shows the data frame after dropping the row containing Rajasthan.
#drop a row from data frame df = df.drop(df.index[5]) df
States | Confirmed | Recovered | |
0 | Telangana | 3496 | 1710 |
1 | Maharashtra | 82968 | 37390 |
2 | Tamil Nadu | 30152 | 16395 |
3 | Delhi | 27654 | 10664 |
4 | Gujarat | 19592 | 13316 |
Figure 6: Dropping a row from the data frame
Now let us drop the ‘Recovered’ column from our data set. Figure 7 shows the data frame after dropping the ‘Recovered’ column.
#drop a column from data frame df = df.drop(‘Recovered’, axis=1) df
The axis=1 property lets the drop() method know that the reference is to a column instead of a row.
States | Confirmed | |
0 | Telangana | 3496 |
1 | Maharashtra | 82968 |
2 | Tamil Nadu | 30152 |
3 | Delhi | 27654 |
4 | Gujarat | 19592 |
Figure 7: Dropping a column from the data frame
Update values
You can update existing values in the rows of the data frame. Again, there are many different ways that you can do this. For the purpose of this tutorial, I am going to show you what I think is the simplest technique, using the at() method in Pandas.
Let’s say we want to update the value of confirmed cases in Maharashtra from 82,968 to 83,000. We can achieve the output shown in Figure 8 using the following piece of code:
#Edit/Update values df.at[1,’Confirmed’] = 83000 df
States | Confirmed | |
0 | Telangana | 3496 |
1 | Maharashtra | 83000 |
2 | Tamil Nadu | 30152 |
3 | Delhi | 27654 |
4 | Gujarat | 19592 |
Figure 8: Updating a value in the data frame
Exporting data
You can export your Pandas data frame into multiple file types such as JSON, csx, xlsx, etc. The following code snippet contains the code for some of the commonly used data types for export:
#export data frame to csv df.to_csv() #export data frame to xlsx df.to_excel() #export data frame to JSON df.to_json()
Looking forward
It is important to keep in mind that we have only touched the basics of the Pandas library in this article. There is a whole lot more to this library with a wide array of methods, attributes and advanced techniques of dealing with complex data. My book, ‘Data Analytics with Pandas for Absolute Beginners’, covers some of the more advanced concepts and methods in detail.