An Overview of the Python Data Analysis Library

0
4228
Python Panda

Pandas is a cross-platform library (abstraction) written in Python, Cython and C by Wes McKinney for the Python programming language. It is used for data analysis and data manipulation. This article lists a few important features of this library.

It is easy to install Pandas. Download Anaconda for your operating system and the latest Python version, and run the installer as available at https://pandas.pydata.org/getting_started.html. Pandas can be installed using pip from PyPI using the command pip install pandas.

You can verify the Python version by using the command python –version.

Output: Python 3.8.3

pip list and pip freeze will generate a list of installed packages:

1. pip freeze
2. anchorecli==0.8.2
3. astroid==2.4.2
4. .
5. .
6. .
7. openpyxl==3.0.5
8. .
9. pandas==1.1.3
10. .
11. py==1.8.2
12. .
13. .
14. .
15. xlrd==1.2.0
16. XlsxWriter==1.3.7

Table 1 lists some of the modules utilised in our script.

Module Description
Pandas  Data import, merge, experimentation, clean-up, and analysis
OpenPyXL  Read/write Excel 2010 xlsx/xlsm files
xlrd  Read Excel data
XlsxWriter  Write to Excel (xlsx) files
matplotlib  Data visualisation
ConfigParser  Configuration file parser

Table 1: Module details

Figure 1 offers a quick look at Pandas features and Figure 2 gives a Pandas cheat sheet.

Figure 1: The features of Pandas

Let’s first read a configuration file using the Python script. A configuration file looks like this:

Figure 2: Pandas cheat sheet (Reference: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
1. [FileLocations]
2. master.filename=project-data.xlsx
3. auditstatus.filename=audit-status.xlsx

To read this file, use the following script:

1. import pandas as pd
2. import numpy as np
3. import configparser as cfg
4. import array as arr
5. from pandas import DataFrame
6. import matplotlib.pyplot as plt
7. import io
8.
9. #Read Properties file
10. config = cfg.RawConfigParser()
11. config.read(‘master.properties’)
12. #access data from properties file
13. print(config.get(‘FileLocations’, ‘master.filename’))

Let’s now read data from the Excel file, the name for which is available in the properties file:

1. #Read Data
2. projectData = pd.read_excel(config.get(‘FileLocations’, ‘master.filename’))
3. auditStatus = pd.read_excel(config.get(‘FileLocations’, ‘auditstatus.filename’))
4. #Print data on console
5. print “=================================== ======================+++++++++++++++++=========== ===============”)
6. print(projectData)
7. print(“=====================”)
8. print(auditStatus)

Figure 3 gives the output of two DataFrames.

Figure 3: DataFrame outputs

Next, let’s merge the data of both Excel files based on the project ID. Merge the DataFrame with a database-style join on columns or indexes:

1. resultData = pd.merge(projectData, auditStatus, how=’left’)
2. print(resultData)
Figure 4: Merged data set

Figure 4 shows the merged data set.
Now if you want to merge specific columns of specific DataFrames, type:

1. resultData = pd.merge(projectData[[“Project Id”,”Project Name”,”Project Manager”]], auditStatus, how=’left’)

The result is given in Table 2.

Project ID Project name Project manager  Audit status
ZXY00001 PRJ_001 Jayesh P Pending
ZXY00002 PRJ_002 Mukund Completed
ZXY00003 PRJ_003 Radhika M Completed
ZXY00004 PRJ_004 Mukund Pending
ZXY00005 PRJ_005 Radhika M Pending
ZXY00006 PRJ_006 Radhika M Pending
ZXY00007 PRJ_007 Radhika M Pending
ZXY00008 PRJ_008 Radhika M Pending
ZXY00009 PRJ_009 Jayesh P Completed
ZXY00010 PRJ_010 Jayesh P Completed

Table 2: Data with specific columns

To save a DataFrame to the Excel file, use the to_excel method:

1. #Write Data Frame to Excel
2. resultData.to_excel(‘Results.xlsx’, index = False, header=True)

To get specific column data from a DataFrame, use auditStatus[“Audit Status”].

Creating a pivot table using Pandas
We will create a spreadsheet-style pivot table as a DataFrame using pivot_table. We will add values in the properties file to create pivot tables, as follows:

1. [FileLocations]
2. master.filename=project-data.xlsx
3. auditstatus.filename=audit-status.xlsx
4. [Pivot]
5. index-level-0=Project Manager
6. index-level-1=Project Manager,Domain
7. index-level-2=Project Manager,Domain,Technical Architect
8. columns=Audit Status
9. values=Project Id

Let’s create a pivot table using a single column index (see Table 3):

1. pivot = pd.pivot_table(resultData,index=config.get(‘Pivot’, ‘index-level-0’).split(‘,’),columns=config.get(‘Pivot’, ‘columns’).split(‘,’),values=config.get(‘Pivot’, ‘values’),aggfunc=len,margins=True,fill_value=0)
2. print(pivot)
Project manager Completed Pending All
Jayesh P 2 1 3
Mukund 1 1 2
Radhika M 1 4 5
All 4 6 10

Table 3: Pivot table with one index

Now let’s create a pivot table using two-column indexes (see Table 4):

1. pivot = pd.pivot_table(resultData,index=config.get(‘Pivot’, ‘index-level-1’).split(‘,’),columns=config.get(‘Pivot’, ‘columns’).split(‘,’),values=config.get(‘Pivot’, ‘values’),aggfunc=len,margins=True,fill_value=0)
Project manager Domain Completed Pending All
Jayesh P Banking 0 1 0
Commerce 2 0 2
Mukund Commerce 0 1 1
Insurance 1 0 1
Radhika M Aerospace 0 2 2
Banking 0 2 2
All Insurance 1 0 1
4 6 10

Table 4: Pivot table with two indexes

To create a pivot table using three-column indexes (Table 5), type:

1. pivot = pd.pivot_table(resultData,index=config.get(‘Pivot’, ‘index-level-3’).split(‘,’),columns=config.get(‘Pivot’, ‘columns’).split(‘,’),values=config.get(‘Pivot’, ‘values’),aggfunc=len,margins=True,fill_value=0)
Project manager Domain Technical architect Completed Pending All
Jayesh P Banking Ruby V 0 1 1
Commerce Ashish B 1 0 1
Jigisha P 1 0 1
Mukund Commerce Saurabh S 0 1 1
Insurance Shreyansh N 1 0 1
Aerospace Aish P 0 1 1
Radhika M Ramesh N 0 1 1
Banking Approva S 0 1 1
Shreyansh N 0 1 1
Insurance Jigisha P 1 0 1
All 4 6 10

Table 5: Pivot table with three indexes

Figure 5: Plot a chart

In the next section, we will create charts using Python.

Charts using Pandas
In this section, we will create charts using Python script. Let’s create chart data with pivot, as follows (see Table 6):

1. chartData = pd.pivot_table(resultData,index=config.get(‘Pivot’, ‘index-level-0’).split(‘,’),columns=config.get(‘Pivot’, ‘columns’).split(‘,’),values=config.get(‘Pivot’, ‘values’),aggfunc=len,margins=True,fill_value=0)
Project manager Completed Pending All
Jayesh P 2 1 3
Mukund 1 1 2
Radhika M 1 4 5
All 4 6 10

Table 6: Chart data

You can create and plot a chart based on the chart data given in Table 6:

1. ax = chartData.plot(kind=’bar’)
2. for p in ax.patches:
3. ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))
4. plt.show()

The following script helps to create a bar chart that is editable in the Excel file:

1. # pip install xlsxwriter
2. writer = pd.ExcelWriter(‘Results.xlsx’, engine=’xlsxwriter’)
3. #store data frames in a Dictionary object, where the key is the sheet name
4. frames = {‘MasterData’: chartData}
5. #store data frames in a Dictionary object, where the key is the sheet name
6. for sheet, frame in frames.items():
7. frame.to_excel(writer, sheet_name = sheet)
8.
9. worksheet = writer.sheets[‘MasterData’]
10. workbook = writer.book
11. chart = workbook.add_chart({‘type’: ‘column’})
12. # Some alternative colors for the chart.
13. colors = [‘#4DAF4A’, ‘#E41A1C’, ‘#377EB8’]
14.
15. chart.add_series({
16. ‘name’: ‘=MasterData!$C$1’,
17. ‘categories’: ‘=MasterData!$B$2:$B$6’,
18. ‘values’: ‘=MasterData!$C$2:$C$7’,
19. ‘fill’: {‘color’: colors[0]},
20. ‘overlap’: -10,
21. ‘gap’: 400,
22. ‘data_labels’: {‘value’: True},
23. })
24.
25. chart.add_series({
26. ‘name’: ‘=MasterData!$D$1’,
27. ‘categories’: ‘=MasterData!$B$2:$B$6’,
28. ‘values’: ‘=MasterData!$D$2:$D$7’,
29. ‘fill’: {‘color’: colors[1]},
30. ‘gap’: 400,
31. ‘data_labels’: {‘value’: True},
32. })
33.
34. chart.add_series({
35. ‘name’: ‘=MasterData!$E$1’,
36. ‘categories’: ‘=MasterData!$B$2:$B$6’,
37. ‘values’: ‘=MasterData!$E$2:$E$7’,
38. ‘fill’: {‘color’: colors[2]},
39. ‘gap’: 400,
40. ‘data_labels’: {‘value’: True},
41. })
42. chart.set_style(11)
43. # Add a chart title
44. chart.set_title ({‘name’: ‘Project Manager-wise Data’})
45.
46. # Add x-axis label
47. chart.set_x_axis({
48.
49. ‘name’: ‘Project Manager’,
50. ‘major_gridlines’: {
51. ‘visible’: True,
52. ‘line’: {‘width’: 1.25, ‘dash_type’: ‘dash’}
53. },
54. })
55.
56. # Add y-axis label
57. chart.set_y_axis({‘name’: ‘Audit Status’})
58.
59. chart.set_size({‘width’: 720, ‘height’: 576})
60.
61. worksheet.insert_chart(‘H1’, chart)
62.
63. #critical last step
64. writer.save()
Figure 6: Project manager-wise chart

A chart created using Python script is shown in Figure 6.

LEAVE A REPLY

Please enter your comment!
Please enter your name here