Interfacing Python With Excel

0
22565
In this article, the author discusses interfacing Python with Excel. Data stored in numerous Excel sheets can be extracted by using the Python packages written for the same (like xlrd). And the author’s demonstration of its benefits in a real world scenario is enlightening.

Most projects and companies maintain a lot of data in the form of Excel files. These are usually reports, and data needs to be generated from these Excel files in a periodic manner, like a status update. Sometimes, when a specific event takes place, there may be some follow-up action required like generating a mail or filing an escalation in case the data indicates something of interest has happened. So let’s look at how such Excel files can be interfaced by using Python.

The problem of work summaries

Let us a look at a specific problem in this regard. Assume that employees in a particular team are working across different projects for different customers. Figure 1 gives an idea about the association between an employee, a project and a customer.

An employee could be working on one or more projects. A project could also have one or more employees. A project is targeted to be delivered to only one customer (in our example) though a customer may be involved with multiple projects.

To track the effort being spent on these projects, employees could be asked to record their work in the following manner under the heads shown below.

blanktbl

Let us assume all employees are filing such reports in one Excel file each, in a common location. So how do we arrive at the overall summary of efforts put in by the team on a particular customer, over a period of time?

One of the ways in which this could be managed is to do some fancy linking inside a master Excel file. Then write some Excel macros to walk through each file to summarise the data. As can be imagined, this is tedious, involves manual and automated steps, and can be error prone. Also, summaries over different time periods will have to be managed through macros or filters based on pivot tables inside a master Excel file. Instead, let us look at a different way of doing the same task, using Python and Excel interfacing libraries.

Figure 1
Figure 1: ERD-Employee, Project, Customer
Table 1
Figure 2: Sample data – test.xlsx

Introducing the Python-Excel interfacing library

Let me introduce the xlrd package for Python available from the Python Foundation at https://pypi.python.org/pypi/xlrd

The package used to write into Excel files is available as xlwt. As a collective package, xlutils comprises both xlrd and xlwt. In this column, we use only xlrd.

xlrd is a library that can be used to work with Microsoft Excel spreadsheets (.xls and .xlsx) on any platform. Let us check the interfaces available in the following sections and look at how to use them.

Installation of the package

The installation instructions in the readme.html file are pretty clear. On any OS, unzip the .zip file into a suitable directory, chdir to that directory, and type in python setup.py install.
If the installation is successful, you should be able to do what’s shown below at the IDLE GUI prompt:

>>> import xlrd
>>>

Important interfaces of the xlrd package

xlrd provides the following interfaces for reading the Excel values.

  1. xlrd::open_workbook(path_to_Excel_file): This method returns an object of the Book class, with which we will work quite a bit to get access to the Excel data.
  2. Book::sheet_by_index(sheetx): This returns an object of the Sheet class, with which we will access the individual cells of a sheet.
  3. Sheet::cell(row,col): This returns a Cell object for the particular row and column values.
  4. A Cell object should always be retrieved from other methods described above from the Book and Sheet objects and not invoked directly. Critical for a Cell object is the value attribute, the type of which varies according to the data type of the cell represented by another int attribute which is ctype. The following table gives the type that value has, which depends upon the type of the cell.

TableNote: In this column, we will work mostly with text and date types only.

Working with dates

As quoted in the xlrd tutorial, in Excel, dates have many problems.

  1. Dates are not stored as a separate data type but as floating point numbers, and you have to rely on (a) the ‘number format’ applied to them in Excel and/or (b) knowing which cells are supposed to have dates in them.
  2. Dates are stored as elapsed days from epoch, which differs between Windows and Mac versions of Excel.
  3. The Excel implementation of the default Windows 1900 based date system works on the incorrect premise that 1900 was a leap year.

caution

Sample usage

Let us now put together a small spreadsheet with the data as shown in Figure 2.
We can open this using the interfaces of xlrd in the IDLE GUI.

>>> testbook=xlrd.open_workbook(‘C:\\test.xlsx‘)
>>> testbook
<xlrd.book.Book object at 0x0000000002F760B8>

Note: To qualify a file name via the open_workbook API in Windows, I had to escape the “\” backslash in the path with another, to indicate that I am not using any special sequence.

Once a book is available, we can open a sheet of the same.

>>> testsheet=testbook.sheet_by_index (0)
>>> testsheet
<xlrd.sheet.Sheet object at oxooooooooo2F73A90>

From the sheet, we print the following values. You can see the values and types associated with the values.

>>> testsheet.cell (1,0)
xldate:42036.0
>>> testsheet.cell (1,1)
text:u ‘ProjectA’
>>> testsheet.cell (1,2)
text:u ‘Jackie Chain’
>>> testsheet.cell (1,3)
number:10.0

Note: Indexes of both rows and columns start with 0 (zero), which is good for us as we naturally escape the header row.

The date, as can be seen above, is stored as a number, which xlrd shows as an xldate type. Let us do some more processing with the same. To work with Python dates, we would need the date and datetime classes from the datetime module.

>>> from datetime import date, datetime
>>> date.today()
date.date(2015, 2, 1)

We can see that the date ‘1 Feb 2015’ is represented in Python with the year, month and date components.

To work with the xldate type, let us import and use the xldate_as_tuple interface.

>>> from xlrd import xldate_as_tuple
>>> testdate=xldate_as_tuple (testsheet.
cell(1,0).value,testbook.datemode)
>>> testdate
(2015, 2, 1, 0, 0, 0)

We can see that the same date is stored as a datetime internally and has hours, minutes and seconds components too.

To extract only the date part and ignore the last three, we can do a sub-tuple selection and initialise the date with the same.

>>> testdate
(2015, 2, 1, 0, 0, 0)
>>> date (*testdate[:3])
datetime.date (2015, 2, 1)

Now, simple comparisons can be done between Python date classes in the following way.

>>> date.today()==date(*testdate[:3])
True

Note: In the above segment, we make a slice of testdate starting from 0 to 2 and <3, with only the first three values from testdate to get date components of the year, month and day. We use the * operator to unpack and list the values to be used to construct a date object.

Designing the program

So, having set those items up, let us come back to the original problem. We need to have a program that summarises the efforts by the customer’s name and prints it. Let us assume that the data now looks like what’s shown in Figure 3 in a file c:\test.xlsx:

Given that all reports will be in a particular directory, we will need a way to get a list of the report files so that we can process them in our program. Python provides us with the glob object which helps to do that.

>>> import glob
>>> glob.glob (“C:\*.xlsx“)
[‘C:\\test.xlsx’]

Here’s the sequence for the program.

  1. Create a dictionary object to store nHours for each customer name.
  2. Using the glob object, fetch the list of xlsx report files.
  3. For each report file:
    • Cycle through each row of sheet (0)
    • For each row with a valid entry
      • Check if the date is present and is valid
      • Check if the date is present and the date is within the last seven days

Fetch the hours spent
Fetch the customers name
Customer hours + hours spent

4. Print the final output from the map

a. Print the customers name and the total hours spent on the customer
The following is the complete program:

Table 2
Figure 3: Sample data
“””summarize effort values from xlsx reports and print the same.
for each report file in a directory
read each line and update corresponding customer hours summary
when done, print the summary

# import glob object to read xlsx file names
import glob
# import xl reader module
import xlrd

#get date and datetime
from datetime import date,datetime,time
from xlrd import open_workbook,xldate_as_tuple

# note position of different data items in the summary file
DATECOL = 0
CUSTOMERCOL = 2
HOURSCOL = 3

# constant to store value to check for a week’s data
ONEWEEK = 7
ONLYDATE = 3

#customer constants used
CUSTA=”Jackie Chan”
CUSTB=”Jet Li”

#initialize customer hours map to zero
CustHours = {CUSTA: 0, CUSTB: 0}

for report in glob.glob(“C:\*.xlsx”):
print “Processing :”+report
book = xlrd.open_workbook(report)
sh=book.sheet_by_index(0)
for rx in range(1,sh.nrows):
# if date in current week then add value to corresponding section
if sh.cell(rx,DATECOL).value <> “”:
try:
date_value = xldate_as_tuple(sh.cell(rx,DATECOL).value,book.datemode)
except:
print “error with date”
continue
t = date.today() - date(*date_value[:3])
if t.days <=ONEWEEK and sh.cell(rx,HOURSCOL).value <>””:
hoursspent = float(sh.cell(rx,HOURSCOL).value)
customer=str(sh.cell(rx,CUSTOMERCOL).value)
#summarize per customer also
#get project hours if already present
CustHours[customer]+=hoursspent

#print the summaries
print “Summary per customer”
print CUSTA+”:”+str(CustHours[CUSTA])+” hours”
print CUSTB+”:”+str(CustHours[CUSTB])+” hours”

Note 1: Note that we print the report being processed to help in debugging, in case any data makes the program misbehave.

Note 2: We process xldate_as_tuple in a try:, except: block so that we handle any exceptions thrown by incorrect dates. We print an error message in that case, which along with the report file name will help us know where the problem is.

When run through Python, the above program produces the following output:

Processing :C:\test.xlsx
Summary per customer
Jackie Chan:24.0 hours
Jet Li:16.0 hours

More ideas

Think about how you can improve the above solution. To start with, we could make all those constants configurable and pass them as input to our program through a cfg file (refer to Practical Python Programming in the February 2015 issue of OSFY for how to use a ConfigParser module).

We could use better formatting to output the summary (refer repr() and format() interfaces).

Also, readers familiar with pivot tables could start thinking about how to implement generating such pivots from Python.

LEAVE A REPLY

Please enter your comment!
Please enter your name here