In this column, the author discusses how large volumes of mail can be archived and retrieved using Big Data tools, HBase and Phoenix.
One pressing challenge for any organisation is the storage and retrieval of emails. There are some obvious fields, like the subject, the sender and the content, which need to be searched. However, the header lines may be very important as well. For example, it is easy to fake the sender, so the server that sent the mail may be significant.
An emails core components are the header, the content and the attachments. Each line of the header is a key-value pair, with each key representing a header field. There are some fields that are visible to the user, like the subject, the sender, the recipients, the date and time. Other fields may be examined in case of problems, and a user usually needs to make an additional effort to look at them on most graphical mail clients.
Planning for storage
Typically, each email is stored as a record in a data store. The list of possible header fields is very large. Mapping each one to a column would just not be convenient.
This is where the concept of the column family in HBase can provide an answer. For example, you may create an HBase table with the following column families:
- Envelope (header fields like sender, recipients, subject, date)
- Header (header fields not included in the envelope)
- Content
- Attachment
Create the HBase table from the HBase shell as follows:
[fedora@h-mstr hbase-0.96.2-hadoop2]$ bin/hbase shell hbase(main):001:0> create emails,envelope,header,content,attachment 0 row(s) in 5.0390 seconds => Hbase::Table - emails hbase(main):002:0> exit
Loading emails
Python has a very versatile mailbox module for working with various mail formats on the disk. As an illustration, consider messages stored in the mbox format. A message may have multiple parts and each part may be another message or an attachment. In order to keep the code as simple as possible, convert and store each part as a string. Ideally, the unique key for each email would not be a UUID as has been used in this example. It would start with a value that is useful for limiting the search of emails. For example, in a corporate set-up, it may be prefixed with the department and date. As in the previous article in this series, use happybase to connect to the thrift server of HBase. The following code in load_mbox_file.py illustrates the idea.
#! /usr/bin/python import mailbox import happybase import uuid import sys ENVELOPE=[To,From,Subject,Return-Path,Date] # A simple approach to mutipart # Store each part as a string ignoring the type of content def multipart_payloads(columns, msg): num = 1 for part in msg.get_payload(): columns[content: + str(num)] = part.as_string() num += 1 def store(table,msg): # create a unique id for the row row_id = str(uuid.uuid1()) # Process the headers columns = {} # some header keywords may appear multiple times for key in set(msg.keys()): if key in ENVELOPE: cf = envelope: + key else: cf = header: + key # Get all the entries for a key as a list # Store as a string. # Can use eval to get the list back. columns[cf] = str(msg.get_all(key)) # Multipart? if msg.is_multipart(): multipart_payloads(columns,msg) else: columns[content:text] = msg.get_payload() table.put(row_id, columns) connection = happybase.Connection(h-mstr) table = connection.table(emails) mbox=mailbox.mbox(sys.argv[1]) for message in mbox: store(table,message)
Run load_mbox_file.py with the mbox filename as a parameter. The data should be stored in the HBase table emails.
In the above example, attachments have been treated like any other content. Hence, the column family attachment has been ignored.
Searching the emails
There are quite a few options for SQL queries on HBase tables, including Apache Phoenix, Impala, Hbase with Hive, Apache Drill, etc. Apache Drill is interesting as it seems to be very versatile and flexible. However, at present, it only works with HBase 0.94, while the current version is 0.98. So, let us experiment with Apache Phoenix instead (http://phoenix.apache.org/), which is an SQL layer specifically meant for HBase.
Installation involves copying the Phoenix-*.jar files in the HBase lib directory and restarting HBase. Phoenix comes with a Python utility, sqlline.py. You may create and populate a table in the utility using the standard SQL syntax and it will create an HBase table and populate it.
You might want to use the existing HBase table and emails, by creating a view of it. For example:
[fedora@h-mstr phoenix-4.2.2-bin]$ bin/sqlline.py localhost 0: jdbc:phoenix:localhost> CREATE VIEW eview (pk VARCHAR PRIMARY KEY, envelope.Subject VARCHAR, envelope.To VARCHAR, envelope.From VARCHAR, envelope.Date VARCHAR,header.Content-Type VARCHAR) AS SELECT * FROM emails;
The double quotes are needed as sqlline.py assumes that the table and column names are in upper case. Column names of the view will be the qualifiers of the column families.
Now, you can run some queries on it:
0: jdbc:phoenix:localhost> select * from eview where Subject is not null limit 5; 0: jdbc:phoenix:localhost> select * from eview where Content-Type ilike %multipart% limit 5; 0: jdbc:phoenix:localhost> select From, Subject from eview where Date ilike %jan%2011% limit 5;
In case you need to examine some other column in a column family, create another view. You can find out more about the options available at http://phoenix.apache.org/language/; in particular, you can learn to create secondary indices on a view.
It would be easy to include the storage of text messages, other than emails, as well in the above system. You could probably still index the contents and attachments using elasticsearch or a similar tool. Combining the text search with the SQL query layer on HBase makes it possible to offer remarkable solutions to your users, helping them to analyse and search massive volumes of data with minimal effort.