The Complete Magazine on Open Source

Automating pattern discovery with open source software

SHARE
/ 1219 0

Pi Graph

To keep the competitive edge, enterprises need to discover as much as possible about their clients (the likes, dislikes, favourites) and marketing trends. There is a vast amount of data generated by enterprises (inventory, customer purchases, customer accounts, preferences, etc), which can help them do so. This huge store of data can be analysed with the help of open source pattern discovery algorithms to benefit enterprises.

To leverage the huge amount of data being generated, enterprises today want to look beyond conventional business intelligence, in order to discover the trends and patterns hidden in the data. According to analysts, smart pattern discovery will be one of the most in-demand technologies in the near future. Frequent pattern mining (FPM) and the associated algorithms play an important role in many business scenarios that depend on recognising relevant patterns from transaction databases. Formulating these algorithms is currently getting a good amount of traction in the retail domain. However, they can be used effectively in many other use cases across various domains. In this article, I will share some of my experiences in implementing these algorithms to solve business problems across various domains using a set of open source software.
To provide actionable insights from the huge amount of data available with the modern enterprise, machine learning and related technology is being adopted rapidly across various business domains. These techniques enable the learning of trends or patterns from historical data.

Every enterprise is interested in identifying patterns or trends from the huge gamut of data generated by its applications or coming from the external world. Traditionally, subject matter experts (SMEs) have been tasked with identifying trends and patterns in data, based on their experience. With data volumes surpassing levels that are humanly manageable, enterprises are looking towards tools and technologies that can help identify patterns and that, too, with the least effort.
Though machine learning algorithms can identify patterns, they still need to be validated by SMEs. This is because the quality of patterns identified is largely determined not only by the volume of data provided for discovery, but also by the quality of the data.

Pattern discovery: An overview
In today’s dynamic market, conventional business intelligence often falls short in enabling the discovery of actionable insights from the huge volumes of data available. Let’s consider the case of the retail industry where the consumer trends change very rapidly. A single social media post from an influential trendsetter may cause a considerable amount of disruption in the market. Hence, the retailers need to have real-time insights to ensure that all their plans and forecasts are relevant. In this situation, techniques like smart pattern discovery enable automation and acceleration of the process of finding trends and predicting the behaviour of the consumer.
While this technology is being quickly adopted in the retail domain, pattern mining and pattern discovery may be very useful in many other domains. Following this trend, many leading commercial vendors are enhancing their tools with the capability of discovering inconspicuous data patterns hidden in the enterprise data store. As many of these tools and platforms are quite expensive, it makes sense to take a look at the tool-sets available in the open source domain to pick up the appropriate one.
Even with the wide range of algorithms and tools available, effective use of pattern discovery calls for a suitable implementation methodology.

Figure 1

Figure 1: How the algorithm works

Implementation methodology
Discovering patterns related to sequences of data or events can lead to very important insights, as these patterns may have occurred before an event of significance. If such patterns can be identified, it may be possible to proactively track them and take appropriate action in a timely manner. Due to the increase in data volumes, it is no longer possible for subject matter experts to scan the data and identify significant patterns. Leveraging suitable machine learning algorithms can be instrumental in reducing the time and effort humans would need to put into this task.
While there is a huge need to identify the hidden patterns in the data set across various business domains and there are a handful of tools and techniques to address this, it is always required to build a systematic implementation methodology for effective execution and to obtain the best results. The methodology has to be a combination of a structured approach as well as the right choice of tools and algorithms.

The frequent pattern mining (FPM) algorithm
Finding a pattern from a huge and dynamic enterprise data store typically takes up a considerable amount of time and effort. Alternatively, this task has to be performed using suitable machine learning algorithms like FPM and other related algorithms.
As depicted in Figure 1, pattern mining algorithms like FPM take token sequences as input, where a token indicates elements of significance. For example, a token can be an error code, a transaction type or an item from a shopping cart. On being provided the token collection, these algorithms try to identify the most frequent tokens and token sequences with the associated probability, based on the occurrence of the tokens in the complete set (transaction database).

Tool-set in open source
Till recent times, machine learning was confined more to the academic world. This may be the reason for the vast profusion of open source tools and libraries around machine learning and related technologies.

Some of the popular tools are R, Python, Apache Mahout, H2O and Rapid Miner. While each of these tools is created to cater to a specific set of requirements, R provides good support for FPM implementation.

Figure 2

Figure 2: Framework architecture

Framework
While FPM algorithms are quite effective for pattern mining or pattern discovery, a systematic implementation approach ensures effective execution and the best results.

An enterprise that wishes to use FPM techniques would have to implement a framework as depicted in Figure 2. The FPM solution will need to have a data store that has the ability to store data from a variety of sources. The framework also needs an analytics engine that can operate on the data to identify frequently occurring patterns to create a rules database. In addition to creating the rule book, the framework will need to include a monitoring application to continuously monitor incoming data, and generate alerts and reports whenever the incoming data matches a rule in the rule database.

Pattern discovery to solve business problems
FPM algorithms can be used to provide solutions for various domains. In this section, I will describe some of the solutions built using the concept of pattern discovery, where these algorithms were applied. The solutions cover multiple domains. In each of the solutions, various algorithms are leveraged to mine patterns from the data, which can be used as per the need of the application.

Market-basket analysis: This solution was developed to discover purchasing patterns from the customers’ shopping cart data. Patterns were discovered using associations or by identification of co-occurrences from transactional data associated with customers’ shopping carts.
Market-basket analysis techniques were used to mine the consumer purchase patterns. This technique allows retailers to consider the customers’ market basket, in order to understand which products are purchased together, the frequency of purchases as well as the volume of purchases. Such a mining technique enables retailers to formulate an effective strategy for campaigns and promotions. This, in turn, results in increasing the customer base as well as increasing the value of the market-basket.

Machine log analytics: One of our manufacturing customers, wanted to implement an enterprise application to monitor software patches for the company’s medical devices across the world. The need to install a software patch was determined by the sequence of codes that the device generated during its operation.
While the SME provided some sequences of frequently occurring codes, this manual process of pattern identification was quite a time consuming and effort-intensive activity. To save time and effort, we applied frequent item-set mining algorithms to identify sequences. The sequences generated by the algorithm were validated by the SME. Then the patterns were converted into business rules. The rules were deployed to identify the relevant software patch for a specific device. Due to the number of devices deployed and their geographical distribution, the earlier, manual process of identifying patterns was time-consuming. The automated process of pattern identification acted as an enabler to improve productivity.

Fraud analytics: With the Internet being widely accessible, customers have adopted e-commerce for all their purchases. This development goes hand-in-hand with the use of credit and debit cards for making payments. In an e-commerce transaction, a person is not physically present for verification and it is possible to make a purchase simply with a card number. Many customers these days are facing various attacks like phishing, pharming, skimming and dumpster driving. Hence, there is a need to protect against such attacks.
To address this business need, we built a model of customer behaviour. This model is built on an unsupervised learning method. Every customer transaction is validated against the customer’s profile to identify anomalies, if any. If the anomaly check fails, alerts are generated. A tree-based pattern mining algorithm can be leveraged to solve this problem. This algorithm enables identification of the transaction patterns which typically trigger fraudulent behaviour.

Future work
Gartner predicts that, in a few years, smart pattern discovery will be in huge demand, enabling mainstream business consumers to get insights from data. These innovations have the potential to help enterprises gain access to sophisticated interactive analysis and insights.
FPM techniques can be leveraged effectively to discover patterns by finding frequent transaction sets and uncovering the hidden information.

While FPM techniques have a deep impact on businesses through smart pattern discovery, there are still some challenging research issues in terms of selecting the right variation of the techniques to address a specific business problem.