In this months column, we discuss how to apply data analytics to data at rest in storage systems and the buzz around content-centric storage.
For the past few months, we have been discussing information retrieval and natural language processing (NLP) as well as the algorithms associated with them. In this months column, lets focus on a related topichow data analytic techniques are getting applied to the data at rest in storage systems.
Given the buzz around Big Data, there has been considerable interest in mining large volumes of data to obtain actionable insights. Like the case of a beggar sitting on a gold mine and not knowing about it, many organisations have a wealth of data residing in data centres and storage archives. And in most cases, those within the organisations remain unaware of the information content of the data they have available. Gartner calls this dark data-the huge amounts of data that firms collect as a by-product of their normal business activities, but for which they have not found any other use. A typical example is the data centre system logs or the email archives in the internal mailing lists of an organisation. This is in contrast to light data which is typically actionable data generated specifically through a data analysis process for follow-up action by the organisation. A typical example of light data could be the monthly sales report or the software bug trend report generated specifically for consumption by organisational processes. Dark data is typically unstructured content and comes from multiple sources, both within the organisation and outside. It includes emails, process documents, videos, images, social network chats, tweets, etc. Light data is typically structured
You may wonder whats the big deal about whether an organisation has (or does not have) huge amounts of dark data. It can either choose to keep the data archived for future needs or throw it away if it doesnt need it. Well, the choice is not as simple as that. If the organisation chooses to archive the data to later analyse what the dark data contains, it needs to pay to keep the huge amounts of data archived, even if there is no immediate use or value for it. Given the rate at which data is generated in modern organisations, such an option is simply not cost-effective. Hence, tools and processes need to be in place for analysing dark data and identifying whether it needs to be retained or disposed in a secure fashion. Data retention may be needed from the legal compliance perspective or for mining long term history for trends/insights. On the other hand, data also needs to be cleansed of personal identification information so that sensitive data about individuals does not get exposed inadvertently in case of data leakage or theft.
Dark data analysis involves multiple levels of scrutiny, starting from identifying the contents or what kind of data is hidden in archives, to the more complex issue of mining the data for relevant actionable insights and identifying non-obvious relationships hidden in the data. Each level of analysis may need different sets of tools. Given that dark data comes from multiple disparate sources both within and outside the enterprise, it needs to be cleansed and formatted before any data mining or analytics can be applied on it. About 90 per cent of any data analysis task comprises cleaning up the data and getting it into a state that can be used in analytic pipelines.
Data wrangling and curating data at rest are key value added services that should be delivered on storage systems. Data curation also includes the following transformations: verifying the data (to ascertain its composition), cleaning incoming data (e.g., 99999 is not a legal ZIP code), transforming the data (e.g., from the European date format to the US date format), integrating it with other data sources of interest (into a composite whole) and de-duplicating the resulting composite data set.
Modern storage systems typically support rich data services like data replication, snapshot management, file system level encryption and compression services. While such rich data services cater to todays demand for storage, the brave new world of IoT requires a revisit of the promise of storage appliances from dumb data containers to actionable insight generators. Raw data merely does not provide actionable value for end customers, so delivering insight-centric storage should be the key direction along which future storage systems need to evolve.
There has been lots of interest around active data services that help to mine data and offer data discovery. Deep storage contains historical data. Change data capture (CDC) in the context of the existing data at rest, allows us to answer the following questions:
- Does the new incoming piece of data, when combined with a summary/impression of existing long tailed deep storage data, give rise to new insights?
- Does it allow us to invalidate any existing insights that we have derived in the past?
- Is the new incoming piece of data interesting enough (anomaly/outlier) that it needs to be kept sight of or can it be discarded?
Here are some examples: - For a content serving network, incoming location data about a clients movement from say, Karnataka to Gujarat, triggers a data replication event to host the clients favourite Kannada film songs/movies onto an edge network which is faster to access in Gujarat.
- For an advertising companys client website, an update in the email ID of long-standing clients indicates their move to being a top corporate customer. This is an insight of loyal supporters that your sales team will now have in your top customers accounts.
- Energy meter data, coming in daily from an IoT sensor for a group of households served by a common solar panel, suddenly shooting up may indicate a malfunction in the common service infrastructure and will allow the solar company to initiate proactive maintenance.
From the above use-cases, it is clear that any new incoming piece of data needs to be seen in the context of existing historical data in order to deliver insights. Incoming data also needs to be seen in the context of the consumer/client interested in the data, from whom the insight is significant. Privacy considerations may need anonymous insights applicable to a target population/group of users. Relevance and significance of incoming data is best noted at the point of ingest in the frame of existing historical data and current context.
Just consider this scenario. A Client X has a set of data stores associated with him. One data store is his entertainment album, one is his work folder containing his software projects and one is his financial records. Your intelligent storage system has tagged these folders based on the access time and historical patterns, inferring that your client accesses the entertainment data store between 7 p.m. and 10 p.m. Given that your tenant has purchased storage, you would like to improve the quality of service given to him. Knowing his usage pattern, you can proactively move his media album to low latency Flash storage automatically between 7 p.m. and 10 p.m., thereby improving his entertainment experience.
The arrival of the Internet of Things (IoT) heralds a new way of looking at how storage has traditionally been used. Smart IoT devices typically come with a small amount of storage (capable of storing one or two days of data, at the most) and, over longer periods, push data to cloud-backed stores for long tail analysis. Local storage at smart IoT devices also needs to cater to operating conditions of low power consumption and intermittent connectivity. Hence, analytic algorithms need to be power-aware and resilient even in the face of intermittent network connectivity.
Increasingly, smart IoT devices will create digital footprints in a physical world. The data that is generated locally in the context of a physical entity (like a user, machine or software application) gets pushed to separate silos in the cloud and is analysed in isolation, without factoring in the context in which a particular physical entity generated these multiple data sources. While data in each of these individual silos get analysed and mined, insight discovery across these different data sources is typically not achieved due to storage silos. In the context of the IoT, the digital footprints get lost when moved to separate silos. However, at the common origin of the generating entity, they can be unified and analysed for insight discovery in context. What is needed is local analytics at the origin of the data generating entity, with local context and realtime responses. Hence, there needs to be a shift in the current IoT data analytics paradigm. While the current paradigm is to store the raw data locally, then move to the cloud, before analysing it and acting on the insights, it makes sense to do local filtering and analysis, gain insights applicable locally and only then store, if needed, in the cloud. If not required, the raw data can be discarded.
There has been a lot of debate over deductive and predictive analytics and their ability to mine for insights and patterns. While computers have been extremely good in learning certain kinds of problems, such as finding the best route to a destination or enumerating all combinations, they rarely beat humans in identifying patterns, especially when the features are approximate. A typical example is the ability of human intuition to tag a face as that of a friend or foe within a time that is 100 times faster than any machine.
There are various kinds of data discovery problems that are encountered. Certain problems require precise prediction with information on how the prediction was made. For instance, consider the selection of a candidate for a job. Having a machine-learning algorithm that analyses the data based on all the feature vectors determined will produce an objective ranking with hard history on the way the ranking was arrived at. This avoids any subjective judgment in such critical tasks. On the other hand, consider a fuzzy problem such as whether a conversation between a group of people contains racial undertones. Human intuition driven pattern detection (built over centuries of our evolution) can detect undertones of the conversation and classify it more accurately than any machine-learnt algorithm. Accurate discovery based on such approximate impressions needs interactive local analytics on raw data at rest.
When the classification feature vector is of a huge dimension and the context/scenario is fuzzy, human driven contextual local analytics is more successful. This, in turn, necessitates that the data should be presented in a manner conducive for human driven pattern recognition to act on it. We propose building smart storage for IoT devices capable of approximate shallow analytics tasks. Since IoT devices are thin, low power edge devices, such devices are capable only of running energy aware shallow analytics in-situ. Decentralised in-situ local analytics driven by human interaction at such smart IoT devices needs to be built into the IoT storage. Such edge devices will participate in hierarchical insight discovery wherein local insights relevant to the data generating entity are performed at the origin, whereas insights gleaned from the long tailed data can be done at the back-end cloud.
In-situ visualisation is another key need for the analysis of data at rest. Human beings are great at detecting patterns. Insight discovery analytics needs to run locally on the sensor device in a way that is interactive with humans. Visualisation is not about fancy pie charts, but about making sense of data and identifying patterns, with human-aided machine learning. We will discuss more on in-situ visualisation for data at rest in next months column.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!