The Complete Magazine on Open Source

Ten facts highlighting evolution of big data

4.75K 0

big data evolution

The companies that are drawing value from data of all sizes and forms as well as storing and processing it are increasing day-by-day. The systems that support huge volumes of unstructured as well as structured data will continue to rise shortly. The market will demand platforms that are useful for data custodians in securing and governing big data while empowering the end users to examine that data. Furthermore, these systems will mature to run well inside of enterprise IT standards and systems.

Becomes approachable and fast

You can conduct sentiment analysis and perform machine learning on Hadoop, but the major question that people ask here is: How fast is the interactive SQL? After all, SQL is the channel to business users who want to utilise Hadoop data for more repeatable, faster KPI dashboards and the exploratory analysis.

The need for speed increased the adoption of technologies and thus enabled faster queries, stores based on Hadoop like Kudu and faster databases such as MemSQL and Exasol. Utilising the OLAP-on-Hadoop technologies (Kyvos Insights, Jethro Data and AtScale) and SQL-on-Hadoop engines (Drill, Phoenix, Presto, Hive LLAP and Apache Impala), these query accelerators are further blurring the lines between the big data world and traditional warehouses.

Not just exclusive to Hadoop anymore

We saw various technologies rise with big data wave to satisfy the need for analytics on Hadoop. But the enterprises consisting of heterogeneous and complex environments no longer wish to acquire an isolated BI (Business Intelligence) access point for just one source of data (read Hadoop). The answers to their questions are buried in a host of resources that range from the systems of record to cloud warehouses, to unstructured and structured data from both the non-Hadoop and Hadoop sources (even the relational databases) are turning out to be big data – ready. For instance, SQL Server 2016 recently added JSON support.

Very soon, enterprise customers will demand analytics on all the data. The platforms that are the source — and data — agnostic will flourish while those that are developed for Hadoop purposefully and fail to deploy across the use cases will fall by the wayside. The early indicator of this trend is the exit of big data analytics company Platfora.

Organisations make the maximum use of Data Lakes from get-go

At first, you dam the end (i.e., data), then you allow it to fill up with water (i.e., data). You begin utilising water once you establish the lake. The water is used for different purposes such as recreating, drinking and generating electricity. In case of data, the purposes can be cyber security, machine learning and predictive analytics.

Hydrating the lake has been the end in itself till now. This is changing because the business justification for Hadoop is being tightened. For quicker answers, the companies are demanding agile and repeatable use of the lake. Before investing in infrastructure, data and personnel, they are considering the business outcomes carefully. This will promote a stronger relationship between the IT and business. And the self-service platforms will obtain sound recognition as the tool for harnessing the assets of big data.

Architectures mature to refuse the frameworks suitable for all circumstances

For the ad hoc analysis, Hadoop has become a multi-purpose engine. Hadoop is even being utilised for operational reporting on daily workloads – the type managed by the data warehouses. By following use case – specific architecture design, the organisations are responding to these hybrid needs today. They are researching a host of factors that include the level of aggregation, speed of data, frequency of access, volumes, questions and user personas prior to committing to a data strategy. These advanced reference architectures are driven by needs. They are combining the platforms for the end-user analytics, Hadoop core and the data-prep self-service tools in various manners that can be configured differently as those needs grow further. Ultimately, the technology choices are driven by the flexibility of these architectures.

Investments are driven by variety, but not velocity or volume

High variety, high velocity and high volume information assets, as recently defined by Gartner, are the three Vs that are evolving to bring the single largest driver of investments in big data. As the organisations seek to focus on long tail of big data and integrate more sources, this trend will continue to evolve. From the free JSON schema to the nested types in various other databases (NoSQL and relational) to non-flat data ( XML, Parquet and Avro), the connectors are becoming critical and the data formats are multiplying. The analytics platforms are assessed depending on their capability to offer live direct connectivity to these different sources.

Machine learning and Spark light up big data

In a survey of BI analysts, IT managers and data architects, around 70 percent of the respondents favored Spark over the existing MapReduce, which does not lend itself to real-time stream processing or interactive applications and is batch-oriented. This happened all because of big data abilities consist of upraised platforms featuring computation-intensive graph algorithms, AI and machine learning. Microsoft Azure has particularly taken off — with the easy integration of the present Microsoft platforms and beginner-friendliness. To the masses, opening up machine learning is leading to the creation of more applications and models generating petabytes of data. As the systems become smarter and machines learn, all eyes will be on the providers of self-service software to see how they make this data reach the end-user.

Convergence of big data, cloud and IoT create new opportunities for self-service analytics

The huge volumes of unstructured as well as structured data is being generated by IoT, and a good share of this data is being deployed on the cloud services. The data resides across numerous non-relational as well as relational systems, starting from Hadoop clusters till the NoSQL databases. As the innovations in managed and storage services have increased the pace of capture process, understanding and accessing the data itself still poses a major last-mile challenge. Consequently, the demand is evolving for analytical tools that continuously combine and connect to a large variety of cloud-hosted data sources. Those kind of tools enable the businesses to visualise and explore any kind of data that is stored at any place, enabling them to find the underlying opportunity in their IoT investment.

Data preparation becomes mainstream

This journey has been improved by the self-service analytics platforms. But the business users want to further decrease the complexity and time of preparing data for analysis, which is significant, especially when dealing with different formats and data types.

The Agile self-service data prep tools enable Hadoop data to be prepared at source and also make the data accessible in the form of snapshots for easier and faster exploration. In this space, we have witnessed a host of innovation from organisations focused on the end-user data-prep for big data like Paxata, Trifacta and Alteryx. The barriers to entry are being lowered by these tools for laggards and late adopters of Hadoop and are gaining traction seamlessly.

Hadoop adds to the enterprise standards

Today, we are witnessing more investments in the governance and security components surrounding the enterprise systems. The Apache Sentry offers a system for enforcing role-based, fine-grained authorisation to metadata and data stored on a Hadoop cluster. Created as part of data governance initiative, the Apache Atlas empowers the companies to apply steady classification of data across the data ecosystem. For Hadoop, the Apache Ranger offers a centralised security administration.

From their enterprise-class platforms of RDBMS, the customers are beginning to anticipate these kinds of abilities. These abilities are proceeding towards the foreground of the emerging big data technologies and thereby removing yet another barrier to adoption across the enterprise.

Rise of metadata catalogs is helpfu in finding analysis-worthy big data

The metadata catalogs are helpful to users in discovering and understanding relevant data by utilising the self-service tools. Companies like Waterline and Alation are filling this gap in customer need. They are leveraging machine learning to automate the work of finding data in Hadoop. Also, they offer query suggestions through searchable UIs, uncover the relationships in between the data assets and catalog the files using tags. This is helpful for both data stewards and data consumers decrease the time it takes to accurately query, find and trust the data. We see more demand and awareness for self-service discovery, which will evolve as a natural extension of self-service analytics.