Databricks, a cloud data and artificial intelligence (AI) specialist, has announced that it will contribute all enhancements to its Delta Lake platform to the open source community. Databricks was born from open source software, and making more of its features available is seen as a prudent move as the importance of open standards in cloud-based data storage and processing grows.
Delta Lake will be handed over to the open source Linux Foundation, as announced today at the Data + AI summit, and all Delta Lake APIs will be open-sourced as part of the Delta Lake 2.0 release. In addition, the company announced a new version of its MLflow machine learning platform, which includes MLflow Pipeline, a new feature that simplifies the development of machine learning models.
“Open data lakehouses are quickly becoming the standard for how the most innovative companies handle their data and AI. Delta Lake, MLflow and Spark are all core to this architectural transformation, and we’re proud to do our part in accelerating their innovation and adoption.”
Ghodsi and his co-founders founded Databricks in 2013 to commercialise the Apache Spark engine, an open source data analytics engine. The company invented the concept of a ‘data lakehouse,’ which combines structured data storage in a data warehouse with unstructured data storage in a data lake. This combined method simplifies the use of data for the deployment of machine learning models.
The concept has proven popular, and Databricks claims to be used by over 7,000 organisations worldwide, including 40% of the Fortune 500. It makes money by selling subscriptions to analytics tools that are used on top of Delta Lake, and it reported $425 million in revenue for the fiscal year ending September 2021.
According to Beatriz Valle, senior technology analyst at GlobalData, making Delta Lake 2.0 completely open source appears to be “the natural next step” for Databricks. According to her, the move is part of a “highly coherent long-term strategy for its future identity and roadmap as a company.”
According to Valle, the enterprise cloud AI and ML market is fiercely competitive. HPE and Red Hat also announced a collaboration today, with Red Hat’s open source-based tools, including AI and ML options, becoming available on the HPE GreenLake ecosystem, which helps clients manage their cloud and edge deployments and includes a data lake platform. Given this, it makes sense for Databricks to emphasise its roots in open source projects.
The announcement of the new MLflow pipeline is also intriguing, according to Valle. It provides a set of pre-defined templates for various types of ML models, allowing non-technical users to set up their own models more easily.
“Databricks is going headlong into full automation mode,” Valle says. The desire to put these technologies in the hands of regular business users who do not have significant technical knowledge is likely to be a driver for this.