Databricks open sources Delta Lake data lake

Databricks is open sourcing Delta Lake to counter criticism from rivals and take on Apache Iceberg as well as data warehouse products from competitors.

In an effort to push past doubts cast by rival firms, data lake provider Databricks on Tuesday said that it is open sourcing all Delta Lake APIs as part of the Delta Lake 2.0 release. The company also announced that it will be contributing all enhancements of Delta Lake to The Linux Foundation.

Databrick rivals such as Cloudera, Dremio, Google (Big Lake), Microsoft, Oracle, SAP, Amazon Web Services (AWS), Snowflake, Hewlett Packard Enterprise (HPE) and Vertica have criticised the company, casting doubt whether Delta Lake was open source or proprietary, thereby taking away a share of prospective customers, analysts said.

"The new announcement should provide continuity and clarity for users and help counter confusion (stoked in part by competitors) about whether Delta Lake is proprietary or open source," said Matt Aslett, research director at Ventana Research.

With the new announcements, Databricks is putting customer concerns and competitive criticism to bed, said Doug Henschen, principal analyst at Constellation Research.

"In competitive deals, rivals such as Snowflake would point out to would-be customers that aspects of Delta Lake were proprietary," Henschen said, adding that Databrick customers can now trust that their data is on an open platform and that they're not locked into Databricks.

Competition grows in commercial open source market

With an increasing number of commercial open source projects in the data lake market, Databricks' Delta Lake may find itself facing new competition, including Apache Iceberg, which offers high-performance querying for very large analytic tables.

"There are also open source projects that have recently started to be commercialised, such as OneHouse for Apache Hudi and both Starburst and Dremio coming out with their Apache Iceberg offerings," said Hyoun Park, chief analyst at Amalgam Insights.

"With these offerings coming out, Delta Lake faced pressure from other open source lakehouse formats to become more functionally robust as the lakehouse market starts to splinter and technologists have multiple options," Park added.

Many other players in this space are focused on Apache Iceberg as an alternative to Delta Lake tables, Venatana's Aslett said. Delta tables, in contrast to traditional tables that store data in rows and columns, can access ACID (Atomicity, Consistency, Isolation, and Durability) transactions to store metadata to help with faster data ingestion.

Delta Lake is sometimes referred to as a data lakehouse, a data architecture that offers both storage and analytics capabilities, in contrast to the concepts for data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format).

In April, Google announced Big Lake and Iceberg support, and earlier this month, Snowflake announced support for Apache Iceberg tables in private preview.

The Iceberg announcements, just like Databricks' open source strategy, aim to appeal to prospective customers who might have concerns about committing to one vendor and the prospect of having access to their own data encumbered down the road, Henschen said.

In the face of renewed competition, Databricks' move to open source Delta Lake is a good move, said Sanjeev Mohan, former research vice president at Gartner.

"Databricks' announcement to open source the full capabilities of Delta Lake is an excellent step to drive wider adoption," said Sanjeev Mohan, former research vice president for big data and analytics at Gartner.

Delta Lake 2.0 offers faster query performance

Databricks' Delta Lake 2.0, which will be fully available later this year, is expected to offer faster query performance for data analysis, the company said.

Databricks on Tuesday also released the second edition of MLflow — an open source platform for managing the end-to-end machine learning lifecycle (MLOps).

MLflow 2.0 comes with MLflow Pipelines, which offer data scientists predefined, production-ready templates based on the model type they're building to allow them to accelerate model development without requiring intervention from production engineers, the company said.

According to analysts, MLflow 2.0 will serve as a more mature option for data scientists as machine learning production continues to be a challenging process, and translation of algorithmic models into production-grade application code on securely governed resources continues to be difficult.

"There are a number of vendor solutions in this space including Amazon Sagemaker, Azure Machine Learning, Google Cloud AI, Datarobot, Domino Data, Dataiku, and Iguazio," Amalgam's Park said.

"But Databricks serves as a neutral vendor compared to the hyperscalers and Databricks' unified approach to data and model management serves as a differentiator to MLOps vendors that focus on the coding and production challenges of model operationalisation."

The move to release MLflow 2.0 eases the path to bring streaming and streaming analysis into production data pipelines, Henschen said, adding that many companies struggle with MLOps and fail even after successfully creating machine learning models.