A key component of the democratization of data and AI is data engineering. Nevertheless, it encounters formidable obstacles in the shape of intricate and fragile connectors, issues in consolidating data from many, frequently private sources, and disruptions in operations. Some of these issues have been resolved by Datatricks with the launch of a new platform.
Databricks unveiled LakeFlow, a new data engineering solution, at its annual Data + AI Summit. It is intended to simplify every step of the process, from data import to transformation and orchestration.
Developed on top of its Data Intelligence Platform, LakeFlow automates pipeline deployment, operation, and monitoring at scale in production. It can also ingest data from other systems, such as databases, cloud sources, and corporate apps.
The CEO and co-founder of Databricks, Ali Ghodsi, stated in his keynote speech at the Data + AI Summit that one of the main obstacles to using GenAI for businesses is data fragmentation. Ghodsi claims that juggling the high expenses and proprietary lock-in of using several platforms is a “complexity nightmare.”
Prior to the release of LakeFlow, Databricks was dependent on its partners, including dbt and Fivetran, to supply tools for data loading and preparation. However, this need is no longer necessary. With deep integration with Unity Catalog, end-to-end governance, and serverless computing, Databrick now boasts a single platform for a more effective and scalable setup.
The Databrick partner ecosystem is not utilized by a sizable portion of Databricks’ client base. This significant market segment develops its own unique solutions in accordance with its own needs. In order to avoid relying on creating connectors, utilizing data pipelines, and purchasing and setting additional platforms, they desire a service that is integrated into the platform.
LakeFlow Connect, which offers built-in interfaces between various data sources and the Databricks service, is a crucial part of the new platform. Users can import data from enterprise apps like Google Analytics, Sharepoint, and Salesforce, as well as databases like Oracle, MYSQL, and Postgres.
The LakeFlow Pipelines, which are based on Databricks’ Delta Live Tables technology, let customers perform ETL and data transformation using either Python or SQL. Additionally, this feature provides a low latency mode for near-real-time incremental data processing and data transfer.
Using the LakeFlow Jobs feature, which enables automated orchestration and data recovery, users may also keep an eye on the condition of their data pipelines. This tool is compatible with alerting platforms like PagerDuty. Administrations are automatically alerted when a problem is identified.
Thus far, we have discussed using connectors to get the data in. Next, we decided to change the data. Pipelines are that. But what if I have other priorities? How would I go about updating a dashboard? What happens if I wish to use this data to train a machine learning model? What more steps do I need to do with Databricks? Jobs is the orchestrator for it, according to Ghodsi.
Data teams can automate the deployment, orchestration, and monitoring of data pipelines in one location with LakeFlow Jobs’ centralized management and control flow capabilities.
An important step has been taken in Databricks’ ambition to democratize and simplify data and AI, enabling data teams to tackle the most difficult challenges in the world with the launch of LakeFlow Jobs. Although LakeFlow is not yet in preview, users can sign up for easy access by visiting Databricks’ waitlist.