Our client operates multiple stores across Norway and Sweden. To leverage their data effectively, we onboard data from their WooCommerce Application and Marketing Data into our distributed environment, Databricks. This data is then loaded into Delta Lake Tables. Utilizing the Delta Live Tables (DLT) concept in Databricks, we consume this data in real-time, which is then used for reporting and training Machine Learning models by Data Scientists
What is Delta Live Tables (DLT)?
Delta Live Tables (DLT) is a feature of Databricks, a unified data analytics platform. DLT is designed to simplify the process of building, deploying, and maintaining data pipelines at scale.
Here are some key benefits of Delta Live Tables:
- Reliability: DLT ensures data reliability by maintaining data in a transactional and consistent manner. It uses ACID (Atomicity, Consistency, Isolation, Durability) transactions to ensure data integrity.
- Scalability: DLT can handle large volumes of data and scale as your data grows. This makes it ideal for big data processing tasks.
- Simplified Data Engineering: DLT provides a structured framework for developing data pipelines. This simplifies the process of data engineering and reduces the time and effort required to build and maintain data pipelines.
- Real-time Data Processing: DLT supports both batch and real-time data processing. This allows for real-time analytics and decision-making.
- Version Control: DLT maintains a version history of your data. This allows you to access previous versions of your data for auditing or debugging purposes.
- Schema Enforcement and Evolution: DLT enforces schema on write operations, ensuring data consistency. It also supports schema evolution, allowing you to add, delete, or change columns in your data over time.
- Integration with Machine Learning and AI: DLT integrates seamlessly with Databricks’ machine learning and AI capabilities, making it easier to build and deploy predictive models.
- Unified Batch and Streaming: DLT unifies batch and streaming data processing, simplifying the architecture and reducing the maintenance overhead of having separate systems for batch and streaming data.
By leveraging these benefits, organizations can improve their data operations, gain insights faster, and make more informed decisions.
Learn More about Delta Live Tables here
How we implemented DLT to achieve Realtime data sync?
We handle various datasets such as Orders, OrderItems, Products, Variations, Refunds, and RefundItems for each store. These datasets are made available in Delta Tables through Extraction Jobs. These jobs, primarily scheduled to run every 5 to 10 minutes, load data based on the last modified date
We have three main extraction jobs:
- Other datasets (Products, Variations, Refunds, RefundItems, Customers)
Bronze – DLT Job:
This job retrieves data from the Delta Lake Table and loads it into a newly created Bronze DLT in real-time. We select the necessary fields from the delta table and add a few additional fields for in-house purposes.
Silver – DLT Job:
This job retrieves data from the Bronze DLT and loads it into the Silver layer, performing Change Data Capture (CDC) on top of the data and also does the job of deduplication
Gold – DLT Job:
This job retrieves the latest data from the Silver DLT and loads it into the Gold layer.
Marketing Job: We integrate data from various marketing sources such as Facebook, Snapchat, and Instagram via Hevo. This integration setup within Hevo is designed to load data into Databricks. Once loaded into Delta tables, we consume this data and load it directly into the Gold Layer.