Skip to main content
Download free report
SoftBlues
Back to Projects

Cloud Lakehouse with Change-Data-Capture Ingestion

Senior Data Engineer & Architect2022-2023Dany D.
DD
Dany D.

Lead Data & ML Engineer

Data Engineer & Big Data

Key Expertise

Declarative Data EngineeringAdvanced Stream ProcessingReal-time CDC PipelinesMedallion Lakehouse DesignMLOpsAgentic AI Architecture

Experience

8+ years

Timezone

CET (UTC +1)

Skills

AI / ML

LightGBMstatsmodelsLangGraphLangChainMLflow

Languages

Python

Databases

Delta LakePostgreSQLUnity CatalogAuto LoaderDatabricks

Infrastructure

KafkaTerraformKubernetesAWSAzureAzure DevOps PipelinesGitLab CIDatadogcentralized loggingruffmypybandit

Frameworks

Scikit-learnPySparkPydantictyped configuration frameworkDatabricks Asset BundlesDatabricks Workflowsdeclarative streaming pipelinespytest

Integrations & Protocols

Model Context Protocollog-based CDC connectorsPower BI
7-day risk-free trial
Response within 24 hours
View Full Profile

Overview

The project involved designing and delivering a cloud-native data platform for a financial-services institution moving off a fragmented legacy ETL stack. The platform is built around a medallion lakehouse on Databricks, declarative streaming transformations for the silver layer, and log-based change-data-capture from operational relational sources via a managed Kafka service. A config-driven pipeline layer decouples table onboarding from code changes, and a data-quality engine splits each stream into a clean sink and a quarantine sink for audit and remediation.

Achievements

Replaced hand-written, per-table pipeline code with a declarative JSON configuration model - onboarding a new dataset becomes a configuration exercise rather than an engineering project. Materially reduced downstream data-quality incidents through per-column mandatory and type-cast validation, with full lineage of failed records into a quarantine table. Introduced log-based CDC with exactly-once delivery semantics, eliminating the polling overhead and latency of the previous arrangement while preserving schema evolution through the schema registry.

Responsibilities

  • Architected the bronze/silver/gold layout on cloud object storage and the declarative streaming transformation pipeline, where each silver table is materialised from a configuration entry declaring schema, constraints, paths, and source format.
  • Built the data-quality engine (per-column mandatory checks, type-cast verification, row-level faulty-record flagging) and the dual-stream writer pattern that sinks valid and faulty rows into separate Delta destinations for downstream reconciliation.
  • Implemented the CDC ingestion path: managed Kafka as the transport, log-based source connectors against the relational systems, an object-storage sink connector for landing, and a schema registry for evolution and serialization governance.
  • Wrote the Terraform IaC covering resource group, redundant object storage with container layout, secret store with role-based access control and managed secrets for platform credentials, private virtual network with service endpoints, workspace, and orchestration layer.
  • Established the CI/CD pipeline, a local test harness with session-scoped Spark fixtures, and a dev path mirroring the production storage topology so engineers can iterate without hitting cloud resources.

Technologies Used

DatabricksPySparkDelta Lakedeclarative streaming pipelinesAuto LoaderKafkalog-based CDC connectorsKubernetesTerraformAzurecentralized loggingAzure DevOps Pipelines
DD

This project was delivered by

Dany D.

View Full Profile

More Projects by Dany D.

2025-2026

Agentic Automation Platform for Document-Intensive Workflows

AI Architect & Tech Lead Data Engineer

The project involved architecting a greenfield agentic AI platform that automates the end-to-end processing of high-volume, document-heavy business cases for a regulated enterprise. A supervisor-style agent graph routes each case through a set of specialist agents that handle ingestion, enrichment, validation, coordination, and resolution, replacing manual review queues while keeping a human-in-the-loop checkpoint on high-stakes transitions. The agent layer sits on top of a cloud-native Databricks data platform with Unity Catalog governance, declarative streaming ingestion from an object-store landing zone, and a multi-region, multi-tenant infrastructure baseline.

LangGraphLangChainPythonPydanticPySpark+9
View Details
2023 - 2024

AI-Driven Retail Execution Platform

Lead Data & ML Engineer

The project involved delivering an enterprise data and AI platform for a multinational consumer-goods company to orchestrate daily sales-execution planning for its field teams across several major retail channels and international markets. The platform combines a medallion-architecture lakehouse on Databricks with a portfolio of production ML models that translate raw retailer feeds, inventory signals, compliance data, and third-party audits into a ranked set of outlet-level tasks delivered to reps each morning. The system operates as a multi-tenant codebase where each retailer channel is onboarded as a configurable tenant rather than a fork.

DatabricksPySparkDelta LakePythonScikit-learn+10
View Details

Ready to Build Your AI Team?

Get matched with the right AI experts for your project. Book a free discovery call to discuss your requirements.

We respond within 24 hours.