Cloud Lakehouse with Change-Data-Capture Ingestion
Key Expertise
Experience
8+ years
Timezone
CET (UTC +1)
Skills
AI / ML
Languages
Databases
Infrastructure
Frameworks
Integrations & Protocols
Overview
The project involved designing and delivering a cloud-native data platform for a financial-services institution moving off a fragmented legacy ETL stack. The platform is built around a medallion lakehouse on Databricks, declarative streaming transformations for the silver layer, and log-based change-data-capture from operational relational sources via a managed Kafka service. A config-driven pipeline layer decouples table onboarding from code changes, and a data-quality engine splits each stream into a clean sink and a quarantine sink for audit and remediation.
Achievements
Replaced hand-written, per-table pipeline code with a declarative JSON configuration model - onboarding a new dataset becomes a configuration exercise rather than an engineering project. Materially reduced downstream data-quality incidents through per-column mandatory and type-cast validation, with full lineage of failed records into a quarantine table. Introduced log-based CDC with exactly-once delivery semantics, eliminating the polling overhead and latency of the previous arrangement while preserving schema evolution through the schema registry.
Responsibilities
- Architected the bronze/silver/gold layout on cloud object storage and the declarative streaming transformation pipeline, where each silver table is materialised from a configuration entry declaring schema, constraints, paths, and source format.
- Built the data-quality engine (per-column mandatory checks, type-cast verification, row-level faulty-record flagging) and the dual-stream writer pattern that sinks valid and faulty rows into separate Delta destinations for downstream reconciliation.
- Implemented the CDC ingestion path: managed Kafka as the transport, log-based source connectors against the relational systems, an object-storage sink connector for landing, and a schema registry for evolution and serialization governance.
- Wrote the Terraform IaC covering resource group, redundant object storage with container layout, secret store with role-based access control and managed secrets for platform credentials, private virtual network with service endpoints, workspace, and orchestration layer.
- Established the CI/CD pipeline, a local test harness with session-scoped Spark fixtures, and a dev path mirroring the production storage topology so engineers can iterate without hitting cloud resources.
Technologies Used
Key Expertise
Experience
8+ years
Timezone
CET (UTC +1)
Skills
AI / ML
Languages
Databases
Infrastructure
Frameworks
Integrations & Protocols
This project was delivered by
Dany D.
More Projects by Dany D.
Agentic Automation Platform for Document-Intensive Workflows
AI Architect & Tech Lead Data Engineer
The project involved architecting a greenfield agentic AI platform that automates the end-to-end processing of high-volume, document-heavy business cases for a regulated enterprise. A supervisor-style agent graph routes each case through a set of specialist agents that handle ingestion, enrichment, validation, coordination, and resolution, replacing manual review queues while keeping a human-in-the-loop checkpoint on high-stakes transitions. The agent layer sits on top of a cloud-native Databricks data platform with Unity Catalog governance, declarative streaming ingestion from an object-store landing zone, and a multi-region, multi-tenant infrastructure baseline.
AI-Driven Retail Execution Platform
Lead Data & ML Engineer
The project involved delivering an enterprise data and AI platform for a multinational consumer-goods company to orchestrate daily sales-execution planning for its field teams across several major retail channels and international markets. The platform combines a medallion-architecture lakehouse on Databricks with a portfolio of production ML models that translate raw retailer feeds, inventory signals, compliance data, and third-party audits into a ranked set of outlet-level tasks delivered to reps each morning. The system operates as a multi-tenant codebase where each retailer channel is onboarded as a configurable tenant rather than a fork.
Ready to Build Your AI Team?
Get matched with the right AI experts for your project. Book a free discovery call to discuss your requirements.
We respond within 24 hours.