Skip to main content
Download free report
SoftBlues
SoftBlues
Back to Projects

Identity Verification Data Platform Modernization

Senior Big Data Engineer2025-2026Yaroslav K
YK
Yaroslav K

Big Data Engineer

Data Engineer & Big Data

Experience

8+ years

Timezone

CET (UTC +1)

Skills

AI / ML

SageMaker Unified Studio

Languages

Scala

Databases

HDFSPandasIcebergKuduOracleRedshiftAthena

Infrastructure

AWSTerraformJenkins CIAzure DevOpsYARNAzure Databricks

Frameworks

AirflowSparkHadoop

Integrations & Protocols

EventBridgeStep Functions
7-day risk-free trial
Response within 24 hours
View Full Profile

Overview

The project involved modernizing a large-scale data processing platform used for identity validation, fraud detection, and analytical reporting. The system ingested data from external service providers and transformed it into reliable metrics for BI dashboards. A key part of the initiative was migrating the platform from Delta Lake to Apache Iceberg while preserving performance, stability, and cost efficiency. To reduce migration risk, a temporary dual-stack architecture was introduced, allowing Delta and Iceberg pipelines to run in parallel during the transition.

Achievements

Led the migration from Delta Lake to Apache Iceberg while maintaining schema compatibility, partitioning consistency, and metadata reliability. Improved effective cluster resource utilization by 60% while keeping the solution cost-efficient. Tuned Spark execution and Iceberg configurations to achieve comparable or better performance than the legacy implementation.

Responsibilities

  • Designed and implemented PySpark-based pipelines for generating aggregated identity verification metrics for BI dashboards.
  • Built initial prototype pipelines using Athena and EventBridge before migrating the logic to production-grade Spark jobs.
  • Implemented a Scala-based dual-writer architecture to support parallel Delta Lake and Apache Iceberg writes during the migration phase.
  • Led the Delta Lake to Apache Iceberg migration, ensuring schema compatibility, partition strategy alignment, and metadata consistency.
  • Tuned Spark configurations and execution logic to improve Iceberg pipeline efficiency.
  • Integrated pipelines into Airflow DAGs and managed infrastructure changes using Terraform.
  • Worked across mixed Python and Scala codebases, including maintenance and extension of legacy Scala modules.

Technologies Used

PythonScalaAWSEventBridgeSageMakerS3SageMaker Unified StudioAirflowSparkIcebergAzure DatabricksTerraformAzure DevOps
YK

This project was delivered by

Yaroslav K

View Full Profile

Ready to Build Your AI Team?

Get matched with the right AI experts for your project. Book a free discovery call to discuss your requirements.

We respond within 24 hours.