Skip to main content
Download free report
SoftBlues
SoftBlues
Back to Projects

Spark Pipeline Migration from YARN to Kubernetes

Senior Big Data Engineer2024 – PresentVitalii P.
Vitalii P.
Vitalii P.

Senior Big Data Engineer / Platform Engineer

Data Engineer & Big Data

Key Expertise

Big Data EngineeringDelta Lake MigrationConfiguration-Driven ArchitectureCloud-Native InfrastructureETL Pipeline OptimizationScalable Data PipelinesData Platform Architecting

Experience

12+ years

Timezone

CET (UTC +1)

Skills

AI / ML

AI-assisted Migration ToolingJupyter Notebooks

Languages

PythonScala

Databases

HDFSApache IcebergDatabricksAWS S3Delta Lake

Infrastructure

AWS CloudWatchAWS LambdaJenkins CI/CDKubernetesDockerCI/CDYARN

Frameworks

Configuration-Driven ArchitectureCustom Logging & Tracing FrameworksApache Spark

Integrations & Protocols

AWS KinesisConcourse CIApache Kafka
7-day risk-free trial
Response within 24 hours
View Full Profile

Overview

Modernization of mission-critical content-moderation data infrastructure for one of the world’s largest technology companies, migrating legacy Spark-on-YARN pipelines to a cloud-native Spark-on-Kubernetes platform. The initiative enables elastic scaling, reduces operational overhead, and aligns the data stack with the broader enterprise shift toward containerized infrastructure across thousands of services.

Achievements

Successfully migrated the full content-moderation pipeline portfolio (20+ production pipelines, each processing 1–2 TB per run) to Kubernetes with full performance parity against legacy YARN. Reduced per-job compute footprint by ~6x — from 600 to 100 instances (3-core, 20 GB RAM each) — eliminated ~30% of obsolete dependencies (reducing CVE exposure), and cut debugging time by ~40% through enhanced observability. Delivered ahead of compliance deadlines.

Responsibilities

  • Designed and tuned the containerized Spark resource model on Kubernetes — cluster sizing, executor configuration, partition strategy — driving the migration’s compute-efficiency gains while validating performance parity against legacy YARN through systematic benchmarking.
  • Owned the end-to-end migration of the content-moderation pipeline portfolio (Spark 2.x → 3.4), handling dependency upgrades, configuration tuning, and production cutover with zero data loss.
  • Conducted dependency-tree audits across the platform, eliminating ~30% of obsolete libraries and resolving CVEs to harden the security posture.
  • Built advanced logging and tracing instrumentation that surfaced root causes during complex builds and deployments, cutting debugging time by approximately 40%.
  • Partnered with Site Reliability Engineers to finalize production onboarding — ensuring deployment pipelines, health checks, and observability met operational SLAs.

Technologies Used

Apache SparkKubernetesScalaPythonDockerYARNJenkins CI/CDCustom Logging & Tracing Frameworks
Vitalii P.

This project was delivered by

Vitalii P.

View Full Profile

Ready to Build Your AI Team?

Get matched with the right AI experts for your project. Book a free discovery call to discuss your requirements.

We respond within 24 hours.