Large-Scale Data Platform for AI-Driven Recruiting
Key Expertise
Experience
11+ years
Timezone
CET (UTC +1)
Skills
AI / ML
Languages
Databases
Infrastructure
Frameworks
Integrations & Protocols
Overview
WorkHQ is an AI-powered recruiting platform designed to help companies source, contact, and manage talent at scale. The project centred on architecting and scaling a production-grade data platform capable of ingesting, normalising, and serving nearly 1 billion candidate profiles sourced from multiple global data providers. The core engineering challenge was transforming an unstable, custom legacy infrastructure (AWS S3 + Airflow) into a reliable, high-throughput Lakehouse architecture capable of supporting real-time semantic search and AI-powered candidate matching across 7 global regions.
Achievements
Orchestrated a full architectural migration from Airflow + AWS S3 to Databricks and Delta Lake within 6 months, reducing pipeline execution time by 80% (from 24+ hours to 5 hours) and transitioning from weekly to daily incremental processing. Expanded global profile coverage 2.3x to 700M+ records while maintaining cost efficiency. Increased Lightcast Occupation Taxonomy (LOT) enrichment coverage from 60% to 99% through custom ML model development, improving data richness 4x across all candidate work experiences.
Responsibilities
- Led end-to-end technical strategy and architecture for the data platform, managing a team of two Data Engineers and driving key decisions across ingestion, transformation, and serving layers.
- Architected a Lakehouse-centric data delivery framework using Databricks and Delta Lake with Medallion Architecture, replacing legacy 2 TB S3-to-PostgreSQL pipelines.
- Designed a taxonomy mapping and normalisation system using vector embeddings and cosine similarity, with LLMs (OpenAI API) as an intelligent fallback for full alignment with global job-title standards.
- Implemented multi-region infrastructure to support daily ingestion across 7 global regions, scaling platform capacity 2.3x to 700M+ profiles.
- Directed the development of custom ML models for data extraction and LOT enrichment, boosting taxonomy coverage from 60% to 99%.
- Optimised synchronisation across OpenSearch and PostgreSQL to support high-throughput semantic search and Alembic-managed schema evolution.
Technologies Used
Key Expertise
Experience
11+ years
Timezone
CET (UTC +1)
Skills
AI / ML
Languages
Databases
Infrastructure
Frameworks
Integrations & Protocols
This project was delivered by
Ihor M.
Ready to Build Your AI Team?
Get matched with the right AI experts for your project. Book a free discovery call to discuss your requirements.
We respond within 24 hours.