Lyft Data Tech Stack

Explore the high-scale data stack Lyft uses to support 25M+ active riders, ingesting millions of real-time events every second.

Apr 15, 2026

Lyft operates one of the most data-intensive real-time platforms in the world, processing millions of rider–driver interactions every minute. Behind the scenes is a modern, high-scale data stack built on AWS, Kafka, Flink, Trino, and a 100+ PB warehouse on S3. In this deep dive, we’ll explore how these tools power Lyft’s data, analytics, machine learning, and real-time decision systems.

Metrics

28.7M active riders in Q3 2025, completing ~2.7M rides per day.
Apache Kafka processes millions of real-time events per second for streaming analytics.
Thousands of Airflow + Flyte pipelines orchestrate ETL and ML workflows.
Data warehouse exceeds 100+ PB stored in S3 with Hive Metastore.
Trino ETL executes ~250K queries/day, reading ~10 PB/day and writing ~100 TB/day.

Content is based on multiple sources including Lyft Blog, AWS Blog and other public articles etc. You will find references to dive deep as you read.

Platform

AWS

Lyft’s data and infrastructure are fully hosted on AWS, leveraging managed services and elastic scaling to support real-time transportation and logistics workloads. AWS powers everything from API services to large-scale data analytics, enabling Lyft to handle millions of rider-driver interactions efficiently.

📖Further Reading: Lyft Case Study

Messaging System

Kafka

Lyft adopted Kafka relatively recently to address scaling challenges, as explained in the video below. Kafka facilitates real-time data streaming, processing millions of events per minute, and supports various use cases such as trip updates, driver location tracking, and telemetry ingestion.

Kinesis

Alongside Kafka, Lyft also leverages AWS Kinesis which was introduced at Lyft long before Kafka. There is not enough public information about the plans if Kinesis will be replaced by Kafka.

Processing

Flink & Beam

Lyft adopted Apache Flink as the core real time and streaming engine to support wide variety of use cases. They leverage Apache Beam on top of the Flink runner due to portability and multi language capabilities.

Below is the video from the summit where you can learn more on how these tools are used at Lyft.

Spark

Lyft leverages Spark mostly on the Machine Learning side, e.g. LyftLearn, a ML platform which streamline Spark-on-Kubernetes for improved resource allocation, SQL adaptability, and integration with ML libraries. This seamlessly integrates with the orchestration tool Flyte.

📖 Recommended Reading: Spark At Lyft

Trino

Lyft uses Trino it to power large-scale ETL workloads that read 10 PB and write 100 TB of data daily from its Hive warehouse. Trino also serves as a fast, user-friendly query layer for teams, with over 90% of queries completing in under three minutes.

📖 Recommended Reading: Trino for large scale ETL at Lyft

Orchestrator

Airflow & Flyte

Lyft provides two orchestration platforms Airflow and Flyte, each having its pros and cons. Historically, Airflow has been widely used across Lyft for building data pipelines, while Lyft developed Flyte focusing on high intensive tasks like Machine Learning.

📖 Read more: Comparing Flyte and Airflow at Lyft

Warehouse

S3 & Hive

Lyft’s core data warehouse containing 100s of petabytes runs on top of Amazon S3, with Hive used as the metastore. While Lyft hasn’t publicly stated using modern open table formats like Iceberg or Delta, their S3 + Hive setup still powers large-scale historical storage.

Catalog

Amundsen

Developed in-house and open-sourced, Amundsen is Lyft’s metadata and data discovery platform. It provides a centralized catalog for datasets, dashboards, and pipelines, integrating with Hive Metastore, Trino, and Airflow to surface ownership, lineage, and documentation. Amundsen has since become a leading open metadata framework adopted across the industry.

📖 Learn more: Amundsen: Lyft’s Data Discovery Engine

Data Store

ClickHouse

Lyft faced key challenges like data deduplication using Apache Druid, which prompted their migration to ClickHouse. This switch not only resolved those issues but also delivered improved performance and operational efficiency for real-time analytics.

📖 Recommended Reading: Druid Deprecation and ClickHouse Adoption at Lyft

Dashboard

Superset

For visualization, Lyft primarily uses Apache Superset, which connects to Trino as its query engine, to deliver both analytics and operational dashboards.

As per this article from 2019, Lyft supported multiple dashboarding tools, there is no recent public information whether the tools are consolidated or still supported.

Related Content:

Spotify Data Tech Stack

Junaid Effendi

August 16, 2025

Read full story

Shopify Data Tech Stack

Junaid Effendi

November 8, 2025

Read full story

💬 Overall, Lyft blends AWS with cloud-native and open-source systems to support massive real-time data processing, analytics, and machine learning across its platform.

Spotify Data Tech Stack

Shopify Data Tech Stack

Discussion about this post

Ready for more?