Spotify Data Tech Stack

Learn how Spotify ingests 1.4T+ events daily on GCP via 38K+ data pipelines, leveraging BigQuery, Dataflow, and Flyte to power ~5K dashboards and scale data-driven insights.

Aug 16, 2025

Explore how Spotify processes over 1.4 trillion data points daily to power personalized experiences for hundreds of millions of users worldwide. This overview distills the essential tools, architectures, and innovations Spotify employs for data ingestion, processing, storage, and analytics.

Metrics

1.4+ trillion events processed daily.
670+ million monthly active users.
38,000+ Data Pipelines active in production environment.
Spotify runs the largest Hadoop cluster in Europe.
1800+ different event types representing interactions from Spotify users.
~5k dashboards serving to ~6k users.

Spotify Data Tech Stack (Icon source: flaticon.com)

Content is based on multiple sources including Spotify Blog, Open Source websites, Job descriptions and other public articles etc. You will find references to dive deep as you read.

Platform

Google Cloud Platform (GCP)

GCP is Spotify’s core cloud provider, supporting both compute and advanced analytics. Spotify migrated from AWS in the mid-2010s to leverage GCP's scalable infrastructure, big data, and machine learning services.

📖 Recommended Reading: Spotify Case Study

Messaging System

PubSub

Spotify moved from Kafka to GCP Pubsub for ingesting their massive amount of event driven data back in 2016.

As per recent article, their data platform supports automatic deployment of PubSub, anonymization pipelines and streaming jobs.

Processing

Apache Beam

Apache Beam (GCP Dataflow) is the primary processing tool used at Spotify for handling real time and batch workloads. Spotify has their open source Scala API implementation called Scio.

Scio is a high level Scala API for the Beam Java SDK created by Spotify to run both batch and streaming pipelines at scale. We run Scio mainly on the Google Cloud Dataflow runner, a fully managed service, and process data stored in various systems including most Google Cloud products, HDFS, Cassandra, Elasticsearch, PostgreSQL and more.

— source

Apache Flink

While most pipelines leverage Scio (Beam), Data Platform also supports Apache Flink. There is not enough public information on how exactly they leverage Flink.

Orchestrator

Flyte

Spotify migrated from Luigi and Flo to Flyte starting in 2019 to address challenges like fragmented orchestration logic, limited visibility, and lack of extensibility. Flyte offered a centralized service with a thin SDK, better workflow visibility, caching, and multi-language support.

Today, Spotify uses Flyte to manage and introspect data workflows at scale (38k+ jobs), while execution remains on Kubernetes via their existing Styx scheduler.

attachment_e268f0c7959d85909621e5d13a499e22

📖 Recommended Reading: Why We Switched Our Data Orchestration Service

Warehouse

BigQuery

With their migration to GCP, they also moved to BigQuery as the centralized warehouse, processing SQL based workflows through DBT while storing all the analytical data served through dashboards tools e.g. Looker.

Storage

HDFS / GCS

Spotify maintains the largest Hadoop cluster in Europe, with the on going migration to Cloud, they serve and store data on both on premise HDFS and Google Cloud Storage.

There’s no public information confirming whether Spotify uses a lakehouse architecture.

Management

Spotify has in house tooling for data management as part of their Data Platform, solving the problems from the following common areas.

Metadata
Lineage
Retention
Access Control

📖Read more: Data Management & Data Processing

Dashboard

Looker / Tableau

Spotify provides both Looker and Tableau as the dashboarding platforms. As per 2023, Spotify had 4900+ dashboards serving to 6000+ users across the company.

Tableau is used for complex, highly customized dashboards; so all design flexibility is available for deep-dive internal products with specific user needs.
Looker Studio is preferred for fast, lightweight dashboards; especially among engineering and product teams, thanks to its tight integration with BigQuery and ease of SQL‑to‑visualization workflows.

Related Content:

Snapchat Data Tech Stack

Junaid Effendi

June 7, 2025

Read full story

DoorDash Data Tech Stack

Junaid Effendi

April 26, 2025

Read full story

💬 Spotify’s cloud journey has been unique; starting with on-premise then leveraging AWS and then migrating to GCP in 2016, while still operating some on-prem systems like Hadoop. Today, they rely heavily on GCP-native tools alongside in-house platforms that empower internal teams. I may have missed details, feel free to share in the comments!

Snapchat Data Tech Stack

DoorDash Data Tech Stack

Discussion about this post

Ready for more?