How Airbnb built a Kubernetes sidecar to deliver dynamic configuration reliably at scale. By : Bo Teng , Cosmo Qiu , Siyuan Zhou , Ankur Soni , Xin Huang , Willis Harvey Introduction In our previous post , we explored Airbnb’s dynamic configuration system, Sitar, with a focus on service architecture and configuration change safety. Now for the harder question:…
#distributed-systems
11 posts
4 Jun
3 Jun
By Rajiv Shringi , Kaidan Fullerton , Oleksii Tkachuk and Kartik Sathyanarayanan Introduction Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal event data with millisecond latency. We use Apache Cassandra 4.x as the underlying storage for these main reasons: Throughput, latency, and cost : Cassandra can handle millions of low‑latency reads and writes in…
29 May
By Oleksii Tkachuk , Kartik Sathyanarayanan , Rajiv Shringi Introduction Netflix has a diverse range of graph use cases, each serving specific business needs with unique functionality and performance requirements. These use cases fall into two broad categories: OLAP : These use cases typically involve open-ended and algorithmic exploration of large graph datasets. They often utilize industry-standard models and languages…
By Parth Jain , Rakesh Sukumar , Yingwu Zhao , Renzo Sanchez & Nathan Fisher How we built a living map of our distributed infrastructure to help engineers understand dependencies, troubleshoot faster, and keep Netflix running smoothly for our members around the world. The Puzzle with a Thousand Pieces Picture this: It’s 3am, and an engineer gets paged. One of…
4 May
Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph
Netflix Technology BlogSaish Sali , Nipun Kumar , Sura Elamurugu Introduction As Netflix has grown, machine learning continues to support our ability to deliver value to members and drive excellence across multiple areas of our business. When Netflix began investing in machine learning over a decade ago, it was primarily focused on a single domain: personalization. Scala was the industry standard, our…
1 May
By Nipun Kumar , Rajat Shah , Peter Chng Introduction This is the first blog post in a multi-part series that shares technical insights into how our ML model serving infrastructure powers several personalized experiences at scale across various domains (e.g., title recommendations, commerce). In this introductory blog post, we will dive into our domain-independent API abstraction and its traffic…
28 Apr
Expedia Group Technology — Engineering A system that facilitates investigation of service degradations and outages using service telemetry data and AI Photo by Evangelos Mpikakis on Unsplash. The recent advancements in the artificial intelligence space make us re-evaluate how work is done. From programming, to designing systems, or even operating them in production. While there is considerable focus on automating…
14 Mar 2025
A brutally simple and effective implementation for long-running account move jobs at Zendesk. This article outlines some architectural changes we’ve been able to make to radically simplify the execution model of long-running jobs. By leveraging client behaviour, the resulting system improves overall functionality while removing the many complexities of distributed job execution. Dall-e impression of a server who’s ready to…
14 Apr 2023
The Jobteaser application contains a lot of different relatively independent modules to help universities provide career guidance to students: a job board, a career event management system, a career advice appointment management system… When we decided to migrate our application’s backend from a monolith to a service-oriented architecture, we strived to keep each module as isolated as possible from the…
27 Jul 2019
What a great book Designing Data-Intensive Applications is! It covers databases and distributed systems in clear language, great detail and without any fluff. I particularly like that the author Martin Kleppmann knows the theory very well, but also seems to … Continue reading →
18 Dec 2017
Or: Move That Loop into the Server Already! This article will illustrate the significance of something that I always thought to be common sense, but I keep seeing people getting this (very) wrong in their productive systems. Chances are, in fact, that most applications out there suffer from this performance problem – and the fix … Continue reading The Cost…