How a screen-level performance metric reshaped platform decisions, engineering ownership, and release discipline Continue reading on Expedia Group Technology »
#observability
13 posts
3 Jun
29 May
By Parth Jain , Rakesh Sukumar , Yingwu Zhao , Renzo Sanchez & Nathan Fisher How we built a living map of our distributed infrastructure to help engineers understand dependencies, troubleshoot faster, and keep Netflix running smoothly for our members around the world. The Puzzle with a Thousand Pieces Picture this: It’s 3am, and an engineer gets paged. One of…
15 May
Observability and human intuition in an AI world
Stack OverflowIn this two-for-one episode recorded at HumanX, Ryan is first joined by Christine Yen, CEO of Honeycomb, to discuss how AI compresses the software development lifecycle, making observability about capturing the right telemetry. Then, Spiros Xanthos, founder and CEO of Resolve AI, shares with us how AI coding increases code volume but decreases human intuition, making production operations harder than…
5 May
Designing monitoring that works when everything else doesn’t. By : Abdurrahman J. Allawala Introduction When an incident hits, teams lean on observability to answer the only questions that matter: what’s broken, and why? Monitoring systems are designed to help you answer these questions, and they usually do. But what happens when your observability stack is dependent on the same systems…
4 May
Over the last couple of months I've had performance issues with Cloudflare (CF) about 2 times, including today. That's a sentence I never thought I'd write, because Cloudflare genuinely doesn't have performance issues most of the time, and when they do, it's usually on the status page as part of a larger issue. Continue reading...
28 Apr
Expedia Group Technology — Engineering A system that facilitates investigation of service degradations and outages using service telemetry data and AI Photo by Evangelos Mpikakis on Unsplash. The recent advancements in the artificial intelligence space make us re-evaluate how work is done. From programming, to designing systems, or even operating them in production. While there is considerable focus on automating…
16 Apr
Author: Fabian Höring Agentic systems powered by LLMs can be incredibly impressive in demos. With a few well-crafted prompts, they can demonstrate reasoning, calling tools, and solving complex tasks [1]. Demos are effective at showcasing what’s possible. Production environments, however, are where those capabilities are tested at scale and under real-world conditions. The same agent that performs perfectly on curated…
7 Apr
A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus. By: Eugene Ma , Natasha Aleksandrova When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as…
31 Mar
The Problem: Legacy Tooling and Its Limitations Currently, Slack utilizes a hybrid approach to network measurement, incorporating both internal (such as traffic between AWS Availability Zones) and external (monitoring traffic from the public internet into Slack’s infrastructure) solutions. These tools comprise a combination of commercial SaaS offerings and custom-built network testing solutions developed by our…
26 Nov 2025
Introduction In an age where artificial intelligence (AI) and machine learning (ML) are integral to almost every aspect of our lives, ensuring the effectiveness, fairness, and reliability of ML models is paramount. Observability plays a crucial role in maintaining the performance of these models, allowing us to detect and resolve issues promptly. At Helpshift, we recognized the need for robust…
8 Sept 2025
Heartbeats: How Synthetic Traffic Keeps Us Running Let me take you on a journey of how we came to use heartbeats in our application design. It’s a happy story of love and no broken hearts along the way. What are heartbeats? What my teams have called heartbeats are a form of synthetic traffic generated by the application itself. The deployed…
26 Jul 2023
Customer-first: Moving from Hero Engineering to Reliability Engineering From the beginning, Slack has always had a strong focus on the customer experience, and customer love is one of our core values. Slack has grown from a small team to thousands of employees over the years and this customer love has always included a focus on…
7 Oct 2021
Slack is an integral part of where work happens for teams across the world, and our work in the Core Development Engineering department supports engineers throughout Slack that develop, build, test, and release high-quality services to Slack’s customers. In this article, we share how teams at Slack evolved our internal tooling and made infrastructure bets.…