#observability

17 posts

13 Jul

Netflix Technology Blog 13 Jul 2026 23 min read

Building Service Topology at Scale: Architecture, Challenges, and Lessons Learned

By Parth Jain , Rakesh Sukumar , Yingwu Zhao , Renzo Sanchez-Silva & Nathan Fisher A deep dive into the engineering challenges of building a real-time service dependency map at Netflix scale: from streaming architectures and distributed aggregation pipelines to time-travel queries and the methodology that made it work. Introduction In our first post , we introduced the problem: engineers…

backend-developmentdistributed-systems data-engineering observability software-engineering

6 Jul

Sujitha Paduchuri, ManageEngine 6 Jul 2026 1 min read

Linux Server Health Checks: 10 Metrics Every Sysadmin Should Monitor

Hayden James

Servers give you warnings before they fail. Most sysadmins performing Linux server monitoring miss them because they're watching the wrong numbers. The metrics that actually matter are one level deeper: iowait instead of CPU percentage, active swap paging instead of memory usage, inode counts instead of just disk space. Continue reading...

blogguestsobservability performance server

18 Jun

Manav Mehta 18 Jun 2026 7 min read

Building a Centralized Alerting Framework for Data Quality Monitoring and Incident Management

Helpshift

Building a Centralized Alerting Framework for Data Quality using Snowflake Before We Knew Better As our data platform grew, so did the number of pipelines, scheduled tasks, and data quality checks running every day. While Snowflake provided a reliable platform for storing and processing data, operational monitoring was fragmented across multiple systems. Data quality failures were often discovered only after…

snowflake incident-managementdata-qualitydata-engineering observability

3 Jun

Divya Gupta Arora 3 Jun 2026 1 min read

The Most Expensive Milliseconds Are Unmeasured

Expedia

How a screen-level performance metric reshaped platform decisions, engineering ownership, and release discipline Continue reading on Expedia Group Technology »

time-to-interactivesoftware-engineering observability performance devops

29 May

Netflix Technology Blog 29 May 2026 13 min read

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map

Netflix Technology Blog

By Parth Jain , Rakesh Sukumar , Yingwu Zhao , Renzo Sanchez & Nathan Fisher How we built a living map of our distributed infrastructure to help engineers understand dependencies, troubleshoot faster, and keep Netflix running smoothly for our members around the world. The Puzzle with a Thousand Pieces Picture this: It’s 3am, and an engineer gets paged. One of…

distributed-systems software-engineering platform-engineering microservices observability

15 May

Phoebe Sajor 15 May 2026 1 min read

Observability and human intuition in an AI world‌ ‍ ‍‍‌‍ ‌ ‍‌‍‍‌‌‍‌ ‌‍‍‌‌‍ ‍‍‍ ‍‍‍‍‌ ‌‍‌‌‍ ‍‌‍‍‌‌ ‌‌ ‍‌‍ ‍‌‍‍‌‌‍ ‍‍‍ ‍‍‌‍‍‌ ‍‌‍‌‌‌‍‌‍‍‍ ‍‍‍‍‌‍‍‌ ‌‌ ‌‌ ‌ ‍‍‍ ‍ ‌‍ ‌‍ ‌‌ ‍ ‍‌ ‌ ‌‌‍‌‌‍ ‌‍‍ ‌‍ ‌ ‌‍‌‍‌‌‌ ‍‌‍‌‍‌‍ ‌‍ ‌ ‌ ‍ ‍‌‍ ‌‍ ‍ ‌‍‍‌‌‍ ‍‌ ‌‌‍‌‌‌‍ ‍‌ ‌‍ ‌‍‌‌‌‍‌‌‍‍‌‌ ‌‍ ‌‍ ‌‌‍ ‌‍‌‌‍‌‌ ‌‌ ‌ ‍‌‍‌‌‌ ‌‍‌‌‌‍ ‍‌ ‌‌‍‌‌ ‌‌‍‍‌‌‍ ‌‍ ‍ ‍ ‌‍‍‌‌‍‌ ‌ ‌‌ ‌‌‍‌‌ ‍ ‌‍ ‌ ‌‍‌‍‍‍ ‌‌‍ ‌‍‌ ‌‍‌‍‍‍ ‌ ‌ ‍‌ ‌ ‍ ‌‌‍‍‌‍‌‌ ‍ ‍‍ ‌ ‌‌‌‍‌‌‌‍‍‌‍‌‌‍‌‍‌‍ ‌‌‍‌‍ ‌ ‍ ‌ ‌ ‍ ‌ ‌‌ ‍‌‌ ‌‍‌‌ ‌‌‍‍‌‍ ‌‍ ‌‍‌ ‌‌‌‍ ‌ ‌ ‌ ‍ ‌ ‌‍‌‌ ‌‌‍‍ ‌‌ ‌‌‍‍‌‌ ‌‌‍ ‌‍‌‌ ‌‍‍‌‍‌‌ ‌‍‌‌‌‌‌‌‌ ‍‌‍ ‌‌‍‍‌ ‌‌ ‌‌ ‌ ‍‌‌ ‌‌‍‌‌ ‍‌‌‍‍‌‌ ‍‌‌‍‌‍ ‌‍ ‌‌ ‍ ‍‌ ‌ ‌‌‍‌‌‍ ‌‍‍ ‌‍ ‌ ‌‍‌‍‌‌‌ ‍‌‍‌‍‌‍ ‌‍ ‌ ‌ ‍ ‍‌‍ ‌‍ ‍‌‍‌‍‍‌‌‍‌ ‌ ‌‌ ‌‌‍‌‌ ‍ ‌‍ ‌ ‌‍‌‍‍‍ ‌‌‍ ‌‍‌ ‌‍‌‍‍‍ ‌ ‌ ‍‌ ‌ ‍ ‌‌‍‍‌‍‌‌ ‍ ‍‍ ‌ ‌‌‌‍‌‌‌‍‍‌‍‌‌‍‌‍‌‍ ‌‌‍‌‍ ‌ ‍ ‌ ‌‍‌‍‌ ‌‌ ‍‌‌ ‌‍‌‌ ‌‌‍‍‌‍ ‌‍ ‌‍‌ ‌‌‌‍ ‌ ‌ ‌‍‌‍‌ ‌‍‌‌ ‌‌‍‍ ‌‌ ‌‌‍‍‌‌ ‌‌‍ ‌‍‌‌‍‌‍‌ ‌‍‌‌‌ ‍‌ ‌ ‌‍‌‌‌‍ ‌ ‌‌‍‍‌‌ ‌‍‌‍‌‌ ‌‌ ‌ ‌‌‌‍‍‌‍ ‌‍‍‌‌ ‌‍‍‌‍‌‌‌‍‌‍‍‌ ‌

Stack Overflow

In this two-for-one episode recorded at HumanX, Ryan is first joined by Christine Yen, CEO of Honeycomb, to discuss how AI compresses the software development lifecycle, making observability about capturing the right telemetry. Then, Spiros Xanthos, founder and CEO of Resolve AI, shares with us how AI coding increases code volume but decreases human intuition, making production operations harder than…

podcast se-tech se-stackoverflow observability ai

5 May

Abdurrahman J. Allawala 5 May 2026 8 min read

Monitoring reliably at scale

Airbnb

Designing monitoring that works when everything else doesn’t. By : Abdurrahman J. Allawala Introduction When an incident hits, teams lean on observability to answer the only questions that matter: what’s broken, and why? Monitoring systems are designed to help you answer these questions, and they usually do. But what happens when your observability stack is dependent on the same systems…

engineering infrastructure technology observability site-reliability-engineer

4 May

Hayden James 4 May 2026 1 min read

Watching Cloudflare Data Center Locations in real-time

Hayden James

Over the last couple of months I've had performance issues with Cloudflare (CF) about 2 times, including today. That's a sentence I never thought I'd write, because Cloudflare genuinely doesn't have performance issues most of the time, and when they do, it's usually on the status page as part of a larger issue. Continue reading...

blog linuxapmcloudflareobservability

28 Apr

Nikos Katirtzis 28 Apr 2026 7 min read

Expedia’s Service Telemetry Analyzer

Expedia

Expedia Group Technology — Engineering A system that facilitates investigation of service degradations and outages using service telemetry data and AI Photo by Evangelos Mpikakis on Unsplash. The recent advancements in the artificial intelligence space make us re-evaluate how work is done. From programming, to designing systems, or even operating them in production. While there is considerable focus on automating…

observability software-engineering distributed-systems generative-ai-tools site-reliability-engineer

16 Apr

Criteo Tech 16 Apr 2026 9 min read

Beyond the demo: Why agentic evaluation matters

Criteo

Author: Fabian Höring Agentic systems powered by LLMs can be incredibly impressive in demos. With a few well-crafted prompts, they can demonstrate reasoning, calling tools, and solving complex tasks [1]. Demos are effective at showcasing what’s possible. Production environments, however, are where those capabilities are tested at scale and under real-world conditions. The same agent that performs perfectly on curated…

langfuseagentic-ai ai llm observability

7 Apr

Eugene Ma 7 Apr 2026 9 min read

Building a high-volume metrics pipeline with OpenTelemetry and vmagent

Airbnb

A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus. By: Eugene Ma , Natasha Aleksandrova When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as…

engineering technology infrastructure observability site-reliability-engineer

31 Mar

Carlo Preciado 31 Mar 2026 4 min read

From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus

Slack

The Problem: Legacy Tooling and Its Limitations Currently, Slack utilizes a hybrid approach to network measurement, incorporating both internal (such as traffic between AWS Availability Zones) and external (monitoring traffic from the public internet into Slack’s infrastructure) solutions. These tools comprise a combination of commercial SaaS offerings and custom-built network testing solutions developed by our…

uncategorized golang infrastructure networking observability

12 Mar

Rachel Revoy 12 Mar 2026 6 min read

From Fragmented Logs to Full-Stack Visibility with SolarWinds Papertrail

Heroku

Modern applications on Heroku don’t just consist of code. They are living ecosystems comprised of dynos, databases, third-party APIs, and complex user interactions. As these systems scale, so do the logs and metrics. To efficiently extract the signals from the noise you need to understand system health in the context of external factors, like resource […] The post From Fragmented…

ecosystem engineering add-ons developer tools observability

26 Nov 2025

Sujit Singh 26 Nov 2025 7 min read

From Data to Insight: Helpshift’s Journey with ML Observability

Helpshift

Introduction In an age where artificial intelligence (AI) and machine learning (ML) are integral to almost every aspect of our lives, ensuring the effectiveness, fairness, and reliability of ML models is paramount. Observability plays a crucial role in maintaining the performance of these models, allowing us to detect and resolve issues promptly. At Helpshift, we recognized the need for robust…

analytics artificial-intelligence machine-learning observability

8 Sept 2025

Luke Stephenson 8 Sept 2025 4 min read

Heartbeats

Zendesk

Heartbeats: How Synthetic Traffic Keeps Us Running Let me take you on a journey of how we came to use heartbeats in our application design. It’s a happy story of love and no broken hearts along the way. What are heartbeats? What my teams have called heartbeats are a form of synthetic traffic generated by the application itself. The deployed…

observability monitoring kafka

26 Jul 2023

Ryan Katkov 26 Jul 2023 8 min read

Service Delivery Index: A Driver for Reliability

Slack

Customer-first: Moving from Hero Engineering to Reliability Engineering From the beginning, Slack has always had a strong focus on the customer experience, and customer love is one of our core values. Slack has grown from a small team to thousands of employees over the years and this customer love has always included a focus on…

uncategorized leadership observability

7 Oct 2021

Frank Chen 7 Oct 2021 10 min read

Infrastructure Observability for Changing the Spend Curve

Slack

Slack is an integral part of where work happens for teams across the world, and our work in the Core Development Engineering department supports engineers throughout Slack that develop, build, test, and release high-quality services to Slack’s customers. In this article, we share how teams at Slack evolved our internal tooling and made infrastructure bets.…

uncategorized developer-productivity infrastructure observability