~/devreads

#observability

13 posts

3 Jun

29 May

Netflix Technology Blog 13 min read

By Parth Jain , Rakesh Sukumar , Yingwu Zhao , Renzo Sanchez & Nathan Fisher How we built a living map of our distributed infrastructure to help engineers understand dependencies, troubleshoot faster, and keep Netflix running smoothly for our members around the world. The Puzzle with a Thousand Pieces Picture this: It’s 3am, and an engineer gets paged. One of…

distributed-systemssoftware-engineeringplatform-engineeringmicroservicesobservability

15 May

Phoebe Sajor 1 min read

Observability and human intuition in an AI world​​​​‌ ‍ ​‍​‍‌‍ ‌ ​‍‌‍‍‌‌‍‌ ‌‍‍‌‌‍ ‍​‍​‍​ ‍‍​‍​‍‌ ​ ‌‍​‌‌‍ ‍‌‍‍‌‌ ‌​‌ ‍‌​‍ ‍‌‍‍‌‌‍ ​‍​‍​‍ ​​‍​‍‌‍‍​‌ ​‍‌‍‌‌‌‍‌‍​‍​‍​ ‍‍​‍​‍‌‍‍​‌ ‌​‌ ‌​‌ ​​‌ ​ ​ ‍‍​‍ ​‍ ‌‍​ ‌‍ ‌‌ ​ ​‍ ‍‌ ​ ‌ ‌​‌‍​‌‌‍​ ‌‍‍ ‌‍ ‌ ‌‍‌‍‌‌‌ ​‍‌‍‌‍‌‍ ​‌‍ ‌ ‌ ​‍ ‍‌‍​ ‌‍ ​‍ ‌‍‍‌‌‍ ‍‌ ‌​‌‍‌‌‌‍ ‍‌ ‌​​‍ ‌‍‌‌‌‍‌​‌‍‍‌‌ ‌​​‍ ‌‍ ‌‌‍ ‌‍‌​‌‍‌‌​ ‌‌ ​​‌ ​‍‌‍‌‌‌ ​ ‌‍‌‌‌‍ ‍‌ ‌​‌‍​‌‌ ‌​‌‍‍‌‌‍ ‌‍ ‍​ ‍ ‌‍‍‌‌‍‌​​ ‌​ ‌‌​ ​‌‌‍‌‌​ ​‍​ ‌‍​ ‌​​ ‌‍‌‍​‍​‍ ‌‌‍​ ‌‍‌​​ ‌‍‌‍​‍​‍ ‌​ ‌​​ ‍‌​ ‌ ​ ​​​‍ ‌‌‍​‍‌‍‌‌​ ‍​​ ​‍​‍ ‌​ ‌‌‌‍‌‌‌‍​‍‌‍​‌‌‍‌‍‌‍​ ​ ​‌‌‍‌‍​ ‌ ​ ‍​​ ‌​​ ​‌​ ‍ ‌ ‌​‌ ‍‌‌ ​​‌‍‌‌​ ‌‌‍​‍‌‍ ​‌‍ ‌‍‌ ‌‌​​‌‍ ‌ ​ ‌ ‌​​ ‍ ‌ ​​‌‍​‌‌ ‌​‌‍‍​​ ‌‌ ‌​‌‍‍‌‌ ‌​‌‍ ​‌‍‌‌​ ‌‍​‍‌‍​‌‌ ​ ‌‍‌‌‌‌‌‌‌ ​‍‌‍ ​​ ‌‌‍‍​‌ ‌​‌ ‌​‌ ​​‌ ​ ​‍‌‌​ ​ ‌​​‌​‍‌‌​ ​‍‌​‌‍​‍‌‌​ ​‍‌​‌‍‌‍​ ‌‍ ‌‌ ​ ​‍ ‍‌ ​ ‌ ‌​‌‍​‌‌‍​ ‌‍‍ ‌‍ ‌ ‌‍‌‍‌‌‌ ​‍‌‍‌‍‌‍ ​‌‍ ‌ ‌ ​‍ ‍‌‍​ ‌‍ ​‍‌‍‌‍‍‌‌‍‌​​ ‌​ ‌‌​ ​‌‌‍‌‌​ ​‍​ ‌‍​ ‌​​ ‌‍‌‍​‍​‍ ‌‌‍​ ‌‍‌​​ ‌‍‌‍​‍​‍ ‌​ ‌​​ ‍‌​ ‌ ​ ​​​‍ ‌‌‍​‍‌‍‌‌​ ‍​​ ​‍​‍ ‌​ ‌‌‌‍‌‌‌‍​‍‌‍​‌‌‍‌‍‌‍​ ​ ​‌‌‍‌‍​ ‌ ​ ‍​​ ‌​​ ​‌​‍‌‍‌ ‌​‌ ‍‌‌ ​​‌‍‌‌​ ‌‌‍​‍‌‍ ​‌‍ ‌‍‌ ‌‌​​‌‍ ‌ ​ ‌ ‌​​‍‌‍‌ ​​‌‍​‌‌ ‌​‌‍‍​​ ‌‌ ‌​‌‍‍‌‌ ‌​‌‍ ​‌‍‌‌​‍‌‍‌ ​​‌‍‌‌‌ ​‍‌ ​ ‌ ​​‌‍‌‌‌‍​ ‌ ‌​‌‍‍‌‌ ‌‍‌‍‌‌​ ‌‌ ​​‌ ‌‌‌‍​‍‌‍ ​‌‍‍‌‌ ​ ‌‍‍​‌‍‌‌‌‍‌​​‍​‍‌ ‌

Stack Overflow

In this two-for-one episode recorded at HumanX, Ryan is first joined by Christine Yen, CEO of Honeycomb, to discuss how AI compresses the software development lifecycle, making observability about capturing the right telemetry. Then, Spiros Xanthos, founder and CEO of Resolve AI, shares with us how AI coding increases code volume but decreases human intuition, making production operations harder than…

podcastse-techse-stackoverflowobservabilityai

5 May

Abdurrahman J. Allawala 8 min read

Designing monitoring that works when everything else doesn’t. By : Abdurrahman J. Allawala Introduction When an incident hits, teams lean on observability to answer the only questions that matter: what’s broken, and why? Monitoring systems are designed to help you answer these questions, and they usually do. But what happens when your observability stack is dependent on the same systems…

engineeringinfrastructuretechnologyobservabilitysite-reliability-engineer

4 May

Hayden James 1 min read

Over the last couple of months I've had performance issues with Cloudflare (CF) about 2 times, including today. That's a sentence I never thought I'd write, because Cloudflare genuinely doesn't have performance issues most of the time, and when they do, it's usually on the status page as part of a larger issue. Continue reading...

bloglinuxapmcloudflareobservability

28 Apr

Nikos Katirtzis 7 min read

Expedia Group Technology — Engineering A system that facilitates investigation of service degradations and outages using service telemetry data and AI Photo by Evangelos Mpikakis on Unsplash. The recent advancements in the artificial intelligence space make us re-evaluate how work is done. From programming, to designing systems, or even operating them in production. While there is considerable focus on automating…

observabilitysoftware-engineeringdistributed-systemsgenerative-ai-toolssite-reliability-engineer

16 Apr

Criteo Tech 9 min read

Author: Fabian Höring Agentic systems powered by LLMs can be incredibly impressive in demos. With a few well-crafted prompts, they can demonstrate reasoning, calling tools, and solving complex tasks [1]. Demos are effective at showcasing what’s possible. Production environments, however, are where those capabilities are tested at scale and under real-world conditions. The same agent that performs perfectly on curated…

langfuseagentic-aiaillmobservability

7 Apr

Eugene Ma 9 min read

A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus. By: Eugene Ma , Natasha Aleksandrova When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as…

engineeringtechnologyinfrastructureobservabilitysite-reliability-engineer

31 Mar

Carlo Preciado 4 min read

The Problem: Legacy Tooling and Its Limitations Currently, Slack utilizes a hybrid approach to network measurement, incorporating both internal (such as traffic between AWS Availability Zones) and external (monitoring traffic from the public internet into Slack’s infrastructure) solutions. These tools comprise a combination of commercial SaaS offerings and custom-built network testing solutions developed by our…

uncategorizedgolanginfrastructurenetworkingobservability

26 Nov 2025

Sujit Singh 7 min read

Introduction In an age where artificial intelligence (AI) and machine learning (ML) are integral to almost every aspect of our lives, ensuring the effectiveness, fairness, and reliability of ML models is paramount. Observability plays a crucial role in maintaining the performance of these models, allowing us to detect and resolve issues promptly. At Helpshift, we recognized the need for robust…

analyticsartificial-intelligencemachine-learningobservability

8 Sept 2025

Luke Stephenson 4 min read

Heartbeats: How Synthetic Traffic Keeps Us Running Let me take you on a journey of how we came to use heartbeats in our application design. It’s a happy story of love and no broken hearts along the way. What are heartbeats? What my teams have called heartbeats are a form of synthetic traffic generated by the application itself. The deployed…

observabilitymonitoringkafka

26 Jul 2023

Ryan Katkov 8 min read

Customer-first: Moving from Hero Engineering to Reliability Engineering From the beginning, Slack has always had a strong focus on the customer experience, and customer love is one of our core values. Slack has grown from a small team to thousands of employees over the years and this customer love has always included a focus on…

uncategorizedleadershipobservability

7 Oct 2021

Frank Chen 10 min read

Slack is an integral part of where work happens for teams across the world, and our work in the Core Development Engineering department supports engineers throughout Slack that develop, build, test, and release high-quality services to Slack’s customers. In this article, we share how teams at Slack evolved our internal tooling and made infrastructure bets.…

uncategorizeddeveloper-productivityinfrastructureobservability