During a reliability gameday, Datadog engineers discovered that their PostgreSQL clusters couldn’t safely fail over. Here’s how the team redesigned them for high availability using Patroni and synchronous replication.
Datadog
https://www.datadoghq.com/blog/engineering/ · 94 posts · history since 2016 · active
4 Jun
2 Jun
By combining stacked LLM evaluations with tool-driven investigation, we scaled malicious code detection from pull requests to dependency packages without sacrificing accuracy or cost control.
21 Apr
Learn how we embed widget metadata into screenshots using invisible, resilient watermarks, enabling self-describing visualizations at scale.
7 Apr
Find out how we built a scalable evaluation platform for Datadog’s Bits AI SRE agent that replays real incidents, detects regressions, and measures agent performance across production scenarios.
23 Mar
When a high-volume upsert doubled disk writes, Datadog engineers traced the issue to Postgres WAL behavior and rewrote the query to eliminate hidden costs.
9 Mar
Learn how Datadog detected and resolved issues from hackerbot-claw, an AI-powered automated attack campaign.
4 Mar
We share lessons learned building Datadog’s MCP server, from designing agent-friendly tools and managing context windows to using queries instead of raw data retrieval.
18 Feb
Get an inside look at shrinking large Go binaries in the Datadog Agent through dependency and linker analysis.
7 Jan
Learn field-tested lessons for eBPF-powered workload protection.
18 Nov 2025
Learn how Datadog scaled eBPF-powered file monitoring to handle more than 10 billion kernel events per minute while preserving full detection coverage.
4 Nov 2025
Discover how Datadog engineered a scalable Change Data Capture (CDC) platform to replicate data across systems in near real time—reducing search latency by 87%, increasing availability, and powering diverse, multi-tenant use cases across the company.
21 Oct 2025
Learn how Datadog’s SDLC Security team built an LLM-powered system to detect malicious pull requests at scale—without sacrificing developer velocity.
15 Oct 2025
After a major outage, we re-architected Datadog systems to degrade gracefully under failure. Here’s what we learned—and how we’re building forward.
1 Oct 2025
See how Husky enables interactive querying across 100 trillion events daily by combining caching, smart indexing, and query pruning.
18 Sept 2025
Learn how we turned hot-path optimizations into a system for continuous, AI-assisted performance improvements and saved thousands of cores in the process.
12 Aug 2025
We re-architected the real-time data pipeline for Datadog’s Processes and Containers views—cutting traffic by 100x and infrastructure use by 98%. This post explores the system challenges, architectural changes, and their impact.
4 Aug 2025
Discover how we reengineered our metrics storage engine for massive scale with Rust, a shard-per-core model, and real-time performance.
17 Jul 2025
Go 1.24’s Swiss Tables cut our map memory usage by up to 70% in high-traffic workloads. Here’s how we profiled the savings and improved performance.
We rolled out Go 1.24 and saw a memory regression. Here’s how we dug into system metrics, uncovered a bug in the runtime allocator, and worked with the Go team to help fix it.
24 Jun 2025
Learn how we implemented and open-sourced a noise filter for real-time audio chat without compromising performance. Better yet, try the demo and add it to your own project today.
23 Jun 2025
Learn how we built Datadog’s Log Forwarding system for low-latency, high-throughput delivery to thousands of unreliable third-party endpoints.
17 Jun 2025
Learn how Datadog engineered a highly reliable, low-latency system to distribute per-tenant configuration data across thousands of containers, enabling real-time log processing at scale.
Learn how Datadog is breaking up a shared production database at scale—defining clear ownership boundaries, minimizing migration risk, and building the tooling to make decoupling safe, automated, and sustainable.
3 Jun 2025
Learn how we developed Datadog Automatic Faulty Deployment Detection and improved precision, recall, and time to detection along the way.
9 Apr 2025
Learn how we rewrote the Datadog Lambda Extension in Rust, cutting cold starts by 82 percent, shrinking the binary 87 percent, and slashing memory use without sacrificing observability.
19 Feb 2025
Learn how we built the Streaming Platform at Datadog to provide more resilience and flexibility to our Kafka infrastructure.
29 Jan 2025
A deep dive into Husky’s underlying data storage and compaction system.
15 Jan 2025
Learn how an unassuming Postgres error led us to discover a bug in Postgres for Arm.
23 Dec 2024
Learn some tips and strategies to stay connected, visible, and effective as a remote worker in a predominantly office-based company.
20 Nov 2024
Learn how we used formal modeling and simulation to analyze a distributed, multi-tenant queueing system.
1 Nov 2024
Learn how we built a test impact analysis library in Ruby, the challenges we faced, the solutions we found, and what we discovered about tracing the Ruby VM.
23 Sept 2024
Learn about the challenges and solutions we discovered while using LLMs to automate writing postmortems.
28 Jun 2024
Learn how we implemented a new timeseries indexing strategy when the amount of data we ingested increased significantly.
23 May 2024
Learn how we enhanced our static analyzer by migrating from Java to Rust, tripling performance improvements and a 10x reduction in memory usage.
21 May 2024
In this interview, Ivo Dimitrov, Distributed Data Systems VP of Engineering, describes the engineering career that led him to Datadog and his committment to helping build out our core backend platforms.
20 May 2024
How we handle memory usage in our .NET continuous profiler.
1 May 2024
Learn how we used DDSketch to enhance our heatmap visualizations, allowing us to represent and analyze high cardinality data distributions at scale.
18 Apr 2024
Learn how we built an internal graphing library to support data visualization in iOS using Swift and SwiftUI.
4 Apr 2024
How we handle exceptions and lock contention in our .NET continuous profiler.
26 Mar 2024
In this interview, Marie-Laure Bardonnet, Log Management Senior Engineering Manager, describes the journey of learning, growing, and scaling a team of 4 backend engineers to over 30 frontend and backend engineers at Datadog.
14 Feb 2024
Learn how Datadog’s Documentation team uses a linter to shift quality left.
13 Feb 2024
How we implemented CPU and wall time profiling in our .NET continuous profiler.
9 Jan 2024
Our .NET profiler was designed and implemented to run 24/7 in production, at any scale, with negligible impact. Here are the details of how we built it.
21 Dec 2023
In this video, Jean-Mathieu Saponaro, Data & Analytics Senior Engineering Manager, describes the journey of leading, growing, and scaling self-serve analytics within Datadog.
27 Oct 2023
Engineering spotlight: Jeromy Carriere
30 Jun 2023
How Datadog’s Frontend DevX team migrated a codebase from flaky, hard-to-maintain acceptance testing with Puppeteer to more robust Synthetic tests.
16 Jun 2023
This post walks through how we restored our platform after it was affected by the outage of March 8, 2023.
1 Jun 2023
This post sketches out our incident response process, where it succeeded and where it stumbled on March 8, and what we learned along the way.
26 May 2023
Learn how we tackled a case of high network-latency in our usage estimation platform that required a multi-layered solution.
24 May 2023
A deep dive into what happened at the platform level during the outage of March 8, 2023.
17 Apr 2023
Learn how we developed a new scheduling algorithm for data fetching and rendering and how we built it for use across our suite of Datadog products.
22 Feb 2023
A closer look at storage routing in Husky, Datadog’s third-generation event storage system.
31 Jan 2023
We’ve recently improved the raw performance of the Datadog Agent, leading to 20% less CPU use on Agents flooded with custom metrics.
29 Sept 2022
Learn about Datadog’s repeatable design elements that we’ve documented in our design style guide called DRUIDS.
6 Jun 2022
Engineering spotlight: Tay Nishimura
17 May 2022
Husky is an unbundled, distributed, schemaless, vectorized column store. Here’s how we built it—and why.
12 May 2022
Employees at all modern software companies use a ton of outside pieces of software to do their jobs. Learn how Datadog’s IT team expanded Clarity to automate monitoring these accounts for inactivity and optimizing how much we spend on them.
13 Apr 2022
It’s always DNS . . . except when it’s not: A deep dive through gRPC, Kubernetes, and AWS networking
The story of a seemingly simple issue that led us into the hidden complexities of gRPC, DNS, and Kubernetes.
25 Mar 2022
See Datadog’s proof of concept exploit for breaking out from unprivileged containers using the Dirty Pipe vulnerability.
28 Feb 2022
How several patches and fixes in Go 1.18 bring improved profiling accuracy.
22 Feb 2022
How the Datadog DesignOps team uses Datadog to understand our users and make well-informed design decisions
15 Oct 2021
Our story of contributing to kube-state-metrics, a popular open source Kubernetes service.
30 Sept 2021
We identified a performance issue caused by the `ForkJoinPool` in our Java application based on the Akka framework. This is how we solved our issue.
30 Apr 2021
Employees at all modern software companies use a ton of outside pieces of software to do their jobs. Learn how Datadog’s IT team built a tool to automate monitoring these accounts for security and compliance.
22 Feb 2021
Solving performance problems when moving an application to Kubernetes
10 Feb 2021
Engineering spotlight: Maël Nison, maintainer of Yarn
2 Feb 2021
What the observer API means for PHP 8 and the future of observability
2 Nov 2020
Glommio (pronounced glom-io or |glomjəʊ|) is a cooperative thread-per-core crate for Rust & Linux based on io_uring. It allows you to write asynchronous code that takes advantage of rust async/await, but it doesn’t use helper threads anywhere.
7 Oct 2020
Adventures in developing a Python profiler
23 Sept 2020
How I used Datadog to become a better sailor.
23 Sept 2019
Introducing DDSketch, the first fully-mergeable, relative-error quantile sketching algorithm with formal guarantees.
3 Jun 2019
How to guarantee end-to-end security when using automation to package and publish Datadog Agent integrations
2 Apr 2019
A look at how Datadog builds and operates data pipelines reliably at scale.
22 Jan 2019
The introduction of advanced statistical methods is reshaping the UX of alerts
1 Nov 2018
Integrating Amazon Simple Email Service with Datadog to improve observability.
13 Aug 2018
Today, we’re open-sourcing Kafka-Kit, a toolset for scaling and recovering Kafka.
24 Apr 2018
Using Datadog to find performance bottlenecks, and contrasting tentative solutions using performance benchmarks.
16 Apr 2018
How the new Datadog Agent written in Go runs Python checks.
23 Mar 2018
How to be a better designer by being a better explainer.
20 Sept 2017
If you are part of the team managing the AWS infrastructure at your organization, you’ve likely had to wrestle with the complexity of managing multiple accounts for some time now.
6 Sept 2017
Designing powerful outlier and anomaly detection algorithms requires using the right tools. Discover how robust statistical distances can help.
23 Aug 2017
The Datadog Solutions Team reproduces problems that customers run into while they try using our many integrations in their own, always-unique environments.
15 Aug 2017
Highlights of our recent work to improve our cloud-based monitoring and alerting pipeline.
11 Jul 2017
A piecewise regression can model multiple trends in a single data set. Learn how Datadog automates piecewise regression on our timeseries data.
15 Jun 2017
At Datadog we see and gather metrics everywhere by using Datadog to monitor our applications and infrastructure. So our team thought it’d be fun to come up with creative solutions to “where can we display metrics?”
8 Jun 2017
Recently we extended the Datadog Agent to support extracting additional metrics from Kubernetes using the kube-state-metrics via protobufs.
17 May 2017
Solutions Engineers at Datadog have to stay on top of what’s going on within the company and outside.
23 Jan 2017
When some of our customers reported that their agents were freezing, sometimes for hours at a time, we tracked down the issue to their disk mount options.
17 Jan 2017
It might surprise you to learn who built most of the prototype of the newest Datadog feature. Read more about Marie-Laure and her internship at Datadog.
14 Nov 2016
Today, we’re open-sourcing Redux-Doghouse, a library for Redux that helps you scope components so that they can be reused multiple times in multiple contexts without conflicting with one another.
27 Oct 2016
One of our colleagues, Christian, is participating in a tremendous 6-day-run challenge. Yes, you read that right, he will run around 850km (528 miles) over 6 days. As we like to graph everything, we thought it would be fun to cheer him on remotely and follow his progress in this crazy race via a Datadog dashboard.
23 Aug 2016
Do you ever walk to the bathroom across the office only to discover that it’s in use? Then you’ve got to decide if you want to awkwardly hover right outside, or hold it in for a while and try again later. This is obviously a first world problem, but bathroom contention was getting to be a challenge as we quickly…
11 Aug 2016
We’ve been using Consul for about 18 months at Datadog and it’s an important part of our production stack. In this post we will discuss some of the lessons we have learned.
11 Jul 2016
To commemorate the third annual GopherCon US in Denver this week, we’re releasing cgo bindings to two compression libraries that we’ve been using in production at Datadog for a while now: czlib and zstd.