Datadog

https://www.datadoghq.com/blog/engineering/ · 97 posts · history since 2016 · active

22 Jul

22 Jul 2026 1 min read

Unbiased Java CPU profiling with JFR in JDK 25

Modern Java profilers often rely on unsupported JVM internals for accurate CPU profiling. Here’s how engineers from Datadog, SAP, Amazon, and the OpenJDK community helped bring a new CPU profiling event to JDK 25.

1 Jul

1 Jul 2026 1 min read

How we measure data completeness at scale

Learn how Datadog helps ensure data completeness at scale, enabling accurate alerts and safer automated decisions across distributed pipelines.

23 Jun

23 Jun 2026 1 min read

How we migrated a live routing system using AI-assisted refactoring

Using AI-assisted refactoring, we migrated our live routing brain to a relational model, safely validating changes against live production traffic.

4 Jun

4 Jun 2026 1 min read

When failover isn’t safe: Building high-availability PostgreSQL on Kubernetes

During a reliability gameday, Datadog engineers discovered that their PostgreSQL clusters couldn’t safely fail over. Here’s how the team redesigned them for high availability using Patroni and synchronous replication.

2 Jun

2 Jun 2026 1 min read

From single pull requests to full software packages: Detecting malicious code at scale

By combining stacked LLM evaluations with tool-driven investigation, we scaled malicious code detection from pull requests to dependency packages without sacrificing accuracy or cost control.

21 Apr

21 Apr 2026 1 min read

Steganography at scale: Embedding share URLs in Datadog widget screenshots

Learn how we embed widget metadata into screenshots using invisible, resilient watermarks, enabling self-describing visualizations at scale.

7 Apr

7 Apr 2026 1 min read

How we built a real-world evaluation platform for autonomous SRE agents at scale

Find out how we built a scalable evaluation platform for Datadog’s Bits AI SRE agent that replays real incidents, detects regressions, and measures agent performance across production scenarios.

23 Mar

23 Mar 2026 1 min read

When upserts don’t update but still write: Debugging Postgres performance at scale

When a high-volume upsert doubled disk writes, Datadog engineers traced the issue to Postgres WAL behavior and rewrote the query to eliminate hidden costs.

9 Mar

9 Mar 2026 1 min read

When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos

Learn how Datadog detected and resolved issues from hackerbot-claw, an AI-powered automated attack campaign.

4 Mar

4 Mar 2026 1 min read

Designing MCP tools for agents: Lessons from building Datadog’s MCP server

We share lessons learned building Datadog’s MCP server, from designing agent-friendly tools and managing context windows to using queries instead of raw data retrieval.

18 Feb

18 Feb 2026 1 min read

How we reduced the size of our Agent Go binaries by up to 77%

Get an inside look at shrinking large Go binaries in the Datadog Agent through dependency and linker analysis.

7 Jan

7 Jan 2026 1 min read

Hardening eBPF for runtime security: Lessons from Datadog Workload Protection

Learn field-tested lessons for eBPF-powered workload protection.

18 Nov 2025

18 Nov 2025 1 min read

Scaling real-time file monitoring with eBPF: How we filtered billions of kernel events per minute

Learn how Datadog scaled eBPF-powered file monitoring to handle more than 10 billion kernel events per minute while preserving full detection coverage.

4 Nov 2025

4 Nov 2025 1 min read

Replication redefined: How we built a low-latency, multi-tenant data replication platform

Discover how Datadog engineered a scalable Change Data Capture (CDC) platform to replicate data across systems in near real time—reducing search latency by 87%, increasing availability, and powering diverse, multi-tenant use cases across the company.

21 Oct 2025

21 Oct 2025 1 min read

Detecting malicious pull requests at scale with LLMs

Learn how Datadog’s SDLC Security team built an LLM-powered system to detect malicious pull requests at scale—without sacrificing developer velocity.

15 Oct 2025

15 Oct 2025 1 min read

Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog

After a major outage, we re-architected Datadog systems to degrade gracefully under failure. Here’s what we learned—and how we’re building forward.

1 Oct 2025

1 Oct 2025 1 min read

Inside Husky’s query engine: Real-time access to 100 trillion events

See how Husky enables interactive querying across 100 trillion events daily by combining caching, smart indexing, and query pruning.

18 Sept 2025

18 Sept 2025 1 min read

From hand-tuned Go to self-optimizing code: Building BitsEvolve

Learn how we turned hot-path optimizations into a system for continuous, AI-assisted performance improvements and saved thousands of cores in the process.

12 Aug 2025

12 Aug 2025 1 min read

Scaling down to speed up: How we improved efficiency of live process metrics by 100x

We re-architected the real-time data pipeline for Datadog’s Processes and Containers views—cutting traffic by 100x and infrastructure use by 98%. This post explores the system challenges, architectural changes, and their impact.

4 Aug 2025

4 Aug 2025 1 min read

Evolving our real-time timeseries storage again: Built in Rust for performance at scale

Discover how we reengineered our metrics storage engine for massive scale with Rust, a shard-per-core model, and real-time performance.

17 Jul 2025

17 Jul 2025 1 min read

How Go 1.24’s Swiss Tables saved us hundreds of gigabytes

Go 1.24’s Swiss Tables cut our map memory usage by up to 70% in high-traffic workloads. Here’s how we profiled the savings and improved performance.

17 Jul 2025 1 min read

How we tracked down a Go 1.24 memory regression across hundreds of pods

We rolled out Go 1.24 and saw a memory regression. Here’s how we dug into system metrics, uncovered a bug in the runtime allocator, and worked with the Go team to help fix it.

24 Jun 2025

24 Jun 2025 1 min read

How we built a real-time, client-side noise suppression library without server dependencies

Learn how we implemented and open-sourced a noise filter for real-time audio chat without compromising performance. Better yet, try the demo and add it to your own project today.

23 Jun 2025

23 Jun 2025 1 min read

How we built reliable log delivery to thousands of unpredictable endpoints

Learn how we built Datadog’s Log Forwarding system for low-latency, high-throughput delivery to thousands of unreliable third-party endpoints.

17 Jun 2025

17 Jun 2025 1 min read

How we scaled fast, reliable configuration distribution to thousands of workload containers

Learn how Datadog engineered a highly reliable, low-latency system to distribute per-tenant configuration data across thousands of containers, enabling real-time log processing at scale.

17 Jun 2025 1 min read

Breaking up a monolith: How we’re unwinding a shared database at scale

Learn how Datadog is breaking up a shared production database at scale—defining clear ownership boundaries, minimizing migration risk, and building the tooling to make decoupling safe, automated, and sustainable.

3 Jun 2025

3 Jun 2025 1 min read

Detecting faulty deployments: Our journey from unlabeled data to supervised learning

Learn how we developed Datadog Automatic Faulty Deployment Detection and improved precision, recall, and time to detection along the way.

9 Apr 2025

9 Apr 2025 1 min read

Squeezing every millisecond: How we rebuilt the Datadog Lambda Extension in Rust

Learn how we rewrote the Datadog Lambda Extension in Rust, cutting cold starts by 82 percent, shrinking the binary 87 percent, and slashing memory use without sacrificing observability.

19 Feb 2025

19 Feb 2025 1 min read

Achieving relentless Kafka reliability at scale with the Streaming Platform

Learn how we built the Streaming Platform at Datadog to provide more resilience and flexibility to our Kafka infrastructure.

29 Jan 2025

29 Jan 2025 1 min read

Husky: Efficient compaction at Datadog scale

A deep dive into Husky’s underlying data storage and compaction system.

15 Jan 2025

15 Jan 2025 1 min read

Unraveling a Postgres segfault that uncovered an Arm64 JIT compiler bug

Learn how an unassuming Postgres error led us to discover a bug in Postgres for Arm.

23 Dec 2024

23 Dec 2024 1 min read

Effective habits of remote workers

Learn some tips and strategies to stay connected, visible, and effective as a remote worker in a predominantly office-based company.

20 Nov 2024

20 Nov 2024 1 min read

How we use formal modeling, lightweight simulations, and chaos testing to design reliable distributed systems

Learn how we used formal modeling and simulation to analyze a distributed, multi-tenant queueing system.

1 Nov 2024

1 Nov 2024 1 min read

How we built a Ruby library that saves 50% in testing time

Learn how we built a test impact analysis library in Ruby, the challenges we faced, the solutions we found, and what we discovered about tracing the Ruby VM.

23 Sept 2024

23 Sept 2024 1 min read

How we optimized LLM use for cost, quality, and safety to facilitate writing postmortems

Learn about the challenges and solutions we discovered while using LLMs to automate writing postmortems.

28 Jun 2024

28 Jun 2024 1 min read

Timeseries indexing at scale

Learn how we implemented a new timeseries indexing strategy when the amount of data we ingested increased significantly.

23 May 2024

23 May 2024 1 min read

How we migrated our static analyzer from Java to Rust

Learn how we enhanced our static analyzer by migrating from Java to Rust, tripling performance improvements and a 10x reduction in memory usage.

21 May 2024

21 May 2024 1 min read

Engineering VP spotlight: Ivo Dimitrov

In this interview, Ivo Dimitrov, Distributed Data Systems VP of Engineering, describes the engineering career that led him to Datadog and his committment to helping build out our core backend platforms.

20 May 2024

20 May 2024 1 min read

.NET Continuous Profiler: Memory usage

How we handle memory usage in our .NET continuous profiler.

1 May 2024

1 May 2024 1 min read

How we built the Datadog heatmap to visualize distributions over time at arbitrary scale

Learn how we used DDSketch to enhance our heatmap visualizations, allowing us to represent and analyze high cardinality data distributions at scale.

18 Apr 2024

18 Apr 2024 1 min read

How we brought Datadog’s data visualization to iOS: A focus on performance

Learn how we built an internal graphing library to support data visualization in iOS using Swift and SwiftUI.

4 Apr 2024

4 Apr 2024 1 min read

.NET Continuous Profiler: Exception and lock contention

How we handle exceptions and lock contention in our .NET continuous profiler.

26 Mar 2024

26 Mar 2024 1 min read

Engineering spotlight: Marie-Laure Bardonnet

In this interview, Marie-Laure Bardonnet, Log Management Senior Engineering Manager, describes the journey of learning, growing, and scaling a team of 4 backend engineers to over 30 frontend and backend engineers at Datadog.

14 Feb 2024

14 Feb 2024 1 min read

How we use Vale to improve our documentation editing process

Learn how Datadog’s Documentation team uses a linter to shift quality left.

13 Feb 2024

13 Feb 2024 1 min read

.NET Continuous Profiler: CPU and wall time profiling

How we implemented CPU and wall time profiling in our .NET continuous profiler.

9 Jan 2024

9 Jan 2024 1 min read

.NET Continuous Profiler: Under the hood

Our .NET profiler was designed and implemented to run 24/7 in production, at any scale, with negligible impact. Here are the details of how we built it.

21 Dec 2023

21 Dec 2023 1 min read

Scaling self-serve analytics: The tools empowering 5,000 employees

In this video, Jean-Mathieu Saponaro, Data & Analytics Senior Engineering Manager, describes the journey of leading, growing, and scaling self-serve analytics within Datadog.

27 Oct 2023

27 Oct 2023 1 min read

Engineering spotlight: Jeromy Carriere

30 Jun 2023

30 Jun 2023 1 min read

How we migrated our acceptance tests to use Synthetic Monitoring

How Datadog’s Frontend DevX team migrated a codebase from flaky, hard-to-maintain acceptance testing with Puppeteer to more robust Synthetic tests.

16 Jun 2023

16 Jun 2023 1 min read

2023-03-08 incident: A deep dive into the platform-level recovery

This post walks through how we restored our platform after it was affected by the outage of March 8, 2023.

1 Jun 2023

1 Jun 2023 1 min read

2023-03-08 incident: A deep dive into our incident response

This post sketches out our incident response process, where it succeeded and where it stumbled on March 8, and what we learned along the way.

26 May 2023

26 May 2023 1 min read

Not just another network latency issue: How we unraveled a series of hidden bottlenecks

Learn how we tackled a case of high network-latency in our usage estimation platform that required a multi-layered solution.

24 May 2023

24 May 2023 1 min read

2023-03-08 incident: A deep dive into the platform-level impact

A deep dive into what happened at the platform level during the outage of March 8, 2023.

17 Apr 2023

17 Apr 2023 1 min read

Making fetch happen: Building a general-purpose query and render scheduler

Learn how we developed a new scheduling algorithm for data fetching and rendering and how we built it for use across our suite of Datadog products.

22 Feb 2023

22 Feb 2023 1 min read

Husky: Exactly-once ingestion and multi-tenancy at scale

A closer look at storage routing in Husky, Datadog’s third-generation event storage system.

31 Jan 2023

31 Jan 2023 1 min read

Performance improvements in the Datadog Agent metrics pipeline

We’ve recently improved the raw performance of the Datadog Agent, leading to 20% less CPU use on Agents flooded with custom metrics.

29 Sept 2022

29 Sept 2022 1 min read

DRUIDS, the design system that powers Datadog

Learn about Datadog’s repeatable design elements that we’ve documented in our design style guide called DRUIDS.

6 Jun 2022

6 Jun 2022 1 min read

Engineering Spotlight: Tay Nishimura

Engineering spotlight: Tay Nishimura

17 May 2022

17 May 2022 1 min read

Introducing Husky, Datadog’s third-generation event store

Husky is an unbundled, distributed, schemaless, vectorized column store. Here’s how we built it—and why.

12 May 2022

12 May 2022 1 min read

How Datadog’s IT team automated account inactivity and SaaS spend management

Employees at all modern software companies use a ton of outside pieces of software to do their jobs. Learn how Datadog’s IT team expanded Clarity to automate monitoring these accounts for inactivity and optimizing how much we spend on them.

13 Apr 2022

13 Apr 2022 1 min read

It’s always DNS . . . except when it’s not: A deep dive through gRPC, Kubernetes, and AWS networking

The story of a seemingly simple issue that led us into the hidden complexities of gRPC, DNS, and Kubernetes.

25 Mar 2022

25 Mar 2022 1 min read

Using the Dirty Pipe vulnerability to break out from containers

See Datadog’s proof of concept exploit for breaking out from unprivileged containers using the Dirty Pipe vulnerability.

28 Feb 2022

28 Feb 2022 1 min read

Profiling improvements in Go 1.18

How several patches and fixes in Go 1.18 bring improved profiling accuracy.

22 Feb 2022

22 Feb 2022 1 min read

How Datadog uses Datadog to gain visibility into the Datadog user experience

How the Datadog DesignOps team uses Datadog to understand our users and make well-informed design decisions

15 Oct 2021

15 Oct 2021 1 min read

Our journey taking Kubernetes state metrics to the next level

Our story of contributing to kube-state-metrics, a popular open source Kubernetes service.

30 Sept 2021

30 Sept 2021 1 min read

How we optimized our Akka application using Datadog’s Continuous Profiler

We identified a performance issue caused by the `ForkJoinPool` in our Java application based on the Akka framework. This is how we solved our issue.

30 Apr 2021

30 Apr 2021 1 min read

How Datadog’s IT team automated monitoring third-party accounts

Employees at all modern software companies use a ton of outside pieces of software to do their jobs. Learn how Datadog’s IT team built a tool to automate monitoring these accounts for security and compliance.

22 Feb 2021

22 Feb 2021 1 min read

How we minimized the overhead of Kubernetes in our job system

Solving performance problems when moving an application to Kubernetes

10 Feb 2021

10 Feb 2021 1 min read

Engineering spotlight: Maël Nison

Engineering spotlight: Maël Nison, maintainer of Yarn

2 Feb 2021

2 Feb 2021 1 min read

PHP 8: Observability baked right in

What the observer API means for PHP 8 and the future of observability

2 Nov 2020

2 Nov 2020 1 min read

Introducing Glommio, a thread-per-core crate for Rust and Linux

Glommio (pronounced glom-io or |glomjəʊ|) is a cooperative thread-per-core crate for Rust & Linux based on io_uring. It allows you to write asynchronous code that takes advantage of rust async/await, but it doesn’t use helper threads anywhere.

7 Oct 2020

7 Oct 2020 1 min read

How we wrote a Python profiler

Adventures in developing a Python profiler

23 Sept 2020

23 Sept 2020 1 min read

The Old Datadog and the Sea

How I used Datadog to become a better sailor.

23 Sept 2019

23 Sept 2019 1 min read

Computing accurate percentiles with DDSketch

Introducing DDSketch, the first fully-mergeable, relative-error quantile sketching algorithm with formal guarantees.

3 Jun 2019

3 Jun 2019 1 min read

Secure publication of Datadog Agent integrations with TUF and in-toto

How to guarantee end-to-end security when using automation to package and publish Datadog Agent integrations

2 Apr 2019

2 Apr 2019 1 min read

Building highly reliable data pipelines at Datadog

A look at how Datadog builds and operates data pipelines reliably at scale.

22 Jan 2019

22 Jan 2019 1 min read

Rethinking UX for AI-driven alerting

The introduction of advanced statistical methods is reshaping the UX of alerts

1 Nov 2018

1 Nov 2018 1 min read

Improving trust with Datadog Log Management

Integrating Amazon Simple Email Service with Datadog to improve observability.

13 Aug 2018

13 Aug 2018 1 min read

Introducing Kafka-Kit: Tools for scaling Kafka

Today, we’re open-sourcing Kafka-Kit, a toolset for scaling and recovering Kafka.

24 Apr 2018

24 Apr 2018 1 min read

Using Datadog APM to improve the performance of Homebrew

Using Datadog to find performance bottlenecks, and contrasting tentative solutions using performance benchmarks.

16 Apr 2018

16 Apr 2018 1 min read

Cgo and Python

How the new Datadog Agent written in Go runs Python checks.

23 Mar 2018

23 Mar 2018 1 min read

What product designers can learn from explanatory journalism

How to be a better designer by being a better explainer.

20 Sept 2017

20 Sept 2017 1 min read

Secure (and usable) multi-AWS account IAM setup

If you are part of the team managing the AWS infrastructure at your organization, you’ve likely had to wrestle with the complexity of managing multiple accounts for some time now.

6 Sept 2017

6 Sept 2017 1 min read

Robust statistical distances for machine learning

Designing powerful outlier and anomaly detection algorithms requires using the right tools. Discover how robust statistical distances can help.

23 Aug 2017

23 Aug 2017 1 min read

Scaling support with Vagrant and Terraform

The Datadog Solutions Team reproduces problems that customers run into while they try using our many integrations in their own, always-unique environments.

15 Aug 2017

15 Aug 2017 1 min read

Improving cloud security visibility with ChatOps

Highlights of our recent work to improve our cloud-based monitoring and alerting pipeline.

11 Jul 2017

11 Jul 2017 1 min read

Piecewise regression: When one line simply isn’t enough

A piecewise regression can model multiple trends in a single data set. Learn how Datadog automates piecewise regression on our timeseries data.

15 Jun 2017

15 Jun 2017 1 min read

Hackathon project: Viewing Datadog metrics in Minecraft

At Datadog we see and gather metrics everywhere by using Datadog to monitor our applications and infrastructure. So our team thought it’d be fun to come up with creative solutions to “where can we display metrics?”

8 Jun 2017

8 Jun 2017 1 min read

Protobuf parsing in Python

Recently we extended the Datadog Agent to support extracting additional metrics from Kubernetes using the kube-state-metrics via protobufs.

17 May 2017

17 May 2017 1 min read

Being a solutions engineer at Datadog

Solutions Engineers at Datadog have to stay on top of what’s going on within the company and outside.

23 Jan 2017

23 Jan 2017 1 min read

The trouble with mounting

When some of our customers reported that their agents were freezing, sometimes for hours at a time, we tracked down the issue to their disk mount options.

17 Jan 2017

17 Jan 2017 1 min read

Engineering spotlight: Marie-Laure Bardonnet

It might surprise you to learn who built most of the prototype of the newest Datadog feature. Read more about Marie-Laure and her internship at Datadog.

14 Nov 2016

14 Nov 2016 1 min read

Redux-Doghouse: Creating reusable React-Redux components through scoping

Today, we’re open-sourcing Redux-Doghouse, a library for Redux that helps you scope components so that they can be reused multiple times in multiple contexts without conflicting with one another.

27 Oct 2016

27 Oct 2016 1 min read

Cheering on coworkers: Building culture with Datadog dashboards

One of our colleagues, Christian, is participating in a tremendous 6-day-run challenge. Yes, you read that right, he will run around 850km (528 miles) over 6 days. As we like to graph everything, we thought it would be fun to cheer him on remotely and follow his progress in this crazy race via a Datadog dashboard.

23 Aug 2016

23 Aug 2016 1 min read

Restroom hacks

Do you ever walk to the bathroom across the office only to discover that it’s in use? Then you’ve got to decide if you want to awkwardly hover right outside, or hold it in for a while and try again later. This is obviously a first world problem, but bathroom contention was getting to be a challenge as we quickly…

11 Aug 2016

11 Aug 2016 1 min read

Consul at Datadog

We’ve been using Consul for about 18 months at Datadog and it’s an important part of our production stack. In this post we will discuss some of the lessons we have learned.

11 Jul 2016

11 Jul 2016 1 min read

Releasing czlib and zstd Go bindings

To commemorate the third annual GopherCon US in Denver this week, we’re releasing cgo bindings to two compression libraries that we’ve been using in production at Datadog for a while now: czlib and zstd.