Authors ( listed alphabetically ) Ads Feature Engineering Infra team: Ajay Venkatakrishnan, Le Zhang Core ML Infra team: Eric Shang, Pihui Wei ML Data team: Connor Votroubek, Yi He User Understanding team: Camilo Munoz, Simin Li If you work on ranking, retrieval, or recommendation systems, you’ve probably asked for some version of the same thing: “Give me the last N…
#data infrastructure
14 posts
21 May
12 May
Meta’s data ingestion system, which our engineering teams leverage for up-to-date snapshots of the social graph, has recently undergone a significant revamp to enhance its reliability at scale. Moving from our legacy system to our new architecture required a large-scale migration of our entire data ingestion system. We’re sharing the solutions and strategies that enabled [...] Read More... The post…
8 Apr
As AI increases developer speed and productivity it also increases the need for safeguards. On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Ishwari and Joe from Meta’s Configurations team to discuss how Meta makes config rollouts safe at scale. Listen in to learn about canarying and progressive rollouts, the health checks [...] Read More... The…
2 Mar
Meta recognizes the long-term benefits of jemalloc, a high-performance memory allocator, in its software infrastructure. We are renewing focus on jemalloc, aiming to reduce maintenance needs and modernize the codebase while continuing to evolve the allocator to adapt to the latest hardware and workloads. We are committed to continuing to develop jemalloc development with the [...] Read More... The post…
19 Dec 2025
Incident investigation can be a daunting task in today’s digital landscape, where large-scale systems comprise numerous interconnected components and dependencies DrP is a root cause analysis (RCA) platform, designed by Meta, to programmatically automate the investigation process, significantly reducing the mean time to resolve (MTTR) for incidents and alleviating on-call toil Today, DrP is used [...] Read More... The post…
21 Nov 2025
Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization
FacebookWe’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI. Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure. Zoomer has delivered training time reductions, and significant QPS improvements, making it the [...] Read More... The post…
14 Oct 2025
At Open Compute Project Summit (OCP) 2025, we’re sharing details about the direction of next-generation network fabrics for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing new disaggregated network platforms to OCP. We look forward to continued collaboration with OCP to open designs for racks, servers, storage boxes, and motherboards [...] Read More... The post…
29 Sept 2025
Over the past 21 years, Meta has grown exponentially from a small social network connecting a few thousand people in a handful of universities in the U.S. into several apps and novel hardware products that serve over 3.4 billion people throughout the world. Our infrastructure has evolved significantly over the years, growing from a [...] Read More... The post Meta’s…
13 Aug 2025
In this post, we explore the ways we’re evolving Meta’s data warehouse to facilitate productivity and security to serve both human users and AI agents. We detail how we’re developing agents that help users making data access requests to get to the data they need, and that help data owners process requests and maintain security. [...] Read More... The post…
22 Jul 2025
Hardware faults can have a significant impact on AI training and inference. Silent data corruptions (SDCs), undetected data errors caused by hardware, can be particularly harmful for AI systems that rely on accurate data for training as well as providing useful outputs. We are sharing methodologies we deploy at various scales for detecting SDC across [...] Read More... The post…
20 May 2025
As Meta has launched new, innovative products leveraging generative AI (GenAI), we need to make sure the underlying infrastructure components evolve along with it. Applying infrastructure knowledge and optimizations have allowed us to adapt to changing product requirements, delivering a better product along the way. Ultimately, our infrastructure systems need to balance our need to [...] Read More... The post…
8 May 2025
Meta and NVIDIA collaborated to accelerate vector search on GPUs by integrating NVIDIA cuVS into Faiss v1.10, Meta’s open source library for similarity search. This new implementation of cuVS will be more performant than classic GPU-accelerated search in some areas. For inverted file (IVF) indexing, NVIDIA cuVS outperforms classical GPU-accelerated IVF build times by up [...] Read More... The post…
31 Mar 2025
Mobile GraphQL is a framework used at Meta for fetching data in mobile applications using GraphQL, a strongly-typed, declarative query language. At Meta it handles data fetching for apps like Facebook and Instagram. Sabrina, a software engineer on Meta’s Mobile GraphQL Platform Team, joins Pascal Hartig on the Meta Tech podcast to discuss the evolution [...] Read More... The post…
26 Apr 2022
By Laura Nolan, with contributions from Glen D. Sanford, Jamie Scheinblum, and Chris Sullivan. Assessing conditions Slack experienced a major incident on February 22 this year, during which time many users were unable to connect to Slack, including the author — which certainly made my role as Incident Commander more challenging! This incident was a…