~/devreads

#data center engineering

14 posts

3 Jun

6 min read

We’re introducing Instantaneous PowerLoss Storm, a new testing paradigm within Meta’s infrastructure for handling and mitigating instant or zero-notice power loss in our data centers. We’re sharing: how we built readiness to tolerate instant failures into our existing systems with defense-in-depth strategies; tradeoffs made in implementing it, and how we validated our readiness. Disaster preparedness [...] Read More... The post…

core infradata center engineeringproduction engineering

30 Mar

7 min read

Meta is continuing its long-term roadmap to help the construction industry leverage AI to produce high-quality and more sustainable concrete mixes, as well as those exclusively produced in the United States. Concurrent with the 2026 American Concrete Institute (ACI) Spring Convention, Meta is releasing a new AI model for designing concrete mixes – Bayesian Optimization [...] Read More... The post…

data center engineeringml applicationsopen source

24 Feb

5 min read

We are open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend. Communication patterns for AI models are constantly evolving, as are hardware [...] Read More... The post…

ai researchdata center engineeringml applicationsnetworking traffic

9 Feb

4 min read

We’re sharing details of the role backend aggregation (BAG) plays in building Meta’s gigawatt-scale AI clusters like Prometheus. BAG allows us to seamlessly connect thousands of GPUs across multiple data centers and regions. Our BAG implementation is connecting two different network fabrics – Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF). Once it’s complete our AI [...] Read More... The…

data center engineering

21 Nov 2025

9 min read

We’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI. Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure. Zoomer has delivered training time reductions, and significant QPS improvements, making it the [...] Read More... The post…

data center engineeringdata infrastructureml applications

14 Nov 2025

1 min read

Most people have heard of open-source software. But have you heard about open hardware? And did you know open source can have a positive impact on the environment? On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Dharmesh and Lisa to talk about all things open hardware, and Meta’s biggest announcements [...] Read More... The post…

data center engineeringml applicationsopen sourceproduction engineeringmeta tech podcast

20 Oct 2025

12 min read

Disaggregated Schedule Fabric (DSF) is Meta’s next-generation network fabric technology for AI training networks that addresses the challenges of existing Clos-based networks. We’re sharing the challenges and innovations surrounding DSF and discussing future directions, including the creation of mega clusters through DSF and non-DSF region interconnectivity, as well as the exploration of alternative switching technologies. [...] Read More... The post…

data center engineeringnetworking traffic

16 Oct 2025

9 min read

We’re sharing details on our journey to scale Meta’s Backbone network to support the increasing demands of new and existing AI workloads. We’ve developed new technologies and designs to address our 10x scaling needs and applying some of these same principles to help extend our AI clusters between multiple data centers. Meta’s Backbone network is [...] Read More... The post…

data center engineering

14 Oct 2025

11 min read

We’re presenting Design for Sustainability, a set of technical design principles for new designs of IT hardware to reduce emissions and cost through reuse, extending useful life, and optimizing design. At Meta, we’ve been able to significantly reduce the carbon footprint of our data centers by integrating several design strategies such as modularity, reuse, retrofitting, [...] Read More... The post…

data center engineering

10 min read

As we focus on our goal of achieving net zero emissions in 2030, we also aim to create a common taxonomy for the entire industry to measure carbon emissions. We’re sharing details on a new methodology we presented at the 2025 OCP regional EMEA summit that leverages AI to improve our understanding of our IT [...] Read More... The post…

data center engineeringml applications

6 min read

At Open Compute Project Summit (OCP) 2025, we’re sharing details about the direction of next-generation network fabrics for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing new disaggregated network platforms to OCP. We look forward to continued collaboration with OCP to open designs for racks, servers, storage boxes, and motherboards [...] Read More... The post…

data center engineeringdata infrastructuredevinframl applicationsnetworking traffic

29 Sept 2025

17 min read

Over the past 21 years, Meta has grown exponentially from a small social network connecting a few thousand people in a handful of universities in the U.S. into several apps and novel hardware products that serve over 3.4 billion people throughout the world. Our infrastructure has evolved significantly over the years, growing from a [...] Read More... The post Meta’s…

ai researchdata center engineeringdata infrastructuredevinframl applications

16 Jul 2025

9 min read

Meta has developed an open-source AI tool to design concrete mixes that are stronger, more sustainable, and ready to build with faster—speeding up construction while reducing environmental impact. The AI tool leverages Bayesian optimization, powered by Meta’s BoTorch and Ax frameworks, and was developed with Amrize and the University of Illinois Urbana-Champaign (U of I) [...] Read More... The post…

ai researchdata center engineeringopen source

4 Mar 2025

5 min read

The growth of data and need for increased power efficiency are leading to innovative storage solutions. HDDs have been growing in density, but not performance, and TLC flash remains at a price point that is restrictive for scaling. QLC technology addresses these challenges by forming a middle tier between HDDs and TLC SSDs. QLC [...] Read More... The post A…

data center engineering