#incident-management

4 posts

18 Jun

Manav Mehta 18 Jun 2026 7 min read

Building a Centralized Alerting Framework for Data Quality Monitoring and Incident Management

Building a Centralized Alerting Framework for Data Quality using Snowflake Before We Knew Better As our data platform grew, so did the number of pipelines, scheduled tasks, and data quality checks running every day. While Snowflake provided a reliable platform for storing and processing data, operational monitoring was fragmented across multiple systems. Data quality failures were often discovered only after…

snowflake incident-managementdata-qualitydata-engineering observability

14 Nov 2024

Scott Nelson Windels 14 Nov 2024 11 min read

There’s No Such Thing as a Free Lunch!

Slack

Incident Management takes time Incidents need responders that are trained and experienced. At Slack, training is a foundation of our incident management program. Self-service training and live courses based mainly on prepared content are one piece of the puzzle, but there can be a missing piece in many organizations. How can staff get practical experience…

uncategorized incident-management incident-response

19 Aug 2022

Frank Chen 19 Aug 2022 15 min read

Slowing Down to Speed Up – Circuit Breakers for Slack’s CI/CD

Slack

What happens when your distributed service has challenges with stampeding herds of internal requests? How do you prevent cascading failures between internal services? How might you re-architect your workflows when naive horizontal or vertical scaling reaches their respective limits? These were the challenges facing Slack engineers during their day-to-day development workflows in 2020. Multiple internal…

uncategorized ci-cd developer-productivity incident-management infrastructure

18 Feb 2022

Carlos Valdez 18 Feb 2022 12 min read

Balancing Safety and Velocity in CI/CD at Slack

Slack

In 2021, we changed developer testing workflows for Webapp, Slack’s main monorepo, from predominantly testing before merging to a multi-tiered testing workflow after merging. This changed our previous definition of safety and developer workflows between testing and deploys. In this project, we aimed to ensure frequent, reliable, and high-quality releases to our customers for a…

uncategorized automation-testing ci-cd deployment incident-management