In early 2023, Slack faced a foundational challenge: serving Large Language Models (LLMs) at enterprise scale with the security, reliability, and performance our customers expect. Over three years, we evolved from basic infrastructure to orchestrating a sophisticated multi-cloud architecture. We didn’t just want shiny new models; we needed a system resilient to regional outages and…
#cloud-computing
9 posts
28 May
15 May
No Dumb Questions: What is cloud computing and why is everyone doing it?
Stack OverflowIn this No Dumb Questions, Phoebe is joined by Stack Overflow’s tech lead for the infrastructure team, Josh Zhang, to learn about the cloud, compute, and data centers. …
21 Feb 2024
Leveraging Spark 3 and NVIDIA’s GPUs to Reduce Cloud Cost by up to 70% for Big Data Pipelines
PaypalBy Ilay Chen and Tomer Akirav At PayPal, hundreds of thousands of Apache Spark jobs run on an hourly basis, processing petabytes of data and requiring a high volume of resources. To handle the growth of machine learning solutions, PayPal requires scalable environments, cost awareness and constant innovation. This blog explains how Apache Spark 3 and GPUs can help enterprises…
12 Dec 2023
We are heavy users of Amazon Compute Compute Cloud (EC2) at Slack — we run approximately 60,000 EC2 instances across 17 AWS regions while operating hundreds of AWS accounts. A multitude of teams own and manage our various instances. The Instance Metadata Service (IMDS) is an on-instance component that can be used to gain an…
21 Mar 2023
This blog post discusses the strategies that Slack uses to manage the lifecycle (development, support, and eventual retirement) of infrastructure projects, through the lens of the migration through three successive internal “platform” offerings. Our challenges Circa 2020, our Cloud Engineering team (now evolved into multiple teams responsible for narrower aspects) was responsible for managing our…
24 Mar 2022
Using Machine Learning to Understand How Branding in Photos Affects the Car Shopping Experience
TrueCarBy: Samad Patel This blog post delves into how we answered a challenging business question using pre-trained AWS Models. Our question required us to parse text from photos, then analyze the contents of that text. We used AWS Rekognition and Comprehend to extract and classify text from photos, followed by a few highly interpretable statistical methods to analyze the data.…
9 Mar 2022
According to a recent Thoughtworks radar, “the industry is increasingly gaining experience with platform engineering product teams that create and support internal platforms.” They caveated this with a piece of advice: “When creating a platform, it’s critical to have clearly defined customers and products that will benefit from it rather than building in a vacuum.”…
20 Oct 2021
About a year ago, I wrote a blog post called Building the Next Evolution of Cloud Networks at Slack. In it, we discussed how Slack’s AWS infrastructure has evolved over the years and the pain points that drove us to spin up a brand-new network architecture redesign project called Whitecastle. If you have not had…
17 Feb 2019
When we started Discourse in 2013, our server requirements were high: 1GB RAM modern, fast dual core CPU speedy solid state drive with 20+ GB I’m not talking about a cheapo shared cpanel server, either, I mean a dedicated virtual private server with those specifications. We