~/devreads

#big-data

18 posts

29 May

Ratul Dawar 5 min read

A silent deadlock in our query engine was stalling inventory replenishment jobs with no error, no crash — just infinite waiting. This is the story of how we found it, traced it to an open-source bug, and fixed it upstream. TL;DR Trino’s Hudi connector used a single thread pool for both producing file splits and signalling when there was room…

blinkittrinosapache-hudibig-dataopen-source

5 May

Mahendran Vasagam 13 min read

Excerpt By 2024, Slack’s data platform had accumulated 700+ SSH-based operators orchestrating critical data pipelines. We’re talking daily search indexing that processed terabytes of data, analytics jobs powering business intelligence, the whole shebang. Every single one of these jobs required direct SSH access to production AWS Elastic MapReduce (EMR) clusters. We had a massive security…

uncategorizedairflowawsbig-datadata-engineering

6 Oct 2025

Poorva Patil 9 min read

As a data engineer, I used to see metrics as just numbers on a dashboard — until I realized they’re the lens through which customers view and run their operations. In customer support, for example, agent productivity metrics aren’t just figures, they’re actionable insights that drive efficiency, shape staffing decisions, and directly impact customer satisfaction. These aren’t just charts —…

apache-sparkanalyticsbig-datadata-analysis

2 Jul 2024

Nilanjana Mukherjee 9 min read

Slack Data Engineering recently underwent data workload migration from AWS EMR 5 (Spark 2/Hive 2 processing engine) to EMR 6 (Spark 3 processing engine). In this blog, we will share our migration journey, challenges, and the performance gains we observed in the process. This blog aims to assist Data Engineers, Data Infrastructure Engineers, and Product…

uncategorizedanalyticsawsbig-datadata-engineering

21 Feb 2024

Ilay Chen 12 min read

By Ilay Chen and Tomer Akirav At PayPal, hundreds of thousands of Apache Spark jobs run on an hourly basis, processing petabytes of data and requiring a high volume of resources. To handle the growth of machine learning solutions, PayPal requires scalable environments, cost awareness and constant innovation. This blog explains how Apache Spark 3 and GPUs can help enterprises…

cloud-computinggpubig-datamachine-learningapache-spark

31 Aug 2023

Edgar Trujillo 4 min read

On the racetrack of building ML applications, traditional software development steps are often overtaken. Welcome to the world of MLOps, where unique challenges meet innovative solutions and consistency is king. At Bazaarvoice, training pipelines serve as the backbone of our MLOps strategy. They underpin the reproducibility of our model builds. A glaring gap existed, however, […]

artificial intelligencebig datadevopsopen sourcesoftware architecture

17 Aug 2021

Samuel Bock 8 min read

Reinventing how the world does work inevitably creates a lot of data. Each year, Slack’s scale has increased and the volume of data ingested and stored has kept pace. To make it possible to understand relationships within our data, we’ve invested heavily in an automated data lineage framework. This facilitates producer/consumer coordination, improves risk mitigation,…

uncategorizedbig-datadata-engineering

16 Nov 2020

17 Jul 2020

Joe Minichino 10 min read

You need a Data Lake. The Context Teamwork has been around for more than 10 years. Starting out as a project management and work collaboration platform and later expanding into other areas, such as help-desk, chat, document management and CRM software. As the company has grown and evolved, data has grown, changed, expanded, diversified, fragmented, then changed again. Analytics in…

awsdata-lakebig-data

7 Aug 2019

Parth Shah 10 min read

Parth Shah and Thai Bui Overview One of the reasons why Hadoop jobs are hard to operate is their inability to provide clear, actionable error diagnostic messages for users. This stems from the fact that Hadoop consists of many interrelated components. When a component fails or behaves poorly, the failure will be cascaded to its […]

big datainternshipstestingchaos testshadoop

2 Jan 2018

Edwin Wise 5 min read

Recently, during a holiday lull, I decided to look at another way of modeling event stream data (for the purposes of anomaly detection). I’ve dabbled with (simplistic) event stream models before but this time I decided to take a deeper look at Twitter’s anomaly detection algorithm [1], which in turn is based (more or less) […]

big databigdatadata modelingstatistics

21 Jun 2016

Gary Allison 8 min read

Divide and Conquer As Engineers, we often like nice clean solutions that don’t carry along what we like to call technical debt. Technical debt literally is stuff that we have to go back to fix/rewrite later or that requires significant ongoing maintenance effort. In a perfect world, we fire up the the new platform and […]

big datasoftware architecturesoftware business

10 Jun 2016

Gary Allison 7 min read

At Bazaarvoice, we’ve pulled off an incredible feat, one that is such an enormous task that I’ve seen other companies hesitate to take on. We’ve learned a lot along the way and I wanted to share some of these experiences and lessons in hopes they may benefit others facing similar decisions. The Beginning Our original […]

big datasoftware architecturesoftware business

24 Dec 2015

admin 4 min read

Preparing for the Holiday season is a year round task for all of us here at Bazaarvoice. This year we saw many retailers extending their seasonal in-store specials to their websites as well. We also saw retailers going as far as closing physical stores on Thanksgiving (Nordstrom, Costco, Home Depot, etc.) and Black Friday (REI). Regardless […]

big data

4 Sept 2015

Frederick Feibel 2 min read

Every year Bazaarvoice R&D throws BVIO, an internal technical conference followed by a two-day hackathon. These conferences are an opportunity for us to focus on unlocking the power of our network, data, APIs, and platforms as well as have some fun in the process. We invite keynote speakers from within BV, from companies who use […]

big dataconferencesculturetalks

23 Mar 2015

Fahd Siddiqui 5 min read

A distributed data system consisting of several nodes is said to be fully consistent when all nodes have the same state of the data they own. So, if record A is in State S on one node, then we know that it is in the same state in all its replicas and data centers. Full […]

big data

20 Feb 2015

Trey Perry 4 min read

Every holiday season, the virtual doors of your favorite retailer are blown open by a torrent of shoppers who are eager to find the best deal, whether they’re looking for a Turbo Man action figure or a ludicrously discounted 4K flat screen. This series focuses on our Big Data analytics platform, which is used to learn more […]

big dataanalyticsbigdatareporting

21 Jun 2014

lukaseder 1 min read

One for the weekend: Big Data Big Data pic.twitter.com/18VPE9LGDq — Victor Agreda Jr (@superpixels) June 19, 2014

funbig datatruth