~/devreads

#spark

4 posts

24 Jan 2025

Derick Yang 10 min read

Much of our heatmaps are built on batch data outputs stored in Rain At Strava, we love maps — some of our most loved features are nestled on map surfaces. My team, the Geo team, is focused on building and improving these products. On the Geo and Metro teams, we tend to work with large datasets: aggregations of open source…

mapssparkkey-value-storecaching

29 Oct 2021

Saurabh Jain 9 min read

Pinion — The Load Framework Part-2 This post is the 2nd part of the “Pinion — The Load Framework” series. In case you have not read the 1st post, you can read it here . In this post, we are going to cover the following topics. How does Pinion use Delta Lake for SCD operations? Small file problem with Delta…

cloudsparkawsdelta-lakeoptimization

28 Sept 2014

1 min read

It’s occasionally useful when writing map/reduce jobs to get a hold of the current filename that’s being processed. There’s a few ways to do this, depending on the version of Spark that you’re using. Spark 1.1.0 introduced a new method on HadoopRDD that makes this super easy: import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text import org.apache.hadoop.mapred.{FileSplit, TextInputFormat} import org.apache.spark.rdd.HadoopRDD // Create the text…

scalasparkhadoophdfs

7 Aug 2014