Video Title: Music Recommendations at Scale with
Spark
Speaker: Christopher Johnson, Spotify
Duration: 26:29
Link: (https://www.youtube.com/watch?v=3LBgiFch4_g)
Spotify relies heavily on collaborative filtering for music recommendations, using implicit listener data as stream counts instead of explicit ratings. Their core approach involves matrix factorization via Alternating Least Squares (ALS), where user and song vectors are learned iteratively to predict preferences. Initially, this ran on Hadoop, but disk I/O bottlenecks slowed down the iterative ALS process.
To overcome this, Spotify experimented with Apache Spark, leveraging its in-memory caching to avoid redundant data reads. They tested three implementations:
Broadcasting all item vectors, simple but caused excessive network shuffling.
Full Gridify, partitioned data into user and item blocks, reducing network traffic but requiring an extra shuffle phase.
Half Gridify adopted from MLlib that grouped ratings by user partitions, eliminating shuffles and proving fastest (e.g., processing 4M users × 500K artists in 1 hour vs. Hadoop’s 8 hours).
While Spark showed a 10x speedup over Hadoop, challenges persisted:
Memory issues with Half Gridify (partitions risked OOM errors if requiring all item vectors).
Serialization hurdles (custom Kryo serializers needed for non-standard types).
Stability problems at scale (executor failures beyond 20% data volume, requiring intensive tuning).
As of the talk, Spark ALS remained experimental at Spotify, with Hadoop still in production. Key learnings included using Pair RDDs for efficient joins and prioritizing data partitioning strategies to balance speed, memory, and network use.
Matrix Factorization:
Spotify uses Alternating Least Squares (ALS) to
decompose a large user-item matrix into latent features representing
users and items.
Implicit Feedback Modeling:
Since Spotify doesn’t use explicit ratings, they model whether a user
listened to a track (binary) and how confident the system is in that
interaction.
ALS for Implicit Data (Hu, Koren & Volinsky,
2008):
\[
\min_{x, y} \sum_{u, i} c_{ui}(p_{ui} - x_u^T y_i)^2 + \lambda
(\|x_u\|^2 + \|y_i\|^2)
\]
Apache Spark:
Spotify uses Spark’s MLlib for scalable matrix factorization. It
supports distributed ALS optimized for large-scale, sparse
datasets.
Offline vs Online Serving:
Blended Approach:
Collaborative filtering + content-based filtering (audio features) +
editorial logic.
Spotify’s recommendation system demonstrates how machine learning and big data infrastructure combine to deliver personalized user experiences at scale. The shift from explicit to implicit feedback introduces modeling complexity, but also more authentic behavioral signals.
Below an example of simple ALS matrix factorization using
recommenderlab
, which abstracts much of the complexity but
follows the same core idea. This demonstrated the fundamental concept of
matrix factorization by creating a binary user-item matrix and feeding
it into an ALS recommender model to generate personalized music
recommendations, albeit on a small scale.
library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
# Create a mock user-item matrix with implicit feedback
ratings_matrix <- matrix(
c(1, 0, 1, 0, 1,
0, 1, 0, 1, 0,
1, 1, 0, 0, 0,
0, 0, 1, 1, 1),
nrow = 4,
byrow = TRUE
)
rownames(ratings_matrix) <- paste0("User", 1:4)
colnames(ratings_matrix) <- paste0("Track", 1:5)
# Convert to binaryRatingMatrix
binary_ratings <- as(ratings_matrix, "binaryRatingMatrix")
# Build ALS-based collaborative filtering recommender
recommender_model <- Recommender(binary_ratings, method = "ALS")
# Generate recommendations
recommendations <- predict(recommender_model, binary_ratings, n = 3)
# Show recommendations for each user
as(recommendations, "list")
## $`0`
## [1] "Track4" "Track2"
##
## $`1`
## [1] "Track3" "Track5" "Track1"
##
## $`2`
## [1] "Track3" "Track5" "Track4"
##
## $`3`
## [1] "Track1" "Track2"