Executive summary

Spotify is a world’s most popular streaming platform, and churn rate is a major issue for subscription service platforms. In this project, we are interested to know which behaviors and characteristics of user influence the likelihood of churn. By comparing subscription types, listening behaviors, and ad exposure, it is possible to identify customer behavior that explain why users stop using the platform. The three final figures demonstrate how churn varies by subscription type, level of listening engagement, and ad exposure among free users.

Data background

The chosen dataset, “Spotify Analysis Dataset 2025,” available on Kaggle, includes demographic (age, gender, and country) and behavioral data (subscription type, total listening time, number of songs played, ad frequency, and whether the user churned or not). The secondary dataset “Spotify Analysis Dataset 2025” was created and uploaded by Nabiha Zahid, containing information from 8,000 Spotify users. The dataset is structured as a single CSV file where each row represents one user and each column represents a measured attribute. Once downloaded, the CSV file was imported into R Studio for cleaning data, variable selection, and graph visualization.

As this dataset includes both behavioral data and subscription data, it enables the prediction of the likelihood of customers churning. Also, it provides a clear outcome variable(churn) and data that are ideal for comparison (free and premium users’ behaviors). Because of these reason, the dataset is well-suited for analyzing how different factors influence the likelihood of churn. Since it is also publicly available, it is ethical, replicable, and suitable for academic analysis.

Data cleaning

Overall, the initial data is relatively straightforward; however, several preprocessing steps are necessary before conducting the analysis. The variables need to be renamed for clarity, missing values must be identified and handled (using median imputation for numeric variables), and rows with missing demographic fields have to be removed. Additionally, the churn variable must be converted into a labeled variable, “Active” and “Churn,” from a numeric format, continuous variables must be converted into numeric types, and outliers must be removed using the IQR method to avoid skewed results. Duplicate data must be removed to ensure the final dataset’s accuracy and reliability. By following these processing steps, the cleaned dataset (spotify) became consistent, interpretable, and fully ready for statistical analysis and visualization.

library(tidyverse)
spotify <- read.csv("spotify_churn_dataset.csv")

#Rename
spotify <- spotify %>%
  rename(
    ads_exposure = ads_listened_per_week,
    churn = is_churned)

# Load data
spotify_clean <- spotify %>%
  
  # 1) Convert data types + standardize categorical values
  mutate(
    subscription_type = str_to_title(subscription_type),
    subscription_type = factor(subscription_type),
    listening_time = as.numeric(listening_time),
    ads_exposure = as.numeric(ads_exposure),
    skip_rate = as.numeric(skip_rate),
    songs_played_per_day = as.numeric(songs_played_per_day),
    offline_listening = as.numeric(offline_listening)) %>%
  
  # 2) Create labeled churn variable (0 = Active, 1 = Churned)
  mutate(
    churn = factor(churn,
                        levels = c(0, 1),
                        labels = c("Active", "Churned"))) %>%
  
  # 3) Handle missing values
  mutate(
    listening_time = ifelse(is.na(listening_time),
                            median(listening_time, na.rm = TRUE),
                            listening_time),
    songs_played_per_day = ifelse(is.na(songs_played_per_day),
                                  median(songs_played_per_day, na.rm = TRUE),
                                  songs_played_per_day),
    skip_rate = ifelse(is.na(skip_rate),
                       median(skip_rate, na.rm = TRUE),
                       skip_rate),
    ads_exposure = ifelse(is.na(ads_exposure),
                                   median(ads_exposure, na.rm = TRUE),
                                   ads_exposure),
    offline_listening = ifelse(is.na(offline_listening),
                               median(offline_listening, na.rm = TRUE),
                               offline_listening)) %>%
  drop_na(subscription_type) %>%
  drop_na(gender) %>%
  
  # 4) Remove outliers (IQR method for continuous variables)
  filter(
    listening_time >= quantile(listening_time, 0.25) - 1.5 * IQR(listening_time),
    listening_time <= quantile(listening_time, 0.75) + 1.5 * IQR(listening_time),
    
    songs_played_per_day >= quantile(songs_played_per_day, 0.25) - 1.5 * IQR(songs_played_per_day),
    songs_played_per_day <= quantile(songs_played_per_day, 0.75) + 1.5 * IQR(songs_played_per_day)) %>%
  
  # 5) Remove duplicates
  distinct()

Data Summary

Variable Names

names(spotify)

##  [1] "user_id"              "gender"               "age"                 
##  [4] "country"              "subscription_type"    "listening_time"      
##  [7] "songs_played_per_day" "skip_rate"            "device_type"         
## [10] "ads_exposure"         "offline_listening"    "churn"

Summary Statistics

summary(spotify_clean[, c("listening_time",
                          "ads_exposure",
                          "songs_played_per_day",
                          "age")])

##  listening_time   ads_exposure    songs_played_per_day      age       
##  Min.   : 10.0   Min.   : 0.000   Min.   : 1.00        Min.   :16.00  
##  1st Qu.: 81.0   1st Qu.: 0.000   1st Qu.:25.00        1st Qu.:26.00  
##  Median :154.0   Median : 0.000   Median :50.00        Median :38.00  
##  Mean   :154.1   Mean   : 6.944   Mean   :50.13        Mean   :37.66  
##  3rd Qu.:227.0   3rd Qu.: 5.000   3rd Qu.:75.00        3rd Qu.:49.00  
##  Max.   :299.0   Max.   :49.000   Max.   :99.00        Max.   :59.00

Categorical Variable Summary

table(spotify_clean$subscription_type)

## 
##  Family    Free Premium Student 
##    1908    2018    2115    1959

table(spotify_clean$gender)

## 
## Female   Male  Other 
##   2659   2691   2650

table(spotify_clean$churn)

## 
##  Active Churned 
##    5929    2071

Distribution of Listening Time

p <- ggplot(data = spotify_clean,
            mapping = aes(x = listening_time))

p + geom_histogram(binwidth = 20,
                   fill = "#1DB954",
                   color = "white") +
    labs(title = "Distribution of Listening Time",
       x = "Listening Time (minutes)",
       y = "Count") +
  theme_minimal()

Distribution of Subscription Types

p <- ggplot(data = spotify_clean,
            mapping = aes(x = subscription_type))

p + geom_bar(fill = "#1ED760") +
    labs(title = "Number of Users by Subscription Type",
         x = "Subscription Type",
         y = "Count") +
     theme_minimal()

Ads Exposure Distribution (Free Users)

free_users <- spotify_clean %>%
  filter(subscription_type == "Free")

p <- ggplot(data = free_users,
            mapping = aes(x = ads_exposure))

p + geom_histogram(binwidth = 2,
                   fill = "#191414",
                   color = "white") +
    labs(title = "Ads Exposure Distribution (Free Users)",
         x = "Ads Exposure (per week)",
         y = "Count") +
     theme_minimal()

Individual figures

Figure 1: Churn Rate Across Subscription Types (Bar Chart)

Figure 1 is visualized using a bar chart with four subscription plans—Family, Premium, Student, and Free—for a clear comparison of churn levels across different types of memberships. To create the graph, we compare each subscription type and calculated the churn rate for each category since subscription type directly reflects differences in user experience, such as feature access, pricing, and overall value, all of which are likely to influence whether users remain active or churn.

In terms of visualization, a bar chart is the most appropriate one because subscription type is a categorical variable, and bar heights allow churn percentages to be compared clearly across groups. The design applies Spotify CI color codes, minimal gridlines, and simple typography to maintain consistency and ensure readability. This visual style helps present the data truthfully without exaggeration or distortion.

According to the chart, family plan users have the highest churn rate (27.5%), followed by student, premium, and free users, who have the lowest (24.9%). This trend implies that churn is not higher within unsubscribed users; instead, users on shared or discounted plans (Family or Student) may be less committed individually, increasing their likelihood of churn. However, free users may be more familiar to the basic service or feel less pressured to cancel as they are not paying. In summary, the figure shows that subscription type influences churn, but not in the stereotypical manner; free users churn the most. Instead, group-based plans may have distinct behavioral dynamics that contribute to increased churn.

# calculate churn rate by subscription type
churn_by_subscription <- spotify %>%
    group_by(subscription_type) %>%
    summarise(churn = mean(churn == 1) * 100)

# draw graph
figure1 <- ggplot(data = churn_by_subscription,
                  mapping = aes(x = subscription_type,
                                y = churn,
                                fill = subscription_type)) +
    
    geom_col() +
    scale_fill_manual(values = c(
    "Family"  = "#1DB954",  # Spotify Green
    "Free"    = "#1ED760",  # Light Green
    "Premium" = "#535353",  # Dark Grey / Black-ish
    "Student" = "#191414"   # Black
  )) +
    
    labs(title = "Figure 1: Churn Rate Across Subscription Types",
         x = "Subscription Type",
         y = "Churn Rate (%)",
         fill = "Subscription Type") +
    theme_minimal()

figure1

ggsave(figure1, filename = "images/figure1_bar_chart.png",
       width = 8, height = 5)

Figure 2: Relationship Between Listening Time and Churn Rate (Line Graph)

In figure 2, since listening time is a frequency-based variable and displaying it as a continuous trend helps illustrates how churn changes as the engagement increases, we created this graph by dividing users into equal listening-time intervals and calculating the churn rate within each group using a line graph. Additionally, in terms of visualization, the graph is shown in Spotify’s green color and a simple, minimalistic theme to make it easier to compare various listener groups as we divided listening time into equal ranges.

According to the resulting graph, it can be seen that the users with moderate listening time (50–100 minutes) have the highest churn rate, while those with very little or very much listening time appear in lower churn. This graph indicates that engagement affects churn in a more complex way than a straight upward or downward trend, non-linear. Moderately engaged users may not feel strongly attached to the platform, making them more likely to stop using the application. On the other hand, high-engagement listeners seem to have greater commitment to the platform, making them less likely to churn. Overall, this figure truthfully represents the chosen data and effectively communicates the behavioral trend without unnecessary detail.

# 1) Create listening time bins
spotify_clean <- spotify_clean %>%
    mutate(time_group = cut(listening_time,
                            breaks = seq(0, 300, 50),
                            include.lowest = TRUE))

# 2) Calculate churn rate by time group
churn_by_time <- spotify_clean %>%
    group_by(time_group) %>%
    summarise(churn_rate = mean(churn == "Churned") * 100)

# 3) Line chart with Spotify green
figure2 <- ggplot(data = churn_by_time,
                  mapping = aes(x = time_group,
                                y = churn_rate,
                                group = 1)) +
    geom_line(size = 1.2, color = "#1DB954") +
    geom_point(size = 3, color = "#1DB954") +
    theme_minimal() +
    labs(title = "Churn Rate Across Listening Time Groups",
         x = "Listening Time Range (minutes)",
         y = "Churn Rate (%)")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

figure2

ggsave(figure2, filename = "images/figure2_line_graph.png",
       width = 8, height = 5)

Figure 3: The effect of Ad exposure on churn among free users (Boxplot)

Figure 3 shows how ad exposure affects churn among free users. Since ad frequency is a continuous behavioral variable, boxplot is used to depict the distribution of ad exposure between Active and Churned users to efficiently displays differences in medians, variability, and the distribution of values within each group.

To create the figure, we only focus on free users and compared with weekly ad frequency between active and churned users. Still, we used Spotify’s brand colors (green for Active users and dark grey for Churned users) to maintain design consistency across all figures while ensuring clear visual contrast.

The graph shows that churned users have slightly higher ad frequency exposure on average; however, the two groups still overlap significantly. This visual illustrates that users with high ad exposure may contribute to churn, it is not the only factor that causes them to quit. Instead, ad exposure is likely to interact with other aspects of the user experience. The boxplot accurately conveys this by displaying full distributions rather than relying on a single summary value.

figure3 <- ggplot(data = free_users,
               mapping = aes(x = churn,
                             y = ads_exposure,
                             fill = churn)) +
  geom_boxplot(alpha = 0.7,
               outlier.color = "#535353") +
  scale_fill_manual(values = c(
    "Active"  = "#1DB954",
    "Churned" = "#191414"
  )) +
  labs(title = "Figure 3: Weekly Ad Frequency by User Status (Free Users)",
       x = "User Status",
       y = "Weekly Ad Frequency (times per week)",
       fill = "Churn Status") +
  theme_minimal()


figure3

ggsave(figure3, filename = "images/figure3_boxplot.png",
       width = 8, height = 5)

Conclusion

Overall, the analysis indicated that user churn on Spotify is influenced by subscription type, listening engagement, and ad exposure. Highest churn appears among users on shared or discounted plans, implying a lower individual commitment, whereas free users churn the least. Engagement also has a non-linear effect: moderately active users leave the most frequently, whereas very low- and high-engagement listeners are more likely to stay. Higher ad exposure is associated with churn among free users, but this factor is not solely strong enough.Together, these findings show that churn is influenced by a combination of pricing models, engagement patterns, and user experience factors rather than a single variable.

What factors influence Spotify users to churn?

Alexandra Laguardia, Methaporn Sunthichaiyakul

2025-12-09