Project 1

Author

Bryce Williams

Spotify Stream analysis

This is a dataset from spotify that I have chosen to see what the greatest predictors of streams are, and the variable I chose were, ‘Spotify Playlist Reach’ which measures the potential audience reached through Spotify playlists, a primary method for gaining streams. ‘YouTube Views’ which captures the song’s popularity on another major streaming platform, often an indicator of overall reach. ‘TikTok Views’, A critical measure of modern music virality. Songs that are viral on TikTok frequently see a large increase in streams. ‘AirPlay Spins’ Represents car radio play, which somewhat contributes to a song’s overall exposure and discovery. ‘Days Since Release’(new variable) Controls for the age of the song. Older songs have more time to accumulate streams, making this a crucial control variable. ‘Explicit Track’ A binary (0 or 1) variable that can be used to see if explicit content has a statistically significant positive or negative effect on stream counts.

library(tidyverse)
library(scales)
setwd("~/Documents/Data 110") #getting my dataset in
rawspot <- read_csv("Most Streamed Spotify Songs 2024.csv") 
collectiondate <- as.Date("2024-06-30") #setup for my new variable
cleanspotify1 <- rawspot |>
  mutate(
    `Release Date` = as.Date(`Release Date`, format = "%m/%d/%Y"), #creating a days since variable using some code I learned in python that helped me out here
    `Days Since Release` = (collectiondate - `Release Date`)
  )
cleanspotify <- cleanspotify1 |> #Cleaning out N/As
  filter(!is.na(`Spotify Streams`) & !is.na(`Spotify Playlist Reach`) & !is.na(`YouTube Views`) & !is.na(`TikTok Views`) & !is.na(`AirPlay Spins`) & !is.na(`Explicit Track`) & !is.na(`Days Since Release`))
model1 <- lm(`Spotify Streams` ~ 
               `Spotify Playlist Reach` + 
               `YouTube Views` + 
               `TikTok Views` + 
               `AirPlay Spins` + 
               `Explicit Track` + 
               `Days Since Release`, 
            data = cleanspotify)
summary(model1)

Call:
lm(formula = `Spotify Streams` ~ `Spotify Playlist Reach` + `YouTube Views` + 
    `TikTok Views` + `AirPlay Spins` + `Explicit Track` + `Days Since Release`, 
    data = cleanspotify)

Residuals:
       Min         1Q     Median         3Q        Max 
-2.392e+09 -1.289e+08 -2.102e+07  1.154e+08  1.795e+09 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -2.041e+07  9.875e+06  -2.067   0.0388 *  
`Spotify Playlist Reach`  7.619e+00  2.054e-01  37.094  < 2e-16 ***
`YouTube Views`           1.646e-01  8.589e-03  19.161  < 2e-16 ***
`TikTok Views`           -1.863e-03  8.982e-04  -2.075   0.0381 *  
`AirPlay Spins`           7.744e+02  4.463e+01  17.352  < 2e-16 ***
`Explicit Track`          6.945e+07  1.112e+07   6.243 4.83e-10 ***
`Days Since Release`      1.282e+05  4.528e+03  28.319  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 309300000 on 3352 degrees of freedom
Multiple R-squared:  0.6705,    Adjusted R-squared:  0.6699 
F-statistic:  1137 on 6 and 3352 DF,  p-value: < 2.2e-16
plot(model1)

I would have thought that tiktok would be more statistically significant that it ended up being, and didn’t think that explicity would be a factor, as I didn’t think that the audience of people that don’t listen to explicit music would be that large. Which is why in my 2 visualizations I have chosen to explore these 2 variables further.

Tiktok viral plot

virality_plot <- cleanspotify |>
  ggplot(aes(x = `TikTok Views`, y = `Spotify Streams`)) +
  #Scatterplot with custom color
  geom_point(alpha = 0.6, color = "#1f77b4") + 
  #lm trend line
  geom_smooth(method = "lm", color = "#d62728", linewidth = 1, se = FALSE) +
  #log scaling
  scale_x_log10(labels = label_comma()) +
  scale_y_log10(labels = label_comma()) +
  #title, axis, caption
  labs(
    title = "The Relationship Between TikTok Virality and Spotify Streams",
    x = "TikTok Views (Log Scale)",
    y = "Spotify Streams (Log Scale)",
    caption = "Source: Spotify"
  ) +
  #Theme swap
  theme_light() +
  #Messing with the theme a little more
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.caption = element_text(hjust = 0)
  )

virality_plot

Explicit Vs. Non-Explicit plot

Grade this one

spotifyexp1 <- cleanspotify |>
  mutate(
    `Explicit Status` = factor(`Explicit Track`, #mutation to make a categorial explicit variable to be able to plot it and labelling and coloring to prepare it for plotting
                               levels = c(0, 1), 
                               labels = c("Non-Explicit", "Explicit"))
  )

# Making colors
fill_colors <- c("Non-Explicit" = "#2ca02c", "Explicit" = "#9467bd")
border_color <- "#333333"
spotifyexp <- spotifyexp1 |>
  ggplot(aes(x = `Explicit Status`, y = `Spotify Streams`, fill = `Explicit Status`)) +
  #Boxplot
  geom_boxplot(width = 0.5, alpha = 0.8, color = border_color) +
  #Log of streams for plot
  scale_y_log10(labels = label_comma()) + #code found by just testing things that makes it so that no scientific notation shows up
  #Filling
  scale_fill_manual(values = fill_colors) +
  #Title, axis, caption
  labs(
    title = "Distribution of Spotify Streams: Explicit vs. Non-Explicit Tracks", x = "Track Status", y = "Spotify Streams (Log Scale)", fill = "Track Type", caption = "Source: Spotify") +
  #Changing the default ggplot theme
  theme_light() +
  #Changing a few more things for fun
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.caption = element_text(hjust = 0),
    legend.position = "bottom" # Position the legend (makes sense of colors) 
    )

spotifyexp

Thoughts

I cleaned the dataset through mutating to create a new variable, days since, and changed another variable alongside factor, explicit status, and I used is.na to remove nas from all of the variables that I had used. The Virality graph didn’t surprise me as I had expected that there would be a positive correlation betweeen tiktok views and spotify streams as most popular songs get posted on tiktok. But what I didn’t expect was a higher IQR and median out of the Non-explicit category as I had underestimated the audience that would actually want Non-explicit music. The one thing that I wish the data included is categorical genre data, which would offer deeper, genre-specific control over stream prediction.