library(tidyverse)
library(scales)Project 1
Spotify Stream analysis
This is a dataset from spotify that I have chosen to see what the greatest predictors of streams are, and the variable I chose were, ‘Spotify Playlist Reach’ which measures the potential audience reached through Spotify playlists, a primary method for gaining streams. ‘YouTube Views’ which captures the song’s popularity on another major streaming platform, often an indicator of overall reach. ‘TikTok Views’, A critical measure of modern music virality. Songs that are viral on TikTok frequently see a large increase in streams. ‘AirPlay Spins’ Represents car radio play, which somewhat contributes to a song’s overall exposure and discovery. ‘Days Since Release’(new variable) Controls for the age of the song. Older songs have more time to accumulate streams, making this a crucial control variable. ‘Explicit Track’ A binary (0 or 1) variable that can be used to see if explicit content has a statistically significant positive or negative effect on stream counts.
setwd("~/Documents/Data 110") #getting my dataset in
rawspot <- read_csv("Most Streamed Spotify Songs 2024.csv")
collectiondate <- as.Date("2024-06-30") #setup for my new variablecleanspotify1 <- rawspot |>
mutate(
`Release Date` = as.Date(`Release Date`, format = "%m/%d/%Y"), #creating a days since variable using some code I learned in python that helped me out here
`Days Since Release` = (collectiondate - `Release Date`)
)cleanspotify <- cleanspotify1 |> #Cleaning out N/As
filter(!is.na(`Spotify Streams`) & !is.na(`Spotify Playlist Reach`) & !is.na(`YouTube Views`) & !is.na(`TikTok Views`) & !is.na(`AirPlay Spins`) & !is.na(`Explicit Track`) & !is.na(`Days Since Release`))model1 <- lm(`Spotify Streams` ~
`Spotify Playlist Reach` +
`YouTube Views` +
`TikTok Views` +
`AirPlay Spins` +
`Explicit Track` +
`Days Since Release`,
data = cleanspotify)
summary(model1)
Call:
lm(formula = `Spotify Streams` ~ `Spotify Playlist Reach` + `YouTube Views` +
`TikTok Views` + `AirPlay Spins` + `Explicit Track` + `Days Since Release`,
data = cleanspotify)
Residuals:
Min 1Q Median 3Q Max
-2.392e+09 -1.289e+08 -2.102e+07 1.154e+08 1.795e+09
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.041e+07 9.875e+06 -2.067 0.0388 *
`Spotify Playlist Reach` 7.619e+00 2.054e-01 37.094 < 2e-16 ***
`YouTube Views` 1.646e-01 8.589e-03 19.161 < 2e-16 ***
`TikTok Views` -1.863e-03 8.982e-04 -2.075 0.0381 *
`AirPlay Spins` 7.744e+02 4.463e+01 17.352 < 2e-16 ***
`Explicit Track` 6.945e+07 1.112e+07 6.243 4.83e-10 ***
`Days Since Release` 1.282e+05 4.528e+03 28.319 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309300000 on 3352 degrees of freedom
Multiple R-squared: 0.6705, Adjusted R-squared: 0.6699
F-statistic: 1137 on 6 and 3352 DF, p-value: < 2.2e-16
plot(model1)I would have thought that tiktok would be more statistically significant that it ended up being, and didn’t think that explicity would be a factor, as I didn’t think that the audience of people that don’t listen to explicit music would be that large. Which is why in my 2 visualizations I have chosen to explore these 2 variables further.
Explicit Vs. Non-Explicit plot
Grade this one
spotifyexp1 <- cleanspotify |>
mutate(
`Explicit Status` = factor(`Explicit Track`, #mutation to make a categorial explicit variable to be able to plot it and labelling and coloring to prepare it for plotting
levels = c(0, 1),
labels = c("Non-Explicit", "Explicit"))
)
# Making colors
fill_colors <- c("Non-Explicit" = "#2ca02c", "Explicit" = "#9467bd")
border_color <- "#333333"spotifyexp <- spotifyexp1 |>
ggplot(aes(x = `Explicit Status`, y = `Spotify Streams`, fill = `Explicit Status`)) +
#Boxplot
geom_boxplot(width = 0.5, alpha = 0.8, color = border_color) +
#Log of streams for plot
scale_y_log10(labels = label_comma()) + #code found by just testing things that makes it so that no scientific notation shows up
#Filling
scale_fill_manual(values = fill_colors) +
#Title, axis, caption
labs(
title = "Distribution of Spotify Streams: Explicit vs. Non-Explicit Tracks", x = "Track Status", y = "Spotify Streams (Log Scale)", fill = "Track Type", caption = "Source: Spotify") +
#Changing the default ggplot theme
theme_light() +
#Changing a few more things for fun
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.caption = element_text(hjust = 0),
legend.position = "bottom" # Position the legend (makes sense of colors)
)
spotifyexpThoughts
I cleaned the dataset through mutating to create a new variable, days since, and changed another variable alongside factor, explicit status, and I used is.na to remove nas from all of the variables that I had used. The Virality graph didn’t surprise me as I had expected that there would be a positive correlation betweeen tiktok views and spotify streams as most popular songs get posted on tiktok. But what I didn’t expect was a higher IQR and median out of the Non-explicit category as I had underestimated the audience that would actually want Non-explicit music. The one thing that I wish the data included is categorical genre data, which would offer deeper, genre-specific control over stream prediction.