Source: Outside Insight
The dataset I will be using for this project is from Spotify. It has information on 2000 songs ranging from the year 1998 to 2020. The dataset gives information on various components of each song. Some of these variables include the popularity of the song (popularity), whether or not the song is explicit (explicit), how loud the song is (loudness), how easy or difficult it is to dance to the song (danceability), etc. All of the variables are quantitative except for the explicit and genre columns. I’ll be focusing the danceability, popularity, year, and tempo variables. I plan on filtering the data based on songs that were released between 1998 and 2008. I’m hoping to draw a conclusion on how the tempo and danceability of a song impact its popularity and how those two variables effect each other. I chose this topic because music is very influential within a majority of peoples lives. The topic doesn’t necessarily have any meaning for me, but I thought it would be interesting to explore!
Loaded in the necessary library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Loaded in the dataset csv.
data <- read_csv("spotifysongs.csv")
## Rows: 2000 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): artist, song, genre
## dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
## lgl (1): explicit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Filtered the data based on songs that had a popularity of 70 or higher between the years of 1998 and 2008.
years_popular <- data %>%
filter(year >= 1998 & year <= 2008 &
popularity >= 70) # Used the filter function to filter the year and popularity
Created a linear regression model using the lm() function that draws a relationship between the tempo and popularity of a song (between the years of 1998-2008).
tempo_pop <- lm(`popularity` ~ `tempo`, data = years_popular) # Making sure to use my filtered data instead of the original dataset
Calculated the correlation between the two variables using the cor() function. There’s a negative correlation between tempo and popularity, so we can assume that when one decreases, the other increases and vice versa. However, since the correlation is so close to 0, we can assume that the relationship between these two variables are weak.
cor(years_popular$'popularity', years_popular$'tempo')
## [1] -0.01895
Using the summary function to analyze other information such as p-values, r value, etc.
summary(tempo_pop)
##
## Call:
## lm(formula = popularity ~ tempo, data = years_popular)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0454 -2.9296 -0.8472 2.0835 12.0467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.238452 1.100072 68.394 <2e-16 ***
## tempo -0.002540 0.009036 -0.281 0.779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.644 on 220 degrees of freedom
## Multiple R-squared: 0.0003591, Adjusted R-squared: -0.004185
## F-statistic: 0.07903 on 1 and 220 DF, p-value: 0.7789
popularity = (-0.002540 x tempo) + 75.24
The R^2 value, in this case 0.003591, indicates that popularity explains about 0.036% of the deviation in tempo. Using the p-value, 0.779, we can assume that the relationship between popularity and tempo is not statistically significant since it’s a lot greater that 0.05.
Created a graph visualizing the liner regression model. As the model shows, the linear regression line is nearly completely horizontal, meaning that there isn’t much of a relationship between the two variables: the tempo of a song doesn’t effect its popularity.
plot1 <- ggplot(years_popular, aes(x = tempo, y = popularity)) +
geom_point() + # Scatter plot of Tempo vs. Popularity
geom_smooth(method = "lm", se = TRUE, color = "red") + # Linear regression line
labs(x = "Tempo of Song (Beats Per Minute)",
y = "Popularity of Song",
title = "Linear Regression: Tempo vs. Popularity",
caption = "Source: Spotify")+ # Axis labels and title
theme_bw()
plot1
## `geom_smooth()` using formula = 'y ~ x'
Created a graph that visualizes the relationship between danceability, tempo, and popularity. The more red a point is, the higher the popularity and the more gold it is, the lower the popularity.
plot2 <- ggplot(years_popular, aes(x = danceability, y = tempo, size = popularity, color = popularity)) +
geom_point(alpha = 0.9) +
scale_color_gradient(low = "gold", high = "red") +
labs(title = "Popularity of Songs Based On Tempo and Danceability",
x = "Danceability of Song",
y = "Tempo of Song",
caption = "Source: Spotify") +
theme_bw()
plot2
Used the mutate function to create different ranges for the legend.
years_tempo <- years_popular %>%
mutate(mutated_tempo = ifelse(`tempo` <= 100, "0 to 100 BPM",
ifelse(`tempo` <= 150, "100 to 150 BPM",
ifelse(`tempo` <= 202, "150 to 202 BPM"))))
Used the mutated data to create a different graph that shows the relationship between tempo and danceability.
plot3 <- ggplot(years_tempo, aes(x = danceability, y = tempo, color = mutated_tempo)) +
geom_point() +
scale_color_brewer(palette = "Set1") +
labs(title = "Relationship Between the Danceability and Tempo of a Song",
x = "Danceability of Song",
y = "Tempo of Song",
caption = "Source: Spotify") +
theme_bw()
plot3 <- ggplotly(plot3) # Researched a few different ways on how I could include interactivity into my plot and decided on usinng plotly. Converted plot3 into an interactive plot. The key shows danceability, tempo, and mutated_tempo values.
plot3
Filtered the years_popular data by genre so I could focus on pop, rock, and hip hop specifically
genres <- years_popular %>%
filter(genre %in% c("pop", "rock", "hip hop"))
Used the filtered data to create a visualization comparing the popularity of songs within each of these genres. We can see that hip hop tends to be the most popular genre.
genre_plot <-genres |>
ggplot(aes(year, popularity))+
geom_point()+
aes(color = genre)+
facet_wrap(~genre)
genre_plot
After creating all of my visualizations, I didn’t notice any patterns. This was slightly upsetting because I was assuming that the tempo or danceability of a song would most definitely have an impact on its popularity, but apparently it doesn’t. I wish I could have explored the other variables as well, but for the sake of keeping things concise, I had to make sure I wasn’t going over board. In the future, I think I’ll conduct research on songs that I like rather than the most popular ones on Spotify.
I did most of my research on my topic after I had finished my project just so I wasn’t influenced by anyways research or findings. One article I found by Daniel Hernandez Gonzalez on Medium.com did research very similar to my topic. His question was whether or not a songs BPM could influence its popularity. He concluded that these variables have a weak positive relationship, meaning that the variables increase together but there isn’t enough evidence to prove if they influence each other. My conclusion was that they don’t influence each other. However, Gonzalez says that there are a lot of other factors (lyrics, artist, genre, etc.) that go into whether or not a song becomes popular, so we wouldn’t know for sure without being able to completely remove these variables.