This analysis explores trends in the characteristics in songs from Spotify, specifically pop songs from 1999 to 2020. Categories include that will be explored include:
Danceability : How good a track is for dancing
Loudness : The general loudness of a track measured in decibels
Years : The year the song was released
Explicit : Whether the song contains explicit content or not.
The dataset is sourced from Spotify.
Load Library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load Dataset
spotify <-read_csv("spotifysongs.csv")
Rows: 2000 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): artist, song, genre
dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
lgl (1): explicit
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
artist song duration_ms explicit
Length:2000 Length:2000 Min. :113000 Mode :logical
Class :character Class :character 1st Qu.:203580 FALSE:1449
Mode :character Mode :character Median :223280 TRUE :551
Mean :228748
3rd Qu.:248133
Max. :484146
year popularity danceability energy
Min. :1998 Min. : 0.00 Min. :0.1290 Min. :0.0549
1st Qu.:2004 1st Qu.:56.00 1st Qu.:0.5810 1st Qu.:0.6220
Median :2010 Median :65.50 Median :0.6760 Median :0.7360
Mean :2009 Mean :59.87 Mean :0.6674 Mean :0.7204
3rd Qu.:2015 3rd Qu.:73.00 3rd Qu.:0.7640 3rd Qu.:0.8390
Max. :2020 Max. :89.00 Max. :0.9750 Max. :0.9990
key loudness mode speechiness
Min. : 0.000 Min. :-20.514 Min. :0.0000 Min. :0.02320
1st Qu.: 2.000 1st Qu.: -6.490 1st Qu.:0.0000 1st Qu.:0.03960
Median : 6.000 Median : -5.285 Median :1.0000 Median :0.05985
Mean : 5.378 Mean : -5.512 Mean :0.5535 Mean :0.10357
3rd Qu.: 8.000 3rd Qu.: -4.168 3rd Qu.:1.0000 3rd Qu.:0.12900
Max. :11.000 Max. : -0.276 Max. :1.0000 Max. :0.57600
acousticness instrumentalness liveness valence
Min. :0.0000192 Min. :0.0000000 Min. :0.0215 Min. :0.0381
1st Qu.:0.0140000 1st Qu.:0.0000000 1st Qu.:0.0881 1st Qu.:0.3867
Median :0.0557000 Median :0.0000000 Median :0.1240 Median :0.5575
Mean :0.1289549 Mean :0.0152260 Mean :0.1812 Mean :0.5517
3rd Qu.:0.1762500 3rd Qu.:0.0000683 3rd Qu.:0.2410 3rd Qu.:0.7300
Max. :0.9760000 Max. :0.9850000 Max. :0.8530 Max. :0.9730
tempo genre
Min. : 60.02 Length:2000
1st Qu.: 98.99 Class :character
Median :120.02 Mode :character
Mean :120.12
3rd Qu.:134.27
Max. :210.85
Remove the N/A Values
spotify <- spotify %>%drop_na()
Lets filter information from the data set
in this occasion i want to use the characteristics of the same genre (pop) in my visualization.
ggplot(pop_songs, aes(x = year, y = danceability, color =as.factor(explicit))) +geom_point(alpha =0.5) +geom_smooth(method ="loess", se =FALSE, aes(color ="Trend Line")) +labs(title ="Danceability of Pop Songs Over the Years",x ="Year",y ="Danceability",caption ="Source: Spotify",color ="Song Type") +scale_color_manual(values =c("TRUE"="purple", "FALSE"="orange", "Trend Line"="green"),labels =c("Explicit", "Non-Explicit", "Trend Line")) +theme_minimal(base_size =12) +theme(legend.position ="right")
`geom_smooth()` using formula = 'y ~ x'
Linear Regression Analysis
Lets analyze how danceability is affected y loudness and explicit.
Linear regression analysis
model <-lm(danceability ~ loudness +as.numeric(explicit) + popularity + energy, data = pop_songs)
Summary of the linear regression analysis
summary(model)
Call:
lm(formula = danceability ~ loudness + as.numeric(explicit) +
popularity + energy, data = pop_songs)
Residuals:
Min 1Q Median 3Q Max
-0.44939 -0.07688 0.01187 0.08866 0.30136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4950636 0.0618644 8.002 1.19e-14 ***
loudness -0.0023410 0.0044841 -0.522 0.60190
as.numeric(explicit) 0.0444760 0.0192404 2.312 0.02128 *
popularity 0.0001389 0.0002834 0.490 0.62425
energy 0.1753682 0.0544465 3.221 0.00138 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1346 on 423 degrees of freedom
Multiple R-squared: 0.0451, Adjusted R-squared: 0.03607
F-statistic: 4.995 on 4 and 423 DF, p-value: 0.0006078
According to this information the formula for my model is:
Danceability = 0.4951−0.0023×Loudness+0.0445 x Explicit+0.000139×Popularity+0.1754×Energy
Also according to this model,explicit content and energy are key predictors of danceability in pop songs. However loudness and popularity do not influence danceability in pop songs. There are also low R-square values that indicate that other variables may also affect danceability.
Brief Essay
How you cleaned the dataset up
With the code str() I was able to see the internal structure of the data set. Secondly, with the code summary() i was able to see if there were any missing values. Thirdly, with the code spotify <- spotify %>% drop_na() i was able to erase all the missing values in the dataset, if any. Finally, since i only wanted to work with a specific genre of songs,pop, i used the code pop_songs <- spotify %>% filter(genre == “pop”) in order to filter out the songs that fitter that genre.
What the visualization represents, any interesting patterns or surprises that arise within the visualization.
The visualization represents the change over time of the danceability of songs from 1999 to 2020. With each point representing a song, the y-axis being the danceability score and the x-axis shows the year of release, and depending whether the song is explicit or not the color of the point changes. Finally the trend line can show a general trajectory of the danceability over the years, showing whether it has increase, decreased, or stabilized.
Patterns or surprises
A surprise for me was that the trend line showed an increase in the year 2020 higher than those years before, and that the line started increasing whenever explicit songs started appearing. Also within the 2000-2010, some outliners were explicit songs, which could relate to the change in trend of songs during that century.
Anything that you might have shown that you could not get to work or that you wished you could have included.
I was also working on the percentage of incurring with a explicit song over the years, however i could not seem to get it to work or be included into the graph.