https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year
Context The top songs BY YEAR in the world by Spotify. This data set has several variables about the songs and is based on Billboard.
Content There are the most popular songs in the world by year and 13 variables to be explored. Data were stracted from: http://organizeyourmusic.playlistmachinery.com/
Inspiration What can we know about the genre? What is the mean of minutes that a top music has? And what about the cenario by year?
names(Spotify)
## [1] "X1" "title" "artist" "top genre" "year" "bpm"
## [7] "nrgy" "dnce" "dB" "live" "val" "dur"
## [13] "acous" "spch" "pop"
In this data set there is information about 603 songs on Spotify Billboard charts from 2010 up to 2019.
## [1] 603
summary(Spotify)
## X1 title artist top genre
## Min. : 1.0 Length:603 Length:603 Length:603
## 1st Qu.:151.5 Class :character Class :character Class :character
## Median :302.0 Mode :character Mode :character Mode :character
## Mean :302.0
## 3rd Qu.:452.5
## Max. :603.0
## year bpm nrgy dnce
## Min. :2010 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:2013 1st Qu.:100.0 1st Qu.:61.0 1st Qu.:57.00
## Median :2015 Median :120.0 Median :74.0 Median :66.00
## Mean :2015 Mean :118.5 Mean :70.5 Mean :64.38
## 3rd Qu.:2017 3rd Qu.:129.0 3rd Qu.:82.0 3rd Qu.:73.00
## Max. :2019 Max. :206.0 Max. :98.0 Max. :97.00
## dB live val dur
## Min. :-60.000 Min. : 0.00 Min. : 0.00 Min. :134.0
## 1st Qu.: -6.000 1st Qu.: 9.00 1st Qu.:35.00 1st Qu.:202.0
## Median : -5.000 Median :12.00 Median :52.00 Median :221.0
## Mean : -5.579 Mean :17.77 Mean :52.23 Mean :224.7
## 3rd Qu.: -4.000 3rd Qu.:24.00 3rd Qu.:69.00 3rd Qu.:239.5
## Max. : -2.000 Max. :74.00 Max. :98.00 Max. :424.0
## acous spch pop
## Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 4.000 1st Qu.:60.00
## Median : 6.00 Median : 5.000 Median :69.00
## Mean :14.33 Mean : 8.358 Mean :66.52
## 3rd Qu.:17.00 3rd Qu.: 9.000 3rd Qu.:76.00
## Max. :99.00 Max. :48.000 Max. :99.00
In this barplot you can see the most popular artists with 10 or more songs in the billboards over the time period 2010-2019. The artist with the most songs is Katy Perry (17 songs), leading Justin Bieber(16 songs) only by 1.
Art <- Spotify %>% count(artist, sort = TRUE, name = "Count")
ArtFil <- Art %>% filter(Count >= 10)
par(mar = c(12, 5, 4, 2)+ 0.1)
barplot(ArtFil$Count,
ylab = "Number of songs",
col = "#30d6c8",
names.arg= ArtFil$artist,
width= 0.01,
ylim = c(0,20),
las = 2
)
This barplot shows the top genres of Spotify top songs from 2010 to 2019. As one can see, the most frequent genre in the billboards is pop.
In general all top songs fall between ~60 bpm and 170 bpm. The average bpm is ~125bpm, however bpm for the pop genre songs which are the most popular, are no less than ~80 bpm, the average bpm in pop songs has not a great difference than in overall top songs.
Pop <- filter(Spotify, Spotify$`top genre` == "pop")
boxplot(Spotify$bpm,Pop$bpm,
main = "Beats per minute",
xlab = "bpm",
names = c("bpm overall", "bpm Pop genre"),
col = c("orange", "red"),
border = "brown",
horizontal = TRUE,
notch = TRUE)
This scatterplot proves that the popularity of the song is directly affected by its energy. In these data, the higher the number, the more popular and the more energetic is the song. We can see that the most popular songs are between 50-85 on the energy scale, however, the most popular song in this data set is lower than 50 on the energy scale. Which leads to a conclusion that there is more than one factor responsible for determining the popularity of songs.
ggplot(Spotify,
aes(x=nrgy, y=pop, color = pop)) +
geom_point(size=5) +
ggtitle("Relations between song popularity and its energy")
Here we can see the relations between song popularity and their valency. Valency the positivism of the song - the higher the valency number, the more positive is the song. As shown, the most positive songs are not necessarily the most popular ones. Majority is between 50-85 on the popularity scale regardless of the valency. However the duration seems to be valid criteria in the popularity, since the longest (light blue) songs are lower on both popularity scale and valency as well.
ggplot(Spotify,
aes(x=val, y=pop, color = dur)) +
geom_point(size=4) +
ggtitle("Relations between Song popularity and valency")
Linear regression here visualizes the success of predicting song’s popularity based on its danceability and tempo (bpm).
Residuals vs Fitted linear regression shows no obvious distinct patter, which means that it is not a non-linear relationship between the chosen data. However, the Normal Q-Q plot indicates that there might be a problem in residuals’ distribution. The scale-location linear regression indicates that residuals do not have a constant variance, because of the density in one part of the line. And the residuals vs Leverage plot shows the data value that has a great impact on dataset is 443, but the one that does not reach Cook’s distance, therefore is not crucial is value 363 in the dataset.
The adjusted R-squared is 0.01138 and the p-value is 0.01191, meaning that this regression is successful on predicting the song’s popularity by analyzing the relations between danceability and tempo(bpm). However, the p-value of dnce indicates that is it significantly better to use just danceability to predict the popularity.
par(mfrow = c(2,2))
fit <- lm(pop~dnce+bpm, data = Spotify);
plot(fit,
col = c("orange", "blue")
)
summary(fit)
##
## Call:
## lm(formula = pop ~ dnce + bpm, data = Spotify)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.056 -6.327 2.481 9.600 32.568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.67675 4.32196 12.882 < 2e-16 ***
## dnce 0.13090 0.04436 2.951 0.00329 **
## bpm 0.02039 0.02393 0.852 0.39465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.43 on 600 degrees of freedom
## Multiple R-squared: 0.01466, Adjusted R-squared: 0.01138
## F-statistic: 4.464 on 2 and 600 DF, p-value: 0.01191
To analyze more recent data, an interactive scatterplot was created. This plot contains the information about the most popular songs in the world in 2019 on music streaming platform Spotify.It shows the relations between various aspects of each song. Size of the point indicates the popularity of the song, color indicates duration, on y axis is the valency and on x axis are beats per minute. This shows a common feature for majority of the songs that the duration is rather average (~200 bpm) for the most popular ones, but valency remains an independent variable that seems not to affect the success of the songs. Also, slower songs seem to be more popular in 2019 rather than faster ones.
p <- Spotify %>%
filter(year == "2019") %>%
ggplot( aes(bpm, val, size = pop, color=dur)) +
geom_point() +
theme_bw()+
ggtitle("Interactive Scatterplot on the most popular songs in 2019")
ggplotly(p)
Between 2010 and 2019, the most popular or successful musical genre in the world according to the data of Spotify Billboard, is the pop genre, staying true to its abbreviation from “popular”.
The artist that has had the most songs on Spotify Billboard is Katy Perry with 17 songs in the top charts, followed by Justin Bieber - 16 songs.
There are no crucial relations between the valency of song and its popularity, however the more energetic songs seem to be more successful.
In 2019, the slower tempo songs were more popular than faster tempo songs. Another significant factor is the duration, that shows the average length (~200 bpm) of a song is more likely to be successful than longer or shorter songs.
This data is useful to determine what is the best target audience for the uprising artists and how to write songs with bigger chance of succeeding, as well as analyzing the data for marketing and commercial purposes.