Source and content

https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year

Context The top songs BY YEAR in the world by Spotify. This data set has several variables about the songs and is based on Billboard.

Content There are the most popular songs in the world by year and 13 variables to be explored. Data were stracted from: http://organizeyourmusic.playlistmachinery.com/

Inspiration What can we know about the genre? What is the mean of minutes that a top music has? And what about the cenario by year?

names(Spotify)
##  [1] "X1"        "title"     "artist"    "top genre" "year"      "bpm"      
##  [7] "nrgy"      "dnce"      "dB"        "live"      "val"       "dur"      
## [13] "acous"     "spch"      "pop"
  1. x1 -ID number
  2. title - Song’s title
  3. artist - Song’s artist
  4. top genre - the genre of the track
  5. year - Song’s year in the Billboard
  6. bpm - Beats.Per.Minute - The tempo of the song.
  7. nrgy - Energy- The energy of a song - the higher the value, the more energtic. song
  8. dnce - Danceability - The higher the value, the easier it is to dance to this song.
  9. dB - Loudness..dB.. - The higher the value, the louder the song
  10. live - Liveness - The higher the value, the more likely the song is a live recording
  11. val - Valence - The higher the value, the more positive mood for the song.
  12. dur - Length - The duration of the song in seconds.
  13. acous - Acousticness.. - The higher the value the more acoustic the song is.
  14. spch - Speechiness - The higher the value the more spoken word the song contains.
  15. pop - Popularity- The higher the value the more popular the song is.

Summary

In this data set there is information about 603 songs on Spotify Billboard charts from 2010 up to 2019.

## [1] 603
summary(Spotify)
##        X1           title              artist           top genre        
##  Min.   :  1.0   Length:603         Length:603         Length:603        
##  1st Qu.:151.5   Class :character   Class :character   Class :character  
##  Median :302.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :302.0                                                           
##  3rd Qu.:452.5                                                           
##  Max.   :603.0                                                           
##       year           bpm             nrgy           dnce      
##  Min.   :2010   Min.   :  0.0   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.:2013   1st Qu.:100.0   1st Qu.:61.0   1st Qu.:57.00  
##  Median :2015   Median :120.0   Median :74.0   Median :66.00  
##  Mean   :2015   Mean   :118.5   Mean   :70.5   Mean   :64.38  
##  3rd Qu.:2017   3rd Qu.:129.0   3rd Qu.:82.0   3rd Qu.:73.00  
##  Max.   :2019   Max.   :206.0   Max.   :98.0   Max.   :97.00  
##        dB               live            val             dur       
##  Min.   :-60.000   Min.   : 0.00   Min.   : 0.00   Min.   :134.0  
##  1st Qu.: -6.000   1st Qu.: 9.00   1st Qu.:35.00   1st Qu.:202.0  
##  Median : -5.000   Median :12.00   Median :52.00   Median :221.0  
##  Mean   : -5.579   Mean   :17.77   Mean   :52.23   Mean   :224.7  
##  3rd Qu.: -4.000   3rd Qu.:24.00   3rd Qu.:69.00   3rd Qu.:239.5  
##  Max.   : -2.000   Max.   :74.00   Max.   :98.00   Max.   :424.0  
##      acous            spch             pop       
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 2.00   1st Qu.: 4.000   1st Qu.:60.00  
##  Median : 6.00   Median : 5.000   Median :69.00  
##  Mean   :14.33   Mean   : 8.358   Mean   :66.52  
##  3rd Qu.:17.00   3rd Qu.: 9.000   3rd Qu.:76.00  
##  Max.   :99.00   Max.   :48.000   Max.   :99.00

Data analysis

In this barplot you can see the most popular artists with 10 or more songs in the billboards over the time period 2010-2019. The artist with the most songs is Katy Perry (17 songs), leading Justin Bieber(16 songs) only by 1.

Art <- Spotify %>% count(artist, sort = TRUE, name = "Count")

ArtFil <- Art %>% filter(Count >= 10)

par(mar = c(12, 5, 4, 2)+ 0.1)
barplot(ArtFil$Count, 
        ylab = "Number of songs",
        col = "#30d6c8",
        names.arg= ArtFil$artist,
        width= 0.01,
        ylim = c(0,20),
        las = 2
        )

This barplot shows the top genres of Spotify top songs from 2010 to 2019. As one can see, the most frequent genre in the billboards is pop.

In general all top songs fall between ~60 bpm and 170 bpm. The average bpm is ~125bpm, however bpm for the pop genre songs which are the most popular, are no less than ~80 bpm, the average bpm in pop songs has not a great difference than in overall top songs.

Pop <- filter(Spotify, Spotify$`top genre` == "pop")

boxplot(Spotify$bpm,Pop$bpm,
main = "Beats per minute",
xlab = "bpm",
names = c("bpm overall", "bpm Pop genre"),
col = c("orange", "red"),
border = "brown",
horizontal = TRUE,
notch = TRUE)

This scatterplot proves that the popularity of the song is directly affected by its energy. In these data, the higher the number, the more popular and the more energetic is the song. We can see that the most popular songs are between 50-85 on the energy scale, however, the most popular song in this data set is lower than 50 on the energy scale. Which leads to a conclusion that there is more than one factor responsible for determining the popularity of songs.

ggplot(Spotify, 
       aes(x=nrgy, y=pop, color = pop)) + 
    geom_point(size=5) +
  ggtitle("Relations between song popularity and its energy")

Here we can see the relations between song popularity and their valency. Valency the positivism of the song - the higher the valency number, the more positive is the song. As shown, the most positive songs are not necessarily the most popular ones. Majority is between 50-85 on the popularity scale regardless of the valency. However the duration seems to be valid criteria in the popularity, since the longest (light blue) songs are lower on both popularity scale and valency as well.

ggplot(Spotify, 
       aes(x=val, y=pop, color = dur)) + 
    geom_point(size=4) +
  ggtitle("Relations between Song popularity and valency")

Linear regression here visualizes the success of predicting song’s popularity based on its danceability and tempo (bpm).

Residuals vs Fitted linear regression shows no obvious distinct patter, which means that it is not a non-linear relationship between the chosen data. However, the Normal Q-Q plot indicates that there might be a problem in residuals’ distribution. The scale-location linear regression indicates that residuals do not have a constant variance, because of the density in one part of the line. And the residuals vs Leverage plot shows the data value that has a great impact on dataset is 443, but the one that does not reach Cook’s distance, therefore is not crucial is value 363 in the dataset.

The adjusted R-squared is 0.01138 and the p-value is 0.01191, meaning that this regression is successful on predicting the song’s popularity by analyzing the relations between danceability and tempo(bpm). However, the p-value of dnce indicates that is it significantly better to use just danceability to predict the popularity.

par(mfrow = c(2,2))
fit <- lm(pop~dnce+bpm, data = Spotify);
plot(fit,
     col = c("orange", "blue")
     )

summary(fit)
## 
## Call:
## lm(formula = pop ~ dnce + bpm, data = Spotify)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.056  -6.327   2.481   9.600  32.568 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.67675    4.32196  12.882  < 2e-16 ***
## dnce         0.13090    0.04436   2.951  0.00329 ** 
## bpm          0.02039    0.02393   0.852  0.39465    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.43 on 600 degrees of freedom
## Multiple R-squared:  0.01466,    Adjusted R-squared:  0.01138 
## F-statistic: 4.464 on 2 and 600 DF,  p-value: 0.01191

To analyze more recent data, an interactive scatterplot was created. This plot contains the information about the most popular songs in the world in 2019 on music streaming platform Spotify.It shows the relations between various aspects of each song. Size of the point indicates the popularity of the song, color indicates duration, on y axis is the valency and on x axis are beats per minute. This shows a common feature for majority of the songs that the duration is rather average (~200 bpm) for the most popular ones, but valency remains an independent variable that seems not to affect the success of the songs. Also, slower songs seem to be more popular in 2019 rather than faster ones.

p <- Spotify %>%
  filter(year == "2019") %>%
  ggplot( aes(bpm, val, size = pop, color=dur)) +
  geom_point() +
  theme_bw()+
  ggtitle("Interactive Scatterplot on the most popular songs in 2019")


ggplotly(p)

Conclusions of the analysis

Between 2010 and 2019, the most popular or successful musical genre in the world according to the data of Spotify Billboard, is the pop genre, staying true to its abbreviation from “popular”.

The artist that has had the most songs on Spotify Billboard is Katy Perry with 17 songs in the top charts, followed by Justin Bieber - 16 songs.

There are no crucial relations between the valency of song and its popularity, however the more energetic songs seem to be more successful.

In 2019, the slower tempo songs were more popular than faster tempo songs. Another significant factor is the duration, that shows the average length (~200 bpm) of a song is more likely to be successful than longer or shorter songs.

This data is useful to determine what is the best target audience for the uprising artists and how to write songs with bigger chance of succeeding, as well as analyzing the data for marketing and commercial purposes.