Introduction

challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Fire up library

library(dplyr)

Variables Discription

year songtitle artistname timesignature loudness tempo key energy pitch timbre_0_min timbre_0_max
2010 This Is the House That Doubt Built A Day to Remember 3 -4.262 91.525 11 0.9666556 0.024 0.002 57.342
2010 Sticks & Bricks A Day to Remember 4 -4.051 140.048 10 0.9847095 0.025 0.000 57.414
2010 All I Want A Day to Remember 4 -3.571 160.512 2 0.9899004 0.026 0.003 57.422
2010 It’s Complicated A Day to Remember 4 -3.815 97.525 1 0.9392072 0.013 0.000 57.765
2010 2nd Sucks A Day to Remember 4 -4.707 140.053 6 0.9877376 0.063 0.000 56.872
2010 Better Off This Way A Day to Remember 4 -3.807 160.366 4 0.9799530 0.038 0.000 57.083

Creating Model

Data Split

Train Set : Year 1990 ~ 2009

Test Set : Year 2010

SongsTrain = songs %>% filter(year >= 1990) %>% filter(year <= 2009)
SongsTest = songs %>% filter(year == 2010)

Creating Model

In this problem, our outcome variable is “Top10” - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model.

In our case, we want to exclude some of the variables in our dataset from being used as independent variables (“year”, “songtitle”, “artistname”, “songID”, and “artistID”). To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won’t use in our model.

Additional, we want to remove

To remove these variables from your training and testing sets, type the following commands in your R console:

nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]

Make optimized model using step function

model = glm(formula = Top10 ~ timesignature + timesignature_confidence + 
    loudness + tempo_confidence + key + key_confidence + energy + 
    pitch + timbre_0_min + timbre_0_max + timbre_1_min + timbre_2_min + 
    timbre_3_max + timbre_4_min + timbre_4_max + timbre_5_min + 
    timbre_6_min + timbre_6_max + timbre_7_min + timbre_7_max + 
    timbre_8_min + timbre_10_min + timbre_10_max + timbre_11_min + 
    timbre_11_max, family = binomial, data = SongsTrain)

Perdiction on threshold 0.35

testpredict = predict(model, newdata = SongsTest, type = 'response')
table(SongsTest$Top10, testpredict >= 0.35)
##    
##     FALSE TRUE
##   0   300   14
##   1    36   23

Model Accuracy = 0.866

Base line Accuracy = 0.842

Model Sensitivity = 0.955

Model Specificity = 0.400

Conclusion

Model provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.