challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.
Taking an analytics approach, aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.
The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.
library(dplyr)
| year | songtitle | artistname | timesignature | loudness | tempo | key | energy | pitch | timbre_0_min | timbre_0_max |
|---|---|---|---|---|---|---|---|---|---|---|
| 2010 | This Is the House That Doubt Built | A Day to Remember | 3 | -4.262 | 91.525 | 11 | 0.9666556 | 0.024 | 0.002 | 57.342 |
| 2010 | Sticks & Bricks | A Day to Remember | 4 | -4.051 | 140.048 | 10 | 0.9847095 | 0.025 | 0.000 | 57.414 |
| 2010 | All I Want | A Day to Remember | 4 | -3.571 | 160.512 | 2 | 0.9899004 | 0.026 | 0.003 | 57.422 |
| 2010 | It’s Complicated | A Day to Remember | 4 | -3.815 | 97.525 | 1 | 0.9392072 | 0.013 | 0.000 | 57.765 |
| 2010 | 2nd Sucks | A Day to Remember | 4 | -4.707 | 140.053 | 6 | 0.9877376 | 0.063 | 0.000 | 56.872 |
| 2010 | Better Off This Way | A Day to Remember | 4 | -3.807 | 160.366 | 4 | 0.9799530 | 0.038 | 0.000 | 57.083 |
year = the year the song was released
songtitle = the title of the song
artistname = the name of the artist of the song
songID and artistID = identifying variables for the song and artist
timesignature and timesignature confidence = a variable estimating the time signature of the song, and the confidence in the estimate
loudness = a continuous variable indicating the average amplitude of the audio in decibels
tempo and tempo confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
key and key confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch = a continuous variable that indicates the pitch of the song
timbre = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)
Train Set : Year 1990 ~ 2009
Test Set : Year 2010
SongsTrain = songs %>% filter(year >= 1990) %>% filter(year <= 2009)
SongsTest = songs %>% filter(year == 2010)
In this problem, our outcome variable is “Top10” - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model.
In our case, we want to exclude some of the variables in our dataset from being used as independent variables (“year”, “songtitle”, “artistname”, “songID”, and “artistID”). To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won’t use in our model.
Additional, we want to remove
To remove these variables from your training and testing sets, type the following commands in your R console:
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
Make optimized model using step function
model = glm(formula = Top10 ~ timesignature + timesignature_confidence +
loudness + tempo_confidence + key + key_confidence + energy +
pitch + timbre_0_min + timbre_0_max + timbre_1_min + timbre_2_min +
timbre_3_max + timbre_4_min + timbre_4_max + timbre_5_min +
timbre_6_min + timbre_6_max + timbre_7_min + timbre_7_max +
timbre_8_min + timbre_10_min + timbre_10_max + timbre_11_min +
timbre_11_max, family = binomial, data = SongsTrain)
Perdiction on threshold 0.35
testpredict = predict(model, newdata = SongsTest, type = 'response')
table(SongsTest$Top10, testpredict >= 0.35)
##
## FALSE TRUE
## 0 300 14
## 1 36 23
Model Accuracy = 0.866
Base line Accuracy = 0.842
Model Sensitivity = 0.955
Model Specificity = 0.400
Model provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.