Predict the Top 10 of the Billboard

Introduction

challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Fire up library

library(dplyr)

Variables Discription

year	songtitle	artistname	timesignature	loudness	tempo	key	energy	pitch	timbre_0_min	timbre_0_max
2010	This Is the House That Doubt Built	A Day to Remember	3	-4.262	91.525	11	0.9666556	0.024	0.002	57.342
2010	Sticks & Bricks	A Day to Remember	4	-4.051	140.048	10	0.9847095	0.025	0.000	57.414
2010	All I Want	A Day to Remember	4	-3.571	160.512	2	0.9899004	0.026	0.003	57.422
2010	It’s Complicated	A Day to Remember	4	-3.815	97.525	1	0.9392072	0.013	0.000	57.765
2010	2nd Sucks	A Day to Remember	4	-4.707	140.053	6	0.9877376	0.063	0.000	56.872
2010	Better Off This Way	A Day to Remember	4	-3.807	160.366	4	0.9799530	0.038	0.000	57.083

year = the year the song was released
songtitle = the title of the song
artistname = the name of the artist of the song
songID and artistID = identifying variables for the song and artist
timesignature and timesignature confidence = a variable estimating the time signature of the song, and the confidence in the estimate
loudness = a continuous variable indicating the average amplitude of the audio in decibels
tempo and tempo confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
key and key confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch = a continuous variable that indicates the pitch of the song
timbre = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

Creating Model

Data Split

Train Set : Year 1990 ~ 2009

Test Set : Year 2010

SongsTrain = songs %>% filter(year >= 1990) %>% filter(year <= 2009)
SongsTest = songs %>% filter(year == 2010)

Creating Model

In this problem, our outcome variable is “Top10” - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model.

In our case, we want to exclude some of the variables in our dataset from being used as independent variables (“year”, “songtitle”, “artistname”, “songID”, and “artistID”). To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won’t use in our model.

Additional, we want to remove

To remove these variables from your training and testing sets, type the following commands in your R console:

nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]

Make optimized model using step function

model = glm(formula = Top10 ~ timesignature + timesignature_confidence + 
    loudness + tempo_confidence + key + key_confidence + energy + 
    pitch + timbre_0_min + timbre_0_max + timbre_1_min + timbre_2_min + 
    timbre_3_max + timbre_4_min + timbre_4_max + timbre_5_min + 
    timbre_6_min + timbre_6_max + timbre_7_min + timbre_7_max + 
    timbre_8_min + timbre_10_min + timbre_10_max + timbre_11_min + 
    timbre_11_max, family = binomial, data = SongsTrain)

Perdiction on threshold 0.35

testpredict = predict(model, newdata = SongsTest, type = 'response')
table(SongsTest$Top10, testpredict >= 0.35)

##    
##     FALSE TRUE
##   0   300   14
##   1    36   23

Model Accuracy = 0.866

Base line Accuracy = 0.842

Model Sensitivity = 0.955

Model Specificity = 0.400

Conclusion

Model provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.

Predict the Top 10 of the Billboard

Jinwook Chang

2016/05/03

Introduction

Fire up library

Variables Discription

Creating Model

Data Split

Creating Model

Conclusion