The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.
Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.
Unfortunately, the success of an artist’s release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.
Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.
How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.
Taking an analytics approach, we aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.
The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.
Here’s a detailed description of the variables:
Use the read.csv function to load the dataset “songs.csv” into R.
# Load the data
songs = read.csv("songs.csv")# Count the number of observations
z = table(songs$year)
kable(z)| Var1 | Freq |
|---|---|
| 1990 | 328 |
| 1991 | 196 |
| 1992 | 186 |
| 1993 | 324 |
| 1994 | 198 |
| 1995 | 258 |
| 1996 | 178 |
| 1997 | 329 |
| 1998 | 380 |
| 1999 | 357 |
| 2000 | 363 |
| 2001 | 282 |
| 2002 | 518 |
| 2003 | 434 |
| 2004 | 479 |
| 2005 | 392 |
| 2006 | 479 |
| 2007 | 622 |
| 2008 | 415 |
| 2009 | 483 |
| 2010 | 373 |
373 songs are from year 2010.
# How many songs from michael jackson
z = table(songs$artistname == "Michael Jackson")
kable(z)| Var1 | Freq |
|---|---|
| FALSE | 7556 |
| TRUE | 18 |
MichaelJackson = subset(songs, artistname == "Michael Jackson")18 songs are from the artist Michael Jackson.
# Output top 10 song titles of michael jackson
MichaelJackson[c("songtitle", "Top10")]
## songtitle Top10
## 4329 You Rock My World 1
## 6205 She's Out of My Life 0
## 6206 Wanna Be Startin' Somethin' 0
## 6207 You Are Not Alone 1
## 6208 Billie Jean 0
## 6209 The Way You Make Me Feel 0
## 6210 Black or White 1
## 6211 Rock with You 0
## 6212 Bad 0
## 6213 I Just Can't Stop Loving You 0
## 6214 Man in the Mirror 0
## 6215 Thriller 0
## 6216 Beat It 0
## 6217 The Girl Is Mine 0
## 6218 Remember the Time 1
## 6219 Don't Stop 'Til You Get Enough 0
## 6220 Heal the World 0
## 6915 In The Closet 1You Rock My World and You Are Not Alone.
# Tabulate time signature
z = table(songs$timesignature)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 10 |
| 1 | 143 |
| 3 | 503 |
| 4 | 6787 |
| 5 | 112 |
| 7 | 19 |
The only values that appear in the table for timesignature are 0, 1, 3, 4, 5, and 7
# Tabulate time signature
z = table(songs$timesignature)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 10 |
| 1 | 143 |
| 3 | 503 |
| 4 | 6787 |
| 5 | 112 |
| 7 | 19 |
6787 songs have a value of 4 for the timesignature.
# Find the song with the highest tempo
i = which.max(songs$tempo)
songs$songtitle[i]
## [1] Wanna Be Startin' Somethin'
## 7141 Levels: '03 Bonnie & Clyde '69 'O Surdato 'Nnammurato 'Til I Fell in Love with You #1 (Hot S**t) Country Grammar (Lay Your Head on My) Pillow ... ZumbiWe wish to predict whether or not a song will make it to the Top 10. To do this, first use the subset function to split the data into a training set “SongsTrain” consisting of all the observations up to and including 2009 song releases, and a testing set “SongsTest”, consisting of the 2010 song releases.
#split the data
SongsTrain = subset(songs, year <= 2009)
SongsTest = subset(songs, year == 2010)
nrow(SongsTrain)
## [1] 72017201 songs are in the training set.
# Remove the variables we wont use in our model
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
# Build linear regression
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
summary(SongsLog1)
##
## Call:
## glm(formula = Top10 ~ ., family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9220 -0.5399 -0.3459 -0.1845 3.0770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.69998823 1.80638746 8.138 4.03e-16 ***
## timesignature 0.12639483 0.08673566 1.457 0.145050
## timesignature_confidence 0.74499227 0.19530526 3.815 0.000136 ***
## loudness 0.29987940 0.02916535 10.282 < 2e-16 ***
## tempo 0.00036340 0.00169146 0.215 0.829889
## tempo_confidence 0.47322705 0.14217401 3.329 0.000873 ***
## key 0.01588199 0.01038950 1.529 0.126349
## key_confidence 0.30867509 0.14115620 2.187 0.028760 *
## energy -1.50214447 0.30992402 -4.847 1.25e-06 ***
## pitch -44.90773986 6.83488314 -6.570 5.02e-11 ***
## timbre_0_min 0.02315894 0.00425625 5.441 5.29e-08 ***
## timbre_0_max -0.33098196 0.02569259 -12.882 < 2e-16 ***
## timbre_1_min 0.00588100 0.00077981 7.542 4.64e-14 ***
## timbre_1_max -0.00024486 0.00071524 -0.342 0.732087
## timbre_2_min -0.00212741 0.00112599 -1.889 0.058843 .
## timbre_2_max 0.00065857 0.00090658 0.726 0.467571
## timbre_3_min 0.00069196 0.00059845 1.156 0.247583
## timbre_3_max -0.00296730 0.00058149 -5.103 3.34e-07 ***
## timbre_4_min 0.01039562 0.00198505 5.237 1.63e-07 ***
## timbre_4_max 0.00611050 0.00155029 3.942 8.10e-05 ***
## timbre_5_min -0.00559796 0.00127670 -4.385 1.16e-05 ***
## timbre_5_max 0.00007736 0.00079354 0.097 0.922337
## timbre_6_min -0.01685618 0.00226395 -7.445 9.66e-14 ***
## timbre_6_max 0.00366807 0.00218950 1.675 0.093875 .
## timbre_7_min -0.00454922 0.00178148 -2.554 0.010661 *
## timbre_7_max -0.00377369 0.00183198 -2.060 0.039408 *
## timbre_8_min 0.00391105 0.00285101 1.372 0.170123
## timbre_8_max 0.00401134 0.00300298 1.336 0.181620
## timbre_9_min 0.00136726 0.00299806 0.456 0.648356
## timbre_9_max 0.00160266 0.00243364 0.659 0.510188
## timbre_10_min 0.00412631 0.00183907 2.244 0.024852 *
## timbre_10_max 0.00582498 0.00176941 3.292 0.000995 ***
## timbre_11_min -0.02625234 0.00369327 -7.108 1.18e-12 ***
## timbre_11_max 0.01967338 0.00338549 5.811 6.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4759.2 on 7167 degrees of freedom
## AIC: 4827.2
##
## Number of Fisher Scoring iterations: 6AIC = 4827.2
If you look at the output summary(model), where model is the name of your logistic regression model, you can see that the coefficient estimates for the confidence variables (timesignature_confidence, key_confidence, and tempo_confidence) are positive. This means that higher confidence leads to a higher predicted probability of a Top 10 hit.
Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream listeners tend to prefer less complex songs.
The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs, which are those with heavier instrumentation. #### By inspecting the coefficient of the variable “energy”, do we draw the same conclusions as above?
However, the coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those with light instrumentation. These coefficients lead us to different conclusions!
# Calculate the collinearity
cor(SongsTrain$loudness, SongsTrain$energy)
## [1] 0.7399067Correlation = 0.73991
# Logistic Regression
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
# Output summary
summary(SongsLog2)
##
## Call:
## glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0983 -0.5607 -0.3602 -0.1902 3.3107
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2406121 0.7464843 -3.002 0.002686 **
## timesignature 0.1624613 0.0873408 1.860 0.062873 .
## timesignature_confidence 0.6884706 0.1924193 3.578 0.000346 ***
## tempo 0.0005521 0.0016651 0.332 0.740226
## tempo_confidence 0.5496567 0.1407363 3.906 9.40e-05 ***
## key 0.0174026 0.0102563 1.697 0.089740 .
## key_confidence 0.2953671 0.1394460 2.118 0.034163 *
## energy 0.1812603 0.2607678 0.695 0.486991
## pitch -51.4985789 6.8565442 -7.511 5.87e-14 ***
## timbre_0_min 0.0247895 0.0042397 5.847 5.01e-09 ***
## timbre_0_max -0.1006969 0.0117760 -8.551 < 2e-16 ***
## timbre_1_min 0.0071435 0.0007710 9.265 < 2e-16 ***
## timbre_1_max -0.0007830 0.0007064 -1.108 0.267650
## timbre_2_min -0.0015790 0.0011091 -1.424 0.154531
## timbre_2_max 0.0003889 0.0008964 0.434 0.664427
## timbre_3_min 0.0006500 0.0005949 1.093 0.274524
## timbre_3_max -0.0024622 0.0005674 -4.339 1.43e-05 ***
## timbre_4_min 0.0091146 0.0019519 4.670 3.02e-06 ***
## timbre_4_max 0.0063056 0.0015323 4.115 3.87e-05 ***
## timbre_5_min -0.0056411 0.0012549 -4.495 6.95e-06 ***
## timbre_5_max 0.0006937 0.0007807 0.889 0.374256
## timbre_6_min -0.0161221 0.0022350 -7.214 5.45e-13 ***
## timbre_6_max 0.0038138 0.0021566 1.768 0.076982 .
## timbre_7_min -0.0051019 0.0017548 -2.907 0.003644 **
## timbre_7_max -0.0031585 0.0018107 -1.744 0.081090 .
## timbre_8_min 0.0044882 0.0028103 1.597 0.110254
## timbre_8_max 0.0064225 0.0029504 2.177 0.029497 *
## timbre_9_min -0.0004282 0.0029549 -0.145 0.884792
## timbre_9_max 0.0035254 0.0023769 1.483 0.138017
## timbre_10_min 0.0029934 0.0018037 1.660 0.097004 .
## timbre_10_max 0.0073666 0.0017314 4.255 2.09e-05 ***
## timbre_11_min -0.0283702 0.0036300 -7.815 5.48e-15 ***
## timbre_11_max 0.0182939 0.0033405 5.476 4.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4871.8 on 7168 degrees of freedom
## AIC: 4937.8
##
## Number of Fisher Scoring iterations: 6The coefficient estimate for energy is positive in Model 2, suggesting that songs with higher energy levels tend to be more popular. However, note that the variable energy is not significant in this model.
# Logistic Regression
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)
# Output summary
summary(SongsLog3)
##
## Call:
## glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9182 -0.5417 -0.3481 -0.1874 3.4171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.96056207 1.71419468 6.977 3.01e-12 ***
## timesignature 0.11509425 0.08726155 1.319 0.187183
## timesignature_confidence 0.71426976 0.19461751 3.670 0.000242 ***
## loudness 0.23055652 0.02527983 9.120 < 2e-16 ***
## tempo -0.00064600 0.00166547 -0.388 0.698107
## tempo_confidence 0.38409299 0.13983499 2.747 0.006019 **
## key 0.01649459 0.01035139 1.593 0.111056
## key_confidence 0.33940638 0.14087438 2.409 0.015984 *
## pitch -53.28405750 6.73285437 -7.914 2.49e-15 ***
## timbre_0_min 0.02204524 0.00423942 5.200 1.99e-07 ***
## timbre_0_max -0.31048005 0.02536544 -12.240 < 2e-16 ***
## timbre_1_min 0.00541597 0.00076427 7.086 1.38e-12 ***
## timbre_1_max -0.00051146 0.00071101 -0.719 0.471928
## timbre_2_min -0.00225435 0.00112029 -2.012 0.044190 *
## timbre_2_max 0.00041189 0.00090196 0.457 0.647915
## timbre_3_min 0.00031786 0.00058687 0.542 0.588083
## timbre_3_max -0.00296369 0.00057576 -5.147 2.64e-07 ***
## timbre_4_min 0.01104648 0.00197793 5.585 2.34e-08 ***
## timbre_4_max 0.00646679 0.00154132 4.196 2.72e-05 ***
## timbre_5_min -0.00513453 0.00126897 -4.046 5.21e-05 ***
## timbre_5_max 0.00029790 0.00078555 0.379 0.704526
## timbre_6_min -0.01784468 0.00224605 -7.945 1.94e-15 ***
## timbre_6_max 0.00344687 0.00218214 1.580 0.114203
## timbre_7_min -0.00512843 0.00176848 -2.900 0.003733 **
## timbre_7_max -0.00339351 0.00181976 -1.865 0.062208 .
## timbre_8_min 0.00368609 0.00283309 1.301 0.193229
## timbre_8_max 0.00465780 0.00298790 1.559 0.119022
## timbre_9_min -0.00009318 0.00295687 -0.032 0.974859
## timbre_9_max 0.00134171 0.00242391 0.554 0.579900
## timbre_10_min 0.00405001 0.00182697 2.217 0.026637 *
## timbre_10_max 0.00579252 0.00175858 3.294 0.000988 ***
## timbre_11_min -0.02637666 0.00368292 -7.162 7.96e-13 ***
## timbre_11_max 0.01983605 0.00336460 5.896 3.74e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4782.7 on 7168 degrees of freedom
## AIC: 4848.7
##
## Number of Fisher Scoring iterations: 6Yes, we can see that loudness has a positive coefficient estimate, meaning that our model predicts that songs with heavier instrumentation tend to be more popular.
Make predictions on the test set using Model 3.
# Make predictions
testPredict = predict(SongsLog3, newdata=SongsTest, type="response")
# Tabulate top 10 songs vs our prediction function
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 309 | 5 |
| 1 | 40 | 19 |
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.8793566Accuracy = 0.87936
# Tabulate Baseline
z = table(SongsTest$Top10)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 314 |
| 1 | 59 |
# Compute Accuracy
z[1]/sum(z)
## 0
## 0.8418231Accuracy = 0.8418231
# Predict Top10 hits in 2010
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 309 | 5 |
| 1 | 40 | 19 |
19 songs.
# Predict Top10 hits in 2010
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 309 | 5 |
| 1 | 40 | 19 |
5 songs.
# Tabulate confusion matrix
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 309 | 5 |
| 1 | 40 | 19 |
# Compute Sensitivity
z[4]/(z[2]+z[4])
## [1] 0.3220339Sensitivity = 0.3220339
# Tabulate confusion matrix
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 309 | 5 |
| 1 | 40 | 19 |
# Compute Specificity
z[1]/(z[1]+z[3])
## [1] 0.9840764Specificity = 0.9840764
Model 3 has a very high specificity, meaning that it favors specificity over sensitivity. While Model 3 only captures less than half of the Top 10 songs, it still can offer a competitive edge, since it is very conservative in its predictions.