Popularity of Music Records

Background Information on the Dataset

The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.

Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist’s release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.

How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, we aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here’s a detailed description of the variables:

  • year = the year the song was released
  • songtitle = the title of the song
  • artistname = the name of the artist of the song
  • songID and artistID = identifying variables for the song and artist
  • timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
  • loudness = a continuous variable indicating the average amplitude of the audio in decibels
  • tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
  • key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
  • energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
  • pitch = a continuous variable that indicates the pitch of the song
  • timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
  • Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

Understanding the Data

Use the read.csv function to load the dataset “songs.csv” into R.

# Load the data
songs = read.csv("songs.csv")

How many observations (songs) are from the year 2010?

# Count the number of observations
z = table(songs$year)
kable(z)
Var1 Freq
1990 328
1991 196
1992 186
1993 324
1994 198
1995 258
1996 178
1997 329
1998 380
1999 357
2000 363
2001 282
2002 518
2003 434
2004 479
2005 392
2006 479
2007 622
2008 415
2009 483
2010 373

373 songs are from year 2010.

How many songs does the dataset include for which the artist name is “Michael Jackson”?

# How many songs from michael jackson
z = table(songs$artistname == "Michael Jackson")
kable(z)
Var1 Freq
FALSE 7556
TRUE 18
MichaelJackson = subset(songs, artistname == "Michael Jackson")

18 songs are from the artist Michael Jackson.

Which of these songs by Michael Jackson made it to the Top 10?

# Output top 10 song titles of michael jackson
MichaelJackson[c("songtitle", "Top10")]
##                           songtitle Top10
## 4329              You Rock My World     1
## 6205           She's Out of My Life     0
## 6206    Wanna Be Startin' Somethin'     0
## 6207              You Are Not Alone     1
## 6208                    Billie Jean     0
## 6209       The Way You Make Me Feel     0
## 6210                 Black or White     1
## 6211                  Rock with You     0
## 6212                            Bad     0
## 6213   I Just Can't Stop Loving You     0
## 6214              Man in the Mirror     0
## 6215                       Thriller     0
## 6216                        Beat It     0
## 6217               The Girl Is Mine     0
## 6218              Remember the Time     1
## 6219 Don't Stop 'Til You Get Enough     0
## 6220                 Heal the World     0
## 6915                  In The Closet     1

You Rock My World and You Are Not Alone.

The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only takes integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset?

# Tabulate time signature
z = table(songs$timesignature)
kable(z)
Var1 Freq
0 10
1 143
3 503
4 6787
5 112
7 19

The only values that appear in the table for timesignature are 0, 1, 3, 4, 5, and 7

Which timesignature value is the most frequent among songs in our dataset?

# Tabulate time signature
z = table(songs$timesignature)
kable(z)
Var1 Freq
0 10
1 143
3 503
4 6787
5 112
7 19

6787 songs have a value of 4 for the timesignature.

Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs. Which one is it?

# Find the song with the highest tempo
i = which.max(songs$tempo)
songs$songtitle[i]
## [1] Wanna Be Startin' Somethin'
## 7141 Levels: '03 Bonnie & Clyde '69 'O Surdato 'Nnammurato 'Til I Fell in Love with You #1 (Hot S**t) Country Grammar (Lay Your Head on My) Pillow ... Zumbi

Creating Our Prediction Model

We wish to predict whether or not a song will make it to the Top 10. To do this, first use the subset function to split the data into a training set “SongsTrain” consisting of all the observations up to and including 2009 song releases, and a testing set “SongsTest”, consisting of the 2010 song releases.

How many observations (songs) are in the training set?

#split the data
SongsTrain = subset(songs, year <= 2009)
SongsTest = subset(songs, year == 2010)
nrow(SongsTrain)
## [1] 7201

7201 songs are in the training set.

Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?

# Remove the variables we wont use in our model
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
# Build linear regression
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
summary(SongsLog1)
## 
## Call:
## glm(formula = Top10 ~ ., family = binomial, data = SongsTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9220  -0.5399  -0.3459  -0.1845   3.0770  
## 
## Coefficients:
##                              Estimate   Std. Error z value Pr(>|z|)    
## (Intercept)               14.69998823   1.80638746   8.138 4.03e-16 ***
## timesignature              0.12639483   0.08673566   1.457 0.145050    
## timesignature_confidence   0.74499227   0.19530526   3.815 0.000136 ***
## loudness                   0.29987940   0.02916535  10.282  < 2e-16 ***
## tempo                      0.00036340   0.00169146   0.215 0.829889    
## tempo_confidence           0.47322705   0.14217401   3.329 0.000873 ***
## key                        0.01588199   0.01038950   1.529 0.126349    
## key_confidence             0.30867509   0.14115620   2.187 0.028760 *  
## energy                    -1.50214447   0.30992402  -4.847 1.25e-06 ***
## pitch                    -44.90773986   6.83488314  -6.570 5.02e-11 ***
## timbre_0_min               0.02315894   0.00425625   5.441 5.29e-08 ***
## timbre_0_max              -0.33098196   0.02569259 -12.882  < 2e-16 ***
## timbre_1_min               0.00588100   0.00077981   7.542 4.64e-14 ***
## timbre_1_max              -0.00024486   0.00071524  -0.342 0.732087    
## timbre_2_min              -0.00212741   0.00112599  -1.889 0.058843 .  
## timbre_2_max               0.00065857   0.00090658   0.726 0.467571    
## timbre_3_min               0.00069196   0.00059845   1.156 0.247583    
## timbre_3_max              -0.00296730   0.00058149  -5.103 3.34e-07 ***
## timbre_4_min               0.01039562   0.00198505   5.237 1.63e-07 ***
## timbre_4_max               0.00611050   0.00155029   3.942 8.10e-05 ***
## timbre_5_min              -0.00559796   0.00127670  -4.385 1.16e-05 ***
## timbre_5_max               0.00007736   0.00079354   0.097 0.922337    
## timbre_6_min              -0.01685618   0.00226395  -7.445 9.66e-14 ***
## timbre_6_max               0.00366807   0.00218950   1.675 0.093875 .  
## timbre_7_min              -0.00454922   0.00178148  -2.554 0.010661 *  
## timbre_7_max              -0.00377369   0.00183198  -2.060 0.039408 *  
## timbre_8_min               0.00391105   0.00285101   1.372 0.170123    
## timbre_8_max               0.00401134   0.00300298   1.336 0.181620    
## timbre_9_min               0.00136726   0.00299806   0.456 0.648356    
## timbre_9_max               0.00160266   0.00243364   0.659 0.510188    
## timbre_10_min              0.00412631   0.00183907   2.244 0.024852 *  
## timbre_10_max              0.00582498   0.00176941   3.292 0.000995 ***
## timbre_11_min             -0.02625234   0.00369327  -7.108 1.18e-12 ***
## timbre_11_max              0.01967338   0.00338549   5.811 6.21e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6017.5  on 7200  degrees of freedom
## Residual deviance: 4759.2  on 7167  degrees of freedom
## AIC: 4827.2
## 
## Number of Fisher Scoring iterations: 6

AIC = 4827.2

Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest?

If you look at the output summary(model), where model is the name of your logistic regression model, you can see that the coefficient estimates for the confidence variables (timesignature_confidence, key_confidence, and tempo_confidence) are positive. This means that higher confidence leads to a higher predicted probability of a Top 10 hit.

What does Model 1 suggest in terms of complexity?

Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream listeners tend to prefer less complex songs.

By inspecting the coefficient of the variable “loudness”, what does Model 1 suggest?

The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs, which are those with heavier instrumentation. #### By inspecting the coefficient of the variable “energy”, do we draw the same conclusions as above?

However, the coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those with light instrumentation. These coefficients lead us to different conclusions!

Beware of Multicollinearity Issues!

What is the correlation between the variables “loudness” and “energy” in the training set?

# Calculate the collinearity
cor(SongsTrain$loudness, SongsTrain$energy)
## [1] 0.7399067

Correlation = 0.73991

Create Model 2, which is Model 1 without the independent variable “loudness”. Look at the summary of SongsLog2, and inspect the coefficient of the variable “energy”. What do you observe?

# Logistic Regression
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
# Output summary
summary(SongsLog2)
## 
## Call:
## glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0983  -0.5607  -0.3602  -0.1902   3.3107  
## 
## Coefficients:
##                             Estimate  Std. Error z value Pr(>|z|)    
## (Intercept)               -2.2406121   0.7464843  -3.002 0.002686 ** 
## timesignature              0.1624613   0.0873408   1.860 0.062873 .  
## timesignature_confidence   0.6884706   0.1924193   3.578 0.000346 ***
## tempo                      0.0005521   0.0016651   0.332 0.740226    
## tempo_confidence           0.5496567   0.1407363   3.906 9.40e-05 ***
## key                        0.0174026   0.0102563   1.697 0.089740 .  
## key_confidence             0.2953671   0.1394460   2.118 0.034163 *  
## energy                     0.1812603   0.2607678   0.695 0.486991    
## pitch                    -51.4985789   6.8565442  -7.511 5.87e-14 ***
## timbre_0_min               0.0247895   0.0042397   5.847 5.01e-09 ***
## timbre_0_max              -0.1006969   0.0117760  -8.551  < 2e-16 ***
## timbre_1_min               0.0071435   0.0007710   9.265  < 2e-16 ***
## timbre_1_max              -0.0007830   0.0007064  -1.108 0.267650    
## timbre_2_min              -0.0015790   0.0011091  -1.424 0.154531    
## timbre_2_max               0.0003889   0.0008964   0.434 0.664427    
## timbre_3_min               0.0006500   0.0005949   1.093 0.274524    
## timbre_3_max              -0.0024622   0.0005674  -4.339 1.43e-05 ***
## timbre_4_min               0.0091146   0.0019519   4.670 3.02e-06 ***
## timbre_4_max               0.0063056   0.0015323   4.115 3.87e-05 ***
## timbre_5_min              -0.0056411   0.0012549  -4.495 6.95e-06 ***
## timbre_5_max               0.0006937   0.0007807   0.889 0.374256    
## timbre_6_min              -0.0161221   0.0022350  -7.214 5.45e-13 ***
## timbre_6_max               0.0038138   0.0021566   1.768 0.076982 .  
## timbre_7_min              -0.0051019   0.0017548  -2.907 0.003644 ** 
## timbre_7_max              -0.0031585   0.0018107  -1.744 0.081090 .  
## timbre_8_min               0.0044882   0.0028103   1.597 0.110254    
## timbre_8_max               0.0064225   0.0029504   2.177 0.029497 *  
## timbre_9_min              -0.0004282   0.0029549  -0.145 0.884792    
## timbre_9_max               0.0035254   0.0023769   1.483 0.138017    
## timbre_10_min              0.0029934   0.0018037   1.660 0.097004 .  
## timbre_10_max              0.0073666   0.0017314   4.255 2.09e-05 ***
## timbre_11_min             -0.0283702   0.0036300  -7.815 5.48e-15 ***
## timbre_11_max              0.0182939   0.0033405   5.476 4.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6017.5  on 7200  degrees of freedom
## Residual deviance: 4871.8  on 7168  degrees of freedom
## AIC: 4937.8
## 
## Number of Fisher Scoring iterations: 6

The coefficient estimate for energy is positive in Model 2, suggesting that songs with higher energy levels tend to be more popular. However, note that the variable energy is not significant in this model.

Now, create Model 3, which should be exactly like Model 1, but without the variable “energy”. Do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?

# Logistic Regression
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)
# Output summary
summary(SongsLog3)
## 
## Call:
## glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9182  -0.5417  -0.3481  -0.1874   3.4171  
## 
## Coefficients:
##                              Estimate   Std. Error z value Pr(>|z|)    
## (Intercept)               11.96056207   1.71419468   6.977 3.01e-12 ***
## timesignature              0.11509425   0.08726155   1.319 0.187183    
## timesignature_confidence   0.71426976   0.19461751   3.670 0.000242 ***
## loudness                   0.23055652   0.02527983   9.120  < 2e-16 ***
## tempo                     -0.00064600   0.00166547  -0.388 0.698107    
## tempo_confidence           0.38409299   0.13983499   2.747 0.006019 ** 
## key                        0.01649459   0.01035139   1.593 0.111056    
## key_confidence             0.33940638   0.14087438   2.409 0.015984 *  
## pitch                    -53.28405750   6.73285437  -7.914 2.49e-15 ***
## timbre_0_min               0.02204524   0.00423942   5.200 1.99e-07 ***
## timbre_0_max              -0.31048005   0.02536544 -12.240  < 2e-16 ***
## timbre_1_min               0.00541597   0.00076427   7.086 1.38e-12 ***
## timbre_1_max              -0.00051146   0.00071101  -0.719 0.471928    
## timbre_2_min              -0.00225435   0.00112029  -2.012 0.044190 *  
## timbre_2_max               0.00041189   0.00090196   0.457 0.647915    
## timbre_3_min               0.00031786   0.00058687   0.542 0.588083    
## timbre_3_max              -0.00296369   0.00057576  -5.147 2.64e-07 ***
## timbre_4_min               0.01104648   0.00197793   5.585 2.34e-08 ***
## timbre_4_max               0.00646679   0.00154132   4.196 2.72e-05 ***
## timbre_5_min              -0.00513453   0.00126897  -4.046 5.21e-05 ***
## timbre_5_max               0.00029790   0.00078555   0.379 0.704526    
## timbre_6_min              -0.01784468   0.00224605  -7.945 1.94e-15 ***
## timbre_6_max               0.00344687   0.00218214   1.580 0.114203    
## timbre_7_min              -0.00512843   0.00176848  -2.900 0.003733 ** 
## timbre_7_max              -0.00339351   0.00181976  -1.865 0.062208 .  
## timbre_8_min               0.00368609   0.00283309   1.301 0.193229    
## timbre_8_max               0.00465780   0.00298790   1.559 0.119022    
## timbre_9_min              -0.00009318   0.00295687  -0.032 0.974859    
## timbre_9_max               0.00134171   0.00242391   0.554 0.579900    
## timbre_10_min              0.00405001   0.00182697   2.217 0.026637 *  
## timbre_10_max              0.00579252   0.00175858   3.294 0.000988 ***
## timbre_11_min             -0.02637666   0.00368292  -7.162 7.96e-13 ***
## timbre_11_max              0.01983605   0.00336460   5.896 3.74e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6017.5  on 7200  degrees of freedom
## Residual deviance: 4782.7  on 7168  degrees of freedom
## AIC: 4848.7
## 
## Number of Fisher Scoring iterations: 6

Yes, we can see that loudness has a positive coefficient estimate, meaning that our model predicts that songs with heavier instrumentation tend to be more popular.

Validating our Model

Make predictions on the test set using Model 3.

What is the accuracy of Model 3 on the test set, using a threshold of 0.45?

# Make predictions
testPredict = predict(SongsLog3, newdata=SongsTest, type="response")
# Tabulate top 10 songs vs our prediction function
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)
FALSE TRUE
0 309 5
1 40 19
# Compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.8793566

Accuracy = 0.87936

What would the accuracy of the baseline model be on the test set?

# Tabulate Baseline
z = table(SongsTest$Top10)
kable(z)
Var1 Freq
0 314
1 59
# Compute Accuracy
z[1]/sum(z)
##         0 
## 0.8418231

Accuracy = 0.8418231

How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into our test set), using a threshold of 0.45?

# Predict Top10 hits in 2010
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)
FALSE TRUE
0 309 5
1 40 19

19 songs.

How many non-hit songs does Model 3 predict will be Top 10 hits (again, looking at the test set), using a threshold of 0.45?

# Predict Top10 hits in 2010
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)
FALSE TRUE
0 309 5
1 40 19

5 songs.

What is the sensitivity of Model 3 on the test set, using a threshold of 0.45?

# Tabulate confusion matrix
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)
FALSE TRUE
0 309 5
1 40 19
# Compute Sensitivity
z[4]/(z[2]+z[4])
## [1] 0.3220339

Sensitivity = 0.3220339

What is the specificity of Model 3 on the test set, using a threshold of 0.45?

# Tabulate confusion matrix
z = table(SongsTest$Top10, testPredict >= 0.45)
kable(z)
FALSE TRUE
0 309 5
1 40 19
# Compute Specificity
z[1]/(z[1]+z[3])
## [1] 0.9840764

Specificity = 0.9840764

What conclusions can you make about our model?

Model 3 has a very high specificity, meaning that it favors specificity over sensitivity. While Model 3 only captures less than half of the Top 10 songs, it still can offer a competitive edge, since it is very conservative in its predictions.