Quick Question 1
In R, create a logistic regression model to predict “PoorCare” using the independent variables “StartedOnCombination” and “ProviderCount”. Use the training set we created in the previous video to build the model.
Note: If you haven’t already loaded and split the data in R, please run these commands in your R console to load and split the data set. Remember to first navigate to the directory where you have saved “quality.csv”.
quality = read.csv("quality.csv")
# install.packages("caTools")
library(caTools)
set.seed(88)
# Both training and test dataset have around 75% "good quality care" cases
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
# Logistic regression model
QualityLog = glm(PoorCare ~ StartedOnCombination + ProviderCount, data=qualityTrain, family=binomial)
summary(QualityLog)##
## Call:
## glm(formula = PoorCare ~ StartedOnCombination + ProviderCount,
## family = binomial, data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.61826 -0.72782 -0.64555 -0.08407 1.94662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.00097 0.55097 -3.632 0.000282 ***
## StartedOnCombinationTRUE 1.95230 1.22342 1.596 0.110541
## ProviderCount 0.03366 0.01983 1.697 0.089706 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.89 on 98 degrees of freedom
## Residual deviance: 104.37 on 96 degrees of freedom
## AIC: 110.37
##
## Number of Fisher Scoring iterations: 4
What is the coefficient for “StartedOnCombination”?
1.9523
Explanation
To construct this model in R, use the command: Model = glm(PoorCare ~ StartedOnCombination + ProviderCount, data=qualityTrain, family=binomial) If you look at the output of summary(Model), the value of the coefficient (Estimate) for StartedOnCombination is 1.95230.
Quick Question 2
StartedOnCombination is a binary variable, which equals 1 if the patient is started on a combination of drugs to treat their diabetes, and equals 0 if the patient is not started on a combination of drugs. All else being equal, does this model imply that starting a patient on a combination of drugs is indicative of poor care, or good care?
Poor Care
Explanation
The coefficient value is positive, meaning that positive values of the variable make the outcome of 1 more likely. This corresponds to Poor Care.
Quick Question 1
Given this ROC curve, which threshold would you pick if you wanted to correctly identify a small group of patients who are receiving the worst care with high confidence?
t = 0.7
Explanation
The threshold 0.7 is best to identify a small group of patients who are receiving the worst care with high confidence, since at this threshold we make very few false positive mistakes, and identify about 35% of the true positives. The threshold t = 0.8 is not a good choice, since it makes about the same number of false positives, but only identifies 10% of the true positives. The thresholds 0.2 and 0.3 both identify more of the true positives, but they make more false positive mistakes, so our confidence decreases.
Quick Question 2
Which threshold would you pick if you wanted to correctly identify half of the patients receiving poor care, while making as few errors as possible?
t = 0.3
Explanation
The threshold 0.3 is the best choice in this scenerio. The threshold 0.2 also identifies over half of the patients receiving poor care, but it makes many more false positive mistakes. The thresholds 0.7 and 0.8 don’t identify at least half of the patients receiving poor care.
Quick Question 1
IMPORTANT NOTE: This question uses the original model with the independent variables “OfficeVisits” and “Narcotics”. Be sure to use this model, instead of the model you built in Quick Question 4.
Compute the test set predictions in R by running the command:
predictTest = predict(QualityLog, type=“response”, newdata=qualityTest)
You can compute the test set AUC by running the following two commands in R:
ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
auc = as.numeric(performance(ROCRpredTest, “auc”)@y.values)
What is the AUC of this model on the test set?
The AUC of a model has the following nice interpretation: given a random patient from the dataset who actually received poor care, and a random patient from the dataset who actually received good care, the AUC is the perecentage of time that our model will classify which is which correctly.
library(ROCR)
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
predictTest = predict(QualityLog, type="response", newdata=qualityTest)
# compute the test set AUC
ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)Quick Question 1
In the previous video, we computed the following confusion matrix for our logistic regression model on our test set with a threshold of 0.5:
| Actual/Prediction | FALSE | TRUE |
|---|---|---|
| 0 | 1069 | 6 |
| 1 | 187 | 11 |
Using this confusion matrix, answer the following questions.
What is the sensitivity of our logistic regression model on the test set, using a threshold of 0.5?
sens = 11 / (11+187)
sens## [1] 0.05555556
What is the specificity of our logistic regression model on the test set, using a threshold of 0.5?
spec = 1069 / (1069+6)
spec## [1] 0.9944186
Quick Question 1
For which of the following models should external validation be used? Consider both the population used to train the model, and the population that the model will be used on. (Select all that apply.)
Answer:
A model to predict obesity risk. Data from a random sample of California residents was used to build the model, and we want to use the model to predict the obesity risk of all United States residents.
A model to predict the probability of a runner winning a marathon. Data from all runners in the Boston Marathon was used to build the model, and we want use the model to predict the probability of winning for all people who run marathons.
The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.
Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.
Unfortunately, the success of an artist’s release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.
Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.
How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.
Taking an analytics approach, we aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.
The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.
Here’s a detailed description of the variables:
year = the year the song was released
songtitle = the title of the song
artistname = the name of the artist of the song
songID and artistID = identifying variables for the song and artist
timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
loudness = a continuous variable indicating the average amplitude of the audio in decibels
tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch = a continuous variable that indicates the pitch of the song
timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)
Use the read.csv function to load the dataset “songs.csv” into R.
How many observations (songs) are from the year 2010?
songs <- read.csv("songs.csv")
str(songs)## 'data.frame': 7574 obs. of 39 variables:
## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ songtitle : Factor w/ 7141 levels "̈́ l'or_e des bois",..: 6204 5522 241 3115 48 608 255 4419 2886 6756 ...
## $ artistname : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
## $ songID : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831 3964 6904 2473 ...
## $ artistID : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 671 507 ...
## $ timesignature : int 3 4 4 4 4 4 4 4 4 4 ...
## $ timesignature_confidence: num 0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
## $ loudness : num -4.26 -4.05 -3.57 -3.81 -4.71 ...
## $ tempo : num 91.5 140 160.5 97.5 140.1 ...
## $ tempo_confidence : num 0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
## $ key : int 11 10 2 1 6 4 10 5 9 11 ...
## $ key_confidence : num 0.453 0.469 0.209 0.632 0.483 0.627 0.715 0.423 0.751 0.602 ...
## $ energy : num 0.967 0.985 0.99 0.939 0.988 ...
## $ pitch : num 0.024 0.025 0.026 0.013 0.063 0.038 0.026 0.033 0.027 0.004 ...
## $ timbre_0_min : num 0.002 0 0.003 0 0 ...
## $ timbre_0_max : num 57.3 57.4 57.4 57.8 56.9 ...
## $ timbre_1_min : num -6.5 -37.4 -17.2 -32.1 -223.9 ...
## $ timbre_1_max : num 171 171 171 221 171 ...
## $ timbre_2_min : num -81.7 -149.6 -72.9 -138.6 -147.2 ...
## $ timbre_2_max : num 95.1 180.3 157.9 173.4 166 ...
## $ timbre_3_min : num -285 -380.1 -204 -73.5 -128.1 ...
## $ timbre_3_max : num 259 384 251 373 389 ...
## $ timbre_4_min : num -40.4 -48.7 -66 -55.6 -43.9 ...
## $ timbre_4_max : num 73.6 100.4 152.1 119.2 99.3 ...
## $ timbre_5_min : num -104.7 -87.3 -98.7 -77.5 -96.1 ...
## $ timbre_5_max : num 183.1 42.8 141.4 141.2 38.3 ...
## $ timbre_6_min : num -88.8 -86.9 -88.9 -70.8 -110.8 ...
## $ timbre_6_max : num 73.5 75.5 66.5 64.5 72.4 ...
## $ timbre_7_min : num -71.1 -65.8 -67.4 -63.7 -55.9 ...
## $ timbre_7_max : num 82.5 106.9 80.6 96.7 110.3 ...
## $ timbre_8_min : num -52 -61.3 -59.8 -78.7 -56.5 ...
## $ timbre_8_max : num 39.1 35.4 46 41.1 37.6 ...
## $ timbre_9_min : num -35.4 -81.9 -46.3 -49.2 -48.6 ...
## $ timbre_9_max : num 71.6 74.6 59.9 95.4 67.6 ...
## $ timbre_10_min : num -126.4 -103.8 -108.3 -102.7 -52.8 ...
## $ timbre_10_max : num 18.7 121.9 33.3 46.4 22.9 ...
## $ timbre_11_min : num -44.8 -38.9 -43.7 -59.4 -50.4 ...
## $ timbre_11_max : num 26 22.5 25.7 37.1 32.8 ...
## $ Top10 : int 0 0 0 0 0 0 0 0 0 1 ...
nrow(subset(songs, year == 2010))## [1] 373
table(songs$year)##
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## 328 196 186 324 198 258 178 329 380 357 363 282 518 434 479
## 2005 2006 2007 2008 2009 2010
## 392 479 622 415 483 373
Explanation
First, navigate to the directory on your computer containing the file “songs.csv”. You can load the dataset by using the command: songs = read.csv(“songs.csv”) Then, you can count the number of songs from 2010 by using the table function: table(songs$year)
How many songs does the dataset include for which the artist name is “Michael Jackson”?
MichaelJackson <- subset(songs, artistname == "Michael Jackson")
length(unique(MichaelJackson$songID))## [1] 18
Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.
MJ_top10 <- subset(MichaelJackson, Top10 == 1)
MJ_top10$songtitle## [1] You Rock My World You Are Not Alone Black or White Remember the Time
## [5] In The Closet
## 7141 Levels: ̈́ l'or_e des bois _\x84_ _\x84\x8d ... Zumbi
MichaelJackson[c("songtitle", "Top10")]## songtitle Top10
## 4329 You Rock My World 1
## 6205 She's Out of My Life 0
## 6206 Wanna Be Startin' Somethin' 0
## 6207 You Are Not Alone 1
## 6208 Billie Jean 0
## 6209 The Way You Make Me Feel 0
## 6210 Black or White 1
## 6211 Rock with You 0
## 6212 Bad 0
## 6213 I Just Can't Stop Loving You 0
## 6214 Man in the Mirror 0
## 6215 Thriller 0
## 6216 Beat It 0
## 6217 The Girl Is Mine 0
## 6218 Remember the Time 1
## 6219 Don't Stop 'Til You Get Enough 0
## 6220 Heal the World 0
## 6915 In The Closet 1
The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only takes integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset? Select all that apply.
unique(songs$timesignature)## [1] 3 4 5 7 1 0
Which timesignature value is the most frequent among songs in our dataset?
table(songs$timesignature)##
## 0 1 3 4 5 7
## 10 143 503 6787 112 19
Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs. Which one is it?
max_tempo <- max(songs$tempo)
subset(songs, tempo == max_tempo)$songtitle## [1] Wanna Be Startin' Somethin'
## 7141 Levels: ̈́ l'or_e des bois _\x84_ _\x84\x8d ... Zumbi
Explanation
You can answer this question by using the which.max function. The output of which.max(songs\(tempo) is 6206, meaning that the song with the highest tempo is the row 6206. We can output the song title by typing: songs\)songtitle[6206] The song title is: Wanna be Startin’ Somethin’.
We wish to predict whether or not a song will make it to the Top 10. To do this, first use the subset function to split the data into a training set “SongsTrain” consisting of all the observations up to and including 2009 song releases, and a testing set “SongsTest”, consisting of the 2010 song releases.
How many observations (songs) are in the training set?
SongsTrain <- subset(songs, year <= 2009)
nrow(SongsTrain)## [1] 7201
SongsTest <- subset(songs, year == 2010)In this problem, our outcome variable is “Top10” - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model. We’ll start by using all song attributes as our independent variables, which we’ll call Model 1.
We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won’t use the variables “year”, “songtitle”, “artistname”, “songID” or “artistID”.
We have seen in the lecture that, to build the logistic regression model, we would normally explicitly input the formula including all the independent variables in R. However, in this case, this is a tedious amount of work since we have a large number of independent variables.
There is a nice trick to avoid doing so. Let’s suppose that, except for the outcome variable Top10, all other variables in the training set are inputs to Model 1. Then, we can use the formula
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
to build our model. Notice that the “.” is used in place of enumerating all the independent variables. (Also, keep in mind that you can choose to put quotes around binomial, or leave out the quotes. R can understand this argument either way.)
However, in our case, we want to exclude some of the variables in our dataset from being used as independent variables (“year”, “songtitle”, “artistname”, “songID”, and “artistID”). To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won’t use in our model.
nonvars = c(“year”, “songtitle”, “artistname”, “songID”, “artistID”)
To remove these variables from your training and testing sets, type the following commands in your R console:
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
Now, use the glm function to build a logistic regression model to predict Top10 using all of the other variables as the independent variables. You should use SongsTrain to build the model.
Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?
# Variables that we won't use in the model
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
# Remove these variables from the training and testing sets
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
# Build logistic regression model
str(SongsTrain)## 'data.frame': 7201 obs. of 34 variables:
## $ timesignature : int 3 3 4 4 4 4 4 4 4 4 ...
## $ timesignature_confidence: num 0.732 0.906 0.987 0.822 0.983 1 0.821 0.997 0.816 1 ...
## $ loudness : num -6.32 -9.54 -4.84 -5.27 -6.23 ...
## $ tempo : num 89.6 117.7 119 71.5 77.5 ...
## $ tempo_confidence : num 0.652 0.542 0.838 0.613 0.74 0.821 0.912 0.609 0.786 0.27 ...
## $ key : int 1 0 6 4 8 9 6 9 0 9 ...
## $ key_confidence : num 0.773 0.722 0.106 0.781 0.552 0.218 0.275 0.333 0.634 0.578 ...
## $ energy : num 0.599 0.363 0.76 0.755 0.524 ...
## $ pitch : num 0.004 0.006 0.003 0.014 0.008 0.012 0.002 0.003 0.001 0.006 ...
## $ timbre_0_min : num 0 0.739 0 0 0 ...
## $ timbre_0_max : num 57.8 57.1 57.8 58.3 57.6 ...
## $ timbre_1_min : num -62.3 -220.2 -189.7 -113.9 -160.6 ...
## $ timbre_1_max : num 286 241 187 171 217 ...
## $ timbre_2_min : num -81.8 -96.8 -139.1 -71.6 -79.5 ...
## $ timbre_2_max : num 211 215 135 195 114 ...
## $ timbre_3_min : num -217 -202 -116 -276 -184 ...
## $ timbre_3_max : num 203.2 124.2 94.7 146.3 108.7 ...
## $ timbre_4_min : num -55.9 -52.4 -55.6 -59.4 -31.9 ...
## $ timbre_4_max : num 97.6 131.9 79.3 121.7 169.7 ...
## $ timbre_5_min : num -62.5 -73.9 -73.5 -71.1 -73 ...
## $ timbre_5_max : num 82.2 73.6 41 39.6 233.9 ...
## $ timbre_6_min : num -82.1 -63.5 -41.5 -77.8 -76 ...
## $ timbre_6_max : num 59.2 70.1 62.8 94.5 58 ...
## $ timbre_7_min : num -109.4 -90.1 -69.3 -69.1 -78.8 ...
## $ timbre_7_max : num 71 112.9 90.4 93.4 100.8 ...
## $ timbre_8_min : num -71.8 -64.5 -52.5 -55.8 -61.4 ...
## $ timbre_8_max : num 58.4 58.1 40.7 79 50.3 ...
## $ timbre_9_min : num -53.8 -76.9 -50.4 -51.5 -63 ...
## $ timbre_9_max : num 88.6 74.4 58.8 70.5 96.8 ...
## $ timbre_10_min : num -89.8 -88.2 -78.2 -74.9 -90.4 ...
## $ timbre_10_max : num 38 42.2 35.3 30.8 60.5 ...
## $ timbre_11_min : num -52.1 -66.8 -54.2 -51.4 -52.1 ...
## $ timbre_11_max : num 52.8 40.7 46.5 27.8 48.1 ...
## $ Top10 : int 0 0 0 0 0 0 0 0 0 0 ...
SongsLog <- glm(Top10 ~ ., data = SongsTrain, family=binomial)
summary(SongsLog)##
## Call:
## glm(formula = Top10 ~ ., family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9220 -0.5399 -0.3459 -0.1845 3.0770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.470e+01 1.806e+00 8.138 4.03e-16 ***
## timesignature 1.264e-01 8.674e-02 1.457 0.145050
## timesignature_confidence 7.450e-01 1.953e-01 3.815 0.000136 ***
## loudness 2.999e-01 2.917e-02 10.282 < 2e-16 ***
## tempo 3.634e-04 1.691e-03 0.215 0.829889
## tempo_confidence 4.732e-01 1.422e-01 3.329 0.000873 ***
## key 1.588e-02 1.039e-02 1.529 0.126349
## key_confidence 3.087e-01 1.412e-01 2.187 0.028760 *
## energy -1.502e+00 3.099e-01 -4.847 1.25e-06 ***
## pitch -4.491e+01 6.835e+00 -6.570 5.02e-11 ***
## timbre_0_min 2.316e-02 4.256e-03 5.441 5.29e-08 ***
## timbre_0_max -3.310e-01 2.569e-02 -12.882 < 2e-16 ***
## timbre_1_min 5.881e-03 7.798e-04 7.542 4.64e-14 ***
## timbre_1_max -2.449e-04 7.152e-04 -0.342 0.732087
## timbre_2_min -2.127e-03 1.126e-03 -1.889 0.058843 .
## timbre_2_max 6.586e-04 9.066e-04 0.726 0.467571
## timbre_3_min 6.920e-04 5.985e-04 1.156 0.247583
## timbre_3_max -2.967e-03 5.815e-04 -5.103 3.34e-07 ***
## timbre_4_min 1.040e-02 1.985e-03 5.237 1.63e-07 ***
## timbre_4_max 6.110e-03 1.550e-03 3.942 8.10e-05 ***
## timbre_5_min -5.598e-03 1.277e-03 -4.385 1.16e-05 ***
## timbre_5_max 7.736e-05 7.935e-04 0.097 0.922337
## timbre_6_min -1.686e-02 2.264e-03 -7.445 9.66e-14 ***
## timbre_6_max 3.668e-03 2.190e-03 1.675 0.093875 .
## timbre_7_min -4.549e-03 1.781e-03 -2.554 0.010661 *
## timbre_7_max -3.774e-03 1.832e-03 -2.060 0.039408 *
## timbre_8_min 3.911e-03 2.851e-03 1.372 0.170123
## timbre_8_max 4.011e-03 3.003e-03 1.336 0.181620
## timbre_9_min 1.367e-03 2.998e-03 0.456 0.648356
## timbre_9_max 1.603e-03 2.434e-03 0.659 0.510188
## timbre_10_min 4.126e-03 1.839e-03 2.244 0.024852 *
## timbre_10_max 5.825e-03 1.769e-03 3.292 0.000995 ***
## timbre_11_min -2.625e-02 3.693e-03 -7.108 1.18e-12 ***
## timbre_11_max 1.967e-02 3.385e-03 5.811 6.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4759.2 on 7167 degrees of freedom
## AIC: 4827.2
##
## Number of Fisher Scoring iterations: 6
Let’s now think about the variables in our dataset related to the confidence of the time signature, key and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest?
Answer
The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10 Explanation
If you look at the output summary(model), where model is the name of your logistic regression model, you can see that the coefficient estimates for the confidence variables (timesignature_confidence, key_confidence, and tempo_confidence) are positive. This means that higher confidence leads to a higher predicted probability of a Top 10 hit.
In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be complex. What does Model 1 suggest in terms of complexity?
Answer
Mainstream listeners tend to prefer less complex songs
Explanation
Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream listeners tend to prefer less complex songs.
Songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”).
By inspecting the coefficient of the variable “loudness”, what does Model 1 suggest?
Answer
Mainstream listeners prefer songs with heavy instrumentation.
By inspecting the coefficient of the variable “energy”, do we draw the same conclusions as above?
Answer
No
Explanation
The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs, which are those with heavier instrumentation. However, the coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those with light instrumentation. These coefficients lead us to different conclusions!
What is the correlation between the variables “loudness” and “energy” in the training set?
cor(SongsTrain$loudness, SongsTrain$energy)## [1] 0.7399067
Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we’ll build two variations of our original model: Model 2, in which we keep “energy” and omit “loudness”, and Model 3, in which we keep “loudness” and omit “energy”.
Create Model 2, which is Model 1 without the independent variable “loudness”. This can be done with the following command:
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
We just subtracted the variable loudness. We couldn’t do this with the variables “songtitle” and “artistname”, because they are not numeric variables, and we might get different values in the test set that the training set has never seen. But this approach (subtracting the variable from the model formula) will always work when you want to remove numeric variables.
Look at the summary of SongsLog2, and inspect the coefficient of the variable “energy”. What do you observe?
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
summary(SongsLog2)##
## Call:
## glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0983 -0.5607 -0.3602 -0.1902 3.3107
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.241e+00 7.465e-01 -3.002 0.002686 **
## timesignature 1.625e-01 8.734e-02 1.860 0.062873 .
## timesignature_confidence 6.885e-01 1.924e-01 3.578 0.000346 ***
## tempo 5.521e-04 1.665e-03 0.332 0.740226
## tempo_confidence 5.497e-01 1.407e-01 3.906 9.40e-05 ***
## key 1.740e-02 1.026e-02 1.697 0.089740 .
## key_confidence 2.954e-01 1.394e-01 2.118 0.034163 *
## energy 1.813e-01 2.608e-01 0.695 0.486991
## pitch -5.150e+01 6.857e+00 -7.511 5.87e-14 ***
## timbre_0_min 2.479e-02 4.240e-03 5.847 5.01e-09 ***
## timbre_0_max -1.007e-01 1.178e-02 -8.551 < 2e-16 ***
## timbre_1_min 7.143e-03 7.710e-04 9.265 < 2e-16 ***
## timbre_1_max -7.830e-04 7.064e-04 -1.108 0.267650
## timbre_2_min -1.579e-03 1.109e-03 -1.424 0.154531
## timbre_2_max 3.889e-04 8.964e-04 0.434 0.664427
## timbre_3_min 6.500e-04 5.949e-04 1.093 0.274524
## timbre_3_max -2.462e-03 5.674e-04 -4.339 1.43e-05 ***
## timbre_4_min 9.115e-03 1.952e-03 4.670 3.02e-06 ***
## timbre_4_max 6.306e-03 1.532e-03 4.115 3.87e-05 ***
## timbre_5_min -5.641e-03 1.255e-03 -4.495 6.95e-06 ***
## timbre_5_max 6.937e-04 7.807e-04 0.889 0.374256
## timbre_6_min -1.612e-02 2.235e-03 -7.214 5.45e-13 ***
## timbre_6_max 3.814e-03 2.157e-03 1.768 0.076982 .
## timbre_7_min -5.102e-03 1.755e-03 -2.907 0.003644 **
## timbre_7_max -3.158e-03 1.811e-03 -1.744 0.081090 .
## timbre_8_min 4.488e-03 2.810e-03 1.597 0.110254
## timbre_8_max 6.423e-03 2.950e-03 2.177 0.029497 *
## timbre_9_min -4.282e-04 2.955e-03 -0.145 0.884792
## timbre_9_max 3.525e-03 2.377e-03 1.483 0.138017
## timbre_10_min 2.993e-03 1.804e-03 1.660 0.097004 .
## timbre_10_max 7.367e-03 1.731e-03 4.255 2.09e-05 ***
## timbre_11_min -2.837e-02 3.630e-03 -7.815 5.48e-15 ***
## timbre_11_max 1.829e-02 3.341e-03 5.476 4.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4871.8 on 7168 degrees of freedom
## AIC: 4937.8
##
## Number of Fisher Scoring iterations: 6
Model 2 suggests that songs with high energy levels tend to be more popular. This contradicts our observation in Model 1.
Explanation
The coefficient estimate for energy is positive in Model 2, suggesting that songs with higher energy levels tend to be more popular. However, note that the variable energy is not significant in this model.
Now, create Model 3, which should be exactly like Model 1, but without the variable “energy”.
Look at the summary of Model 3 and inspect the coefficient of the variable “loudness”. Remembering that higher loudness and energy both occur in songs with heavier instrumentation, do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)
summary(SongsLog3)##
## Call:
## glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9182 -0.5417 -0.3481 -0.1874 3.4171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.196e+01 1.714e+00 6.977 3.01e-12 ***
## timesignature 1.151e-01 8.726e-02 1.319 0.187183
## timesignature_confidence 7.143e-01 1.946e-01 3.670 0.000242 ***
## loudness 2.306e-01 2.528e-02 9.120 < 2e-16 ***
## tempo -6.460e-04 1.665e-03 -0.388 0.698107
## tempo_confidence 3.841e-01 1.398e-01 2.747 0.006019 **
## key 1.649e-02 1.035e-02 1.593 0.111056
## key_confidence 3.394e-01 1.409e-01 2.409 0.015984 *
## pitch -5.328e+01 6.733e+00 -7.914 2.49e-15 ***
## timbre_0_min 2.205e-02 4.239e-03 5.200 1.99e-07 ***
## timbre_0_max -3.105e-01 2.537e-02 -12.240 < 2e-16 ***
## timbre_1_min 5.416e-03 7.643e-04 7.086 1.38e-12 ***
## timbre_1_max -5.115e-04 7.110e-04 -0.719 0.471928
## timbre_2_min -2.254e-03 1.120e-03 -2.012 0.044190 *
## timbre_2_max 4.119e-04 9.020e-04 0.457 0.647915
## timbre_3_min 3.179e-04 5.869e-04 0.542 0.588083
## timbre_3_max -2.964e-03 5.758e-04 -5.147 2.64e-07 ***
## timbre_4_min 1.105e-02 1.978e-03 5.585 2.34e-08 ***
## timbre_4_max 6.467e-03 1.541e-03 4.196 2.72e-05 ***
## timbre_5_min -5.135e-03 1.269e-03 -4.046 5.21e-05 ***
## timbre_5_max 2.979e-04 7.855e-04 0.379 0.704526
## timbre_6_min -1.784e-02 2.246e-03 -7.945 1.94e-15 ***
## timbre_6_max 3.447e-03 2.182e-03 1.580 0.114203
## timbre_7_min -5.128e-03 1.768e-03 -2.900 0.003733 **
## timbre_7_max -3.394e-03 1.820e-03 -1.865 0.062208 .
## timbre_8_min 3.686e-03 2.833e-03 1.301 0.193229
## timbre_8_max 4.658e-03 2.988e-03 1.559 0.119022
## timbre_9_min -9.318e-05 2.957e-03 -0.032 0.974859
## timbre_9_max 1.342e-03 2.424e-03 0.554 0.579900
## timbre_10_min 4.050e-03 1.827e-03 2.217 0.026637 *
## timbre_10_max 5.793e-03 1.759e-03 3.294 0.000988 ***
## timbre_11_min -2.638e-02 3.683e-03 -7.162 7.96e-13 ***
## timbre_11_max 1.984e-02 3.365e-03 5.896 3.74e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6017.5 on 7200 degrees of freedom
## Residual deviance: 4782.7 on 7168 degrees of freedom
## AIC: 4848.7
##
## Number of Fisher Scoring iterations: 6
Explanation
Model 3 can be created with the following command: SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial) Looking at the output of summary(SongsLog3), we can see that loudness has a positive coefficient estimate, meaning that our model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion we got from Model 2. In the remainder of this problem, we’ll just use Model 3.
Make predictions on the test set using Model 3. What is the accuracy of Model 3 on the test set, using a threshold of 0.45? (Compute the accuracy as a number between 0 and 1.)
# Make predictions on training set
predictTest = predict(SongsLog3, newdata = SongsTest, type="response")
# Create confusion matrix with threshold of 0.45
confMat <- table(SongsTest$Top10, predictTest >= 0.45)
confMat##
## FALSE TRUE
## 0 309 5
## 1 40 19
accuracy <- (confMat[1, 1]+confMat[2, 2])/sum(confMat)
accuracy## [1] 0.8793566
Let’s check if there’s any incremental benefit in using Model 3 instead of a baseline model. Given the difficulty of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs. What would the accuracy of the baseline model be on the test set? (Give your answer as a number between 0 and 1.)
confMat##
## FALSE TRUE
## 0 309 5
## 1 40 19
acc_baseline <- (309+5)/sum(confMat)
acc_baseline## [1] 0.8418231
# Or
table(SongsTest$Top10)##
## 0 1
## 314 59
It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge?
Let’s view the two models from an investment perspective. A production company is interested in investing in songs that are highly likely to make it to the Top 10. The company’s objective is to minimize its risk of financial losses attributed to investing in songs that end up unpopular.
A competitive edge can therefore be achieved if we can provide the production company a list of songs that are highly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does not label any song as a hit. Let us see what our model has to offer.
How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into our test set), using a threshold of 0.45?
confMat##
## FALSE TRUE
## 0 309 5
## 1 40 19
# True positives
confMat[2, 2]## [1] 19
How many non-hit songs does Model 3 predict will be Top 10 hits (again, looking at the test set), using a threshold of 0.45?
# false positives
confMat[1, 2]## [1] 5
What is the sensitivity of Model 3 on the test set, using a threshold of 0.45?
sensitivity <- confMat[2, 2] / sum(confMat[2, ])
sensitivity## [1] 0.3220339
What is the specificity of Model 3 on the test set, using a threshold of 0.45?
confMat##
## FALSE TRUE
## 0 309 5
## 1 40 19
specificity <- confMat[1, 1] / sum(confMat[1, ])
specificity## [1] 0.9840764
What conclusions can you make about our model? (Select all that apply.)
Answer
Model 3 favors specificity over sensitivity.
Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.
Explanation
Model 3 has a very high specificity, meaning that it favors specificity over sensitivity. While Model 3 only captures less than half of the Top 10 songs, it still can offer a competitive edge, since it is very conservative in its predictions.
In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.
Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.
For this prediction task, we will use data from the United States 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:
male: 1 if the parolee is male, 0 if female
race: 1 if the parolee is white, 2 otherwise
age: the parolee’s age (in years) when he or she was released from prison
state: a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset. time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
crime: a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.
Load the dataset “parole.csv” into a data frame called parole, and investigate it using the str() and summary() functions.
How many parolees are contained in the dataset?
parole <- read.csv("parole.csv")
str(parole)## 'data.frame': 675 obs. of 9 variables:
## $ male : int 1 0 1 1 1 1 1 0 0 1 ...
## $ race : int 1 1 2 1 2 2 1 1 1 2 ...
## $ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
## $ state : int 1 1 1 1 1 1 1 1 1 1 ...
## $ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
## $ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
## $ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
## $ crime : int 4 3 3 1 1 4 3 1 3 2 ...
## $ violator : int 0 0 0 0 0 0 0 0 0 0 ...
summary(parole)## male race age state
## Min. :0.0000 Min. :1.000 Min. :18.40 Min. :1.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 1st Qu.:2.000
## Median :1.0000 Median :1.000 Median :33.70 Median :3.000
## Mean :0.8074 Mean :1.424 Mean :34.51 Mean :2.887
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:42.55 3rd Qu.:4.000
## Max. :1.0000 Max. :2.000 Max. :67.00 Max. :4.000
## time.served max.sentence multiple.offenses crime
## Min. :0.000 Min. : 1.00 Min. :0.0000 Min. :1.000
## 1st Qu.:3.250 1st Qu.:12.00 1st Qu.:0.0000 1st Qu.:1.000
## Median :4.400 Median :12.00 Median :1.0000 Median :2.000
## Mean :4.198 Mean :13.06 Mean :0.5363 Mean :2.059
## 3rd Qu.:5.200 3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :6.000 Max. :18.00 Max. :1.0000 Max. :4.000
## violator
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1156
## 3rd Qu.:0.0000
## Max. :1.0000
How many of the parolees in the dataset violated the terms of their parole?
sum(parole$violator == 1)## [1] 78
You should be familiar with unordered factors (if not, review the Week 2 homework problem “Reading Test Scores”). Which variables in this dataset are unordered factors with at least three levels? Select all that apply.
Answer
While the variables male, race, state, crime, and violator are all unordered factors, only state and crime have at least 3 levels in this dataset.
In the last subproblem, we identified variables that are unordered factors with at least 3 levels, so we need to convert them to factors for our prediction problem (we introduced this idea in the “Reading Test Scores” problem last week). Using the as.factor() function, convert these variables to factors. Keep in mind that we are not changing the values, just the way R understands them (the values are still numbers).
How does the output of summary() change for a factor variable as compared to a numerical variable?
parole$state <- as.factor(parole$state)
parole$crime <- as.factor(parole$crime)
summary(parole)## male race age state time.served
## Min. :0.0000 Min. :1.000 Min. :18.40 1:143 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 2:120 1st Qu.:3.250
## Median :1.0000 Median :1.000 Median :33.70 3: 82 Median :4.400
## Mean :0.8074 Mean :1.424 Mean :34.51 4:330 Mean :4.198
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:42.55 3rd Qu.:5.200
## Max. :1.0000 Max. :2.000 Max. :67.00 Max. :6.000
## max.sentence multiple.offenses crime violator
## Min. : 1.00 Min. :0.0000 1:315 Min. :0.0000
## 1st Qu.:12.00 1st Qu.:0.0000 2:106 1st Qu.:0.0000
## Median :12.00 Median :1.0000 3:153 Median :0.0000
## Mean :13.06 Mean :0.5363 4:101 Mean :0.1156
## 3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :18.00 Max. :1.0000 Max. :1.0000
To ensure consistent training/testing set splits, run the following 5 lines of code (do not include the line numbers at the beginning):
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
Roughly what proportion of parolees have been allocated to the training and testing sets?
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect?
If you instead ONLY re-ran lines [3]-[5], what would you expect?
If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
Explanation
If you set a random seed, split, set the seed again to the same value, and then split again, you will get the same split. However, if you set the seed and then split twice, you will get different splits. If you set the seed to different values, you will get different splits. You can also verify this by running the specified code in R. If you have training sets train1 and train2, the function sum(train1 != train2) will count the number of values in those two data frames that are different.
If you tested other training/testing set splits in the previous section, please re-run the original 5 lines of code to obtain the original split.
Using glm (and remembering the parameter family=“binomial”), train a logistic regression model on the training set. Your dependent variable is “violator”, and you should use all of the other variables as independent variables.
What variables are significant in this model? Significant variables should have a least one star, or should have a probability less than 0.05 (the column Pr(>|z|) in the summary output). Select all that apply.
mod <- glm(violator ~ ., data = train, family = binomial)
summary(mod)##
## Call:
## glm(formula = violator ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7041 -0.4236 -0.2719 -0.1690 2.8375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2411574 1.2938852 -3.278 0.00105 **
## male 0.3869904 0.4379613 0.884 0.37690
## race 0.8867192 0.3950660 2.244 0.02480 *
## age -0.0001756 0.0160852 -0.011 0.99129
## state2 0.4433007 0.4816619 0.920 0.35739
## state3 0.8349797 0.5562704 1.501 0.13335
## state4 -3.3967878 0.6115860 -5.554 2.79e-08 ***
## time.served -0.1238867 0.1204230 -1.029 0.30359
## max.sentence 0.0802954 0.0553747 1.450 0.14705
## multiple.offenses 1.6119919 0.3853050 4.184 2.87e-05 ***
## crime2 0.6837143 0.5003550 1.366 0.17180
## crime3 -0.2781054 0.4328356 -0.643 0.52054
## crime4 -0.0117627 0.5713035 -0.021 0.98357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 340.04 on 472 degrees of freedom
## Residual deviance: 251.48 on 460 degrees of freedom
## AIC: 277.48
##
## Number of Fisher Scoring iterations: 6
What can we say based on the coefficient of the multiple.offenses variable?
The following two properties might be useful to you when answering this question:
If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase in the variable.
Answer
Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.
Explanation
For parolees A and B who are identical other than A having committed multiple offenses, the predicted log odds of A is 1.61 more than the predicted log odds of B. Then we have:
$$
ln(odds of A) = ln(odds of B) + 1.61 \
e^{ln(odds of A)} = e^{ln(odds of B) + 1.61} \
e^{ln(odds of A)} = e^{ln(odds of B)} e^{1.61} \
odds of A = e^{1.61} odds of B \ odds of A= 5.01 odds of B
$$ In the second step we raised e to the power of both sides. In the third step we used the exponentiation rule that e^(a+b) = e^a * e^b. In the fourth step we used the rule that e^(ln(x)) = x.
Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. Answer the following questions based on the model’s predictions for this individual. (HINT: You should use the coefficients of your model, the Logistic Response Function, and the Odds equation to solve this problem.)
According to the model, what are the odds this individual is a violator?
mod$coefficients## (Intercept) male race age
## -4.2411573912 0.3869904022 0.8867192411 -0.0001756156
## state2 state3 state4 time.served
## 0.4433006515 0.8349796539 -3.3967877633 -0.1238867027
## max.sentence multiple.offenses crime2 crime3
## 0.0802953765 1.6119918595 0.6837143152 -0.2781054228
## crime4
## -0.0117626521
# This parolee has male=1, race=1, age=50, state2=0, state3=0, state4=0,
# time.served=3, max.sentence=12, multiple.offenses=0, crime2=1, crime3=0, crime4=0
intercept <- mod$coefficients[[1]]
coef_male <- mod$coefficients[[2]]
coef_race <- mod$coefficients[[3]]
coef_age <- mod$coefficients[[4]]
coef_ts <- mod$coefficients[[8]]
coef_ms <- mod$coefficients[[9]]
coef_multi <- mod$coefficients[[10]]
coef_larceny <- mod$coefficients[[11]]
odds = exp(intercept + coef_male*1 + coef_race*1 + coef_age*50 + coef_ts*3 +
coef_ms*12 + coef_multi*0 + coef_larceny*1)
odds## [1] 0.1825685
According to the model, what is the probability this individual is a violator?
probability <- odds / (1+odds)
probability## [1] 0.154383
Use the predict() function to obtain the model’s predicted probabilities for parolees in the testing set, remembering to pass type=“response”.
What is the maximum predicted probability of a violation?
predTest <- predict(mod, newdata = test, type = "response")
# max(predTest)
summary(predTest)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002334 0.023777 0.057905 0.146576 0.147452 0.907279
In the following questions, evaluate the model’s predictions on the test set using a threshold of 0.5.
# confusion matrix
CM <- table(test$violator, as.numeric(predTest >= 0.5))
CM##
## 0 1
## 0 167 12
## 1 11 12
What is the model’s sensitivity, specificity and accuracy?
# Sensitivity = TP / (TP+FN)
CM[2, 2] / sum(CM[2, ])## [1] 0.5217391
# Specificity = TN / (TN+FP)
CM[1, 1] / sum(CM[1, ])## [1] 0.9329609
# Accuracy = (TN + TP) / total observations
(CM[1, 1] + CM[2, 2]) / sum(CM)## [1] 0.8861386
What is the accuracy of a simple model that predicts that every parolee is a non-violator?
table(test$violator)##
## 0 1
## 179 23
179/(179+23)## [1] 0.8861386
Explanation
If you table the outcome variable using the following command: table(test$violator) you can see that there are 179 negative examples, which are the ones that the baseline model would get correct. Thus the baseline model would have an accuracy of 179/202 = 0.886.
Consider a parole board using the model to predict whether parolees will be violators or not. The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned about releasing prisoners who will violate their parole. Which of the following most likely describes their preferences and best course of action?
Answer
The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.
Explanation
If the board used the model for parole decisions, a negative prediction would lead to a prisoner being granted parole, while a positive prediction would lead to a prisoner being denied parole. The parole board would experience more regret for releasing a prisoner who then violates parole (a negative prediction that is actually positive, or false negative) than it would experience for denying parole to a prisoner who would not have violated parole (a positive prediction that is actually negative, or false positive). Decreasing the cutoff leads to more positive predictions, which increases false positives and decreases false negatives. Meanwhile, increasing the cutoff leads to more negative predictions, which increases false negatives and decreases false positives. The parole board assigns high cost to false negatives, and therefore should decrease the cutoff.
Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model?
Answer
The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model’s value.
Explanation
The model at cutoff 0.5 has 12 false positives and 11 false negatives, while the baseline model has 0 false positives and 23 false negatives. Because a parole board is likely to assign more cost to a false negative, the model at cutoff 0.5 is likely of value to the board. From the previous question, the parole board would likely benefit from decreasing the logistic regression cutoffs, which decreases the false negative rate while increasing the false positive rate.
Using the ROCR package, what is the AUC value for the model?
library(ROCR)
ROCRpred = prediction(predTest, test$violator)
as.numeric(performance(ROCRpred, "auc")@y.values)## [1] 0.8945834
Describe the meaning of AUC in this context.
Answer
The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
Our goal has been to predict the outcome of a parole decision, and we used a publicly available dataset of parole releases for predictions. In this final problem, we’ll evaluate a potential source of bias associated with our analysis. It is always important to evaluate a dataset for possible sources of bias.
The dataset contains all individuals released from parole in 2004, either due to completing their parole term or violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor completed their term in 2004, causing non-violators to be underrepresented. This is called “selection bias” or “selecting on the dependent variable,” because only a subset of all relevant parolees were included in our analysis, based on our dependent variable in this analysis (parole violation). How could we improve our dataset to best address selection bias?
Answer
We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term. correct
Explanation
While expanding the dataset to include the missing parolees and labeling each as violator=0 would improve the representation of non-violators, it does not capture the true outcome, since the parolee might become a violator after 2004. Though labeling these new examples with violator=NA correctly identifies that we don’t know their true outcome, we cannot train or test a prediction model with a missing dependent variable. As a result, a prospective dataset that tracks a cohort of parolees and observes the true outcome of each is more desirable. Unfortunately, such datasets are often more challenging to obtain (for instance, if a parolee had a 10-year term, it might require tracking that individual for 10 years before building the model). Such a prospective analysis would not be possible using the 2004 National Corrections Reporting Program dataset.
In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.
To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable not_fully_paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).
To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:
credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments ($) owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
Load the dataset loans.csv into a data frame called loans, and explore it using the str() and summary() functions.
What proportion of the loans in the dataset were not paid in full? Please input a number between 0 and 1.
loans <- read.csv("loans.csv")
str(loans)## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
sum(loans$not.fully.paid == 1) / nrow(loans)## [1] 0.1600543
Which of the following variables has at least one missing observation? Select all that apply.
summary(loans)## credit.policy purpose int.rate
## Min. :0.000 all_other :2331 Min. :0.0600
## 1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
## Median :1.000 debt_consolidation:3957 Median :0.1221
## Mean :0.805 educational : 343 Mean :0.1226
## 3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
## Max. :1.000 major_purchase : 437 Max. :0.2164
## small_business : 619
## installment log.annual.inc dti fico
## Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
## 1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
## Median :268.95 Median :10.928 Median :12.665 Median :707.0
## Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8
## 3rd Qu.:432.76 3rd Qu.:11.290 3rd Qu.:17.950 3rd Qu.:737.0
## Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
## NA's :4
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.70 1st Qu.: 0.000
## Median : 4140 Median : 8596 Median : 46.40 Median : 1.000
## Mean : 4562 Mean : 16914 Mean : 46.87 Mean : 1.572
## 3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 71.00 3rd Qu.: 2.000
## Max. :17640 Max. :1207359 Max. :119.00 Max. :33.000
## NA's :29 NA's :62 NA's :29
## delinq.2yrs pub.rec not.fully.paid
## Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median :0.0000 Median :0.0000
## Mean : 0.1638 Mean :0.0621 Mean :0.1601
## 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :13.0000 Max. :5.0000 Max. :1.0000
## NA's :29 NA's :29
apply(loans, 2, function(x) any(is.na(x)))## credit.policy purpose int.rate installment
## FALSE FALSE FALSE FALSE
## log.annual.inc dti fico days.with.cr.line
## TRUE FALSE FALSE TRUE
## revol.bal revol.util inq.last.6mths delinq.2yrs
## FALSE TRUE TRUE TRUE
## pub.rec not.fully.paid
## TRUE FALSE
Answer
From summary(loans), we can read that log.annual.inc, days.with.cr.line, revol.util, inq.last.6mths, delinq.2yrs and pub.rec are missing values.
Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data? (Hint: you can use the subset() function to build a data frame with the observations missing at least one value. To test if a variable, for example pub.rec, is missing a value, use is.na(pub.rec).)
Answer
We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.
Explanation
Answering this question requires analyzing the loans with missing data. We can build a data frame limited to observations with some missing data with the following command:
missing = subset(loans,
is.na(log.annual.inc) | is.na(days.with.cr.line) | is.na(revol.util)
| is.na(inq.last.6mths) | is.na(delinq.2yrs) | is.na(pub.rec))
missing## credit.policy purpose int.rate installment log.annual.inc
## 782 1 all_other 0.1134 98.70 10.530495
## 804 1 educational 0.1103 52.41 10.532096
## 840 1 debt_consolidation 0.1134 263.20 10.714418
## 858 1 debt_consolidation 0.1229 23.35 9.852194
## 1214 1 major_purchase 0.1064 182.39 11.264464
## 1281 1 credit_card 0.1633 264.91 10.819778
## 1554 1 all_other 0.1557 314.51 10.596535
## 1783 1 educational 0.1695 97.98 8.342840
## 1928 1 all_other 0.1379 170.38 10.714418
## 2009 1 all_other 0.1316 405.25 11.350407
## 2194 1 all_other 0.1695 178.14 11.608236
## 2529 1 small_business 0.1537 48.79 10.491274
## 2533 1 home_improvement 0.1411 222.49 11.198215
## 2706 1 major_purchase 0.1505 346.92 10.757988
## 3351 1 educational 0.1505 60.71 9.210340
## 3980 1 all_other 0.1284 47.07 8.494539
## 4959 1 credit_card 0.1913 201.97 11.877569
## 5522 1 all_other 0.1426 41.17 8.476371
## 6274 1 all_other 0.1461 172.38 10.491274
## 6319 1 major_purchase 0.1531 87.04 13.003580
## 6639 1 debt_consolidation 0.1531 174.08 10.915088
## 7014 1 major_purchase 0.1461 517.13 10.491274
## 7702 1 all_other 0.1600 70.32 10.085809
## 7703 1 debt_consolidation 0.1821 181.30 10.714418
## 7726 0 all_other 0.0775 156.11 11.156251
## 7727 0 all_other 0.0838 204.84 NA
## 7728 0 all_other 0.0807 136.45 11.695247
## 7729 0 all_other 0.0933 80.69 11.608236
## 7730 0 all_other 0.0933 124.62 11.492723
## 7732 0 all_other 0.1028 164.42 11.461632
## 7733 0 all_other 0.0964 32.11 9.392662
## 7734 0 all_other 0.0996 103.20 11.918391
## 7735 0 all_other 0.1028 113.39 12.100712
## 7736 0 all_other 0.1122 211.85 10.373491
## 7737 0 all_other 0.0964 208.66 9.903488
## 7738 0 all_other 0.0996 64.50 8.699515
## 7739 0 all_other 0.1249 42.65 10.596635
## 7740 0 all_other 0.0933 95.86 9.903488
## 7741 0 all_other 0.0901 95.42 10.463103
## 7742 0 all_other 0.0775 209.18 NA
## 7743 0 all_other 0.0775 218.55 NA
## 7744 0 all_other 0.0743 155.38 NA
## 7745 0 all_other 0.1122 344.87 11.002100
## 7746 0 all_other 0.0712 30.94 10.819778
## 7747 0 all_other 0.0838 81.94 8.779557
## 7748 0 all_other 0.0743 93.23 11.289782
## 7749 0 all_other 0.0743 77.69 11.106820
## 7750 0 all_other 0.0775 156.11 12.611538
## 7751 0 all_other 0.1407 34.21 10.126631
## 7752 0 all_other 0.1091 45.78 10.596635
## 7753 0 all_other 0.1122 164.23 10.239960
## 7754 0 all_other 0.0901 38.17 10.491274
## 7759 0 debt_consolidation 0.0964 61.00 11.512925
## 7760 0 all_other 0.1312 180.57 8.294050
## 7799 0 all_other 0.1091 196.18 8.853665
## 7812 0 all_other 0.1565 131.19 8.892886
## 8555 0 debt_consolidation 0.1071 97.81 10.203592
## 8571 0 debt_consolidation 0.1387 163.75 10.736397
## 8677 0 major_purchase 0.0976 128.62 10.752484
## 8760 0 credit_card 0.0863 170.80 10.680516
## 9074 0 all_other 0.1442 859.57 12.542545
## 9365 0 debt_consolidation 0.1774 172.91 9.546813
## dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
## 782 7.72 677 1680.0000 0 NA 1
## 804 15.84 682 1829.9583 0 NA 0
## 840 8.75 682 2490.0000 0 NA 1
## 858 12.38 662 1199.9583 0 NA 1
## 1214 4.26 697 4140.9583 0 NA 0
## 1281 10.80 667 5249.9583 0 NA 0
## 1554 0.00 687 2940.0417 0 NA 1
## 1783 0.00 687 1238.0417 0 NA 0
## 1928 6.00 722 5280.0417 0 NA 1
## 2009 12.01 752 3749.0417 0 NA 0
## 2194 14.95 677 6990.0000 49238 NA 0
## 2529 5.07 712 2220.9583 0 NA 2
## 2533 11.59 767 5522.9583 0 NA 1
## 2706 20.19 752 4621.0000 0 NA 2
## 3351 15.84 677 2070.0000 0 NA 1
## 3980 17.68 722 2100.0000 0 NA 3
## 4959 3.58 662 5130.0000 0 NA 1
## 5522 13.75 702 2580.0000 0 NA 0
## 6274 16.57 742 5400.0000 0 NA 0
## 6319 0.18 712 1710.0417 0 NA 1
## 6639 9.45 732 2643.0417 0 NA 0
## 7014 9.17 732 1892.0000 0 NA 2
## 7702 2.55 667 3509.9583 0 NA 0
## 7703 11.39 662 5400.0000 0 NA 3
## 7726 8.81 772 NA 0 NA NA
## 7727 4.00 742 NA 0 NA NA
## 7728 4.00 742 NA 0 NA NA
## 7729 10.00 712 NA 0 NA NA
## 7730 10.00 707 NA 0 NA NA
## 7732 10.00 702 NA 0 NA NA
## 7733 10.00 697 NA 0 NA NA
## 7734 10.00 692 NA 0 NA NA
## 7735 10.00 687 NA 0 NA NA
## 7736 10.00 687 NA 0 NA NA
## 7737 10.00 712 NA 0 NA NA
## 7738 10.00 687 NA 0 NA NA
## 7739 10.00 707 NA 0 NA NA
## 7740 10.00 712 NA 0 NA NA
## 7741 10.00 717 NA 0 NA NA
## 7742 1.00 802 NA 0 NA NA
## 7743 1.00 802 NA 0 NA NA
## 7744 1.00 802 NA 0 NA NA
## 7745 19.50 742 NA 0 NA NA
## 7746 1.10 772 NA 0 NA NA
## 7747 6.46 722 NA 0 NA NA
## 7748 0.39 797 NA 0 NA NA
## 7749 10.36 787 NA 0 NA NA
## 7750 5.38 757 NA 0 NA NA
## 7751 16.27 642 NA 0 NA NA
## 7752 8.61 672 NA 0 NA NA
## 7753 3.51 672 NA 0 NA NA
## 7754 3.27 707 NA 0 NA NA
## 7759 10.00 707 NA 0 NA NA
## 7760 15.00 662 178.9583 0 NA 1
## 7799 8.57 682 334.0000 0 NA 2
## 7812 17.31 642 3478.9583 0 NA 1
## 8555 18.89 697 3812.0000 0 NA 0
## 8571 22.72 662 2010.0000 0 NA 6
## 8677 7.29 707 990.0000 0 NA 0
## 8760 10.18 742 2399.9583 0 NA 3
## 9074 5.58 737 7890.0000 290291 NA 1
## 9365 9.51 692 1530.0417 0 NA 4
## delinq.2yrs pub.rec not.fully.paid
## 782 0 0 1
## 804 0 0 0
## 840 1 0 1
## 858 0 0 0
## 1214 0 1 0
## 1281 0 0 1
## 1554 0 0 0
## 1783 0 0 0
## 1928 0 0 0
## 2009 0 0 0
## 2194 4 0 0
## 2529 0 0 0
## 2533 0 0 1
## 2706 0 0 0
## 3351 0 0 0
## 3980 0 0 0
## 4959 0 0 0
## 5522 0 0 1
## 6274 0 0 0
## 6319 0 0 0
## 6639 0 0 0
## 7014 0 0 0
## 7702 0 0 1
## 7703 2 0 1
## 7726 NA NA 0
## 7727 NA NA 0
## 7728 NA NA 0
## 7729 NA NA 0
## 7730 NA NA 0
## 7732 NA NA 0
## 7733 NA NA 0
## 7734 NA NA 0
## 7735 NA NA 0
## 7736 NA NA 0
## 7737 NA NA 1
## 7738 NA NA 0
## 7739 NA NA 1
## 7740 NA NA 0
## 7741 NA NA 0
## 7742 NA NA 0
## 7743 NA NA 0
## 7744 NA NA 0
## 7745 NA NA 0
## 7746 NA NA 0
## 7747 NA NA 1
## 7748 NA NA 0
## 7749 NA NA 0
## 7750 NA NA 0
## 7751 NA NA 0
## 7752 NA NA 0
## 7753 NA NA 0
## 7754 NA NA 0
## 7759 NA NA 0
## 7760 0 0 0
## 7799 0 0 1
## 7812 0 0 1
## 8555 0 0 0
## 8571 0 0 0
## 8677 0 0 0
## 8760 0 0 0
## 9074 0 0 0
## 9365 0 0 0
we see that only 62 of 9578 loans have missing data; removing this small number of observations would not lead to overfitting. From table(missing$not.fully.paid), we see that 12 of 62 loans with missing data were not fully paid, or 19.35%. This rate is similar to the 16.01% across all loans, so the form of biasing described is not an issue. However, to predict risk for loans with missing data we need to fill in the missing values instead of removing the observations.
For the rest of this problem, we’ll be using a revised version of the dataset that has the missing values filled in with multiple imputation (which was discussed in the Recitation of this Unit). To ensure everybody has the same data frame going forward, you can either run the commands below in your R console (if you haven’t already, run the command install.packages(“mice”) first), or you can download and load into R the dataset we created after running the imputation: loans_imputed.csv.
IMPORTANT NOTE: On certain operating systems, the imputation results are not the same even if you set the random seed. If you decide to do the imputation yourself, please still read the provided imputed dataset (loans_imputed.csv) into R and compare your results, using the summary function. If the results are different, please make sure to use the data in loans_imputed.csv for the rest of the problem.
library(mice)
set.seed(144)
vars.for.imputation = setdiff(names(loans), "not.fully.paid")
imputed = complete(mice(loans[vars.for.imputation]))##
## iter imp variable
## 1 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 1 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 2 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 3 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 4 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 2 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 3 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 4 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
## 5 5 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
loans[vars.for.imputation] = imputed
summary(loans)## credit.policy purpose int.rate
## Min. :0.000 all_other :2331 Min. :0.0600
## 1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
## Median :1.000 debt_consolidation:3957 Median :0.1221
## Mean :0.805 educational : 343 Mean :0.1226
## 3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
## Max. :1.000 major_purchase : 437 Max. :0.2164
## small_business : 619
## installment log.annual.inc dti fico
## Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
## 1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
## Median :268.95 Median :10.928 Median :12.665 Median :707.0
## Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8
## 3rd Qu.:432.76 3rd Qu.:11.290 3rd Qu.:17.950 3rd Qu.:737.0
## Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
##
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.60 1st Qu.: 0.000
## Median : 4140 Median : 8596 Median : 46.25 Median : 1.000
## Mean : 4560 Mean : 16914 Mean : 46.80 Mean : 1.582
## 3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 70.90 3rd Qu.: 2.000
## Max. :17640 Max. :1207359 Max. :119.00 Max. :33.000
##
## delinq.2yrs pub.rec not.fully.paid
## Min. : 0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 0.0000 Median :0.00000 Median :0.0000
## Mean : 0.1635 Mean :0.06223 Mean :0.1601
## 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :13.0000 Max. :5.00000 Max. :1.0000
##
loans = read.csv("loans_imputed.csv")Explanation
Imputation predicts missing variable values for a given observation using the variable values that are reported. We called the imputation on a data frame with the dependent variable not.fully.paid removed, so we predicted the missing values using only other independent variables.
Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set (the dependent variable for sample.split is not.fully.paid). Name the data frames train and test.
Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.
Which independent variables are significant in our model? (Significant variables have at least one star, or a Pr(>|z|) value less than 0.05.) Select all that apply.
set.seed(144)
spl = sample.split(loans$not.fully.paid, SplitRatio = 0.7)
train = subset(loans, spl==TRUE)
test = subset(loans, spl==FALSE)
loansLog = glm(not.fully.paid ~ ., data=train, family="binomial")
summary(loansLog)##
## Call:
## glm(formula = not.fully.paid ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2049 -0.6205 -0.4951 -0.3606 2.6397
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.187e+00 1.554e+00 5.910 3.42e-09 ***
## credit.policy -3.368e-01 1.011e-01 -3.332 0.000861 ***
## purposecredit_card -6.141e-01 1.344e-01 -4.568 4.93e-06 ***
## purposedebt_consolidation -3.212e-01 9.183e-02 -3.498 0.000469 ***
## purposeeducational 1.347e-01 1.753e-01 0.768 0.442201
## purposehome_improvement 1.727e-01 1.480e-01 1.167 0.243135
## purposemajor_purchase -4.830e-01 2.009e-01 -2.404 0.016203 *
## purposesmall_business 4.120e-01 1.419e-01 2.905 0.003678 **
## int.rate 6.110e-01 2.085e+00 0.293 0.769446
## installment 1.275e-03 2.092e-04 6.093 1.11e-09 ***
## log.annual.inc -4.337e-01 7.148e-02 -6.067 1.30e-09 ***
## dti 4.638e-03 5.502e-03 0.843 0.399288
## fico -9.317e-03 1.710e-03 -5.448 5.08e-08 ***
## days.with.cr.line 2.371e-06 1.588e-05 0.149 0.881343
## revol.bal 3.085e-06 1.168e-06 2.641 0.008273 **
## revol.util 1.839e-03 1.535e-03 1.199 0.230722
## inq.last.6mths 8.437e-02 1.600e-02 5.275 1.33e-07 ***
## delinq.2yrs -8.320e-02 6.561e-02 -1.268 0.204762
## pub.rec 3.300e-01 1.139e-01 2.898 0.003756 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5896.6 on 6704 degrees of freedom
## Residual deviance: 5485.2 on 6686 degrees of freedom
## AIC: 5523.2
##
## Number of Fisher Scoring iterations: 5
Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.
Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B. What is the value of Logit(A) - Logit(B)?
coef_fico <- loansLog$coefficients[["fico"]]
(700-710)*coef_fico## [1] 0.0931679
Explanation
Because Application A is identical to Application B other than having a FICO score 10 lower, its predicted log odds differ by -0.009317 * -10 = 0.09317 from the predicted log odds of Application B. Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A)exp(B)exp(C). Also, remember that exp() is the exponential function in R.)
Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A) * exp(B) * exp(C). Also, remember that exp() is the exponential function in R.)
exp((700-710)*coef_fico)## [1] 1.097646
Explanation
Using the answer from the previous question, the predicted odds of loan A not being paid back in full are exp(0.09317) = 1.0976 times larger than the predicted odds for loan B. Intuitively, it makes sense that loan A should have higher odds of non-payment than loan B, since the borrower has a worse credit score.
Predict the probability of the test set loans not being paid back in full (remember type=“response” for the predict function). Store these predicted probabilities in a variable named predicted.risk and add it to your test set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.
What is the accuracy of the logistic regression model? Input the accuracy as a number between 0 and 1.
summary(test)## credit.policy purpose int.rate
## Min. :0.0000 all_other : 688 Min. :0.0600
## 1st Qu.:1.0000 credit_card : 411 1st Qu.:0.1028
## Median :1.0000 debt_consolidation:1206 Median :0.1221
## Mean :0.8047 educational : 93 Mean :0.1225
## 3rd Qu.:1.0000 home_improvement : 186 3rd Qu.:0.1393
## Max. :1.0000 major_purchase : 105 Max. :0.2121
## small_business : 184
## installment log.annual.inc dti fico
## Min. : 15.67 Min. : 8.102 Min. : 0.00 Min. :612.0
## 1st Qu.:163.57 1st Qu.:10.560 1st Qu.: 7.16 1st Qu.:682.0
## Median :267.74 Median :10.933 Median :12.85 Median :707.0
## Mean :316.99 Mean :10.928 Mean :12.75 Mean :710.8
## 3rd Qu.:421.89 3rd Qu.:11.290 3rd Qu.:18.30 3rd Qu.:737.0
## Max. :926.83 Max. :13.459 Max. :29.96 Max. :822.0
##
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2795 1st Qu.: 3362 1st Qu.: 23.40 1st Qu.: 0.000
## Median : 4140 Median : 8712 Median : 46.90 Median : 1.000
## Mean : 4494 Mean : 17198 Mean : 47.02 Mean : 1.576
## 3rd Qu.: 5670 3rd Qu.: 18728 3rd Qu.: 70.40 3rd Qu.: 2.000
## Max. :17640 Max. :1207359 Max. :108.80 Max. :24.000
##
## delinq.2yrs pub.rec not.fully.paid
## Min. : 0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 0.0000 Median :0.00000 Median :0.0000
## Mean : 0.1605 Mean :0.05743 Mean :0.1601
## 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :13.0000 Max. :3.00000 Max. :1.0000
##
summary(mod)##
## Call:
## glm(formula = violator ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7041 -0.4236 -0.2719 -0.1690 2.8375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2411574 1.2938852 -3.278 0.00105 **
## male 0.3869904 0.4379613 0.884 0.37690
## race 0.8867192 0.3950660 2.244 0.02480 *
## age -0.0001756 0.0160852 -0.011 0.99129
## state2 0.4433007 0.4816619 0.920 0.35739
## state3 0.8349797 0.5562704 1.501 0.13335
## state4 -3.3967878 0.6115860 -5.554 2.79e-08 ***
## time.served -0.1238867 0.1204230 -1.029 0.30359
## max.sentence 0.0802954 0.0553747 1.450 0.14705
## multiple.offenses 1.6119919 0.3853050 4.184 2.87e-05 ***
## crime2 0.6837143 0.5003550 1.366 0.17180
## crime3 -0.2781054 0.4328356 -0.643 0.52054
## crime4 -0.0117627 0.5713035 -0.021 0.98357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 340.04 on 472 degrees of freedom
## Residual deviance: 251.48 on 460 degrees of freedom
## AIC: 277.48
##
## Number of Fisher Scoring iterations: 6
test$predicted.risk <- predict(loansLog, newdata = test, type = "response")
table(test$not.fully.paid, as.numeric(test$predicted.risk >= 0.5))##
## 0 1
## 0 2400 13
## 1 457 3
# 2403 predictions are correct (accuracy 2403/2873=0.8364),
# while 2413 predictions would be correct in the baseline model of guessing every loan would be paid back in full (accuracy 2413/2873=0.8399).Use the ROCR package to compute the test set AUC.
library(ROCR)
pred = prediction(test$predicted.risk, test$not.fully.paid)
as.numeric(performance(pred, "auc")@y.values)## [1] 0.6720995
The model has poor accuracy at the threshold 0.5. But despite the poor accuracy, we will see later how an investor can still leverage this logistic regression model to make profitable investments.
In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations.
However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan’s risk. This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan’s interest rate as a “smart baseline” to order the loans according to risk.
Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.
The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. What is the most likely explanation for this difference?
Answer
int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included.
Explanation
To train the bivariate model, run the following command:
bivariate = glm(not.fully.paid ~ int.rate, data=train, family="binomial")
summary(bivariate)##
## Call:
## glm(formula = not.fully.paid ~ int.rate, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0547 -0.6271 -0.5442 -0.4361 2.2914
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6726 0.1688 -21.76 <2e-16 ***
## int.rate 15.9214 1.2702 12.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5896.6 on 6704 degrees of freedom
## Residual deviance: 5734.8 on 6703 degrees of freedom
## AIC: 5738.8
##
## Number of Fisher Scoring iterations: 4
Decreased significance between a bivariate and multivariate model is typically due to correlation. From cor(train\(int.rate, train\)fico), we can see that the interest rate is moderately well correlated with a borrower’s credit score. Training/testing set split rarely has a large effect on the significance of variables (this can be verified in this case by trying out a few other training/testing splits), and the models were trained on the same observations.
Make test set predictions for the bivariate model. What is the highest predicted probability of a loan not being paid in full on the testing set?
pred.bivariate = predict(bivariate, newdata=test, type="response")
summary(pred.bivariate)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06196 0.11549 0.15077 0.15963 0.18928 0.42662
According to the summary function, the maximum predicted probability of the loan not being paid back is 0.4266, which means no loans would be flagged at a logistic regression cutoff of 0.5.
What is the test set AUC of the bivariate model?
prediction.bivariate = prediction(pred.bivariate, test$not.fully.paid)
as.numeric(performance(prediction.bivariate, "auc")@y.values)## [1] 0.6239081
While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable. If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.
To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years. Using continuous compounding of interest, this investment pays back c * exp(rt) dollars by the end of the t years, where exp(rt) is e raised to the r*t power.
How much does a $10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compounding of interest? Hint: remember to convert the percentage to a proportion before doing the math. Enter the number of dollars, without the $ sign.
c=10
r=0.06
t=3
# Using the formula above, the final value is
c * exp(r*t)## [1] 11.97217
While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the investment. What is the profit to the investor if the investment is paid back in full?
Answer
c * exp(rt) - c
Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received). What is the profit to the investor in this scenario?
Answer
-c
In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit if the loan is not paid back in full (pessimistically).
In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test set. For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full. All the loans in our dataset are 3-year loans, meaning t=3 in our calculations. Enter the following commands in your R console to create this new variable:
test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1
summary(test$profit)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.2858 0.4111 0.2094 0.4980 0.8895
What is the maximum profit of a $10 investment in any loan in the testing set (do not include the $ sign in your answer)?
Explanation From summary(test$profit), we see the maximum profit for a $1 investment in any loan is $0.8895. Therefore, the maximum profit of a $10 investment is 10 times as large, or $8.895.
A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem. As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.
To meet this objective, we will analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.
First, use the subset() function to build a data frame called highInterest consisting of the test set loans with an interest rate of at least 15%.
What is the average profit of a $1 investment in one of these high-interest loans?
highInterest = subset(test, int.rate >= 0.15)
summary(highInterest$profit)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 -1.0000 0.5992 0.2251 0.6380 0.8895
What proportion of the high-interest loans were not paid back in full?
table(highInterest$not.fully.paid)##
## 0 1
## 327 110
110/437## [1] 0.2517162
Next, we will determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list. Find the highest predicted risk that we will include by typing the following command into your R console:
cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]
Use the subset() function to build a data frame called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cutoff we just computed. Check to make sure you have selected 100 loans for investment.
What is the profit of the investor, who invested $1 in each of these 100 loans?
cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]
selectedLoans = subset(highInterest, predicted.risk <= cutoff)
sum(selectedLoans$profit)## [1] 31.27825
How many of 100 selected loans were not paid back in full?
table(selectedLoans$not.fully.paid)##
## 0 1
## 81 19
We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at only a slightly lower rate than average, resulting in a significant increase in the profit from our investor’s $100 investment. Although the logistic regression models developed in this problem did not have large AUC values, we see that they still provided the edge needed to improve the profitability of an investment portfolio.
We conclude with a note of warning. Throughout this analysis we assume that the loans we invest in will perform in the same way as the loans we used to train our model, even though our training set covers a relatively short period of time. If there is an economic shock like a large financial downturn, default rates might be significantly higher than those observed in the training set and we might end up losing money instead of profiting. Investors must pay careful attention to such risk when making investment decisions.