Modeling the Expert

Video 4: Logistic Regression in R

Quick Question 1

In R, create a logistic regression model to predict “PoorCare” using the independent variables “StartedOnCombination” and “ProviderCount”. Use the training set we created in the previous video to build the model.

Note: If you haven’t already loaded and split the data in R, please run these commands in your R console to load and split the data set. Remember to first navigate to the directory where you have saved “quality.csv”.

quality = read.csv("quality.csv")

# install.packages("caTools")

library(caTools)

set.seed(88)

# Both training and test dataset have around 75% "good quality care" cases
split = sample.split(quality$PoorCare, SplitRatio = 0.75)

qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)

# Logistic regression model
QualityLog = glm(PoorCare ~ StartedOnCombination + ProviderCount, data=qualityTrain, family=binomial)
summary(QualityLog)
## 
## Call:
## glm(formula = PoorCare ~ StartedOnCombination + ProviderCount, 
##     family = binomial, data = qualityTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.61826  -0.72782  -0.64555  -0.08407   1.94662  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.00097    0.55097  -3.632 0.000282 ***
## StartedOnCombinationTRUE  1.95230    1.22342   1.596 0.110541    
## ProviderCount             0.03366    0.01983   1.697 0.089706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.89  on 98  degrees of freedom
## Residual deviance: 104.37  on 96  degrees of freedom
## AIC: 110.37
## 
## Number of Fisher Scoring iterations: 4

What is the coefficient for “StartedOnCombination”?

1.9523

Explanation

To construct this model in R, use the command: Model = glm(PoorCare ~ StartedOnCombination + ProviderCount, data=qualityTrain, family=binomial) If you look at the output of summary(Model), the value of the coefficient (Estimate) for StartedOnCombination is 1.95230.


Quick Question 2

StartedOnCombination is a binary variable, which equals 1 if the patient is started on a combination of drugs to treat their diabetes, and equals 0 if the patient is not started on a combination of drugs. All else being equal, does this model imply that starting a patient on a combination of drugs is indicative of poor care, or good care?

Poor Care

Explanation

The coefficient value is positive, meaning that positive values of the variable make the outcome of 1 more likely. This corresponds to Poor Care.


Video 6: ROC Curves

Quick Question 1

Given this ROC curve, which threshold would you pick if you wanted to correctly identify a small group of patients who are receiving the worst care with high confidence?

t = 0.7

Explanation

The threshold 0.7 is best to identify a small group of patients who are receiving the worst care with high confidence, since at this threshold we make very few false positive mistakes, and identify about 35% of the true positives. The threshold t = 0.8 is not a good choice, since it makes about the same number of false positives, but only identifies 10% of the true positives. The thresholds 0.2 and 0.3 both identify more of the true positives, but they make more false positive mistakes, so our confidence decreases.


Quick Question 2

Which threshold would you pick if you wanted to correctly identify half of the patients receiving poor care, while making as few errors as possible?

t = 0.3

Explanation

The threshold 0.3 is the best choice in this scenerio. The threshold 0.2 also identifies over half of the patients receiving poor care, but it makes many more false positive mistakes. The thresholds 0.7 and 0.8 don’t identify at least half of the patients receiving poor care.

Video 7: Interpreting the Model

Quick Question 1

IMPORTANT NOTE: This question uses the original model with the independent variables “OfficeVisits” and “Narcotics”. Be sure to use this model, instead of the model you built in Quick Question 4.

Compute the test set predictions in R by running the command:

predictTest = predict(QualityLog, type=“response”, newdata=qualityTest)

You can compute the test set AUC by running the following two commands in R:

ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)

auc = as.numeric(performance(ROCRpredTest, “auc”)@y.values)

What is the AUC of this model on the test set?

The AUC of a model has the following nice interpretation: given a random patient from the dataset who actually received poor care, and a random patient from the dataset who actually received good care, the AUC is the perecentage of time that our model will classify which is which correctly.

library(ROCR)

QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
## 
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
##     data = qualityTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06303  -0.63155  -0.50503  -0.09689   2.16686  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.64613    0.52357  -5.054 4.33e-07 ***
## OfficeVisits  0.08212    0.03055   2.688  0.00718 ** 
## Narcotics     0.07630    0.03205   2.381  0.01728 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.888  on 98  degrees of freedom
## Residual deviance:  89.127  on 96  degrees of freedom
## AIC: 95.127
## 
## Number of Fisher Scoring iterations: 4
predictTest = predict(QualityLog, type="response", newdata=qualityTest)

# compute the test set AUC
ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)

The Framingham Heart Study: Evaluating Risk Factors to Save Lives

Video 3: A Logistic Regression Model

Quick Question 1

In the previous video, we computed the following confusion matrix for our logistic regression model on our test set with a threshold of 0.5:

Actual/Prediction FALSE TRUE
0 1069 6
1 187 11

Using this confusion matrix, answer the following questions.

What is the sensitivity of our logistic regression model on the test set, using a threshold of 0.5?

sens = 11 / (11+187)
sens
## [1] 0.05555556

What is the specificity of our logistic regression model on the test set, using a threshold of 0.5?

spec = 1069 / (1069+6)
spec
## [1] 0.9944186

Video 4: Validating the Model

Quick Question 1

For which of the following models should external validation be used? Consider both the population used to train the model, and the population that the model will be used on. (Select all that apply.)

Answer:

A model to predict obesity risk. Data from a random sample of California residents was used to build the model, and we want to use the model to predict the obesity risk of all United States residents.

A model to predict the probability of a runner winning a marathon. Data from all runners in the Boston Marathon was used to build the model, and we want use the model to predict the probability of winning for all people who run marathons.


Popularity of Music Records

The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.

Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist’s release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.

How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, we aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here’s a detailed description of the variables:

Problem 1.1 - Understanding the Data

Use the read.csv function to load the dataset “songs.csv” into R.

How many observations (songs) are from the year 2010?

songs <- read.csv("songs.csv")
str(songs)
## 'data.frame':    7574 obs. of  39 variables:
##  $ year                    : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ songtitle               : Factor w/ 7141 levels "̈́ l'or_e des bois",..: 6204 5522 241 3115 48 608 255 4419 2886 6756 ...
##  $ artistname              : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
##  $ songID                  : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831 3964 6904 2473 ...
##  $ artistID                : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 671 507 ...
##  $ timesignature           : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ timesignature_confidence: num  0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
##  $ loudness                : num  -4.26 -4.05 -3.57 -3.81 -4.71 ...
##  $ tempo                   : num  91.5 140 160.5 97.5 140.1 ...
##  $ tempo_confidence        : num  0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
##  $ key                     : int  11 10 2 1 6 4 10 5 9 11 ...
##  $ key_confidence          : num  0.453 0.469 0.209 0.632 0.483 0.627 0.715 0.423 0.751 0.602 ...
##  $ energy                  : num  0.967 0.985 0.99 0.939 0.988 ...
##  $ pitch                   : num  0.024 0.025 0.026 0.013 0.063 0.038 0.026 0.033 0.027 0.004 ...
##  $ timbre_0_min            : num  0.002 0 0.003 0 0 ...
##  $ timbre_0_max            : num  57.3 57.4 57.4 57.8 56.9 ...
##  $ timbre_1_min            : num  -6.5 -37.4 -17.2 -32.1 -223.9 ...
##  $ timbre_1_max            : num  171 171 171 221 171 ...
##  $ timbre_2_min            : num  -81.7 -149.6 -72.9 -138.6 -147.2 ...
##  $ timbre_2_max            : num  95.1 180.3 157.9 173.4 166 ...
##  $ timbre_3_min            : num  -285 -380.1 -204 -73.5 -128.1 ...
##  $ timbre_3_max            : num  259 384 251 373 389 ...
##  $ timbre_4_min            : num  -40.4 -48.7 -66 -55.6 -43.9 ...
##  $ timbre_4_max            : num  73.6 100.4 152.1 119.2 99.3 ...
##  $ timbre_5_min            : num  -104.7 -87.3 -98.7 -77.5 -96.1 ...
##  $ timbre_5_max            : num  183.1 42.8 141.4 141.2 38.3 ...
##  $ timbre_6_min            : num  -88.8 -86.9 -88.9 -70.8 -110.8 ...
##  $ timbre_6_max            : num  73.5 75.5 66.5 64.5 72.4 ...
##  $ timbre_7_min            : num  -71.1 -65.8 -67.4 -63.7 -55.9 ...
##  $ timbre_7_max            : num  82.5 106.9 80.6 96.7 110.3 ...
##  $ timbre_8_min            : num  -52 -61.3 -59.8 -78.7 -56.5 ...
##  $ timbre_8_max            : num  39.1 35.4 46 41.1 37.6 ...
##  $ timbre_9_min            : num  -35.4 -81.9 -46.3 -49.2 -48.6 ...
##  $ timbre_9_max            : num  71.6 74.6 59.9 95.4 67.6 ...
##  $ timbre_10_min           : num  -126.4 -103.8 -108.3 -102.7 -52.8 ...
##  $ timbre_10_max           : num  18.7 121.9 33.3 46.4 22.9 ...
##  $ timbre_11_min           : num  -44.8 -38.9 -43.7 -59.4 -50.4 ...
##  $ timbre_11_max           : num  26 22.5 25.7 37.1 32.8 ...
##  $ Top10                   : int  0 0 0 0 0 0 0 0 0 1 ...
nrow(subset(songs, year == 2010))
## [1] 373
table(songs$year)
## 
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 
##  328  196  186  324  198  258  178  329  380  357  363  282  518  434  479 
## 2005 2006 2007 2008 2009 2010 
##  392  479  622  415  483  373

Explanation

First, navigate to the directory on your computer containing the file “songs.csv”. You can load the dataset by using the command: songs = read.csv(“songs.csv”) Then, you can count the number of songs from 2010 by using the table function: table(songs$year)

Problem 1.2 - Understanding the Data

How many songs does the dataset include for which the artist name is “Michael Jackson”?

MichaelJackson <- subset(songs, artistname == "Michael Jackson")

length(unique(MichaelJackson$songID))
## [1] 18

Problem 1.3 - Understanding the Data

Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.

MJ_top10 <- subset(MichaelJackson, Top10 == 1)

MJ_top10$songtitle
## [1] You Rock My World You Are Not Alone Black or White    Remember the Time
## [5] In The Closet    
## 7141 Levels: ̈́ l'or_e des bois _\x84_ _\x84\x8d ... Zumbi
MichaelJackson[c("songtitle", "Top10")]
##                           songtitle Top10
## 4329              You Rock My World     1
## 6205           She's Out of My Life     0
## 6206    Wanna Be Startin' Somethin'     0
## 6207              You Are Not Alone     1
## 6208                    Billie Jean     0
## 6209       The Way You Make Me Feel     0
## 6210                 Black or White     1
## 6211                  Rock with You     0
## 6212                            Bad     0
## 6213   I Just Can't Stop Loving You     0
## 6214              Man in the Mirror     0
## 6215                       Thriller     0
## 6216                        Beat It     0
## 6217               The Girl Is Mine     0
## 6218              Remember the Time     1
## 6219 Don't Stop 'Til You Get Enough     0
## 6220                 Heal the World     0
## 6915                  In The Closet     1

Problem 1.4 - Understanding the Data

The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only takes integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset? Select all that apply.

unique(songs$timesignature)
## [1] 3 4 5 7 1 0

Which timesignature value is the most frequent among songs in our dataset?

table(songs$timesignature)
## 
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19

Problem 1.5 - Understanding the Data

Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs. Which one is it?

max_tempo <- max(songs$tempo)

subset(songs, tempo == max_tempo)$songtitle
## [1] Wanna Be Startin' Somethin'
## 7141 Levels: ̈́ l'or_e des bois _\x84_ _\x84\x8d ... Zumbi

Explanation

You can answer this question by using the which.max function. The output of which.max(songs\(tempo) is 6206, meaning that the song with the highest tempo is the row 6206. We can output the song title by typing: songs\)songtitle[6206] The song title is: Wanna be Startin’ Somethin’.

Problem 2.1 - Creating Our Prediction Model

We wish to predict whether or not a song will make it to the Top 10. To do this, first use the subset function to split the data into a training set “SongsTrain” consisting of all the observations up to and including 2009 song releases, and a testing set “SongsTest”, consisting of the 2010 song releases.

How many observations (songs) are in the training set?

SongsTrain <- subset(songs, year <= 2009)
nrow(SongsTrain)
## [1] 7201
SongsTest <- subset(songs, year == 2010)

Problem 2.2 - Creating our Prediction Model

In this problem, our outcome variable is “Top10” - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regression model. We’ll start by using all song attributes as our independent variables, which we’ll call Model 1.

We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won’t use the variables “year”, “songtitle”, “artistname”, “songID” or “artistID”.

We have seen in the lecture that, to build the logistic regression model, we would normally explicitly input the formula including all the independent variables in R. However, in this case, this is a tedious amount of work since we have a large number of independent variables.

There is a nice trick to avoid doing so. Let’s suppose that, except for the outcome variable Top10, all other variables in the training set are inputs to Model 1. Then, we can use the formula

SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)

to build our model. Notice that the “.” is used in place of enumerating all the independent variables. (Also, keep in mind that you can choose to put quotes around binomial, or leave out the quotes. R can understand this argument either way.)

However, in our case, we want to exclude some of the variables in our dataset from being used as independent variables (“year”, “songtitle”, “artistname”, “songID”, and “artistID”). To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won’t use in our model.

nonvars = c(“year”, “songtitle”, “artistname”, “songID”, “artistID”)

To remove these variables from your training and testing sets, type the following commands in your R console:

SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]

SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]

Now, use the glm function to build a logistic regression model to predict Top10 using all of the other variables as the independent variables. You should use SongsTrain to build the model.

Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?

# Variables that we won't use in the model
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")

# Remove these variables from the training and testing sets
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]

# Build logistic regression model
str(SongsTrain)
## 'data.frame':    7201 obs. of  34 variables:
##  $ timesignature           : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ timesignature_confidence: num  0.732 0.906 0.987 0.822 0.983 1 0.821 0.997 0.816 1 ...
##  $ loudness                : num  -6.32 -9.54 -4.84 -5.27 -6.23 ...
##  $ tempo                   : num  89.6 117.7 119 71.5 77.5 ...
##  $ tempo_confidence        : num  0.652 0.542 0.838 0.613 0.74 0.821 0.912 0.609 0.786 0.27 ...
##  $ key                     : int  1 0 6 4 8 9 6 9 0 9 ...
##  $ key_confidence          : num  0.773 0.722 0.106 0.781 0.552 0.218 0.275 0.333 0.634 0.578 ...
##  $ energy                  : num  0.599 0.363 0.76 0.755 0.524 ...
##  $ pitch                   : num  0.004 0.006 0.003 0.014 0.008 0.012 0.002 0.003 0.001 0.006 ...
##  $ timbre_0_min            : num  0 0.739 0 0 0 ...
##  $ timbre_0_max            : num  57.8 57.1 57.8 58.3 57.6 ...
##  $ timbre_1_min            : num  -62.3 -220.2 -189.7 -113.9 -160.6 ...
##  $ timbre_1_max            : num  286 241 187 171 217 ...
##  $ timbre_2_min            : num  -81.8 -96.8 -139.1 -71.6 -79.5 ...
##  $ timbre_2_max            : num  211 215 135 195 114 ...
##  $ timbre_3_min            : num  -217 -202 -116 -276 -184 ...
##  $ timbre_3_max            : num  203.2 124.2 94.7 146.3 108.7 ...
##  $ timbre_4_min            : num  -55.9 -52.4 -55.6 -59.4 -31.9 ...
##  $ timbre_4_max            : num  97.6 131.9 79.3 121.7 169.7 ...
##  $ timbre_5_min            : num  -62.5 -73.9 -73.5 -71.1 -73 ...
##  $ timbre_5_max            : num  82.2 73.6 41 39.6 233.9 ...
##  $ timbre_6_min            : num  -82.1 -63.5 -41.5 -77.8 -76 ...
##  $ timbre_6_max            : num  59.2 70.1 62.8 94.5 58 ...
##  $ timbre_7_min            : num  -109.4 -90.1 -69.3 -69.1 -78.8 ...
##  $ timbre_7_max            : num  71 112.9 90.4 93.4 100.8 ...
##  $ timbre_8_min            : num  -71.8 -64.5 -52.5 -55.8 -61.4 ...
##  $ timbre_8_max            : num  58.4 58.1 40.7 79 50.3 ...
##  $ timbre_9_min            : num  -53.8 -76.9 -50.4 -51.5 -63 ...
##  $ timbre_9_max            : num  88.6 74.4 58.8 70.5 96.8 ...
##  $ timbre_10_min           : num  -89.8 -88.2 -78.2 -74.9 -90.4 ...
##  $ timbre_10_max           : num  38 42.2 35.3 30.8 60.5 ...
##  $ timbre_11_min           : num  -52.1 -66.8 -54.2 -51.4 -52.1 ...
##  $ timbre_11_max           : num  52.8 40.7 46.5 27.8 48.1 ...
##  $ Top10                   : int  0 0 0 0 0 0 0 0 0 0 ...
SongsLog <- glm(Top10 ~ ., data = SongsTrain, family=binomial)

summary(SongsLog)
## 
## Call:
## glm(formula = Top10 ~ ., family = binomial, data = SongsTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9220  -0.5399  -0.3459  -0.1845   3.0770  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               1.470e+01  1.806e+00   8.138 4.03e-16 ***
## timesignature             1.264e-01  8.674e-02   1.457 0.145050    
## timesignature_confidence  7.450e-01  1.953e-01   3.815 0.000136 ***
## loudness                  2.999e-01  2.917e-02  10.282  < 2e-16 ***
## tempo                     3.634e-04  1.691e-03   0.215 0.829889    
## tempo_confidence          4.732e-01  1.422e-01   3.329 0.000873 ***
## key                       1.588e-02  1.039e-02   1.529 0.126349    
## key_confidence            3.087e-01  1.412e-01   2.187 0.028760 *  
## energy                   -1.502e+00  3.099e-01  -4.847 1.25e-06 ***
## pitch                    -4.491e+01  6.835e+00  -6.570 5.02e-11 ***
## timbre_0_min              2.316e-02  4.256e-03   5.441 5.29e-08 ***
## timbre_0_max             -3.310e-01  2.569e-02 -12.882  < 2e-16 ***
## timbre_1_min              5.881e-03  7.798e-04   7.542 4.64e-14 ***
## timbre_1_max             -2.449e-04  7.152e-04  -0.342 0.732087    
## timbre_2_min             -2.127e-03  1.126e-03  -1.889 0.058843 .  
## timbre_2_max              6.586e-04  9.066e-04   0.726 0.467571    
## timbre_3_min              6.920e-04  5.985e-04   1.156 0.247583    
## timbre_3_max             -2.967e-03  5.815e-04  -5.103 3.34e-07 ***
## timbre_4_min              1.040e-02  1.985e-03   5.237 1.63e-07 ***
## timbre_4_max              6.110e-03  1.550e-03   3.942 8.10e-05 ***
## timbre_5_min             -5.598e-03  1.277e-03  -4.385 1.16e-05 ***
## timbre_5_max              7.736e-05  7.935e-04   0.097 0.922337    
## timbre_6_min             -1.686e-02  2.264e-03  -7.445 9.66e-14 ***
## timbre_6_max              3.668e-03  2.190e-03   1.675 0.093875 .  
## timbre_7_min             -4.549e-03  1.781e-03  -2.554 0.010661 *  
## timbre_7_max             -3.774e-03  1.832e-03  -2.060 0.039408 *  
## timbre_8_min              3.911e-03  2.851e-03   1.372 0.170123    
## timbre_8_max              4.011e-03  3.003e-03   1.336 0.181620    
## timbre_9_min              1.367e-03  2.998e-03   0.456 0.648356    
## timbre_9_max              1.603e-03  2.434e-03   0.659 0.510188    
## timbre_10_min             4.126e-03  1.839e-03   2.244 0.024852 *  
## timbre_10_max             5.825e-03  1.769e-03   3.292 0.000995 ***
## timbre_11_min            -2.625e-02  3.693e-03  -7.108 1.18e-12 ***
## timbre_11_max             1.967e-02  3.385e-03   5.811 6.21e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6017.5  on 7200  degrees of freedom
## Residual deviance: 4759.2  on 7167  degrees of freedom
## AIC: 4827.2
## 
## Number of Fisher Scoring iterations: 6

Problem 2.3 - Creating Our Prediction Model

Let’s now think about the variables in our dataset related to the confidence of the time signature, key and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest?

Answer

The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10 Explanation

If you look at the output summary(model), where model is the name of your logistic regression model, you can see that the coefficient estimates for the confidence variables (timesignature_confidence, key_confidence, and tempo_confidence) are positive. This means that higher confidence leads to a higher predicted probability of a Top 10 hit.

Problem 2.4 - Creating Our Prediction Model

In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be complex. What does Model 1 suggest in terms of complexity?

Answer

Mainstream listeners tend to prefer less complex songs

Explanation

Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream listeners tend to prefer less complex songs.

Problem 2.5 - Creating Our Prediction Model

Songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”).

By inspecting the coefficient of the variable “loudness”, what does Model 1 suggest?

Answer

Mainstream listeners prefer songs with heavy instrumentation.

By inspecting the coefficient of the variable “energy”, do we draw the same conclusions as above?

Answer

No

Explanation

The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs, which are those with heavier instrumentation. However, the coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those with light instrumentation. These coefficients lead us to different conclusions!

Problem 3.1 - Beware of Multicollinearity Issues!

What is the correlation between the variables “loudness” and “energy” in the training set?

cor(SongsTrain$loudness, SongsTrain$energy)
## [1] 0.7399067

Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we’ll build two variations of our original model: Model 2, in which we keep “energy” and omit “loudness”, and Model 3, in which we keep “loudness” and omit “energy”.

Problem 3.2 - Beware of Multicollinearity Issues!

Create Model 2, which is Model 1 without the independent variable “loudness”. This can be done with the following command:

SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)

We just subtracted the variable loudness. We couldn’t do this with the variables “songtitle” and “artistname”, because they are not numeric variables, and we might get different values in the test set that the training set has never seen. But this approach (subtracting the variable from the model formula) will always work when you want to remove numeric variables.

Look at the summary of SongsLog2, and inspect the coefficient of the variable “energy”. What do you observe?

SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)

summary(SongsLog2)
## 
## Call:
## glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0983  -0.5607  -0.3602  -0.1902   3.3107  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.241e+00  7.465e-01  -3.002 0.002686 ** 
## timesignature             1.625e-01  8.734e-02   1.860 0.062873 .  
## timesignature_confidence  6.885e-01  1.924e-01   3.578 0.000346 ***
## tempo                     5.521e-04  1.665e-03   0.332 0.740226    
## tempo_confidence          5.497e-01  1.407e-01   3.906 9.40e-05 ***
## key                       1.740e-02  1.026e-02   1.697 0.089740 .  
## key_confidence            2.954e-01  1.394e-01   2.118 0.034163 *  
## energy                    1.813e-01  2.608e-01   0.695 0.486991    
## pitch                    -5.150e+01  6.857e+00  -7.511 5.87e-14 ***
## timbre_0_min              2.479e-02  4.240e-03   5.847 5.01e-09 ***
## timbre_0_max             -1.007e-01  1.178e-02  -8.551  < 2e-16 ***
## timbre_1_min              7.143e-03  7.710e-04   9.265  < 2e-16 ***
## timbre_1_max             -7.830e-04  7.064e-04  -1.108 0.267650    
## timbre_2_min             -1.579e-03  1.109e-03  -1.424 0.154531    
## timbre_2_max              3.889e-04  8.964e-04   0.434 0.664427    
## timbre_3_min              6.500e-04  5.949e-04   1.093 0.274524    
## timbre_3_max             -2.462e-03  5.674e-04  -4.339 1.43e-05 ***
## timbre_4_min              9.115e-03  1.952e-03   4.670 3.02e-06 ***
## timbre_4_max              6.306e-03  1.532e-03   4.115 3.87e-05 ***
## timbre_5_min             -5.641e-03  1.255e-03  -4.495 6.95e-06 ***
## timbre_5_max              6.937e-04  7.807e-04   0.889 0.374256    
## timbre_6_min             -1.612e-02  2.235e-03  -7.214 5.45e-13 ***
## timbre_6_max              3.814e-03  2.157e-03   1.768 0.076982 .  
## timbre_7_min             -5.102e-03  1.755e-03  -2.907 0.003644 ** 
## timbre_7_max             -3.158e-03  1.811e-03  -1.744 0.081090 .  
## timbre_8_min              4.488e-03  2.810e-03   1.597 0.110254    
## timbre_8_max              6.423e-03  2.950e-03   2.177 0.029497 *  
## timbre_9_min             -4.282e-04  2.955e-03  -0.145 0.884792    
## timbre_9_max              3.525e-03  2.377e-03   1.483 0.138017    
## timbre_10_min             2.993e-03  1.804e-03   1.660 0.097004 .  
## timbre_10_max             7.367e-03  1.731e-03   4.255 2.09e-05 ***
## timbre_11_min            -2.837e-02  3.630e-03  -7.815 5.48e-15 ***
## timbre_11_max             1.829e-02  3.341e-03   5.476 4.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6017.5  on 7200  degrees of freedom
## Residual deviance: 4871.8  on 7168  degrees of freedom
## AIC: 4937.8
## 
## Number of Fisher Scoring iterations: 6

Model 2 suggests that songs with high energy levels tend to be more popular. This contradicts our observation in Model 1.

Explanation

The coefficient estimate for energy is positive in Model 2, suggesting that songs with higher energy levels tend to be more popular. However, note that the variable energy is not significant in this model.

Problem 3.3 - Beware of Multicollinearity Issues!

Now, create Model 3, which should be exactly like Model 1, but without the variable “energy”.

Look at the summary of Model 3 and inspect the coefficient of the variable “loudness”. Remembering that higher loudness and energy both occur in songs with heavier instrumentation, do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?

SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)

summary(SongsLog3)
## 
## Call:
## glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9182  -0.5417  -0.3481  -0.1874   3.4171  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               1.196e+01  1.714e+00   6.977 3.01e-12 ***
## timesignature             1.151e-01  8.726e-02   1.319 0.187183    
## timesignature_confidence  7.143e-01  1.946e-01   3.670 0.000242 ***
## loudness                  2.306e-01  2.528e-02   9.120  < 2e-16 ***
## tempo                    -6.460e-04  1.665e-03  -0.388 0.698107    
## tempo_confidence          3.841e-01  1.398e-01   2.747 0.006019 ** 
## key                       1.649e-02  1.035e-02   1.593 0.111056    
## key_confidence            3.394e-01  1.409e-01   2.409 0.015984 *  
## pitch                    -5.328e+01  6.733e+00  -7.914 2.49e-15 ***
## timbre_0_min              2.205e-02  4.239e-03   5.200 1.99e-07 ***
## timbre_0_max             -3.105e-01  2.537e-02 -12.240  < 2e-16 ***
## timbre_1_min              5.416e-03  7.643e-04   7.086 1.38e-12 ***
## timbre_1_max             -5.115e-04  7.110e-04  -0.719 0.471928    
## timbre_2_min             -2.254e-03  1.120e-03  -2.012 0.044190 *  
## timbre_2_max              4.119e-04  9.020e-04   0.457 0.647915    
## timbre_3_min              3.179e-04  5.869e-04   0.542 0.588083    
## timbre_3_max             -2.964e-03  5.758e-04  -5.147 2.64e-07 ***
## timbre_4_min              1.105e-02  1.978e-03   5.585 2.34e-08 ***
## timbre_4_max              6.467e-03  1.541e-03   4.196 2.72e-05 ***
## timbre_5_min             -5.135e-03  1.269e-03  -4.046 5.21e-05 ***
## timbre_5_max              2.979e-04  7.855e-04   0.379 0.704526    
## timbre_6_min             -1.784e-02  2.246e-03  -7.945 1.94e-15 ***
## timbre_6_max              3.447e-03  2.182e-03   1.580 0.114203    
## timbre_7_min             -5.128e-03  1.768e-03  -2.900 0.003733 ** 
## timbre_7_max             -3.394e-03  1.820e-03  -1.865 0.062208 .  
## timbre_8_min              3.686e-03  2.833e-03   1.301 0.193229    
## timbre_8_max              4.658e-03  2.988e-03   1.559 0.119022    
## timbre_9_min             -9.318e-05  2.957e-03  -0.032 0.974859    
## timbre_9_max              1.342e-03  2.424e-03   0.554 0.579900    
## timbre_10_min             4.050e-03  1.827e-03   2.217 0.026637 *  
## timbre_10_max             5.793e-03  1.759e-03   3.294 0.000988 ***
## timbre_11_min            -2.638e-02  3.683e-03  -7.162 7.96e-13 ***
## timbre_11_max             1.984e-02  3.365e-03   5.896 3.74e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6017.5  on 7200  degrees of freedom
## Residual deviance: 4782.7  on 7168  degrees of freedom
## AIC: 4848.7
## 
## Number of Fisher Scoring iterations: 6

Explanation

Model 3 can be created with the following command: SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial) Looking at the output of summary(SongsLog3), we can see that loudness has a positive coefficient estimate, meaning that our model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion we got from Model 2. In the remainder of this problem, we’ll just use Model 3.

Problem 4.1 - Validating Our Model

Make predictions on the test set using Model 3. What is the accuracy of Model 3 on the test set, using a threshold of 0.45? (Compute the accuracy as a number between 0 and 1.)

# Make predictions on training set
predictTest = predict(SongsLog3, newdata = SongsTest, type="response")

# Create confusion matrix with threshold of 0.45
confMat <- table(SongsTest$Top10, predictTest >= 0.45)
confMat
##    
##     FALSE TRUE
##   0   309    5
##   1    40   19
accuracy <- (confMat[1, 1]+confMat[2, 2])/sum(confMat)
accuracy
## [1] 0.8793566

Problem 4.2 - Validating Our Model

Let’s check if there’s any incremental benefit in using Model 3 instead of a baseline model. Given the difficulty of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs. What would the accuracy of the baseline model be on the test set? (Give your answer as a number between 0 and 1.)

confMat
##    
##     FALSE TRUE
##   0   309    5
##   1    40   19
acc_baseline <- (309+5)/sum(confMat)
acc_baseline
## [1] 0.8418231
# Or

table(SongsTest$Top10)
## 
##   0   1 
## 314  59

Problem 4.3 - Validating Our Model

It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge?

Let’s view the two models from an investment perspective. A production company is interested in investing in songs that are highly likely to make it to the Top 10. The company’s objective is to minimize its risk of financial losses attributed to investing in songs that end up unpopular.

A competitive edge can therefore be achieved if we can provide the production company a list of songs that are highly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does not label any song as a hit. Let us see what our model has to offer.

How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into our test set), using a threshold of 0.45?

confMat
##    
##     FALSE TRUE
##   0   309    5
##   1    40   19
# True positives
confMat[2, 2]
## [1] 19

How many non-hit songs does Model 3 predict will be Top 10 hits (again, looking at the test set), using a threshold of 0.45?

# false positives
confMat[1, 2]
## [1] 5

Problem 4.4 - Validating Our Model

What is the sensitivity of Model 3 on the test set, using a threshold of 0.45?

sensitivity <- confMat[2, 2] / sum(confMat[2, ])
sensitivity
## [1] 0.3220339

What is the specificity of Model 3 on the test set, using a threshold of 0.45?

confMat
##    
##     FALSE TRUE
##   0   309    5
##   1    40   19
specificity <- confMat[1, 1] / sum(confMat[1, ])
specificity
## [1] 0.9840764

Problem 4.5 - Validating Our Model

What conclusions can you make about our model? (Select all that apply.)

Answer

  • Model 3 favors specificity over sensitivity.

  • Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.

Explanation

Model 3 has a very high specificity, meaning that it favors specificity over sensitivity. While Model 3 only captures less than half of the Top 10 songs, it still can offer a competitive edge, since it is very conservative in its predictions.


Predicting parole violators

In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.

Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.

For this prediction task, we will use data from the United States 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:


Problem 1.1 - Loading the Dataset

Load the dataset “parole.csv” into a data frame called parole, and investigate it using the str() and summary() functions.

How many parolees are contained in the dataset?

parole <- read.csv("parole.csv")

str(parole)
## 'data.frame':    675 obs. of  9 variables:
##  $ male             : int  1 0 1 1 1 1 1 0 0 1 ...
##  $ race             : int  1 1 2 1 2 2 1 1 1 2 ...
##  $ age              : num  33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
##  $ state            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ time.served      : num  5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
##  $ max.sentence     : int  18 12 12 18 12 18 18 12 13 12 ...
##  $ multiple.offenses: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ crime            : int  4 3 3 1 1 4 3 1 3 2 ...
##  $ violator         : int  0 0 0 0 0 0 0 0 0 0 ...
summary(parole)
##       male             race            age            state      
##  Min.   :0.0000   Min.   :1.000   Min.   :18.40   Min.   :1.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:25.35   1st Qu.:2.000  
##  Median :1.0000   Median :1.000   Median :33.70   Median :3.000  
##  Mean   :0.8074   Mean   :1.424   Mean   :34.51   Mean   :2.887  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:42.55   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :2.000   Max.   :67.00   Max.   :4.000  
##   time.served     max.sentence   multiple.offenses     crime      
##  Min.   :0.000   Min.   : 1.00   Min.   :0.0000    Min.   :1.000  
##  1st Qu.:3.250   1st Qu.:12.00   1st Qu.:0.0000    1st Qu.:1.000  
##  Median :4.400   Median :12.00   Median :1.0000    Median :2.000  
##  Mean   :4.198   Mean   :13.06   Mean   :0.5363    Mean   :2.059  
##  3rd Qu.:5.200   3rd Qu.:15.00   3rd Qu.:1.0000    3rd Qu.:3.000  
##  Max.   :6.000   Max.   :18.00   Max.   :1.0000    Max.   :4.000  
##     violator     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1156  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

Problem 1.2 - Loading the Dataset

How many of the parolees in the dataset violated the terms of their parole?

sum(parole$violator == 1)
## [1] 78

Problem 2.1 - Preparing the Dataset

You should be familiar with unordered factors (if not, review the Week 2 homework problem “Reading Test Scores”). Which variables in this dataset are unordered factors with at least three levels? Select all that apply.

Answer

While the variables male, race, state, crime, and violator are all unordered factors, only state and crime have at least 3 levels in this dataset.

Problem 2.2 - Preparing the Dataset

In the last subproblem, we identified variables that are unordered factors with at least 3 levels, so we need to convert them to factors for our prediction problem (we introduced this idea in the “Reading Test Scores” problem last week). Using the as.factor() function, convert these variables to factors. Keep in mind that we are not changing the values, just the way R understands them (the values are still numbers).

How does the output of summary() change for a factor variable as compared to a numerical variable?

parole$state <- as.factor(parole$state)
parole$crime <- as.factor(parole$crime)
summary(parole)
##       male             race            age        state    time.served   
##  Min.   :0.0000   Min.   :1.000   Min.   :18.40   1:143   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:25.35   2:120   1st Qu.:3.250  
##  Median :1.0000   Median :1.000   Median :33.70   3: 82   Median :4.400  
##  Mean   :0.8074   Mean   :1.424   Mean   :34.51   4:330   Mean   :4.198  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:42.55           3rd Qu.:5.200  
##  Max.   :1.0000   Max.   :2.000   Max.   :67.00           Max.   :6.000  
##   max.sentence   multiple.offenses crime      violator     
##  Min.   : 1.00   Min.   :0.0000    1:315   Min.   :0.0000  
##  1st Qu.:12.00   1st Qu.:0.0000    2:106   1st Qu.:0.0000  
##  Median :12.00   Median :1.0000    3:153   Median :0.0000  
##  Mean   :13.06   Mean   :0.5363    4:101   Mean   :0.1156  
##  3rd Qu.:15.00   3rd Qu.:1.0000            3rd Qu.:0.0000  
##  Max.   :18.00   Max.   :1.0000            Max.   :1.0000

Problem 3.1 - Splitting into a Training and Testing Set

To ensure consistent training/testing set splits, run the following 5 lines of code (do not include the line numbers at the beginning):

  1. set.seed(144)

  2. library(caTools)

  3. split = sample.split(parole$violator, SplitRatio = 0.7)

  4. train = subset(parole, split == TRUE)

  5. test = subset(parole, split == FALSE)

Roughly what proportion of parolees have been allocated to the training and testing sets?

set.seed(144)

library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)

Problem 3.2 - Splitting into a Training and Testing Set

Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect?

  • The exact same training/testing set split as the first execution of [1]-[5].

If you instead ONLY re-ran lines [3]-[5], what would you expect?

  • A different training/testing set split from the first execution of [1]-[5]

If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?

  • A different training/testing set split from the first execution of [1]-[5]

Explanation

If you set a random seed, split, set the seed again to the same value, and then split again, you will get the same split. However, if you set the seed and then split twice, you will get different splits. If you set the seed to different values, you will get different splits. You can also verify this by running the specified code in R. If you have training sets train1 and train2, the function sum(train1 != train2) will count the number of values in those two data frames that are different.

Problem 4.1 - Building a Logistic Regression Model

If you tested other training/testing set splits in the previous section, please re-run the original 5 lines of code to obtain the original split.

Using glm (and remembering the parameter family=“binomial”), train a logistic regression model on the training set. Your dependent variable is “violator”, and you should use all of the other variables as independent variables.

What variables are significant in this model? Significant variables should have a least one star, or should have a probability less than 0.05 (the column Pr(>|z|) in the summary output). Select all that apply.

mod <- glm(violator ~ ., data = train, family = binomial)

summary(mod)
## 
## Call:
## glm(formula = violator ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7041  -0.4236  -0.2719  -0.1690   2.8375  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.2411574  1.2938852  -3.278  0.00105 ** 
## male               0.3869904  0.4379613   0.884  0.37690    
## race               0.8867192  0.3950660   2.244  0.02480 *  
## age               -0.0001756  0.0160852  -0.011  0.99129    
## state2             0.4433007  0.4816619   0.920  0.35739    
## state3             0.8349797  0.5562704   1.501  0.13335    
## state4            -3.3967878  0.6115860  -5.554 2.79e-08 ***
## time.served       -0.1238867  0.1204230  -1.029  0.30359    
## max.sentence       0.0802954  0.0553747   1.450  0.14705    
## multiple.offenses  1.6119919  0.3853050   4.184 2.87e-05 ***
## crime2             0.6837143  0.5003550   1.366  0.17180    
## crime3            -0.2781054  0.4328356  -0.643  0.52054    
## crime4            -0.0117627  0.5713035  -0.021  0.98357    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 340.04  on 472  degrees of freedom
## Residual deviance: 251.48  on 460  degrees of freedom
## AIC: 277.48
## 
## Number of Fisher Scoring iterations: 6

Problem 4.2 - Building a Logistic Regression Model

What can we say based on the coefficient of the multiple.offenses variable?

The following two properties might be useful to you when answering this question:

  1. If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.

  2. If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase in the variable.

Answer

Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.

Explanation

For parolees A and B who are identical other than A having committed multiple offenses, the predicted log odds of A is 1.61 more than the predicted log odds of B. Then we have:

$$

ln(odds of A) = ln(odds of B) + 1.61 \

e^{ln(odds of A)} = e^{ln(odds of B) + 1.61} \

e^{ln(odds of A)} = e^{ln(odds of B)} e^{1.61} \

odds of A = e^{1.61} odds of B \ odds of A= 5.01 odds of B

$$ In the second step we raised e to the power of both sides. In the third step we used the exponentiation rule that e^(a+b) = e^a * e^b. In the fourth step we used the rule that e^(ln(x)) = x.

Problem 4.3 - Building a Logistic Regression Model

Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. Answer the following questions based on the model’s predictions for this individual. (HINT: You should use the coefficients of your model, the Logistic Response Function, and the Odds equation to solve this problem.)

According to the model, what are the odds this individual is a violator?

mod$coefficients
##       (Intercept)              male              race               age 
##     -4.2411573912      0.3869904022      0.8867192411     -0.0001756156 
##            state2            state3            state4       time.served 
##      0.4433006515      0.8349796539     -3.3967877633     -0.1238867027 
##      max.sentence multiple.offenses            crime2            crime3 
##      0.0802953765      1.6119918595      0.6837143152     -0.2781054228 
##            crime4 
##     -0.0117626521
# This parolee has male=1, race=1, age=50, state2=0, state3=0, state4=0, 
# time.served=3, max.sentence=12, multiple.offenses=0, crime2=1, crime3=0, crime4=0
intercept <- mod$coefficients[[1]]
coef_male <- mod$coefficients[[2]]
coef_race <- mod$coefficients[[3]]
coef_age <- mod$coefficients[[4]]
coef_ts <- mod$coefficients[[8]]
coef_ms <- mod$coefficients[[9]]
coef_multi <- mod$coefficients[[10]]
coef_larceny <- mod$coefficients[[11]]

odds = exp(intercept + coef_male*1 + coef_race*1 + coef_age*50 + coef_ts*3 + 
               coef_ms*12 + coef_multi*0 + coef_larceny*1)

odds
## [1] 0.1825685

According to the model, what is the probability this individual is a violator?

probability <- odds / (1+odds)
probability
## [1] 0.154383

Problem 5.1 - Evaluating the Model on the Testing Set

Use the predict() function to obtain the model’s predicted probabilities for parolees in the testing set, remembering to pass type=“response”.

What is the maximum predicted probability of a violation?

predTest <- predict(mod, newdata = test, type = "response")

# max(predTest)
summary(predTest)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002334 0.023777 0.057905 0.146576 0.147452 0.907279

Problem 5.2 - Evaluating the Model on the Testing Set

In the following questions, evaluate the model’s predictions on the test set using a threshold of 0.5.

# confusion matrix
CM <- table(test$violator, as.numeric(predTest >= 0.5))
CM
##    
##       0   1
##   0 167  12
##   1  11  12

What is the model’s sensitivity, specificity and accuracy?

# Sensitivity = TP / (TP+FN)
CM[2, 2] / sum(CM[2, ])
## [1] 0.5217391
# Specificity = TN / (TN+FP)
CM[1, 1] / sum(CM[1, ])
## [1] 0.9329609
# Accuracy = (TN + TP) / total observations
(CM[1, 1] + CM[2, 2]) / sum(CM)
## [1] 0.8861386

Problem 5.3 - Evaluating the Model on the Testing Set

What is the accuracy of a simple model that predicts that every parolee is a non-violator?

table(test$violator)
## 
##   0   1 
## 179  23
179/(179+23)
## [1] 0.8861386

Explanation

If you table the outcome variable using the following command: table(test$violator) you can see that there are 179 negative examples, which are the ones that the baseline model would get correct. Thus the baseline model would have an accuracy of 179/202 = 0.886.

Problem 5.4 - Evaluating the Model on the Testing Set

Consider a parole board using the model to predict whether parolees will be violators or not. The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned about releasing prisoners who will violate their parole. Which of the following most likely describes their preferences and best course of action?

Answer

The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.

Explanation

If the board used the model for parole decisions, a negative prediction would lead to a prisoner being granted parole, while a positive prediction would lead to a prisoner being denied parole. The parole board would experience more regret for releasing a prisoner who then violates parole (a negative prediction that is actually positive, or false negative) than it would experience for denying parole to a prisoner who would not have violated parole (a positive prediction that is actually negative, or false positive). Decreasing the cutoff leads to more positive predictions, which increases false positives and decreases false negatives. Meanwhile, increasing the cutoff leads to more negative predictions, which increases false negatives and decreases false positives. The parole board assigns high cost to false negatives, and therefore should decrease the cutoff.

Problem 5.5 - Evaluating the Model on the Testing Set

Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model?

Answer

The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model’s value.

Explanation

The model at cutoff 0.5 has 12 false positives and 11 false negatives, while the baseline model has 0 false positives and 23 false negatives. Because a parole board is likely to assign more cost to a false negative, the model at cutoff 0.5 is likely of value to the board. From the previous question, the parole board would likely benefit from decreasing the logistic regression cutoffs, which decreases the false negative rate while increasing the false positive rate.

Problem 5.6 - Evaluating the Model on the Testing Set

Using the ROCR package, what is the AUC value for the model?

library(ROCR)

ROCRpred = prediction(predTest, test$violator)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.8945834

Problem 5.7 - Evaluating the Model on the Testing Set

Describe the meaning of AUC in this context.

Answer

The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.

Problem 6.1 - Identifying Bias in Observational Data

Our goal has been to predict the outcome of a parole decision, and we used a publicly available dataset of parole releases for predictions. In this final problem, we’ll evaluate a potential source of bias associated with our analysis. It is always important to evaluate a dataset for possible sources of bias.

The dataset contains all individuals released from parole in 2004, either due to completing their parole term or violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor completed their term in 2004, causing non-violators to be underrepresented. This is called “selection bias” or “selecting on the dependent variable,” because only a subset of all relevant parolees were included in our analysis, based on our dependent variable in this analysis (parole violation). How could we improve our dataset to best address selection bias?

Answer

We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term. correct

Explanation

While expanding the dataset to include the missing parolees and labeling each as violator=0 would improve the representation of non-violators, it does not capture the true outcome, since the parolee might become a violator after 2004. Though labeling these new examples with violator=NA correctly identifies that we don’t know their true outcome, we cannot train or test a prediction model with a missing dependent variable. As a result, a prospective dataset that tracks a cohort of parolees and observes the true outcome of each is more desirable. Unfortunately, such datasets are often more challenging to obtain (for instance, if a parolee had a 10-year term, it might require tracking that individual for 10 years before building the model). Such a prospective analysis would not be possible using the 2004 National Corrections Reporting Program dataset.


Predicting loan repayment

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.

To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable not_fully_paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).

To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:


Problem 1.1 - Preparing the Dataset

Load the dataset loans.csv into a data frame called loans, and explore it using the str() and summary() functions.

What proportion of the loans in the dataset were not paid in full? Please input a number between 0 and 1.

loans <- read.csv("loans.csv")

str(loans)
## 'data.frame':    9578 obs. of  14 variables:
##  $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
##  $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
##  $ installment      : num  829 228 367 162 103 ...
##  $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
##  $ dti              : num  19.5 14.3 11.6 8.1 15 ...
##  $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
##  $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
##  $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
##  $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
##  $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
##  $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...
sum(loans$not.fully.paid == 1) / nrow(loans)
## [1] 0.1600543

Problem 1.2 - Preparing the Dataset

Which of the following variables has at least one missing observation? Select all that apply.

summary(loans)
##  credit.policy                 purpose        int.rate     
##  Min.   :0.000   all_other         :2331   Min.   :0.0600  
##  1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039  
##  Median :1.000   debt_consolidation:3957   Median :0.1221  
##  Mean   :0.805   educational       : 343   Mean   :0.1226  
##  3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407  
##  Max.   :1.000   major_purchase    : 437   Max.   :0.2164  
##                  small_business    : 619                   
##   installment     log.annual.inc        dti              fico      
##  Min.   : 15.67   Min.   : 7.548   Min.   : 0.000   Min.   :612.0  
##  1st Qu.:163.77   1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0  
##  Median :268.95   Median :10.928   Median :12.665   Median :707.0  
##  Mean   :319.09   Mean   :10.932   Mean   :12.607   Mean   :710.8  
##  3rd Qu.:432.76   3rd Qu.:11.290   3rd Qu.:17.950   3rd Qu.:737.0  
##  Max.   :940.14   Max.   :14.528   Max.   :29.960   Max.   :827.0  
##                   NA's   :4                                        
##  days.with.cr.line   revol.bal         revol.util     inq.last.6mths  
##  Min.   :  179     Min.   :      0   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 2820     1st Qu.:   3187   1st Qu.: 22.70   1st Qu.: 0.000  
##  Median : 4140     Median :   8596   Median : 46.40   Median : 1.000  
##  Mean   : 4562     Mean   :  16914   Mean   : 46.87   Mean   : 1.572  
##  3rd Qu.: 5730     3rd Qu.:  18250   3rd Qu.: 71.00   3rd Qu.: 2.000  
##  Max.   :17640     Max.   :1207359   Max.   :119.00   Max.   :33.000  
##  NA's   :29                          NA's   :62       NA's   :29      
##   delinq.2yrs         pub.rec       not.fully.paid  
##  Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.0000   Median :0.0000  
##  Mean   : 0.1638   Mean   :0.0621   Mean   :0.1601  
##  3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :13.0000   Max.   :5.0000   Max.   :1.0000  
##  NA's   :29        NA's   :29
apply(loans, 2, function(x) any(is.na(x)))
##     credit.policy           purpose          int.rate       installment 
##             FALSE             FALSE             FALSE             FALSE 
##    log.annual.inc               dti              fico days.with.cr.line 
##              TRUE             FALSE             FALSE              TRUE 
##         revol.bal        revol.util    inq.last.6mths       delinq.2yrs 
##             FALSE              TRUE              TRUE              TRUE 
##           pub.rec    not.fully.paid 
##              TRUE             FALSE

Answer

From summary(loans), we can read that log.annual.inc, days.with.cr.line, revol.util, inq.last.6mths, delinq.2yrs and pub.rec are missing values.

Problem 1.3 - Preparing the Dataset

Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data? (Hint: you can use the subset() function to build a data frame with the observations missing at least one value. To test if a variable, for example pub.rec, is missing a value, use is.na(pub.rec).)

Answer

We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.

Explanation

Answering this question requires analyzing the loans with missing data. We can build a data frame limited to observations with some missing data with the following command:

missing = subset(loans, 
                 is.na(log.annual.inc) | is.na(days.with.cr.line) | is.na(revol.util) 
                 | is.na(inq.last.6mths) | is.na(delinq.2yrs) | is.na(pub.rec))

missing
##      credit.policy            purpose int.rate installment log.annual.inc
## 782              1          all_other   0.1134       98.70      10.530495
## 804              1        educational   0.1103       52.41      10.532096
## 840              1 debt_consolidation   0.1134      263.20      10.714418
## 858              1 debt_consolidation   0.1229       23.35       9.852194
## 1214             1     major_purchase   0.1064      182.39      11.264464
## 1281             1        credit_card   0.1633      264.91      10.819778
## 1554             1          all_other   0.1557      314.51      10.596535
## 1783             1        educational   0.1695       97.98       8.342840
## 1928             1          all_other   0.1379      170.38      10.714418
## 2009             1          all_other   0.1316      405.25      11.350407
## 2194             1          all_other   0.1695      178.14      11.608236
## 2529             1     small_business   0.1537       48.79      10.491274
## 2533             1   home_improvement   0.1411      222.49      11.198215
## 2706             1     major_purchase   0.1505      346.92      10.757988
## 3351             1        educational   0.1505       60.71       9.210340
## 3980             1          all_other   0.1284       47.07       8.494539
## 4959             1        credit_card   0.1913      201.97      11.877569
## 5522             1          all_other   0.1426       41.17       8.476371
## 6274             1          all_other   0.1461      172.38      10.491274
## 6319             1     major_purchase   0.1531       87.04      13.003580
## 6639             1 debt_consolidation   0.1531      174.08      10.915088
## 7014             1     major_purchase   0.1461      517.13      10.491274
## 7702             1          all_other   0.1600       70.32      10.085809
## 7703             1 debt_consolidation   0.1821      181.30      10.714418
## 7726             0          all_other   0.0775      156.11      11.156251
## 7727             0          all_other   0.0838      204.84             NA
## 7728             0          all_other   0.0807      136.45      11.695247
## 7729             0          all_other   0.0933       80.69      11.608236
## 7730             0          all_other   0.0933      124.62      11.492723
## 7732             0          all_other   0.1028      164.42      11.461632
## 7733             0          all_other   0.0964       32.11       9.392662
## 7734             0          all_other   0.0996      103.20      11.918391
## 7735             0          all_other   0.1028      113.39      12.100712
## 7736             0          all_other   0.1122      211.85      10.373491
## 7737             0          all_other   0.0964      208.66       9.903488
## 7738             0          all_other   0.0996       64.50       8.699515
## 7739             0          all_other   0.1249       42.65      10.596635
## 7740             0          all_other   0.0933       95.86       9.903488
## 7741             0          all_other   0.0901       95.42      10.463103
## 7742             0          all_other   0.0775      209.18             NA
## 7743             0          all_other   0.0775      218.55             NA
## 7744             0          all_other   0.0743      155.38             NA
## 7745             0          all_other   0.1122      344.87      11.002100
## 7746             0          all_other   0.0712       30.94      10.819778
## 7747             0          all_other   0.0838       81.94       8.779557
## 7748             0          all_other   0.0743       93.23      11.289782
## 7749             0          all_other   0.0743       77.69      11.106820
## 7750             0          all_other   0.0775      156.11      12.611538
## 7751             0          all_other   0.1407       34.21      10.126631
## 7752             0          all_other   0.1091       45.78      10.596635
## 7753             0          all_other   0.1122      164.23      10.239960
## 7754             0          all_other   0.0901       38.17      10.491274
## 7759             0 debt_consolidation   0.0964       61.00      11.512925
## 7760             0          all_other   0.1312      180.57       8.294050
## 7799             0          all_other   0.1091      196.18       8.853665
## 7812             0          all_other   0.1565      131.19       8.892886
## 8555             0 debt_consolidation   0.1071       97.81      10.203592
## 8571             0 debt_consolidation   0.1387      163.75      10.736397
## 8677             0     major_purchase   0.0976      128.62      10.752484
## 8760             0        credit_card   0.0863      170.80      10.680516
## 9074             0          all_other   0.1442      859.57      12.542545
## 9365             0 debt_consolidation   0.1774      172.91       9.546813
##        dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
## 782   7.72  677         1680.0000         0         NA              1
## 804  15.84  682         1829.9583         0         NA              0
## 840   8.75  682         2490.0000         0         NA              1
## 858  12.38  662         1199.9583         0         NA              1
## 1214  4.26  697         4140.9583         0         NA              0
## 1281 10.80  667         5249.9583         0         NA              0
## 1554  0.00  687         2940.0417         0         NA              1
## 1783  0.00  687         1238.0417         0         NA              0
## 1928  6.00  722         5280.0417         0         NA              1
## 2009 12.01  752         3749.0417         0         NA              0
## 2194 14.95  677         6990.0000     49238         NA              0
## 2529  5.07  712         2220.9583         0         NA              2
## 2533 11.59  767         5522.9583         0         NA              1
## 2706 20.19  752         4621.0000         0         NA              2
## 3351 15.84  677         2070.0000         0         NA              1
## 3980 17.68  722         2100.0000         0         NA              3
## 4959  3.58  662         5130.0000         0         NA              1
## 5522 13.75  702         2580.0000         0         NA              0
## 6274 16.57  742         5400.0000         0         NA              0
## 6319  0.18  712         1710.0417         0         NA              1
## 6639  9.45  732         2643.0417         0         NA              0
## 7014  9.17  732         1892.0000         0         NA              2
## 7702  2.55  667         3509.9583         0         NA              0
## 7703 11.39  662         5400.0000         0         NA              3
## 7726  8.81  772                NA         0         NA             NA
## 7727  4.00  742                NA         0         NA             NA
## 7728  4.00  742                NA         0         NA             NA
## 7729 10.00  712                NA         0         NA             NA
## 7730 10.00  707                NA         0         NA             NA
## 7732 10.00  702                NA         0         NA             NA
## 7733 10.00  697                NA         0         NA             NA
## 7734 10.00  692                NA         0         NA             NA
## 7735 10.00  687                NA         0         NA             NA
## 7736 10.00  687                NA         0         NA             NA
## 7737 10.00  712                NA         0         NA             NA
## 7738 10.00  687                NA         0         NA             NA
## 7739 10.00  707                NA         0         NA             NA
## 7740 10.00  712                NA         0         NA             NA
## 7741 10.00  717                NA         0         NA             NA
## 7742  1.00  802                NA         0         NA             NA
## 7743  1.00  802                NA         0         NA             NA
## 7744  1.00  802                NA         0         NA             NA
## 7745 19.50  742                NA         0         NA             NA
## 7746  1.10  772                NA         0         NA             NA
## 7747  6.46  722                NA         0         NA             NA
## 7748  0.39  797                NA         0         NA             NA
## 7749 10.36  787                NA         0         NA             NA
## 7750  5.38  757                NA         0         NA             NA
## 7751 16.27  642                NA         0         NA             NA
## 7752  8.61  672                NA         0         NA             NA
## 7753  3.51  672                NA         0         NA             NA
## 7754  3.27  707                NA         0         NA             NA
## 7759 10.00  707                NA         0         NA             NA
## 7760 15.00  662          178.9583         0         NA              1
## 7799  8.57  682          334.0000         0         NA              2
## 7812 17.31  642         3478.9583         0         NA              1
## 8555 18.89  697         3812.0000         0         NA              0
## 8571 22.72  662         2010.0000         0         NA              6
## 8677  7.29  707          990.0000         0         NA              0
## 8760 10.18  742         2399.9583         0         NA              3
## 9074  5.58  737         7890.0000    290291         NA              1
## 9365  9.51  692         1530.0417         0         NA              4
##      delinq.2yrs pub.rec not.fully.paid
## 782            0       0              1
## 804            0       0              0
## 840            1       0              1
## 858            0       0              0
## 1214           0       1              0
## 1281           0       0              1
## 1554           0       0              0
## 1783           0       0              0
## 1928           0       0              0
## 2009           0       0              0
## 2194           4       0              0
## 2529           0       0              0
## 2533           0       0              1
## 2706           0       0              0
## 3351           0       0              0
## 3980           0       0              0
## 4959           0       0              0
## 5522           0       0              1
## 6274           0       0              0
## 6319           0       0              0
## 6639           0       0              0
## 7014           0       0              0
## 7702           0       0              1
## 7703           2       0              1
## 7726          NA      NA              0
## 7727          NA      NA              0
## 7728          NA      NA              0
## 7729          NA      NA              0
## 7730          NA      NA              0
## 7732          NA      NA              0
## 7733          NA      NA              0
## 7734          NA      NA              0
## 7735          NA      NA              0
## 7736          NA      NA              0
## 7737          NA      NA              1
## 7738          NA      NA              0
## 7739          NA      NA              1
## 7740          NA      NA              0
## 7741          NA      NA              0
## 7742          NA      NA              0
## 7743          NA      NA              0
## 7744          NA      NA              0
## 7745          NA      NA              0
## 7746          NA      NA              0
## 7747          NA      NA              1
## 7748          NA      NA              0
## 7749          NA      NA              0
## 7750          NA      NA              0
## 7751          NA      NA              0
## 7752          NA      NA              0
## 7753          NA      NA              0
## 7754          NA      NA              0
## 7759          NA      NA              0
## 7760           0       0              0
## 7799           0       0              1
## 7812           0       0              1
## 8555           0       0              0
## 8571           0       0              0
## 8677           0       0              0
## 8760           0       0              0
## 9074           0       0              0
## 9365           0       0              0

we see that only 62 of 9578 loans have missing data; removing this small number of observations would not lead to overfitting. From table(missing$not.fully.paid), we see that 12 of 62 loans with missing data were not fully paid, or 19.35%. This rate is similar to the 16.01% across all loans, so the form of biasing described is not an issue. However, to predict risk for loans with missing data we need to fill in the missing values instead of removing the observations.

Problem 1.4 - Preparing the Dataset

For the rest of this problem, we’ll be using a revised version of the dataset that has the missing values filled in with multiple imputation (which was discussed in the Recitation of this Unit). To ensure everybody has the same data frame going forward, you can either run the commands below in your R console (if you haven’t already, run the command install.packages(“mice”) first), or you can download and load into R the dataset we created after running the imputation: loans_imputed.csv.

IMPORTANT NOTE: On certain operating systems, the imputation results are not the same even if you set the random seed. If you decide to do the imputation yourself, please still read the provided imputed dataset (loans_imputed.csv) into R and compare your results, using the summary function. If the results are different, please make sure to use the data in loans_imputed.csv for the rest of the problem.

library(mice)

set.seed(144)

vars.for.imputation = setdiff(names(loans), "not.fully.paid")

imputed = complete(mice(loans[vars.for.imputation]))
## 
##  iter imp variable
##   1   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   1   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   2   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   3   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   4   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   1  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   2  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   3  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   4  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
##   5   5  log.annual.inc  days.with.cr.line  revol.util  inq.last.6mths  delinq.2yrs  pub.rec
loans[vars.for.imputation] = imputed

summary(loans)
##  credit.policy                 purpose        int.rate     
##  Min.   :0.000   all_other         :2331   Min.   :0.0600  
##  1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039  
##  Median :1.000   debt_consolidation:3957   Median :0.1221  
##  Mean   :0.805   educational       : 343   Mean   :0.1226  
##  3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407  
##  Max.   :1.000   major_purchase    : 437   Max.   :0.2164  
##                  small_business    : 619                   
##   installment     log.annual.inc        dti              fico      
##  Min.   : 15.67   Min.   : 7.548   Min.   : 0.000   Min.   :612.0  
##  1st Qu.:163.77   1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0  
##  Median :268.95   Median :10.928   Median :12.665   Median :707.0  
##  Mean   :319.09   Mean   :10.932   Mean   :12.607   Mean   :710.8  
##  3rd Qu.:432.76   3rd Qu.:11.290   3rd Qu.:17.950   3rd Qu.:737.0  
##  Max.   :940.14   Max.   :14.528   Max.   :29.960   Max.   :827.0  
##                                                                    
##  days.with.cr.line   revol.bal         revol.util     inq.last.6mths  
##  Min.   :  179     Min.   :      0   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 2820     1st Qu.:   3187   1st Qu.: 22.60   1st Qu.: 0.000  
##  Median : 4140     Median :   8596   Median : 46.25   Median : 1.000  
##  Mean   : 4560     Mean   :  16914   Mean   : 46.80   Mean   : 1.582  
##  3rd Qu.: 5730     3rd Qu.:  18250   3rd Qu.: 70.90   3rd Qu.: 2.000  
##  Max.   :17640     Max.   :1207359   Max.   :119.00   Max.   :33.000  
##                                                                       
##   delinq.2yrs         pub.rec        not.fully.paid  
##  Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.00000   Median :0.0000  
##  Mean   : 0.1635   Mean   :0.06223   Mean   :0.1601  
##  3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :13.0000   Max.   :5.00000   Max.   :1.0000  
## 
loans = read.csv("loans_imputed.csv")

Explanation

Imputation predicts missing variable values for a given observation using the variable values that are reported. We called the imputation on a data frame with the dependent variable not.fully.paid removed, so we predicted the missing values using only other independent variables.

Problem 2.1 - Prediction Models

Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set (the dependent variable for sample.split is not.fully.paid). Name the data frames train and test.

Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.

Which independent variables are significant in our model? (Significant variables have at least one star, or a Pr(>|z|) value less than 0.05.) Select all that apply.

set.seed(144)
spl = sample.split(loans$not.fully.paid, SplitRatio = 0.7)
train = subset(loans, spl==TRUE)
test = subset(loans, spl==FALSE)
loansLog = glm(not.fully.paid ~ ., data=train, family="binomial")
summary(loansLog)
## 
## Call:
## glm(formula = not.fully.paid ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2049  -0.6205  -0.4951  -0.3606   2.6397  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                9.187e+00  1.554e+00   5.910 3.42e-09 ***
## credit.policy             -3.368e-01  1.011e-01  -3.332 0.000861 ***
## purposecredit_card        -6.141e-01  1.344e-01  -4.568 4.93e-06 ***
## purposedebt_consolidation -3.212e-01  9.183e-02  -3.498 0.000469 ***
## purposeeducational         1.347e-01  1.753e-01   0.768 0.442201    
## purposehome_improvement    1.727e-01  1.480e-01   1.167 0.243135    
## purposemajor_purchase     -4.830e-01  2.009e-01  -2.404 0.016203 *  
## purposesmall_business      4.120e-01  1.419e-01   2.905 0.003678 ** 
## int.rate                   6.110e-01  2.085e+00   0.293 0.769446    
## installment                1.275e-03  2.092e-04   6.093 1.11e-09 ***
## log.annual.inc            -4.337e-01  7.148e-02  -6.067 1.30e-09 ***
## dti                        4.638e-03  5.502e-03   0.843 0.399288    
## fico                      -9.317e-03  1.710e-03  -5.448 5.08e-08 ***
## days.with.cr.line          2.371e-06  1.588e-05   0.149 0.881343    
## revol.bal                  3.085e-06  1.168e-06   2.641 0.008273 ** 
## revol.util                 1.839e-03  1.535e-03   1.199 0.230722    
## inq.last.6mths             8.437e-02  1.600e-02   5.275 1.33e-07 ***
## delinq.2yrs               -8.320e-02  6.561e-02  -1.268 0.204762    
## pub.rec                    3.300e-01  1.139e-01   2.898 0.003756 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5896.6  on 6704  degrees of freedom
## Residual deviance: 5485.2  on 6686  degrees of freedom
## AIC: 5523.2
## 
## Number of Fisher Scoring iterations: 5

Problem 2.2 - Prediction Models

Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.

Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B. What is the value of Logit(A) - Logit(B)?

coef_fico <- loansLog$coefficients[["fico"]]

(700-710)*coef_fico
## [1] 0.0931679

Explanation

Because Application A is identical to Application B other than having a FICO score 10 lower, its predicted log odds differ by -0.009317 * -10 = 0.09317 from the predicted log odds of Application B. Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A)exp(B)exp(C). Also, remember that exp() is the exponential function in R.)

Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A) * exp(B) * exp(C). Also, remember that exp() is the exponential function in R.)

exp((700-710)*coef_fico)
## [1] 1.097646

Explanation

Using the answer from the previous question, the predicted odds of loan A not being paid back in full are exp(0.09317) = 1.0976 times larger than the predicted odds for loan B. Intuitively, it makes sense that loan A should have higher odds of non-payment than loan B, since the borrower has a worse credit score.

Problem 2.3 - Prediction Models

Predict the probability of the test set loans not being paid back in full (remember type=“response” for the predict function). Store these predicted probabilities in a variable named predicted.risk and add it to your test set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.

What is the accuracy of the logistic regression model? Input the accuracy as a number between 0 and 1.

summary(test)
##  credit.policy                  purpose        int.rate     
##  Min.   :0.0000   all_other         : 688   Min.   :0.0600  
##  1st Qu.:1.0000   credit_card       : 411   1st Qu.:0.1028  
##  Median :1.0000   debt_consolidation:1206   Median :0.1221  
##  Mean   :0.8047   educational       :  93   Mean   :0.1225  
##  3rd Qu.:1.0000   home_improvement  : 186   3rd Qu.:0.1393  
##  Max.   :1.0000   major_purchase    : 105   Max.   :0.2121  
##                   small_business    : 184                   
##   installment     log.annual.inc        dti             fico      
##  Min.   : 15.67   Min.   : 8.102   Min.   : 0.00   Min.   :612.0  
##  1st Qu.:163.57   1st Qu.:10.560   1st Qu.: 7.16   1st Qu.:682.0  
##  Median :267.74   Median :10.933   Median :12.85   Median :707.0  
##  Mean   :316.99   Mean   :10.928   Mean   :12.75   Mean   :710.8  
##  3rd Qu.:421.89   3rd Qu.:11.290   3rd Qu.:18.30   3rd Qu.:737.0  
##  Max.   :926.83   Max.   :13.459   Max.   :29.96   Max.   :822.0  
##                                                                   
##  days.with.cr.line   revol.bal         revol.util     inq.last.6mths  
##  Min.   :  179     Min.   :      0   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 2795     1st Qu.:   3362   1st Qu.: 23.40   1st Qu.: 0.000  
##  Median : 4140     Median :   8712   Median : 46.90   Median : 1.000  
##  Mean   : 4494     Mean   :  17198   Mean   : 47.02   Mean   : 1.576  
##  3rd Qu.: 5670     3rd Qu.:  18728   3rd Qu.: 70.40   3rd Qu.: 2.000  
##  Max.   :17640     Max.   :1207359   Max.   :108.80   Max.   :24.000  
##                                                                       
##   delinq.2yrs         pub.rec        not.fully.paid  
##  Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median : 0.0000   Median :0.00000   Median :0.0000  
##  Mean   : 0.1605   Mean   :0.05743   Mean   :0.1601  
##  3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :13.0000   Max.   :3.00000   Max.   :1.0000  
## 
summary(mod)
## 
## Call:
## glm(formula = violator ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7041  -0.4236  -0.2719  -0.1690   2.8375  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.2411574  1.2938852  -3.278  0.00105 ** 
## male               0.3869904  0.4379613   0.884  0.37690    
## race               0.8867192  0.3950660   2.244  0.02480 *  
## age               -0.0001756  0.0160852  -0.011  0.99129    
## state2             0.4433007  0.4816619   0.920  0.35739    
## state3             0.8349797  0.5562704   1.501  0.13335    
## state4            -3.3967878  0.6115860  -5.554 2.79e-08 ***
## time.served       -0.1238867  0.1204230  -1.029  0.30359    
## max.sentence       0.0802954  0.0553747   1.450  0.14705    
## multiple.offenses  1.6119919  0.3853050   4.184 2.87e-05 ***
## crime2             0.6837143  0.5003550   1.366  0.17180    
## crime3            -0.2781054  0.4328356  -0.643  0.52054    
## crime4            -0.0117627  0.5713035  -0.021  0.98357    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 340.04  on 472  degrees of freedom
## Residual deviance: 251.48  on 460  degrees of freedom
## AIC: 277.48
## 
## Number of Fisher Scoring iterations: 6
test$predicted.risk <- predict(loansLog, newdata = test, type = "response")

table(test$not.fully.paid, as.numeric(test$predicted.risk >= 0.5))
##    
##        0    1
##   0 2400   13
##   1  457    3
# 2403 predictions are correct (accuracy 2403/2873=0.8364), 
# while 2413 predictions would be correct in the baseline model of guessing every loan would be paid back in full (accuracy 2413/2873=0.8399).

Problem 2.4 - Prediction Models

Use the ROCR package to compute the test set AUC.

library(ROCR)
pred = prediction(test$predicted.risk, test$not.fully.paid)
as.numeric(performance(pred, "auc")@y.values)
## [1] 0.6720995

The model has poor accuracy at the threshold 0.5. But despite the poor accuracy, we will see later how an investor can still leverage this logistic regression model to make profitable investments.

Problem 3.1 - A “Smart Baseline”

In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations.

However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan’s risk. This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan’s interest rate as a “smart baseline” to order the loans according to risk.

Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.

The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. What is the most likely explanation for this difference?

Answer

int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included.

Explanation

To train the bivariate model, run the following command:

bivariate = glm(not.fully.paid ~ int.rate, data=train, family="binomial")
summary(bivariate)
## 
## Call:
## glm(formula = not.fully.paid ~ int.rate, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0547  -0.6271  -0.5442  -0.4361   2.2914  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.6726     0.1688  -21.76   <2e-16 ***
## int.rate     15.9214     1.2702   12.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5896.6  on 6704  degrees of freedom
## Residual deviance: 5734.8  on 6703  degrees of freedom
## AIC: 5738.8
## 
## Number of Fisher Scoring iterations: 4

Decreased significance between a bivariate and multivariate model is typically due to correlation. From cor(train\(int.rate, train\)fico), we can see that the interest rate is moderately well correlated with a borrower’s credit score. Training/testing set split rarely has a large effect on the significance of variables (this can be verified in this case by trying out a few other training/testing splits), and the models were trained on the same observations.

Problem 3.2 - A “Smart Baseline”

Make test set predictions for the bivariate model. What is the highest predicted probability of a loan not being paid in full on the testing set?

pred.bivariate = predict(bivariate, newdata=test, type="response")
summary(pred.bivariate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06196 0.11549 0.15077 0.15963 0.18928 0.42662

According to the summary function, the maximum predicted probability of the loan not being paid back is 0.4266, which means no loans would be flagged at a logistic regression cutoff of 0.5.

Problem 3.3 - A “Smart Baseline”

What is the test set AUC of the bivariate model?

prediction.bivariate = prediction(pred.bivariate, test$not.fully.paid)
as.numeric(performance(prediction.bivariate, "auc")@y.values)
## [1] 0.6239081

Problem 4.1 - Computing the Profitability of an Investment

While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable. If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.

To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years. Using continuous compounding of interest, this investment pays back c * exp(rt) dollars by the end of the t years, where exp(rt) is e raised to the r*t power.

How much does a $10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compounding of interest? Hint: remember to convert the percentage to a proportion before doing the math. Enter the number of dollars, without the $ sign.

c=10
r=0.06
t=3

# Using the formula above, the final value is 
c * exp(r*t)
## [1] 11.97217

Problem 4.2 - Computing the Profitability of an Investment

While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the investment. What is the profit to the investor if the investment is paid back in full?

Answer

c * exp(rt) - c

Problem 4.3 - Computing the Profitability of an Investment

Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received). What is the profit to the investor in this scenario?

Answer

-c

Problem 5.1 - A Simple Investment Strategy

In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit if the loan is not paid back in full (pessimistically).

In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test set. For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full. All the loans in our dataset are 3-year loans, meaning t=3 in our calculations. Enter the following commands in your R console to create this new variable:

test$profit = exp(test$int.rate*3) - 1

test$profit[test$not.fully.paid == 1] = -1

summary(test$profit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0000  0.2858  0.4111  0.2094  0.4980  0.8895

What is the maximum profit of a $10 investment in any loan in the testing set (do not include the $ sign in your answer)?

Explanation From summary(test$profit), we see the maximum profit for a $1 investment in any loan is $0.8895. Therefore, the maximum profit of a $10 investment is 10 times as large, or $8.895.

Problem 6.1 - An Investment Strategy Based on Risk

A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem. As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.

To meet this objective, we will analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.

First, use the subset() function to build a data frame called highInterest consisting of the test set loans with an interest rate of at least 15%.

What is the average profit of a $1 investment in one of these high-interest loans?

highInterest = subset(test, int.rate >= 0.15)

summary(highInterest$profit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0000 -1.0000  0.5992  0.2251  0.6380  0.8895

What proportion of the high-interest loans were not paid back in full?

table(highInterest$not.fully.paid)
## 
##   0   1 
## 327 110
110/437
## [1] 0.2517162

Problem 6.2 - An Investment Strategy Based on Risk

Next, we will determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list. Find the highest predicted risk that we will include by typing the following command into your R console:

cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]

Use the subset() function to build a data frame called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cutoff we just computed. Check to make sure you have selected 100 loans for investment.

What is the profit of the investor, who invested $1 in each of these 100 loans?

cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]
selectedLoans = subset(highInterest, predicted.risk <= cutoff)

sum(selectedLoans$profit)
## [1] 31.27825

How many of 100 selected loans were not paid back in full?

table(selectedLoans$not.fully.paid)
## 
##  0  1 
## 81 19

We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at only a slightly lower rate than average, resulting in a significant increase in the profit from our investor’s $100 investment. Although the logistic regression models developed in this problem did not have large AUC values, we see that they still provided the edge needed to improve the profitability of an investment portfolio.

We conclude with a note of warning. Throughout this analysis we assume that the loans we invest in will perform in the same way as the loans we used to train our model, even though our training set covers a relatively short period of time. If there is an economic shock like a large financial downturn, default rates might be significantly higher than those observed in the training set and we might end up losing money instead of profiting. Investors must pay careful attention to such risk when making investment decisions.