Week 8: Class 13/04/2023

More deeply in classification models

Using the data “Default”

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.2.3

attach(Default)

mod1<-glm(default~balance + student + income, family = "binomial", data = Default)

summary(mod1)

## 
## Call:
## glm(formula = default ~ balance + student + income, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

Just remind that the logistic function, adapted to this model, reads as follows:

\[ \mathbb{P}(default=1|balance,student,income)=\frac{1}{1+e^{-(\beta_0+\beta_1balance+\beta_2student+\beta_3income)}} \]

Interpret each coefficient (hint: brush up on what you should know about logistic regression)

Remember that you can’t interpret the coefficients as if they were the slopes of a linear regression (you can try to compute the partial derivative of the function with respect to each variable, and you’ll see that the slope will depend on the values of all the variables, so it is not unique and for that reason is harder to interpret). In general, it is easier to have an idea of the coefficient’s meaning, by paying attention to their sign. Therefore, you can say “it is the positive/negative weight of the associated variable on the probability of the target variable to get value 1”. For instance,

Balance the variable balance has a positive weight on the probability of default to be equal to 1.

Student A student has a negative weight on the probability of default to be equal to 1.

Income the variable income has a positive weight on the probability of default to be equal to 1.

Predict, by hand, the probability of default of an “no student” individual with a balance of 2000 and income of 20000

\[ \mathbb{P}(default=1|balance,student,income)=\frac{1}{1+e^{-(-0.11+0.0057\times 2000+0.0000030\times 20000)}}\approx0.66 \]

Do the same prediction by using an [R] function

library(ISLR)
attach(Default)

## The following objects are masked from Default (pos = 3):
## 
##     balance, default, income, student

mod1<-glm(default~balance + student + income, family = "binomial", data = Default)
datanew<-data.frame("student"="No","balance"=2000,"income"=20000)
predict(mod1,newdata=datanew,type="response")

##         1 
## 0.6603007

Build three confusion matrices and compute its accuracy under the different threshold values:

-0.3 -0.5 -0.8

set.seed(123)
train_ind <- sample(seq_len(nrow(Default)), size = floor(0.75 * nrow(Default)))
train <- Default[train_ind,]
test <- Default[-train_ind,]

mod1<-glm(default~balance + student + income, family = "binomial", data = train)


#case 1

predicted_prob<-predict(mod1,newdata=test,type="response")
prediction <- as.integer(predicted_prob > 0.3)
confusion_mat_03 <- addmargins(table(test$default, prediction))
# Labeling
names(dimnames(confusion_mat_03)) <- c("True status", "Prediction")
colnames(confusion_mat_03) <- c("Fail", "Success", "Total")
rownames(confusion_mat_03) <- c("Fail", "Success", "Total")



#case 2

predicted_prob<-predict(mod1,newdata=test,type="response")
prediction <- as.integer(predicted_prob > 0.5)
confusion_mat_05 <- addmargins(table(test$default, prediction))
# Labeling
names(dimnames(confusion_mat_05)) <- c("True status", "Prediction")
colnames(confusion_mat_05) <- c("Fail", "Success", "Total")
rownames(confusion_mat_05) <- c("Fail", "Success", "Total")



#case 3

predicted_prob<-predict(mod1,newdata=test,type="response")
prediction <- as.integer(predicted_prob > 0.8)
confusion_mat_08 <- addmargins(table(test$default, prediction))
# Labeling
names(dimnames(confusion_mat_08)) <- c("True status", "Prediction")
colnames(confusion_mat_08) <- c("Fail", "Success", "Total")
rownames(confusion_mat_08) <- c("Fail", "Success", "Total")
confusion_mat_03

##            Prediction
## True status Fail Success Total
##     Fail    2382      36  2418
##     Success   39      43    82
##     Total   2421      79  2500

confusion_mat_05

##            Prediction
## True status Fail Success Total
##     Fail    2409       9  2418
##     Success   58      24    82
##     Total   2467      33  2500

confusion_mat_08

##            Prediction
## True status Fail Success Total
##     Fail    2417       1  2418
##     Success   76       6    82
##     Total   2493       7  2500

Let us define to interesting measures:

True Positive Rate (specifity)

\[ TPR=\frac{TP}{FN+TP} \]

True Negative Rate (sensitivity)

\[ TNR=\frac{TN}{TN+FP} \]

False Positive Rate

\[ FPR=\frac{FP}{TN+FP} \]

Compute the values of TPR and TNR for each threshold value

Case 1: threshold=0.3

True Positive Rate

\[ TPR=\frac{86}{89+86}=0.49 \]

True Negative Rate

\[ TNR=\frac{4741}{4741+84}=0.98 \]

False Positive Rate

\[ FPR=\frac{89}{86+89}=0.50 \]

Case 2: threshold=0.5

True Positive Rate

\[ TPR=\frac{51}{51+124}=0.29 \]

True Negative Rate

\[ TNR=\frac{4796}{4796+29}=0.99 \]

False Positive Rate

\[ FPR=\frac{124}{51+124}=0.70 \]

Case 3: threshold=0.8

True Positive Rate

\[ TPR=\frac{16}{16+159}=0.09 \]

True Negative Rate

\[ TNR=\frac{ 4822}{ 4822+3}=0.9999 \]

False Positive Rate

\[ FPR=\frac{159}{16+159}=0.90 \]

Draw a curve, with the previous results, where in the x-axis you have the false positive rate and in the y-axis the true positive rate. How are related both quantities?

This curve is also interesting, because it help us to choose an adequate threshold value to classify individuals (instead of the “typical” 0.5).

library(ROCR)

## Warning: package 'ROCR' was built under R version 4.2.3

set.seed(123)
train_ind <- sample(seq_len(nrow(Default)), size = floor(0.75 * nrow(Default)))
train <- Default[train_ind,]
test <- Default[-train_ind,]

mod1<-glm(default~balance + student + income, family = "binomial", data = train)

pred_test <- predict(mod1,test,type="response")
ROCR_pred_test <- prediction(pred_test,test$default)
ROCR_perf_test <- performance(ROCR_pred_test,'tpr','fpr')
plot(ROCR_perf_test,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))

How to interpret it?

The ROC curve shows the trade-off between TPR and FPR. Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

Note that the ROC does not depend on the class distribution. This makes it useful for evaluating classifiers predicting rare events such as diseases or disasters. In contrast, evaluating performance using accuracy (TP + TN)/(TP + TN + FN + FP) would favor classifiers that always predict a negative outcome for rare events.

A nice threshold point should be chosen trying to have a small FALSE POSITIVE RATE and a high TRUE POSITIVE RATE In this case, it seems plausible to choose as a cutoff value 0.1. As you can guess, so different to choose 0.5 as a threshold as everybody does!

Predictive 8

Jorge Herrera

2023-03-14

Week 8: Class 13/04/2023

More deeply in classification models