Using the data “Default”
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.2.3
attach(Default)
mod1<-glm(default~balance + student + income, family = "binomial", data = Default)
summary(mod1)
##
## Call:
## glm(formula = default ~ balance + student + income, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
Just remind that the logistic function, adapted to this model, reads as follows:
\[ \mathbb{P}(default=1|balance,student,income)=\frac{1}{1+e^{-(\beta_0+\beta_1balance+\beta_2student+\beta_3income)}} \]
- Interpret each coefficient (hint: brush up on what you should know about logistic regression)
Remember that you can’t interpret the coefficients as if they were the slopes of a linear regression (you can try to compute the partial derivative of the function with respect to each variable, and you’ll see that the slope will depend on the values of all the variables, so it is not unique and for that reason is harder to interpret). In general, it is easier to have an idea of the coefficient’s meaning, by paying attention to their sign. Therefore, you can say “it is the positive/negative weight of the associated variable on the probability of the target variable to get value 1”. For instance,
Balance the variable balance has a positive weight on the probability of default to be equal to 1.
Student A student has a negative weight on the probability of default to be equal to 1.
Income the variable income has a positive weight on the probability of default to be equal to 1.
- Predict, by hand, the probability of default of an “no student” individual with a balance of 2000 and income of 20000
\[ \mathbb{P}(default=1|balance,student,income)=\frac{1}{1+e^{-(-0.11+0.0057\times 2000+0.0000030\times 20000)}}\approx0.66 \]
- Do the same prediction by using an [R] function
library(ISLR)
attach(Default)
## The following objects are masked from Default (pos = 3):
##
## balance, default, income, student
mod1<-glm(default~balance + student + income, family = "binomial", data = Default)
datanew<-data.frame("student"="No","balance"=2000,"income"=20000)
predict(mod1,newdata=datanew,type="response")
## 1
## 0.6603007
-0.3 -0.5 -0.8
set.seed(123)
train_ind <- sample(seq_len(nrow(Default)), size = floor(0.75 * nrow(Default)))
train <- Default[train_ind,]
test <- Default[-train_ind,]
mod1<-glm(default~balance + student + income, family = "binomial", data = train)
#case 1
predicted_prob<-predict(mod1,newdata=test,type="response")
prediction <- as.integer(predicted_prob > 0.3)
confusion_mat_03 <- addmargins(table(test$default, prediction))
# Labeling
names(dimnames(confusion_mat_03)) <- c("True status", "Prediction")
colnames(confusion_mat_03) <- c("Fail", "Success", "Total")
rownames(confusion_mat_03) <- c("Fail", "Success", "Total")
#case 2
predicted_prob<-predict(mod1,newdata=test,type="response")
prediction <- as.integer(predicted_prob > 0.5)
confusion_mat_05 <- addmargins(table(test$default, prediction))
# Labeling
names(dimnames(confusion_mat_05)) <- c("True status", "Prediction")
colnames(confusion_mat_05) <- c("Fail", "Success", "Total")
rownames(confusion_mat_05) <- c("Fail", "Success", "Total")
#case 3
predicted_prob<-predict(mod1,newdata=test,type="response")
prediction <- as.integer(predicted_prob > 0.8)
confusion_mat_08 <- addmargins(table(test$default, prediction))
# Labeling
names(dimnames(confusion_mat_08)) <- c("True status", "Prediction")
colnames(confusion_mat_08) <- c("Fail", "Success", "Total")
rownames(confusion_mat_08) <- c("Fail", "Success", "Total")
confusion_mat_03
## Prediction
## True status Fail Success Total
## Fail 2382 36 2418
## Success 39 43 82
## Total 2421 79 2500
confusion_mat_05
## Prediction
## True status Fail Success Total
## Fail 2409 9 2418
## Success 58 24 82
## Total 2467 33 2500
confusion_mat_08
## Prediction
## True status Fail Success Total
## Fail 2417 1 2418
## Success 76 6 82
## Total 2493 7 2500
Let us define to interesting measures:
- True Positive Rate (specifity)
\[ TPR=\frac{TP}{FN+TP} \]
- True Negative Rate (sensitivity)
\[ TNR=\frac{TN}{TN+FP} \]
- False Positive Rate
\[ FPR=\frac{FP}{TN+FP} \]
- Compute the values of TPR and TNR for each threshold value
Case 1: threshold=0.3
- True Positive Rate
\[ TPR=\frac{86}{89+86}=0.49 \]
- True Negative Rate
\[ TNR=\frac{4741}{4741+84}=0.98 \]
- False Positive Rate
\[ FPR=\frac{89}{86+89}=0.50 \]
Case 2: threshold=0.5
- True Positive Rate
\[ TPR=\frac{51}{51+124}=0.29 \]
- True Negative Rate
\[ TNR=\frac{4796}{4796+29}=0.99 \]
- False Positive Rate
\[ FPR=\frac{124}{51+124}=0.70 \]
Case 3: threshold=0.8
- True Positive Rate
\[ TPR=\frac{16}{16+159}=0.09 \]
- True Negative Rate
\[ TNR=\frac{ 4822}{ 4822+3}=0.9999 \]
- False Positive Rate
\[ FPR=\frac{159}{16+159}=0.90 \]
Draw a curve, with the previous results, where in the x-axis you have the false positive rate and in the y-axis the true positive rate. How are related both quantities?
This curve is also interesting, because it help us to choose an adequate threshold value to classify individuals (instead of the “typical” 0.5).
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.2.3
set.seed(123)
train_ind <- sample(seq_len(nrow(Default)), size = floor(0.75 * nrow(Default)))
train <- Default[train_ind,]
test <- Default[-train_ind,]
mod1<-glm(default~balance + student + income, family = "binomial", data = train)
pred_test <- predict(mod1,test,type="response")
ROCR_pred_test <- prediction(pred_test,test$default)
ROCR_perf_test <- performance(ROCR_pred_test,'tpr','fpr')
plot(ROCR_perf_test,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))
- How to interpret it?
The ROC curve shows the trade-off between TPR and FPR. Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
Note that the ROC does not depend on the class distribution. This makes it useful for evaluating classifiers predicting rare events such as diseases or disasters. In contrast, evaluating performance using accuracy (TP + TN)/(TP + TN + FN + FP) would favor classifiers that always predict a negative outcome for rare events.
A nice threshold point should be chosen trying to have a small FALSE POSITIVE RATE and a high TRUE POSITIVE RATE In this case, it seems plausible to choose as a cutoff value 0.1. As you can guess, so different to choose 0.5 as a threshold as everybody does!