setwd("~/Desktop/R Materials/mih140/Lecture 20 - Logistic Regression II")
cba = read.table("cba_admissions_1999.txt", sep = "\t", header = T, quote = "", allowEscapes = T)
cba = cba[!is.na(cba$SAT_composite), ] # removes na's
cba = cba[!is.na(cba$HS_rank), ] # removes na's
Qu: Let’s predict if a student is Honors College Eligible from SAT score and HS_rank.
Motivation: Could be a useful model for prospective students who want to know if Pitt will allow them to join the Honors College, without having to wait for the admissions office to make a final decision.
## First lets separate our data into a training and testing set
cba$Honors_college_eligible = as.factor(cba$Honors_college_eligible)
cba = cba[sample(nrow(cba)),]
train_set = cba[1:500,]
test_set = cba[501:nrow(cba),]
cba$Honors_college_eligible = as.factor(cba$Honors_college_eligible)
glm1 = glm(Honors_college_eligible ~ SAT_composite + HS_rank, data = train_set, family = "binomial") # Trains the log model
glm1$coefficients
## (Intercept) SAT_composite HS_rank
## -30.19492634 0.02523809 -0.04644333
# Model
Model learned is then
\(Pr(paiddeposit = 1) = \frac{exp(-27.91793866 + SAT\_composite*0.02319719 - HS\_rank*0.04178120)}{1+exp(-27.91793866 + SAT\_composite*0.02319719 - HS\_rank*0.04178120)}\)
Goal: Given model and new observation, predict it’s class label How: We will choose a threshold \(T\), such given new observation X:
A contingency table displays our predictions using the threshold against the true class labels in the training set. Contigency Tables are nice because they tell us not just how accurate the method is, but how the method makes errors.
predicted_probs = predict.glm(glm1, test_set, type = "response")
# Lets use a threshold of T = .5, that sounds smart right? Interpreting as probabilities .5 represents a 50/50 mark.
T = .5
predicted_labels = rep(0, nrow(test_set))
predicted_labels[predicted_probs >= T] = 1
# Contigency Table (Confusion Matrix) displays predicted labels vs true labels on the testing set.
table(predicted_labels, test_set$Honors_college_eligible)
##
## predicted_labels no yes
## 0 161 5
## 1 3 22
# predicted_labels no yes
# 0 166 6
# 1 2 17
# Sensitivity = 17/(17+2)
# Specificity = 166/(166+6)
# Total Accuracy = (166+17)/(166+17+6+2) = .958 pretty good!
So the analysis in Topic 3 with T = .5 was good, but just interpreting Logistic Regression as a probability and using a Threshold of .5 is too simplistic. Note: 1. The outputs of logistic regression are not actually probabilities, it’s just the fit of a model and we may do better by changing the threshold. 2. The threshold captures a tradeoff between sensitivity and specificity. If we care about one more than the other, we should use a more/less conservative threshold to accomidate that.
The receiver operating characteristic (ROC) curve is a way of looking at the model for every threshold between 0 and 1. For each threshold it plots the sensitivity and sensitivity of the model using that threshold.
# install.packages("pROC")
library(pROC) # This is the package that contains the ROC functions
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Training_set ROC
ROC_plot = roc(train_set$Honors_college_eligible, fitted(glm1)) # Makes the ROC object. It takes in a training set, and the logistic model fit on that training set. Remember we're trying to figure out a good threshold so we can use on testing sets in the future.
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(ROC_plot) # This plots the ROC curve
ROC_plot$auc #
## Area under the curve: 0.9627
# Area under the curve: 0.9581, this a good measure of model quality. Closer to 0 is bad, closer to 1 is good. We can use this to evaluate if adding further features is good, if different sets of features are better etc.
# Testing_set ROC (usually we can't see this in a prediction problem)
ROC_plot = roc(test_set$Honors_college_eligible, predicted_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(ROC_plot)
Lets use a quick loop to find the best Threshold on the training data
# This code will loop through all the elements fitted on the training set, compute the prediction accuracy, then keep the best one.
fitted_vals = fitted(glm1)
best_acc = 0
best_T = 0
for(i in 1:nrow(train_set)){
curr = fitted_vals[i]
predicted_labels = rep(0, nrow(train_set))
predicted_labels[fitted_vals >= curr] = 1
curr_tab = table(predicted_labels, train_set$Honors_college_eligible)
curr_acc = sum(diag(curr_tab)) # Special command
if(curr_acc > best_acc){
best_T = curr
best_acc = curr_acc
}
}
print(best_acc)
## [1] 472
print(best_T)
## 600
## 0.4501405
Thresh = best_T
predicted_labels = rep(0, nrow(test_set))
predicted_labels[predicted_probs >= Thresh] = 1
# Contigency Table (Confusion Matrix)
table(predicted_labels, test_set$Honors_college_eligible)
##
## predicted_labels no yes
## 0 158 5
## 1 6 22
# In the previous code we search for thresholds by considering the modelled response value for each observation. A more direct way is to search for a threshold by discretizing the interval [0,1], and testing those.
candidate_threshold = seq(0,1, .01)
best_T = 0
best_acc = 0
predicted_probs = fitted(glm1)
for(i in 1:length(candidate_threshold)){
Thresh = candidate_threshold[i]
curr_pred = rep(0, length(predicted_probs))
curr_pred[predicted_probs > Thresh] = 1
curr_acc = sum(diag(table(curr_pred, train_set$Honors_college_eligible)))
if(curr_acc > best_acc){
best_T = Thresh
best_acc = curr_acc
}
}
print(best_T)
## [1] 0.42
print(best_acc)
## [1] 472