Topics for today!

  1. Example of Logistic Regression
  2. Logistic Regression for Prediction
  3. Contingency tables
  4. ROC Curves and AUC
  5. Finding the best choice of threshold

Data for today

setwd("~/Desktop/R Materials/mih140/Lecture 20 - Logistic Regression II")
cba = read.table("cba_admissions_1999.txt", sep = "\t", header = T, quote = "", allowEscapes = T)
cba = cba[!is.na(cba$SAT_composite), ] # removes na's
cba = cba[!is.na(cba$HS_rank), ] # removes na's

Topic 1: Example of Logistic Regression

Qu: Let’s predict if a student is Honors College Eligible from SAT score and HS_rank.

Motivation: Could be a useful model for prospective students who want to know if Pitt will allow them to join the Honors College, without having to wait for the admissions office to make a final decision.

## First lets separate our data into a training and testing set
cba$Honors_college_eligible = as.factor(cba$Honors_college_eligible)
cba = cba[sample(nrow(cba)),]
train_set = cba[1:500,]
test_set = cba[501:nrow(cba),]
cba$Honors_college_eligible = as.factor(cba$Honors_college_eligible)

glm1 = glm(Honors_college_eligible ~ SAT_composite + HS_rank, data = train_set, family = "binomial") # Trains the log model
glm1$coefficients 
##   (Intercept) SAT_composite       HS_rank 
##  -30.19492634    0.02523809   -0.04644333
# Model

Model learned is then

\(Pr(paiddeposit = 1) = \frac{exp(-27.91793866 + SAT\_composite*0.02319719 - HS\_rank*0.04178120)}{1+exp(-27.91793866 + SAT\_composite*0.02319719 - HS\_rank*0.04178120)}\)

Topic 2: Logistic Regression for Prediction

Goal: Given model and new observation, predict it’s class label How: We will choose a threshold \(T\), such given new observation X:

  1. If \(Pr(Y = 1| X) \ge T\), predict \(Y = 1\)
  2. Else predict \(Y = 0\)

Two methodologies for evaluating such a strategy

  1. Contingency table
  2. ROC Curves

Topic 3: Contingency Tables

A contingency table displays our predictions using the threshold against the true class labels in the training set. Contigency Tables are nice because they tell us not just how accurate the method is, but how the method makes errors.

predicted_probs = predict.glm(glm1, test_set, type = "response")

# Lets use a threshold of T = .5, that sounds smart right? Interpreting as probabilities .5 represents a 50/50 mark.
T = .5
predicted_labels = rep(0, nrow(test_set))
predicted_labels[predicted_probs >= T] = 1

# Contigency Table (Confusion Matrix) displays predicted labels vs true labels on the testing set.

table(predicted_labels, test_set$Honors_college_eligible)
##                 
## predicted_labels  no yes
##                0 161   5
##                1   3  22
# predicted_labels  no yes
#                0 166   6
#                1   2  17

# Sensitivity = 17/(17+2)
# Specificity = 166/(166+6)
# Total Accuracy = (166+17)/(166+17+6+2) = .958 pretty good!

Topic 4: ROC Curves and AUC

So the analysis in Topic 3 with T = .5 was good, but just interpreting Logistic Regression as a probability and using a Threshold of .5 is too simplistic. Note: 1. The outputs of logistic regression are not actually probabilities, it’s just the fit of a model and we may do better by changing the threshold. 2. The threshold captures a tradeoff between sensitivity and specificity. If we care about one more than the other, we should use a more/less conservative threshold to accomidate that.

The receiver operating characteristic (ROC) curve is a way of looking at the model for every threshold between 0 and 1. For each threshold it plots the sensitivity and sensitivity of the model using that threshold.

# install.packages("pROC")
library(pROC) # This is the package that contains the ROC functions
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Training_set ROC
ROC_plot = roc(train_set$Honors_college_eligible, fitted(glm1)) # Makes the ROC object. It takes in a training set, and the logistic model fit on that training set. Remember we're trying to figure out a good threshold so we can use on testing sets in the future.
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(ROC_plot) # This plots the ROC curve

ROC_plot$auc # 
## Area under the curve: 0.9627
# Area under the curve: 0.9581, this a good measure of model quality. Closer to 0 is bad, closer to 1 is good. We can use this to evaluate if adding further features is good, if different sets of features are better etc.

# Testing_set ROC (usually we can't see this in a prediction problem)
ROC_plot = roc(test_set$Honors_college_eligible, predicted_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(ROC_plot)

Topic 5: Finding the Best Threshold

Lets use a quick loop to find the best Threshold on the training data

# This code will loop through all the elements fitted on the training set, compute the prediction accuracy, then keep the best one.

fitted_vals = fitted(glm1)
best_acc = 0
best_T = 0
for(i in 1:nrow(train_set)){
  curr = fitted_vals[i]
  predicted_labels = rep(0, nrow(train_set))
  predicted_labels[fitted_vals >= curr] = 1
  
  curr_tab = table(predicted_labels, train_set$Honors_college_eligible)
  
  curr_acc = sum(diag(curr_tab)) # Special command
  if(curr_acc > best_acc){
    best_T = curr
    best_acc = curr_acc
  }
}
print(best_acc)
## [1] 472
print(best_T)
##       600 
## 0.4501405

Lets test it out our choices of threshold

Thresh = best_T
predicted_labels = rep(0, nrow(test_set))
predicted_labels[predicted_probs >= Thresh] = 1

# Contigency Table (Confusion Matrix)
table(predicted_labels, test_set$Honors_college_eligible)
##                 
## predicted_labels  no yes
##                0 158   5
##                1   6  22

Alternative way to find the best threshold

# In the previous code we search for thresholds by considering the modelled response value for each observation. A more direct way is to search for a threshold by discretizing the interval [0,1], and testing those.

candidate_threshold = seq(0,1, .01)
best_T = 0
best_acc = 0
predicted_probs = fitted(glm1)

for(i in 1:length(candidate_threshold)){
  Thresh = candidate_threshold[i]
  curr_pred = rep(0, length(predicted_probs))
  curr_pred[predicted_probs > Thresh] = 1
  
  curr_acc = sum(diag(table(curr_pred, train_set$Honors_college_eligible)))
  
  if(curr_acc > best_acc){
    best_T = Thresh
    best_acc = curr_acc
  }
}
print(best_T)
## [1] 0.42
print(best_acc)
## [1] 472