Discussion 1: Adjustment of a 2x2 table

Dr. Turner mentioned in the videos that, like MLR, logistic regression allows for estimating effects while taking into account additional variables. These additional variables could be confounding the effect you wish to estimate most. Consider the following example where two colleges had a science competitions. Four hundred students from each college were given an exam and were given a grade of pass or fail. We additionally have information on whether the student is in the physics or math department.

comp<-read.csv("ScienceCompetition.csv",stringsAsFactors = T)
levels(comp$Pass)
## [1] "N" "Y"
head(comp)
##   School Dept Pass
## 1  Crane Math    Y
## 2  Crane Math    Y
## 3  Crane Math    Y
## 4  Crane Math    Y
## 5  Crane Math    Y
## 6  Crane Math    Y
  1. Consider the two logistic regression model fits. Interpret the regression coefficient on the school variable for each model. State the big finding here.
coef(summary(glm(Pass~School,data=comp,family="binomial")))
##                  Estimate Std. Error       z value  Pr(>|z|)
## (Intercept) -2.175938e-15  0.1000000 -2.175938e-14 1.0000000
## SchoolEagle  1.000835e-01  0.1415099  7.072540e-01 0.4794086
coef(summary(glm(Pass~School+Dept,data=comp,family="binomial")))
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  0.7023509  0.1473091  4.767871 1.861830e-06
## SchoolEagle -0.4833417  0.1709882 -2.826755 4.702231e-03
## DeptPhys    -1.2691942  0.1843913 -6.883159 5.853982e-12
exp(confint(glm(Pass~School,data=comp,family="binomial")))
## Waiting for profiling to be done...
##                 2.5 %   97.5 %
## (Intercept) 0.8218856 1.216714
## SchoolEagle 0.8375940 1.458941
exp(confint(glm(Pass~School+Dept,data=comp,family="binomial")))
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 1.5186013 2.7080311
## SchoolEagle 0.4393646 0.8594695
## DeptPhys    0.1948056 0.4016388
  1. The more appropriate result is the multiple logistic model including both predictors. Use the following graphs to try to explain why the result changes the way it does between the two models.
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

g1<-comp %>% 
  group_by(School,Pass) %>%
  summarise(cnt=n()) %>%
  mutate(perc=round(cnt/sum(cnt),4))%>%
  arrange(desc(perc))
## `summarise()` has grouped output by 'School'. You can override using the
## `.groups` argument.
#g1

ggplot(g1[c(1,3),],aes(x=reorder(School,-perc),y=perc,colour=School))+
  geom_bar(aes(fill=School),show.legend=T,stat="identity")+
  ylab("Proportion of Passing ")+
  xlab("School")

g2<-comp %>% 
  group_by(Dept,Pass) %>%
  summarise(cnt=n()) %>%
  mutate(perc=round(cnt/sum(cnt),4))%>%
  arrange(desc(perc))
## `summarise()` has grouped output by 'Dept'. You can override using the
## `.groups` argument.
#g2

ggplot(g2[c(2,4),],aes(x=reorder(Dept,-perc),y=perc,colour=Dept))+
  geom_bar(aes(fill=Dept),show.legend=T,stat="identity")+
  ylab("Proportion of Passing ")+
  xlab("Dept.")

g3<-comp %>% 
  group_by(Dept,School) %>%
  summarise(cnt=n()) %>%
  mutate(perc=round(cnt/sum(cnt),4))%>%
  arrange(desc(perc))
## `summarise()` has grouped output by 'Dept'. You can override using the
## `.groups` argument.
#g3

ggplot(g3,aes(x=reorder(Dept,-perc),y=cnt,colour=Dept))+
  geom_bar(aes(fill=Dept),show.legend=T,stat="identity")+
  ylab("Count of Students ")+
  xlab("Dept.")+facet_grid(~School)

ANSWER: DISCUSSION 1

A. Logistic Regression Model Fits Interpretation

  • Model 1 (Pass ~ School):
    • Intercept: Virtually 0 (very close to 0), indicating that the log odds of passing for the baseline school (Crane, assuming alphabetical order if not specified) is near 0 when not accounting for department. The p-value is 1, suggesting this intercept is not significantly different from 0.
    • SchoolEagle: The coefficient is 0.1000835, with a p-value of 0.4794086. This indicates that the difference in log odds of passing between Eagle and the baseline (Crane) is not statistically significant. The confidence interval for this coefficient ranges from 0.8375940 to 1.458941, suggesting uncertainty in the exact effect size, but overall indicating no strong evidence of a difference in pass rates between schools when not accounting for department.
  • Model 2 (Pass ~ School + Dept):
    • Intercept (Crane, Math assumed as baseline): 0.7023509, with a very significant p-value (1.861830e-06), indicating a high log odds of passing for students in the Math department at Crane.
    • SchoolEagle: Now -0.4833417, significantly different from 0 (p-value = 0.004702231), suggesting that when adjusting for department, Eagle students are less likely to pass than Crane students, reversing the insight obtained without department adjustment.
    • DeptPhys: Coefficient of -1.2691942, significantly less likely to pass than Math students, regardless of school (p-value = 5.853982e-12).

Big Finding:

Adjusting for the department reveals a significant difference in pass rates between schools and between departments. Initially, it appeared there was no significant difference between schools; however, after adjusting for department, it is evident that being from Eagle and/or being in the Physics department negatively impacts the likelihood of passing, highlighting the importance of adjusting for potential confounders in statistical analysis.

B. Graphical Explanation of Model Result Changes

  • Graphs g1 and g2 are likely to show that when not considering department, the pass rates between schools might look similar or not significantly different. However, once you break down the pass rates by department (as presumably depicted in g2), it may become apparent that the composition of departments (more students from a department with lower pass rates) at each school influences the overall pass rates.

  • Graph g3 (count of students by department and school) further contextualizes this by potentially showing an unequal distribution of students across departments between the schools. If, for example, Eagle has a higher proportion of Physics students compared to Crane and Physics students have lower pass rates, this would explain why Eagle’s overall pass rate seemed lower once department was accounted for in the model.

The key takeaway is that department acts as a confounding variable, influencing the pass rates and thereby affecting the comparison between schools. Without considering the department, the analysis might overlook significant differences in pass rates attributable to the uneven distribution of students across departments with inherently different pass rates.

Discussion 2: Titanic Data Set Revisited

This section is to show you how many of the graphs were made in the videos. We’ll also use this to provide a quick example of feature selection on top of it.

As explored in the videos, interactions potentially exist between Sex and Age.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

To get a quick feel on whether interactions are important or not, we can use an AIC metric on multiple models and compare. The following code fits a simple additive model using the key predictors of interest and two additional complex models: one that intuitively makes sense based on the EDA and another with even more complexity. Although you do not have to do this, I’m going to create a categorical version of the response as well.

Rose$Survived2<-factor(ifelse(Rose$Survived==1,"Survived","Perished"))
simple<-glm(Survived2~Pclass+Age+Sex,data=Rose,family="binomial")
complex1<-glm(Survived2~Pclass+Age+Sex+Age:Pclass+Age:Sex+Pclass:Sex,data=Rose,family="binomial")
complex2<-glm(Survived2~Pclass+poly(Age,3)+Sex+poly(Age,3):Pclass+poly(Age,3):Sex+Pclass:Sex,data=Rose,family="binomial")

AIC(simple)
## [1] 811.594
AIC(complex1)
## [1] 777.5174
AIC(complex2)
## [1] 770.323

An additional way to test is to provide a deviance test. This is the equivalent of a partial F test in regression. It is a global test of coefficients that are unique to one model relative to a simpler model. When performing a deviance test between the simple and first complex model, the test is highly significant suggesting that at least one of the coefficients unique to the complex model is not 0.

anova(simple,complex1,test='LR')
## Analysis of Deviance Table
## 
## Model 1: Survived2 ~ Pclass + Age + Sex
## Model 2: Survived2 ~ Pclass + Age + Sex + Age:Pclass + Age:Sex + Pclass:Sex
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       882     801.59                          
## 2       877     757.52  5   44.077 2.235e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. In this example, the AIC metric actually favors the most complex model. The model is most likely overfitting because the uncertainty in the estimates are quite large creating a ton of insignificant tests.. To see this, compare the regression coefficients for the two complex models. What is the general theme here?
coef(summary(complex1))
coef(summary(complex2))
  1. To help interpret a logistic regression model, effects plots can be used. Three examples follow. Note that when a predictor is not used in the effect plot, R fixes the variable at a certain value. Its better to only fix values that do not have interaction terms in the model. Use the the last graph to estimate what the probability of survival is for a female in second class, age 60.
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.3.3
library(sjmisc)
## Warning: package 'sjmisc' was built under R version 4.3.3
## 
## Attaching package: 'sjmisc'
## The following object is masked from 'package:tidyr':
## 
##     replace_na
plot_model(complex1,type="pred",terms=c("Sex","Age[5,15,30,45]"))

plot_model(complex1,type="pred",terms=c("Pclass","Age[5,15,30,45]"))

plot_model(complex1,type="pred",terms=c("Age","Sex","Pclass"))
## Data were 'prettified'. Consider using `terms="Age [all]"` to get smooth
##   plots.

  1. GLMNET and Stepwise feature selection tools exist in caret. These can be used to handle situations where the number of predictors is too large to handle manually. Like with MLR, you can use these tools to determine candidate models, but it is up to you to decide if interactions should be included, transformations applied, etc. You will also need to fit a final model using glm if you wish to provide statistical inference. Recall that fitting models after feature selection on the same data set tends to generate p-values that are too small. This is typically done in practice and development of methods to handle this issue are still ongoing.

Run the following feature selection code and examine what predictors are included. For stepwise selection, it will report each step of the process so the output is lengthy. We can discuss this during live session. For this exercise, provide a quick summary of what coefficients were included using stepwise and GLMnet. Do these suggest a model closer to the simple, complex1, or complex 2 models fit earlier?

library(caret)
## Loading required package: lattice
fitControl<-trainControl(method="repeatedcv",number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)
#note CV and error metric are not really used here, but logLoss is reported for the final model.
Step.fit<-train(Survived2~Pclass+poly(Age,3)+Sex+poly(Age,3):Pclass+poly(Age,3):Sex+Pclass:Sex,
                    data=Rose,
                    method="glmStepAIC",
                    trControl=fitControl,
                    metric="logLoss")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(Step.fit$finalModel)


#GLMNET
glmnet.fit<-train(Survived2~Pclass+poly(Age,3)+Sex+poly(Age,3):Pclass+poly(Age,3):Sex+Pclass:Sex,
                    data=Rose,
                    method="glmnet",
                    trControl=fitControl,
                    metric="logLoss")
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)

##ANSWER DISCUSSION 2: Discussion 2, this represents a stepwise model selection process where both forward and backward steps are used to determine the best set of predictors for the model based on Akaike’s Information Criterion (AIC).

The terms beginning with poly(Age, 3) represent polynomial terms for age — these are used to model non-linear effects of age. Terms like Pclass3:poly(Age, 3)2 represent interaction terms between class and a polynomial term of age, indicating that the effect of age might be different across passenger classes.

Here’s a step-by-step interpretation:

  1. The AIC is used to compare models with different combinations of predictors. A lower AIC suggests a better model, balancing the trade-off between the model fit and the complexity of the model.

  2. In the model, terms are added or removed based on their effect on the AIC. The goal is to minimize the AIC.

  3. The terms listed in each step are those being considered for removal (-) or addition (+). After each step, the term that improves the AIC the most is either added or removed.

  4. The final model will be the one that has the lowest AIC while still providing a good fit to the data.

For the coefficients obtained from the glmnet model, which performs regularization, we see that some coefficients are exactly zero (represented by .). This indicates that glmnet has performed feature selection, penalizing some coefficients to zero to avoid overfitting.

Comparing the stepwise selection and glmnet coefficients:

The discrepancies between the models highlight the differences in approach: stepwise selection is solely based on AIC and does not penalize complexity as directly as glmnet, which uses a shrinkage method to penalize larger coefficients. This is why glmnet is often preferred when the goal is to improve the model’s predictive performance, especially on new, unseen data. A. Avoid overfitting: the stepwise selection process and the coefficients from the glmnet model are used to find a model that balances model complexity (which increases with more terms) and goodness of fit (which generally improves with more terms).

The stepwise selection includes a mix of main effects and interaction terms, suggesting it’s favoring a more complex model than the simple model but not necessarily as complex as the complex2 model with cubic terms for age and all possible interactions.

The coefficients from the glmnet model, chosen by penalizing complexity to prevent overfitting, suggest a simpler model than the full interaction model but with some interaction terms remaining significant. This may suggest it is more aligned with the complex1 model, but with regularization applied, fewer terms are included, and their effect sizes are shrunk. B. Less than 75% C. The model that feature selection results are closest, but still way off, is complex 1.

Discussion 3: Introduction to the ROC curve

A point of discussion during live session will be the introduction if another error metric that can be used to investigate the importance of models as well as compare models. The error metric is derived from a graphical display known as the receiver operating characteristic curve (ROC). For now, the following code produces two ROC curves using the simple and complex1 models. Use Google to do a quick search on what an ROC curve is and how to read it. What does the ROC curve suggest when comparing the two models? Note these are done on the training data so they are a bit optimistic. In practice we will be doing it on a validation set or via CV.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
#Predicting probabilities on the training data
#Age and SPclass only model
simple<-glm(Survived2~Age+Pclass,data=Rose,family="binomial")
simple.predprobs<-predict(simple,Rose,type="response")  #note if we were using a caret model type="raw"

#Complex model from previous
complex1.predprobs<-predict(complex1,Rose,type="response")


simple.roc<-roc(response=Rose$Survived2,predictor=simple.predprobs,levels=c("Perished","Survived"))
## Setting direction: controls < cases
complex1.roc<-roc(response=Rose$Survived2,predictor=complex1.predprobs,levels=c("Perished","Survived"))
## Setting direction: controls < cases
#Note if using a caret model make sure to only feed one set of predicted probabilies. Caret models provide both category prediction so you have to pull the one you need.

plot(simple.roc)
plot(complex1.roc,print.thres="best",col="red",add=T,legend=T)
legend("bottomright",
       legend=c("Simple", "Complex"),
       col=c("black", "red"),
       lwd=4, cex =1, xpd = TRUE, horiz = FALSE)

Note that in the code above we never specified a threshold like we do for confusion matrices.

ROC curves in caret

No work to do past this. Coding is slightly different when computing an ROC curve from a model fit using caret. Next is the ROC curve on the training data of the glmnet fit earlier:

library(pROC)
#Predicting probabilities on the training data
glmnet.predprobs<-predict(glmnet.fit,Rose,type="prob")  #note if we were using a caret model type="raw"
glmnet.roc<-roc(response=Rose$Survived2,predictor=glmnet.predprobs$Survived,levels=c("Perished","Survived"))
## Setting direction: controls < cases
plot(simple.roc)
plot(complex1.roc,print.thres="best",col="red",add=T)
plot(glmnet.roc,add=T,col="lightblue")
legend("bottomright",
       legend=c("Simple", "Complex","GLMNET"),
       col=c("black", "red","lightblue"),
       lwd=4, cex =1, xpd = TRUE, horiz = FALSE)

#plot(log.roc,print.thres="best") #This graph is nice because the x axis is plotted in terms of specificity rather than FPR
#auc(log.roc)

##ANSWER Discussion 3:

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the true positive rate (TPR, also known as sensitivity) against the false positive rate (FPR, 1 - specificity) at various threshold settings. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).

From the ROC curves provided for the simple and complex models, several points can be analyzed:

  1. Area Under the Curve (AUC): This area reflects model accuracy. An AUC of 1 indicates perfect prediction, while an AUC of 0.5 suggests no discriminative power, equivalent to random guessing. The closer the ROC curve is to the top left corner, the higher the overall accuracy of the test.

  2. Curve Steepness: The beginning of the curve indicates the sensitivity of the model at low specificity levels. A steeper initial curve implies the model quickly captures a high rate of true positives with a limited number of false positives.

  3. Comparison of Models: By comparing the curves of the simple and complex models, it can be observed which model better differentiates between the two classes over the range of possible thresholds. If one curve dominates another (is consistently above the other), it indicates better performance.

In the context of the ROC curves for the simple and complex models, if one curve encloses a greater area than the other, it would suggest that this model has better performance. For the precise model comparison and interpretation, one would typically look at the AUC and specific points of interest on the curve that balance the trade-off between true positives and false positives according to the application’s needs.

In practice, the threshold selection depends on the particular cost of false positives and false negatives in the problem domain. If no threshold is specified, it is often assumed that the optimal threshold is the one that maximizes the difference between the true positive rate and the false positive rate, which corresponds to the highest Youden’s J index (sensitivity + specificity - 1).

Lastly, note that the provided ROC curves are based on the training data, which can give an overly optimistic assessment of the model’s performance. It is generally more informative to evaluate ROC curves on a separate validation dataset that the model has not seen during training to get a realistic estimate of its performance.