Dr. Turner mentioned in the videos that, like MLR, logistic regression allows for estimating effects while taking into account additional variables. These additional variables could be confounding the effect you wish to estimate most. Consider the following example where two colleges had a science competitions. Four hundred students from each college were given an exam and were given a grade of pass or fail. We additionally have information on whether the student is in the physics or math department.
comp<-read.csv("ScienceCompetition.csv",stringsAsFactors = T)
levels(comp$Pass)
## [1] "N" "Y"
head(comp)
## School Dept Pass
## 1 Crane Math Y
## 2 Crane Math Y
## 3 Crane Math Y
## 4 Crane Math Y
## 5 Crane Math Y
## 6 Crane Math Y
coef(summary(glm(Pass~School,data=comp,family="binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.175938e-15 0.1000000 -2.175938e-14 1.0000000
## SchoolEagle 1.000835e-01 0.1415099 7.072540e-01 0.4794086
coef(summary(glm(Pass~School+Dept,data=comp,family="binomial")))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7023509 0.1473091 4.767871 1.861830e-06
## SchoolEagle -0.4833417 0.1709882 -2.826755 4.702231e-03
## DeptPhys -1.2691942 0.1843913 -6.883159 5.853982e-12
exp(confint(glm(Pass~School,data=comp,family="binomial")))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.8218856 1.216714
## SchoolEagle 0.8375940 1.458941
exp(confint(glm(Pass~School+Dept,data=comp,family="binomial")))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.5186013 2.7080311
## SchoolEagle 0.4393646 0.8594695
## DeptPhys 0.1948056 0.4016388
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
g1<-comp %>%
group_by(School,Pass) %>%
summarise(cnt=n()) %>%
mutate(perc=round(cnt/sum(cnt),4))%>%
arrange(desc(perc))
## `summarise()` has grouped output by 'School'. You can override using the
## `.groups` argument.
#g1
ggplot(g1[c(1,3),],aes(x=reorder(School,-perc),y=perc,colour=School))+
geom_bar(aes(fill=School),show.legend=T,stat="identity")+
ylab("Proportion of Passing ")+
xlab("School")
g2<-comp %>%
group_by(Dept,Pass) %>%
summarise(cnt=n()) %>%
mutate(perc=round(cnt/sum(cnt),4))%>%
arrange(desc(perc))
## `summarise()` has grouped output by 'Dept'. You can override using the
## `.groups` argument.
#g2
ggplot(g2[c(2,4),],aes(x=reorder(Dept,-perc),y=perc,colour=Dept))+
geom_bar(aes(fill=Dept),show.legend=T,stat="identity")+
ylab("Proportion of Passing ")+
xlab("Dept.")
g3<-comp %>%
group_by(Dept,School) %>%
summarise(cnt=n()) %>%
mutate(perc=round(cnt/sum(cnt),4))%>%
arrange(desc(perc))
## `summarise()` has grouped output by 'Dept'. You can override using the
## `.groups` argument.
#g3
ggplot(g3,aes(x=reorder(Dept,-perc),y=cnt,colour=Dept))+
geom_bar(aes(fill=Dept),show.legend=T,stat="identity")+
ylab("Count of Students ")+
xlab("Dept.")+facet_grid(~School)
Adjusting for the department reveals a significant difference in pass rates between schools and between departments. Initially, it appeared there was no significant difference between schools; however, after adjusting for department, it is evident that being from Eagle and/or being in the Physics department negatively impacts the likelihood of passing, highlighting the importance of adjusting for potential confounders in statistical analysis.
Graphs g1 and g2 are likely to show that when not considering department, the pass rates between schools might look similar or not significantly different. However, once you break down the pass rates by department (as presumably depicted in g2), it may become apparent that the composition of departments (more students from a department with lower pass rates) at each school influences the overall pass rates.
Graph g3 (count of students by department and school) further contextualizes this by potentially showing an unequal distribution of students across departments between the schools. If, for example, Eagle has a higher proportion of Physics students compared to Crane and Physics students have lower pass rates, this would explain why Eagle’s overall pass rate seemed lower once department was accounted for in the model.
The key takeaway is that department acts as a confounding variable, influencing the pass rates and thereby affecting the comparison between schools. Without considering the department, the analysis might overlook significant differences in pass rates attributable to the uneven distribution of students across departments with inherently different pass rates.
This section is to show you how many of the graphs were made in the videos. We’ll also use this to provide a quick example of feature selection on top of it.
As explored in the videos, interactions potentially exist between Sex and Age.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
To get a quick feel on whether interactions are important or not, we can use an AIC metric on multiple models and compare. The following code fits a simple additive model using the key predictors of interest and two additional complex models: one that intuitively makes sense based on the EDA and another with even more complexity. Although you do not have to do this, I’m going to create a categorical version of the response as well.
Rose$Survived2<-factor(ifelse(Rose$Survived==1,"Survived","Perished"))
simple<-glm(Survived2~Pclass+Age+Sex,data=Rose,family="binomial")
complex1<-glm(Survived2~Pclass+Age+Sex+Age:Pclass+Age:Sex+Pclass:Sex,data=Rose,family="binomial")
complex2<-glm(Survived2~Pclass+poly(Age,3)+Sex+poly(Age,3):Pclass+poly(Age,3):Sex+Pclass:Sex,data=Rose,family="binomial")
AIC(simple)
## [1] 811.594
AIC(complex1)
## [1] 777.5174
AIC(complex2)
## [1] 770.323
An additional way to test is to provide a deviance test. This is the equivalent of a partial F test in regression. It is a global test of coefficients that are unique to one model relative to a simpler model. When performing a deviance test between the simple and first complex model, the test is highly significant suggesting that at least one of the coefficients unique to the complex model is not 0.
anova(simple,complex1,test='LR')
## Analysis of Deviance Table
##
## Model 1: Survived2 ~ Pclass + Age + Sex
## Model 2: Survived2 ~ Pclass + Age + Sex + Age:Pclass + Age:Sex + Pclass:Sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 882 801.59
## 2 877 757.52 5 44.077 2.235e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(summary(complex1))
coef(summary(complex2))
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.3.3
library(sjmisc)
## Warning: package 'sjmisc' was built under R version 4.3.3
##
## Attaching package: 'sjmisc'
## The following object is masked from 'package:tidyr':
##
## replace_na
plot_model(complex1,type="pred",terms=c("Sex","Age[5,15,30,45]"))
plot_model(complex1,type="pred",terms=c("Pclass","Age[5,15,30,45]"))
plot_model(complex1,type="pred",terms=c("Age","Sex","Pclass"))
## Data were 'prettified'. Consider using `terms="Age [all]"` to get smooth
## plots.
glm
if you wish to
provide statistical inference. Recall that fitting models after feature
selection on the same data set tends to generate p-values that are too
small. This is typically done in practice and development of methods to
handle this issue are still ongoing.Run the following feature selection code and examine what predictors are included. For stepwise selection, it will report each step of the process so the output is lengthy. We can discuss this during live session. For this exercise, provide a quick summary of what coefficients were included using stepwise and GLMnet. Do these suggest a model closer to the simple, complex1, or complex 2 models fit earlier?
library(caret)
## Loading required package: lattice
fitControl<-trainControl(method="repeatedcv",number=5,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)
#note CV and error metric are not really used here, but logLoss is reported for the final model.
Step.fit<-train(Survived2~Pclass+poly(Age,3)+Sex+poly(Age,3):Pclass+poly(Age,3):Sex+Pclass:Sex,
data=Rose,
method="glmStepAIC",
trControl=fitControl,
metric="logLoss")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
coef(Step.fit$finalModel)
#GLMNET
glmnet.fit<-train(Survived2~Pclass+poly(Age,3)+Sex+poly(Age,3):Pclass+poly(Age,3):Sex+Pclass:Sex,
data=Rose,
method="glmnet",
trControl=fitControl,
metric="logLoss")
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)
##ANSWER DISCUSSION 2: Discussion 2, this represents a stepwise model selection process where both forward and backward steps are used to determine the best set of predictors for the model based on Akaike’s Information Criterion (AIC).
The terms beginning with poly(Age, 3)
represent
polynomial terms for age — these are used to model non-linear effects of
age. Terms like Pclass3:poly(Age, 3)2
represent interaction
terms between class and a polynomial term of age, indicating that the
effect of age might be different across passenger classes.
Here’s a step-by-step interpretation:
The AIC is used to compare models with different combinations of predictors. A lower AIC suggests a better model, balancing the trade-off between the model fit and the complexity of the model.
In the model, terms are added or removed based on their effect on the AIC. The goal is to minimize the AIC.
The terms listed in each step are those being considered for
removal (-
) or addition (+
). After each step,
the term that improves the AIC the most is either added or
removed.
The final model will be the one that has the lowest AIC while still providing a good fit to the data.
For the coefficients obtained from the glmnet model, which performs
regularization, we see that some coefficients are exactly zero
(represented by .
). This indicates that glmnet has
performed feature selection, penalizing some coefficients to zero to
avoid overfitting.
Comparing the stepwise selection and glmnet coefficients:
Terms with very large coefficients in the stepwise model, such as
poly(Age, 3)3
and its interactions, are also present in the
glmnet model. However, glmnet has shrunk the coefficients substantially,
as it penalizes large coefficients to prevent overfitting.
Some terms that are present in the stepwise model are completely
removed (coefficients shrunk to zero) in the glmnet model, such as
Pclass3:poly(Age, 3)2
.
Both models have retained key terms like Pclass3
and
Sexmale
, suggesting these are important
predictors.
The discrepancies between the models highlight the differences in approach: stepwise selection is solely based on AIC and does not penalize complexity as directly as glmnet, which uses a shrinkage method to penalize larger coefficients. This is why glmnet is often preferred when the goal is to improve the model’s predictive performance, especially on new, unseen data. A. Avoid overfitting: the stepwise selection process and the coefficients from the glmnet model are used to find a model that balances model complexity (which increases with more terms) and goodness of fit (which generally improves with more terms).
The stepwise selection includes a mix of main effects and interaction terms, suggesting it’s favoring a more complex model than the simple model but not necessarily as complex as the complex2 model with cubic terms for age and all possible interactions.
The coefficients from the glmnet model, chosen by penalizing complexity to prevent overfitting, suggest a simpler model than the full interaction model but with some interaction terms remaining significant. This may suggest it is more aligned with the complex1 model, but with regularization applied, fewer terms are included, and their effect sizes are shrunk. B. Less than 75% C. The model that feature selection results are closest, but still way off, is complex 1.
A point of discussion during live session will be the introduction if another error metric that can be used to investigate the importance of models as well as compare models. The error metric is derived from a graphical display known as the receiver operating characteristic curve (ROC). For now, the following code produces two ROC curves using the simple and complex1 models. Use Google to do a quick search on what an ROC curve is and how to read it. What does the ROC curve suggest when comparing the two models? Note these are done on the training data so they are a bit optimistic. In practice we will be doing it on a validation set or via CV.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
#Predicting probabilities on the training data
#Age and SPclass only model
simple<-glm(Survived2~Age+Pclass,data=Rose,family="binomial")
simple.predprobs<-predict(simple,Rose,type="response") #note if we were using a caret model type="raw"
#Complex model from previous
complex1.predprobs<-predict(complex1,Rose,type="response")
simple.roc<-roc(response=Rose$Survived2,predictor=simple.predprobs,levels=c("Perished","Survived"))
## Setting direction: controls < cases
complex1.roc<-roc(response=Rose$Survived2,predictor=complex1.predprobs,levels=c("Perished","Survived"))
## Setting direction: controls < cases
#Note if using a caret model make sure to only feed one set of predicted probabilies. Caret models provide both category prediction so you have to pull the one you need.
plot(simple.roc)
plot(complex1.roc,print.thres="best",col="red",add=T,legend=T)
legend("bottomright",
legend=c("Simple", "Complex"),
col=c("black", "red"),
lwd=4, cex =1, xpd = TRUE, horiz = FALSE)
Note that in the code above we never specified a threshold like we do for confusion matrices.
No work to do past this. Coding is slightly different when computing an ROC curve from a model fit using caret. Next is the ROC curve on the training data of the glmnet fit earlier:
library(pROC)
#Predicting probabilities on the training data
glmnet.predprobs<-predict(glmnet.fit,Rose,type="prob") #note if we were using a caret model type="raw"
glmnet.roc<-roc(response=Rose$Survived2,predictor=glmnet.predprobs$Survived,levels=c("Perished","Survived"))
## Setting direction: controls < cases
plot(simple.roc)
plot(complex1.roc,print.thres="best",col="red",add=T)
plot(glmnet.roc,add=T,col="lightblue")
legend("bottomright",
legend=c("Simple", "Complex","GLMNET"),
col=c("black", "red","lightblue"),
lwd=4, cex =1, xpd = TRUE, horiz = FALSE)
#plot(log.roc,print.thres="best") #This graph is nice because the x axis is plotted in terms of specificity rather than FPR
#auc(log.roc)
##ANSWER Discussion 3:
The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the true positive rate (TPR, also known as sensitivity) against the false positive rate (FPR, 1 - specificity) at various threshold settings. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal).
From the ROC curves provided for the simple and complex models, several points can be analyzed:
Area Under the Curve (AUC): This area reflects model accuracy. An AUC of 1 indicates perfect prediction, while an AUC of 0.5 suggests no discriminative power, equivalent to random guessing. The closer the ROC curve is to the top left corner, the higher the overall accuracy of the test.
Curve Steepness: The beginning of the curve indicates the sensitivity of the model at low specificity levels. A steeper initial curve implies the model quickly captures a high rate of true positives with a limited number of false positives.
Comparison of Models: By comparing the curves of the simple and complex models, it can be observed which model better differentiates between the two classes over the range of possible thresholds. If one curve dominates another (is consistently above the other), it indicates better performance.
In the context of the ROC curves for the simple and complex models, if one curve encloses a greater area than the other, it would suggest that this model has better performance. For the precise model comparison and interpretation, one would typically look at the AUC and specific points of interest on the curve that balance the trade-off between true positives and false positives according to the application’s needs.
In practice, the threshold selection depends on the particular cost of false positives and false negatives in the problem domain. If no threshold is specified, it is often assumed that the optimal threshold is the one that maximizes the difference between the true positive rate and the false positive rate, which corresponds to the highest Youden’s J index (sensitivity + specificity - 1).
Lastly, note that the provided ROC curves are based on the training data, which can give an overly optimistic assessment of the model’s performance. It is generally more informative to evaluate ROC curves on a separate validation dataset that the model has not seen during training to get a realistic estimate of its performance.