Getting ready…
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[30m-- [1mAttaching packages[22m ------------------------------------------------------------- tidyverse 1.3.0 --[39m
[30m[32mv[30m [34mggplot2[30m 3.3.3 [32mv[30m [34mpurrr [30m 0.3.4
[32mv[30m [34mtibble [30m 3.0.6 [32mv[30m [34mdplyr [30m 1.0.4
[32mv[30m [34mtidyr [30m 1.1.2 [32mv[30m [34mstringr[30m 1.4.0
[32mv[30m [34mreadr [30m 1.4.0 [32mv[30m [34mforcats[30m 0.5.1[39m
[30m-- [1mConflicts[22m ---------------------------------------------------------------- tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
fl <- read_csv('isbell_et_al_2019_firstlast.csv')
[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------[39m
cols(
.default = col_double(),
Student.Identification = [31mcol_character()[39m,
Language = [31mcol_character()[39m
)
[36mi[39m Use [38;5;235m[48;5;253m[38;5;235m[48;5;253m`spec()`[48;5;253m[38;5;235m[49m[39m for the full column specifications.
Getting a sense of the data - from an ordinal/continous outcome to a dichotomous outcome.
table(fl$netchange)
-3 -2 -1 0 1 2 3
1 15 105 323 280 84 6
table(fl$Decrease)
0 1
693 121
Taking a look at how some predictors associate with the outcome:
ggplot(data = fl, aes(x = factor(start.prof), fill = factor(Decrease, levels = c(0,1)))) +
geom_bar(stat = "count")+
coord_flip()
And another…
ggplot(data = fl, aes(x = factor(Time), fill = factor(Decrease, levels = c(0,1)))) +
geom_bar(stat = "count")+
coord_flip()
An intercept-only model
mod0 <- glm(Decrease ~ 1, family = "binomial", data = fl)
summary(mod0)
Call:
glm(formula = Decrease ~ 1, family = "binomial", data = fl)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5673 -0.5673 -0.5673 -0.5673 1.9525
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.74524 0.09853 -17.71 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 684.34 on 813 degrees of freedom
Residual deviance: 684.34 on 813 degrees of freedom
AIC: 686.34
Number of Fisher Scoring iterations: 4
Converting logits to OR…
exp(mod0$coefficients)
(Intercept)
0.1746032
Quick check: p = 121/(121+693) = .1486 odds = p/(1-p) = .1486/(1-.1486) = .1745 (matches our intercept)
An intercept only model in logistic regression doesn’t give you the grand mean like a linear regression, but instead gives you the overall odds of the outcome in the data.
Now to a model with predictors:
mod1 <- glm(Decrease ~ all.course + motiv + start.prof, family = "binomial", data = fl)
summary(mod1)
Call:
glm(formula = Decrease ~ all.course + motiv + start.prof, family = "binomial",
data = fl)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4942 -0.5820 -0.4322 -0.3128 2.4740
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.84787 0.48715 -5.846 5.04e-09 ***
all.course -0.10139 0.03640 -2.785 0.00535 **
motiv -0.22241 0.09081 -2.449 0.01432 *
start.prof 0.63563 0.08285 7.672 1.69e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 684.34 on 813 degrees of freedom
Residual deviance: 617.62 on 810 degrees of freedom
AIC: 625.62
Number of Fisher Scoring iterations: 5
And now to do some nifty conversions to odds, including CIs…
coef <- tidy(mod1, conf.int = T)
coef <- mutate(coef, OR = exp(estimate),
`2.5%` = exp(conf.low),
`97.5%` = exp(conf.high))
coef
Converting to odds is nice because there is a clear interpretation for effects and CI: less than 1 is a decrease in odds, and greater than 1 is an increase in odds. If a CI crosses 1, the effect is not statistically different from the null, which in the case of LR, is 50% chance/1:1 odds/0 logits
How about an effect plot?
coef %>% ggplot(aes(x = term, y = OR))+
geom_point()+
geom_errorbar(aes(ymin = `2.5%`, ymax = `97.5%`), width = .2)+
geom_hline(yintercept = 1, linetype = 2)+
labs(x = NULL, y = 'Odds of Decrease')+
coord_flip()+
theme_minimal()
It’s also useful to check the predictive accuracy of the model:
#check predicted probabilities
fl$pred.prob <- fitted(mod1) #note - fitted() returns probabilities for logistic reg, predict() gives logits
fl$pred.class <- cut(fl$pred.prob, breaks = c(-Inf, .5, Inf), labels= c(0, 1))
table(fl$Decrease, fl$pred.class)
0 1
0 683 10
1 115 6
What about R^2? There’s another package that returns this (Nagelkerke R2) when running a LR model:
library(rms)
mod1b <- lrm(Decrease ~ all.course
+ motiv + start.prof, data = fl)
mod1b$stats
Obs Max Deriv Model L.R. d.f. P C Dxy
8.140000e+02 2.437147e-06 6.672132e+01 3.000000e+00 2.142730e-14 7.348813e-01 4.697626e-01
Gamma Tau-a R2 Brier g gr gp
4.735235e-01 1.190452e-01 1.384067e-01 1.174376e-01 9.255851e-01 2.523344e+00 1.094682e-01