Introduction
This article looks at how to interpret the output of the
glm()
R function using the ATTRITION DATASET train
dataset.
A note on the p-value: the p-value is a test of significance for the null hypothesis \(H_0\) that
- there is no difference in the log-odds of the outcome between the reference group (captured by the intercept) and the explanatory variable (or one of its categories), or that the difference between the two groups equals zero: \(H_0: b_1 = 0\) and \(H_a: b_1 \neq 0\).
If p<0.05, we reject \(H_0\) as we have evidence to suggest that the difference between the two groups does not equal zero.
Types Of Modeling Techniques
Logistic regression
When you have a binary outcome (Yes/No), you can use a chi square test to compare the differences in proportions across \(n\) number of groups. For instance, if you had two groups (exposed and unexposed) and a binary outcome (event and no event), you can create a 2 x 2 contingency table and use a chi square test to test if there is a difference in the frequency or proportion in the outcome across the two groups. However, this will not get you the magnitude of the differences, the direction of the difference, nor the uncertainty with the differences.
Why not linear regression
Generate an example dataset
set.seed(12345)
interest=rnorm(20,175,20)
interest=sort(interest)
loan_status=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1)
loan_data=as.data.frame(cbind(interest,loan_status))
Linear Regression Model
lm1 <- lm(loan_status ~ interest, data = loan_data)
summary(lm1)
#>
#> Call:
#> lm(formula = loan_status ~ interest, data = loan_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.73342 -0.32015 -0.00892 0.30666 0.65050
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.755029 1.028622 -2.678 0.01534 *
#> interest 0.018439 0.005802 3.178 0.00521 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4218 on 18 degrees of freedom
#> Multiple R-squared: 0.3594, Adjusted R-squared: 0.3238
#> F-statistic: 10.1 on 1 and 18 DF, p-value: 0.00521
Takeaways
- interest rate has a significant positive effect on
loan status
(a bit silly) - the model is overally significant and better than the null model
because
p-value<0.05
- the
R squared
andadj R sqred
are significantly small
logistic regression model
blr1 <- glm(loan_status ~ interest, family = binomial, data = loan_data)
summary(blr1)
#>
#> Call:
#> glm(formula = loan_status ~ interest, family = binomial, data = loan_data)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -21.34790 9.55927 -2.233 0.0255 *
#> interest 0.12066 0.05381 2.242 0.0249 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 27.726 on 19 degrees of freedom
#> Residual deviance: 18.666 on 18 degrees of freedom
#> AIC: 22.666
#>
#> Number of Fisher Scoring iterations: 5
takeaways
- interest rate still has a positive effect on
loan status
Compare the results graphically
loan_data <- loan_data %>%
dplyr::mutate(Pred_lm = predict(lm1, loan_data),
Pred_blm = predict(blr1, loan_data)) %>%
dplyr::mutate(Prob_lm = plogis(predict(lm1, loan_data)),
Prob_blm = plogis(predict(blr1, loan_data)),
Pred_blm2 = ifelse(predict(blr1, loan_data, type = "response") >= .5, 1, 0))
p1 <- ggplot(loan_data, aes(x = interest, y = loan_status)) +
geom_point() +
geom_abline(intercept = coef(lm1)[1], slope = coef(lm1)[2], color = "red", size = .5) +
geom_point(aes(x = interest, y = Pred_lm), color = "blue") +
tvthemes::theme_avatar()+
coord_cartesian(ylim = c(-0.2, 1.2), xlim = c(125, 225)) +
scale_y_continuous(breaks=seq(0, 1, 1), labels = c("non-default", "default")) +
guides(fill = FALSE) +
labs(title = "linear regression", x = "amount", y = "")
p2 <- ggplot(loan_data, aes(x = interest, y = loan_status)) +
geom_point() +
geom_point(aes(x = interest, y = Prob_blm), color = "blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE, color = "red", size = .5) +
geom_segment(aes(xend = interest, yend = Prob_blm), color = "red", alpha = .2) +
tvthemes::theme_avatar()+
coord_cartesian(ylim = c(-0.2, 1.2), xlim = c(125, 225)) +
scale_y_continuous(breaks=seq(0, 1, 1), labels = c("non-default", "default")) +
guides(fill = FALSE) +
labs(title = "logistic regression", x = "amount", y = "",
caption="Bongani Ncube")
ggpubr::ggarrange(p1, p2, ncol =2, nrow = 1)
Deriving Logistic Regression
The structural form of the logistic regression model:
\[p_i = f(\mathbf{x}_i ; \beta) =
\frac{exp(\mathbf{x_i'\beta})}{1 +
exp(\mathbf{x_i'\beta})}\]
Equivalently,
\[ logit(p_i) = log(\frac{p_i}{1+p_i}) = \mathbf{x_i'\beta} \]
where \(\frac{p_i}{1+p_i}\)is the odds.
In this form, the model is specified such that a function of the mean response is linear. Hence, Generalized Linear Models
The likelihood function
\[L(p_i) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i}\]
where \(p_i = \frac{\mathbf{x'_i \beta}}{1+\mathbf{x'_i \beta}}\) and \(1-p_i = (1+ exp(\mathbf{x'_i \beta}))^{-1}\)
Hence, our objective function is
\[ Q(\beta) = log(L(\beta)) = \sum_{i=1}^n Y_i \mathbf{x'_i \beta} - \sum_{i=1}^n log(1+ exp(\mathbf{x'_i \beta})) \]
Multivariable logistic regression model
The structural form of the multivariable logistic regression model
(this example uses two X
variables):
Interpreting the coefficients
Our coefficients indicate the linear impact on the log odds of a positive decision. A negative coefficient decreases the log odds and a positive coefficient increases the log odds.
We can easily extend the manipulations from a few slides back to get a formula for the odds of an event in terms of the coefficents \(\beta_0, \beta_1, ..., \beta_p\):
\[ \begin{align*} \frac{P(y = 1)}{P(y = 0)} &= e^{\beta_0 + \beta_1x_1 + ... + \beta_px_p} \\ &= e^{\beta_0}(e^{\beta_1})^{x_1}...(e^{\beta_p})^{x_p} \end{align*} \]
- \(e^{\beta_0}\) is the odds of the event assuming zero from all input variables
- \(e^{\beta_i}\) is the multiplier of the odds associated with a one unit increase in \(x_i\) (for example, an extra point rating in physical attractiveness), assuming all else equal - because of the multiplicative effect, we call this the odds ratio for \(x_i\).
Categorical Explanatory Variable (2 Categories)
- attrition_status is modelled as a function of Sex.
- Reference group is female1.
attrition_data<-readr::read_csv("data-attrition.csv") |>
mutate(attrition_status= ifelse(attrition=="yes",1,0)) |>
mutate(department=as.factor(department))
model1 <- glm(attrition_status ~ gender, family=binomial(link='logit'), data=attrition_data)
summary(model1)
#>
#> Call:
#> glm(formula = attrition_status ~ gender, family = binomial(link = "logit"),
#> data = attrition_data)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.7507 0.1161 -15.073 <2e-16 ***
#> gendermale 0.1656 0.1467 1.128 0.259
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1298.6 on 1469 degrees of freedom
#> Residual deviance: 1297.3 on 1468 degrees of freedom
#> AIC: 1301.3
#>
#> Number of Fisher Scoring iterations: 4
- Intercept: The log-odds of attrition for women is -1.750698.
- Sexmale:
- The difference in the log-odds of attrition between men and women is 0.1655528 i.e. the chance of attrition is lower for men than for women.
- Given \(p < 0.05\), we can reject the null hypothesis (\(b_1=0\)) that there is no difference in the log-odds between men and women.
- Men have 1.18004522 times the odds (i.e. 18 %)3 of attrition than women.
- Null deviance = SSTot
- Residual deviance = SSResid
Categorical Explanatory Variable (More than 2 Categories)
attrition_status is modeled as a function of department.
The default reference group is human resources.
model2 <- glm(attrition_status ~ department, family=binomial(link='logit'),
data=na.omit(attrition_data))
summary(model2)
#>
#> Call:
#> glm(formula = attrition_status ~ department, family = binomial(link = "logit"),
#> data = na.omit(attrition_data))
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.44692 0.32084 -4.510 6.49e-06 ***
#> departmentresearch_development -0.38175 0.33417 -1.142 0.253
#> departmentsales 0.09941 0.34152 0.291 0.771
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1298.6 on 1469 degrees of freedom
#> Residual deviance: 1288.1 on 1467 degrees of freedom
#> AIC: 1294.1
#>
#> Number of Fisher Scoring iterations: 4
- Intercept: the log-odds of attrition for someone in
the
human resources
(reference group) is -1.446919. - For each of the other age groups, the coefficient tells us that the log-odds of attrition for a given group is smaller or bigger than that of the reference group.
- Where \(p>0.05\), the null hypothesis \(b_k=0\) cannot be rejected i.e. there is insufficient statistical evidence that the chance of attrition is significantly smaller compared to the reference group.
Now let’s change the reference group to sales:
- this is still based on data without missing department rows.
levels(attrition_data$department)
#> [1] "human_resources" "research_development" "sales"
attrition_data$departmentR <- relevel(attrition_data$department, ref=2)
levels(attrition_data$departmentR) # new col is not attached so must use data$
#> [1] "research_development" "human_resources" "sales"
model3 <- glm(attrition_status ~ departmentR, family=binomial(link='logit'),
data=na.omit(attrition_data))
summary(model3)
#>
#> Call:
#> glm(formula = attrition_status ~ departmentR, family = binomial(link = "logit"),
#> data = na.omit(attrition_data))
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.82866 0.09342 -19.576 < 2e-16 ***
#> departmentRhuman_resources 0.38175 0.33417 1.142 0.25330
#> departmentRsales 0.48116 0.14974 3.213 0.00131 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1298.6 on 1469 degrees of freedom
#> Residual deviance: 1288.1 on 1467 degrees of freedom
#> AIC: 1294.1
#>
#> Number of Fisher Scoring iterations: 4
.
Continuous Explanatory Variable
- attrition_status is modelled as a function of Age.
- There is no reference group as such.
- R “can deal with” missing data for a continuous variable i.e.throws no error due to missing data in the continuous variable.
model4 <- glm(attrition_status ~ age, family=binomial(link='logit'), data=attrition_data)
summary(model4)
#>
#> Call:
#> glm(formula = attrition_status ~ age, family = binomial(link = "logit"),
#> data = attrition_data)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.20620 0.30599 0.674 0.5
#> age -0.05225 0.00870 -6.006 1.91e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1298.6 on 1469 degrees of freedom
#> Residual deviance: 1259.1 on 1468 degrees of freedom
#> AIC: 1263.1
#>
#> Number of Fisher Scoring iterations: 4
- Intercept: The log-odds of attrition when \(Age = 0\) is 0.206199.
- Age:
- For every unit increase in Age the log-odds of attrition decrease by -0.0522501 i.e. the chances of attrition decrease as employee`s age increases.
- Given \(p < 0.05\), we can reject the null hypothesis \(b_1=0\) that one unit increase in age does not affect chances of attrition.
- For every unit increase in Age, the odds of attrition are 0.9490914 times the odds of those with one Age unit less (i.e. -5.091 %).