Gentle introduction to Logistic Regression

Bongani Ncube

2023-10-02

Introduction

This article looks at how to interpret the output of the glm() R function using the ATTRITION DATASET train dataset.

A note on the p-value: the p-value is a test of significance for the null hypothesis \(H_0\) that

there is no difference in the log-odds of the outcome between the reference group (captured by the intercept) and the explanatory variable (or one of its categories), or that the difference between the two groups equals zero: \(H_0: b_1 = 0\) and \(H_a: b_1 \neq 0\).

If p<0.05, we reject \(H_0\) as we have evidence to suggest that the difference between the two groups does not equal zero.

Types Of Modeling Techniques

Logistic regression

When you have a binary outcome (Yes/No), you can use a chi square test to compare the differences in proportions across \(n\) number of groups. For instance, if you had two groups (exposed and unexposed) and a binary outcome (event and no event), you can create a 2 x 2 contingency table and use a chi square test to test if there is a difference in the frequency or proportion in the outcome across the two groups. However, this will not get you the magnitude of the differences, the direction of the difference, nor the uncertainty with the differences.

Why not linear regression

Generate an example dataset

set.seed(12345)

interest=rnorm(20,175,20) 

interest=sort(interest) 

loan_status=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1) 

loan_data=as.data.frame(cbind(interest,loan_status))

Linear Regression Model

lm1 <- lm(loan_status ~ interest, data = loan_data)
summary(lm1)
#> 
#> Call:
#> lm(formula = loan_status ~ interest, data = loan_data)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.73342 -0.32015 -0.00892  0.30666  0.65050 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)   
#> (Intercept) -2.755029   1.028622  -2.678  0.01534 * 
#> interest     0.018439   0.005802   3.178  0.00521 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4218 on 18 degrees of freedom
#> Multiple R-squared:  0.3594, Adjusted R-squared:  0.3238 
#> F-statistic:  10.1 on 1 and 18 DF,  p-value: 0.00521

Takeaways

interest rate has a significant positive effect on loan status (a bit silly)
the model is overally significant and better than the null model because p-value<0.05
the R squared and adj R sqred are significantly small

logistic regression model

blr1 <- glm(loan_status ~ interest, family = binomial, data = loan_data)
summary(blr1)
#> 
#> Call:
#> glm(formula = loan_status ~ interest, family = binomial, data = loan_data)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)  
#> (Intercept) -21.34790    9.55927  -2.233   0.0255 *
#> interest      0.12066    0.05381   2.242   0.0249 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 27.726  on 19  degrees of freedom
#> Residual deviance: 18.666  on 18  degrees of freedom
#> AIC: 22.666
#> 
#> Number of Fisher Scoring iterations: 5

takeaways

interest rate still has a positive effect on loan status

Compare the results graphically

loan_data <- loan_data %>%
  dplyr::mutate(Pred_lm = predict(lm1, loan_data),
                Pred_blm = predict(blr1, loan_data)) %>%
  dplyr::mutate(Prob_lm = plogis(predict(lm1, loan_data)),
                Prob_blm = plogis(predict(blr1, loan_data)),
                Pred_blm2 = ifelse(predict(blr1, loan_data, type = "response") >= .5, 1, 0))
 
p1 <- ggplot(loan_data, aes(x = interest, y =  loan_status)) +
  geom_point() +
  geom_abline(intercept = coef(lm1)[1], slope = coef(lm1)[2], color = "red", size = .5) +
  geom_point(aes(x = interest, y = Pred_lm), color = "blue") +
  tvthemes::theme_avatar()+
  coord_cartesian(ylim = c(-0.2, 1.2), xlim = c(125, 225)) +
  scale_y_continuous(breaks=seq(0, 1, 1), labels = c("non-default", "default")) +
  guides(fill = FALSE) +
  labs(title = "linear regression", x = "amount", y = "")


p2 <- ggplot(loan_data, aes(x = interest, y =  loan_status)) +
  geom_point() +
  geom_point(aes(x = interest, y = Prob_blm), color = "blue") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), 
              se = FALSE, color = "red", size = .5) +
  geom_segment(aes(xend = interest, yend = Prob_blm), color = "red", alpha = .2) +
 tvthemes::theme_avatar()+
  coord_cartesian(ylim = c(-0.2, 1.2), xlim = c(125, 225)) +
  scale_y_continuous(breaks=seq(0, 1, 1), labels = c("non-default", "default")) +
  guides(fill = FALSE) +
  labs(title = "logistic regression", x = "amount", y = "",
       caption="Bongani Ncube")

ggpubr::ggarrange(p1, p2, ncol =2, nrow = 1)

Deriving Logistic Regression

The structural form of the logistic regression model:

\(logit( E[Y_i | X_i]) = logit(p_i) = ln(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1 X_{1i} + \epsilon\)

\[p_i = f(\mathbf{x}_i ; \beta) = \frac{exp(\mathbf{x_i'\beta})}{1 + exp(\mathbf{x_i'\beta})}\]

Equivalently,

\[ logit(p_i) = log(\frac{p_i}{1+p_i}) = \mathbf{x_i'\beta} \]

where \(\frac{p_i}{1+p_i}\)is the odds.

In this form, the model is specified such that a function of the mean response is linear. Hence, Generalized Linear Models

The likelihood function

\[L(p_i) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i}\]

where \(p_i = \frac{\mathbf{x'_i \beta}}{1+\mathbf{x'_i \beta}}\) and \(1-p_i = (1+ exp(\mathbf{x'_i \beta}))^{-1}\)

Hence, our objective function is

\[ Q(\beta) = log(L(\beta)) = \sum_{i=1}^n Y_i \mathbf{x'_i \beta} - \sum_{i=1}^n log(1+ exp(\mathbf{x'_i \beta})) \]

Multivariable logistic regression model

The structural form of the multivariable logistic regression model (this example uses two X variables):

\(logit( E[Y_i | X_i]) = logit(p_i) = ln(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1 X_{1i} + + \beta_2 X_{2i} + \epsilon\)

Interpreting the coefficients

Our coefficients indicate the linear impact on the log odds of a positive decision. A negative coefficient decreases the log odds and a positive coefficient increases the log odds.

We can easily extend the manipulations from a few slides back to get a formula for the odds of an event in terms of the coefficents \(\beta_0, \beta_1, ..., \beta_p\):

\[ \begin{align*} \frac{P(y = 1)}{P(y = 0)} &= e^{\beta_0 + \beta_1x_1 + ... + \beta_px_p} \\ &= e^{\beta_0}(e^{\beta_1})^{x_1}...(e^{\beta_p})^{x_p} \end{align*} \]

\(e^{\beta_0}\) is the odds of the event assuming zero from all input variables
\(e^{\beta_i}\) is the multiplier of the odds associated with a one unit increase in \(x_i\) (for example, an extra point rating in physical attractiveness), assuming all else equal - because of the multiplicative effect, we call this the odds ratio for \(x_i\).

Categorical Explanatory Variable (2 Categories)

attrition_status is modelled as a function of Sex.
Reference group is female¹.

attrition_data<-readr::read_csv("data-attrition.csv") |> 
  mutate(attrition_status= ifelse(attrition=="yes",1,0)) |> 
  mutate(department=as.factor(department))

model1 <- glm(attrition_status ~ gender, family=binomial(link='logit'), data=attrition_data)
summary(model1)
#> 
#> Call:
#> glm(formula = attrition_status ~ gender, family = binomial(link = "logit"), 
#>     data = attrition_data)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -1.7507     0.1161 -15.073   <2e-16 ***
#> gendermale    0.1656     0.1467   1.128    0.259    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1298.6  on 1469  degrees of freedom
#> Residual deviance: 1297.3  on 1468  degrees of freedom
#> AIC: 1301.3
#> 
#> Number of Fisher Scoring iterations: 4

Intercept: The log-odds of attrition for women is -1.750698.
Sexmale:
- The difference in the log-odds of attrition between men and women is 0.1655528 i.e. the chance of attrition is lower for men than for women.
- Given \(p < 0.05\), we can reject the null hypothesis (\(b_1=0\)) that there is no difference in the log-odds between men and women.
- Men have 1.1800452² times the odds (i.e. 18 %)³ of attrition than women.
Null deviance = SSTot
Residual deviance = SSResid

Categorical Explanatory Variable (More than 2 Categories)

attrition_status is modeled as a function of department.
The default reference group is human resources.

model2 <- glm(attrition_status ~ department, family=binomial(link='logit'), 
              data=na.omit(attrition_data))
summary(model2)
#> 
#> Call:
#> glm(formula = attrition_status ~ department, family = binomial(link = "logit"), 
#>     data = na.omit(attrition_data))
#> 
#> Coefficients:
#>                                Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                    -1.44692    0.32084  -4.510 6.49e-06 ***
#> departmentresearch_development -0.38175    0.33417  -1.142    0.253    
#> departmentsales                 0.09941    0.34152   0.291    0.771    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1298.6  on 1469  degrees of freedom
#> Residual deviance: 1288.1  on 1467  degrees of freedom
#> AIC: 1294.1
#> 
#> Number of Fisher Scoring iterations: 4

Intercept: the log-odds of attrition for someone in the human resources (reference group) is -1.446919.
For each of the other age groups, the coefficient tells us that the log-odds of attrition for a given group is smaller or bigger than that of the reference group.
Where \(p>0.05\), the null hypothesis \(b_k=0\) cannot be rejected i.e. there is insufficient statistical evidence that the chance of attrition is significantly smaller compared to the reference group.

Now let’s change the reference group to sales:

this is still based on data without missing department rows.


levels(attrition_data$department)
#> [1] "human_resources"      "research_development" "sales"
attrition_data$departmentR <- relevel(attrition_data$department, ref=2)
levels(attrition_data$departmentR) # new col is not attached so must use data$
#> [1] "research_development" "human_resources"      "sales"
model3 <- glm(attrition_status ~ departmentR, family=binomial(link='logit'), 
              data=na.omit(attrition_data))
summary(model3)
#> 
#> Call:
#> glm(formula = attrition_status ~ departmentR, family = binomial(link = "logit"), 
#>     data = na.omit(attrition_data))
#> 
#> Coefficients:
#>                            Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                -1.82866    0.09342 -19.576  < 2e-16 ***
#> departmentRhuman_resources  0.38175    0.33417   1.142  0.25330    
#> departmentRsales            0.48116    0.14974   3.213  0.00131 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1298.6  on 1469  degrees of freedom
#> Residual deviance: 1288.1  on 1467  degrees of freedom
#> AIC: 1294.1
#> 
#> Number of Fisher Scoring iterations: 4

Continuous Explanatory Variable

attrition_status is modelled as a function of Age.
There is no reference group as such.
R “can deal with” missing data for a continuous variable i.e.throws no error due to missing data in the continuous variable.

model4 <- glm(attrition_status ~ age, family=binomial(link='logit'), data=attrition_data)
summary(model4)
#> 
#> Call:
#> glm(formula = attrition_status ~ age, family = binomial(link = "logit"), 
#>     data = attrition_data)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  0.20620    0.30599   0.674      0.5    
#> age         -0.05225    0.00870  -6.006 1.91e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1298.6  on 1469  degrees of freedom
#> Residual deviance: 1259.1  on 1468  degrees of freedom
#> AIC: 1263.1
#> 
#> Number of Fisher Scoring iterations: 4

Intercept: The log-odds of attrition when \(Age = 0\) is 0.206199.
Age:
- For every unit increase in Age the log-odds of attrition decrease by -0.0522501 i.e. the chances of attrition decrease as employee`s age increases.
- Given \(p < 0.05\), we can reject the null hypothesis \(b_1=0\) that one unit increase in age does not affect chances of attrition.
- For every unit increase in Age, the odds of attrition are 0.9490914 times the odds of those with one Age unit less (i.e. -5.091 %).

By default R assigns the reference group based on alphabetical order↩︎
log-odds to odds-ratio: exp(log-odds)↩︎
odds-ratio to %: (OR-1) * 100↩︎