Logistic Regression

Professor Dr. Md. Kamrul Hasan

For dataset used, click here to download

Logistic regression is a classification algorithm where the dependent variables is binary ([0, 1], [no, yes], [female, male], [not happened, happened], [not satisfied, satisfied], …).

\[ y = \begin{cases} 1, & \text{with probability p} \\ 0, & \text{with probability 1 - p} \end{cases} \]

Linear regression does not work here becuase it can not predict 0 or 1. Linear regression predicts continuous dependent variable (outside the ranges between 0 and ). Depending on the independent variables (continuous, discrete, and/or factor), logistic regression model estimates the coefficients that will predict the log of odds (\(\frac{p(1)}{p(0)}\)), which is called logit function, which is linear function of the \(X_i\) (independent variables). The equation is: \[ log(odds) = \alpha + \beta \cdot X_i + \epsilon \] \[ odds = \frac{p}{1-p},~~~p = \frac{odds}{1 + odds} \] We know that in a Bernoulli trial (having only two possible outcomes), if the total is 100 (yes + no) and p(yes) = 0.60, the p(no) = 1 - p(yes).

Therefore, exponentiation of log(odds) = odds: \(odds = e^{\alpha + \beta \cdot X_i + \epsilon}\)

We can convert the odds to p by: \(probability~of~1~(yes) = \frac {e^{\alpha + \beta \cdot X_i + \epsilon}}{1+e^{\alpha + \beta \cdot X_i + \epsilon}}\)

Probit regression analysis is similar to the logistic regression. Probit model estimates the coefficients in terms of changes in z-scores (quantile value along the x-axis of the normal distribution) in the response variable. In probit regression, the linear combination of the predictors results in the inverse standard normal distribution of the probability (\(z~score, \phi^{-1}(p)\)) of the binary dependent variable. The the coefficients indicate the change in the z-score of positive (yes, 1) response, that can be converted to marginal effects (p-values, which ranges between 0 and 1) to explain the coefficients in terms of probability of changes towards 1 (yes, positive) value of the dependent variable.

The choice between logit or probit depends just on the researchers’ preference.

In linear regression (OLS), we estimate the F-value with p-value to check the model fit. Logit and probit estimates the coefficients using maximum likelihood method (highest probability of getting the estimates). By subtracting the -2 ln (likelihood of full model) from -2 ln (likelihood of null model with only intercept), we get the likelihood ratio, \(-2ln \frac{Likelihood_{null}}{Likehood_{full}} = \chi^2\), which is interpreted based on the p-values.

Not exactly usefull as in linear regression, we can compute pseudo-\(R^2\), sucsh McFadden’s \(R^2\) and Nagelkerke’s \(R^2\).

No more words, let us begin the analysis

Open a R script. Keep the script and the dataset in the same folder. Set the working directory to that folder.
Load the required library (run install.packages(‘pacman’) in R consol from luncher if not installed).
Load the dataset with a name.
The dataset and methods followed here are based on Hands-on machine learning with R (https://bradleyboehmke.github.io/HOML/)

pacman::p_load(modeldata, car, ggpubr, rsample, broom, regressinator, jtools, tidyverse, caret, vip)
options(repr.plot.width=10, repr.plot.height=7) # for notebook figure width
theme_set(new = theme_bw()) # to set theme of ggplot

Let us examine what happens if we fit a linear model for a binary dependent variable.

# Data set
set.seed(1)
x = c(21:70)
y = c(rep(0, times = 15), sample(c(0,1), size = 20, replace = TRUE), rep(1, times = 15))
dt = data.frame(x, y)
head(dt)

##    x y
## 1 21 0
## 2 22 0
## 3 23 0
## 4 24 0
## 5 25 0
## 6 26 0

# Linear regression model
linear_fit = lm(y ~ x, dt)

# Logistic regression model
logit_fit = glm(y ~ x, dt, family = binomial(link = 'logit'))

summ(linear_fit)

## MODEL INFO:
## Observations: 50
## Dependent Variable: y
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(1,48) = 67.01, p = 0.00
## R² = 0.58
## Adj. R² = 0.57 
## 
## Standard errors:OLS
## ------------------------------------------------
##                      Est.   S.E.   t val.      p
## ----------------- ------- ------ -------- ------
## (Intercept)         -0.74   0.15    -4.81   0.00
## x                    0.03   0.00     8.19   0.00
## ------------------------------------------------

# exp = TRUE to get the beta as odds
summ(logit_fit, exp = TRUE)

## MODEL INFO:
## Observations: 50
## Dependent Variable: y
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(1) = 38.33, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.72
## Pseudo-R² (McFadden) = 0.56
## AIC = 34.66, BIC = 38.48 
## 
## Standard errors:MLE
## ------------------------------------------------------------
##                     exp(Est.)   2.5%   97.5%   z val.      p
## ----------------- ----------- ------ ------- -------- ------
## (Intercept)              0.00   0.00    0.01    -3.65   0.00
## x                        1.23   1.10    1.38     3.69   0.00
## ------------------------------------------------------------

# Predict the data using the models. type = 'response' to get the p(y)|x [probability of y for given x].
linear_predict = predict(linear_fit)
logit_predict = predict(logit_fit, type = 'response')

# Visualize x vs y from dt 
ggplot(dt, aes(x,y)) +
geom_point(size = 4, alpha = 0.6) +
geom_line(aes(y = linear_predict), color = 'blue', linewidth = 2) +
geom_line(aes(y = logit_predict), color = 'red', linewidth = 2) +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
labs(title = 'Association membership vs. age', x = 'Age (year)', 
    y = 'Membership: no = 0, yes = 1 \n Linear predictions \n Logistic prediction as probability')

Here, we see membership tends to be 1 (yes) with an increase in age. Linear predictions (blue line) are outside the binary boundaries (0>y<1), which is unrealistic. Another critical observation is that the relationship is not linear, which means that each unit increase in age does not create equal rate of change in membership. Now, look at the S-shaped (red) probability curve transformed using logistic regression coefficient [odds = exp(coeff), p = odds/(1 + odds), since odds = p/(1 - p), p = probability of happening, 1 - p = probability of not happening] because we have modeled, logit = log(odds) = linear combination of x. However, the red curve shows, gradual increase in probability of being 1, followed by sharp increase during 40 to 55 years. Therefore, if we want to target the age group to increase the number of association members, we need to target this 40-55 age group.

Now observe the diagnostic plots of the both models. Left (linear) and right (logistic). First, try with the traditional diagnostic plots that do not help much understand the linear relationship between x and y.

par(mfrow = c(1, 2))
plot(linear_fit, which = 1)
plot(logit_fit, which = 1)

plot(linear_fit, which = 2)
plot(logit_fit, which = 2)

par(mfrow = c(1, 1))

# Box-tidwell test to check the assumption of linearity of log(odds) with the predictors
# p > 0.05, H0: linearity presence. H0 cannot be rejected. So, there is a linear relationship.
boxTidwell(y = y, x1 = x)

##  MLE of lambda Score Statistic (t) Pr(>|t|)
##         1.7195              1.0142   0.3157
## 
## iterations =  4

Now, we will plot the randomized quantiles of the residuals fitted against predicted (fitted) values. An well fitted model should show homogenous distribution of the quantile residuals both sides of the 0.5 horizontal line. We will use augment() from broom package that extracts the required values from the model outputs and adds to the original data set as columns. Please note that tidy() also extracts the model output values but does not work on non-tidy objects in R.

augment_quantile(linear_fit)  %>% head()

## # A tibble: 6 × 9
##       y     x .fitted .resid   .hat .sigma  .cooksd .std.resid .quantile.resid
##   <dbl> <int>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>           <dbl>
## 1     0    21 -0.186  0.186  0.0776  0.331 0.0146        0.589           0.724
## 2     0    22 -0.160  0.160  0.0730  0.331 0.0100        0.504           0.688
## 3     0    23 -0.133  0.133  0.0686  0.331 0.00649       0.420           0.616
## 4     0    24 -0.107  0.107  0.0644  0.332 0.00388       0.336           0.6  
## 5     0    25 -0.0804 0.0804 0.0604  0.332 0.00205       0.252           0.576
## 6     0    26 -0.0541 0.0541 0.0565  0.332 0.000859      0.169           0.572

# Residual vs. fitted plot
a = augment_quantile(logit_fit) %>% 
ggplot(aes(x = .fitted, y = .quantile.resid)) +
  geom_point() +
  geom_hline(yintercept = 0.5) +
  labs(x = "Predictor", y = "Randomized quantile residual",title = "Residual vs. fitted")

# QQ-plot of the randomized quantile residuals. This does not also help much.
dt$log_odds = log(logit_predict/(1 - logit_predict))
b = ggplot(dt, aes(x = x, y = log_odds)) +
geom_point() +
geom_smooth() +
labs(x = "Predictor", y = "Logit", title = "Age vs. logit of predicted y")

# Merging the both figures
ggarrange(a, b, labels = c('a', 'b'))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Let us continue with actual data set

Attrition: ~ is a situation where the rate of leaving the workforce from an organization is faster than recruiting new workforce.

attrition dataset of rsample package: Attrition is the reponse variable coded as yes and no. - Personal Information: Age, Gender, Marital Status. - Job Information: Job Role, Department, Job Level, Job Satisfaction. - Compensation: Monthly Income, Stock Option Level. - Performance Metrics: Performance Rating, Training Times Last Year. - Work-Life Balance: Work Life Balance, Overtime

attrition = modeldata::attrition
str(attrition)

## 'data.frame':    1470 obs. of  31 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

This dataset has 31 variables and 1470 observations. We will use Attrition as the dependent variable and Age, Department, MonthlyIncome and OverTime as independent variables. Select only this dataset and name it as df. If you download the dataset as .csv file. read the dataset using read.csv() and convert Department and OverTime variables as factor. We have directly taken the dataset from the package and it has well formatted variables.

df = attrition %>% select(Attrition, Age, MaritalStatus, MonthlyIncome, OverTime)
head(df)

##   Attrition Age MaritalStatus MonthlyIncome OverTime
## 1       Yes  41        Single          5993      Yes
## 2        No  49       Married          5130       No
## 4       Yes  37        Single          2090      Yes
## 5        No  33       Married          2909      Yes
## 7        No  27       Married          3468       No
## 8        No  32        Single          3068       No

We want to see whether the attrition is dependent on the monthly salary and over time.

logit = glm(Attrition ~ Age + MaritalStatus + MonthlyIncome + OverTime, 
    family = binomial(link = 'logit'), data = df)

# Dependent continuous variable can be modeled as a logit model based on the median value. 

summ(logit, confint = T, robust = F, vif = T)

## MODEL INFO:
## Observations: 1470
## Dependent Variable: Attrition
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(5) = 181.75, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.20
## Pseudo-R² (McFadden) = 0.14
## AIC = 1128.83, BIC = 1160.59 
## 
## Standard errors:MLE
## -------------------------------------------------------------------------
##                               Est.    2.5%   97.5%   z val.      p    VIF
## -------------------------- ------- ------- ------- -------- ------ ------
## (Intercept)                  -1.16   -1.90   -0.42    -3.08   0.00       
## Age                          -0.03   -0.05   -0.01    -2.94   0.00   1.25
## MaritalStatusMarried          0.29   -0.16    0.73     1.27   0.20   1.03
## MaritalStatusSingle           1.12    0.68    1.56     5.02   0.00   1.03
## MonthlyIncome                -0.00   -0.00   -0.00    -3.97   0.00   1.23
## OverTimeYes                   1.47    1.16    1.77     9.45   0.00   1.03
## -------------------------------------------------------------------------

# Levels of categories in factors
table(df$Attrition)

## 
##   No  Yes 
## 1233  237

table(df$MaritalStatus)

## 
## Divorced  Married   Single 
##      327      673      470

table(df$OverTime)

## 
##   No  Yes 
## 1054  416

summary(df$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   30.00   36.00   36.92   43.00   60.00

summary(df$MonthlyIncome)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1009    2911    4919    6503    8379   19999

sd(df$MonthlyIncome)

## [1] 4707.957

The alphabetically first level of a factor acts as a reference level for the variables. Other levels of that variable have more or less effect depending on their regression coefficient (estimates).

We see that the variables with p < 0.05 have significant effect on the attrition. Thus, increase in age by one year log(odds) of attrition decreased by 0.03, holding other variables constant. The log(odds) of attrition of the singles is 1.12 times larger than the divorcees if all other variables held constant. Simimalry, the personnel with over time facilities attrit more than the personnel without such facilities.

We can notice one thing that the income has significant effect but its log(odds) is 0. This is because the unit of income is very small but the values are large [1009 to 19999] compared to the changes in the log of odds. If we convert this variable to thousand dollars, will make the coefficient larger. This illustrates that the regression coefficients are scale dependent (not scale free).

Pseudo-\(R^2\) (Cragg-Uhler) is 0.20, which indicates 20% variance in the workforce attrition can be explained by the model. However, this statistics is as straightforward as \(R^2\) of OLS. McFadden \(R^2\) is somewhat conservative. We can also calculate Nagelkerke’s \(R^2\) using fmsb::NagelkerkeR2(logit) function.

However, the logit coefficients indicates the changes in the response variable in terms of log(odds). It will be easier if we convert this to odds and eventually to probability.

coefs = c('Married' = 'MaritalStatusMarried', 'Single' = 'MaritalStatusSingle', MonthlyIncome = 'Income', 'Overtime' = 'OverTimeYes')
plot_summs(logit, coefs = coefs, plot.distributions = T)

# Effecct plot
effect_plot(logit, pred = MaritalStatus, interval = T, plot.points = T) +
theme_apa()

# However, if we change the coefficients to odds by exponentiating the logit coeffcients, the inerpretation will be easier.
tidy(exp(coef(logit)))

## Warning in tidy.numeric(exp(coef(logit))): 'tidy.numeric' is deprecated.
## See help("Deprecated")

## # A tibble: 6 × 2
##   names                    x
##   <chr>                <dbl>
## 1 (Intercept)          0.312
## 2 Age                  0.971
## 3 MaritalStatusMarried 1.33 
## 4 MaritalStatusSingle  3.07 
## 5 MonthlyIncome        1.00 
## 6 OverTimeYes          4.34

Thus, the coefficients will indicate the magnitude of changes in the odds of positive response (yes, 1). It will be easier to explain, if we can convert this odds to probability and we know how to do it.

# Conversion from logit to probability
log_odds = coef(logit)
odds = exp(coef(logit))
probability = odds/(1+odds)
plogis = plogis(coef(logit)) # built-in function
coef_all = data.frame(log_odds, odds, probability, plogis)
coef_all

##                           log_odds      odds probability    plogis
## (Intercept)          -1.163531e+00 0.3123813   0.2380263 0.2380263
## Age                  -2.932094e-02 0.9711048   0.4926703 0.4926703
## MaritalStatusMarried  2.879825e-01 1.3337340   0.5715021 0.5715021
## MaritalStatusSingle   1.123215e+00 3.0747235   0.7545846 0.7545846
## MonthlyIncome        -9.818986e-05 0.9999018   0.4999755 0.4999755
## OverTimeYes           1.467721e+00 4.3393345   0.8127107 0.8127107

We see that negative logits have give p < 0.5 because \(e^0 = 1\) and its odds = 1/(1+1) = 0.5. Similarly, if logit > 0, odds > 1, p > 0.5.

Ranges of logit = -\(\infty\) to +\(\infty\)
Ranges of odds = 0 to \(\infty\)
Ranges of probability = 0 to 1

However, we can easily predict the dependent variable’s probability of positive (yes, 1) by using the predict(logit). This will produce an output having probabilities for each of the observations. We can also predict the probability of response for a specific combination of independent variables.

summary(predict(logit, type = 'response'))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.008462 0.066593 0.110660 0.161224 0.221994 0.678212

predict(logit, data.frame(Age = 0, MaritalStatus = 'Divorced', OverTime = 'Yes', MonthlyIncome = 0), type = 'response')

##         1 
## 0.5754665

# This shows that the given combination of data, there is 57.54% probabilty that the respondent will quit the job.

predict(logit, data.frame(Age = 0, MaritalStatus = 'Divorced', OverTime = 'No', MonthlyIncome = 0), type = 'response')

##         1 
## 0.2380263

# This shows that the given combination of data, there is 23.80% probabilty that the respondent will quit the job.

We have used OverTime = Yes and No in the first and second predictions, respectively, keeping all other variables at their reference points. We see that the probability has increased to 57.54 + 23.80 = 81.34%. The model suggests that forcing the worker for over time works, the probability of quiting the job will increase by 81.34%.

However, if we compare this value with the converted probability of logit coefficient of OverTime, which is 0.8127 and intercept is 0.2380. This tells us that 23.80% probability of attrition is inherent to all workforce, when all independent variables are set to zero or reference point. This means the quiting probability of workers without OverTime facilities is 23.80% which increased to 81.27% for the workers with OverTime facilities. So, the increased probability is 81.27 - 23.80 = 57.47% for forcing the workers for over time work.

Let us use probit analysis to see the differences in interpretation.

probit = glm(Attrition ~ Age + MaritalStatus + MonthlyIncome + OverTime, 
    family = binomial(link = 'probit'), data = df) # nolint

summ(probit, confint = T, robust = F, vif = T)

## MODEL INFO:
## Observations: 1470
## Dependent Variable: Attrition
## Type: Generalized linear model
##   Family: binomial 
##   Link function: probit 
## 
## MODEL FIT:
## χ²(5) = 175.63, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.19
## Pseudo-R² (McFadden) = 0.14
## AIC = 1134.96, BIC = 1166.71 
## 
## Standard errors:MLE
## -------------------------------------------------------------------------
##                               Est.    2.5%   97.5%   z val.      p    VIF
## -------------------------- ------- ------- ------- -------- ------ ------
## (Intercept)                  -0.76   -1.16   -0.36    -3.72   0.00       
## Age                          -0.02   -0.03   -0.00    -2.82   0.00   1.27
## MaritalStatusMarried          0.17   -0.06    0.41     1.46   0.14   1.02
## MaritalStatusSingle           0.62    0.38    0.86     5.15   0.00   1.02
## MonthlyIncome                -0.00   -0.00   -0.00    -3.86   0.00   1.26
## OverTimeYes                   0.81    0.64    0.98     9.31   0.00   1.02
## -------------------------------------------------------------------------

The probit coefficient of age is -0.02, which indicates that one year increase in age will decrease the z-score of the response variable (attrition). Now, convert the z-scores (probit coefficients) to probability. Add these as probit to teh coeff_all table to compare.

coef_all$probit = pnorm(coef(probit))
coef_all

##                           log_odds      odds probability    plogis    probit
## (Intercept)          -1.163531e+00 0.3123813   0.2380263 0.2380263 0.2242052
## Age                  -2.932094e-02 0.9711048   0.4926703 0.4926703 0.4939105
## MaritalStatusMarried  2.879825e-01 1.3337340   0.5715021 0.5715021 0.5692595
## MaritalStatusSingle   1.123215e+00 3.0747235   0.7545846 0.7545846 0.7323926
## MonthlyIncome        -9.818986e-05 0.9999018   0.4999755 0.4999755 0.4999808
## OverTimeYes           1.467721e+00 4.3393345   0.8127107 0.8127107 0.7903507

We see that probability computed using logit and probit are the similar with a minor differences because those have been caculated based on different mathematical approaches.