Logistic regression is a classification algorithm where the dependent variables is binary ([0, 1], [no, yes], [female, male], [not happened, happened], [not satisfied, satisfied], …).
\[ y = \begin{cases} 1, & \text{with probability p} \\ 0, & \text{with probability 1 - p} \end{cases} \]
Linear regression does not work here becuase it can not predict 0 or 1. Linear regression predicts continuous dependent variable (outside the ranges between 0 and ). Depending on the independent variables (continuous, discrete, and/or factor), logistic regression model estimates the coefficients that will predict the log of odds (\(\frac{p(1)}{p(0)}\)), which is called logit function, which is linear function of the \(X_i\) (independent variables). The equation is: \[ log(odds) = \alpha + \beta \cdot X_i + \epsilon \] \[ odds = \frac{p}{1-p},~~~p = \frac{odds}{1 + odds} \] We know that in a Bernoulli trial (having only two possible outcomes), if the total is 100 (yes + no) and p(yes) = 0.60, the p(no) = 1 - p(yes).
Therefore, exponentiation of log(odds) = odds: \(odds = e^{\alpha + \beta \cdot X_i + \epsilon}\)
We can convert the odds to p by: \(probability~of~1~(yes) = \frac {e^{\alpha + \beta \cdot X_i + \epsilon}}{1+e^{\alpha + \beta \cdot X_i + \epsilon}}\)
Probit regression analysis is similar to the logistic regression. Probit model estimates the coefficients in terms of changes in z-scores (quantile value along the x-axis of the normal distribution) in the response variable. In probit regression, the linear combination of the predictors results in the inverse standard normal distribution of the probability (\(z~score, \phi^{-1}(p)\)) of the binary dependent variable. The the coefficients indicate the change in the z-score of positive (yes, 1) response, that can be converted to marginal effects (p-values, which ranges between 0 and 1) to explain the coefficients in terms of probability of changes towards 1 (yes, positive) value of the dependent variable.
The choice between logit or probit depends just on the researchers’ preference.
In linear regression (OLS), we estimate the F-value with p-value to
check the model fit. Logit and probit estimates the coefficients using
maximum likelihood method (highest probability of getting the
estimates). By subtracting the
-2 ln (likelihood of full model)
from
-2 ln (likelihood of null model with only intercept)
, we
get the likelihood ratio, \(-2ln
\frac{Likelihood_{null}}{Likehood_{full}} = \chi^2\), which is
interpreted based on the p-values.
Not exactly usefull as in linear regression, we can compute pseudo-\(R^2\), sucsh McFadden’s \(R^2\) and Nagelkerke’s \(R^2\).
pacman::p_load(modeldata, car, ggpubr, rsample, broom, regressinator, jtools, tidyverse, caret, vip)
options(repr.plot.width=10, repr.plot.height=7) # for notebook figure width
theme_set(new = theme_bw()) # to set theme of ggplot
Let us examine what happens if we fit a linear model for a binary dependent variable.
# Data set
set.seed(1)
x = c(21:70)
y = c(rep(0, times = 15), sample(c(0,1), size = 20, replace = TRUE), rep(1, times = 15))
dt = data.frame(x, y)
head(dt)
## x y
## 1 21 0
## 2 22 0
## 3 23 0
## 4 24 0
## 5 25 0
## 6 26 0
# Linear regression model
linear_fit = lm(y ~ x, dt)
# Logistic regression model
logit_fit = glm(y ~ x, dt, family = binomial(link = 'logit'))
summ(linear_fit)
## MODEL INFO:
## Observations: 50
## Dependent Variable: y
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,48) = 67.01, p = 0.00
## R² = 0.58
## Adj. R² = 0.57
##
## Standard errors:OLS
## ------------------------------------------------
## Est. S.E. t val. p
## ----------------- ------- ------ -------- ------
## (Intercept) -0.74 0.15 -4.81 0.00
## x 0.03 0.00 8.19 0.00
## ------------------------------------------------
# exp = TRUE to get the beta as odds
summ(logit_fit, exp = TRUE)
## MODEL INFO:
## Observations: 50
## Dependent Variable: y
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(1) = 38.33, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.72
## Pseudo-R² (McFadden) = 0.56
## AIC = 34.66, BIC = 38.48
##
## Standard errors:MLE
## ------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ----------------- ----------- ------ ------- -------- ------
## (Intercept) 0.00 0.00 0.01 -3.65 0.00
## x 1.23 1.10 1.38 3.69 0.00
## ------------------------------------------------------------
# Predict the data using the models. type = 'response' to get the p(y)|x [probability of y for given x].
linear_predict = predict(linear_fit)
logit_predict = predict(logit_fit, type = 'response')
# Visualize x vs y from dt
ggplot(dt, aes(x,y)) +
geom_point(size = 4, alpha = 0.6) +
geom_line(aes(y = linear_predict), color = 'blue', linewidth = 2) +
geom_line(aes(y = logit_predict), color = 'red', linewidth = 2) +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
labs(title = 'Association membership vs. age', x = 'Age (year)',
y = 'Membership: no = 0, yes = 1 \n Linear predictions \n Logistic prediction as probability')
Here, we see membership tends to be 1 (yes) with an increase in age. Linear predictions (blue line) are outside the binary boundaries (0>y<1), which is unrealistic. Another critical observation is that the relationship is not linear, which means that each unit increase in age does not create equal rate of change in membership. Now, look at the S-shaped (red) probability curve transformed using logistic regression coefficient [odds = exp(coeff), p = odds/(1 + odds), since odds = p/(1 - p), p = probability of happening, 1 - p = probability of not happening] because we have modeled, logit = log(odds) = linear combination of x. However, the red curve shows, gradual increase in probability of being 1, followed by sharp increase during 40 to 55 years. Therefore, if we want to target the age group to increase the number of association members, we need to target this 40-55 age group.
Now observe the diagnostic plots of the both models. Left (linear) and right (logistic). First, try with the traditional diagnostic plots that do not help much understand the linear relationship between x and y.
par(mfrow = c(1, 2))
plot(linear_fit, which = 1)
plot(logit_fit, which = 1)
plot(linear_fit, which = 2)
plot(logit_fit, which = 2)
par(mfrow = c(1, 1))
# Box-tidwell test to check the assumption of linearity of log(odds) with the predictors
# p > 0.05, H0: linearity presence. H0 cannot be rejected. So, there is a linear relationship.
boxTidwell(y = y, x1 = x)
## MLE of lambda Score Statistic (t) Pr(>|t|)
## 1.7195 1.0142 0.3157
##
## iterations = 4
Now, we will plot the randomized quantiles of the residuals fitted
against predicted (fitted) values. An well fitted model should show
homogenous distribution of the quantile residuals both sides of the 0.5
horizontal line. We will use augment()
from
broom
package that extracts the required values from the
model outputs and adds to the original data set as columns. Please note
that tidy()
also extracts the model output values but does
not work on non-tidy objects in R.
augment_quantile(linear_fit) %>% head()
## # A tibble: 6 × 9
## y x .fitted .resid .hat .sigma .cooksd .std.resid .quantile.resid
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 21 -0.186 0.186 0.0776 0.331 0.0146 0.589 0.724
## 2 0 22 -0.160 0.160 0.0730 0.331 0.0100 0.504 0.688
## 3 0 23 -0.133 0.133 0.0686 0.331 0.00649 0.420 0.616
## 4 0 24 -0.107 0.107 0.0644 0.332 0.00388 0.336 0.6
## 5 0 25 -0.0804 0.0804 0.0604 0.332 0.00205 0.252 0.576
## 6 0 26 -0.0541 0.0541 0.0565 0.332 0.000859 0.169 0.572
# Residual vs. fitted plot
a = augment_quantile(logit_fit) %>%
ggplot(aes(x = .fitted, y = .quantile.resid)) +
geom_point() +
geom_hline(yintercept = 0.5) +
labs(x = "Predictor", y = "Randomized quantile residual",title = "Residual vs. fitted")
# QQ-plot of the randomized quantile residuals. This does not also help much.
dt$log_odds = log(logit_predict/(1 - logit_predict))
b = ggplot(dt, aes(x = x, y = log_odds)) +
geom_point() +
geom_smooth() +
labs(x = "Predictor", y = "Logit", title = "Age vs. logit of predicted y")
# Merging the both figures
ggarrange(a, b, labels = c('a', 'b'))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Attrition: ~ is a situation where the rate of leaving the workforce from an organization is faster than recruiting new workforce.
attrition
dataset of rsample package: Attrition is the
reponse variable coded as yes and no. - Personal Information: Age,
Gender, Marital Status. - Job Information: Job Role, Department, Job
Level, Job Satisfaction. - Compensation: Monthly Income, Stock Option
Level. - Performance Metrics: Performance Rating, Training Times Last
Year. - Work-Life Balance: Work Life Balance, Overtime
attrition = modeldata::attrition
str(attrition)
## 'data.frame': 1470 obs. of 31 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
This dataset has 31 variables and 1470 observations. We will use
Attrition
as the dependent variable and Age
,
Department
, MonthlyIncome
and
OverTime
as independent variables. Select only this dataset
and name it as df. If you download the dataset as .csv
file. read the dataset using read.csv()
and convert
Department
and OverTime
variables as factor.
We have directly taken the dataset from the package and it has well
formatted variables.
df = attrition %>% select(Attrition, Age, MaritalStatus, MonthlyIncome, OverTime)
head(df)
## Attrition Age MaritalStatus MonthlyIncome OverTime
## 1 Yes 41 Single 5993 Yes
## 2 No 49 Married 5130 No
## 4 Yes 37 Single 2090 Yes
## 5 No 33 Married 2909 Yes
## 7 No 27 Married 3468 No
## 8 No 32 Single 3068 No
We want to see whether the attrition is dependent on the monthly salary and over time.
logit = glm(Attrition ~ Age + MaritalStatus + MonthlyIncome + OverTime,
family = binomial(link = 'logit'), data = df)
# Dependent continuous variable can be modeled as a logit model based on the median value.
summ(logit, confint = T, robust = F, vif = T)
## MODEL INFO:
## Observations: 1470
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 181.75, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.20
## Pseudo-R² (McFadden) = 0.14
## AIC = 1128.83, BIC = 1160.59
##
## Standard errors:MLE
## -------------------------------------------------------------------------
## Est. 2.5% 97.5% z val. p VIF
## -------------------------- ------- ------- ------- -------- ------ ------
## (Intercept) -1.16 -1.90 -0.42 -3.08 0.00
## Age -0.03 -0.05 -0.01 -2.94 0.00 1.25
## MaritalStatusMarried 0.29 -0.16 0.73 1.27 0.20 1.03
## MaritalStatusSingle 1.12 0.68 1.56 5.02 0.00 1.03
## MonthlyIncome -0.00 -0.00 -0.00 -3.97 0.00 1.23
## OverTimeYes 1.47 1.16 1.77 9.45 0.00 1.03
## -------------------------------------------------------------------------
# Levels of categories in factors
table(df$Attrition)
##
## No Yes
## 1233 237
table(df$MaritalStatus)
##
## Divorced Married Single
## 327 673 470
table(df$OverTime)
##
## No Yes
## 1054 416
summary(df$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 30.00 36.00 36.92 43.00 60.00
summary(df$MonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1009 2911 4919 6503 8379 19999
sd(df$MonthlyIncome)
## [1] 4707.957
The alphabetically first level of a factor acts as a reference level for the variables. Other levels of that variable have more or less effect depending on their regression coefficient (estimates).
We see that the variables with p < 0.05 have significant effect on the attrition. Thus, increase in age by one year log(odds) of attrition decreased by 0.03, holding other variables constant. The log(odds) of attrition of the singles is 1.12 times larger than the divorcees if all other variables held constant. Simimalry, the personnel with over time facilities attrit more than the personnel without such facilities.
We can notice one thing that the income has significant effect but its log(odds) is 0. This is because the unit of income is very small but the values are large [1009 to 19999] compared to the changes in the log of odds. If we convert this variable to thousand dollars, will make the coefficient larger. This illustrates that the regression coefficients are scale dependent (not scale free).
Pseudo-\(R^2\) (Cragg-Uhler) is
0.20, which indicates 20% variance in the workforce attrition can be
explained by the model. However, this statistics is as straightforward
as \(R^2\) of OLS. McFadden \(R^2\) is somewhat conservative. We can also
calculate Nagelkerke’s \(R^2\) using
fmsb::NagelkerkeR2(logit)
function.
However, the logit coefficients indicates the changes in the response variable in terms of log(odds). It will be easier if we convert this to odds and eventually to probability.
coefs = c('Married' = 'MaritalStatusMarried', 'Single' = 'MaritalStatusSingle', MonthlyIncome = 'Income', 'Overtime' = 'OverTimeYes')
plot_summs(logit, coefs = coefs, plot.distributions = T)
# Effecct plot
effect_plot(logit, pred = MaritalStatus, interval = T, plot.points = T) +
theme_apa()
# However, if we change the coefficients to odds by exponentiating the logit coeffcients, the inerpretation will be easier.
tidy(exp(coef(logit)))
## Warning in tidy.numeric(exp(coef(logit))): 'tidy.numeric' is deprecated.
## See help("Deprecated")
## # A tibble: 6 × 2
## names x
## <chr> <dbl>
## 1 (Intercept) 0.312
## 2 Age 0.971
## 3 MaritalStatusMarried 1.33
## 4 MaritalStatusSingle 3.07
## 5 MonthlyIncome 1.00
## 6 OverTimeYes 4.34
Thus, the coefficients will indicate the magnitude of changes in the odds of positive response (yes, 1). It will be easier to explain, if we can convert this odds to probability and we know how to do it.
# Conversion from logit to probability
log_odds = coef(logit)
odds = exp(coef(logit))
probability = odds/(1+odds)
plogis = plogis(coef(logit)) # built-in function
coef_all = data.frame(log_odds, odds, probability, plogis)
coef_all
## log_odds odds probability plogis
## (Intercept) -1.163531e+00 0.3123813 0.2380263 0.2380263
## Age -2.932094e-02 0.9711048 0.4926703 0.4926703
## MaritalStatusMarried 2.879825e-01 1.3337340 0.5715021 0.5715021
## MaritalStatusSingle 1.123215e+00 3.0747235 0.7545846 0.7545846
## MonthlyIncome -9.818986e-05 0.9999018 0.4999755 0.4999755
## OverTimeYes 1.467721e+00 4.3393345 0.8127107 0.8127107
We see that negative logits have give p < 0.5 because \(e^0 = 1\) and its odds = 1/(1+1) = 0.5. Similarly, if logit > 0, odds > 1, p > 0.5.
However, we can easily predict the dependent variable’s probability
of positive (yes, 1) by using the predict(logit)
. This will
produce an output having probabilities for each of the observations. We
can also predict the probability of response for a specific combination
of independent variables.
summary(predict(logit, type = 'response'))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.008462 0.066593 0.110660 0.161224 0.221994 0.678212
predict(logit, data.frame(Age = 0, MaritalStatus = 'Divorced', OverTime = 'Yes', MonthlyIncome = 0), type = 'response')
## 1
## 0.5754665
# This shows that the given combination of data, there is 57.54% probabilty that the respondent will quit the job.
predict(logit, data.frame(Age = 0, MaritalStatus = 'Divorced', OverTime = 'No', MonthlyIncome = 0), type = 'response')
## 1
## 0.2380263
# This shows that the given combination of data, there is 23.80% probabilty that the respondent will quit the job.
We have used OverTime = Yes and No in the first and second
predictions, respectively, keeping all other variables at their
reference points. We see that the probability has increased to
57.54 + 23.80 = 81.34%
. The model suggests that forcing the
worker for over time works, the probability of quiting the job will
increase by 81.34%.
However, if we compare this value with the converted probability of
logit coefficient of OverTime, which is 0.8127 and intercept is 0.2380.
This tells us that 23.80% probability of attrition is inherent to all
workforce, when all independent variables are set to zero or reference
point. This means the quiting probability of workers without OverTime
facilities is 23.80% which increased to 81.27% for the workers with
OverTime facilities. So, the increased probability is
81.27 - 23.80 = 57.47%
for forcing the workers for over
time work.
Let us use probit analysis to see the differences in interpretation.
probit = glm(Attrition ~ Age + MaritalStatus + MonthlyIncome + OverTime,
family = binomial(link = 'probit'), data = df) # nolint
summ(probit, confint = T, robust = F, vif = T)
## MODEL INFO:
## Observations: 1470
## Dependent Variable: Attrition
## Type: Generalized linear model
## Family: binomial
## Link function: probit
##
## MODEL FIT:
## χ²(5) = 175.63, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.19
## Pseudo-R² (McFadden) = 0.14
## AIC = 1134.96, BIC = 1166.71
##
## Standard errors:MLE
## -------------------------------------------------------------------------
## Est. 2.5% 97.5% z val. p VIF
## -------------------------- ------- ------- ------- -------- ------ ------
## (Intercept) -0.76 -1.16 -0.36 -3.72 0.00
## Age -0.02 -0.03 -0.00 -2.82 0.00 1.27
## MaritalStatusMarried 0.17 -0.06 0.41 1.46 0.14 1.02
## MaritalStatusSingle 0.62 0.38 0.86 5.15 0.00 1.02
## MonthlyIncome -0.00 -0.00 -0.00 -3.86 0.00 1.26
## OverTimeYes 0.81 0.64 0.98 9.31 0.00 1.02
## -------------------------------------------------------------------------
The probit coefficient of age is -0.02, which indicates that one year
increase in age will decrease the z-score of the response variable
(attrition). Now, convert the z-scores (probit coefficients) to
probability. Add these as probit
to teh
coeff_all
table to compare.
coef_all$probit = pnorm(coef(probit))
coef_all
## log_odds odds probability plogis probit
## (Intercept) -1.163531e+00 0.3123813 0.2380263 0.2380263 0.2242052
## Age -2.932094e-02 0.9711048 0.4926703 0.4926703 0.4939105
## MaritalStatusMarried 2.879825e-01 1.3337340 0.5715021 0.5715021 0.5692595
## MaritalStatusSingle 1.123215e+00 3.0747235 0.7545846 0.7545846 0.7323926
## MonthlyIncome -9.818986e-05 0.9999018 0.4999755 0.4999755 0.4999808
## OverTimeYes 1.467721e+00 4.3393345 0.8127107 0.8127107 0.7903507
We see that probability computed using logit and probit are the similar with a minor differences because those have been caculated based on different mathematical approaches.