Linear regression is used to approximate the (linear) relationship between a continuous response variable and a set of predictor variables. However, when the response variable is binary (i.e., Yes/No), linear regression is not appropriate. Fortunately, analysts can turn to an analogous method, logistic regression. Logistic regression can be expanded for multinomial problems.
Used packages in R for logistic reggresion:
# Helper packages
library(dplyr) # for data wrangling
library(ggplot2) # for awesome plotting
library(rsample) # for data splitting
# Modeling packages
library(caret) # for logistic regression modeling
# Model interpretability packages
library(vip) # variable importance
Two logistic regression models that could predict the probability of an employee attriting:
model1 <- glm(Attrition ~ MonthlyIncome, family = "binomial", data = churn_train)
model2 <- glm(Attrition ~ OverTime, family = "binomial", data = churn_train)
The first predicts the probability of attrition based on their monthly income (MonthlyIncome) and the second is based on whether or not the employee works overtime (OverTime).
The glm() function fits generalized linear models, a class of models that includes both logistic regression and simple linear regression as special cases.
We must pass the argument family = "binomial" in order to tell R to run a logistic regression rather than some other type of generalized linear model (the default is family = "gaussian", which is equivalent to ordinary linear regression.
glm() uses ML estimation to estimate the unknown model parameters.
Intuition behind using ML estimation to fit a logistic regression model: we try to find β0 and β1 such that plugging these estimates into the model for p(X) (see equation) yields a number close to one for all employees who attrited, and a number close to zero for all employees who did not. This intuition can be formalized using a mathematical equation called a likelihood function:
The estimates β0 and β1 are chosen to maximize this likelihood function. What results is the predicted probability of attrition.
Once our preferred logistic regression model is identified, we need to interpret how the features are influencing the results. As with normal linear regression models, variable importance for logistic regression models can be computed using the absolute value of the zz-statistic for each coefficient. Using vip::vip() we can extract our top influential variables.
vip(cv_model3, num_features = 20)
Logistic regression assumes a monotonic linear relationship. However, the linear relationship occurs on the logit scale; on the probability scale, the relationship will be nonlinear.
Logistic regression provides an alternative to linear regression for binary classification problems.
Logistic regression suffers from the many assumptions involved in the algorithm (i.e. linear relationship of the coefficient, multicollinearity).
Although multinomial extensions of logistic regression exist, the assumptions made only increase and, often, the stability of the coefficient estimates (and therefore the accuracy) decrease.