Hands-On Machine Learning with R

Bradley Boehmke

Chapter 5 - Logistic Regression

Linear regression is used to approximate the (linear) relationship between a continuous response variable and a set of predictor variables. This chapter explores the use of logistic regression for binary response variables.

V.1 Prerequisites

Packages needed to build a logic regression in R:

# Helper packages
library(dplyr)     # for data wrangling
library(ggplot2)   # for awesome plotting
library(rsample)   # for data splitting

# Modeling packages
library(caret)     # for logistic regression modeling

# Model interpretability packages
library(vip)       # variable importance

To illustrate logistic regression concepts we’ll use the employee attrition data, where our intent is to predict the Attrition response variable (coded as "Yes"/"No").

V.2 Why logistic regression

Base on the book the lineal regression could give some inconsistencies so that is why the logistic regression is been used for different purposes as complement on lineal regression analysis.

To avoid the inadequacies of the linear model fit on a binary response, we must model the probability of our response using a function that gives outputs between 0 and 1 for all values. Many functions meet this description.

V.3 Simple logistic regression

The glm() function fits generalized linear models, a class of models that includes both logistic regression and simple linear regression as special cases. The syntax of the glm() function is similar to that of lm(), except that we must pass the argument family = "binomial" "in order to tell R to run a logistic regression rather than some other type of generalized linear model (the default is family = "gaussian, which is equivalent to ordinary linear regression assuming normally distributed errors).

V.4 Feature interpretation

Similar to linear regression, once our preferred logistic regression model is identified, we need to interpret how the features are influencing the results. As with normal linear regression models, variable importance for logistic regression models can be computed using the absolute value of the statistic for each coefficient (albeit with the same issues previously discussed).

Similar to linear regression, logistic regression assumes a monotonic linear relationship. However, the linear relationship occurs on the logit scale; on the probability scale, the relationship will be nonlinear.

V.5 Final thoughts

Logistic regression provides an alternative to linear regression for binary classification problems. However, similar to linear regression, logistic regression suffers from the many assumptions involved in the algorithm (i.e. linear relationship of the coefficient, multicollinearity). Moreover, often we have more than two classes to predict which is commonly referred to as multinomial classification. Although multinomial extensions of logistic regression exist, the assumptions made only increase and, often, the stability of the coefficient estimates (and therefore the accuracy) decrease. Future chapters will discuss more advanced algorithms that provide a more natural and trustworthy approach to binary and multinomial classification prediction.

References

Agresti, Alan. 2003. Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley.

Faraway, Julian J. 2016a. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Vol. 124. CRC press.

Greenwell, Brandon M., Andrew J. McCarthy, Bradley C. Boehmke, and Dungang Lui. 2018. “Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the Sure Package.” The R Journal 10 (1): 1–14. https://journal.r-project.org/archive/2018/RJ-2018-004/index.html.

Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Springer International Publishing. https://books.google.com/books?id=94RgCgAAQBAJ.

Liu, Dungang, and Heping Zhang. 2018. “Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach.” Journal of the American Statistical Association 113 (522). Taylor & Francis: 845–54. https://doi.org/10.1080/01621459.2017.1292915.

Russolillo, Giorgio, and Carlo Natale Lauro. 2011. “A Proposal for Handling Categorical Predictors in Pls Regression Framework.” In Classification and Multivariate Analysis for Complex Data Structures, 343–50. Springer.