Perhaps the quickest manner of getting an understanding of logistic regression is by comparing it directly to linear regression. We’ll do exactly that. First, we’ll look at theory, then we’ll move on to implementation. 1
Both linear regression and logisitic regression are types of generalized linear models (GLMs). 2 Mathematically, GLMs can be expressed as
\[ Y \mid {\bf X} = {\bf x} \sim N(\beta_0 + \beta_1x_1 + \ldots + \beta_{p - 1}x_{p - 1}, \ \sigma^2) \]
The logistic regression framework’s inheritance of the GLM form is easy to see when observing the generalized equations for linear and logisitic regression. Recall that linear regression models are defined by the equation \[ Y = \beta_0 + \beta_1x_1 + \ldots + \beta_qx_q + \epsilon, \ \ \epsilon \sim N(0, \sigma^2) \]
which calculates the response \(Y\) directly. Logisitc regression is defined in a similar manner
\[ \log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p - 1} x_{p - 1} \]
However, in contrast to linear regression, note that the calculation of the response is not direct.
The side of the equation with the response variable is known as the log odds. 3
In a binary context, the odds are the probability for a “positive” event \((Y = 1)\) divided by the probability of a “negative” event \((Y = 0)\). 4
\[ \frac{p({\bf x})}{1 - p({\bf x})} = \frac{P[Y = 1 \mid {\bf X} = {\bf x}]}{P[Y = 0 \mid {\bf X} = {\bf x}]} \]
The logistic regression equation guarantees that a value between \(0\) and \(1\) is calculated. This is evident the when the inverse logit transformation is applied, which results in a “direct” probability prediction.
\[ p({\bf x_i}) = P[Y_i = 1 \mid {\bf X_i} = {\bf x_i}] = \frac{e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_{p-1} x_{i(p-1)}}} \]
Note that this is prediction of probability, not a numerical value. This probability value must be translated to a categorical prediction.
So, what does this all mean? Put simply, linear regression should be used to predict a quantitative (i.e. numerical) response variable, while logisitic regression should be used to predict a qualitiative (i.e. categorical) response variable. (More generally, predicting a categorical response variable is known as classification.) Visually, the linear model generates a straight line, and the logisitic model generates an “S” curve.
For this tutorial, we will use the Default
dataset from the ISLR
package.
library("ISLR")
library("tibble")
as_tibble(Default)
## # A tibble: 10,000 x 4
## default student balance income
## <fctr> <fctr> <dbl> <dbl>
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
## 7 No No 825.5133 24905.227
## 8 No Yes 808.6675 17600.451
## 9 No No 1161.0579 37468.529
## 10 No No 0.0000 29275.268
## # ... with 9,990 more rows
Our goal is to properly classify people who have defaulted based on student status, credit card balance, and income.
It is always a good idea to visualize our data before trying to build a model for it. For example, density plots are useful for identifying the distribution of the predictors relative to one another and to the response variable.
Observing that the distributions of "No"
and "Yes"
for default
given income
do not differ much, we might believe that income
will not be particularly useful for our model. On the other hand, there seems to be a big difference in the balance
distributions at a value around 1400 for the two categories of default
.
We can also observe the distributions of income
and balance
with the other predictor student
, which is qualitative (like default
). These density plots indicate that students have much less income than the rest of the population.
Although we could do more to inspect the data, let’s go ahead and create logistic regression model. To implement good modeling practices, we’ll create training and testing splits in an attempt to avoid under/over-fitting when performing regression.
# Split into train/test splits first.
set.seed(42)
default_idx <- sample(nrow(Default), ceiling(nrow(Default) / 2))
default_trn <- Default[default_idx, ]
default_tst <- Default[-default_idx, ]
# Create the model.
model_glm <- glm(default ~ balance, data = default_trn, family = "binomial")
Creating a logistic regression model should look very similar to creating a linear regression model. However, instead of lm()
we use glm()
. Also, note that we must specify family = "binomial"
for a binary classification context. (Actually, calling glm()
with family = "gaussian"
would be equivalent to lm()
.)
Before making any predictions, let’s briefly examine the model with summary()
. Among other things, it is important to see what coefficient values have been estimated for our model.
summary(model_glm)
##
## Call:
## glm(formula = default ~ balance, family = "binomial", data = default_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2227 -0.1560 -0.0634 -0.0237 3.7286
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.452183 0.500042 -20.90 <2e-16 ***
## balance 0.005368 0.000306 17.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1470.42 on 4999 degrees of freedom
## Residual deviance: 830.61 on 4998 degrees of freedom
## AIC: 834.61
##
## Number of Fisher Scoring iterations: 8
Like the summary
values calculated for a linear regression model, we get quantile information regarding residuals, as well significance estimates for our predictors. The interpretation of p-values in the logisitic regression framework is the same as that for linear regression models. (So the rule of “siginifcant” predictors being associated with p-value
s less than 5 % also holds.)
However, note that we get a z value
instead of a t value
. Without getting too much into the theory regarding this difference, one should understand that this meaning of this value is analogous to that of a t value
. 5 Also, note that we see Null deviance
, AIC
, and Number of Fisher Scoring iterations
instead of Residual standard error
, Multipe R-squared
, Adjusted R-squared
, and F-statistic
.
The logistic regression diagnostic values generated by the summary()
call are typically not used directly to interpret the “goodness of fit” of a model.
However, before looking more closely at model diagnostics that are more suitable for logistic regression, we should first understand how to use the predict()
function with glm()
. In order to return probabilities, we must specify type = "response"
. (The default setting is type = "link"
, which corresponds to the log odds value.)
head(predict(model_glm, type = "response"))
## 9149 9370 2861 8302 6415
## 9.466353e-04 3.298300e-01 7.437969e-03 8.170105e-05 1.183661e-04
## 5189
## 6.357530e-04
As mentioned before, these predicted values are probabliliites, not classifications. We must “manually” convert the probabilities to classifications. Traditionally, a midpoint value such as 0.5 is used to “categorize” the probabilities. (This is actually equivalen to specifyng type = "link"
and using a threshhold value of \(0\).)
trn_pred <- ifelse(predict(model_glm, type = "response") > 0.5, "Yes", "No")
head(trn_pred)
## 9149 9370 2861 8302 6415 5189
## "No" "No" "No" "No" "No" "No"
Probably the most common thing that is done to evaluate a classificiation models is to compare the actual response values with the predicted ones using a cross-table, which is often called a confusion matrix. This matrix can be generated with the base table()
function.
# Making predictions on the train set.
trn_tab <- table(predicted = trn_pred, actual = default_trn$default)
trn_tab
## actual
## predicted No Yes
## No 4815 122
## Yes 17 46
# Making predictions on the test set.
tst_pred <- ifelse(predict(model_glm, newdata = default_tst, type = "response") > 0.5, "Yes", "No")
tst_tab <- table(predicted = tst_pred, actual = default_tst$default)
tst_tab
## actual
## predicted No Yes
## No 4817 113
## Yes 18 52
Perhaps unsurprisingly, the most common metrics for evaluating logistic regression models are error rate and accuracy (which is simply the additive inverse of the error rate). These metrics can be calculated directly from the confustion matrix.
calc_class_err <- function(actual, predicted) {
mean(actual != predicted)
}
calc_class_err(actual = default_trn$default, predicted = trn_pred)
## [1] 0.0278
# Test error rate should be close to train error rate if model is fitted properly.
calc_class_err(actual = default_tst$default, predicted = tst_pred)
## [1] 0.0262
Now, let’s consider the confusion matrix in more detail now. The names true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are often used to reference the four cells of the confustion matrix.
\[ \text{Sens} = \text{True Positive Rate} = \frac{\text{TP}}{\text{P}} = \frac{\text{TP}}{\text{TP + FN}} \] \[ \text{Spec} = \text{True Negative Rate} = \frac{\text{TN}}{\text{N}} = \frac{\text{TN}}{\text{TN + FP}} \] \[ \text{Prev} = \frac{\text{P}}{\text{Total Obs}}= \frac{\text{TP + FN}}{\text{Total Obs}} \]
Calculations of metrics such as sensitivity, specificity, and prevalance are derived from the confusion matrix. The importance of these (and other) metrics is dependent on the nature of the data (e.g. lower values may be acceptable if the data is deemed difficult to predict), as well as the tolerance for the type of misclassification. For example, we may want to bias our predictions for classifying defaults such that we are more likely to predict a default when one does not occur. We must carefully identify whether we want to prioritize sensitivity or specificity.
We can get the values of sensitivity, specificity, prevalance, etc. easily for our predictions using the confusionMatrix()
function from the caret
package. 6
library("caret")
confusionMatrix(trn_tab, positive = "Yes")
## Confusion Matrix and Statistics
##
## actual
## predicted No Yes
## No 4815 122
## Yes 17 46
##
## Accuracy : 0.9722
## 95% CI : (0.9673, 0.9766)
## No Information Rate : 0.9664
## P-Value [Acc > NIR] : 0.01101
##
## Kappa : 0.387
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.2738
## Specificity : 0.9965
## Pos Pred Value : 0.7302
## Neg Pred Value : 0.9753
## Prevalence : 0.0336
## Detection Rate : 0.0092
## Detection Prevalence : 0.0126
## Balanced Accuracy : 0.6351
##
## 'Positive' Class : Yes
##
Now, let’s consider another metric, unrelated to the confusion matrix. Remember where we chose the value \(0.5\) as a threshhold for classification? How do we know that \(0.5\) value is the “optimal” value for accuracy. In reality, other cutoff values may be better (although \(0.5\) will tend to be the best value if all model assumptions are true and the sample size is reasonably large).
The ROC curve (receiver operating characteristic curve) illustrates the sensitivity and specificity for all possible cutoff values. We can use the roc()
function from the pROC
package to generate the ROC curve for our predictions.
library("pROC")
test_prob <- predict(model_glm, newdata = default_tst, type = "response")
test_roc <- roc(default_tst$default ~ test_prob, plot = TRUE, print.auc = TRUE)
as.numeric(test_roc$auc)
## [1] 0.9515076
In general, we would like the curve to “hug” the right and upper borders of the plot (indicating high sensitivity and specificity). The AUC (area under the curve) is used to quantify the visual profile of the ROC.
Based on the density plots, we might have reasonably created a very simplistic model using only the balance
predictor. In particular, we could create a model based on the rule
\[ \hat{C}(\texttt{balance}) = \begin{cases} \text{Yes} & \texttt{balance} > 1400 \\ \text{No} & \texttt{balance} \leq 1400 \end{cases} \]
simple_class <- function(x, boundary, above = "Yes", below = "No") {
ifelse(x > boundary, above, below)
}
tst_pred_basic <- simple_class(x = default_tst$balance, boundary = 1400)
Although this appears to be a reasonable model, it turns out that its prediction accuracy is actually worse than what we would calculate if we simply predicted that no one defaulted!
tst_pred_all_no <- simple_class(x = default_tst$balance, boundary = 1400, above = "No", below = "No")
(However, predicting all "No"
s gives worst prediction accuracy than that for our logistic regression model on the test set. We get an error rate of r calc_class_err(actual = default_tst$default, predicted = tst_pred)
for our model. Here, we caclulate r calc_class_err(actual = default_tst$default, predicted = tst_pred_all_no)
when predicting all "No"
s, and r calc_class_err(actual = default_tst$default, predicted = tst_pred_basic)
when using \(1400\) as a cutoff value for our model based on a one-variable classifier.)
calc_class_err(actual = default_tst$default,
predicted = tst_pred)
## [1] 0.0262
calc_class_err(actual = default_tst$default,
predicted = tst_pred_basic)
## [1] 0.0994
calc_class_err(actual = default_tst$default,
predicted = tst_pred_all_no)
## [1] 0.033
Of course, (even after reviewing the theory) we may return to the inevitable question “Why not linear regression?” The main issue, in practice, involves how to interpret the linear regression prediction values as probabilities and, subsequently, how to convert the numerical values to categories. To better understand this issue, let’s go ahead and try a linear regression model. (Here, we create a model using all predictors, not just balance
.)
##
## Call:
## lm(formula = default ~ ., data = default_trn_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23775 -0.07021 -0.02729 0.02013 0.98785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.074e-02 1.202e-02 -5.052 4.53e-07 ***
## studentYes -2.244e-02 8.131e-03 -2.759 0.00582 **
## balance 1.304e-04 5.033e-06 25.908 < 2e-16 ***
## income -2.638e-07 2.741e-07 -0.963 0.33583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1692 on 4996 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1187
## F-statistic: 225.5 on 3 and 4996 DF, p-value: < 2.2e-16
For this case, if we create a linear regression model using all of the predictors, then all of the predicted probabilities are below 0.5. Thus, if we were to classify every observation below 0.5 as "No"
, then all of our predictions would be "No"
. Additionally, because the linear model returns predictions with values less than \(0\), we would have to figure out how to convert these values to categories.
all(predict(model_lm) < 0.5)
## [1] TRUE
any(predict(model_lm) < 0)
## [1] TRUE
Although we could classify the “upper half” of values into the positive category and the “lower half” of values to the negative category (in order to avoid the issue of interpreting the linear regressions model’s numerical predictions), but we would run into another issue in the case that there are more than two categories for the response variable. Encoding of the different categories to numerical values can become subjective and bias the classifications in such a case. For example, we might encode different ethnicities as numerical values for a problem in which we are trying to predict a person’s heritage, but who is to say one ethnicity is “closer” to another? (Are 1, 2, and 3 proper values to encode Asian, Hispanic, and European? Order matters!)
Anyways, it should be easy to see why the logistic regression framework is more appropriate for classification problems.
That’s it for now. There is certainly much more we could cover, both in terms of theory and implementation.
Notably, we haven’t discussed a case where there are more than two categories for the response variable. In that case, we might use multinomial logistic regression. Probably the most commonly-used implementation is the multinom()
function from the nnet
package. However, we will leave this topic for another time.
GLMs have three components:
For reference purposes, the following table summarizes the differences between linear and logistic regression with respect to the broader GLM framework.
Linear Regression | Logistic Regression | |
---|---|---|
Distribution of \(Y \mid {\bf X} = {\bf x}\) | \(N(\mu({\bf x}), \sigma^2)\) | \(\text{Bern}(p({\bf x}))\) |
Distribution Name | Normal | Bernoulli (Binomial) |
\(\text{E}[Y \mid {\bf X} = {\bf x}]\) | \(\mu({\bf x})\) | \(p({\bf x})\) |
Support | Real: \((-\infty, \infty)\) | Integer: \(0, 1\) |
Usage | Numeric Data | Binary (Class: Yes/No) Data |
Link Name | Identity | Logit |
Link Function | \(\eta({\bf x}) = \mu({\bf x})\) | \(\eta({\bf x}) = \log \left(\frac{p({\bf x})}{1 - p({\bf x})} \right)\) |
Mean Function | \(\mu({\bf x}) = \eta({\bf x})\) | \(p({\bf x}) = \frac{e^{\eta({\bf x})}}{1 + e^{\eta({\bf x})}} = \frac{1}{1 + e^{-\eta({\bf x})}}\) |
Note that logistic regression models do not make the same assumption that linear regression makes, including:
The following sites were used as resources for this tutorial:
Some code and phrasing is taken directly from these resources.↩
See the appendix for more details regarding the theory discussed here.↩
The log odds are the logit transform applied to \(p({\bf x})\).↩
This means that when the odds are \(1\), the two events have equal probability. Moreover, odds greater than \(1\) indicate higher probabiliity for the positive event, while the converse is true for odds less than \(1\).↩
(Recall that the \(t\)-value is calculated by a \(t\)-test to affirm (or deny) the null hypothsesis for a given predictor.)↩
Note that we could do a lot more with the caret
package, which is perhaps R’s most powerful machine learning package.↩