In order to better understand how logistic regression works, I’ve investigated some of it’s core principals here.
First theres some discussion on how probability an odds are related, how the logistic function uses these to make predictions and interpreting the regression coefficent beta, b1.
Second, graphical representations of logistic regression for a single predictor X with two classes generated by normal distributions are discussed. These charts show how the actual probability boundary of these distributions compare to the logistic function decision boundary and are annotated with the main confusion matrix metrics to evaluate the model’s performance.
Equations and figures referenced here are found in Introduction to Statistical Learning with Application in R (ISLR).
library(dplyr)
library(ggplot2)
library(ISLR)
The inference for ordinary least squares linear regression coefficients is easy to understand. When other variables are held constant, an increase of the predictor of 1 will increase the response, Y by the coefficient b1. Here, the equivalent for logistic regression is considered:
Meaning of coefficient b1: “increase in X by 1 changes the log odds by b1 (eq 4.4), or multiplies the odds by e^b1 (eq 4.3)”
odds = p(X) / (1 - p(X)) ie probability of it happening / probability of it not happening p(X) = odds / (1 + odds)
The chart below shows the relationship between odds and probability.
# odds vs probability
plotr <- data.frame(p_X = seq(0,0.99,0.01)) %>% mutate(odds = p_X / (1 - p_X))
ggplot(plotr, aes(p_X, odds)) + geom_line()
The logistic function is responsible for calculating probabilities for classifying the response. The figure below shows the probability of defaulting on a loan given a person’s bank balance (from ISLR::Default).
summary(glm(default ~ balance, family = binomial, data=Default))
##
## Call:
## glm(formula = default ~ balance, family = binomial, data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
## balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
b0 <- -10.65
b1 <- 5.5e-3
logistic_function <- function(b0, b1, X) { # (eq 4.2)
e <- exp(b0 + b1 * X)
e / (1 + e)
}
plotr <- data.frame(balance=Default$balance, P_default=logistic_function(b0=b0, b1=b1, Default$balance))
ggplot(plotr, aes(balance, P_default)) + geom_line() # (equivalent to fig 4.2)
The odds are multiplied by:
exp(b1) # (eq 4.3)
## [1] 1.005515
If the odds are 4, or 4/1, the probability is P(X) = 4/(1+4) = 0.8. An increase in X, balance, of 1 gives increase in odds of:
odds <- 4
new_odds <- odds * exp(b1)
delta_odds <- new_odds - odds
delta_odds
## [1] 0.02206061
Which corresponds to an increase in probability p(X) of p(X) = odds / (1 + odds):
(new_odds / (1 + new_odds)) - (odds / (1 + odds))
## [1] 0.0008785482
Checking this against the logistic function yields the same result (first find X for odds of 4).
X_at_odds <- (log(odds) - b0) / b1
X_at_odds
## [1] 2188.417
logistic_function(b0, b1, X_at_odds + 1) - logistic_function(b0, b1, X_at_odds)
## [1] 0.0008785482
The logistic_regression() function below gives a visual representation of how the logistic function classifies the response Y, compared to the actual probability density of the data for a single predictor with two classes A and B. It randomly generates data from the normal distribution.
Arguments: nA, nB number of observations classed as A or B meanA, meanB mean of normal distribution for A or B sdA, sdB standard deviation of normal distribution for A or B p_cutoff probability cut-off/threshold for classifiying as either A or B seed set seed data generation, defaults to random show_conf_matrix boolean to print confusion matrix
Graphical output: red and blue solid lines probability density of data point counts for class A and B respectively red and blue shaded areas, probability area of data points for class A and B respectively bounded by dashed lines black solid line logistic function decision boundary lower rug plot data points upper rug plot predictions, red = A, blue = B
logistic_regression <- function(nA, meanA=0, sdA=1, nB, meanB=0, sdB=1,
p_cutoff=0.5, seed=round(runif(1)*1e6), show_conf_matrix=TRUE){
set.seed(seed)
# initialise data, cl = class, p1d = probability density
df <- data.frame(X = c(rnorm(nA, meanA, sdA), rnorm(nB, meanB, sdB)), cl = c(rep("A", nA), rep("B", nB))) %>%
mutate(p1d = ifelse(cl == "A", dnorm(X, meanA, sdA), dnorm(X, meanB, sdB)))
# fit logistic regression model and extract coefficients
logistic_reg <- glm(cl ~ X, data=df, family = "binomial")
summ_log_reg <- summary(logistic_reg)
b0 <- logistic_reg$coefficients[1]
b1 <- logistic_reg$coefficients[2]
p_val <- summ_log_reg$coefficients[2,4]
# calculate probabilities and classify predictions using logistic function
df <- df %>% mutate(p_X = logistic_function(b0, b1, X),
pred_cl = ifelse(p_X < p_cutoff, "A", "B"))
# create confusion matrix
conf_matrix <- table(df %>% select(pred_cl, cl))
prop_matrix_row <- prop.table(conf_matrix, margin=1)
prop_matrix_col <- prop.table(conf_matrix, margin=2)
accuracy <- mean(df$pred_cl == df$cl)
precision <- prop_matrix_row[1,1] # positive predictive value
sensitivity <- prop_matrix_col[1,1] # true positive rate
specificity <- prop_matrix_col[2,2] # true negative rate
annotation <- paste(paste("p value =", round(p_val, 2)),
paste("p(X) cutoff =", p_cutoff),
paste("accuracy =", round(accuracy, 2)),
paste("precision =", round(precision, 2)),
paste("true A rate =", round(sensitivity, 2)),
paste("true B rate =", round(specificity, 2)),
sep = "\n")
if(show_conf_matrix==TRUE) print(conf_matrix)
# graphical output
ggplot(df, aes(X, col=cl)) +
geom_rug(alpha=0.1) + # data points
geom_density(aes(y=..count../max(..count..))) + # density of data point counts
geom_density(aes(y=..count.., fill=cl), position="fill", alpha=0.1, linetype=2) + # prob. area of data points
geom_line(aes(X, p_X), colour="black") + # logistic function of data points
geom_rug(aes(X, col=pred_cl), alpha=0.1, sides="t") + # predictions
annotate("text", x=min(df$X)*0.5, y=0.8, label=annotation) +
labs(y="normalised count & p(X)")
}
A few comments on how logistic regression performs on asingle predictor with two classes generated by normal distributions:
A has a mean of 0, B has a mean of 1. The logistic function decison boundary (black line) closely follows the actual probability areas of the two distributions (shown by the shaded areas bound by dotted lines). Accuracy, precision, true positive A rate and true negative B rate are all very similar due to similar true positive and true negative counts as well as similar false positive (type I error) and false negative counts (type II error).Note any bulges in the shaded probability areas, particularly at low and high X, indicate a disproportionate number of a certain class which can be seen due to the low number of observations at these points.
logistic_regression(nA=10000, nB=10000, meanB=1)
## cl
## pred_cl A B
## A 6968 3086
## B 3032 6914
B having a standard deviation of 0.5 so the majority of it’s observations, X, are contained within the range of A. There is no region of X which solely contains B which can be seen by the blue probability area not reaching p(X) = 1, the top of the y-axis. Compared to 1. above, this leads to higher accuracy, precision and true A rate at the expense of true Brate.The logistic function decision boundary (black line) does not closely match the probability area boundary (dashed line) as it does it’s best to correctly classify predictions given at no point do the data show a 100% probability of B. Reducing the classification probability cut-off to 0.4 helps account for the way the two distributions overlap. This increases the true B rate at the expense of the true A rate but increases the overall accuracy and precision of the model.
logistic_regression(nA=5000, nB=3000, meanB=1, sdB=0.5, seed = 687435)
## cl
## pred_cl A B
## A 4051 1224
## B 949 1776
logistic_regression(nA=5000, nB=3000, meanB=1, sdB=0.5, seed = 687435, p_cutoff = 0.4)
## cl
## pred_cl A B
## A 3648 662
## B 1352 2338
B to 3. Standard deviations are both 1. Due to the high separation of the distributions, the black logistic function curve matches the dashed probability area line well and prediction metrics are all high. Changing the probability cut-off to a low 0.2 again helps equalise the true A and true B rate, this time at the slight expense of accuracy.logistic_regression(nA=5000, nB=1000, meanB=3, sdB=1, seed = 67312)
## cl
## pred_cl A B
## A 4893 158
## B 107 842
logistic_regression(nA=5000, nB=1000, meanB=3, sdB=1, seed = 67312, p_cutoff = 0.2)
## cl
## pred_cl A B
## A 4712 73
## B 288 927
General comments on these regressions: - When nA and nB are the same or there is a clear separation, the boundary between probability areas is close to the logistic function. - The logistic function decision boundary (black line) doesn’t need to closely follow the actual probability boundary (dashed line) to give good results. - Lots of overlap between distributions leads to poorer predictions. - Changing the classifier probability cut-off brings the true A rate and true B rate closer at the possible expense of other metrics such as accuracy. It depends on which metrics are important, i.e. in sales you may be more interested in getting the true positive prediction rate as high as possible as this represents successfully predicting when someone purchsed your product. - What has been presented here is similar to LDA (linear discriminant analysis) i.e. use the probability an observation belongs to a certain class using Bayes theorem.
logistic_regression() universal, able to accept any data frame.confusion_matrix() from helpers.R in to logistic_regression() to reduce code chunk size.