logistic_regression.utf8

LOGISTIC REGRESSION

In order to better understand how logistic regression works, I’ve investigated some of it’s core principals here.

First theres some discussion on how probability an odds are related, how the logistic function uses these to make predictions and interpreting the regression coefficent beta, b1.

Second, graphical representations of logistic regression for a single predictor X with two classes generated by normal distributions are discussed. These charts show how the actual probability boundary of these distributions compare to the logistic function decision boundary and are annotated with the main confusion matrix metrics to evaluate the model’s performance.

Equations and figures referenced here are found in Introduction to Statistical Learning with Application in R (ISLR).

library(dplyr)
library(ggplot2)
library(ISLR)

The logistic function and odds

The inference for ordinary least squares linear regression coefficients is easy to understand. When other variables are held constant, an increase of the predictor of 1 will increase the response, Y by the coefficient b1. Here, the equivalent for logistic regression is considered:

Meaning of coefficient b1: “increase in X by 1 changes the log odds by b1 (eq 4.4), or multiplies the odds by e^b1 (eq 4.3)”

odds = p(X) / (1 - p(X)) ie probability of it happening / probability of it not happening p(X) = odds / (1 + odds)

The chart below shows the relationship between odds and probability.

# odds vs probability
plotr <- data.frame(p_X = seq(0,0.99,0.01)) %>% mutate(odds = p_X / (1 - p_X)) 
ggplot(plotr, aes(p_X, odds)) + geom_line()

The logistic function is responsible for calculating probabilities for classifying the response. The figure below shows the probability of defaulting on a loan given a person’s bank balance (from ISLR::Default).

summary(glm(default ~ balance, family = binomial, data=Default))

## 
## Call:
## glm(formula = default ~ balance, family = binomial, data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

b0 <- -10.65
b1 <- 5.5e-3

logistic_function <- function(b0, b1, X) { # (eq 4.2)
  e <- exp(b0 + b1 * X)
  e / (1 + e)
  } 

plotr <- data.frame(balance=Default$balance, P_default=logistic_function(b0=b0, b1=b1, Default$balance))
ggplot(plotr, aes(balance, P_default)) + geom_line() # (equivalent to fig 4.2)

The odds are multiplied by:

exp(b1) # (eq 4.3)

## [1] 1.005515

If the odds are 4, or 4/1, the probability is P(X) = 4/(1+4) = 0.8. An increase in X, balance, of 1 gives increase in odds of:

odds <- 4
new_odds <- odds * exp(b1)
delta_odds <- new_odds - odds
delta_odds

## [1] 0.02206061

Which corresponds to an increase in probability p(X) of p(X) = odds / (1 + odds):

(new_odds / (1 + new_odds)) - (odds / (1 + odds))

## [1] 0.0008785482

Checking this against the logistic function yields the same result (first find X for odds of 4).

X_at_odds <- (log(odds) - b0) / b1
X_at_odds

## [1] 2188.417

logistic_function(b0, b1, X_at_odds + 1) - logistic_function(b0, b1, X_at_odds)

## [1] 0.0008785482

Function to visually show logistic regression of one predictor

The logistic_regression() function below gives a visual representation of how the logistic function classifies the response Y, compared to the actual probability density of the data for a single predictor with two classes A and B. It randomly generates data from the normal distribution.

Arguments: nA, nB number of observations classed as A or B meanA, meanB mean of normal distribution for A or B sdA, sdB standard deviation of normal distribution for A or B p_cutoff probability cut-off/threshold for classifiying as either A or B seed set seed data generation, defaults to random show_conf_matrix boolean to print confusion matrix

Graphical output: red and blue solid lines probability density of data point counts for class A and B respectively red and blue shaded areas, probability area of data points for class A and B respectively bounded by dashed lines black solid line logistic function decision boundary lower rug plot data points upper rug plot predictions, red = A, blue = B

logistic_regression <- function(nA, meanA=0, sdA=1, nB, meanB=0, sdB=1, 
                                p_cutoff=0.5, seed=round(runif(1)*1e6), show_conf_matrix=TRUE){
  set.seed(seed)
  
  # initialise data, cl = class, p1d = probability density
  df <- data.frame(X = c(rnorm(nA, meanA, sdA), rnorm(nB, meanB, sdB)), cl = c(rep("A", nA), rep("B", nB))) %>% 
    mutate(p1d = ifelse(cl == "A", dnorm(X, meanA, sdA), dnorm(X, meanB, sdB)))
  
  # fit logistic regression model and extract coefficients
  logistic_reg <- glm(cl ~ X, data=df, family = "binomial")
  summ_log_reg <- summary(logistic_reg)
  
  b0 <- logistic_reg$coefficients[1]
  b1 <- logistic_reg$coefficients[2]
  p_val <- summ_log_reg$coefficients[2,4]
  
  # calculate probabilities and classify predictions using logistic function
  df <- df %>% mutate(p_X = logistic_function(b0, b1, X),
                      pred_cl = ifelse(p_X < p_cutoff, "A", "B"))
  
  # create confusion matrix
  conf_matrix <- table(df %>% select(pred_cl, cl))
  prop_matrix_row <- prop.table(conf_matrix, margin=1)
  prop_matrix_col <- prop.table(conf_matrix, margin=2)
  
  accuracy <- mean(df$pred_cl == df$cl)
  precision <- prop_matrix_row[1,1] # positive predictive value
  sensitivity <- prop_matrix_col[1,1] # true positive rate
  specificity <- prop_matrix_col[2,2] # true negative rate
  annotation <- paste(paste("p value =", round(p_val, 2)),
                      paste("p(X) cutoff =", p_cutoff),
                      paste("accuracy =", round(accuracy, 2)),
                      paste("precision =", round(precision, 2)),
                      paste("true A rate =", round(sensitivity, 2)),
                      paste("true B rate =", round(specificity, 2)),
                      sep = "\n")
  if(show_conf_matrix==TRUE) print(conf_matrix)
  
  # graphical output
  ggplot(df, aes(X, col=cl)) + 
    geom_rug(alpha=0.1) + # data points
    geom_density(aes(y=..count../max(..count..))) + # density of data point counts
    geom_density(aes(y=..count.., fill=cl), position="fill", alpha=0.1, linetype=2) + # prob. area of data points
    geom_line(aes(X, p_X), colour="black") + # logistic function of data points
    geom_rug(aes(X, col=pred_cl), alpha=0.1, sides="t") + # predictions
    annotate("text", x=min(df$X)*0.5, y=0.8, label=annotation) +
    labs(y="normalised count & p(X)")
}

A few comments on how logistic regression performs on asingle predictor with two classes generated by normal distributions:

Comparing two normal distributions of equal observation size, A has a mean of 0, B has a mean of 1. The logistic function decison boundary (black line) closely follows the actual probability areas of the two distributions (shown by the shaded areas bound by dotted lines). Accuracy, precision, true positive A rate and true negative B rate are all very similar due to similar true positive and true negative counts as well as similar false positive (type I error) and false negative counts (type II error).

Note any bulges in the shaded probability areas, particularly at low and high X, indicate a disproportionate number of a certain class which can be seen due to the low number of observations at these points.

logistic_regression(nA=10000, nB=10000, meanB=1)

##        cl
## pred_cl    A    B
##       A 6968 3086
##       B 3032 6914

Number of observations are now unequal with B having a standard deviation of 0.5 so the majority of it’s observations, X, are contained within the range of A. There is no region of X which solely contains B which can be seen by the blue probability area not reaching p(X) = 1, the top of the y-axis. Compared to 1. above, this leads to higher accuracy, precision and true A rate at the expense of true Brate.

The logistic function decision boundary (black line) does not closely match the probability area boundary (dashed line) as it does it’s best to correctly classify predictions given at no point do the data show a 100% probability of B. Reducing the classification probability cut-off to 0.4 helps account for the way the two distributions overlap. This increases the true B rate at the expense of the true A rate but increases the overall accuracy and precision of the model.

logistic_regression(nA=5000, nB=3000, meanB=1, sdB=0.5, seed = 687435)

##        cl
## pred_cl    A    B
##       A 4051 1224
##       B  949 1776

logistic_regression(nA=5000, nB=3000, meanB=1, sdB=0.5, seed = 687435, p_cutoff = 0.4)

##        cl
## pred_cl    A    B
##       A 3648  662
##       B 1352 2338

The difference in number of observations are increased even more and the distributions separated further by setting the mean of B to 3. Standard deviations are both 1. Due to the high separation of the distributions, the black logistic function curve matches the dashed probability area line well and prediction metrics are all high. Changing the probability cut-off to a low 0.2 again helps equalise the true A and true B rate, this time at the slight expense of accuracy.

logistic_regression(nA=5000, nB=1000, meanB=3, sdB=1, seed = 67312)

##        cl
## pred_cl    A    B
##       A 4893  158
##       B  107  842

logistic_regression(nA=5000, nB=1000, meanB=3, sdB=1, seed = 67312, p_cutoff = 0.2)

##        cl
## pred_cl    A    B
##       A 4712   73
##       B  288  927

General comments on these regressions: - When nA and nB are the same or there is a clear separation, the boundary between probability areas is close to the logistic function. - The logistic function decision boundary (black line) doesn’t need to closely follow the actual probability boundary (dashed line) to give good results. - Lots of overlap between distributions leads to poorer predictions. - Changing the classifier probability cut-off brings the true A rate and true B rate closer at the possible expense of other metrics such as accuracy. It depends on which metrics are important, i.e. in sales you may be more interested in getting the true positive prediction rate as high as possible as this represents successfully predicting when someone purchsed your product. - What has been presented here is similar to LDA (linear discriminant analysis) i.e. use the probability an observation belongs to a certain class using Bayes theorem.

Future Considerations

Make logistic_regression() universal, able to accept any data frame.
Incorporate confusion_matrix() from helpers.R in to logistic_regression() to reduce code chunk size.
Further investigation on how changing number of observations, separation of classes and shape of class distributions effect regression: how the probability area boundary and logistic function decision boundary match up, how confusion matrix metrics are effected. Changing a data generating parameter whilst holding others constant in a sequential manner could achieve this.