Logistic Regression Primer

Logistic Regression - in a nutshell

Logistic regression is predicting the likelihood (really, the log likelihood) of a binary outcome. Because the outcome is binary, rather than fit a line, we fit a sigmoid. So, some of what we are used to in OLS regression no longer applies. For an idea of the logistic function that we discussed in class, use the following code.

\[F(y)=\frac1{1+e^{-y}} \]

There is a lot to explain about why this works. But, we’ll have to cover that in class for now. Until then, play around with the formula to see what you get.

Let’s start by calculating what three places on a 21 point scale will give us if we plug them into the formula. Consider negative ten, zero, and ten as those places. That puts ten and negative ten on the extremes, with zero in the middle.

Before you run the calculation, also consider that a binary outcome takes on the values of zero or one. So, the predictions should run between zero and one.

Think of the output as just that: a prediction.

Now, run it.

numbers <- c(-10, 0, 10)

1/(1+exp(-(numbers)))

## [1] 4.539787e-05 5.000000e-01 9.999546e-01

As you can see, the negative value results in a calculation that is very close to zero (0.000045) and the positive number results in a value very close to one (0.99995). Correspondingly, zero is in the center of the scale and it produces something in the very center of the 0 to 1 scale (0.5).

This is all intended to make the point that the logistic formula is describing a continuum between one (yes/success) and zero (no/fail).

For a more compelling depiction, you can plot the formula using a fuller range of values.

This time, we are using all integers between negative ten and ten, rather than just three. That will allow us to create a visual depiction of what happens when each value is run through the logistic function.

nums <- -10:10  # Save all integers from -10 to 10 in an
                # object named "nums"

plot(nums, 1/(1+exp(-nums)))  # Logistic function to describe the  
                              # sigmoid curve of a logistic regression

abline(v=0, col="red")   # Here, abline() is used to draw a vertical red line at zero, to point out the midpoint on the "nums" scale that we just created.

To read the plot, above, keep in mind that the “outcome” is given on the vertical axis and ranges from zero to one. The “predictor,” in this case, is the set of integers in the “nums” object. Those values are given on the horizontal axis.

The logistic formula describes a situation where the outcome takes the value of either one or zero. Looking at the plot, you can see that most of the points on the plot are at or very near to either zero or one.

Around the center of the horizontal scale (the predictor), however, there is an inflection point. At the inflection point, there is a brief indeterminate space where the outcome could be either zero or one. This is describing a situation where the outcome is predicted to be zero (0) for negative values of “nums” and one (1) for positive values of “nums.” But, as the values of “nums” approach zero, the prediction (given on the vertical axis) passes through a transition from zero to one.

In the plot, we are simply plugging a set of integers into the logistic formula. This was done to provide a neat way of demonstrating how the logistic function works. But, we do not have to limit ourselves to just integers. Consider the equation that we use to describe a multiple regression:

\[Y=\alpha+{\beta_1x_1+\beta_2x_2+\beta_3x_3+...+\beta_kx_k}\]

The regression equation is essentially just predicting a set of numbers as well. All that stuff on the right side of the equation is predicting the outcome, \(Y\). So, if we treat those predicted outcome values as something that will tell us about whether an activity will succeed (1), or fail (0), then we can just plug the equation into the logistic function. Then the logistic function becomes:

\[F(Y)=\frac1{1+e^{-(\alpha+{\beta_1x_1+\beta_2x_2+\beta_3x_3+...+\beta_kx_k} )}} \]

This means that we can use each individual predictor to help us understand the (log) likelihood of a binary outcome, similar to the way we do it with OLS regression.

Running a Logistic Regression in R

The basic form of a logistic regression also looks a lot like OLS regression. But, because it is a generalization of the linear model, we need a couple additional inputs for the function. First, you are using a generalization, so use “glm” instead of “lm”. Next, there are a few options for the distribution you are modeling. For logistic regression, we are modeling the “binomial” distribution.

The generic equation would, therefore, look like this.

Model <- glm(y~x1+x2, family=binomial(), data=dataset)

The output from the model is available using:

summary(Model)

and you can transform the model estimates to odds ratios using:

exp(Model$coefficients)

Do you miss the F-Statistic?

There is no F-Statistic for a logistic regression. But, there is something like it.

Did you notice the “deviance” measures in the output? Those are model residuals, somewhat similar to the ones used to evaluate an OLS regression. You can use those to approximate what the F-statistic is doing in an OLS regression. But, this time, it will be a Chi-square test that we use to evaluate whether the model is more informative than a coin flip.

The null hypothesis is, therefore: \(H_0\): The model predicts no better than chance.

To set it up, use the following steps.

modelChi <- Model$null.deviance - Model$deviance # Chi square value
chidf <- Model$df.null - Model$df.residual       # calculate degrees of freedom for chi-square test
pchisq(modelChi, chidf, lower.tail = FALSE)      # Use that info to derive p-value

(This code is more or less directly out of Field et al. Discovering Statistics using R)

Evaluate the Model

It is possible to calculate a pseudo-R-square value for a logistic regression. I recommend you take a look at the Field (et al) text for instructions.

You will find that, depending on the type of pseudo R-square calculation you select, the answer can vary widely. Instead, consider evaluating the model is a different, but also intuitive manner.

Efficacy: ROC Curves

To evaluate model efficacy, we generally use a receiver operating curve (ROC). The ROC curve plots the true positive rate against the false positive rate.

Try the ROCit package for a fairly easy means of creating the ROC.

# First time: install the package
install.packages("rocit", dependencies = TRUE)

library(ROCit)

class <- Model$y # extract the outcomes (1s and 0s) from the model
score <- logit(Model$fitted.values) # extract predictions from the model

# Create the curve
roc <- rocit(score=score,   # predicted outcomes
             class=class,   # actual outcomes
             method="bin")  # binomial family

plot(roc)    # plot the curve

ciAUC(roc)   # Calculate the area under the curve w. confidence interval
# or 
round(ciAUC(roc)$AUC, 2) # Just the area under the curve

Compare Models

Of course, the initial comparison can just be based on comparing AIC or BIC measures for each model.

library(car)

AIC(Model)
BIC(Model)

But, if you want to see whether an additional variable or two contribute significantly to the model, use ANOVA.

anova(Model, model.2)

install.packages("ROCit", dependencies = TRUE)
library(ROCit)
class <- Model$y
score <- logit(Model$fitted.values)
roc <- rocit(score=score, class=class, method="bin")
plot(roc)
ciAUC(roc)

Assumptions

Test for multicollinearity, outliers, confounders, and interactions just as you did for OLS regression. Here are a few, so you don’t have to look them up.

library(car)
vif(Model)        # >10 could be a problem

mean(vif(Model))  # a lot >1 could be a problem

Other Options

Try stepwise regression with large data sets
Test for interactions
Assess for confounders
Bootstrap, if necessary
You know, all the same stuff…