Logistic regression is predicting the likelihood (really, the log likelihood) of a binary outcome. Because the outcome is binary, rather than fit a line, we fit a sigmoid. So, some of what we are used to in OLS regression no longer applies. For an idea of the logistic function that we discussed in class, use the following code.
\[F(y)=\frac1{1+e^{-y}} \]
There is a lot to explain about why this works. But, we’ll have to cover that in class for now. Until then, play around with the formula to see what you get.
Let’s start by calculating what three places on a 21 point scale will give us if we plug them into the formula. Consider negative ten, zero, and ten as those places. That puts ten and negative ten on the extremes, with zero in the middle.
Before you run the calculation, also consider that a binary outcome takes on the values of zero or one. So, the predictions should run between zero and one.
Think of the output as just that: a prediction.
Now, run it.
numbers <- c(-10, 0, 10)
1/(1+exp(-(numbers)))
## [1] 4.539787e-05 5.000000e-01 9.999546e-01
As you can see, the negative value results in a calculation that is very close to zero (0.000045) and the positive number results in a value very close to one (0.99995). Correspondingly, zero is in the center of the scale and it produces something in the very center of the 0 to 1 scale (0.5).
This is all intended to make the point that the logistic formula is describing a continuum between one (yes/success) and zero (no/fail).
For a more compelling depiction, you can plot the formula using a fuller range of values.
This time, we are using all integers between negative ten and ten, rather than just three. That will allow us to create a visual depiction of what happens when each value is run through the logistic function.
nums <- -10:10 # Save all integers from -10 to 10 in an
# object named "nums"
plot(nums, 1/(1+exp(-nums))) # Logistic function to describe the
# sigmoid curve of a logistic regression
abline(v=0, col="red") # Here, abline() is used to draw a vertical red line at zero, to point out the midpoint on the "nums" scale that we just created.
To read the plot, above, keep in mind that the “outcome” is given on the vertical axis and ranges from zero to one. The “predictor,” in this case, is the set of integers in the “nums” object. Those values are given on the horizontal axis.
The logistic formula describes a situation where the outcome takes the value of either one or zero. Looking at the plot, you can see that most of the points on the plot are at or very near to either zero or one.
Around the center of the horizontal scale (the predictor), however, there is an inflection point. At the inflection point, there is a brief indeterminate space where the outcome could be either zero or one. This is describing a situation where the outcome is predicted to be zero (0) for negative values of “nums” and one (1) for positive values of “nums.” But, as the values of “nums” approach zero, the prediction (given on the vertical axis) passes through a transition from zero to one.
In the plot, we are simply plugging a set of integers into the logistic formula. This was done to provide a neat way of demonstrating how the logistic function works. But, we do not have to limit ourselves to just integers. Consider the equation that we use to describe a multiple regression:
\[Y=\alpha+{\beta_1x_1+\beta_2x_2+\beta_3x_3+...+\beta_kx_k}\]
The regression equation is essentially just predicting a set of numbers as well. All that stuff on the right side of the equation is predicting the outcome, \(Y\). So, if we treat those predicted outcome values as something that will tell us about whether an activity will succeed (1), or fail (0), then we can just plug the equation into the logistic function. Then the logistic function becomes:
\[F(Y)=\frac1{1+e^{-(\alpha+{\beta_1x_1+\beta_2x_2+\beta_3x_3+...+\beta_kx_k} )}} \]
This means that we can use each individual predictor to help us understand the (log) likelihood of a binary outcome, similar to the way we do it with OLS regression.
The basic form of a logistic regression also looks a lot like OLS regression. But, because it is a generalization of the linear model, we need a couple additional inputs for the function. First, you are using a generalization, so use “glm” instead of “lm”. Next, there are a few options for the distribution you are modeling. For logistic regression, we are modeling the “binomial” distribution.
The generic equation would, therefore, look like this.
Model <- glm(y~x1+x2, family=binomial(), data=dataset)
The output from the model is available using:
summary(Model)
and you can transform the model estimates to odds ratios using:
exp(Model$coefficients)
There is no F-Statistic for a logistic regression. But, there is something like it.
Did you notice the “deviance” measures in the output? Those are model residuals, somewhat similar to the ones used to evaluate an OLS regression. You can use those to approximate what the F-statistic is doing in an OLS regression. But, this time, it will be a Chi-square test that we use to evaluate whether the model is more informative than a coin flip.
The null hypothesis is, therefore: \(H_0\): The model predicts no better than chance.
To set it up, use the following steps.
modelChi <- Model$null.deviance - Model$deviance # Chi square value
chidf <- Model$df.null - Model$df.residual # calculate degrees of freedom for chi-square test
pchisq(modelChi, chidf, lower.tail = FALSE) # Use that info to derive p-value
(This code is more or less directly out of Field et al. Discovering Statistics using R)
It is possible to calculate a pseudo-R-square value for a logistic regression. I recommend you take a look at the Field (et al) text for instructions.
You will find that, depending on the type of pseudo R-square calculation you select, the answer can vary widely. Instead, consider evaluating the model is a different, but also intuitive manner.
To evaluate model efficacy, we generally use a receiver operating curve (ROC). The ROC curve plots the true positive rate against the false positive rate.
Try the ROCit package for a fairly easy means of creating the ROC.
# First time: install the package
install.packages("rocit", dependencies = TRUE)
library(ROCit)
class <- Model$y # extract the outcomes (1s and 0s) from the model
score <- logit(Model$fitted.values) # extract predictions from the model
# Create the curve
roc <- rocit(score=score, # predicted outcomes
class=class, # actual outcomes
method="bin") # binomial family
plot(roc) # plot the curve
ciAUC(roc) # Calculate the area under the curve w. confidence interval
# or
round(ciAUC(roc)$AUC, 2) # Just the area under the curve
Of course, the initial comparison can just be based on comparing AIC or BIC measures for each model.
library(car)
AIC(Model)
BIC(Model)
But, if you want to see whether an additional variable or two contribute significantly to the model, use ANOVA.
anova(Model, model.2)
install.packages("ROCit", dependencies = TRUE)
library(ROCit)
class <- Model$y
score <- logit(Model$fitted.values)
roc <- rocit(score=score, class=class, method="bin")
plot(roc)
ciAUC(roc)
Test for multicollinearity, outliers, confounders, and interactions just as you did for OLS regression. Here are a few, so you don’t have to look them up.
library(car)
vif(Model) # >10 could be a problem
mean(vif(Model)) # a lot >1 could be a problem
I will make this pretty later.
##############################################################
#### Logistic Regression ####
#### Surviving the Donner Party ####
##############################################################
load(file.choose()) # Load the Donner party data
# Start with what we have already done:
## Test using categorical variables
c2 <- chisq.test(donner$male, donner$survive)
c2$observed
c2$expected # Compare observed with expected
c2 # Is there a relationship between gender and survival?
## Now, imagine we could model survival as a function of age:
agemodel <- lm(donner$survive~donner$age)
# Plot the relationship for linear regression:
plot(donner$age, donner$survive, main="Survival as a function of Age", xlab="Age", ylab="Survival")
# Next add the regression line:
agemodel <- lm(donner$survive~donner$age)
abline(lm(donner$survive~donner$age), col="red")
# To get a better idea of this relationship, we really need to draw a confidence
# interval around the predictor line:
plot.new()
plot(donner$age, donner$survive, xlim=c(10, 75), ylim=c(-1, 2), main="Survival as a function of Age", xlab="Age", ylab="Survival")
# Next add the regression line:
agemodel <- lm(donner$survive~donner$age) # I did this to save space below
abline(agemodel, col="red")
abline(confint(agemodel)[1], confint(agemodel)[2], col="green")
abline(confint(agemodel)[3], confint(agemodel)[4], col="purple")
# Check model assumption of homoskedasticity:
plot(agemodel)
# Alternative (to essentially do the same thing)
# qqnorm(residuals(agemodel))
# plot(rstudent(agemodel))
# abline(h=0)
### Now, try a logistic regression model
lreg1 <- glm(survive~male, data=donner, family=binomial("logit"))
summary(lreg1)
lreg2 <- glm(survive~age, data=donner, family=binomial("logit"))
summary(lreg2)
lreg3 <- glm(survive~male+age, data=donner, family=binomial("logit"))
summary(lreg3)
# To compare model 2 with model 3, use the "anova" function.
aov <- anova(lreg2, lreg3)
# To test significance of the difference.
chp <- 1- pchisq(aov$Deviance, aov$Df)
# Interpreting individual variables in terms of odds ratios using "exp()"
exp(lreg3$coefficients)