logo logo

Try Logistic Regression in R!

Now that you know some of the technical details, let’s look at how this can be implemented in R. You can follow along with the code below or use this code as a reference when fitting your own models.

Objectives:

Learn about some plots to visualize binary-response data.
Learn how to code logistic regression with 1 or more variables.
Learn how to interpret coefficients in logistic regression.
Learn how to use ROC and AUC to assess predictive performance of a model.

The Setting:

In this exploration of logistic regression, we have survey data obtained from FiveThirtyEight’s github page. The survey asks respondents about whether they have seen any of the six Star Wars films, and then asks them to rate each film. You can access it at this link: Star Wars Survey Data (link).

The survey contains a lot of information about respondents as well, and not every person who was asked about the films had actually seen them. That’s the question that we’re going to try to analyze: What factors are associated with whether or not a survey respondent has actually seen a “Star Wars” film?

Our response variable:

1: A respondent HAS seen at least one Star Wars film
0: A respondent HAS NOT seen any Star Wars film

You might be surpised, but there are more people than you’d think who have never seen a Star Wars film.

Let’s load in the data directly from the web:

sw <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv", sep = ",")

Because these data are coming directly from a survey, there are a lot of columns that refer to survey questions. For this analysis, we’re mostly only interested in our response variable and some demographic predictors at the end. Let’s go ahead and use the ‘dplyr’ package in R to select just the variables we want. Be warned, some of the variable names in this data set are pretty crazy!

#let's get a vector of variable names that we want
var.names <- c("Have.you.seen.any.of.the.6.films.in.the.Star.Wars.franchise.",
               "Do.you.consider.yourself.to.be.a.fan.of.the.Star.Trek.franchise.",
               "Gender","Age","Household.Income",
               "Education", "Location..Census.Region.")
#told ya the variable names were crazy! 
#let's go ahead and create a new dataset

sw.new <- select(sw, one_of(var.names))
sw.new <- sw.new[-c(1),]

Ok, so we have a dataset. Before we analyze anything, let’s do a little bit of “data wrangling” to make things easier.

First, let’s change the variable names to something manageable.
Second, let’s recode the levels of each variable and treat them as a factor.

#rename the columns
names(sw.new) <- c("Y","Trekkie","Gender","Age","Income","Education","Location")

#recode the columns as factors:
sw.new.f <- lapply(sw.new, as.factor) #codes all the predictors as factors
sw.new.f$Ynew <- ifelse(sw.new.f$Y == "No",0,1) #need numeric, binary response

#It's OK to have the X variables (explanatory variables) coded as factors
#However, the response variable Y needs to be coded as a 0,1 binary vector of numbers

Great, now we have a dataset that we can analyze! A little bit of exploratory analysis is in order first. One way to analyze binary response data, especially if we only have a few predictors, is to look at mosaic plots. These are just visual representations of contingency tables that show the proportion of Yes/No responses by each level of a predictor. Let’s take a look at some for this dataset.

First, because many of our categories are ordinal, we want to make sure that their order is correct. We don’t need to worry about Gender, Location, or whether or not the respondent likes Star Trek. Those categories do not have implicit ordering. However, the order of age, income, and education do matter, so we want that to be reflected in the plots. We can order the levels of the factor using the relevel function in R.

par(mar = c(2,1,3,1))
#create a mosaicplot
mosaicplot(table(sw.new.f$Trekkie, sw.new.f$Ynew), col = c('steelblue4','seagreen1'), main = "Trek Fan Status vs. Whether Respondent Has Seen Star Wars")

mosaicplot(table(sw.new.f$Gender,sw.new.f$Ynew), col = c('steelblue4','seagreen1'), main = "Gender vs. Whether Respondent Has Seen Star Wars")

mosaicplot(table(sw.new.f$Age,sw.new.f$Ynew), col = c('steelblue4','seagreen1'), las = 2, main = "Age vs. Whether Respondent Has Seen Star Wars")

mosaicplot(table(sw.new.f$Income,sw.new.f$Ynew), col = c('steelblue4','seagreen1'), las = 2, main = "Income vs. Whether Respondent Has Seen Star Wars")

mosaicplot(table(sw.new.f$Education,sw.new.f$Ynew), col = c('steelblue4','seagreen1'), las = 1, main = "Education Level vs. Whether Respondent Has Seen Star Wars")

mosaicplot(table(sw.new.f$Location,sw.new.f$Ynew), col = c('steelblue4','seagreen1'), las = 2, cex = .6, main = "Location vs. Whether Respondent Has Seen Star Wars")

These plots produce nice visualizations of the proportion of success by each predictor. For instance, we can see that a much higher proportions of respondents who identified as not being fans of Star Trek also had not seen Star Wars. This makes sense; people who are not fans of science fiction probably pay less attention to both Star Trek and Star Wars films/TV shows. The mosaic plots for gender and age show that male respondents are slightly more likely to have seen Star Wars than women and that older respondents (age 60 and older) are slightly less likely to have seen a Star Wars film. Differences in viewership are nearly identical between lower levels of income and education - an interesting and probably related trend that we should be aware of. Finally,there are only slight differences in proportions of respondents that have seen Star Wars by location.

Based on these plots, which variables do you think are important? This is a bit of a subjective question, but it certainly appears that Star Trek Fan Status, Age, and Gender exhibit differences in the response. By contrast, the predictors Location, Education, and Income seem to have fairly similar proportions of successes and failures in each level of the predictor.

Note: In some cases, we may have too many variables to easily assess plots of the predictors vs. the binned response. In such cases, it’s a good idea to use variable selection processes such as forward-backward AIC selection or Lasso regression to inform model building.

Simple Logistic Regression:

Based on our mosaic plots, it looks like Gender might be an important variable to consider. Let’s ask the question: Does the probability of having seen Star Wars differ for men and women?

We can answer this question with the following model, where $\pi_i$ is defined as the estimated probability of having seen at least one Star Wars film:

\[\hat{logit(\pi_i )} = \hat{\beta_0} + \hat{\beta_1} * Gender_i \]

This model is relatively easy to run using the glm function in R, as shown in the code chunk below. When fitting logistic regression models, we do have to add an additional argument specifying the link function and the family of distributions we are assuming. Here, we assume that the response is distributed under a binomial distribution with a logit link function. We also want to make sure that our baseline category is FEMALE, which we do in the first line of code.

#sets the FEMALE level as the baseline category:
sw.new.f$Gender <- relevel(sw.new.f$Gender, ref = "Female")

#runs the logistic regression model
glm1 <- glm(Y ~ Gender, data = sw.new.f, family = binomial(link = logit))
pander(summary(glm1))

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.9601	0.09538	10.07	7.863e-24
GenderNA	0.6155	0.2437	2.526	0.01155
GenderMale	0.7833	0.158	4.956	7.19e-07

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	1222 on 1185 degrees of freedom
Residual deviance:	1194 on 1183 degrees of freedom

Looking at the summary, we can see estimates for an intercept, GENDERNA, and GENDERMALE levels. The GENDERNA refers to respondents who simply did not respond to the question, and GENDERMALE refers to anyone who selected MALE as their gender.

Interpreting a Logistic Regression Model

Unlike Han Solo, we want to know all about the odds.

Unlike linear regression, we cannot directly interpret the coefficients because logistic regression provides estimates on a logit scale. Thankfully, we can get compare the odds of a male respondent having seen Star Wars vs. the odds of a female respondent using the exponential function:

#coefficient estimates
est <- coef(glm1)

#estimated odds of a female having seen Star Wars (b0 + 0*male + 0*NA)
exp(est[1])

## (Intercept) 
##    2.611842

#estimated odds of a male having seen star wars (b0 + 1*male + 0*NA)
exp(est[1] + est['GenderMale'])

## (Intercept) 
##    5.716216

What we now know is that the odds that a male respondent has seen Star Wars is 5.7 to 1, whereas for female respondents, that ratio is only 2.61 to 1. The odds ratio of males to females is $\frac{5.7}{2.6} = 2.19$, indicating that men are roughly 2.2 times as likely to have seen at least one Star Wars film.

Interpreting logistic regression models, especially their coefficients, can be hard. Most problems faced in industry involve prediction rather than interpreting inferential results, so this complexity can be avoided in many settings. However, if you are interested in learning more about interpreting odds ratios and probabilities, check out this link: UCLA Interpretations.

Multiple Predictor Variables

Actually, you can have many more than 2 variables in a logistic regression model.

Of course, we know much more about our respondents than just their gender. Let’s try fitting a model that includes all of the information in the dataset, including Gender, Age, Income, Education, Location and Start Trek fan status. The model would look something like the following:

\[logit(\pi_i) = \beta_0 + \beta_1Gender_i + \beta_2Age_i + \beta_3Location_i + \beta_4TrekFan_i + \beta_5Education_i + \beta_6Income_i \]

We can fit the model just as easily as in the single variable case by adding to the formula in the glm function.

#runs the logistic regression model
glm2 <- glm(Y ~ Gender + Age + Location + Trekkie + Education + Income, data = sw.new.f, family = binomial(link = logit))
pander(summary(glm2))

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	2.345	1.49	1.573	0.1157
GenderNA	0.03483	1.453	0.02397	0.9809
GenderMale	0.7045	0.1824	3.863	0.0001122
Age18-29	1.148	0.267	4.301	1.699e-05
Age30-44	0.4	0.2304	1.736	0.08252
Age45-60	0.6053	0.2401	2.521	0.01169
LocationEN Central	-0.508	1.311	-0.3876	0.6983
LocationES Central	-0.3065	1.405	-0.2182	0.8272
LocationMiddle Atlantic	-0.7057	1.322	-0.5338	0.5935
LocationMountain	0.2277	1.348	0.1688	0.8659
LocationNew England	-0.2524	1.331	-0.1896	0.8496
LocationPacific	0.1837	1.316	0.1396	0.889
LocationSouth Atlantic	-0.2731	1.315	-0.2077	0.8354
LocationWN Central	-0.2244	1.329	-0.1689	0.8659
LocationWS Central	-1.197	1.32	-0.9069	0.3645
TrekkieNo	-3.472	0.6092	-5.699	1.205e-08
TrekkieYes	-0.4796	0.6679	-0.7181	0.4727
Education< HS	-0.362	1.121	-0.3228	0.7468
EducationHS	0.8633	0.7582	1.139	0.2548
EducationSome College	1.19	0.7376	1.613	0.1067
EducationBachelor	1.518	0.7385	2.056	0.03983
EducationGrad	1.693	0.7417	2.283	0.02245
Income$0 - $24,999	-0.4938	0.3048	-1.62	0.1052
Income$25,000 - $49,999	0.1359	0.2908	0.4674	0.6402
Income$50,000 - $99,999	0.1621	0.2673	0.6065	0.5442
Income$100,000 - $149,999	0.1734	0.3229	0.537	0.5913
Income$150,000+	-0.07356	0.3626	-0.2029	0.8392

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	1222 on 1185 degrees of freedom
Residual deviance:	926 on 1159 degrees of freedom

Take a look at the summary. Notice anything funny?

The baseline level for each factor (Female, Age NA, etc.) does not get a coefficient estimate. Instead, these respondents are the baseline category (i.e. the intercept of the model).
You might notice that the model does not estimate a coefficient for “Age > 60”. This is because that predictor does not add enough independent, new information to the model. This problem is known as colinearity and R tries do deal with it by dropping variables with redundant information. Another common way to solve this problem (and the way Einstein Discovery solves it) is to use lasso-based regression methods.

In a model like this, it is possible to interpret the individual coefficients on an odds scale like we did with the single-variable model. You still get a measure of statistical significance and can identify which variables differ in statistically-meaningful ways and which ones do not.

However, it is often more useful to use these types of models to make predictions, as we will do in the next section.

Making Predictions

In logistic regression, there are several types of predictions that we consider making:

Predicted Propensity Scores: These are predictions of $\pi_i$, the probability of success. These are useful for lead scoring models because they tell us which observations have the highest probability of being a success.
Predicted Classes: Based on a threshold (usually 0.5), we can directly predict which class a response will classified in (either 0 or 1).

We can assess both, and take a look at the resulting predictions. From our first row, we can see that a male between 18-29 who does not like Star Trek, has a high-school education and did not report his income, from the South Atlantic, has a propensity score of 0.79 and a predicted class of 1.

#gives the predicted probabilities
predict.prob <- predict(glm2, type = "response")

#gives the predicted classes
thresh <- 0.5
predict.class <- ifelse(predict.prob > thresh, 1, 0)

#data frame
predicted_data <- data.frame(cbind(data.frame(glm2$data), round(predict.prob,2),predict.class))
names(predicted_data) [c(9,10)] <- c("Propensity Score", "Predicted Class")

datatable(predicted_data, options = list(scrollX = 75, pageLength = 5))

Assess Model Performance

Finally, we can assess how well the model predicted using a tool called the Receiver Operating Characteristic (ROC) Curve. A measure of how well our model predicts the true class of an observation is obtained by calculating the area under the curve (AUC).

Quick AUC Interpretations:

0.5: A coin-flip. Your model isn’t doing very well.
0.6: Your model is doing a bit better than a coin flip, but not great.
0.7-0.85: Pretty good model fit.
> 0.85 : The model is predicting pretty close to the truth.

Let’s check the AUC of the model and examine a plot. As it turns out, this model does a pretty nice job of predicting whether or not someone has seen a Star Wars film.

#calculates the ROC curve
roc1 <- pROC::roc(glm2$data$Y, predict.prob)
roc1

## 
## Call:
## roc.default(response = glm2$data$Y, predictor = predict.prob)
## 
## Data: predict.prob in 250 controls (glm2$data$Y No) < 936 cases (glm2$data$Y Yes).
## Area under the curve: 0.8289

#let's plot the curve
plot(roc1, main = "ROC Curve of Full GLM", lwd = 2, col = "blue4")
legend('bottomright', legend = paste("AUC is: ",round(auc(roc1),2)))

Putting It All Together

We’ve just fit a logistic regresssion model that predicts whether or not someone has seen a Star Wars film based on just a few pieces of information about them. At Atrium, we are often tasked with fitting logistic regression models for lead lead scoring, opportunity scoring, or two-group classification problems. The methods presented in this document can be used for any of those types of models.

Atrium University ML Tools

Logistic Regression: Try It in R!