This week we continued our work on logistic regression. I will focus mostly on the interpretations of the coefficients, odds, and probabilities as I find these aspects to be the most challenging and confusing.
I will use the titanic data set we used for the homework this week to review these concepts.
library(dplyr)
library(faraway)
library(stableGR)
data(titanic.complete)
attach(titanic.complete)
First, I will model whether a passenger survived or not with the predictor variable fare.
modelfare <- glm(Survived ~ Fare, family = binomial)
summary(modelfare)
##
## Call:
## glm(formula = Survived ~ Fare, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5623 -0.9077 -0.8716 1.3412 1.5731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.894502 0.107385 -8.330 < 2e-16 ***
## Fare 0.015738 0.002489 6.323 2.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 960.90 on 711 degrees of freedom
## Residual deviance: 899.16 on 710 degrees of freedom
## AIC: 903.16
##
## Number of Fisher Scoring iterations: 5
Now I will interpret this model’s coefficient in a couple ways.
The coefficient of fare will definitely be the focus of my analysis.
coef(modelfare)[2]
## Fare
## 0.01573786
Because the coefficient is positive, the log odds of surviving increase as the fare increases. This logistically makes sense as we would expect richer passengers to get priority when it comes to a sinking ship.
Next is the odds.
exp(coef(modelfare)[2])
## Fare
## 1.015862
The odds of surviving increase by a multiplicative factor of 1.0159 with every one dollar increase in the fare a passenger paid.
Lastly, the probability. It is useful to make up a reasonable example for fare price and evaluate probability of survival.
Lets say someone paid 75 dollars as their fare.
example_fare <- data.frame(Fare = 75)
predict(modelfare, example_fare, type = "response")
## 1
## 0.5709769
This can be interpreted as: A passenger that pays a 75 dollar fare has about a 57.1% of surviving.
Another thing to consider is ranked categorical variables or binary variables. These operate and are interpreted slighlty differently.
I’ll examine this difference in this model with the passenger’s class as the variable.
modelclass <- glm(Survived ~ Pclass, family = binomial)
summary(modelclass)
##
## Call:
## glm(formula = Survived ~ Pclass, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4533 -0.7399 -0.7399 0.9246 1.6908
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6286 0.1548 4.061 4.88e-05 ***
## Pclass2 -0.7096 0.2171 -3.269 0.00108 **
## Pclass3 -1.7844 0.1986 -8.987 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 960.90 on 711 degrees of freedom
## Residual deviance: 868.11 on 709 degrees of freedom
## AIC: 874.11
##
## Number of Fisher Scoring iterations: 4
coef(modelclass)
## (Intercept) Pclass2 Pclass3
## 0.6286087 -0.7095777 -1.7843794
The log odds for Pclass1 is built into the intercept. This is the only class that has positive log odds of surviving. With the other two classes, the log odds of survival decrease.
The odds will help better illustrate this relationship in these class ranks.
exp(coef(modelclass))
## (Intercept) Pclass2 Pclass3
## 1.8750000 0.4918519 0.1679012
These are each the odds of survival for each of the three classes respectively.
Another way to intepret these odds is odd ratios
exp(coef(modelclass)[1])/exp(coef(modelclass)[2])
## (Intercept)
## 3.812123
The odds of a class 1 passenger surviving is 3.81 times the odds of a class 2 passenger surviving.
exp(coef(modelclass)[1])/exp(coef(modelclass)[3])
## (Intercept)
## 11.16728
Shockingly, a class 1 passenger has 11.2 times the odds of surviving when compared to a class 3 passenger.
The last thing I want to mention is how important the dplyr package is when it comes to logistic regression. The categorical predictor variables are much easier to interpret and understand, so mutating a quanitative variable into a binary or categorical variable can be quite useful.