This week we continued our work on logistic regression. I will focus mostly on the interpretations of the coefficients, odds, and probabilities as I find these aspects to be the most challenging and confusing.

I will use the titanic data set we used for the homework this week to review these concepts.

library(dplyr)
library(faraway)
library(stableGR)
data(titanic.complete)
attach(titanic.complete)

Logistic Regression

First, I will model whether a passenger survived or not with the predictor variable fare.

modelfare <- glm(Survived ~ Fare, family = binomial)
summary(modelfare)
## 
## Call:
## glm(formula = Survived ~ Fare, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5623  -0.9077  -0.8716   1.3412   1.5731  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.894502   0.107385  -8.330  < 2e-16 ***
## Fare         0.015738   0.002489   6.323 2.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 899.16  on 710  degrees of freedom
## AIC: 903.16
## 
## Number of Fisher Scoring iterations: 5

Now I will interpret this model’s coefficient in a couple ways.

The coefficient of fare will definitely be the focus of my analysis.

coef(modelfare)[2]
##       Fare 
## 0.01573786

Because the coefficient is positive, the log odds of surviving increase as the fare increases. This logistically makes sense as we would expect richer passengers to get priority when it comes to a sinking ship.

Next is the odds.

exp(coef(modelfare)[2])
##     Fare 
## 1.015862

The odds of surviving increase by a multiplicative factor of 1.0159 with every one dollar increase in the fare a passenger paid.

Lastly, the probability. It is useful to make up a reasonable example for fare price and evaluate probability of survival.

Lets say someone paid 75 dollars as their fare.

example_fare <- data.frame(Fare = 75)

predict(modelfare, example_fare, type = "response")
##         1 
## 0.5709769

This can be interpreted as: A passenger that pays a 75 dollar fare has about a 57.1% of surviving.

Another thing to consider is ranked categorical variables or binary variables. These operate and are interpreted slighlty differently.

I’ll examine this difference in this model with the passenger’s class as the variable.

modelclass <- glm(Survived ~ Pclass, family = binomial)
summary(modelclass)
## 
## Call:
## glm(formula = Survived ~ Pclass, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4533  -0.7399  -0.7399   0.9246   1.6908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6286     0.1548   4.061 4.88e-05 ***
## Pclass2      -0.7096     0.2171  -3.269  0.00108 ** 
## Pclass3      -1.7844     0.1986  -8.987  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 960.90  on 711  degrees of freedom
## Residual deviance: 868.11  on 709  degrees of freedom
## AIC: 874.11
## 
## Number of Fisher Scoring iterations: 4
coef(modelclass)
## (Intercept)     Pclass2     Pclass3 
##   0.6286087  -0.7095777  -1.7843794

The log odds for Pclass1 is built into the intercept. This is the only class that has positive log odds of surviving. With the other two classes, the log odds of survival decrease.

The odds will help better illustrate this relationship in these class ranks.

exp(coef(modelclass))
## (Intercept)     Pclass2     Pclass3 
##   1.8750000   0.4918519   0.1679012

These are each the odds of survival for each of the three classes respectively.

Another way to intepret these odds is odd ratios

exp(coef(modelclass)[1])/exp(coef(modelclass)[2])
## (Intercept) 
##    3.812123

The odds of a class 1 passenger surviving is 3.81 times the odds of a class 2 passenger surviving.

exp(coef(modelclass)[1])/exp(coef(modelclass)[3])
## (Intercept) 
##    11.16728

Shockingly, a class 1 passenger has 11.2 times the odds of surviving when compared to a class 3 passenger.

The last thing I want to mention is how important the dplyr package is when it comes to logistic regression. The categorical predictor variables are much easier to interpret and understand, so mutating a quanitative variable into a binary or categorical variable can be quite useful.