RMS Titanic- a Logistic Regression Model for Survival

Part 1 - Introduction

A luxury ship like no other, the famed RMS Titanic met with tragedy on its maiden voyage. In the years since, many researchers have written books about the Titanic. Data has been gathered based on published matrialand are available as public datasets. One such set, compiled by Thomas Cason of UVa, is called titanic3 and is updated as of August, 1999. It is based on an earlier set found in Eaton & Haas’ (1994) Titanic: Triumph and Tragedy, published by Patrick Stephens Ltd. The Titanic 3 dataset contains data about 1309 passengers. Some of it is incomplete.

Our interest is in whether we can create a model that describes characteristics of a passenger that survived the disaster. The data set contains actual data gathered on real passengers. Because it is not 100% complete, we will use statistical methods to create our model. In order to make a sound inference, we created a subset of our data by randomly choosing 130 assengers. Of these 130, complete data was found for 100 passengers. This number is above 30, but below 10% of total passengers.

Part 2 - Data

As mentioned above, data were collected by archivists, authors and enthusiasts over the years since the Titanic. The cases are individuals who sailed on the Titanic on its voyage. We will be looking at age, gender and passenger class. Age is a numeric variable, understood as continuous. Gender is categorical and contains 2 categories in this dataset. There were three passenger classes. This is an ordinal variable.

The population of interest is only passengers who were on the Titanic when it sank. The findings here can be generalized to that population. Our selection was done by a pseudo-random function. It is possible that part of our study could align with the results from other shipwrecks; women and children may often have priority. However, we can’t make that kind of inference from the data that we have. Our data is observational data from a single event.

If we wanted to generalize to other wrecks, there would be many sources of bias. Our event was chosen for its fame and availability of data. The cultural reasons the Titanic evacuated the way it did may not hold at any other place or time. The design of the ship may also have caused the relationship that we found. Individual decisions by personnel or passengers may also have played a role. If we tried to make predictions or inferences about a different dataset, we would be introducing this bias.

We believe there was no systematic reason for some of the data to be missing. There is a possibility that more information on wealthy individuals was available and more 3rd class passengers were left out. We believe that it was random enough to make a proper inference.

Part 3 - Exploratory data analysis

Relevant summary statistics:

The mean age of passengers was 29.8811377.
The standard deviation of the age was 14.4134932.
The number of passengers who died was 809.
The number of passengers who lived was 500.
The number of passengers who were in first class was 323.
The number of passengers who were in second class was 277.
The number of passengers who were in third class was 709.
The mean age of the survivors was 28.9182436.
The mean age of the dead was 30.5453635.
The mean age of the first class passengers was 39.1599296.
The mean age of the second class passengers was 29.506705.
The mean age of the third class passengers was 24.8163673.

.
passenger.class gender number.survived number.died
first class female 139 5
second class female 94 12
third class female 106 110
first class male 61 118
second class male 25 146
third class male 75 418

A data table proves more useful than histograms because our data classes mostly have 2 or 3 levels. From this table, we can see a possible trend. More men have died. More 3rd class passengers appear to have died than 1st class passengers. We follow this up with our statistical analysis to see if our first impression holds up.

We performed 3 logistic regressions with 0,2, and 3 independent variables. Our null hypotheses: passenger class contributes 0 to the odds of survival;age contributes 0 to the odds of survival;gender contributes 0 to the odds of survival.

set.seed(243)
random.subset<-sample_n(Titanic_passengers, 130)
titanic.glm.minimal <- glm(random.subset$survived ~ -1+1, family = binomial())
summary(titanic.glm.minimal)
## 
## Call:
## glm(formula = random.subset$survived ~ -1 + 1, family = binomial())
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.062  -1.062  -1.062   1.298   1.298  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.2787     0.1771  -1.574    0.116
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.72  on 129  degrees of freedom
## Residual deviance: 177.72  on 129  degrees of freedom
## AIC: 179.72
## 
## Number of Fisher Scoring iterations: 4
titanic.glm.2vars <- glm(random.subset$survived ~ random.subset$pclass + random.subset$age, family = binomial())
summary(titanic.glm.2vars)
## 
## Call:
## glm(formula = random.subset$survived ~ random.subset$pclass + 
##     random.subset$age, family = binomial())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0939  -0.8653  -0.4611   0.9616   1.9424  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           5.72958    1.47820   3.876 0.000106 ***
## random.subset$pclass -1.71864    0.41089  -4.183 2.88e-05 ***
## random.subset$age    -0.07174    0.02282  -3.144 0.001667 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 137.63  on 99  degrees of freedom
## Residual deviance: 110.50  on 97  degrees of freedom
##   (30 observations deleted due to missingness)
## AIC: 116.5
## 
## Number of Fisher Scoring iterations: 4
titanic.glm.3vars <- glm(random.subset$survived ~ random.subset$pclass + random.subset$age + random.subset$sex, family = binomial())
summary(titanic.glm.3vars)
## 
## Call:
## glm(formula = random.subset$survived ~ random.subset$pclass + 
##     random.subset$age + random.subset$sex, family = binomial())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4017  -0.6172  -0.3452   0.5440   2.3163  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            7.43985    1.81679   4.095 4.22e-05 ***
## random.subset$pclass  -1.66233    0.47166  -3.524 0.000424 ***
## random.subset$age     -0.07159    0.02588  -2.767 0.005665 ** 
## random.subset$sexmale -2.77370    0.61072  -4.542 5.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 137.63  on 99  degrees of freedom
## Residual deviance:  84.05  on 96  degrees of freedom
##   (30 observations deleted due to missingness)
## AIC: 92.05
## 
## Number of Fisher Scoring iterations: 5
Titanic_passengers <-na.omit(Titanic_passengers[,c(1,2,4,5)])
hoslem.data<-c(1:1046)
hoslem.data<-cbind(hoslem.data,hoslem.data)
for (i in 1:1046){
pc<-Titanic_passengers$pclass[i]
a<-Titanic_passengers$age[i]
if (Titanic_passengers$sex[i]=='male') {m<-1} else {m<-0}
prediction<-(7.43985-1.66233*pc-.07159*a-2.77370*m)
hoslem.data[i,1]<-exp(prediction)/(1+exp(prediction))
hoslem.data[i,2]<-Titanic_passengers$survived[i]
}
hoslem.test(hoslem.data[,2], hoslem.data[,1])
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  hoslem.data[, 2], hoslem.data[, 1]
## X-squared = 89.537, df = 8, p-value = 5.551e-16

When we add the gender variable to the 2 factor model, the AIC is reduced from 116.5 to 92.05. This is compared to 179.72 for the null model. The model with gender, even after penalizing for the two parameters, fits our data best. For this model, pclass and maleness are statistically significant at a confidence level of .001. Age is significant at a level of .01.

The result of a Hosmer Lemeshow test, a chi-squared test, is a p-value of 5.551e-16. Our model is a good fit.

residuals.3vars<-residuals(titanic.glm.3vars)
ggplot(x=seq(1,100), y=residuals.3vars)+geom_point(aes(x=seq(1,100), y=residuals.3vars),shape=21,fill='red')+ theme(panel.background = element_rect(fill = '#7fc4e0'))+geom_abline(slope=0, intercept=0,color='red')+labs(x='',y='residuals')

predict.accuracy<-rep(0,30)
predict.value<-rep(0,30)
empirical.value<-rep(0,30)
for (i in 1:30)                    {
random.passenger<-sample_n(Titanic_passengers, 1)
pc<-random.passenger$pclass
a<-random.passenger$age
if (random.passenger$sex=='male') {m<-1} else {m<-0}
prediction<-(7.43985-1.66233*pc-.07159*a-2.77370*m)
prediction<-exp(prediction)/(1+exp(prediction))
predict.value[i]<-prediction
empirical.value[i]<-random.passenger$survived
gap<-abs(random.passenger$survived-prediction)
predict.accuracy[i]<-findInterval(gap, c(0,.5) ) == 1
                                     }

We ran a simulation to test how often our model would correctly guess if a passenger survived. When we took a sample of 30 and tested it against the actual data, the model produced 90.9% accuracy.

mean(predict.accuracy,na.rm=TRUE)
## [1] 0.9
model.graphic<-data.frame(cbind(predict.value,empirical.value))
lower.CI<-coefficients(titanic.glm.3vars)[2] -1.96*summary(titanic.glm.3vars)$coefficients[2,2]
upper.CI<-coefficients(titanic.glm.3vars)[2] +1.96*summary(titanic.glm.3vars)$coefficients[2,2]
ggplot(data=model.graphic)+geom_point(aes(x=seq(1,30),y=predict.value,color=empirical.value),size=2)+ theme(panel.background = element_rect(fill = '#c4b8a1')) + theme(legend.position="none")+labs(title='Prediction Value vs. Actual',subtitle='light means survival',x='',y='predicted value')+theme(title = element_text(size = rel(1.5),color='#3988ad'))

Part 4 - Inference

Our final model for the probability of surviving, \(\pi\), is:

\(\large\frac{\pi}{1+\pi}= 7.44-1.66(pclass_{i})-.07(age_{i})-2.77(maleness_{i})\)

Our confidence interval for the pclass variable is -2.5867778, -0.7378763.

We reject all three null hypothesis. Passenger class, gender and age were all associated with the odds of survival on the Titanic.

Our data, with a binary response variable lent itsself to a different model than the standard linear regression. A probit model is a possibility. This model uses the inverse of a normal distribution. We chose a logit model, that differs from the probit model along its cdf by a small amount. The tests to evaluate a logit model are slightly different from those used for a linear regression. The conditions for inference are also different.

Part 5 - Conclusion

We investigated the possibility of finding a model to predict probability for a response variable that doesn’t work well with a linear model. We found that passengers on the Titanic were more likely to survive if they were female, young and wealthy. A future research project could look at other shipwrecks and find out if we could create a model that predicts survival of a sea borne accident.