I decided to create a research analysis of predicting the survival of the passengers on the Titanic. If one is seeking an interesting and motivating, but introductory level problem involving statistical learning, predicting survival of the passengers on the Titanic is a great place to start. It appears that this is somewhat of a common problem to work on and as an added benefit, the data set is publicly available.
Original data is collected from Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.
The data for the observational study was provided by Department of Biostatistics of Vanderbilt University and is available online.
The Titanic passenger data includes the response variable Survived and 13 other descriptive variables pertaining to 1309 passengers. A description of the variables that are encountered in the Titanic dataset are provided on this page: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/Ctitanic3.html
The objective of this project is to predict which of the passengers survived the ship wreck. In particular, the response variable Survived will be modeled given possible predictors. And find out if other factors other than passenger class could possibly increase chance of survival.
We can’t generalize this study to similar tragedies nowadays, since it happened more than a hundred years ago. I believe that the business class passengers and economy passengers will be treated equally, in case of an emergency.
First, let’s look at each passenger’s class survavil rate.
class1 = filter(datatitanic, pclass == 1)
mean(class1$survived)[1] 0.619195
class2 = filter(datatitanic, pclass == 2)
mean(class2$survived)[1] 0.4296029
class3 = filter(datatitanic, pclass == 3)
mean(class3$survived)[1] 0.2552891
plotdata <- datatitanic%>%
group_by(datatitanic$pclass)%>%
summarise(avg = mean(survived))
plotdata# A tibble: 3 x 2
`datatitanic$pclass` avg
<int> <dbl>
1 1 0.6191950
2 2 0.4296029
3 3 0.2552891
The Titanic cruise ship First class passengers had higher chances of survival than the second or the third class passengers of the same vessel on the same trip.
But how was the tragedy enhanced by gender, age, or for someone who had children aboard?
model<- lm(survived ~ pclass + sex + age + sibsp + parch + fare, data = datatitanic)
summary(model)
Call:
lm(formula = survived ~ pclass + sex + age + sibsp + parch +
fare, data = datatitanic)
Residuals:
Min 1Q Median 3Q Max
-1.11459 -0.24238 -0.07336 0.23053 1.03689
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2958867 0.0653096 19.842 < 2e-16 ***
pclass -0.1757732 0.0190349 -9.234 < 2e-16 ***
sexmale -0.4920352 0.0260801 -18.866 < 2e-16 ***
age -0.0059416 0.0009556 -6.217 7.31e-10 ***
sibsp -0.0521300 0.0146598 -3.556 0.000394 ***
parch 0.0095417 0.0161583 0.591 0.554975
fare 0.0002600 0.0002753 0.945 0.345055
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3896 on 1038 degrees of freedom
(264 observations deleted due to missingness)
Multiple R-squared: 0.3761, Adjusted R-squared: 0.3725
F-statistic: 104.3 on 6 and 1038 DF, p-value: < 2.2e-16
model2<- lm(survived ~ pclass + sex + age + sibsp, data = datatitanic)
summary(model2)
Call:
lm(formula = survived ~ pclass + sex + age + sibsp, data = datatitanic)
Residuals:
Min 1Q Median 3Q Max
-1.08810 -0.24819 -0.07416 0.22830 1.03435
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3318420 0.0562191 23.690 < 2e-16 ***
pclass -0.1852103 0.0159754 -11.593 < 2e-16 ***
sexmale -0.4978088 0.0254885 -19.531 < 2e-16 ***
age -0.0059632 0.0009506 -6.273 5.18e-10 ***
sibsp -0.0466052 0.0136738 -3.408 0.000679 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3893 on 1041 degrees of freedom
(263 observations deleted due to missingness)
Multiple R-squared: 0.3756, Adjusted R-squared: 0.3732
F-statistic: 156.5 on 4 and 1041 DF, p-value: < 2.2e-16
Residuals of model are nearly normal CHECK
Variability of residuals is nearly constant CHECK
Residuals are independent CHECK
Each variable is linealy related to the outcome CHECK
Variables like “parch”(having children aboard) or fare doesn’t effect survival rate even though you expect them to be important.
Using backward-selection and p-value as the selection criterion, this model has larger R-squared, which means this variables would be better predictors or passenger survival
survavil = 1.33 + (-0.49) x sexmale + (-0.18) x pclass + (-0.005) x age + (-0.04) x sibsp
Let’s give characteristic of a pasenger who had higher chance of survival on Titanic.
It would be a female passenger from 1st class. She should be young, and has spouse or sibling aboard.
When I started with the statistics, I knew the theory of these measures, but I was not clear about practical uses of these measures and how they could help. In this research, I’ve used of statistical concepts using a practice problem (Titanic).
Results indicate that the predictors such as age, sex, and passenger class are the most important variables, in predicting survival of the Titanic passengers.
Data is collected by by Department of Biostatistics of Vanderbilt University and is available online here: http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets
Data for Titanic passengers. (2002, December 27). Retrieved October 28, 2017, from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html
Encyclopedia Titanica. (n.d.). Retrieved October 28, 2017, from http://www.encyclopedia-titanica.org/