Part 1 - Introduction:

I decided to create a research analysis of predicting the survival of the passengers on the Titanic. If one is seeking an interesting and motivating, but introductory level problem involving statistical learning, predicting survival of the passengers on the Titanic is a great place to start. It appears that this is somewhat of a common problem to work on and as an added benefit, the data set is publicly available.

Part 2 - Data:

Original data is collected from Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

The data for the observational study was provided by Department of Biostatistics of Vanderbilt University and is available online.

The Titanic passenger data includes the response variable Survived and 13 other descriptive variables pertaining to 1309 passengers. A description of the variables that are encountered in the Titanic dataset are provided on this page: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/Ctitanic3.html

The objective of this project is to predict which of the passengers survived the ship wreck. In particular, the response variable Survived will be modeled given possible predictors. And find out if other factors other than passenger class could possibly increase chance of survival.

We can’t generalize this study to similar tragedies nowadays, since it happened more than a hundred years ago. I believe that the business class passengers and economy passengers will be treated equally, in case of an emergency.

Part 3 - Exploratory data analysis: survival rate.

First, let’s look at each passenger’s class survavil rate.

class1 = filter(datatitanic, pclass == 1)
mean(class1$survived)
[1] 0.619195
class2 = filter(datatitanic, pclass == 2)
mean(class2$survived)
[1] 0.4296029
class3 = filter(datatitanic, pclass == 3)
mean(class3$survived)
[1] 0.2552891
plotdata <- datatitanic%>%
  group_by(datatitanic$pclass)%>%
  summarise(avg = mean(survived))
plotdata
# A tibble: 3 x 2
  `datatitanic$pclass`       avg
                 <int>     <dbl>
1                    1 0.6191950
2                    2 0.4296029
3                    3 0.2552891

Survival rate.

The Titanic cruise ship First class passengers had higher chances of survival than the second or the third class passengers of the same vessel on the same trip.

But how was the tragedy enhanced by gender, age, or for someone who had children aboard?

Part 4 - Inference:

model<- lm(survived ~ pclass + sex + age + sibsp + parch + fare, data = datatitanic)
summary(model)

Call:
lm(formula = survived ~ pclass + sex + age + sibsp + parch + 
    fare, data = datatitanic)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.11459 -0.24238 -0.07336  0.23053  1.03689 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.2958867  0.0653096  19.842  < 2e-16 ***
pclass      -0.1757732  0.0190349  -9.234  < 2e-16 ***
sexmale     -0.4920352  0.0260801 -18.866  < 2e-16 ***
age         -0.0059416  0.0009556  -6.217 7.31e-10 ***
sibsp       -0.0521300  0.0146598  -3.556 0.000394 ***
parch        0.0095417  0.0161583   0.591 0.554975    
fare         0.0002600  0.0002753   0.945 0.345055    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3896 on 1038 degrees of freedom
  (264 observations deleted due to missingness)
Multiple R-squared:  0.3761,    Adjusted R-squared:  0.3725 
F-statistic: 104.3 on 6 and 1038 DF,  p-value: < 2.2e-16
model2<- lm(survived ~ pclass + sex + age + sibsp, data = datatitanic)
summary(model2)

Call:
lm(formula = survived ~ pclass + sex + age + sibsp, data = datatitanic)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.08810 -0.24819 -0.07416  0.22830  1.03435 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.3318420  0.0562191  23.690  < 2e-16 ***
pclass      -0.1852103  0.0159754 -11.593  < 2e-16 ***
sexmale     -0.4978088  0.0254885 -19.531  < 2e-16 ***
age         -0.0059632  0.0009506  -6.273 5.18e-10 ***
sibsp       -0.0466052  0.0136738  -3.408 0.000679 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3893 on 1041 degrees of freedom
  (263 observations deleted due to missingness)
Multiple R-squared:  0.3756,    Adjusted R-squared:  0.3732 
F-statistic: 156.5 on 4 and 1041 DF,  p-value: < 2.2e-16

Conditions :

  • Residuals of model are nearly normal CHECK

  • Variability of residuals is nearly constant CHECK

  • Residuals are independent CHECK

  • Each variable is linealy related to the outcome CHECK

Variables like “parch”(having children aboard) or fare doesn’t effect survival rate even though you expect them to be important.

Using backward-selection and p-value as the selection criterion, this model has larger R-squared, which means this variables would be better predictors or passenger survival

survavil = 1.33 + (-0.49) x sexmale + (-0.18) x pclass + (-0.005) x age + (-0.04) x sibsp

Let’s give characteristic of a pasenger who had higher chance of survival on Titanic.

It would be a female passenger from 1st class. She should be young, and has spouse or sibling aboard.

Part 5 - Conclusion:

When I started with the statistics, I knew the theory of these measures, but I was not clear about practical uses of these measures and how they could help. In this research, I’ve used of statistical concepts using a practice problem (Titanic).

Results indicate that the predictors such as age, sex, and passenger class are the most important variables, in predicting survival of the Titanic passengers.

References:

Data is collected by by Department of Biostatistics of Vanderbilt University and is available online here: http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets

Data for Titanic passengers. (2002, December 27). Retrieved October 28, 2017, from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html

Encyclopedia Titanica. (n.d.). Retrieved October 28, 2017, from http://www.encyclopedia-titanica.org/