Have you ever seen the movie Titanic? The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this report we will use visualizations and machine learning to predict the survival rates of the people that were on board
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
We’ve got a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1309 observations of 12 variables. To make things a bit more explicit since a couple of the variable names aren’t 100% illuminating, here’s what we’ve got to deal with:
Variable Name | Description |
---|---|
Survived | Survived (1) or died (0) |
Pclass | Passenger’s class |
Name | Passenger’s name |
Sex | Passenger’s sex |
Age | Passenger’s age |
SibSp | Number of siblings/spouses aboard |
Parch | Number of parents/children aboard |
Ticket | Ticket number |
Fare | Fare |
Cabin | Cabin |
Embarked | Port of embarkation |
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin
## Length:1309 Min. : 0.000 Length:1309
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## Embarked
## Length:1309
## Class :character
## Mode :character
##
##
##
##
Our database has a list 1309 passengers, 418 missing records, 3 passanger classes -1st, 2nd and 3rd class. Maximum number of siblings/spouses aboard is 8 and the maximum number of parents/children aboard is 9. The median fare was 14.5$ dollars while maximum paid was 512.3$ dollars
## Warning: Removed 177 rows containing non-finite values (stat_bin).
Remember, ‘1’ represents Survived and ‘O’ represents Died
## female male
## 0.7420382 0.1889081
Our first plot shows that females have a 74% chance of survival while males have a 19% chance of survival
## Warning: Removed 177 rows containing non-finite values (stat_bin).
Combining variables age and sex, our plot still shows that females are more likely to than males
## Warning: Removed 12 rows containing missing values (geom_point).
This plot has 3 categories -1 for 1st class passengers, -2 for 2nd class passengers, -3 for 3rd class passengers. We have a lot of blue dots in the 1st category which suggests that passengers in the 1st class cabin were more likely to survive than those in the others
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
Explicitly, they was a Captain, 2 Majors, 4 Colonels, a Don and a Dona, 7 doctors, 211 spinsters, 818 bachelors, and 10 missionaries.
## Master Miss Mr Mrs Officer Royalty
## 0.5750000 0.7027027 0.1566731 0.7936508 0.2631579 0.7500000
Conditioning for title, a Master has a 58% chance of survival, a Miss has a 72% chance of surving, a Mrs has a 79% of surviving, an Officer has 26%, while a Royalty has a 75% chance of surviving.
## 1 2 3 4 5 6 7
## 0.3035382 0.5527950 0.5784314 0.7241379 0.2000000 0.1363636 0.3333333
## 8 11
## 0.0000000 0.0000000
A family of 4 has the highest chance of survival at 72%. Those that were alone had a 30% survival rate. Those with big families (5-11) had the least survival rate at 16%.
## Alone Big Small
## 0.3035382 0.1612903 0.5787671
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 1044 1044 NA 3 Storey, Mr. Thomas male 60.5 0 0
## Ticket Fare Cabin Embarked Title Surname Fsize FsizeD
## 1044 3701 NA S Mr Storey 1 Alone
## Warning: Removed 1 rows containing non-finite values (stat_density).
## 1 2 3
## 60.0000 15.0458 8.0500
We have a missing value in our dataset ‘Fare’. We will solve this by imputing the median figure. The median fare is 60$
, 15$
, 8$
, for 1st, 2nd and 3rd class respectively.
## 1 2 3
## 39 29 24
## [1] 0
The median age was 39, 29 and 24 in the 1st, 2nd, and 3rd class respectively.
##
## 0 1
## Adult 495 279
## Child 54 63
Our prediction is that 54 children died and 63 survivied, let’s explore further to see how accurate our prediction is.
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : num 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : num 1 2 2 2 1 1 1 1 2 2 ...
## $ FsizeD : num 1 1 2 1 2 2 2 3 1 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: num 1 3 1 1 1 2 1 1 1 3 ...
## $ Title : num 1 4 6 4 1 1 1 2 4 4 ...
## $ Child : num 1 1 1 1 1 1 1 2 1 2 ...
## ans_rf
## 0 1
## 608 283
##
## Call:
## randomForest(formula = factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title + FsizeD + Child, data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.72%
## Confusion matrix:
## 0 1 class.error
## 0 504 45 0.08196721
## 1 104 238 0.30409357
## [1] 0.8327722
Using Random Forest, our classification model predicts that overall, 608 passengers died, and 283 survived. This was predicted with a 83.2% accuracy and a 16.72% error rate.
The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. The more the accuracy of the random forest decreases due to the exclusion (or permutation) of a single variable, the more important that variable is deemed, and therefore variables with a large mean decrease in accuracy are more important for classification of the data. It’s obvious here that Title is the most important variable for this analysis.
Title is the most important variable for this analysis followed by Sex.