Surviving the Titanic Disaster

Predicting which passengers survived the sinking of the Titanic using a provided training and test dataset.

This is a classification task, specifically, a binary classification task, which means that there are only two different states we are classifying.

Below are the descriptions contained in that data dictionary, with the response variable in bold:

PassengerID - A column added by Kaggle to identify each row and make submissions easier Survived - Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes) Pclass - The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd) Sex - The passenger’s sex Age - The passenger’s age in years SibSp - The number of siblings or spouses the passenger had aboard the Titanic Parch - The number of parents or children the passenger had aboard the Titanic Ticket - The passenger’s ticket number Fare - The fare the passenger paid Cabin - The passenger’s cabin number Embarked - The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

## Lets first visualise the data
glimpse(train)
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male"…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39…
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S…
glimpse(test)
## Observations: 418
## Variables: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 90…
## $ Pclass      <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3…
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs…
## $ Sex         <chr> "male", "female", "male", "male", "female", "male", …
## $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0…
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0…
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "…
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6…
## $ Cabin       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45…
## $ Embarked    <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S…

The training-set has 891 examples and 11 features + the target variable (survived). 2 of the features are doubles, 5 are integers and 5 are character objects. The testing-set is the same minus the target variable.

Prior to modelling it is important to assess the features which may provide the most information. We assume that women and childern were favoured for spots on the life boats, that there was a large class disparity, that young may have been favoured over old. This suggests that Age, Sex, and PClass may be good predictors of survival.

Because the Survived column contains 0 if the passenger did not survive and 1 if they did, we can segment our data by sex and calculate the mean of this column.

Survived <- train %>%
  filter(Survived == 1)
Died <- train %>%
  filter(Survived == 0)


sum_train <- summary(train)


##Predictive utility of Age
Sur_Age_plot <- ggplot(train, aes(x = Age))+
  geom_histogram()+
  facet_wrap(~ Survived)+
  labs(title = "Age vs survival")

## Predicitve utility of Sex
Sur_Sex_plot <- ggplot(train, aes(x = factor(Survived), fill = factor(Sex)))+
  geom_bar(position = "stack")+
  labs(title = "Sex vs survival")


##Predictive utility of Pclass
Sur_Pclass_plot <- ggplot(train, aes(x = factor(Survived), fill = factor(Pclass)))+
  geom_bar(position = "stack")+
  labs(title = "Pclass vs survival")

## Visualise the plots
print(sum_train)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
print(Sur_Age_plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).

print(Sur_Sex_plot)

print(Sur_Pclass_plot)

The tabluar summary indicates that approximatly 38% of people survived. Age contains 177 NAs. There is a large disparity in the fare paid, which is likely to be related to the Pclass. SibSP indicates that just under half of the passengers were traveling alone. We need to convert a lot of features into numeric so that the machine learning algorithms can process them. Furthermore, we can see that the features have widely different ranges, that we will need to convert into roughly the same scale. We can ignore Cabin, Name, Ticket from the dataset. It is difficult to make sense of both the SibSp and Parch together and perhaps these should be combined into a binary outcome, “alone” and “accompanied”

From Age vs Survival plot we can see that there appears to be a bimodal distribution in those that survived. The number of children aged < 10 who survived is greater than the number who died. It may be also may be useful to assume that these children were accompanied by their mothers. It is unclear if female children were favoured over male children. Since there seem to be certain ages, which have increased odds of survival and because I want every feature to be roughly on the same scale.

From Sex vs Survival plot we can see that a much larger proportion of women survived. It may be possible that the women who died were less likely to be mothers.

From Pclass vs Survival plot we can see that class 1 was more likely to survive than 2 and 2 more likely to survive than 3.

##Predictive utillity of Embarked and Fare
Sur_Embarked_Fare <- ggplot(train, aes(x = factor(Embarked), fill = factor(Survived)))+
  geom_bar(position = "stack")+
  #facet_wrap(~ Survived)+
  labs(title = "Embarked vs Fare by survival")


##Predictive utility of Age
Sur_Age_Sex_plot <- ggplot(train, aes(x = Age, fill = Sex))+
  geom_histogram(position = "stack")+
  facet_wrap(~ Survived)+
  labs(title = "Age vs Sex by survival")

## Predicitve utility of Sex
Sur_Fem_SibSP_plot <- ggplot(train, aes(x = factor(Survived), fill = factor(SibSp)))+
  geom_bar(position = "stack")+
  facet_wrap(~ Sex)+
  labs(title = "Sex vs SibSp")


## Visualise the plots
print(Sur_Age_Sex_plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).

print(Sur_Fem_SibSP_plot)

print(Sur_Embarked_Fare)

From the Age vs Sex by survival it appears that there was no preferential treatment given to female children.

# Change Pclass to a factor
train$Pclass <- as.numeric(train$Pclass)

#Change Sex to numeric
train$Sex <- as.numeric(ifelse(train$Sex == "male", 1, 0))

#Replace NAs in the age variable with random digits from normal distribution
Age_replace <- rnorm(177, mean= mean(train$Age, na.rm = TRUE), sd = sd(train$Age, na.rm = TRUE))
train$Age[is.na(train$Age)] <- Age_replace

#Replace NAs in the age variable with C because it appears that there are more survivors from the same fare bracket in C than S.
Embarked_replace <- c("C", "C")
train$Embarked[is.na(train$Embarked)] <- Embarked_replace
train$Embarked <- as.numeric(as.factor(train$Embarked))

# Convert SibSp and Parch to Alone
train <- train %>%
  mutate(Alone = ifelse(Parch != 0 | SibSp !=0, 0, 1))

Predictors <- c("Age", "Alone", "Pclass", "Sex", "Fare")

train[Predictors] <-scale(train[Predictors], center = TRUE, scale = TRUE)


glimpse(train)
## Observations: 891
## Variables: 13
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1…
## $ Pclass      <dbl> 0.8269128, -1.5652278, 0.8269128, -1.5652278, 0.8269…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradl…
## $ Sex         <dbl> 0.737281, -1.354813, -1.354813, -1.354813, 0.737281,…
## $ Age         <dbl> -0.53091190, 0.55998645, -0.25818731, 0.35544301, 0.…
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803…
## $ Fare        <dbl> -0.50216314, 0.78640362, -0.48857985, 0.42049407, -0…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G…
## $ Embarked    <dbl> 3, 1, 3, 3, 3, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 2, 3…
## $ Alone       <dbl> -1.2309535, -1.2309535, 0.8114666, -1.2309535, 0.811…
Kaggle_glm <- glm(Survived ~ Age+Sex+Pclass+Alone, family = "binomial", train)
summary(Kaggle_glm)
## 
## Call:
## glm(formula = Survived ~ Age + Sex + Pclass + Alone, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6020  -0.6515  -0.4146   0.6285   2.6776  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.64491    0.08961  -7.197 6.16e-13 ***
## Age         -0.43129    0.09798  -4.402 1.07e-05 ***
## Sex         -1.25841    0.09281 -13.559  < 2e-16 ***
## Pclass      -0.96836    0.10004  -9.680  < 2e-16 ***
## Alone        0.03418    0.09166   0.373    0.709    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  806.57  on 886  degrees of freedom
## AIC: 816.57
## 
## Number of Fisher Scoring iterations: 5
#Identify missing values and variables to change
summary(test)
##   PassengerId         Pclass          Name               Sex           
##  Min.   : 892.0   Min.   :1.000   Length:418         Length:418        
##  1st Qu.: 996.2   1st Qu.:1.000   Class :character   Class :character  
##  Median :1100.5   Median :3.000   Mode  :character   Mode  :character  
##  Mean   :1100.5   Mean   :2.266                                        
##  3rd Qu.:1204.8   3rd Qu.:3.000                                        
##  Max.   :1309.0   Max.   :3.000                                        
##                                                                        
##       Age            SibSp            Parch           Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000                     
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
##  NA's   :86                                                          
##       Fare            Cabin             Embarked        
##  Min.   :  0.000   Length:418         Length:418        
##  1st Qu.:  7.896   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character  
##  Mean   : 35.627                                        
##  3rd Qu.: 31.500                                        
##  Max.   :512.329                                        
##  NA's   :1
# Change Pclass to a factor
test$Pclass <- as.numeric(test$Pclass) 

#Change Sex to numeric
test$Sex <- as.numeric(ifelse(test$Sex == "male", 1, 0))

#Replace NAs in the age variable with random digits from normal distribution
Age_replace <- rnorm(86, mean= mean(test$Age, na.rm = TRUE), sd = sd(test$Age, na.rm = TRUE))
test$Age[is.na(test$Age)] <- Age_replace

#Convert embarked to factor
test$Embarked <- as.numeric(as.factor(test$Embarked))


# Convert SibSp and Parch to Alone
test <- test %>%
  mutate(Alone = ifelse(Parch != 0 | SibSp !=0, 0, 1))


# Center and scale all of the variables 
test[Predictors] <-scale(train[Predictors], center = TRUE, scale = TRUE)
## Warning in matrix(value, n, p): data length [4455] is not a sub-multiple or
## multiple of the number of rows [418]
glimpse(test)
## Observations: 418
## Variables: 12
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 90…
## $ Pclass      <dbl> -0.59909305, -0.26141314, 0.15089957, -0.43105479, -…
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs…
## $ Sex         <dbl> 0.8114666, -1.2309535, 0.8114666, -1.2309535, 0.8114…
## $ Age         <dbl> -0.53091190, 0.55998645, -0.25818731, 0.35544301, 0.…
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0…
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "…
## $ Fare        <dbl> -1.2309535, 0.8114666, -1.2309535, 0.8114666, 0.8114…
## $ Cabin       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45…
## $ Embarked    <dbl> 2, 3, 2, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 1, 2, 1…
## $ Alone       <dbl> 0.01453728, -1.34908566, -0.33165146, -0.59909305, -…
eval_glm <- predict(Kaggle_glm, test)

summary(eval_glm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.3564 -2.1775 -0.5646 -0.6237  0.2959  3.1727
test <- test %>%
  mutate(Outcome = eval_glm)%>%
mutate(Survived = ifelse(eval_glm > 0, 1, 0))

glimpse(test)
## Observations: 418
## Variables: 14
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 90…
## $ Pclass      <dbl> -0.59909305, -0.26141314, 0.15089957, -0.43105479, -…
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs…
## $ Sex         <dbl> 0.8114666, -1.2309535, 0.8114666, -1.2309535, 0.8114…
## $ Age         <dbl> -0.53091190, 0.55998645, -0.25818731, 0.35544301, 0.…
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0…
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "…
## $ Fare        <dbl> -1.2309535, 0.8114666, -1.2309535, 0.8114666, 0.8114…
## $ Cabin       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45…
## $ Embarked    <dbl> 2, 3, 2, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 1, 2, 1…
## $ Alone       <dbl> 0.01453728, -1.34908566, -0.33165146, -0.59909305, -…
## $ Outcome     <dbl> -0.85646340, 0.86964219, -1.71217892, 1.14776886, -1…
## $ Survived    <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
x <- as.data.frame(cbind(as.integer(test$PassengerId), as.integer(test$Survived)))
colnames(x) <- c("PassengerId", "Survived")

write_csv(x, "/Users/Elliot/Documents/GitHub/Teo_Git/Kaggle/titanic_submission.csv", col_names = TRUE)