Titanic Machine Learning Model- Hope Owens, Jan 1 2022

Project Background

This is an example case study to demonstrate my knowledge and skills in statistical science. With a master’s degree in biology, professional research experience, and experience in R, SAS, SQL, BigQuery, and Python, I approach complex systems with a scientific approach to analysis. In the goal of working as a junior data analyst, I show my skills in this capstone relating to Titanic survivorship.

library(leaflet)
library(geosphere)

content <- "Titanic Sinking location"

gcIntermediate(c(1.6221, 49.6337), c(-49.5649,41.4332),
               n=90, 
               addStartEnd=TRUE,
               sp=TRUE) %>% 
  leaflet() %>% setView(lng = -49.5649, lat = 41.4332, zoom = 4) %>%  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% 
           addPolylines() %>%
         addMarkers(lng = -49.5649, lat = 41.4332) %>%
  addPopups(-49.5649, 41.4332, content,
            options = popupOptions(closeButton = FALSE))

Summary

The sinking of the Titanic is one of the widely known human disasters of modern time.

Shortly before midnight on April 14th, 1912, the Titanic was making her maiden voyage to New York City when it struck an iceburg in the North Atlantic. The “unsinkable” ship (the largest built of her time) sunk within two and a half hours. Due to its isolated location, freezing water temperatures, poorly trained crew, and insufficient lifeboats, the sinking resulted in a tremendous loss of life and resulted in the death of 1502 out of 2224 passengers and crew. The Titanic was a pivotal moment in travel safety, spirring many new regulations which are still seen in modern times.

As seen in the 1997 Titanic feature film, there was a degree of luck involved in surviving; however, some passengers had significantly more advantage than others.

In this challenge, I built a predictive model to understand trends in people that were more likely to survive the Titanic using passenger data (ex. name, age, gender, socio-economic class, if there was room for two people on the door, etc).

Key players

White Star Line - the Titanic ship
Passenger subset

Stage 1: Ask

“The scientist is not a person who gives the right answers, he is one who asks the right questions.”- Claude Lévi-Strauss

1.1 Analysis Task

Identify trends in survivorship for passengers of the Titanic using machine learning techniques.

1.2 Dataset used:

The data source used for my study is data from the titanic survivorship dataset, which contains data for 887 of the real Titanic passengers. Each row represents one person, the columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X). The data represents a subset of the total passengers on the Titanic.

1.3 Data Organization and verification:

The data source consists of 2 CSV documents- training set (train.csv) and test set (test.csv) data. Each data set is supposed to provide a similar distribution of passenger attributes on the Titanic, with the larger training set available to train machine learning models, and the smaller test set to test model accuracy. The data is considered wide since each row is one subject. Every passenger has a unique ID and columnarly listed attributes.

Due to the small size of the Titanic training set, I sorted and filtered tables creating Pivot Tables in Excel. I was able to verify attributes and observations of the training dataset.

Variable	Description
survival	Survival (0 = No; 1 = Yes)
class	Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name	Name
sex	Sex
age	Age
sibsp	Number of Siblings/Spouses Aboard
parch	Number of Parents/Children Aboard
ticket	Ticket Number
fare	Passenger Fare
cabin	Cabin
embarked	Port of Embarkment (C = Cherbourg; Q = Queenstown; S = Southampton)

Task 1: Choose data- For this analysis objective, I am going to work with the test and train datasets to conduct descriptive statistics and run machine learning models.

Task 2: Know your audience- I am performing analysis for the Kaggle competition, and the Kaggle community. Kaggler’s have a working understanding of data analytics and science. I need to tailor my results to be useful to them.

Task 3: Aims to address objective- In the aims of answering this analytics objective, I decided analysis should focus on these variables: survived, sex, age, class, title, family size class (/combined sibsp and parch), child/adult, fare, and embarked.

Stage 2: Prepare

“It is a capital mistake to theorize before one has data.” -Sherlock Holmes

Task 1: Data source validation- Verifying the metadata of the dataset confirmed it as open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. We can copy, modify, and distribute the work without asking permission.

With information about only 887 passengers of the total 2224 passengers and crew, there are limitations that could potentially be encountered due to sampling bias. This dataset is not representative of the entire population, and our model could be considered an operational approach to the entire Titanic passenger list.

Task 2: Data exploration- I needed to prepare my workspace and load the packages that I would be using in my analysis. I uploaded the files to R, and made sure they were in a usable format. I used the summary() function on each dataset to find general trends and irregularities in the data. In terms of R syntax, I did not have a problem with the original column names or data distribution, so I kept them the same.

Due to the amount of data, needing to visualize and model data, and sharing my results through a markdown, I am using R to run my analysis.

#load packages used
require(car)                    # package to run ANOVA
require(dplyr)                  # package to manipulate data
require(ggplot2)                # package to graph values
require(ggthemes)               # package to supplement visualization
require(magrittr)               # package to use pipe %>% function
require(pastecs)                # package for descriptive statistics
require(PerformanceAnalytics)   # package to create correlation matrix
require(plyr)                   # package to work with ddply
require(reshape2)               # package to create correlation heat map
require(scales)                 # package for visualization
require(tidyverse)              # package to clean data

#model packages-
require(aod)                    # R program, logistic regression
require(caret)                  # Decision tree classifier, bagging, confusion matrix
require(ipred)                  # package for bagging classification
require(mice)                   # package for replacing missing values in R   
require(modelr)                 # package to apply prediction model
require(randomForest)           # package- aggregating tree, for regression
require(rpart.plot)             # package to plot regression trees
require(rpart)                  # package to create regression models
require(xgboost)                # package linear model, tree learning algorithm

### Import data
train <- read.csv('C:/Users/hoper/OneDrive/Documents/Cycling case study/train.csv', stringsAsFactors = F)
test <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/test.csv", header=TRUE)

titanic  <- bind_rows(train, test) # bind training & test data

Stage 3: Process

“No data is clean, but most is useful” - Dean Abbott

Task 1: Clean and process data- Before I could perform analysis, I needed my values to be in a usable format. I knew that I was going to be interested both graphing variables as categorical factors, and modeling data as numeric factors, so I transformed the dataset into a numeric and categorical data class. I filled in missing age values, and created subclasses for title, family size and class, as well as child/adult to be used later in analysis.

With a single row (passenger) serving as a unit of replication, I wanted rows to have all future classification variables listed in separate columns so I could group them later on. I additionally combined the test and train dataset to have a larger sample size in my descriptive statistics.

### Data exploration
# Find basic summary statistics and column names
str(titanic)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

summary (titanic)

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin             Embarked        
##  Length:1309        Min.   :  0.000   Length:1309        Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character   Mode  :character  
##                     Mean   : 33.295                                        
##                     3rd Qu.: 31.275                                        
##                     Max.   :512.329                                        
##                     NA's   :1

# Find total na's
titanic_na <- colSums(is.na(titanic))
titanic_na

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1           0           0

3.1 Create new variables and fill missing values

# Missing age values- Multiple Imputation by Chained Equations 
# for randomly distributed missing values
mice_mod <- titanic[,(names(titanic) %in% c("Survived", "Pclass", "Sex", 
                                            "Age", "SibSp", "Parch", 
                                            "Fare", "Embarked"))]
mice_mod <- mice(mice_mod)

## 
##  iter imp variable
##   1   1  Survived  Age  Fare
##   1   2  Survived  Age  Fare
##   1   3  Survived  Age  Fare
##   1   4  Survived  Age  Fare
##   1   5  Survived  Age  Fare
##   2   1  Survived  Age  Fare
##   2   2  Survived  Age  Fare
##   2   3  Survived  Age  Fare
##   2   4  Survived  Age  Fare
##   2   5  Survived  Age  Fare
##   3   1  Survived  Age  Fare
##   3   2  Survived  Age  Fare
##   3   3  Survived  Age  Fare
##   3   4  Survived  Age  Fare
##   3   5  Survived  Age  Fare
##   4   1  Survived  Age  Fare
##   4   2  Survived  Age  Fare
##   4   3  Survived  Age  Fare
##   4   4  Survived  Age  Fare
##   4   5  Survived  Age  Fare
##   5   1  Survived  Age  Fare
##   5   2  Survived  Age  Fare
##   5   3  Survived  Age  Fare
##   5   4  Survived  Age  Fare
##   5   5  Survived  Age  Fare

mice_output <- complete(mice_mod)

# Plot age distributions
par(mfrow=c(1,2))
hist(titanic$Age, freq=F, main='Age: original values', 
     col='#7da2c0', ylim=c(0,0.04))
hist(mice_output$Age, freq=F, main='Age: missing values filled', 
     col='#7da2c0', ylim=c(0,0.04))

# Replace Age variable from the mice model.
titanic$Age <- mice_output$Age

# Show new number of missing Age values
sum(is.na(titanic$Age))

## [1] 0

# Create child age classification
titanic$Child[titanic$Age < 18] <- 'Child'
titanic$Child[titanic$Age >= 18] <- 'Adult'

# Extract title from names
# Name string categorized as -> c("last name", "title" "syntax" "first name"), 
# Syntax divider classified as "." OR "\\.."
titanic$Title <- gsub('(.*, )|(\\..*)', '', titanic$Name)

# Find counts of different titles
table(titanic$Title)

## 
##         Capt          Col          Don         Dona           Dr     Jonkheer 
##            1            4            1            1            8            1 
##         Lady        Major       Master         Miss         Mlle          Mme 
##            1            2           61          260            2            1 
##           Mr          Mrs           Ms          Rev          Sir the Countess 
##          757          197            2            8            1            1

# Group titles with rare frequency, <40
rare <- c('Capt', 'Col', 'Don', 'Dona', 'Dr', 'Jonkheer', 'Lady', 
          'Major', 'Mlle', 'Mme', 'Ms', 'Rev', 'Sir', 
          'the Countess')

titanic$Title[titanic$Title %in% rare]  <- 'Rare'

# Create variable for family size
titanic$family_size <- titanic$SibSp + titanic$Parch + 1

titanic$family_class[titanic$family_size == 1] <- 'single'
titanic$family_class[titanic$family_size <= 2 & titanic$family_size > 1] <- 'small'
titanic$family_class[titanic$family_size < 4 & titanic$family_size > 2.1] <- 'medium'
titanic$family_class[titanic$family_size > 4] <- 'large'

3.2 Create dataset types, categorical for graphing, numeric for modelling

# Subset data
# Drop variables: name, ticket, cabin
titanic_subset <- titanic[,!(names(titanic) %in% c("Name","Ticket","Cabin"))]

### Create table types
# numeric: predictor variables as integers
titanic_numeric <- titanic_subset

# categorical: predictor variables as variable names
titanic_categorical <- titanic_subset
titanic_categorical<- na.omit(titanic_categorical)

################## numeric dataset
# Convert to integer: Sex; male = 0, female = 1
titanic_numeric$Sex <- factor(titanic_numeric$Sex, 
                              levels=c("male","female"), 
                              labels=c(0,1))
titanic_numeric$Sex <- as.integer(as.character(titanic_numeric$Sex))
# Convert to integer: Embarked; Q = 1, C = 2, S = 3
titanic_numeric$Embarked <- factor(titanic_numeric$Embarked, 
                                   levels=c("Q","C","S"), 
                                   labels=c(1,2,3))
titanic_numeric$Embarked <- as.integer(as.character(titanic_numeric$Embarked))
# Convert to integer: Title; Master = 1, Miss = 2, Mr = 3, Mrs = 4, Rare = 5
titanic_numeric$Title <- factor(titanic_numeric$Title, 
                                levels=c("Master", "Miss", "Mr", "Mrs", "Rare"), 
                                labels=c(1,2,3,4,5))
titanic_numeric$Title <- as.integer(as.character(titanic_numeric$Title))
# Convert to integer: Family class; Single = 1, Small = 2, Medium = 3, 
# Large = 4
titanic_numeric$family_class <- factor(titanic_numeric$family_class, 
                                levels=c("single", "small", "medium", "large"), 
                                labels=c(1,2,3,4))
titanic_numeric$family_class <- as.integer(as.character(titanic_numeric$family_class))
# Convert to integer: Child; Child = 1, Adult = 2
titanic_numeric$Child <- factor(titanic_numeric$Child, 
                                levels=c("Child", "Adult"), 
                                labels=c(1,2))
titanic_numeric$Child <- as.integer(as.character(titanic_numeric$Child))


################# categorical dataset
titanic_categorical$Pclass <- mapvalues(titanic_categorical$Pclass, 
                                       from = c(1, 2, 3), 
                                       to = c("First", "Second", 
                                              "Third"))
titanic_categorical$SibSp <- mapvalues(titanic_categorical$SibSp, 
                                       from = c(0, 1, 2, 3, 4, 5, 6, 7, 8), 
                                       to = c("Zero", "One", "Two", 
                                              "Three", "Four", "Five", 
                                              "Six", "Seven", "Eight"))
titanic_categorical$Parch <- mapvalues(titanic_categorical$Parch, 
                                       from = c(0, 1, 2, 3, 4, 5, 6), 
                                       to = c("Zero", "One", "Two", 
                                              "Three", "Four", "Five", 
                                              "Six"))
titanic_categorical <- mutate(titanic_categorical, 
                                       Survived = if_else(Survived== 0, "No", 
                                                          "Yes"))
titanic_categorical$family_size <- mapvalues(titanic_categorical$family_size, 
                                       from = c(0, 1, 2, 3, 4, 5, 6, 7, 8), 
                                       to = c("Zero", "One", "Two", "Three", 
                                              "Four", "Five", "Six", "Seven", 
                                              "Eight"))
# Subset dataset for colinear variables for the correlation matrix
titanic_correlation <- titanic_numeric[,!(names(titanic_numeric) %in% 
                                            c("PassengerId", "SibSp",
                                              "Parch", "Embarked"))]
# Remove missing values
titanic_correlation <- na.omit(titanic_correlation)

titanic_numeric data class; predictor variables as numeric:

Variable	Description	Class
survival	0 = No; 1 = Yes	integer
class	1 = 1st; 2 = 2nd; 3 = 3rd)	integer
sex	0 = No; 1 = Yes	integer
age	Age	numeric
sibsp	Number of Siblings/Spouses Aboard	integer
parch	Number of Parents/Children Aboard	integer
fare	Passenger Fare	numeric
embarked	Q = Queenstown = 1; Cherbourg = 2; Southampton = 3	integer
title	Master = 1, Miss = 2, Mr = 3, Mrs = 4, Rare = 5	integer
family_size	Passenger family size	integer
family_class	Single = 1, small = 2, medium = 3, large = 4	integer
Child	Child = 1, Adult = 2	integer

titanic_categorical data class; predictor variables as characters:

Variable	Description	Class
survival	Survival (Yes/No)	character
class	Passenger Class (First, Second, Third)	character
sex	Male/Female	character
age	Age	numeric
sibsp	Number of Siblings/Spouses Aboard, zero - eight	character
parch	Number of Parents/Children Aboard, zer - six	character
fare	Passenger Fare	integer
embarked	Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)	character
title	Master, Miss, Mr, Mrs, rare	character
family_size	Size of family on the Titanic, zero - eight	character
family_class	Classification of family size- single, small, medium, large	character
Child	Child/Adult	character

Stage 4-5: Analyze and Figures

“If you torture the data long enough, it will confess.” - Ronald Coase

4.1 Exploratory correlation

### Correlation overview-
# Correlation chart
chart.Correlation(titanic_correlation)

# Correlation matrix
cormat <- round(cor(titanic_correlation),3)
melted_cormat <- melt(cormat)

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

4.2 Titanic Survivorship: Sex

titanic_sex <- ddply(titanic_categorical, c("Sex", "Survived"), summarise,    
                                         N    = length(!is.na(Sex)))

ggplot(titanic_sex, aes(x=Sex, y=N, fill = Survived), color="Survived") +
         geom_col() + 
         labs(title = "Survivorship of the Titanic by gender",
              x="Gender",
              y="Total survivorship", 
              fill="Survived") +
         scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0"))+ 
         theme_minimal() +
         theme(legend.position="bottom") +
         theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
         theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
         theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
         theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
         theme(axis.line = element_line(colour = "black"),
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(),
              panel.border = element_blank(),
              panel.background = element_blank())

Figure 3: Survivorship of the Titanic sinking by gender (data subset), with passenger survivorship and death indicated by the dark and light bars respectively.

ggplot(data=titanic_categorical)+
        geom_bar(mapping=aes(x = factor(Sex), fill = factor(Sex))) +
        labs( x = "Gender", 
              y = "Number of passengers", 
              title ="Titanic passenger gender distribution", 
              fill = "Gender") + 
        scale_fill_manual(values = c("male" = "#457696","female" = "#7da2c0")) +
        theme_minimal() +
        theme(legend.position="bottom") +
        theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
        theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
        theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
        theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
        theme(axis.line = element_line(colour = "black"),
             panel.grid.major = element_blank(),
             panel.grid.minor = element_blank(),
             panel.border = element_blank(),
             panel.background = element_blank())

Figure 4: Distribution of Titanic passengers by sex (passenger subset), with women indicated by the light bar and men indicated by the dark bar.

######### ANOVA: survivorship/ sex #########
anova_sex <- aov(Survived~as.factor(Sex), data = titanic_numeric)
Anova(anova_sex, type="III")

4.3 Titanic Survivorship: Class

titanic_class <- ddply(titanic_categorical, c("Pclass", "Survived"), summarise,    
                     N    = length(!is.na(Pclass)))

g3 <- titanic_class %>%
  mutate(Pclass = fct_relevel(Pclass, "First","Second","Third")) %>%
  ggplot(aes(x=Pclass, y=N, fill = Survived), color="Survived") +
       geom_col() + 
       labs(title = "Survivorship of the Titanic by class",
            x="Passenger class",
            y="Total survivorship", 
            fill="Survived") +
       scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0"))+ 
       theme_minimal() +
       theme(legend.position="bottom") +
       theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
       theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
       theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
       theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
       theme(axis.line = element_line(colour = "black"),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
            panel.border = element_blank(),
            panel.background = element_blank()) 
g3 + coord_flip()

Figure 5: Survivorship of the Titanic sinking by passenger class (data subset, n = 877), with passenger survivorship and death indicated by the dark and light bars respectively.

g4 <- ggplot(data=titanic_categorical)+
      geom_bar(mapping=aes(x = factor(Pclass), fill = factor(Pclass))) +
      labs( x = "Passenger Class", y = "Number of Passengers", 
            title ="Titanic passenger class distribution", 
            fill = "Pclass") + 
      scale_fill_manual(values = c("First" = "#36556f",
                                   "Second" = "#476f91",
                                   "Third" = "#5c89af")) +
      theme_minimal() +
      theme(legend.position="bottom") +
      theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
      theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
      theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
      theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
      theme(axis.line = element_line(colour = "black"),
           panel.grid.major = element_blank(),
           panel.grid.minor = element_blank(),
           panel.border = element_blank(),
           panel.background = element_blank())
g4 + coord_flip()

Figure 6: Distribution of Titanic passengers by class (passenger subset, n = 877), with the light bar for third class, the medium bar for second class and the dark bar for first class.

######### ANOVA: survivorship/ class #########
anova_class <- aov(Survived~as.factor(Pclass), data = titanic_numeric)
Anova(anova_class, type="III")

4.4 Titanic Survivorship: Family Size

titanic_family <- titanic_categorical %>%
  group_by(family_class) %>%
  tally() 

#titanic_family
#714 distinct family groups

titanic_family$percent <- titanic_family$n / 714

titanic_family %>%
  ggplot(aes(x = "", y = family_class, fill = family_class)) +
  geom_bar(stat = "identity", width = 1, col="black")+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#92bde0", "#7da2c0", "#457696", "#5a84a3")) +
  geom_label(aes(label = family_class),
             color = "Black",
             position = position_stack(vjust = 0.5),
             show.legend = FALSE) +
  labs(title="Proportional family size of Titanic passengers") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank(),) +  
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = 'white', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = 'white', size = 12, hjust = 0.5))

Figure 7: Proportional distribution of family size on the Titanic (passenger subset, n = 877). The darkest color indicates the classification of single passengers, followed by small families (2 people), medium families (3-4 people) and large families (>4 people).

titanic_family2 <- ddply(titanic_categorical, c("family_class", "Survived"), summarise,    
                     N    = length(!is.na(family_class)))

g9 <- titanic_family2 %>%
  mutate(family_class = fct_relevel(family_class, "single","small","medium",
                             "large")) %>%
  ggplot(aes(x=family_class, y=N, fill = Survived), color="Survived") +
  geom_col() + 
  labs(title = "Survivorship of the Titanic by family size",
       x="Family size class aboard the Titanic",
       y="Total survivorship", 
       fill="Survived") +
  scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0"))+ 
  theme_minimal() +
  theme(legend.position="bottom") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())
g9 + coord_flip()

Figure 8: Survivorship of the Titanic by family size class (passenger subset, n = 877), with passenger survivorship and death indicated by the dark and light bars respectively.

######### ANOVA: survivorship/ family size class #########
anova_f_size <- aov(Survived~as.factor(family_size), data = titanic_numeric)
Anova(anova_f_size, type="III")

4.5 Titanic Survivorship: Age

# Create age dataset
titanic_age<-titanic_categorical
# Group and tally passengers by age
titanic_age.2<-data.frame(titanic_age) %>% 
                   group_by(Age, Survived) %>% 
                   tally() 
# Subset data- survived vs. died
titanic_age_survived <- data.frame(titanic_age.2) %>% 
                            group_by(Age, Survived) %>% 
                            filter(Survived == "Yes")

titanic_age_died <- data.frame(titanic_age.2) %>% 
                            group_by(Age, Survived) %>% 
                            filter(Survived == "No")
### Figure: Survivorship: Age
ggplot() +
  geom_histogram(data = titanic_age_died, aes(x = Age, y = n), 
                 stat = "identity", fill = "#bc5090", 
                 alpha = 0.7, 
                 width = 0.6) +
  geom_histogram(data = titanic_age_survived, 
                 aes(x = Age, y = n), 
                 stat = "identity", 
                 fill = "#003f5c", 
                 alpha = 0.7, 
                 width = 0.6) +
  geom_smooth(data = titanic_age.2, aes(x = Age, y = n, color=Survived), se=FALSE, size = 1.2) +
  labs(title = "Survivorship of the Titanic by Age",x="Age",y="Count", fill="Survived") +
  scale_fill_manual(values = c("Yes" = "#003f5c","No" = "#bc5090")) +
  scale_color_manual(values=c("Yes" = "#003f5c","No" = "#bc5090"))+
  theme_minimal() +
  theme(legend.position="bottom") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  xlim(0, 82) + 
  ylim(0, 20) + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

Figure 9: Survivorship of the Titanic indicated by age, histogram (passenger subset, n = 877). Red bars indicated death, blue bars indicate survival, lines represent average trend line.

### Figure: Survivorship: Age/Gender
ggplot(titanic_age, aes(Age, fill = factor(Survived))) + 
  geom_histogram() +
  scale_fill_manual(values = c("Yes" = "#003f5c","No" = "#bc5090")) +
  theme_minimal() +
  theme(legend.position="bottom") +
  labs( x = "Age", 
        y = "Frequency", 
        title ="Survivorship of the Titanic by Sex and Age", 
        fill = "Survived") + 
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())+
  facet_grid(~Sex) + 
  theme_few()

Figure 10: Survivorship of the Titanic by age and sex (passenger subset, n = 877), with survivorship indicated by the dark bars and death reflected by the red bars.

######### ANOVA: survivorship/ age #########
anova_age <- aov(Survived~as.factor(Age), data = titanic_numeric)
Anova(anova_age, type="III")

4.6 Titanic Embarkment Port vs. Fare and Class

titanic_graph10 <- titanic_categorical %>%
  filter(PassengerId != 62 & PassengerId != 830)

ggplot(titanic_graph10, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
  geom_col() +
  labs( x = "Port", y = "Fare (£)", 
        title ="Titanic fare price by class distribution and embarking port", 
        fill = "Pclass") +
  scale_fill_manual(values = c("First" = "#36556f",
                               "Second" = "#476f91",
                               "Third" = "#5c89af")) +
  theme_minimal() +
  theme(legend.position="bottom") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank())

Figure 11: Proportional class distribution by fare price (£) and embarking port (passenger subset, n = 877). Class is indicated by the bar colors with third and first class having the lightest and darkest bars respectively.

ggplot(titanic_categorical, 
       aes(x = Fare)) +
  geom_density(fill = '#476f91', alpha=0.6) +
  labs( x = "Fare (£)", y = "Proportion", 
        title ="Titanic fare price by class distribution and embarking port", 
        fill = "Pclass") +
  scale_x_continuous(labels=dollar_format()) +
  theme_few()+
  xlim(0,200)+
  facet_grid(~Pclass)+
  theme_bw() +
  theme(legend.position="bottom") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank())

Figure 12: Proportional frequency of passengers, fare price (£) and passenger class on the Titanic (passenger subset, n = 877).

######### ANOVA: Passenger class/ fare + embarking port #########
anova_fare <- lm(Pclass~factor(Fare), data = titanic_numeric)
Anova(anova_fare, type="III")

anova_embarked <- lm(Pclass~factor(Embarked), data = titanic_numeric)
Anova(anova_embarked, type="III")

4.7 Titanic Survivorship: Embarkment Port, Class

ggplot(titanic_graph10, aes(x=Title, fill = factor(Pclass))) + 
  geom_bar() +
  theme_bw() +
  theme(legend.position="bottom") +
  labs( x = "Title", 
        y = "Frequency", 
        title ="Embarked port by title and class", 
        fill = "Class") + 
  scale_fill_manual(values = c("First" = "#36556f",
                               "Second" = "#476f91",
                               "Third" = "#5c89af")) +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank())+
        facet_grid(~Embarked)

Figure 13: Proportional frequency of passenger title and class by embarking port on the Titanic (passenger subset, n = 877). Class is indicated by the bar colors with third and first class having the lightest and darkest bars respectively.

ggplot(titanic_categorical, aes(Title, fill = factor(Survived))) + 
  geom_bar() +
  scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0")) +
  theme(legend.position="bottom") +
  theme_bw() +
  labs( x = "Age", 
        y = "Frequency", 
        title ="Survivorship of the Titanic by title and class", 
        fill = "Survived") + 
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank())+
  facet_grid(~Pclass)

Figure 14: Proportional frequency of survivorship by title and class onboard the Titanic (passenger subset, n = 877), with passenger survivorship and death indicated by the dark and light bars respectively.

anova_title <- lm(Survived~factor(Title), data = titanic_numeric)
Anova(anova_title, type="III")

Stage 6: Machine learning modeling

“All models are wrong, but some are useful” -George E. P. Box.

6.1 Set numeric dataset as factors

# Convert predictor variables to factors
factor_vars <- c('PassengerId','Pclass','Sex','Embarked',
                 'Title','family_size','Child')
titanic_numeric[factor_vars] <- lapply(titanic_numeric[factor_vars], function(x) as.factor(x))

# Eliminate missing passenger IDs
embark_fare <- titanic_numeric %>%
  filter(PassengerId != 62 & PassengerId != 830)

6.2 Regression Trees

First we are going to create a few fast regression trees to see what rough “branches” of classification exist within our variables. Splits in regression trees indicate variables that have the greatest classification effect within the dataset, and we can look at the branches in the tree to understand what variables could likely have an effect on Titanic survivorship. In regression trees, a model searches values of input and output to find the most likely predictor variables for a specific response (survival), causing the model to then split the dataset so that the overall sums of squares error is minimized: \({\texttt{minimize} \bigg\{ SSE = \sum_{i \in R_1}(y_i - c_1)^2 + \sum_{i \in R_2}(y_i - c_2)^2 \bigg\}\tag{1}}\) Regression trees are applied in a top-down, greedy fashion, meaning that earlier performed splits will not be effected by later classification. The advantage of regression trees is that they are fast, they replace missing values automatically, and they are able to model complex non-linear “jagged” responses, meaning that they work well with a multiple factor system like Titanic survivorship.

However, due to the high variance that can exist in with this method of classification, single regression trees can have poor predictive accuracy. Regression trees work well for exploratory analysis, and when used in combination with other modelling techniques.

# Set a random seed
set.seed(129)
# Create regression tree models with different data splitting parameters
fit <- rpart(Survived ~ Pclass + Sex + SibSp + Parch + 
               Fare + Embarked + Title + Child + family_size, data = titanic_numeric)
fit2 <- rpart(Survived ~ Pclass + Sex + SibSp + Parch + 
                Fare + Embarked + Title + Child + family_size, data = titanic_numeric,
              parms = list(prior = c(.65,.35), split = "information"))
fit3 <- rpart(Survived ~ Pclass + Sex + SibSp + Parch + 
                Fare + Embarked + Title + Child + family_size, data = titanic_numeric,
              control = rpart.control(cp = 0.05))
par(mfrow = c(1,2), xpd = NA) # otherwise on some devices the text is clipped

# Plot trees
rpart.plot(fit, 
           box.palette = "GnBu", # color scheme
           branch.lty = 3, # dotted branch lines
           shadow.col = "gray", # shadows under the node boxes
           nn = TRUE) # display the node numbers

Figure 15. Regression tree classification of variables in the titanic survivorship dataset.

rpart.plot(fit2, 
           box.palette = "GnBu", # color scheme
           branch.lty = 3, # dotted branch lines
           shadow.col = "gray", # shadows under the node boxes
           nn = TRUE) # display the node numbers

Figure 16. Regression tree classification of variables in the titanic survivorship dataset.

rpart.plot(fit3,
           box.palette = "GnBu", # color scheme
           branch.lty = 3, # dotted branch lines
           shadow.col = "gray", # shadows under the node boxes
           nn = TRUE) # display the node numbers

Figure 17. Regression tree classification of variables in the titanic survivorship dataset.

6.3 Create machine learning models for Titanic dataset

I am going to use four different model types to explore prediction in our training data. The training portion of the dataset has a survivorship class and because of this, we can measure model accuracy in predicting passenger life or death.

The models we are going to use are-

the Random Forest model: An ensemble learning model, random forest models create multiple regression trees using subset bootstrapped datasets, and randomly applies a subset of data variables for tree nodes. The model then selects “majority wins” model which explains the most variation within the dataset. Using bootstrapping and a large sample size of regression trees, error is reduced from individual trees.
the Logistic Regression model: Logistic regression is similar to linear regression, which models a fit line through a spread of data; however, logistic regression is used to model the probability of binomial outcomes. This treatment is applicable with the binomial ouput of survivorship, with 0 = no, 1 = yes.
the Decision Tree Classifier model: Similar to regression trees, predictor variables are classified as nodes when splitting the dataset to minimize sum of squares error (SSE). The more nodes the model has, usually the more accurate the decision tree will be.
the Bagging Classifier model: Similar to the random forest, however; in random forest models only a subset of features are selected at random to split the dataset. In bagging models, all features are considered for splitting a node.

# Split the data back into a train set and a test set
train <- titanic_numeric[1:891,]
test <- titanic_numeric[892:1309,]
train<-na.omit(train)
# Set a random seed
set.seed(754)
# (1a) Create random forest model
rf_model <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch +
                           Fare + Title + Child + family_size,
                           data = train)
## Use model to predict survivorship in train dataset, with known survivorship values
predict_rf<-predict(rf_model, train)
## Set output as integers, models sometimes will give the probability output (ex. likelihood of 
## passenger X surviving, range from -3 to +3) instead of binomial survivorship output for the dataset
predict_rf <- as.integer(predict_rf)
# Add predicted survivorship vector values to train dataset
train$model_rf <- predict_rf
## Convert values to binomial 0/1 from 1/2
train$model_rf[train$model_rf == 1] <- 0
train$model_rf[train$model_rf == 2] <- 1

6.3.1 Perform Machine Learning Models

###### (2a) LogisticRegression
mylogit <- glm(Survived ~ Pclass +Age + Sex + SibSp + Parch +
                 Fare + Title + Child + family_size, data = train, family = "binomial")
glm <- predict(mylogit, train)
predict_glm <- as.numeric((glm)>0.5)
train$model_glm <- predict_glm

###### (3a) DecisionTreeClassifier
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(Survived ~ Pclass +Age + Sex + SibSp + Parch +
                     Fare + Title + Child + family_size, data = train, method = "rpart",
                   parms = list(split = "information"),
                   trControl=trctrl,
                   tuneLength = 10)
dtree_fit <- predict(dtree_fit, train)
predict_dtree_fit<-as.numeric((dtree_fit)>0.5)
train$model_dtree_fit <- predict_dtree_fit 

###### (4a) BaggingClassifier
fit <- bagging(Survived ~ Pclass +Age + Sex + SibSp + Parch +
                 Fare + Title + Child + family_size, data = train, coob = T, nbagg = 100)
bagging <- predict(fit, train)
predict_bagging <- as.numeric((bagging)>0.5)
train$model_bagging <- predict_bagging

6.4 Testing model accuracy

Confusion matrix and f-score

Our training dataset now has both the model predictions and known survivorship values listed as separate columns. Using this, we can statistically test the accuracy of model predictions. Our next goal is to find the most accurate model, and use it to predict survivorship in the test (unknown values) dataset.

Model accuracy can be measured through a confusion matrix using these four metrics,

true positive values (accurate survival)
false positives (inaccurate survival)
true negatives (accurate death)
false negatives (inaccurate death)

We are also going to find the F-score of our models, which provides a numeric ranking of model fit. \({\texttt{f-score}= \bigg\{2*((Precision*Recall)/(Precision+Recall)) \bigg\}\tag{2}}\)

Models with higher f-score values, ranging from 0 to 1, have higher accuracy in predicting survivorship on the Titanic.

# confusion_matrix
#define vectors of actual values and predicted values
actual <- factor(train$Survived)
pred_glm <- factor(train$model_glm)
pred_dtree_fit <- factor(train$model_dtree_fit)
pred_bagging <- factor(train$model_bagging)
pred_rf <- factor(train$model_rf)

# Create confusion matrix and calculate metrics related to confusion matrix
# Look up F-Score
confusionMatrix(pred_glm, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 501 103
##          1  40 216
##                                          
##                Accuracy : 0.8337         
##                  95% CI : (0.8071, 0.858)
##     No Information Rate : 0.6291         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6287         
##                                          
##  Mcnemar's Test P-Value : 2.164e-07      
##                                          
##             Sensitivity : 0.6771         
##             Specificity : 0.9261         
##          Pos Pred Value : 0.8437         
##          Neg Pred Value : 0.8295         
##               Precision : 0.8438         
##                  Recall : 0.6771         
##                      F1 : 0.7513         
##              Prevalence : 0.3709         
##          Detection Rate : 0.2512         
##    Detection Prevalence : 0.2977         
##       Balanced Accuracy : 0.8016         
##                                          
##        'Positive' Class : 1              
##

confusionMatrix(pred_dtree_fit, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 493  88
##          1  48 231
##                                           
##                Accuracy : 0.8419          
##                  95% CI : (0.8157, 0.8656)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6522          
##                                           
##  Mcnemar's Test P-Value : 0.0008251       
##                                           
##             Sensitivity : 0.7241          
##             Specificity : 0.9113          
##          Pos Pred Value : 0.8280          
##          Neg Pred Value : 0.8485          
##               Precision : 0.8280          
##                  Recall : 0.7241          
##                      F1 : 0.7726          
##              Prevalence : 0.3709          
##          Detection Rate : 0.2686          
##    Detection Prevalence : 0.3244          
##       Balanced Accuracy : 0.8177          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(pred_bagging, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 513  83
##          1  28 236
##                                           
##                Accuracy : 0.8709          
##                  95% CI : (0.8467, 0.8926)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7133          
##                                           
##  Mcnemar's Test P-Value : 2.968e-07       
##                                           
##             Sensitivity : 0.7398          
##             Specificity : 0.9482          
##          Pos Pred Value : 0.8939          
##          Neg Pred Value : 0.8607          
##               Precision : 0.8939          
##                  Recall : 0.7398          
##                      F1 : 0.8096          
##              Prevalence : 0.3709          
##          Detection Rate : 0.2744          
##    Detection Prevalence : 0.3070          
##       Balanced Accuracy : 0.8440          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(pred_rf, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 531  50
##          1  10 269
##                                           
##                Accuracy : 0.9302          
##                  95% CI : (0.9111, 0.9463)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8466          
##                                           
##  Mcnemar's Test P-Value : 4.782e-07       
##                                           
##             Sensitivity : 0.8433          
##             Specificity : 0.9815          
##          Pos Pred Value : 0.9642          
##          Neg Pred Value : 0.9139          
##               Precision : 0.9642          
##                  Recall : 0.8433          
##                      F1 : 0.8997          
##              Prevalence : 0.3709          
##          Detection Rate : 0.3128          
##    Detection Prevalence : 0.3244          
##       Balanced Accuracy : 0.9124          
##                                           
##        'Positive' Class : 1               
##

6.4.1 Confusion Matrix plots

# Create confusion matrix for graphing
cm_glm <- confusionMatrix(factor(train$Survived), factor(train$model_glm), dnn = c("Prediction", "Reference"))
cm_dtree_fit <- confusionMatrix(factor(train$Survived), factor(train$model_dtree_fit), dnn = c("Prediction", "Reference"))
cm_bagging <- confusionMatrix(factor(train$Survived), factor(train$model_bagging), dnn = c("Prediction", "Reference"))
cm_rf <- confusionMatrix(factor(train$Survived), factor(train$model_rf), dnn = c("Prediction", "Reference"))

# Set table as a dataframe for graph
plt_glm <- as.data.frame(cm_glm$table)
plt_dtree_fit <- as.data.frame(cm_dtree_fit$table)
plt_bagging <- as.data.frame(cm_bagging$table)
plt_rf <- as.data.frame(cm_rf$table)

## Set factors for prediction
plt_glm$Prediction <- factor(plt_glm$Prediction, levels=rev(levels(plt_glm$Prediction)))
plt_dtree_fit$Prediction <- factor(plt_dtree_fit$Prediction, levels=rev(levels(plt_dtree_fit$Prediction)))
plt_bagging$Prediction <- factor(plt_bagging$Prediction, levels=rev(levels(plt_bagging$Prediction)))
plt_rf$Prediction <- factor(plt_rf$Prediction, levels=rev(levels(plt_rf$Prediction)))

### Plot correlation matrix
# GLM model
ggplot(plt_glm, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) +
  scale_fill_gradient(low="white", high="#009194") +
  labs(x = "Reference",y = "Prediction") +
  scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
  scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))

Figure 18. Confusion matrix for the logistic regression model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.

# DTree Fit
ggplot(plt_dtree_fit, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) +
  scale_fill_gradient(low="white", high="#009194") +
  labs(x = "Reference",y = "Prediction") +
  scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
  scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))

Figure 19. Confusion matrix for the decision tree classifier model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.

# Bagging
ggplot(plt_bagging, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) +
  scale_fill_gradient(low="white", high="#009194") +
  labs(x = "Reference",y = "Prediction") +
  scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
  scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))

Figure 20. Confusion matrix for the bagging classifier model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.

# Random Forest
ggplot(plt_rf, aes(Prediction,Reference, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) +
  scale_fill_gradient(low="white", high="#009194") +
  labs(x = "Reference",y = "Prediction") +
  scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
  scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))

Figure 21. Confusion matrix for the random forest model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.

### Graph F-scores, find best fitting model
# Set model type as predictor variables
model <- c('glm', 'dtree', 'bagging', 'rf')
# Set F-score- response variables
f_score <- c(0.8264, 0.8267, 0.8816, 0.9840)
# Create dataframe
f_score_data <- data.frame(model, f_score)
### Figure: Plot f-scores
# Create color gradient function for graph
colfunc <- colorRampPalette(c("#2f5269", "#7da2c0"))
# Plot values
ggplot(f_score_data, aes(x = reorder(model, f_score), 
  y = f_score, fill = f_score)) +
  geom_bar(stat='identity') + 
  labs(x = 'Model Type',
       y = 'F-Score',
       title = 'Model Fit') +
  geom_text(aes(x = model, y = 0.7, label = f_score),
            hjust=0.5, vjust=-0.5, size = 5, colour = 'white') +
  scale_fill_gradientn(colours=colfunc(4))+
  theme_bw() +
  theme(legend.position="bottom") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank())

Figure 22. F-Score of machine learning models used on the Titanic training dataset. Bars indicate machine learning models used. Accuracy of the model is shown by the intensity of color, with more accurate models having a brighter color.

6.5 Random Forest model

The random forest model had the highest f-score, making it the most accurate model. We will now further investigate this model to understand its error and the importance of different variables in making predictions.

# Show model error
plot(rf_model, ylim=c(0,0.36))
legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)

Figure 23. Error of rf_model in the titanic training dataset.

# Get importance
importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

# Create color gradient function for graph
colfunc <- colorRampPalette(c("#2f5269", "#7da2c0"))
### Figure: relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, -Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  labs(x = 'Predictor Variables',
       y = 'Response Significance',
       title = 'Variable impact on Titanic survivorship') +
  geom_text(aes(x = Variables, y = 0.9, label = Rank),
            hjust=0.5, vjust=-0.1, size = 5, colour = 'white') +
  scale_fill_gradientn(colours=colfunc(4))+
  theme_bw() +
  theme(legend.position="bottom") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank())

Figure 24. Variable impact on survivorship on the Titanic. Response significance is indicated by the intensity of color, with greater impact shown by the brighter blue.

The most influential variables in predicting survival are title (ex. Duchess vs. Mr), fare paid, sex and age. It is interesting, because when I ran descriptive statistics, passenger survival was not cut and dry first class passengers + women and children only. The result make a lot of sense, and are kind of sad (!) with title and fare paid being the two most likely factors to determine survival.

6.6 “Tuning” the Random Forest model

We now are going to experiment with variables used in our random forest model to see if we can get a greater fit. I am subsetting the model with different permutations of significant variables, and will then compare models to find their f-score.

### Narrow down rf_model for more accurate predictions
#set.seed(754)

# original model
rf_model_1 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch +
                           Fare + Title + Child + family_size,
                           data = train)
predict_rf_1<-predict(rf_model_1, train)
predict_rf_1 <- as.integer(predict_rf_1)
train$model_rf_1 <- predict_rf_1
train$model_rf_1[train$model_rf_1 == 1] <- 0
train$model_rf_1[train$model_rf_1 == 2] <- 1

# subset 1
set.seed(754)
rf_model_2 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch +
                             Fare + Title + family_size,
                             data = train)
predict_rf_2 <- predict(rf_model_2, train)
predict_rf_2 <- as.integer(predict_rf_2)
train$model_rf_2 <- predict_rf_2
train$model_rf_2[train$model_rf_2 == 1] <- 0
train$model_rf_2[train$model_rf_2 == 2] <- 1

# subset 2
set.seed(754)
rf_model_3 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp +
                             Fare + Title + Child + family_size,
                             data = train)
predict_rf_3<-predict(rf_model_3, train)
predict_rf_3 <- as.integer(predict_rf_3)
train$model_rf_3 <- predict_rf_3
train$model_rf_3[train$model_rf_3 == 1] <- 0
train$model_rf_3[train$model_rf_3 == 2] <- 1



# subset 3
set.seed(754)
rf_model_4 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + Parch +
                             Fare + Title + family_size,
                             data = train)
predict_rf_4<-predict(rf_model_4, train)
predict_rf_4 <- as.integer(predict_rf_4)
train$model_rf_4 <- predict_rf_4
train$model_rf_4[train$model_rf_4 == 1] <- 0
train$model_rf_4[train$model_rf_4 == 2] <- 1

# subset 4
set.seed(754)
rf_model_5 <- randomForest(factor(Survived) ~ Pclass + Age + Sex +
                             Fare + Title + family_size,
                             data = train)
predict_rf_5 <-predict(rf_model_5, train)
predict_rf_5 <- as.integer(predict_rf_5)
train$model_rf_5 <- predict_rf_5
train$model_rf_5[train$model_rf_5 == 1] <- 0
train$model_rf_5[train$model_rf_5 == 2] <- 1

# subset 5
set.seed(754)
rf_model_6 <- randomForest(factor(Survived) ~ Pclass + Age + Sex +
                             Fare + Title,
                             data = train)
predict_rf_6 <- predict(rf_model_6, train)
predict_rf_6 <- as.integer(predict_rf_6)
train$model_rf_6 <- predict_rf_6
train$model_rf_6[train$model_rf_6 == 1] <- 0
train$model_rf_6[train$model_rf_6 == 2] <- 1

6.5.1 Compare subset and full model accuracy

#define vectors of actual values and predicted values
actual <- factor(train$Survived)
pred_rf_1 <- factor(train$model_rf_1)
pred_rf_2 <- factor(train$model_rf_2)
pred_rf_3 <- factor(train$model_rf_3)
pred_rf_4 <- factor(train$model_rf_4)
pred_rf_5 <- factor(train$model_rf_5)
pred_rf_6 <- factor(train$model_rf_6)

# Create confusion matrix and calculate metrics related to confusion matrix
# Look up F-Score
confusionMatrix(pred_rf_1, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 531  55
##          1  10 264
##                                           
##                Accuracy : 0.9244          
##                  95% CI : (0.9047, 0.9412)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8332          
##                                           
##  Mcnemar's Test P-Value : 4.828e-08       
##                                           
##             Sensitivity : 0.8276          
##             Specificity : 0.9815          
##          Pos Pred Value : 0.9635          
##          Neg Pred Value : 0.9061          
##               Precision : 0.9635          
##                  Recall : 0.8276          
##                      F1 : 0.8904          
##              Prevalence : 0.3709          
##          Detection Rate : 0.3070          
##    Detection Prevalence : 0.3186          
##       Balanced Accuracy : 0.9046          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(pred_rf_2, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 522  69
##          1  19 250
##                                           
##                Accuracy : 0.8977          
##                  95% CI : (0.8755, 0.9171)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7735          
##                                           
##  Mcnemar's Test P-Value : 1.757e-07       
##                                           
##             Sensitivity : 0.7837          
##             Specificity : 0.9649          
##          Pos Pred Value : 0.9294          
##          Neg Pred Value : 0.8832          
##               Precision : 0.9294          
##                  Recall : 0.7837          
##                      F1 : 0.8503          
##              Prevalence : 0.3709          
##          Detection Rate : 0.2907          
##    Detection Prevalence : 0.3128          
##       Balanced Accuracy : 0.8743          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(pred_rf_3, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 525  76
##          1  16 243
##                                           
##                Accuracy : 0.893           
##                  95% CI : (0.8704, 0.9129)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7616          
##                                           
##  Mcnemar's Test P-Value : 7.691e-10       
##                                           
##             Sensitivity : 0.7618          
##             Specificity : 0.9704          
##          Pos Pred Value : 0.9382          
##          Neg Pred Value : 0.8735          
##               Precision : 0.9382          
##                  Recall : 0.7618          
##                      F1 : 0.8408          
##              Prevalence : 0.3709          
##          Detection Rate : 0.2826          
##    Detection Prevalence : 0.3012          
##       Balanced Accuracy : 0.8661          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(pred_rf_4, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 525  66
##          1  16 253
##                                          
##                Accuracy : 0.9047         
##                  95% CI : (0.883, 0.9234)
##     No Information Rate : 0.6291         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7889         
##                                          
##  Mcnemar's Test P-Value : 6.262e-08      
##                                          
##             Sensitivity : 0.7931         
##             Specificity : 0.9704         
##          Pos Pred Value : 0.9405         
##          Neg Pred Value : 0.8883         
##               Precision : 0.9405         
##                  Recall : 0.7931         
##                      F1 : 0.8605         
##              Prevalence : 0.3709         
##          Detection Rate : 0.2942         
##    Detection Prevalence : 0.3128         
##       Balanced Accuracy : 0.8818         
##                                          
##        'Positive' Class : 1              
##

confusionMatrix(pred_rf_5, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 527  70
##          1  14 249
##                                           
##                Accuracy : 0.9023          
##                  95% CI : (0.8805, 0.9213)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7829          
##                                           
##  Mcnemar's Test P-Value : 1.961e-09       
##                                           
##             Sensitivity : 0.7806          
##             Specificity : 0.9741          
##          Pos Pred Value : 0.9468          
##          Neg Pred Value : 0.8827          
##               Precision : 0.9468          
##                  Recall : 0.7806          
##                      F1 : 0.8557          
##              Prevalence : 0.3709          
##          Detection Rate : 0.2895          
##    Detection Prevalence : 0.3058          
##       Balanced Accuracy : 0.8773          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(pred_rf_6, actual, mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 523  70
##          1  18 249
##                                           
##                Accuracy : 0.8977          
##                  95% CI : (0.8755, 0.9171)
##     No Information Rate : 0.6291          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7732          
##                                           
##  Mcnemar's Test P-Value : 5.43e-08        
##                                           
##             Sensitivity : 0.7806          
##             Specificity : 0.9667          
##          Pos Pred Value : 0.9326          
##          Neg Pred Value : 0.8820          
##               Precision : 0.9326          
##                  Recall : 0.7806          
##                      F1 : 0.8498          
##              Prevalence : 0.3709          
##          Detection Rate : 0.2895          
##    Detection Prevalence : 0.3105          
##       Balanced Accuracy : 0.8736          
##                                           
##        'Positive' Class : 1               
##

6.5.2 Graph subset and full model accuracy

### Graph F-scores, find best fitting model
## Set model type- predictor variables
model <- c('rf_1', 'rf_2', 'rf_3', 'rf_4', 'rf_5', 'rf_6')
## Set F-score- response variables
f_score <- c(0.9855, 0.9283, 0.9208, 0.9403, 0.9283, 0.9202)
## Create dataframe
f_score_data <- data.frame(model, f_score)
### Figure: Plot f-scores
## Create color gradient function for graph
colfunc <- colorRampPalette(c("#2f5269", "#7da2c0"))
## Plot values
ggplot(f_score_data, aes(x = reorder(model, -f_score), 
                         y = f_score, fill = f_score)) +
  geom_bar(stat='identity') + 
  labs(x = 'Model Type',
       y = 'F-Score',
       title = 'Model Fit') +
  geom_text(aes(x = model, y = 0.8, label = f_score),
            hjust=0.5, vjust=-0.5, size = 5, colour = 'white') +
  scale_fill_gradientn(colours=colfunc(4))+
  theme_bw() +
  theme(legend.position="bottom") +
  theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
  theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
  theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank())

Figure 25. F-Score of machine learning models used on the Titanic training dataset. Bars indicate machine learning models used. Accuracy of the model is shown by the intensity of color, with more accurate models having a brighter color.

Stage 7: Prediction

“The goal is to turn data into information, and information into insight.” - Carly Fiorina

We are now going to take our most effective model and use it on the test dataset. This lets us make predictions on survival for our unknown values, so that we can submit our results for the competition. We will create a two column solution output with PassengerId, and Survived and will export the dataset to our file location.

# Create predictions
predict(rf_model_1, test)

##  892  893  894  895  896  897  898  899  900  901  902  903  904  905  906  907 
##    0    0    0    0    0    0    0    0    1    0    0    0    1    0    1    1 
##  908  909  910  911  912  913  914  915  916  917  918  919  920  921  922  923 
##    0    0    0    0    0    1    1    0    1    0    1    0    0    0    0    0 
##  924  925  926  927  928  929  930  931  932  933  934  935  936  937  938  939 
##    0    0    0    0    1    0    0    1    0    0    0    1    1    0    0    0 
##  940  941  942  943  944  945  946  947  948  949  950  951  952  953  954  955 
##    1    1    0    0    1    1    0    0    0    0    0    1    0    0    0    1 
##  956  957  958  959  960  961  962  963  964  965  966  967  968  969  970  971 
##    1    1    1    0    0    1    1    0    0    0    1    0    0    1    0    1 
##  972  973  974  975  976  977  978  979  980  981  982  983  984  985  986  987 
##    1    0    0    0    0    0    1    0    1    1    0    0    1    0    0    0 
##  988  989  990  991  992  993  994  995  996  997  998  999 1000 1001 1002 1003 
##    1    0    1    0    1    0    0    0    0    0    0    0    0    0    0    0 
## 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 
##    1    1    1    0    0    1    0    1    1    0    1    0    0    0    0    0 
## 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0 
## 1036 1037 1038 1039 1040 1041 1042 1043 <NA> 1045 1046 1047 1048 1049 1050 1051 
##    0    0    0    0    0    0    1    0 <NA>    1    0    0    1    1    0    1 
## 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 
##    1    1    1    0    0    1    0    0    1    0    0    0    0    0    0    1 
## 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 
##    1    0    1    1    0    0    1    0    1    0    1    0    0    0    0    0 
## 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 
##    1    0    1    0    1    0    0    0    0    1    1    1    0    0    0    0 
## 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 
##    1    0    0    0    0    1    0    0    0    0    1    0    1    0    1    0 
## 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 
##    1    0    0    0    0    0    0    1    0    0    0    0    0    0    1    1 
## 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 
##    1    1    0    0    0    0    1    0    1    0    1    0    0    0    0    0 
## 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 
##    0    0    1    0    0    0    1    1    0    0    0    0    1    0    0    0 
## 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 
##    1    0    0    1    0    0    0    0    0    1    1    0    1    0    0    0 
## 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 
##    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0 
## 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 
##    1    1    0    1    0    0    0    0    0    0    1    1    0    0    0    0 
## 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 
##    0    0    0    0    1    0    1    0    0    0    1    0    0    0    0    0 
## 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 
##    0    0    0    1    0    0    0    1    1    1    0    0    0    1    1    0 
## 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 
##    0    0    0    0    1    0    0    1    0    1    1    0    1    0    0    1 
## 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 
##    1    0    0    1    0    0    1    1    0    0    0    0    0    0    0    0 
## 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 
##    0    1    0    0    0    0    0    1    1    0    0    1    0    1    0    0 
## 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 
##    1    0    1    0    0    0    0    0    0    1    0    1    0    0    1    0 
## 1308 1309 
##    0    1 
## Levels: 0 1

# Add predictions as a column to the test dataset
test1 <- add_predictions(test, rf_model_1, var = "pred", type = NULL)
# Rename predictions column as Survived, keeping with the data source
test1$Survived <- test1$pred
# Subset dataset to PassengerId and Survived
test_results <- test1[,(names(test1) %in% c("PassengerId","Survived"))]
# Export dataset, and sumbit!
write.csv(test_results, file = '...rf_1_solution.csv', row.names = F)