Titanic Machine Learning Model- Hope Owens, Jan 1 2022
library(leaflet)
library(geosphere)
content <- "Titanic Sinking location"
gcIntermediate(c(1.6221, 49.6337), c(-49.5649,41.4332),
n=90,
addStartEnd=TRUE,
sp=TRUE) %>%
leaflet() %>% setView(lng = -49.5649, lat = 41.4332, zoom = 4) %>% addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
addPolylines() %>%
addMarkers(lng = -49.5649, lat = 41.4332) %>%
addPopups(-49.5649, 41.4332, content,
options = popupOptions(closeButton = FALSE))
The sinking of the Titanic is one of the widely known human disasters of modern time.
Shortly before midnight on April 14th, 1912, the Titanic was making her maiden voyage to New York City when it struck an iceburg in the North Atlantic. The “unsinkable” ship (the largest built of her time) sunk within two and a half hours. Due to its isolated location, freezing water temperatures, poorly trained crew, and insufficient lifeboats, the sinking resulted in a tremendous loss of life and resulted in the death of 1502 out of 2224 passengers and crew. The Titanic was a pivotal moment in travel safety, spirring many new regulations which are still seen in modern times.
As seen in the 1997 Titanic feature film, there was a degree of luck involved in surviving; however, some passengers had significantly more advantage than others.
In this challenge, I built a predictive model to understand trends in people that were more likely to survive the Titanic using passenger data (ex. name, age, gender, socio-economic class, if there was room for two people on the door, etc).
Key players
Identify trends in survivorship for passengers of the Titanic using machine learning techniques.
The data source used for my study is data from the titanic survivorship dataset, which contains data for 887 of the real Titanic passengers. Each row represents one person, the columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X). The data represents a subset of the total passengers on the Titanic.
The data source consists of 2 CSV documents- training set (train.csv) and test set (test.csv) data. Each data set is supposed to provide a similar distribution of passenger attributes on the Titanic, with the larger training set available to train machine learning models, and the smaller test set to test model accuracy. The data is considered wide since each row is one subject. Every passenger has a unique ID and columnarly listed attributes.
Due to the small size of the Titanic training set, I sorted and filtered tables creating Pivot Tables in Excel. I was able to verify attributes and observations of the training dataset.
| Variable | Description |
|---|---|
| survival | Survival (0 = No; 1 = Yes) |
| class | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
| name | Name |
| sex | Sex |
| age | Age |
| sibsp | Number of Siblings/Spouses Aboard |
| parch | Number of Parents/Children Aboard |
| ticket | Ticket Number |
| fare | Passenger Fare |
| cabin | Cabin |
| embarked | Port of Embarkment (C = Cherbourg; Q = Queenstown; S = Southampton) |
Task 1: Choose data- For this analysis objective, I am going to work with the test and train datasets to conduct descriptive statistics and run machine learning models.
Task 2: Know your audience- I am performing analysis for the Kaggle competition, and the Kaggle community. Kaggler’s have a working understanding of data analytics and science. I need to tailor my results to be useful to them.
Task 3: Aims to address objective- In the aims of answering this analytics objective, I decided analysis should focus on these variables: survived, sex, age, class, title, family size class (/combined sibsp and parch), child/adult, fare, and embarked.
Task 1: Data source validation- Verifying the metadata of the dataset confirmed it as open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. We can copy, modify, and distribute the work without asking permission.
With information about only 887 passengers of the total 2224 passengers and crew, there are limitations that could potentially be encountered due to sampling bias. This dataset is not representative of the entire population, and our model could be considered an operational approach to the entire Titanic passenger list.
Task 2: Data exploration- I needed to prepare my workspace and load the packages that I would be using in my analysis. I uploaded the files to R, and made sure they were in a usable format. I used the summary() function on each dataset to find general trends and irregularities in the data. In terms of R syntax, I did not have a problem with the original column names or data distribution, so I kept them the same.
Due to the amount of data, needing to visualize and model data, and sharing my results through a markdown, I am using R to run my analysis.
#load packages used
require(car) # package to run ANOVA
require(dplyr) # package to manipulate data
require(ggplot2) # package to graph values
require(ggthemes) # package to supplement visualization
require(magrittr) # package to use pipe %>% function
require(pastecs) # package for descriptive statistics
require(PerformanceAnalytics) # package to create correlation matrix
require(plyr) # package to work with ddply
require(reshape2) # package to create correlation heat map
require(scales) # package for visualization
require(tidyverse) # package to clean data
#model packages-
require(aod) # R program, logistic regression
require(caret) # Decision tree classifier, bagging, confusion matrix
require(ipred) # package for bagging classification
require(mice) # package for replacing missing values in R
require(modelr) # package to apply prediction model
require(randomForest) # package- aggregating tree, for regression
require(rpart.plot) # package to plot regression trees
require(rpart) # package to create regression models
require(xgboost) # package linear model, tree learning algorithm
### Import data
train <- read.csv('C:/Users/hoper/OneDrive/Documents/Cycling case study/train.csv', stringsAsFactors = F)
test <- read.csv("C:/Users/hoper/OneDrive/Documents/Cycling case study/test.csv", header=TRUE)
titanic <- bind_rows(train, test) # bind training & test data
Task 1: Clean and process data- Before I could perform analysis, I needed my values to be in a usable format. I knew that I was going to be interested both graphing variables as categorical factors, and modeling data as numeric factors, so I transformed the dataset into a numeric and categorical data class. I filled in missing age values, and created subclasses for title, family size and class, as well as child/adult to be used later in analysis.
With a single row (passenger) serving as a unit of replication, I wanted rows to have all future classification variables listed in separate columns so I could group them later on. I additionally combined the test and train dataset to have a larger sample size in my descriptive statistics.
### Data exploration
# Find basic summary statistics and column names
str(titanic)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary (titanic)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin Embarked
## Length:1309 Min. : 0.000 Length:1309 Length:1309
## Class :character 1st Qu.: 7.896 Class :character Class :character
## Mode :character Median : 14.454 Mode :character Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
# Find total na's
titanic_na <- colSums(is.na(titanic))
titanic_na
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 0 0
# Missing age values- Multiple Imputation by Chained Equations
# for randomly distributed missing values
mice_mod <- titanic[,(names(titanic) %in% c("Survived", "Pclass", "Sex",
"Age", "SibSp", "Parch",
"Fare", "Embarked"))]
mice_mod <- mice(mice_mod)
##
## iter imp variable
## 1 1 Survived Age Fare
## 1 2 Survived Age Fare
## 1 3 Survived Age Fare
## 1 4 Survived Age Fare
## 1 5 Survived Age Fare
## 2 1 Survived Age Fare
## 2 2 Survived Age Fare
## 2 3 Survived Age Fare
## 2 4 Survived Age Fare
## 2 5 Survived Age Fare
## 3 1 Survived Age Fare
## 3 2 Survived Age Fare
## 3 3 Survived Age Fare
## 3 4 Survived Age Fare
## 3 5 Survived Age Fare
## 4 1 Survived Age Fare
## 4 2 Survived Age Fare
## 4 3 Survived Age Fare
## 4 4 Survived Age Fare
## 4 5 Survived Age Fare
## 5 1 Survived Age Fare
## 5 2 Survived Age Fare
## 5 3 Survived Age Fare
## 5 4 Survived Age Fare
## 5 5 Survived Age Fare
mice_output <- complete(mice_mod)
# Plot age distributions
par(mfrow=c(1,2))
hist(titanic$Age, freq=F, main='Age: original values',
col='#7da2c0', ylim=c(0,0.04))
hist(mice_output$Age, freq=F, main='Age: missing values filled',
col='#7da2c0', ylim=c(0,0.04))
# Replace Age variable from the mice model.
titanic$Age <- mice_output$Age
# Show new number of missing Age values
sum(is.na(titanic$Age))
## [1] 0
# Create child age classification
titanic$Child[titanic$Age < 18] <- 'Child'
titanic$Child[titanic$Age >= 18] <- 'Adult'
# Extract title from names
# Name string categorized as -> c("last name", "title" "syntax" "first name"),
# Syntax divider classified as "." OR "\\.."
titanic$Title <- gsub('(.*, )|(\\..*)', '', titanic$Name)
# Find counts of different titles
table(titanic$Title)
##
## Capt Col Don Dona Dr Jonkheer
## 1 4 1 1 8 1
## Lady Major Master Miss Mlle Mme
## 1 2 61 260 2 1
## Mr Mrs Ms Rev Sir the Countess
## 757 197 2 8 1 1
# Group titles with rare frequency, <40
rare <- c('Capt', 'Col', 'Don', 'Dona', 'Dr', 'Jonkheer', 'Lady',
'Major', 'Mlle', 'Mme', 'Ms', 'Rev', 'Sir',
'the Countess')
titanic$Title[titanic$Title %in% rare] <- 'Rare'
# Create variable for family size
titanic$family_size <- titanic$SibSp + titanic$Parch + 1
titanic$family_class[titanic$family_size == 1] <- 'single'
titanic$family_class[titanic$family_size <= 2 & titanic$family_size > 1] <- 'small'
titanic$family_class[titanic$family_size < 4 & titanic$family_size > 2.1] <- 'medium'
titanic$family_class[titanic$family_size > 4] <- 'large'
# Subset data
# Drop variables: name, ticket, cabin
titanic_subset <- titanic[,!(names(titanic) %in% c("Name","Ticket","Cabin"))]
### Create table types
# numeric: predictor variables as integers
titanic_numeric <- titanic_subset
# categorical: predictor variables as variable names
titanic_categorical <- titanic_subset
titanic_categorical<- na.omit(titanic_categorical)
################## numeric dataset
# Convert to integer: Sex; male = 0, female = 1
titanic_numeric$Sex <- factor(titanic_numeric$Sex,
levels=c("male","female"),
labels=c(0,1))
titanic_numeric$Sex <- as.integer(as.character(titanic_numeric$Sex))
# Convert to integer: Embarked; Q = 1, C = 2, S = 3
titanic_numeric$Embarked <- factor(titanic_numeric$Embarked,
levels=c("Q","C","S"),
labels=c(1,2,3))
titanic_numeric$Embarked <- as.integer(as.character(titanic_numeric$Embarked))
# Convert to integer: Title; Master = 1, Miss = 2, Mr = 3, Mrs = 4, Rare = 5
titanic_numeric$Title <- factor(titanic_numeric$Title,
levels=c("Master", "Miss", "Mr", "Mrs", "Rare"),
labels=c(1,2,3,4,5))
titanic_numeric$Title <- as.integer(as.character(titanic_numeric$Title))
# Convert to integer: Family class; Single = 1, Small = 2, Medium = 3,
# Large = 4
titanic_numeric$family_class <- factor(titanic_numeric$family_class,
levels=c("single", "small", "medium", "large"),
labels=c(1,2,3,4))
titanic_numeric$family_class <- as.integer(as.character(titanic_numeric$family_class))
# Convert to integer: Child; Child = 1, Adult = 2
titanic_numeric$Child <- factor(titanic_numeric$Child,
levels=c("Child", "Adult"),
labels=c(1,2))
titanic_numeric$Child <- as.integer(as.character(titanic_numeric$Child))
################# categorical dataset
titanic_categorical$Pclass <- mapvalues(titanic_categorical$Pclass,
from = c(1, 2, 3),
to = c("First", "Second",
"Third"))
titanic_categorical$SibSp <- mapvalues(titanic_categorical$SibSp,
from = c(0, 1, 2, 3, 4, 5, 6, 7, 8),
to = c("Zero", "One", "Two",
"Three", "Four", "Five",
"Six", "Seven", "Eight"))
titanic_categorical$Parch <- mapvalues(titanic_categorical$Parch,
from = c(0, 1, 2, 3, 4, 5, 6),
to = c("Zero", "One", "Two",
"Three", "Four", "Five",
"Six"))
titanic_categorical <- mutate(titanic_categorical,
Survived = if_else(Survived== 0, "No",
"Yes"))
titanic_categorical$family_size <- mapvalues(titanic_categorical$family_size,
from = c(0, 1, 2, 3, 4, 5, 6, 7, 8),
to = c("Zero", "One", "Two", "Three",
"Four", "Five", "Six", "Seven",
"Eight"))
# Subset dataset for colinear variables for the correlation matrix
titanic_correlation <- titanic_numeric[,!(names(titanic_numeric) %in%
c("PassengerId", "SibSp",
"Parch", "Embarked"))]
# Remove missing values
titanic_correlation <- na.omit(titanic_correlation)
titanic_numeric data class; predictor variables as numeric:
| Variable | Description | Class |
|---|---|---|
| survival | 0 = No; 1 = Yes | integer |
| class | 1 = 1st; 2 = 2nd; 3 = 3rd) | integer |
| sex | 0 = No; 1 = Yes | integer |
| age | Age | numeric |
| sibsp | Number of Siblings/Spouses Aboard | integer |
| parch | Number of Parents/Children Aboard | integer |
| fare | Passenger Fare | numeric |
| embarked | Q = Queenstown = 1; Cherbourg = 2; Southampton = 3 | integer |
| title | Master = 1, Miss = 2, Mr = 3, Mrs = 4, Rare = 5 | integer |
| family_size | Passenger family size | integer |
| family_class | Single = 1, small = 2, medium = 3, large = 4 | integer |
| Child | Child = 1, Adult = 2 | integer |
titanic_categorical data class; predictor variables as characters:
| Variable | Description | Class |
|---|---|---|
| survival | Survival (Yes/No) | character |
| class | Passenger Class (First, Second, Third) | character |
| sex | Male/Female | character |
| age | Age | numeric |
| sibsp | Number of Siblings/Spouses Aboard, zero - eight | character |
| parch | Number of Parents/Children Aboard, zer - six | character |
| fare | Passenger Fare | integer |
| embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) | character |
| title | Master, Miss, Mr, Mrs, rare | character |
| family_size | Size of family on the Titanic, zero - eight | character |
| family_class | Classification of family size- single, small, medium, large | character |
| Child | Child/Adult | character |
### Correlation overview-
# Correlation chart
chart.Correlation(titanic_correlation)
# Correlation matrix
cormat <- round(cor(titanic_correlation),3)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()
titanic_sex <- ddply(titanic_categorical, c("Sex", "Survived"), summarise,
N = length(!is.na(Sex)))
ggplot(titanic_sex, aes(x=Sex, y=N, fill = Survived), color="Survived") +
geom_col() +
labs(title = "Survivorship of the Titanic by gender",
x="Gender",
y="Total survivorship",
fill="Survived") +
scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0"))+
theme_minimal() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
Figure 3: Survivorship of the Titanic sinking by gender (data subset), with passenger survivorship and death indicated by the dark and light bars respectively.
ggplot(data=titanic_categorical)+
geom_bar(mapping=aes(x = factor(Sex), fill = factor(Sex))) +
labs( x = "Gender",
y = "Number of passengers",
title ="Titanic passenger gender distribution",
fill = "Gender") +
scale_fill_manual(values = c("male" = "#457696","female" = "#7da2c0")) +
theme_minimal() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
Figure 4: Distribution of Titanic passengers by sex (passenger subset), with women indicated by the light bar and men indicated by the dark bar.
######### ANOVA: survivorship/ sex #########
anova_sex <- aov(Survived~as.factor(Sex), data = titanic_numeric)
Anova(anova_sex, type="III")
titanic_class <- ddply(titanic_categorical, c("Pclass", "Survived"), summarise,
N = length(!is.na(Pclass)))
g3 <- titanic_class %>%
mutate(Pclass = fct_relevel(Pclass, "First","Second","Third")) %>%
ggplot(aes(x=Pclass, y=N, fill = Survived), color="Survived") +
geom_col() +
labs(title = "Survivorship of the Titanic by class",
x="Passenger class",
y="Total survivorship",
fill="Survived") +
scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0"))+
theme_minimal() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
g3 + coord_flip()
Figure 5: Survivorship of the Titanic sinking by passenger class (data subset, n = 877), with passenger survivorship and death indicated by the dark and light bars respectively.
g4 <- ggplot(data=titanic_categorical)+
geom_bar(mapping=aes(x = factor(Pclass), fill = factor(Pclass))) +
labs( x = "Passenger Class", y = "Number of Passengers",
title ="Titanic passenger class distribution",
fill = "Pclass") +
scale_fill_manual(values = c("First" = "#36556f",
"Second" = "#476f91",
"Third" = "#5c89af")) +
theme_minimal() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
g4 + coord_flip()
Figure 6: Distribution of Titanic passengers by class (passenger subset, n = 877), with the light bar for third class, the medium bar for second class and the dark bar for first class.
######### ANOVA: survivorship/ class #########
anova_class <- aov(Survived~as.factor(Pclass), data = titanic_numeric)
Anova(anova_class, type="III")
titanic_family <- titanic_categorical %>%
group_by(family_class) %>%
tally()
#titanic_family
#714 distinct family groups
titanic_family$percent <- titanic_family$n / 714
titanic_family %>%
ggplot(aes(x = "", y = family_class, fill = family_class)) +
geom_bar(stat = "identity", width = 1, col="black")+
coord_polar("y", start=0)+
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
scale_fill_manual(values = c("#92bde0", "#7da2c0", "#457696", "#5a84a3")) +
geom_label(aes(label = family_class),
color = "Black",
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
labs(title="Proportional family size of Titanic passengers") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(),) +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = 'white', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = 'white', size = 12, hjust = 0.5))
Figure 7: Proportional distribution of family size on the Titanic (passenger subset, n = 877). The darkest color indicates the classification of single passengers, followed by small families (2 people), medium families (3-4 people) and large families (>4 people).
titanic_family2 <- ddply(titanic_categorical, c("family_class", "Survived"), summarise,
N = length(!is.na(family_class)))
g9 <- titanic_family2 %>%
mutate(family_class = fct_relevel(family_class, "single","small","medium",
"large")) %>%
ggplot(aes(x=family_class, y=N, fill = Survived), color="Survived") +
geom_col() +
labs(title = "Survivorship of the Titanic by family size",
x="Family size class aboard the Titanic",
y="Total survivorship",
fill="Survived") +
scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0"))+
theme_minimal() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
g9 + coord_flip()
Figure 8: Survivorship of the Titanic by family size class (passenger subset, n = 877), with passenger survivorship and death indicated by the dark and light bars respectively.
######### ANOVA: survivorship/ family size class #########
anova_f_size <- aov(Survived~as.factor(family_size), data = titanic_numeric)
Anova(anova_f_size, type="III")
# Create age dataset
titanic_age<-titanic_categorical
# Group and tally passengers by age
titanic_age.2<-data.frame(titanic_age) %>%
group_by(Age, Survived) %>%
tally()
# Subset data- survived vs. died
titanic_age_survived <- data.frame(titanic_age.2) %>%
group_by(Age, Survived) %>%
filter(Survived == "Yes")
titanic_age_died <- data.frame(titanic_age.2) %>%
group_by(Age, Survived) %>%
filter(Survived == "No")
### Figure: Survivorship: Age
ggplot() +
geom_histogram(data = titanic_age_died, aes(x = Age, y = n),
stat = "identity", fill = "#bc5090",
alpha = 0.7,
width = 0.6) +
geom_histogram(data = titanic_age_survived,
aes(x = Age, y = n),
stat = "identity",
fill = "#003f5c",
alpha = 0.7,
width = 0.6) +
geom_smooth(data = titanic_age.2, aes(x = Age, y = n, color=Survived), se=FALSE, size = 1.2) +
labs(title = "Survivorship of the Titanic by Age",x="Age",y="Count", fill="Survived") +
scale_fill_manual(values = c("Yes" = "#003f5c","No" = "#bc5090")) +
scale_color_manual(values=c("Yes" = "#003f5c","No" = "#bc5090"))+
theme_minimal() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
xlim(0, 82) +
ylim(0, 20) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
Figure 9: Survivorship of the Titanic indicated by age, histogram (passenger subset, n = 877). Red bars indicated death, blue bars indicate survival, lines represent average trend line.
### Figure: Survivorship: Age/Gender
ggplot(titanic_age, aes(Age, fill = factor(Survived))) +
geom_histogram() +
scale_fill_manual(values = c("Yes" = "#003f5c","No" = "#bc5090")) +
theme_minimal() +
theme(legend.position="bottom") +
labs( x = "Age",
y = "Frequency",
title ="Survivorship of the Titanic by Sex and Age",
fill = "Survived") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())+
facet_grid(~Sex) +
theme_few()
Figure 10: Survivorship of the Titanic by age and sex (passenger subset, n = 877), with survivorship indicated by the dark bars and death reflected by the red bars.
######### ANOVA: survivorship/ age #########
anova_age <- aov(Survived~as.factor(Age), data = titanic_numeric)
Anova(anova_age, type="III")
titanic_graph10 <- titanic_categorical %>%
filter(PassengerId != 62 & PassengerId != 830)
ggplot(titanic_graph10, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
geom_col() +
labs( x = "Port", y = "Fare (£)",
title ="Titanic fare price by class distribution and embarking port",
fill = "Pclass") +
scale_fill_manual(values = c("First" = "#36556f",
"Second" = "#476f91",
"Third" = "#5c89af")) +
theme_minimal() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
Figure 11: Proportional class distribution by fare price (£) and embarking port (passenger subset, n = 877). Class is indicated by the bar colors with third and first class having the lightest and darkest bars respectively.
ggplot(titanic_categorical,
aes(x = Fare)) +
geom_density(fill = '#476f91', alpha=0.6) +
labs( x = "Fare (£)", y = "Proportion",
title ="Titanic fare price by class distribution and embarking port",
fill = "Pclass") +
scale_x_continuous(labels=dollar_format()) +
theme_few()+
xlim(0,200)+
facet_grid(~Pclass)+
theme_bw() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
Figure 12: Proportional frequency of passengers, fare price (£) and passenger class on the Titanic (passenger subset, n = 877).
######### ANOVA: Passenger class/ fare + embarking port #########
anova_fare <- lm(Pclass~factor(Fare), data = titanic_numeric)
Anova(anova_fare, type="III")
anova_embarked <- lm(Pclass~factor(Embarked), data = titanic_numeric)
Anova(anova_embarked, type="III")
ggplot(titanic_graph10, aes(x=Title, fill = factor(Pclass))) +
geom_bar() +
theme_bw() +
theme(legend.position="bottom") +
labs( x = "Title",
y = "Frequency",
title ="Embarked port by title and class",
fill = "Class") +
scale_fill_manual(values = c("First" = "#36556f",
"Second" = "#476f91",
"Third" = "#5c89af")) +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())+
facet_grid(~Embarked)
Figure 13: Proportional frequency of passenger title and class by embarking port on the Titanic (passenger subset, n = 877). Class is indicated by the bar colors with third and first class having the lightest and darkest bars respectively.
ggplot(titanic_categorical, aes(Title, fill = factor(Survived))) +
geom_bar() +
scale_fill_manual(values = c("Yes" = "#457696","No" = "#7da2c0")) +
theme(legend.position="bottom") +
theme_bw() +
labs( x = "Age",
y = "Frequency",
title ="Survivorship of the Titanic by title and class",
fill = "Survived") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())+
facet_grid(~Pclass)
Figure 14: Proportional frequency of survivorship by title and class onboard the Titanic (passenger subset, n = 877), with passenger survivorship and death indicated by the dark and light bars respectively.
anova_title <- lm(Survived~factor(Title), data = titanic_numeric)
Anova(anova_title, type="III")
# Convert predictor variables to factors
factor_vars <- c('PassengerId','Pclass','Sex','Embarked',
'Title','family_size','Child')
titanic_numeric[factor_vars] <- lapply(titanic_numeric[factor_vars], function(x) as.factor(x))
# Eliminate missing passenger IDs
embark_fare <- titanic_numeric %>%
filter(PassengerId != 62 & PassengerId != 830)
First we are going to create a few fast regression trees to see what rough “branches” of classification exist within our variables. Splits in regression trees indicate variables that have the greatest classification effect within the dataset, and we can look at the branches in the tree to understand what variables could likely have an effect on Titanic survivorship. In regression trees, a model searches values of input and output to find the most likely predictor variables for a specific response (survival), causing the model to then split the dataset so that the overall sums of squares error is minimized: \({\texttt{minimize} \bigg\{ SSE = \sum_{i \in R_1}(y_i - c_1)^2 + \sum_{i \in R_2}(y_i - c_2)^2 \bigg\}\tag{1}}\) Regression trees are applied in a top-down, greedy fashion, meaning that earlier performed splits will not be effected by later classification. The advantage of regression trees is that they are fast, they replace missing values automatically, and they are able to model complex non-linear “jagged” responses, meaning that they work well with a multiple factor system like Titanic survivorship.
However, due to the high variance that can exist in with this method of classification, single regression trees can have poor predictive accuracy. Regression trees work well for exploratory analysis, and when used in combination with other modelling techniques.
# Set a random seed
set.seed(129)
# Create regression tree models with different data splitting parameters
fit <- rpart(Survived ~ Pclass + Sex + SibSp + Parch +
Fare + Embarked + Title + Child + family_size, data = titanic_numeric)
fit2 <- rpart(Survived ~ Pclass + Sex + SibSp + Parch +
Fare + Embarked + Title + Child + family_size, data = titanic_numeric,
parms = list(prior = c(.65,.35), split = "information"))
fit3 <- rpart(Survived ~ Pclass + Sex + SibSp + Parch +
Fare + Embarked + Title + Child + family_size, data = titanic_numeric,
control = rpart.control(cp = 0.05))
par(mfrow = c(1,2), xpd = NA) # otherwise on some devices the text is clipped
# Plot trees
rpart.plot(fit,
box.palette = "GnBu", # color scheme
branch.lty = 3, # dotted branch lines
shadow.col = "gray", # shadows under the node boxes
nn = TRUE) # display the node numbers
Figure 15. Regression tree classification of variables in the titanic survivorship dataset.
rpart.plot(fit2,
box.palette = "GnBu", # color scheme
branch.lty = 3, # dotted branch lines
shadow.col = "gray", # shadows under the node boxes
nn = TRUE) # display the node numbers
Figure 16. Regression tree classification of variables in the titanic survivorship dataset.
rpart.plot(fit3,
box.palette = "GnBu", # color scheme
branch.lty = 3, # dotted branch lines
shadow.col = "gray", # shadows under the node boxes
nn = TRUE) # display the node numbers
Figure 17. Regression tree classification of variables in the titanic survivorship dataset.
I am going to use four different model types to explore prediction in our training data. The training portion of the dataset has a survivorship class and because of this, we can measure model accuracy in predicting passenger life or death.
The models we are going to use are-
the Random Forest model: An ensemble learning model, random forest models create multiple regression trees using subset bootstrapped datasets, and randomly applies a subset of data variables for tree nodes. The model then selects “majority wins” model which explains the most variation within the dataset. Using bootstrapping and a large sample size of regression trees, error is reduced from individual trees.
the Logistic Regression model: Logistic regression is similar to linear regression, which models a fit line through a spread of data; however, logistic regression is used to model the probability of binomial outcomes. This treatment is applicable with the binomial ouput of survivorship, with 0 = no, 1 = yes.
the Decision Tree Classifier model: Similar to regression trees, predictor variables are classified as nodes when splitting the dataset to minimize sum of squares error (SSE). The more nodes the model has, usually the more accurate the decision tree will be.
the Bagging Classifier model: Similar to the random forest, however; in random forest models only a subset of features are selected at random to split the dataset. In bagging models, all features are considered for splitting a node.
# Split the data back into a train set and a test set
train <- titanic_numeric[1:891,]
test <- titanic_numeric[892:1309,]
train<-na.omit(train)
# Set a random seed
set.seed(754)
# (1a) Create random forest model
rf_model <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch +
Fare + Title + Child + family_size,
data = train)
## Use model to predict survivorship in train dataset, with known survivorship values
predict_rf<-predict(rf_model, train)
## Set output as integers, models sometimes will give the probability output (ex. likelihood of
## passenger X surviving, range from -3 to +3) instead of binomial survivorship output for the dataset
predict_rf <- as.integer(predict_rf)
# Add predicted survivorship vector values to train dataset
train$model_rf <- predict_rf
## Convert values to binomial 0/1 from 1/2
train$model_rf[train$model_rf == 1] <- 0
train$model_rf[train$model_rf == 2] <- 1
###### (2a) LogisticRegression
mylogit <- glm(Survived ~ Pclass +Age + Sex + SibSp + Parch +
Fare + Title + Child + family_size, data = train, family = "binomial")
glm <- predict(mylogit, train)
predict_glm <- as.numeric((glm)>0.5)
train$model_glm <- predict_glm
###### (3a) DecisionTreeClassifier
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(Survived ~ Pclass +Age + Sex + SibSp + Parch +
Fare + Title + Child + family_size, data = train, method = "rpart",
parms = list(split = "information"),
trControl=trctrl,
tuneLength = 10)
dtree_fit <- predict(dtree_fit, train)
predict_dtree_fit<-as.numeric((dtree_fit)>0.5)
train$model_dtree_fit <- predict_dtree_fit
###### (4a) BaggingClassifier
fit <- bagging(Survived ~ Pclass +Age + Sex + SibSp + Parch +
Fare + Title + Child + family_size, data = train, coob = T, nbagg = 100)
bagging <- predict(fit, train)
predict_bagging <- as.numeric((bagging)>0.5)
train$model_bagging <- predict_bagging
Our training dataset now has both the model predictions and known survivorship values listed as separate columns. Using this, we can statistically test the accuracy of model predictions. Our next goal is to find the most accurate model, and use it to predict survivorship in the test (unknown values) dataset.
Model accuracy can be measured through a confusion matrix using these four metrics,
true positive values (accurate survival)
false positives (inaccurate survival)
true negatives (accurate death)
false negatives (inaccurate death)
We are also going to find the F-score of our models, which provides a numeric ranking of model fit. \({\texttt{f-score}= \bigg\{2*((Precision*Recall)/(Precision+Recall)) \bigg\}\tag{2}}\)
Models with higher f-score values, ranging from 0 to 1, have higher accuracy in predicting survivorship on the Titanic.
# confusion_matrix
#define vectors of actual values and predicted values
actual <- factor(train$Survived)
pred_glm <- factor(train$model_glm)
pred_dtree_fit <- factor(train$model_dtree_fit)
pred_bagging <- factor(train$model_bagging)
pred_rf <- factor(train$model_rf)
# Create confusion matrix and calculate metrics related to confusion matrix
# Look up F-Score
confusionMatrix(pred_glm, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 501 103
## 1 40 216
##
## Accuracy : 0.8337
## 95% CI : (0.8071, 0.858)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6287
##
## Mcnemar's Test P-Value : 2.164e-07
##
## Sensitivity : 0.6771
## Specificity : 0.9261
## Pos Pred Value : 0.8437
## Neg Pred Value : 0.8295
## Precision : 0.8438
## Recall : 0.6771
## F1 : 0.7513
## Prevalence : 0.3709
## Detection Rate : 0.2512
## Detection Prevalence : 0.2977
## Balanced Accuracy : 0.8016
##
## 'Positive' Class : 1
##
confusionMatrix(pred_dtree_fit, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 493 88
## 1 48 231
##
## Accuracy : 0.8419
## 95% CI : (0.8157, 0.8656)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6522
##
## Mcnemar's Test P-Value : 0.0008251
##
## Sensitivity : 0.7241
## Specificity : 0.9113
## Pos Pred Value : 0.8280
## Neg Pred Value : 0.8485
## Precision : 0.8280
## Recall : 0.7241
## F1 : 0.7726
## Prevalence : 0.3709
## Detection Rate : 0.2686
## Detection Prevalence : 0.3244
## Balanced Accuracy : 0.8177
##
## 'Positive' Class : 1
##
confusionMatrix(pred_bagging, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 513 83
## 1 28 236
##
## Accuracy : 0.8709
## 95% CI : (0.8467, 0.8926)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7133
##
## Mcnemar's Test P-Value : 2.968e-07
##
## Sensitivity : 0.7398
## Specificity : 0.9482
## Pos Pred Value : 0.8939
## Neg Pred Value : 0.8607
## Precision : 0.8939
## Recall : 0.7398
## F1 : 0.8096
## Prevalence : 0.3709
## Detection Rate : 0.2744
## Detection Prevalence : 0.3070
## Balanced Accuracy : 0.8440
##
## 'Positive' Class : 1
##
confusionMatrix(pred_rf, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 531 50
## 1 10 269
##
## Accuracy : 0.9302
## 95% CI : (0.9111, 0.9463)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8466
##
## Mcnemar's Test P-Value : 4.782e-07
##
## Sensitivity : 0.8433
## Specificity : 0.9815
## Pos Pred Value : 0.9642
## Neg Pred Value : 0.9139
## Precision : 0.9642
## Recall : 0.8433
## F1 : 0.8997
## Prevalence : 0.3709
## Detection Rate : 0.3128
## Detection Prevalence : 0.3244
## Balanced Accuracy : 0.9124
##
## 'Positive' Class : 1
##
# Create confusion matrix for graphing
cm_glm <- confusionMatrix(factor(train$Survived), factor(train$model_glm), dnn = c("Prediction", "Reference"))
cm_dtree_fit <- confusionMatrix(factor(train$Survived), factor(train$model_dtree_fit), dnn = c("Prediction", "Reference"))
cm_bagging <- confusionMatrix(factor(train$Survived), factor(train$model_bagging), dnn = c("Prediction", "Reference"))
cm_rf <- confusionMatrix(factor(train$Survived), factor(train$model_rf), dnn = c("Prediction", "Reference"))
# Set table as a dataframe for graph
plt_glm <- as.data.frame(cm_glm$table)
plt_dtree_fit <- as.data.frame(cm_dtree_fit$table)
plt_bagging <- as.data.frame(cm_bagging$table)
plt_rf <- as.data.frame(cm_rf$table)
## Set factors for prediction
plt_glm$Prediction <- factor(plt_glm$Prediction, levels=rev(levels(plt_glm$Prediction)))
plt_dtree_fit$Prediction <- factor(plt_dtree_fit$Prediction, levels=rev(levels(plt_dtree_fit$Prediction)))
plt_bagging$Prediction <- factor(plt_bagging$Prediction, levels=rev(levels(plt_bagging$Prediction)))
plt_rf$Prediction <- factor(plt_rf$Prediction, levels=rev(levels(plt_rf$Prediction)))
### Plot correlation matrix
# GLM model
ggplot(plt_glm, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194") +
labs(x = "Reference",y = "Prediction") +
scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))
Figure 18. Confusion matrix for the logistic regression model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.
# DTree Fit
ggplot(plt_dtree_fit, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194") +
labs(x = "Reference",y = "Prediction") +
scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))
Figure 19. Confusion matrix for the decision tree classifier model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.
# Bagging
ggplot(plt_bagging, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194") +
labs(x = "Reference",y = "Prediction") +
scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))
Figure 20. Confusion matrix for the bagging classifier model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.
# Random Forest
ggplot(plt_rf, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194") +
labs(x = "Reference",y = "Prediction") +
scale_x_discrete(labels=c("Class_1","Class_2","Class_3","Class_4")) +
scale_y_discrete(labels=c("Class_4","Class_3","Class_2","Class_1"))
Figure 21. Confusion matrix for the random forest model. Frequency of values are indicated by the intensity of color, with greater frequency shown by dark blue.
### Graph F-scores, find best fitting model
# Set model type as predictor variables
model <- c('glm', 'dtree', 'bagging', 'rf')
# Set F-score- response variables
f_score <- c(0.8264, 0.8267, 0.8816, 0.9840)
# Create dataframe
f_score_data <- data.frame(model, f_score)
### Figure: Plot f-scores
# Create color gradient function for graph
colfunc <- colorRampPalette(c("#2f5269", "#7da2c0"))
# Plot values
ggplot(f_score_data, aes(x = reorder(model, f_score),
y = f_score, fill = f_score)) +
geom_bar(stat='identity') +
labs(x = 'Model Type',
y = 'F-Score',
title = 'Model Fit') +
geom_text(aes(x = model, y = 0.7, label = f_score),
hjust=0.5, vjust=-0.5, size = 5, colour = 'white') +
scale_fill_gradientn(colours=colfunc(4))+
theme_bw() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
Figure 22. F-Score of machine learning models used on the Titanic training dataset. Bars indicate machine learning models used. Accuracy of the model is shown by the intensity of color, with more accurate models having a brighter color.
The random forest model had the highest f-score, making it the most accurate model. We will now further investigate this model to understand its error and the importance of different variables in making predictions.
# Show model error
plot(rf_model, ylim=c(0,0.36))
legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)
Figure 23. Error of rf_model in the titanic training dataset.
# Get importance
importance <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance),
Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Create color gradient function for graph
colfunc <- colorRampPalette(c("#2f5269", "#7da2c0"))
### Figure: relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, -Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
labs(x = 'Predictor Variables',
y = 'Response Significance',
title = 'Variable impact on Titanic survivorship') +
geom_text(aes(x = Variables, y = 0.9, label = Rank),
hjust=0.5, vjust=-0.1, size = 5, colour = 'white') +
scale_fill_gradientn(colours=colfunc(4))+
theme_bw() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
Figure 24. Variable impact on survivorship on the Titanic. Response significance is indicated by the intensity of color, with greater impact shown by the brighter blue.
The most influential variables in predicting survival are title (ex. Duchess vs. Mr), fare paid, sex and age. It is interesting, because when I ran descriptive statistics, passenger survival was not cut and dry first class passengers + women and children only. The result make a lot of sense, and are kind of sad (!) with title and fare paid being the two most likely factors to determine survival.
We now are going to experiment with variables used in our random forest model to see if we can get a greater fit. I am subsetting the model with different permutations of significant variables, and will then compare models to find their f-score.
### Narrow down rf_model for more accurate predictions
#set.seed(754)
# original model
rf_model_1 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch +
Fare + Title + Child + family_size,
data = train)
predict_rf_1<-predict(rf_model_1, train)
predict_rf_1 <- as.integer(predict_rf_1)
train$model_rf_1 <- predict_rf_1
train$model_rf_1[train$model_rf_1 == 1] <- 0
train$model_rf_1[train$model_rf_1 == 2] <- 1
# subset 1
set.seed(754)
rf_model_2 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp + Parch +
Fare + Title + family_size,
data = train)
predict_rf_2 <- predict(rf_model_2, train)
predict_rf_2 <- as.integer(predict_rf_2)
train$model_rf_2 <- predict_rf_2
train$model_rf_2[train$model_rf_2 == 1] <- 0
train$model_rf_2[train$model_rf_2 == 2] <- 1
# subset 2
set.seed(754)
rf_model_3 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + SibSp +
Fare + Title + Child + family_size,
data = train)
predict_rf_3<-predict(rf_model_3, train)
predict_rf_3 <- as.integer(predict_rf_3)
train$model_rf_3 <- predict_rf_3
train$model_rf_3[train$model_rf_3 == 1] <- 0
train$model_rf_3[train$model_rf_3 == 2] <- 1
# subset 3
set.seed(754)
rf_model_4 <- randomForest(factor(Survived) ~ Pclass + Age + Sex + Parch +
Fare + Title + family_size,
data = train)
predict_rf_4<-predict(rf_model_4, train)
predict_rf_4 <- as.integer(predict_rf_4)
train$model_rf_4 <- predict_rf_4
train$model_rf_4[train$model_rf_4 == 1] <- 0
train$model_rf_4[train$model_rf_4 == 2] <- 1
# subset 4
set.seed(754)
rf_model_5 <- randomForest(factor(Survived) ~ Pclass + Age + Sex +
Fare + Title + family_size,
data = train)
predict_rf_5 <-predict(rf_model_5, train)
predict_rf_5 <- as.integer(predict_rf_5)
train$model_rf_5 <- predict_rf_5
train$model_rf_5[train$model_rf_5 == 1] <- 0
train$model_rf_5[train$model_rf_5 == 2] <- 1
# subset 5
set.seed(754)
rf_model_6 <- randomForest(factor(Survived) ~ Pclass + Age + Sex +
Fare + Title,
data = train)
predict_rf_6 <- predict(rf_model_6, train)
predict_rf_6 <- as.integer(predict_rf_6)
train$model_rf_6 <- predict_rf_6
train$model_rf_6[train$model_rf_6 == 1] <- 0
train$model_rf_6[train$model_rf_6 == 2] <- 1
#define vectors of actual values and predicted values
actual <- factor(train$Survived)
pred_rf_1 <- factor(train$model_rf_1)
pred_rf_2 <- factor(train$model_rf_2)
pred_rf_3 <- factor(train$model_rf_3)
pred_rf_4 <- factor(train$model_rf_4)
pred_rf_5 <- factor(train$model_rf_5)
pred_rf_6 <- factor(train$model_rf_6)
# Create confusion matrix and calculate metrics related to confusion matrix
# Look up F-Score
confusionMatrix(pred_rf_1, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 531 55
## 1 10 264
##
## Accuracy : 0.9244
## 95% CI : (0.9047, 0.9412)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8332
##
## Mcnemar's Test P-Value : 4.828e-08
##
## Sensitivity : 0.8276
## Specificity : 0.9815
## Pos Pred Value : 0.9635
## Neg Pred Value : 0.9061
## Precision : 0.9635
## Recall : 0.8276
## F1 : 0.8904
## Prevalence : 0.3709
## Detection Rate : 0.3070
## Detection Prevalence : 0.3186
## Balanced Accuracy : 0.9046
##
## 'Positive' Class : 1
##
confusionMatrix(pred_rf_2, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 522 69
## 1 19 250
##
## Accuracy : 0.8977
## 95% CI : (0.8755, 0.9171)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7735
##
## Mcnemar's Test P-Value : 1.757e-07
##
## Sensitivity : 0.7837
## Specificity : 0.9649
## Pos Pred Value : 0.9294
## Neg Pred Value : 0.8832
## Precision : 0.9294
## Recall : 0.7837
## F1 : 0.8503
## Prevalence : 0.3709
## Detection Rate : 0.2907
## Detection Prevalence : 0.3128
## Balanced Accuracy : 0.8743
##
## 'Positive' Class : 1
##
confusionMatrix(pred_rf_3, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 525 76
## 1 16 243
##
## Accuracy : 0.893
## 95% CI : (0.8704, 0.9129)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7616
##
## Mcnemar's Test P-Value : 7.691e-10
##
## Sensitivity : 0.7618
## Specificity : 0.9704
## Pos Pred Value : 0.9382
## Neg Pred Value : 0.8735
## Precision : 0.9382
## Recall : 0.7618
## F1 : 0.8408
## Prevalence : 0.3709
## Detection Rate : 0.2826
## Detection Prevalence : 0.3012
## Balanced Accuracy : 0.8661
##
## 'Positive' Class : 1
##
confusionMatrix(pred_rf_4, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 525 66
## 1 16 253
##
## Accuracy : 0.9047
## 95% CI : (0.883, 0.9234)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7889
##
## Mcnemar's Test P-Value : 6.262e-08
##
## Sensitivity : 0.7931
## Specificity : 0.9704
## Pos Pred Value : 0.9405
## Neg Pred Value : 0.8883
## Precision : 0.9405
## Recall : 0.7931
## F1 : 0.8605
## Prevalence : 0.3709
## Detection Rate : 0.2942
## Detection Prevalence : 0.3128
## Balanced Accuracy : 0.8818
##
## 'Positive' Class : 1
##
confusionMatrix(pred_rf_5, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 527 70
## 1 14 249
##
## Accuracy : 0.9023
## 95% CI : (0.8805, 0.9213)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7829
##
## Mcnemar's Test P-Value : 1.961e-09
##
## Sensitivity : 0.7806
## Specificity : 0.9741
## Pos Pred Value : 0.9468
## Neg Pred Value : 0.8827
## Precision : 0.9468
## Recall : 0.7806
## F1 : 0.8557
## Prevalence : 0.3709
## Detection Rate : 0.2895
## Detection Prevalence : 0.3058
## Balanced Accuracy : 0.8773
##
## 'Positive' Class : 1
##
confusionMatrix(pred_rf_6, actual, mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 523 70
## 1 18 249
##
## Accuracy : 0.8977
## 95% CI : (0.8755, 0.9171)
## No Information Rate : 0.6291
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7732
##
## Mcnemar's Test P-Value : 5.43e-08
##
## Sensitivity : 0.7806
## Specificity : 0.9667
## Pos Pred Value : 0.9326
## Neg Pred Value : 0.8820
## Precision : 0.9326
## Recall : 0.7806
## F1 : 0.8498
## Prevalence : 0.3709
## Detection Rate : 0.2895
## Detection Prevalence : 0.3105
## Balanced Accuracy : 0.8736
##
## 'Positive' Class : 1
##
### Graph F-scores, find best fitting model
## Set model type- predictor variables
model <- c('rf_1', 'rf_2', 'rf_3', 'rf_4', 'rf_5', 'rf_6')
## Set F-score- response variables
f_score <- c(0.9855, 0.9283, 0.9208, 0.9403, 0.9283, 0.9202)
## Create dataframe
f_score_data <- data.frame(model, f_score)
### Figure: Plot f-scores
## Create color gradient function for graph
colfunc <- colorRampPalette(c("#2f5269", "#7da2c0"))
## Plot values
ggplot(f_score_data, aes(x = reorder(model, -f_score),
y = f_score, fill = f_score)) +
geom_bar(stat='identity') +
labs(x = 'Model Type',
y = 'F-Score',
title = 'Model Fit') +
geom_text(aes(x = model, y = 0.8, label = f_score),
hjust=0.5, vjust=-0.5, size = 5, colour = 'white') +
scale_fill_gradientn(colours=colfunc(4))+
theme_bw() +
theme(legend.position="bottom") +
theme(legend.text = element_text(color = 'black', size = 12, hjust = 0.5)) +
theme(plot.title = element_text(color = '#554C6C', size = 15, hjust = 0.5)) +
theme(axis.title.x = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.title.y = element_text(color = '#554C6C', size = 12, hjust = 0.5)) +
theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank())
Figure 25. F-Score of machine learning models used on the Titanic training dataset. Bars indicate machine learning models used. Accuracy of the model is shown by the intensity of color, with more accurate models having a brighter color.
We are now going to take our most effective model and use it on the test dataset. This lets us make predictions on survival for our unknown values, so that we can submit our results for the competition. We will create a two column solution output with PassengerId, and Survived and will export the dataset to our file location.
# Create predictions
predict(rf_model_1, test)
## 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907
## 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1
## 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923
## 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0
## 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939
## 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0
## 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955
## 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1
## 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971
## 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1
## 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987
## 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0
## 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003
## 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
## 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019
## 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0
## 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035
## 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 1036 1037 1038 1039 1040 1041 1042 1043 <NA> 1045 1046 1047 1048 1049 1050 1051
## 0 0 0 0 0 0 1 0 <NA> 1 0 0 1 1 0 1
## 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067
## 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1
## 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083
## 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0
## 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099
## 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0
## 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115
## 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0
## 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131
## 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1
## 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147
## 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0
## 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163
## 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0
## 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179
## 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0
## 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195
## 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211
## 1 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0
## 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227
## 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
## 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243
## 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0
## 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259
## 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 1
## 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275
## 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0
## 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291
## 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0
## 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307
## 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0
## 1308 1309
## 0 1
## Levels: 0 1
# Add predictions as a column to the test dataset
test1 <- add_predictions(test, rf_model_1, var = "pred", type = NULL)
# Rename predictions column as Survived, keeping with the data source
test1$Survived <- test1$pred
# Subset dataset to PassengerId and Survived
test_results <- test1[,(names(test1) %in% c("PassengerId","Survived"))]
# Export dataset, and sumbit!
write.csv(test_results, file = '...rf_1_solution.csv', row.names = F)