Introduction

The Titanic: Machine Learning from Disaster dataset is a widely known compettion among the data science community that has been hosted by Kaggle since September 2012. As of 11/19/2016, there were 5,778 teams that had submitted answers to the competition. In this report, I will give a walkthrough of how I generated the test data I submitted that scored an accuracy of 82% and put me in the top 3% of teams on Kaggle.

Overview

In this exercise, I used a random forest algorithm to generate predictions on who would survive the famous 1912 sinking of the RMS Titanic passenger liner given various attributes about the passenger such as their class (1st, 2nd, 3rd), sex, age, name, fellow travelers, and more. The data required some cleaning, and I performed feature engineering to create several variables that I used in the random forest algorithm, including predictions I generated on the sparse cabin data which improved my model’s accuracy.
Before beginning, I downloaded the test and train datasets from Kaggle, which are found at https://www.kaggle.com/c/titanic/.

train_raw       <- read.csv("train.csv", na.strings="")
test_raw        <- read.csv("test.csv", na.strings="")

Pre-process data

After loading the data, I created a new column, Survived on the test data. Naturally, it lacks the survival information as part of the competition. I filled it with NA values. Then, I bound the test and train datasets and stored them in a new dataframe called res_raw and looked at its dimensions and its structure.

test_raw$Survived <- NA
res_raw <- rbind(train_raw, test_raw)
dim(res_raw)
## [1] 1309   12
str(res_raw)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 186 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Next, I coerced the Ticket and Name variables in res_raw to be characters, as our random forest algorithm cannot handle variables with more than 53 factors. Then, I threw out the PassengerId column and stored the resulting dataframe in an R object called res. I then viewed the summmary of res, including its missing values.

res_raw$Ticket <- as.character(res_raw$Ticket)
res_raw$Name <- as.character(res_raw$Name)
names(res_raw)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"
## Throw out PassengerId
res <- res_raw[,-1]
summary(res)
##     Survived          Pclass          Name               Sex     
##  Min.   :0.0000   Min.   :1.000   Length:1309        female:466  
##  1st Qu.:0.0000   1st Qu.:2.000   Class :character   male  :843  
##  Median :0.0000   Median :3.000   Mode  :character               
##  Mean   :0.3838   Mean   :2.295                                  
##  3rd Qu.:1.0000   3rd Qu.:3.000                                  
##  Max.   :1.0000   Max.   :3.000                                  
##  NA's   :418                                                     
##       Age            SibSp            Parch          Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.000   Length:1309       
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000   Class :character  
##  Median :28.00   Median :0.0000   Median :0.000   Mode  :character  
##  Mean   :29.88   Mean   :0.4989   Mean   :0.385                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000                     
##  Max.   :80.00   Max.   :8.0000   Max.   :9.000                     
##  NA's   :263                                                        
##       Fare                     Cabin      Embarked  
##  Min.   :  0.000   C23 C25 C27    :   6   C   :270  
##  1st Qu.:  7.896   B57 B59 B63 B66:   5   Q   :123  
##  Median : 14.454   G6             :   5   S   :914  
##  Mean   : 33.295   B96 B98        :   4   NA's:  2  
##  3rd Qu.: 31.275   C22 C26        :   4             
##  Max.   :512.329   (Other)        : 271             
##  NA's   :1         NA's           :1014
summary(is.na(res))
##   Survived         Pclass           Name            Sex         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:891       FALSE:1309      FALSE:1309      FALSE:1309     
##  TRUE :418       NA's :0         NA's :0         NA's :0        
##  NA's :0                                                        
##     Age            SibSp           Parch           Ticket       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1046      FALSE:1309      FALSE:1309      FALSE:1309     
##  TRUE :263       NA's :0         NA's :0         NA's :0        
##  NA's :0                                                        
##     Fare           Cabin          Embarked      
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1308      FALSE:295       FALSE:1307     
##  TRUE :1         TRUE :1014      TRUE :2        
##  NA's :0         NA's :0         NA's :0

There were quite a few missing values that I dealt with in the following section.
First, I imputed the missing values in Embarked with the mode, ‘S’.

table(res$Embarked)
## 
##   C   Q   S 
## 270 123 914
res$Embarked[is.na(res$Embarked)] <- 'S'
## Check if it worked
which(is.na(res$Embarked))
## integer(0)

Great. Next, I imputed the median Fare for that particular Pclass using the ave() function for the missing values in res$Fare.

res$Fare[is.na(res$Fare)] <- with(res, ave(Fare, Pclass,
                        FUN = function(x) median(x, na.rm=TRUE)))[is.na(res$Fare)]
## Check if it worked 
which(is.na(res$Fare))
## integer(0)

Before imputing missing values on the Age column, I decided to do some feature engineering. I first created a variable called Title I extracted from the Name column. Thank you to DataCamp’s tutorial for this idea, and for Kaggle user tnikaggle for providing code that was cleaner than my original code for this step. To perform this step, I used functions from the stringr package.

library(stringr)
res$Title <- NA
res$Title <- str_sub(res$Name, str_locate(res$Name, ",")[, 1] + 2, str_locate(res$Name, "\\.")[,1]-1)
## Check if it worked
table(res$Title)
## 
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1
which(is.na(res$Title))
## integer(0)
## Convert to factor
res$Title <- factor(res$Title)

Next, I created a variable called LastName. Similarly to the Title variable, the following code extrated only the surname of each person traveling aboard the Titanic.

res$LastName <- NA
reg.ex1 <- ".*," 
res$LastName <- str_extract(res$Name, reg.ex1)
res$LastName <- gsub(",", "", res$LastName)
## Check if it worked 
head(table(res$LastName))
## 
##      Abbing      Abbott    Abelseth     Abelson Abrahamsson     Abrahim 
##           1           3           2           2           1           1
which(is.na(res$LastName)) 
## integer(0)

Then, I created two new variables, TktPre and TktNum that take the first half of the Ticket variable string and the second half, respectively. I found this suggestion from Kaggle user Stephen McInerney in a forum on Kaggle. For tickets that did not have a charater string before the ticket number, I assigned them a dummy character, ’_’. For the several rows where Ticket == “LINE”, I assigned a dummy TktNum of 0.

res$TktPre <- NA; res$TktNum <- NA
reg.ex2 <- ".*\\s|^[A-Z].*(?![0-9])$" 
res$TktPre <- str_extract(res$Ticket, reg.ex2)
res$TktPre <- gsub("\\.|\\s|/", "", res$TktPre)
res$TktPre[is.na(res$TktPre)] <- "_"
## Convert to factor
res$TktPre <- factor(res$TktPre)
## Check if it worked 
table(res$TktPre)
## 
##         _        A2        A4        A5       AQ3       AQ4        AS 
##       957         1        10        28         1         1         1 
##         C        CA   CASOTON        Fa        FC       FCC      LINE 
##         8        68         1         1         3         9         4 
##        LP        PC        PP       PPP        SC      SCA3      SCA4 
##         1        92         4         2         2         1         2 
##      SCAH SCAHBasle      SCOW   SCParis   SCPARIS       SOC       SOP 
##         4         1         1         5        14         8         1 
##      SOPP   SOTONO2   SOTONOQ        SP    STONO2    STONOQ      SWPP 
##         7         3        24         1        21         1         2 
##        WC       WEP 
##        15         4
which(is.na(res$TktPre))
## integer(0)
reg.ex3 <- "\\s[0-9]{1,}|^(?![A-Z]).*"
res$TktNum <- str_extract(res$Ticket, reg.ex3)
res$TktNum[is.na(res$TktNum)] <- "0"
res$TktNum <- as.numeric(res$TktNum)
## Check if it worked
str(res$TktNum)
##  num [1:1309] 21171 17599 3101282 113803 373450 ...
which(is.na(res$TktNum))
## integer(0)

Next, I created a simple column called FamilySize, again thanks to the motivation from DataCamp’s tutorial. This column is constructred by adding together the SibSp and Parch traveling with a person, plus one (for themselves) to equal total family size on board. Then, I discretized FamilySize and created 3 levels of family size: “self”, “small”, and “large”.

res$FamilySize <- res$SibSp + res$Parch + 1

res$relFamilySize <- NA
res$relFamilySize <- ifelse(res$FamilySize == 1, "self", 
                            ifelse(res$FamilySize <= 4, "small", "large")) 
res$relFamilySize <- factor(res$relFamilySize)
## Check if it worked
table(res$relFamilySize)
## 
## large  self small 
##    82   790   437
which(is.na(res$relFamilySize))
## integer(0)

After that, I created another variable called FamilySurvived. My motivation was the following table:

##                  Survival by LastName
## LastName          0 1
##   Abbing          1 0
##   Abbott          1 1
##   Abelseth        0 0
##   Abelson         1 1
##   Abrahamsson     0 0
##   Abrahim         0 0
##   Adahl           1 0
##   Adams           1 0
##   Ahlin           1 0
##   Aks             0 1
##   Albimona        0 1
##   Aldworth        0 0
##   Alexander       1 0
##   Alhomaki        1 0
##   Ali             2 0
##   Allen           1 1
##   Allison         2 1
##   Allum           1 0
##   Andersen        0 0
##   Andersen-Jensen 0 1
##   Anderson        0 1
##   Andersson       7 2
##   Andreasson      1 0
##   Andrew          1 0
##   Andrews         1 1
##   Angheloff       0 0
##   Angle           0 1
##   Appleton        0 1
##   Arnold-Franchi  2 0
##   Aronsson        0 0

At only 30 rows, the table is just a glimpse, but it became apparent that in many cases people of the same surname (a heuristic for families) survived or died together.
I used the dplyr() package to create a tibble called familyAgg, where I grouped res by LastName and summarized the Survived column for each group. Since 1=Survived and 0=Did Not Survive, the sum was the total number of survivors per group. Then, I iterated over res$LastName and assigned the total number of survivors per last name for each row into the new FamilySurvived variable. Please note that this was not a perfect feature, as a few apparently unrelated travelers shared a last name on the ship.

res$Survived <- as.integer(res$Survived)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
familyAgg <- res %>% group_by(LastName) %>% summarise(n=sum(Survived, na.rm=TRUE))
familyDf <- data.frame(familyAgg)
rownames(familyDf) <- familyDf[,1]
res$FamilySurvived <- NA 

for(i in 1:length(res$LastName)){
        surname <- res$LastName[i]
        res$FamilySurvived[i] <- familyDf[surname,2] 
}
## Check if it worked
summary(res$FamilySurvived)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5806  1.0000  4.0000
which(is.na(res$FamilySurvived))
## integer(0)

Here, I optionally took the FamilySurvived variable a step further and calculated the proportion of family members we know survived, because it might be useful for visualization. I got values greater than 1 here, because again, this variable was based on another imperfect variable where people who shared a last name but were unrelated were grouped together. I dealt with this issue in the next step, where I created two new logical variables, SibSpLive and ParchLive. I checked to make sure that a person whose FamilySurvived size was > 0 had at least one SibSp or Parch traveling with them, respectively. Therefore, these variables more accurately reflected whether we already knew a person’s family member survived the sinking.

res$PropSurvived <- res$FamilySurvived/res$FamilySize

res$SibSpLive <- ifelse(res$FamilySurvived>0 & res$SibSp!= 0, T, F)
## Check if it worked
table(res$SibSpLive)
## 
## FALSE  TRUE 
##  1063   246
which(is.na(res$SibSpLive))
## integer(0)
res$ParchLive <- ifelse(res$FamilySurvived>0 & res$Parch!=0, T, F)
## Check if it worked
table(res$ParchLive)
## 
## FALSE  TRUE 
##  1110   199
which(is.na(res$ParchLive))
## integer(0)

Next, I returned to Age and its missing values. I used the corrgram package to construct a correlogram to visualize which numeric variables are most correlated to Age.

table(is.na(res$Age))
## 
## FALSE  TRUE 
##  1046   263
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.3.2
corrgram(res)

From the map, I saw that the intersection of Age and Pclass was shaded the darkest, therefore signaling that Age and Pclass were fairly related. I used the rcorr() function from the Hmisc package to test its strength. I also tested rcorr() on several other non-numeric variables in res, and found that Title was the only variable that came close to the correlation between Age and Pclass.

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
rcorr(res$Age, res$Pclass)
##       x     y
## x  1.00 -0.41
## y -0.41  1.00
## 
## n
##      x    y
## x 1046 1046
## y 1046 1309
## 
## P
##   x  y 
## x     0
## y  0
rcorr(res$Age, res$Title)
##      x    y
## x 1.00 0.31
## y 0.31 1.00
## 
## n
##      x    y
## x 1046 1046
## y 1046 1309
## 
## P
##   x  y 
## x     0
## y  0

Next, I used the ave() function to impute the median age per passenger class and title for each row with a missing age value. One value at row 980 did not accept an imputed age from the functionn, so I manually assigned it using a less complex model. Finally, I discretized the Age variable into four categories: “young child”, “child”, “young adult”, and “adult”, and stored them in a new variable called relAge.

res$Age[is.na(res$Age)] <- with(res, ave(Age, Pclass, Title, 
                        FUN = function(x) median(x, na.rm=TRUE)))[is.na(res$Age)]
## Check if it worked 
summary(res$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.17   21.00   26.00   29.27   36.62   80.00       1
which(is.na(res$Age))
## [1] 980
res[980,] ## Pclass==3
##     Survived Pclass                    Name    Sex Age SibSp Parch Ticket
## 980       NA      3 O'Donoghue, Ms. Bridget female  NA     0     0 364856
##     Fare Cabin Embarked Title   LastName TktPre TktNum FamilySize
## 980 7.75  <NA>        Q    Ms O'Donoghue      _ 364856          1
##     relFamilySize FamilySurvived PropSurvived SibSpLive ParchLive
## 980          self              0            0     FALSE     FALSE
## For some reason, we have a NA for Age at row 980. Let's give it the median value.
require(dplyr)
tbl <- res %>% group_by(Pclass) %>% summarise(median(Age, na.rm=TRUE))
res$Age[980] <- tbl[3,2]
## Convert to numeric
res$Age <- as.numeric(res$Age)

res$relAge <- NA
res$relAge <- ifelse(res$Age < 7, "young child", 
                        ifelse(res$Age < 18, "child", 
                        ifelse(res$Age < 41, "young_adult", "adult")))
## Check if it worked
table(res$relAge)
## 
##       adult       child young child young_adult 
##         262          92          70         885
which(is.na(res$relAge))
## integer(0)
## Convert to factor
res$relAge <- factor(res$relAge) 

Then, really wanting to use the LastName variable, but restricted by the random forest algorithm’s limits, I created a new variable called SurnameL using only the first letter of a family’s last name. Then, I did the same with Cabin, storing the first letter of each cabin in a new variable called CabinL. Looking at the Titanic deckplans, I saw that cabins were named according to floor of the ship. I thought this could be useful.

res$SurnameL <- NA
res$SurnameL <- factor(substr(res$LastName, start=1, stop=1))
## Check if it worked
table(res$SurnameL)
## 
##   A   B   C   d   D   E   F   G   H   I   J   K   L   M   N   O   P   Q 
##  75 103 111   7  72  17  47  54  92   9  39  49  68 104  39  34  66   3 
##   R   S   T   U   v   V   W   Y   Z 
##  60 125  40   1   4  17  61   7   5
which(is.na(res$SurnameL))
## integer(0)
res$CabinL <- NA
res$CabinL <- factor(substr(res$Cabin, start=1, stop=1))
## Check if it worked
table(res$CabinL)
## 
##  A  B  C  D  E  F  G  T 
## 22 65 94 46 41 21  5  1
table(is.na(res$CabinL))
## 
## FALSE  TRUE 
##   295  1014

It should be noted that, at this point I dealt with extremely sparse data. The Cabin and CabinL variables had 1014 missing values and only 295 filled values each. By proceeding with these variable, I was at a great risk of overfitting my random forest model. Nonetheless, I continued because I had reason to believe I could improve my model with cabin information. Beore I could use the CabinL variable, however, I needed to impute the missing values. Using the Titanic deckplan, I assigned each cabin letter its relative floor: Cabins beginning with ‘A’ were on the uppermost part of the ship, hence the value ‘1’. ‘G’ cabins were the lowest on the ship and received the value 7. For the single cabin letter == ‘T’, which I could not find in the diagram, I assigned it floor 0. The passenger was traveling in 1st class, and 1st class occupied the upper decks according to the diagram. These floor numbers were stored in a new variable called CabinFloor. As an integer, I figured I would have better success at estimating CabinFloor than I would using unordered factor levels (i.e. letters).

res$CabinFloor <- NA
res$CabinFloor <- ifelse(res$CabinL=="A", 1, 
                         ifelse(res$CabinL=="B", 2, 
                        ifelse(res$CabinL=="C", 3, 
                        ifelse(res$CabinL=="D", 4, 
                        ifelse(res$CabinL=="E", 5,
                        ifelse(res$CabinL=="F", 6, 
                        ifelse(res$CabinL=="G", 7, 
                        ifelse(res$CabinL=="T", 0, NA))))))))
res$CabinFloor <- as.integer(res$CabinFloor)

I used a corrgram() to determine which numeric variables were most correlated with CabinFloor. I found that Pclass had the boldest shading at the intersection with CabinFloor. Other variables did not come close to this correlation, so I imputed missing CabinFloor values using the mean Pclass and the ave() function.

corrgram(res)

rcorr(res$CabinFloor, res$Pclass)
##      x    y
## x 1.00 0.62
## y 0.62 1.00
## 
## n
##     x    y
## x 295  295
## y 295 1309
## 
## P
##   x  y 
## x     0
## y  0
res$CabinFloor[is.na(res$CabinFloor)] <- with(res, ave(CabinFloor, Pclass,
                        FUN = function(x) mean(x, na.rm=TRUE)))[is.na(res$CabinFloor)]
## Check if it worked
summary(res$CabinFloor)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   6.125   5.176   6.125   7.000
which(is.na(res$CabinFloor))
## integer(0)

Based on the nature of my previous model, I got numeric values rather than integers for my CabinFloor variable. Instead of rounding these values, I proceeded by discretizing CabinFloor into “upper half” and “lower half” values. I built a new column called relCabinFloor that did this.

res$relCabinFloor <- NA
res$relCabinFloor <- ifelse(res$CabinFloor < 4, "upper half", "lower half")
res$relCabinFloor <- factor(res$relCabinFloor)
## Check if it worked
table(res$relCabinFloor)
## 
## lower half upper half 
##       1060        249
which(is.na(res$relCabinFloor))
## integer(0)

Next, I re-split the data back into training and testing sets. I removed variables I did not plan to use in my random forest algorithm.

names(res)
##  [1] "Survived"       "Pclass"         "Name"           "Sex"           
##  [5] "Age"            "SibSp"          "Parch"          "Ticket"        
##  [9] "Fare"           "Cabin"          "Embarked"       "Title"         
## [13] "LastName"       "TktPre"         "TktNum"         "FamilySize"    
## [17] "relFamilySize"  "FamilySurvived" "PropSurvived"   "SibSpLive"     
## [21] "ParchLive"      "relAge"         "SurnameL"       "CabinL"        
## [25] "CabinFloor"     "relCabinFloor"
## Remove Name (3), Ticket (8), Cabin (10), LastName (13), FamilySurvived (18), 
## PropSurvived (19), CabinL (24), CabinFloor (25)
train <- res[1:891, c(-3, -8, -10, -13, -18, -19, -24, -25)] 
test <- res[892:1309, c(-3, -8, -10, -13, -18, -19, -24, -25)] 

Finally, I finished pre-processing the data and I was ready to begin exploring it.

Explore the data

I explored the different variables in res to see which ones were more likely to affect my random forest algorithm. I used a mixture of the base graphics and ggplot2 package for the visualizations.

names(train)
##  [1] "Survived"      "Pclass"        "Sex"           "Age"          
##  [5] "SibSp"         "Parch"         "Fare"          "Embarked"     
##  [9] "Title"         "TktPre"        "TktNum"        "FamilySize"   
## [13] "relFamilySize" "SibSpLive"     "ParchLive"     "relAge"       
## [17] "SurnameL"      "relCabinFloor"
library(ggplot2)

Passengers in 3rd class were much more likely to die, while passengers in 1st class were more likely to survive.

Females were much more likely to survive than their male counterparts.

Older passengers were less likely to survive.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.95
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.3403e-29
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 0.95
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1.05
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 2.3403e-29
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 1

Passengers with more siblings, spouses, parents, and children on board were less likely to survive. However, for family sizes between 1 and approximately 4, there was an increase in passenger survival.

In agreement with the previous plot, individuals traveling alone and large families (>4) were much less likely to survive, whereas small families (<=4) had a higher chance of survival.

Passengers who paid more for their fare were more likely to survive.

Passengers who embarked at port “S” were less likely to survive, but then again, “S” was the mode for these data.

##               Survival by Title
## Title            0   1
##   Capt           1   0
##   Col            1   1
##   Don            1   0
##   Dona           0   0
##   Dr             4   3
##   Jonkheer       1   0
##   Lady           0   1
##   Major          1   1
##   Master        17  23
##   Miss          55 127
##   Mlle           0   2
##   Mme            0   1
##   Mr           436  81
##   Mrs           26  99
##   Ms             0   1
##   Rev            6   0
##   Sir            0   1
##   the Countess   0   1

This table shows that those with titles such as “Miss” and “Mrs” were more likely to survive, while those with titles such as “Mr” and “Rev” were more likely to die.

##            Survival by TktPre
## TktPre        0   1
##   _         407 254
##   A2          0   0
##   A4          7   0
##   A5         19   2
##   AQ3         0   0
##   AQ4         0   0
##   AS          1   0
##   C           3   2
##   CA         27  14
##   CASOTON     1   0
##   Fa          1   0
##   FC          1   0
##   FCC         1   4
##   LINE        3   1
##   LP          0   0
##   PC         21  39
##   PP          1   2
##   PPP         1   1
##   SC          0   1
##   SCA3        0   0
##   SCA4        1   0
##   SCAH        1   1
##   SCAHBasle   0   1
##   SCOW        1   0
##   SCParis     2   2
##   SCPARIS     4   3
##   SOC         5   1
##   SOP         1   0
##   SOPP        3   0
##   SOTONO2     2   0
##   SOTONOQ    13   2
##   SP          1   0
##   STONO2     10   8
##   STONOQ      0   0
##   SWPP        0   2
##   WC          9   1
##   WEP         2   1

In some cases, those who had a similar ticket prefix survived together, although this assumption is not very robust.

##       Survival by TktNum
## TktNum 0 1
##   0    3 1
##   2    7 5
##   3    2 0
##   541  0 1
##   693  1 0
##   695  1 0
##   751  1 1
##   752  0 1
##   851  1 0
##   1166 1 0
##   1585 1 0
##   1601 2 5

It is difficult to draw any conclusions from the data about survival versus ticket suffix.

Passengers whose sibling or spouse survived were much more likely to survive than those who did not have a surviving sibling or spouse.

Passengers whose parent or child survived were also much more likely to survive than those who did not have a surviving parent or child.

##         Survival by SurnameL
## SurnameL  0  1
##        A 32 19
##        B 37 35
##        C 42 27
##        d  2  2
##        D 19 20
##        E  9  3
##        F 19 12
##        G 31 10
##        H 37 32
##        I  4  2
##        J 19 11
##        K 19  9
##        L 30 18
##        M 41 33
##        N 14 15
##        O 17  6
##        P 34 13
##        Q  0  2
##        R 28 13
##        S 57 29
##        T 14 16
##        U  1  0
##        v  2  0
##        V 13  0
##        W 20 13
##        Y  5  2
##        Z  3  0

It is difficult to draw any conclusions from the data about survival versus first letter of family name.

Passengers with a cabin in the lower half of the ship were less likely to survive than passengers in the upper half.
Finally, I looked at the corrgram() with respect to Survived and tested a few variables’ correlation with Survived.

corrgram(res)

rcorr(res$Survied, res$Title)
##   y
## y 1
## 
## n= 1309 
## 
## 
## P
##   y
## y
rcorr(res$Survived, res$Pclass)
##       x     y
## x  1.00 -0.34
## y -0.34  1.00
## 
## n
##     x    y
## x 891  891
## y 891 1309
## 
## P
##   x  y 
## x     0
## y  0
rcorr(res$Survived, res$Fare)
##      x    y
## x 1.00 0.26
## y 0.26 1.00
## 
## n
##     x    y
## x 891  891
## y 891 1309
## 
## P
##   x  y 
## x     0
## y  0
rcorr(res$Survived, res$SibSpLive)
##      x    y
## x 1.00 0.36
## y 0.36 1.00
## 
## n
##     x    y
## x 891  891
## y 891 1309
## 
## P
##   x  y 
## x     0
## y  0
rcorr(res$Survived, res$ParchLive)
##      x    y
## x 1.00 0.35
## y 0.35 1.00
## 
## n
##     x    y
## x 891  891
## y 891 1309
## 
## P
##   x  y 
## x     0
## y  0

Build a model

Finally, I created a random forest algorithm using the randomForest package in R. For my formula, I used every variable in train except for Embarked and relAge.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:Hmisc':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
set.seed(0)
rf <- randomForest(factor(Survived)~ Pclass+Sex+Age+SibSp+Parch+Fare+Title+ 
                           ParchLive+SibSpLive+relCabinFloor+TktPre+TktNum+  
                           FamilySize+SurnameL+relFamilySize, data=train, ntree=401)
importance(rf)
##               MeanDecreaseGini
## Pclass               19.725888
## Sex                  47.085066
## Age                  26.017395
## SibSp                 7.713462
## Parch                 5.809513
## Fare                 33.640425
## Title                66.757325
## ParchLive            17.928442
## SibSpLive            17.732570
## relCabinFloor         3.768058
## TktPre               15.105764
## TktNum               34.862783
## FamilySize           11.842743
## SurnameL             47.475756
## relFamilySize         9.713617
print(rf)
## 
## Call:
##  randomForest(formula = factor(Survived) ~ Pclass + Sex + Age +      SibSp + Parch + Fare + Title + ParchLive + SibSpLive + relCabinFloor +      TktPre + TktNum + FamilySize + SurnameL + relFamilySize,      data = train, ntree = 401) 
##                Type of random forest: classification
##                      Number of trees: 401
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.49%
## Confusion matrix:
##     0   1 class.error
## 0 526  23  0.04189435
## 1 115 227  0.33625731

Results

The solution to the test set was readily available on the web. I found it at http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv. After cleaning up the data, I used the answer to validate my predictions on my local machine. I also used the solution to construct an ROC curve and calculate area under the curve (AUC). I stored the test solution in an R object called test_soln.

Then, I tested my model’s prediction.

out.rf <- predict(rf, newdata=test)
mean(out.rf == test_soln$Survived)
## [1] 0.8110048

I liked this accuracy, so I assigned my predicted survival to my test dataframe. Then, I created an ROC plot with my model using the ROCR package.

test$Survived <- out.rf 
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
p <- predict(rf, newdata=test, type="class")
pr <- prediction(as.numeric(p), as.numeric(test_soln$Survived))
prf <- performance(pr, measure="tpr", x.measure="fpr")
plot(prf)

Finally, I calculated the AUC.

auc <- performance(pr, measure="auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7661392

I submitted the final prediction as a dataframe to Kaggle.

my_soln <- data.frame(PassengerId = test_raw$PassengerId, Survived=test$Survived)
write.csv(my_soln, "my_titanic_soln.csv", row.names=FALSE)

Conclusion

For some reason, my code run in knitr produces a slightly lower accuracy than what I got in my RStudio terminal. Using this code on my local machine, with the seed also set to 0, gave me a slightly higher accuracy. In the future, I will look more closely at why this might have occurred. For now, find the raw code that earned my high score on Kaggle at https://github.com/kairstenfay/kairstenfay.github.io/blob/master/_posts/R-projects/titanic/titanic_winner_.R.