Introduction

This is an attempt to solve the machine learning problem titled “Titanic: Machine Learning from Disaster.” By analyzing the effect that variable factors had on the probability that a person on board the Titanic would survive the crash, I will attempt to create a predictive model for predicting a person’s survival.

I will be using some R packages to assist my analysis. Download them from CRAN if you do not have them, and then load them from your library.

  library(ggplot2)
  suppressMessages(library(dplyr))
  library(tidyr)

Cleaning Train Data

Download data from Kaggle

Download the training data and the test data from Kaggle into your working directory, and load it into R. Take a look at the variable names provided.

train <- read.csv("train.csv", header=TRUE, stringsAsFactors=FALSE, na="")
train <- tbl_df(train)
test <- read.csv("test.csv", header=TRUE, stringsAsFactors=FALSE, na="")
names(train)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"

The most important attribute of the data is whether more people passed away or survived. I would like to know whether more people passed away or survived.

#Sum of 1 values in Survived column, representing surivival
sum(train$Survived)
## [1] 342
#Sum of 0 values in Survived column, representing deaths
sum(!train$Survived)
## [1] 549

The training data has more deaths than survivals. ##Variable descriptions (from Kaggle) The problem description on the Kaggle website includes a description of each variable. The description is as follows: “VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used
for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations."

Transform variables

To begin cleaning the data, I converted some of the categorical data into factors for more meaningful plotting. I performed this operation on data that were discrete, but were represented numerically in the training data.

train$Survived <- factor(train$Survived, c(0,1), c("No", "Yes"))
train$Pclass <- factor(train$Pclass, c(1,2,3), c("First", "Second", "Third"))

I will also convert some of the other variables into more useful forms. For example, the passengers’ names could potentially provide insight into their probability of survival. Each passengers’ name contains one of the following titles: “Master”, “Mr.”, “Mrs.”, “Miss”, or “Dr.” We care about this information rather than the rest of the passenger’s name.
First, I will use regular expressions to extract each type of name qualifier, and then loops to replace the entire name with the important part.

master <- grep("Master.",train$Name, fixed=TRUE)
miss <- grep("Miss.", train$Name, fixed=TRUE)
mrs <- grep("Mrs.", train$Name, fixed=TRUE)
mr <- grep("Mr.", train$Name, fixed=TRUE)
dr <- grep("Dr.", train$Name, fixed=TRUE)
for(i in master) {
  train$Name[i] <- "Master"
}
for(i in miss) {
  train$Name[i] <- "Miss"
}
for(i in mrs) {
  train$Name[i] <- "Mrs"
}
for(i in mr) {
  train$Name[i] <- "Mr"
}
for(i in dr) {
  train$Name[i] <- "Dr"
}

I decided to only use variables in my predictive model if a significant number of observations had values for that variable. The following code can be used to see how many missing values each variable has in the train data. A new vector is created, with each row representing a different variable, and the second column representing the number of missing values that variable had in the train data.

x <- length(names(train))
y <- cbind(names(train),1:x)
for(i in 1:x){
  y[i,2] <- sum(is.na(train[,i]))
}
y
##       [,1]          [,2] 
##  [1,] "PassengerId" "0"  
##  [2,] "Survived"    "0"  
##  [3,] "Pclass"      "0"  
##  [4,] "Name"        "0"  
##  [5,] "Sex"         "0"  
##  [6,] "Age"         "177"
##  [7,] "SibSp"       "0"  
##  [8,] "Parch"       "0"  
##  [9,] "Ticket"      "0"  
## [10,] "Fare"        "0"  
## [11,] "Cabin"       "687"
## [12,] "Embarked"    "2"

Based on these results, I will not be using a passenger’s cabin number in my analysis. I am also going to ignore the ticket and embarked data.

Imputing Missing Values

The Age data is missing 177 values. I will estimate these values based on the mean age value of the corresponding name title (Mr, Mrs, Miss, Master, or Dr)

  meanByTitle <- summarize(group_by(train, Name), mean(Age, na.rm=TRUE))
  for(i in 1:nrow(train)){
    if(is.na(train$Age[i])){
      train$Age[i] <- meanByTitle[meanByTitle[,1]==train$Name[i],2]
    }

  }
#Check to make sure there are no more missing age values
sum(is.na(train$Age))
## [1] 0

Data Exploration

In order to get a general sense of where the data are, I plotted some plots of the numerical data.

agePlot <- qplot(x=Age, data=train, geom="density")
agePlot <- agePlot + labs(title="Age Density in Training Data")
print(agePlot)

print("")
## [1] ""
farePlot <- qplot(x=Fare, data=train, geom="density")
farePlot <- farePlot + labs(title="Fare Density in Training Data")
print(farePlot)

A plot of survival by Sex

survivalPlot <- qplot(x=Sex, data=train) + geom_bar(aes(fill=Survived))
survivalPlot <- survivalPlot + labs(title="Survival by Gender")
survivalPlot

From this plot, it is clear that women were far more likely to survive than men.
A plot of survival by Age

agePlot <- qplot(x=Age, data=train, binwidth=5) + aes(fill=Survived)
agePlot <- agePlot + labs(title="Survival by Age")
agePlot

A plot of survival by Age with respect to Sex

agePlot <- qplot(x=Age, data=train, binwidth=5, facets=.~Sex) + aes(fill=Survived)
agePlot <- agePlot + labs(title="Survival by Age with respect to Sex")
agePlot

A plot of survival by Pclass (passenger class)

classPlot <- qplot(x=Pclass, data=train) + geom_bar(aes(fill=Survived))
classPlot <- classPlot + labs(title="Survival by Pclass")
classPlot

A plot of survival by Pclass with respect to Sex

classPlot <- qplot(x=Pclass, data=train, facets= .~Sex) + geom_bar(aes(fill=Survived))
classPlot <- classPlot + labs(title="Survival by Pclass with respect to Sex")
classPlot

A plot of survival by Fare

farePlot <- qplot(x=Fare, data=train, binwidth=50) + aes(fill=Survived)
farePlot

A plot of survival by Fare with respect to Sex

farePlot <- qplot(x=Fare, data=train, binwidth=50, facets=.~Sex) + aes(fill=Survived)
farePlot

Age, Pclass, Fare, and Sex clearly affect a passenger’s chance of survival and should be used in the model. However, Sex is by far the best indicator of one’s chance of survival.

Creating a Model

Converting categorical data to numerical values

Now that I have a general sense of how the variables affect a passenger’s mortality probability, I will create a model to represent the data. First, I will assign numerical values to all the categorical values.

train$Survived <- factor(train$Survived, c("No", "Yes"), c(0,1))
train$Pclass <- factor(train$Pclass, c("First", "Second", "Third"), c(1,2,3))
train$Sex <- factor(train$Sex, c("male", "female"), c(1,2))

Using the generalized linear model (glm) function, I will create a model for survival based on Sex, PClass, Age, and Fare

myModel <- glm(Survived ~ Sex + Pclass + Age + Fare + Sex*Pclass, 
               family = quasibinomial, data = train)
myModel
## 
## Call:  glm(formula = Survived ~ Sex + Pclass + Age + Fare + Sex * Pclass, 
##     family = quasibinomial, data = train)
## 
## Coefficients:
##  (Intercept)          Sex2       Pclass2       Pclass3           Age  
##    1.0588297     3.9217644    -1.5832273    -1.8510200    -0.0405333  
##         Fare  Sex2:Pclass2  Sex2:Pclass3  
##   -0.0003594     0.3325401    -2.2079007  
## 
## Degrees of Freedom: 890 Total (i.e. Null);  883 Residual
## Null Deviance:       1187 
## Residual Deviance: 769.2     AIC: NA

Cleaning Test Data

The test data must be cleaned in the same way the train data was.

test <- tbl_df(test)
test$Sex <- factor(test$Sex, c("male", "female"), c(1,2))
test$Pclass <- factor(test$Pclass)
master <- grep("Master.",test$Name, fixed=TRUE)
miss <- grep("Miss.", test$Name, fixed=TRUE)
mrs <- grep("Mrs.", test$Name, fixed=TRUE)
mr <- grep("Mr.", test$Name, fixed=TRUE)
dr <- grep("Dr.", test$Name, fixed=TRUE)
for(i in master) {
  test$Name[i] <- "Master"
}
for(i in miss) {
  test$Name[i] <- "Miss"
}
for(i in mrs) {
  test$Name[i] <- "Mrs"
}
for(i in mr) {
  test$Name[i] <- "Mr"
}
for(i in dr) {
  test$Name[i] <- "Dr"
}
meanByTitle <- summarize(group_by(test, Name), mean(Age, na.rm=TRUE))
  for(i in 1:nrow(test)){
    if(is.na(test$Age[i])){
      test$Age[i] <- meanByTitle[meanByTitle[,1]==test$Name[i],2]
    }
   #If the person does not have one of these titles, 
   #just give them the average age value of all people
    if(is.na(test$Age[i])){
      test$Age[i] <- mean(test$Age, na.rm=TRUE)
    }
  }
#Check to make sure there are no more missing age values
sum(is.na(test$Age))
## [1] 0

Now, use the model on the test data.

p.hats <- predict.glm(myModel, newdata = test, type = "response")
head(p.hats)
##          1          2          3          4          5          6 
## 0.10034462 0.27172162 0.04561019 0.13127853 0.50638114 0.20375784
p.hats[is.na(p.hats)] <- mean(p.hats, na.rm=TRUE)
head(p.hats)
##          1          2          3          4          5          6 
## 0.10034462 0.27172162 0.04561019 0.13127853 0.50638114 0.20375784

The model gives us continuous values from 0 to 1, but we need discrete values representing survival or death. Using 0.5 as a cutoff, assign each predicted value to either 0 (did not survive) or 1 (survived).

survival <- vector()

for(i in 1:length(p.hats)) {
  if(p.hats[i] > .5) {
    survival[i] <- 1
  } else {
    survival[i] <- 0
  }
}
answer <- cbind(test$PassengerId, survival)
dim(answer)
## [1] 418   2
colnames(answer) <- c("PassengerId", "Survived")

Since my primary analysis showed that gender was the best indicator of a passenger’s survival, I will plot survival by Sex for train and test data to compare.

test <- cbind(test, survival)
train$Sex <- factor(train$Sex, c(1,2), c("male","female"))
test$Sex <- factor(test$Sex, c(1,2), c("male","female"))
train$Survived <- factor(train$Survived, c(0,1), c("no","yes"))
test$survival <- factor(test$survival, c(0,1), c("no","yes"))
trainPlot <- qplot(x=Sex, data=train) + geom_bar(aes(fill=Survived))
trainPlot <- trainPlot + labs(title="Survival by Gender: Training Data")
trainPlot

testPlot <- survivalPlot <- qplot(x=Sex, data=test) + 
  geom_bar(aes(fill=survival))
testPlot <- testPlot + labs(title="Survival by Gender: Predicted Data")
testPlot

My predictive model produced a much higher percentage of male passengers dying, and a slightly lower percentage of female passengers dying, than the test data.

Results

Write the answer vector to a csv file for submission to Kaggle.

write.csv(answer, file = "kaggle.csv", row.names = FALSE)

This method got a score of 0.77512 on Kaggle, putting me in position 1480/2146 on the leaderboard.

Next Steps

I found that using the Parch (parent and child) and SibSp (sibling and spouse) variables in my generalized linear model actually made my accuracy go down slightly, according to Kaggle’s error caculations. I will need to analyze the data more and see if I can find a way to use the familial data to help my results.
My next steps will be to explore my test predictions more, and see how they differ from the training data. I will try to use this to create a better generalized linear model. In this example, I used the glm function as a black box, but hopefully once I have a better understanding of the math behind the function I will be able to modify the function more and get a better result.