Introduction

In this document, we predict the survivability of passengers aboard the Titanic based on their compartment class, gender age, family status, where they boarded and how much fare they paid for the trip.

I wonder how the algorithms will deal with ticket price which is not only dependent on compartment class but also on the point of boarding.

Data Preprocessing

# download all data sets
# train - training data set
# test - test data set
# gendermodel - survival status of passenger of test dataset
train <- read.csv("C:\\Users\\Windows\\Dropbox\\AllStuff\\Titanic_Kaggle\\Data\\train.csv") 
gendermodel <- read.csv("C:\\Users\\Windows\\Dropbox\\AllStuff\\Titanic_Kaggle\\Data\\gendermodel.csv")
test <- read.csv("C:\\Users\\Windows\\Dropbox\\AllStuff\\Titanic_Kaggle\\Data\\test.csv")

# what does the data sets look like
str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

str(gendermodel)

## 'data.frame':    418 obs. of  2 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 0 ...

str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

# change column names to lower case
names(train) <- tolower(names(train))
names(gendermodel) <- tolower(names(gendermodel))
names(test) <- tolower(names(test))

# passenger names, ticket and cabin names can be removed since they cannot possibly contribute meaningfully to survival. Actually it might on an individual basis, but without any generalizable pattern.
train <- train[, !(names(train) %in%  c("name", "ticket", "cabin"))]
test <- test[, !(names(test) %in% c("name", "ticket", "cabin"))]

head(train)

##   passengerid survived pclass    sex age sibsp parch    fare embarked
## 1           1        0      3   male  22     1     0  7.2500        S
## 2           2        1      1 female  38     1     0 71.2833        C
## 3           3        1      3 female  26     0     0  7.9250        S
## 4           4        1      1 female  35     1     0 53.1000        S
## 5           5        0      3   male  35     0     0  8.0500        S
## 6           6        0      3   male  NA     0     0  8.4583        Q

head(test)

##   passengerid pclass    sex  age sibsp parch    fare embarked
## 1         892      3   male 34.5     0     0  7.8292        Q
## 2         893      3 female 47.0     1     0  7.0000        S
## 3         894      2   male 62.0     0     0  9.6875        Q
## 4         895      3   male 27.0     0     0  8.6625        S
## 5         896      3 female 22.0     1     1 12.2875        S
## 6         897      3   male 14.0     0     0  9.2250        S

Exploratory data analysis

I am interested in a simple black and white idea of what the data says:
* survivability based on compartment class
* survivability based on gender
* survivability based on age
* survivability family status

Titanic disaster - predicting survival using ML

Faiyaz Hasan

August 30, 2016

Introduction

Data Preprocessing

Exploratory data analysis