Predicting Titanic Survival Rate

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. This project investigates the likelihood of survival on titanic by employing Machine Learning tools to predict which passengers survived the tragedy.

During this exercise I will perform the following:
  1. Load and Check Data
  2. Perform Exploratory Data Analysis
  3. Combine Train and Test data
  4. Impute Missing Values
  5. Perform Feature Engineering
  6. Make Prediction

1. Load and Check Data

# Load and check the training data
train <- read.csv('Kaggle_Titanic_Train.csv', stringsAsFactors=FALSE)
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
summary(train)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

Observation- As we can see, the training dataset has 891 observations and 12 variables. Looking at the structure and summary of our dataset gives us some sense of the data.

2. Explore the Data

2.1 Let’s look at how many people survived

# Number of people survived
table(train$Survived)
## 
##   0   1 
## 549 342
# Proportion of people survived
prop.table(table(train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

Observation- Looking at the numbers, we can see that only 38% passengers survived.

2.1.1 Survival Plot
library(ggplot2)
ggplot(train, aes(factor(Survived), fill=factor(Survived))) + geom_bar() +
  scale_y_continuous(limits=c(0,600)) + 
  scale_x_discrete(labels=c('Dead', 'Survived')) +
  scale_fill_discrete(guide=FALSE) +
  labs(x='Survival', y='No of Passengers', title='Survival Rate')

2.2 Survival by Sex

# Number of people survived by sex
table(train$Sex, train$Survived)
##         
##            0   1
##   female  81 233
##   male   468 109
# Proportion of people survived by sex
prop.table(table(train$Sex, train$Survived),1)
##         
##                  0         1
##   female 0.2579618 0.7420382
##   male   0.8110919 0.1889081

Observation- We can see that of the passengers that survived, 74% were females while only 19% were males.

2.2.1 Plot of survival by Sex
ggplot(train, aes(factor(Survived))) + geom_bar(aes(fill=factor(Sex))) +
  scale_y_continuous(limits=c(0,600), breaks=seq(0,600,100)) +
  scale_x_discrete(labels = c('Dead', 'Survived')) +
  scale_fill_discrete(name="Sex") +
  labs(x='Survived', y='No. of Passengers', title='Survival by Sex')

2.3 Survival by Passenger Class

# Number of people survived by Passenger Class
table(train$Pclass, train$Survived)
##    
##       0   1
##   1  80 136
##   2  97  87
##   3 372 119
# Proportion of people survived by Passenger Class
prop.table(table(train$Pclass, train$Survived),1)
##    
##             0         1
##   1 0.3703704 0.6296296
##   2 0.5271739 0.4728261
##   3 0.7576375 0.2423625

Observation- We can see that of the passengers that survived 1. 63% passengers that travelled in Class 1 survived 2. Only 24% passengers in class 3 survived

2.3.1 Plot of Survival by Passenger Class
ggplot(train, aes(factor(Survived))) + geom_bar(aes(fill=factor(Pclass))) +
  scale_y_continuous(limits=c(0,600), breaks=seq(0,600,100))+
  scale_x_discrete(labels = c('Dead', 'Survived')) +
  scale_fill_discrete(name="Passenger\nClass") +
  labs(x='Survived', y='No. of Passengers', title='Survival by Passenger Class')

2.4 Survival by Sex and Passenger Class

# Number of people survived by Sex and Passenger Class
aggregate(Survived~Sex+Pclass, train, FUN = sum)
##      Sex Pclass Survived
## 1 female      1       91
## 2   male      1       45
## 3 female      2       70
## 4   male      2       17
## 5 female      3       72
## 6   male      3       47

Note- This gives the total number of male/female passengers that survived in each class.

# Proportion of people survived by Sex and Passenger Class
aggregate(Survived~Sex+Pclass, train, FUN=function(x){sum(x)/length(x)})
##      Sex Pclass  Survived
## 1 female      1 0.9680851
## 2   male      1 0.3688525
## 3 female      2 0.9210526
## 4   male      2 0.1574074
## 5 female      3 0.5000000
## 6   male      3 0.1354467

Observation- 1. Atleast 92% women in class 1 and 2 survived. Only 50% women in class 3 survived 2. 37% men in class 1 survived while only 13% men in class 3 survived

2.5 Let’s visualize relationship between Age and Survival for each Passenger Class

# Lets look at relationship b/w age and survival for each class
ggplot(train, aes(Age)) + 
  geom_histogram(aes(fill=factor(Survived))) +
  facet_grid(Pclass~.) + 
  scale_fill_discrete(name="Survival", labels=c('Dead', 'Survived')) +
  labs(title = 'Survival by Age and Class')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.6 Let’s visualize relationship between Age, Passenger Class and Sex as survival factors

# Lets look at relation b/w age, class and sex as survival factors
ggplot(train, aes(Age, factor(Pclass), col=factor(Survived))) + 
  geom_jitter(position=position_jitter(height=0.2)) + 
  facet_grid(Sex~.) + 
  scale_color_discrete(name="Survival", labels=c('Dead', 'Survived')) +
  labs(y='Passenger Class', title='Survival by Age, Sex and Class')

2.7 Let’s visualize relationship between Age, Fare and Survival for each Passenger Class

# Lets look at relation b/w age, fare and survival for each class
ggplot(train, aes(Age, Fare, col=factor(Pclass))) +
  geom_point(aes(shape=factor(Survived))) + 
  facet_grid(Sex~.) +
  scale_color_discrete(name="Passenger\nClass") +
  scale_shape_discrete(name="Survival", labels=c('Dead', 'Survived'))

3. Combine Train and Test data

We will combine train and test datasets for feature engineering. Let’s load the test data and take a look at the structure of the test data.

# Load and check test data
test <- read.csv('Kaggle_Titanic_Test.csv', stringsAsFactors = FALSE)
str(test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

Observation- As we can see that the test data does not have variable ‘Survived’ as we are trying to predict survival. In order for us to combine the 2 datasets for feature engineering, we will create a new variable called ‘Survived’ in the test set and assign a value of 0 (dead).

# Create 'Survived' variable in test set
test$Survived <- 0

# Combine the datasets
full <- rbind(train, test)

# View the structure of combined dataset
str(full)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Observation- We can see that the combined dataset has 1309 observations of 12 variables.

4. Missing Value Imputation

Lets look at the summary of our dataset to see which variables have missing values.

summary(full)
##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.2613   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                                                                    
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin          
##  Length:1309        Min.   :  0.000   Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 33.295                     
##                     3rd Qu.: 31.275                     
##                     Max.   :512.329                     
##                     NA's   :1                           
##    Embarked        
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Observation- We see that Age has 263 missing values and Fare has 1 missing value. Let’s start with Fare variable.

4.1 Impute Missing Values - Fare

4.1.1 - Lets look at the row with missing value in Fare variable
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
filter(full, is.na(full$Fare))
##   PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
## 1        1044        0      3 Storey, Mr. Thomas male 60.5     0     0
##   Ticket Fare Cabin Embarked
## 1   3701   NA              S

Observation- We can see that the passenger travelled in 3rd class and embarked from ‘S’

4.1.2 - Lets filter rows with Embarked=S and Pclass=3
Fare_est <- filter(full, full$Embarked=='S' & full$Pclass==3)
4.1.3 - Lets find the median fare of this filtered data.
median(Fare_est$Fare, na.rm=TRUE)
## [1] 8.05

Observation- We can see that the median fare for passenger who travelled in 3rd class and embarked from ‘S’ is 8.05. Lets assign this value to the missing fare.

4.1.4 - Lets assign the value to passenger with ID 1004 and check if there are any missing values left.
full$Fare[full$PassengerId==1044] <- 8.05
any(is.na(full$Fare))
## [1] FALSE

4.2 Impute Missing Values - Embarked

4.2.1 - Lets convert ‘Embarked’ variable to a factor and check if Embarked has any missingness.
full$Embarked <- as.factor(full$Embarked)
filter(full, full$Embarked=='')
##   PassengerId Survived Pclass                                      Name
## 1          62        1      1                       Icard, Miss. Amelie
## 2         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##      Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 female  38     0     0 113572   80   B28         
## 2 female  62     0     0 113572   80   B28

Observation- Passengers with ID 62 and 830 have missing Embarked value. Both passengers travelled in 1st class.

4.2.2 - Lets filter rows with Pclass=1
Embark_est <- filter(full, full$Pclass==1)
4.2.3 - Now let’s find where most of the 1st class passengers embarked from
plot(Embark_est$Embarked)

Observation- Most passengers embarked from port ‘S’

4.2.4 - Let’s assign ‘S’ to the missing ports for passengers with ID 62 and 830
full$Embarked[full$PassengerId %in% c(62, 830)] <- 'S'
4.2.5 - Now lets check is Embarked has any missing values left
filter(full, full$Embarked=='')
##  [1] PassengerId Survived    Pclass      Name        Sex        
##  [6] Age         SibSp       Parch       Ticket      Fare       
## [11] Cabin       Embarked   
## <0 rows> (or 0-length row.names)

4.3 Impute Missing Values - Age

Now that we have removed misiing values from Fare and Embarked, we can use MICE package for imputation. If we had not replaced missing values in Fare and Embarked, using these 2 variables would generate an error. Not using these 2 variables however would cause MICE to impute values for Fare and Embarked as well.

So unless we want to use MICE for all varibales that have missing values, we either need to fix missing values for variables we dont want MICE to replace or create a subset of variables that can be used by MICE for imputation.

4.3.1 - Let’s take a quick look at missing values in Age variable
hist(full$Age)

sum(is.na(full$Age))
## [1] 263

Observation- We can see that Age varies between 0-80 and we have 263 missing values in Age.

4.3.2 - Let’s Impute Age using MICE
library(mice)
## Loading required package: Rcpp
## mice 2.25 2015-11-09
set.seed(101)
vars.for.imputation <- full[,!names(full) %in% c('PassengerId','Name','Survived', 'Pclass', 'Sex', 'SibSp', 'Parch','Ticket','Cabin','Fare')]
imputed <- complete(mice(vars.for.imputation))
4.3.3 - Let’s look at the frequency distribution of original and imputed age to ensure everything looks good.
par(mfrow=c(1,2))
hist(full$Age, freq=F, main='Age: Original', col='darkgreen', ylim=c(0,0.04))
hist(imputed$Age, freq=F, main='Age: Imputed', col='lightgreen', ylim=c(0,0.04))

4.3.4 - Things look good, so let’s replace our age vector in the original data with the output from the mice model.
full$Age <- imputed$Age
sum(is.na(full$Age))
## [1] 0

5. Feature Engineering

5.1 New Variable - Family Size

5.1.1 - Let’s make a family size variable based on number of siblings/spouse(s) and number of children/parents.
full$Fsize <- full$SibSp + full$Parch + 1
5.1.2 - Let’s visualize the relationship between Family Size and Survival. We will only consider the training data (891 rows) for the plot.
ggplot(full[1:891,], aes(Fsize)) + 
  geom_bar(aes(fill=factor(Survived), stat='count', position='dodge')) +
  scale_x_continuous(breaks = c(1:11)) + 
  scale_y_continuous(breaks=seq(0,600,50)) + 
  labs(x='Family Size', title='Survival by Family Size') +
  scale_fill_discrete(name="Survival", labels=c('Dead', 'Survived'))
## Warning: Ignoring unknown aesthetics: stat, position

Observation- We can see that the survival rate for those with a family size of 1 or those with family sizes above 4 is really low.

5.2 New Variable - Discretized Family Size

Perhaps we can collapse the Family Size variable into three levels which will be helpful since there are comparatively fewer large families. Let’s create a discretized family size variable.

full$FsizeD[full$Fsize==1] <- 'Single'
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'Small'
full$FsizeD[full$Fsize > 4] <- 'Large'
5.2.1 - Let’s visualize survival by discretized family size
ggplot(full[1:891,], aes(factor(Survived))) + geom_bar(aes(fill=factor(FsizeD))) +
  scale_x_discrete(labels = c('Dead','Survived')) + 
  scale_y_continuous(breaks = seq(0,600,100)) +
  labs(x='Survival', title='Survival by Family Size') +
  scale_fill_discrete(name="Family Size") +


mosaicplot(table(full[1:891,]$FsizeD, full[1:891,]$Survived), main='Family Size by Survival', shade=TRUE)

Observation- The mosaic plot shows that there’s a survival penalty among singletons and large families, but a benefit for passengers in small families.

5.3 New Variable - Title

Let’s grab the title from passenger name. We will look for a pattern in any string preceding a ‘comma and space’ OR any string that follows a period and replace it with blank space to get the title

full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)
table(full$Sex, full$Title)
##         
##          Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme
##   female    0   0   0    1   1        0    1     0      0  260    2   1
##   male      1   4   1    0   7        1    0     2     61    0    0   0
##         
##           Mr Mrs  Ms Rev Sir the Countess
##   female   0 197   2   0   0            1
##   male   757   0   0   8   1            0
5.3.1 - Let’s combine and reassign titles
# Combine titles with low counts to rare
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 
                'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

full$Title[full$Title == 'Ms'] <- 'Miss'
full$Title[full$Title=='Mlle'] <- 'Miss'
full$Title[full$Title=='Mme'] <- 'Mrs'
full$Title[full$Title %in% rare_title] <- 'Rare Title'

table(full$Sex, full$Title)
##         
##          Master Miss  Mr Mrs Rare Title
##   female      0  264   0 198          4
##   male       61    0 757   0         25
5.3.2 - Let’s visualize titles to see which classes did various titles travelled in.
full$Title <- as.factor(full$Title)

# Plot Title by Sex and Passenger Class
ggplot(full, aes(Title)) + geom_bar(aes(fill=factor(Pclass))) +
  scale_y_continuous(breaks = seq(0,800,100)) + 
  labs(title='Title by Passenger Class') +
  scale_fill_discrete(name="Passenger\nClass")

5.3.3 Let’s visualize Age by title, sex and passenger class
ggplot(full, aes(Age, Title, col=factor(Sex))) +
  geom_jitter(position=position_jitter(height=0.2)) + facet_grid(Pclass~.) +
  labs(y='Title', title='Age by Title, Sex and Class') +
  scale_color_discrete(name="Sex")

Observation- Looking at the plot we can see that some passengers in class 3 with age > 18 are titled as Master.

5.3.4 - Let’s find out who these people are and replace their title accordingly
# Filter rows where age > 18 and title is Master
young <- filter(full, full$Title=='Master' & full$Age > 18)

# Look at the Name, age, Class and Sex of this subset of data
young[,c('Name','Age','Pclass','Sex')]
##                                                Name Age Pclass  Sex
## 1                          Moubarek, Master. Gerios  30      3 male
## 2                        Sage, Master. Thomas Henry  19      3 male
## 3                     Lefebre, Master. Henry Forbes  30      3 male
## 4 Moubarek, Master. Halim Gonios ("William George")  30      3 male
## 5         Johnston, Master. William Arthur Willie""  29      3 male
## 6               van Billiard, Master. James William  27      3 male
# Since all passengers are male, change their title to Mr
full$Title[young$Title=='Master'] <- 'Mr'

# Check to confirm
filter(full, full$Title=='Master' & full$Age > 18)
##  [1] PassengerId Survived    Pclass      Name        Sex        
##  [6] Age         SibSp       Parch       Ticket      Fare       
## [11] Cabin       Embarked    Fsize       FsizeD      Title      
## <0 rows> (or 0-length row.names)

5.4 New Variable - Child/Adult

5.4.1 - Let’s create a new variable called ‘Child’ and see if children had a higher survival rate in the training data.
full$Child[full$Age < 18] <- 'Child'
full$Child[full$Age >= 18] <- 'Adult'
table(full[1:891,]$Child, full[1:891,]$Survived)
##        
##           0   1
##   Adult 478 276
##   Child  71  66

Observation- Being a child didn’t real help much when it comes to survivalon Titanic.

5.5 New Variable - Mother

5.5.1 - Let’s create a new variable ‘Mother’ and see if mothers were more likely to survive. We will assume a mother as someone who is 1) female, 2) is over 18, 3) has more than 0 children, and 4) does not have the title ‘Miss’
full$Mother <- 'Not Mother'
full$Mother[full$Sex=='female' & full$Age>18 & full$Parch>0 & full$Title != 'Miss'] <- 'Mother'
table(full[1:891,]$Mother, full[1:891,]$Survived)
##             
##                0   1
##   Mother      23  50
##   Not Mother 526 292

Observation- Being a child didn’t real help much when it comes to survivalon Titanic.

6. Prediction

6.1 Split the data

6.1.1 - Now that we are ready to predict survival on Titanic, lets split our data back to training and test sets. Before we do so however lets convert some of our variables to factor and then split the data.
# Convert variables to factor
factor_vars <- c('Pclass', 'Sex', 'FsizeD', 'Child', 'Mother')
full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))
str(full)
## 'data.frame':    1309 obs. of  17 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 23 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
##  $ Fsize      : num  2 2 1 2 1 1 1 5 3 2 ...
##  $ FsizeD     : Factor w/ 3 levels "Large","Single",..: 3 3 2 3 2 2 2 1 3 3 ...
##  $ Title      : Factor w/ 5 levels "Master","Miss",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Child      : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 2 1 2 ...
##  $ Mother     : Factor w/ 2 levels "Mother","Not Mother": 2 2 2 2 2 2 2 2 1 2 ...
# Split the data back to training and test set
train <- full[1:891,]
test <- full[892:1309,]

6.2 Build the models

6.2.1 - We’ll try some linear and non-linear algorithms:
  1. Linear method: Logistic Regression
  2. Non-Linear methods: SVM, kNN
  3. Trees and Rules: CART
  4. Ensembles of Trees: C5.0, Random Forest

We will use 10-fold cross validation with 3 repeats. Since this is a binary classification problem, we will use Accuracy and Kappa metrics for evaluating performance of the models.

Some algorithms perform a whole lot better with some basic data preprocessing. So we will center and scale the data in some algorithms.

library(caret)
## Loading required package: lattice
trainControl <- trainControl(method='repeatedcv', number=10, repeats=3)

# Logistic Regression model
set.seed(101)
fit.glm <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                  Fare + Embarked + Title + FsizeD + Child + Mother,
                  data = train, method = 'glm', metric = 'Accuracy',
                  trControl = trainControl)

# KNN
set.seed(101)
fit.knn <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                  Fare + Embarked + Title + FsizeD + Child + Mother,
                  data = train, method = 'knn', metric = 'Accuracy',
                  preProc=c("center", "scale"), trControl = trainControl)

# SVM Radial
set.seed(101)
fit.svmRadial <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp +
                  Parch + Fare + Embarked + Title + FsizeD + Child + Mother,
                  data = train, method = 'svmRadial', metric = 'Accuracy',
                  preProc=c("center", "scale"), trControl = trainControl,
                  fit=FALSE)
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
# CART model
set.seed(101)
fit.cart <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                  Fare + Embarked + Title + FsizeD + Child + Mother,
                  data = train, method = 'rpart', metric = 'Accuracy',
                  trControl = trainControl)
## Loading required package: rpart
# Random Forest model
set.seed(101)
fit.rf <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                  Fare + Embarked + Title + FsizeD + Child + Mother,
                  data = train, method = 'ranger', metric = 'Accuracy',
                  trControl = trainControl)
## Loading required package: e1071
## Loading required package: ranger
# C 5.0 model
set.seed(101)
fit.c50 <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                  Fare + Embarked + Title + FsizeD + Child + Mother,
                  data = train, method = 'C5.0', metric = 'Accuracy',
                  trControl = trainControl)
## Loading required package: C50
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
# Let's compare the results of these algorithms:

results <- resamples(list(logistic=fit.glm, knn=fit.knn,
    svm=fit.svmRadial, cart=fit.cart, c50=fit.c50, rf=fit.rf))

# Table comparison
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: logistic, knn, svm, cart, c50, rf 
## Number of resamples: 30 
## 
## Accuracy 
##            Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## logistic 0.7222  0.7865 0.8045 0.8093  0.8427 0.8989    0
## knn      0.7111  0.7759 0.8090 0.8003  0.8202 0.8764    0
## svm      0.7444  0.8017 0.8268 0.8268  0.8499 0.8989    0
## cart     0.7500  0.7865 0.8090 0.8118  0.8399 0.8652    0
## c50      0.7753  0.8090 0.8315 0.8264  0.8427 0.8876    0
## rf       0.7444  0.7911 0.8202 0.8186  0.8539 0.8652    0
## 
## Kappa 
##            Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## logistic 0.4186  0.5252 0.5862 0.5890  0.6621 0.7822    0
## knn      0.3727  0.5196 0.5954 0.5697  0.6266 0.7338    0
## svm      0.4595  0.5783 0.6230 0.6272  0.6719 0.7822    0
## cart     0.4224  0.5333 0.5843 0.5856  0.6486 0.7010    0
## c50      0.5075  0.5811 0.6389 0.6247  0.6592 0.7566    0
## rf       0.4480  0.5543 0.6170 0.6087  0.6780 0.7176    0
# Let's plot the results of these algorithms:
bwplot(results)

dotplot(results)

Observation- Looking at the results and plots, we can say that accuracy of SVM model is better than other models. Lets use svm model to make final predictions.

6.3 Make the Prediction

6.3.1 - We’ll use the SVM model to make final predictions. Since we have used some preprocessing (center, scale) for our training data in SVM model, we will apply those to the test data as well before making final predictions.
# Apply preprocessing to test data
test$Survived <- as.factor(test$Survived)
preprocessParams <- preProcess(test, method=c('center', 'scale'))
testData <- predict(preprocessParams, test)

# Make final predictions
prediction <- predict(fit.svmRadial, testData)

# Generate predictions
confusionMatrix(testData$Survived, prediction)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 261 157
##          1   0   0
##                                         
##                Accuracy : 0.6244        
##                  95% CI : (0.576, 0.671)
##     No Information Rate : 0.6244        
##     P-Value [Acc > NIR] : 0.5218        
##                                         
##                   Kappa : 0             
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 1.0000        
##             Specificity : 0.0000        
##          Pos Pred Value : 0.6244        
##          Neg Pred Value :    NaN        
##              Prevalence : 0.6244        
##          Detection Rate : 0.6244        
##    Detection Prevalence : 1.0000        
##       Balanced Accuracy : 0.5000        
##                                         
##        'Positive' Class : 0             
## 
6.3.2 - Export predictions to a csv file.
# Create a data frame with two columns: PassengerId & Survived where Survived contains my predictions.

submit <- data.frame(PassengerId = test$PassengerId, Survived = prediction)

# Write the solution to a csv file 
write.csv(submit, file = "firstSVM_titanic.csv", row.names = FALSE)