A Titanic Journey

May 01, 2016

1 - Introduction

This project is a predictive data analysis practice based on an entry level Kaggle competition, which asks participants to predict the fate of the passengers based on their social class, name, gender, age, ticket fare and so on.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 (68 %) passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Variable Descriptions

Variable	Desctiption
survival	Survival (0 = No; 1 = Yes)
pclass	Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name	Name
sex	Sex
age	Age
sibsp	Number of Siblings/Spouses Aboard
parch	Number of Parents/Children Aboard
ticket	Ticket Number
fare	Passenger Fare
cabin	Cabin
embarked	Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Special Notes

Pclass is a proxy for socio-economic status (SES): 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form of xx.5.

With respect to the family relation variables (i.e. sibsp and parch), some relations were ignored.
The following are the definitions used for sibsp and parch.

Name	Definition
Sibling	Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse	Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent	Mother or Father of Passenger Aboard Titanic
Child	Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

2 - Exploratory Analysis

## Warning: package 'survival' was built under R version 3.2.5

After loading in the data, take a quick look at the structure of the data set.

## quick look of the data set
glimpse(data.comb)

Observations: 1,309
Variables: 12
$ PassengerId (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ Survived    (int) 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
$ Pclass      (int) 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
$ Name        (fctr) Braund, Mr. Owen Harris, Cumings, Mrs. John Bradl...
$ Sex         (fctr) male, female, female, female, male, male, male, m...
$ Age         (dbl) 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
$ SibSp       (int) 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
$ Parch       (int) 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
$ Ticket      (fctr) A/5 21171, PC 17599, STON/O2. 3101282, 113803, 37...
$ Fare        (dbl) 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
$ Cabin       (fctr) NA, C85, NA, C123, NA, NA, E46, NA, NA, NA, G6, C...
$ Embarked    (fctr) S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q...

From the summary above, the first thing that caught my eye is there are NAs, AKA missing values. Before I can make any statistical model, I need to deal with them first.

The table below shows the number and percentage of NAs in each variable.

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Number and Percentage of NAs in Each Variable
No. of NAs	0	0	0	263	0	0	0	1	1014	2
% of NAs	0 %	0 %	0 %	20 %	0 %	0 %	0 %	0 %	77 %	0 %

From the table, we can see that there are 263 NAs (roughly 20%) in Age variable, 1 in Fare, 1014 (roughly 77%) in Cabin, and 2 in Embarked.

In this case, we can definitely find out a way to fill in the missing values in Age since it is only 20%, but for Cabin, with 77% of missing values, it would be impractical to fill in all of them. So the Cabin variable is dropped for now.

Before dealing with the NAs, let’s look at the distribution of the variables first.

2.1 - Categorical Variables Distribution

From the three plots above, we can conclude that:

If a passenger is from upper class, he has a much higher chance to survive than middle or lower class.
Female passengers have a better chance to survive than male, which is as expected.
It seems like passengers from Cherbourg has a higher chance to survive than the other two ports. Cannot think of any particular reason for this. More background investigation may be needed.

2.2 - Numerical Variables Distribution

What I can get from the plots above:

The first two plots are related to the family size. We can see that the less people a passenger was traveling with, the higher chance to survive.
The distribution of Age variable seems quite similar for different passenger fate. But we know that when the Titanic sank, “women and children first” rule was carried out. This is a very importtant information for the project, so I will keep Age variable in the model anyway.
The Boxplot for Fare suggests that survived people paid more than the deceased. However by looking at the histogram, we notice that people paid as high as $500 for the ticket. So I suspect this kind of ticket is group ticket. Family members or friends are likely to buy tickets like these.

Based on the three points above, I’ll create two new features, Family Size and Group Ticket, in part 3 Feature Engineering.

2.3 - Dealing with NAs (Missing Values)

Now it’s time to eliminate the missing values.

2.3.1 - Dealing with NA’s in `Age` Variable

We’ve seen the distribution of the Age variable, it looks normal. Actually by looking at the summary statistics, we know that the mean age is 29.88, and the median age is 28.

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
Summary Statistics of ‘Age’ Variable
0.17	21	28	29.88	39	80	263

The mean and median are very close. So the easiest way would be using the median value (28) to replace all the NAs. But it might not be the best option here.

By exploring the relationship between Age and other variables, we may find some more hints. For example, let’s look at the relationship between age and socio-economic status as shown below.

Hover over the boxplot to observe the distribution of Age for each social class. It’s not hard to find that the median age is increasing along with the social class. So we can use this important information to fill in the NAs in Age variable. If a person’s age is missing, look at his social class first, then use the corresponding median age of that class to replace the NA. It would be much more accurate than just using the overall median.

But in this particular data set, there is actually a better way to deal with the NAs in Age. Let’s take a look at some of the passengers’ names first.

## study 'Name' variable
head(data_comb$Name)

[1] "Braund, Mr. Owen Harris"                            
[2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
[3] "Heikkinen, Miss. Laina"                             
[4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
[5] "Allen, Mr. William Henry"                           
[6] "Moran, Mr. James"

All the passengers’ names follow the same pattern: first name + title + last name. And a person’s title can be very useful. It would probably suggest the person’s gender, age, and even social class. So not only for the purpose of imputing NAs, but also as part of the feature engineering, I decided to extract the titles from passengers’ names, and make a new variable called Title to store them in the data frame.

## create 'Title' variable
data_comb$Title <- sapply(data_comb$Name, function(x) {temp = strsplit(x, split = "[,.]")
                                                       temp[[1]][2]})
data_comb$Title <- sub(" ", "", data_comb$Title)

After doing so, let’s take a closer look at the titles. The table below shows the number of passengers with each title, including the number of missing values in Age, and it also gives us the mean and median age for each title.

	N	Missing	Mean	Median
Capt	1	0	70	70
Col	4	0	54	54.5
Don	1	0	40	40
Dona	1	0	39	39
Dr	7	1	43.57	49
Jonkheer	1	0	38	38
Lady	1	0	48	48
Major	2	0	48.5	48.5
Master	53	8	5.48	4
Miss	210	50	21.77	22
Mlle	2	0	24	24
Mme	1	0	24	24
Mr	581	176	32.25	29
Mrs	170	27	36.99	35.5
Ms	1	1	28	28
Rev	8	0	41.25	41.5
Sir	1	0	49	49
the Countess	1	0	33	33
ALL	1046	263	29.88	28

Now let’s use the median age from each title to replace the missing values in that title gruop if there is at least one NA.

## replace missing age values with median age of the corresponding title
title.na <- c("Dr", "Master", "Miss", "Mr", "Mrs", "Ms")

medAge_byTitle <- data_comb %>%
        group_by(Title) %>%
        summarise(medAge = median(Age, na.rm = TRUE))

for (ttl in title.na) {
        data_comb[data_comb$Title == ttl & is.na(data_comb$Age), "Age"] =
                medAge_byTitle[medAge_byTitle$Title == ttl, "medAge"]
}

Now if we look at the summary statistics of Age variable again, we will find there is no NA left.

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Summary Statistics of ‘Age’ Variable w/o NAs
0.17	22	29	29.43	35.5	80

2.3.2 - Dealing with Empty Cells in `Embarked` Variable

C	Q	S	NA’s
Summary Statistics of ‘Embarked’ Variable
270	123	914	2

We can see that there are only 2 missing values in the Embarked variable. We can simply fill in ‘S’ for those two empty cells since most of the passengers are embarked from port S (Southampton). This is called mode in statistics.

## replace empty cells with mode which is "S"
data_comb$Embarked[is.na(data_comb$Embarked)] = "S"

Look again, no NA left.

C	Q	S
Summary Statistics of ‘Embarked’ Variable w/o NAs
270	123	916

2.3.3 - Dealing with `Fare` Variable

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
Summary Statistics of ‘Fare’ Variable
0	7.896	14.45	33.3	31.28	512.3	1

The Fare variable is a little bit tricky. Although there is only 1 missing value, we can see from the summary statistics that there are some 0 fares in the data set. At first I thought these would be the ticket for baby passengers, they are just too young to be charged. But digging into the details, it turned out differently.

## check the zero values in 'fare'
head(filter(select(data_comb, -Survived, -Fate), Fare == 0 | is.na(Fare)))

  PassengerId Pclass                            Name  Sex Age SibSp Parch Ticket Fare Embarked Title
1         180  Lower             Leonard, Mr. Lionel male  36     0     0   LINE    0        S    Mr
2         264  Upper           Harrison, Mr. William male  40     0     0 112059    0        S    Mr
3         272  Lower    Tornquist, Mr. William Henry male  25     0     0   LINE    0        S    Mr
4         278 Middle     Parkes, Mr. Francis "Frank" male  29     0     0 239853    0        S    Mr
5         303  Lower Johnson, Mr. William Cahoone Jr male  19     0     0   LINE    0        S    Mr
6         414 Middle  Cunningham, Mr. Alfred Fleming male  29     0     0 239853    0        S    Mr

We can clearly see that these people with 0 fares are not babies. So I guess it is some kind of error. I will treat the zeroes the same as NAs.

Looking at the distribution of the Fare variable by passengers’ social class, we can see the median fare are $ 61.66, $ 15.14 and $ 8.13 for upper, middle and lower class respectively.

Now, let’s use these values to replace the missing values in Fare for each passenger class.

## replace the only missing value of 'Fare' and the 0 fare values
medFare_byPclass <- data_comb %>%
        filter(Fare != 0) %>%
        group_by(Pclass) %>%
        summarise(medFare = median(Fare, na.rm = TRUE))

for (cls in medFare_byPclass$Pclass) {
        data_comb[data_comb$Pclass == cls & (data_comb$Fare == 0 | is.na(data_comb$Fare)), "Fare"] =
                medFare_byPclass[medFare_byPclass$Pclass == cls, "medFare"]
}

After the imputation, there is no NA left in Fare.

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Summary Statistics of ‘Fare’ Variable w/o NAs
3.171	7.925	14.5	33.7	31.39	512.3

3 - Feature Engineering

Now, I will create some new features beased on the exploratory analysis.

Totally four new variables are created: Title, famSize, group, and wom_chd. I will explain each one of them step by step.

3.1 - Title Group

From part 2.3.1, we can clearly see that for some titles, there are only a few people on the ship. We need to merge the similar ones into groups.

Let’s see what the distribution of Age would look like across all these titles. It may give us some idea of how to create groups for titles.

From the plot above we can see that:

For females, “Mlle”, “Mme”, “the Countess”, “Dona”, “Lady”, and “Dr” are all upper class titles; “Ms” only appears in middle or lower class, and its age is closer to “Mrs”.
For males, “Jonkheer”, “Don”, “Rev”, “Major”, “Dr”, “Sir”, “Col”, and “Capt” are either middle or upper class.

So I will put “Ms” in the same group with “Mrs”, and all the other title mentioned above will be in a new title group called “Noble” assuming that they all share similar social class and fate.

## group titles
data_comb$Title <- as.character(data_comb$Title)
data_comb[data_comb$Title == "Ms", "Title"] = "Mrs"
data_comb[data_comb$Title %in% c("Mlle", "Mme", "the Countess", "Dona", "Lady", "Jonkheer", "Don", "Rev", "Major", "Dr", "Sir", "Col", "Capt"), "Title"] = "Noble"

Now the number of passengers with each new title is more representative. And the age distribution within each title group is much more distinguishable.

Master	Miss	Mr	Mrs	Noble
Number of Passengers with Each Title
61	260	757	199	32

3.2 - Family Size

This new feature is relatively easy to build. A passenger’s family size equals to the number of siblings/spouses plus number of parents/children plus one which himself.

## create 'famSize' variable
data_comb$famSize <- data_comb$SibSp + data_comb$Parch + 1

From the distribution of passengers’ family size, we can observe that the median family size for those who survived is 2 which will be used in the next part of my feature engineering.

3.3 - Group Ticket

The Ticket variable is quite interesting. The total number of passengers is 1309, however there are only 929 unique tickets.

This indicates that a lot of people shared the same ticket. I will call it Group Ticket in this project, and in this part, I will biuld a new variable named group to represent these feature.

From last part of my feature engineering, I have already mentioned that the median family size for the survived is 2. So in this new variable, I will calculate the number of passengers with each ticket, and if that number is greater than 2, I will assign group a value “Yes”, and “No” vice versa.

## calculate the number of passengers with each ticket, and filter out those greater than 2
groupTKT <- data_comb %>%
        group_by(Ticket) %>%
        summarise(N = n()) %>%
        filter(N > 2) %>%
        arrange(desc(N))

## create 'group' variable
data_comb$group = "No"
data_comb$group[which(data_comb$Ticket %in% groupTKT$Ticket)] = "Yes"
data_comb$group <- as.factor(data_comb$group)

It turns out about one quarter of the passengers match my group criteria.

	No	Yes
Group Ticket
No. of Passengers	977	332
% of Passengers	74.64 %	25.36 %

3.4 - Women and Children

Since we have already known that women and children first rule was carried out in the incident, I will create a feature directly related to women and children. So if a passenger is either female or less than 18 years old, I will assign wom_chd a value “Yes”, and “No” vice versa.

## create 'wom_chd' variable
data_comb$wom_chd = "No"
data_comb$wom_chd[which(data_comb$Sex == "female" | data_comb$Age < 18)] = "Yes"
data_comb$wom_chd <- as.factor(data_comb$wom_chd)

There are nearly 40% passengers are either women or children. But we know the total survival rate is only roughly 30%. We will see how well this feature can help us predict the fate of the passengers in part 4 Modeling.

	No	Yes
Women and Children
No. of Passengers	753	556
% of Passengers	57.52 %	42.48 %

At last, let’s take a look at the relationship of passengers’ fate and the two new variables I just created.

We can see that passengers taking a group ticket, or being women or children, have a higher chance to survive.

4. Modeling

After exploring the patterns and creating new features, now I will build statistical models to predict the fate of the passengers in the test data set.

Three machine learning methods are used in this project:

I will exaplain each one of them respectively. But before diving into the modeling process, I split the training data set into two parts: a new “training”" set and a new “testing” set.

## set seed and split the training set
set.seed(123)
trainIndex <- createDataPartition(data_train$Survived, p = 0.8, list = FALSE)
training <- data_train[trainIndex, ]
testing <- data_train[-trainIndex, ]

So, with each method, the model is built on the data set named “traning”, and then be used to predict the fate of the passengers in the “testing” data set. Since we already know the fate in the “testing” data set, we can evalute the performance of the model, and improve the result before it is applied to the true test set and submission to Kaggle.

4.1 Simple Logistic Regression

The first model I built is a simple Logistic Regression that uses the original variables only. I want to know how well it would be without my feature engineering.

## simple logistic regression
LR_model.train <- glm(Fate ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
                      data = training, family = binomial(link = "logit"))

The ANOVA table is shown below. We can see that the NULL Deviance is 950.8649 and the Residual Deviance is 627.8483. Pclass, Sex, Age and SibSp are all significant variables.

	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
Analysis of Deviance Table
NULL			713	950.8649
Pclass	2	95.7096	711	855.1554	0
Sex	1	197.4841	710	657.6713	0
Age	1	13.4235	709	644.2478	2e-04
SibSp	1	12.8114	708	631.4364	3e-04
Parch	1	0.4308	707	631.0056	0.5116
Fare	1	0.7293	706	630.2763	0.3931
Embarked	2	2.428	704	627.8483	0.297

Let’s use it to predict the “testing” data set.

## prediction
LR_pred.test <- predict(LR_model.train, testing, type = "response")
testing$Fate_LR <- ifelse(LR_pred.test <= 0.5, "Deceased", "Survived")

confusionMatrix(testing$Fate_LR, testing$Fate, positive = "Survived")

Confusion Matrix and Statistics

          Reference
Prediction Deceased Survived
  Deceased       89       15
  Survived       20       53
                                          
               Accuracy : 0.8023          
                 95% CI : (0.7359, 0.8582)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 7.432e-08       
                                          
                  Kappa : 0.5878          
 Mcnemar's Test P-Value : 0.499           
                                          
            Sensitivity : 0.7794          
            Specificity : 0.8165          
         Pos Pred Value : 0.7260          
         Neg Pred Value : 0.8558          
             Prevalence : 0.3842          
         Detection Rate : 0.2994          
   Detection Prevalence : 0.4124          
      Balanced Accuracy : 0.7980          
                                          
       'Positive' Class : Survived

From the Confusion Matrix, we can see the overall Accuracy is 80.23%, Sensitivity is 77.94% and Specificity is 81.65%. Specificity is lower, but as a first model it is not so bad actually.

Now I will add in all the new features I created in part 3 to enhance the prediction.

## add in new features to the model
LR_model.trainPlus <- glm(Fate ~ Pclass + Sex + Age + Fare + Embarked + ## original features
                                 famSize + Title + group + wom_chd,     ## new features
                          data = training, family = binomial(link = "logit"))

Look at the ANOVA table again, the Residual Deviance is decreased to 575.514. And among all the new features I created, only wom_chd is insignificant. This is probably because most of the gender and age information can be captured by Sex, Age and Title, in which case, wom_chd could become secondary.

	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
Analysis of Deviance Table
NULL			713	950.8649
Pclass	2	95.7096	711	855.1554	0
Sex	1	197.4841	710	657.6713	0
Age	1	13.4235	709	644.2478	2e-04
Fare	1	0.0015	708	644.2463	0.9693
Embarked	2	4.7335	706	639.5129	0.0938
famSize	1	10.3762	705	629.1367	0.0013
Title	4	44.7831	701	584.3535	0
group	1	8.8245	700	575.529	0.003
wom_chd	1	0.015	699	575.514	0.9024

Let’s use this model to predict the “tesing” data set again.

## prediction with new features
LR_pred.testPlus <- predict(LR_model.trainPlus, testing, type = "response")
testing$Fate_LRPlus <- ifelse(LR_pred.testPlus <= 0.5, "Deceased", "Survived")

confusionMatrix(testing$Fate_LRPlus, testing$Fate, positive = "Survived")

Confusion Matrix and Statistics

          Reference
Prediction Deceased Survived
  Deceased       89       13
  Survived       20       55
                                          
               Accuracy : 0.8136          
                 95% CI : (0.7483, 0.8681)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 1.058e-08       
                                          
                  Kappa : 0.6135          
 Mcnemar's Test P-Value : 0.2963          
                                          
            Sensitivity : 0.8088          
            Specificity : 0.8165          
         Pos Pred Value : 0.7333          
         Neg Pred Value : 0.8725          
             Prevalence : 0.3842          
         Detection Rate : 0.3107          
   Detection Prevalence : 0.4237          
      Balanced Accuracy : 0.8127          
                                          
       'Positive' Class : Survived

We can see that the overall Accuracy is increased to 81.36%. Specificity is the same, but Sensitivity is increased to 80.88%. This means my feature engineering works well. With all the new features, I successfully predicted more True Positives (people that survived are correctly predicted as survived) than my last model.

From the model comparison below, it is obvious that the second model is better than the first one.

	Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
Logistic Regression Models Comparison
1	704	627.8483
2	699	575.514	5	52.3343	0

After applying the better model to the true test set and submitting the prediction results to Kaggle, it scored

0.79426/1.00000.

4.2 Logistic Regression with Cross Validation

Cross Validation is a technique being used widely in data science. The purpose of doing so is to prevent overfitting and find the best possible model for the given data set.

In this project, I use caret package to perform cross validation for the Logistic Regression model.

## set up cross validation
cv.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE,
                        savePredictions = TRUE, summaryFunction = twoClassSummary)
## logistic regression with cross validation
LR_model.traincv <- train(Fate ~ Pclass + Sex + Age + Fare + Embarked +
                                 famSize + Title + group + wom_chd,
                          data = training, method = "glm", family = "binomial",
                          metric = "ROC", trControl = cv.ctrl)

summary(LR_model.traincv)


Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4018  -0.5480  -0.3858   0.5277   2.4914  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.926e+01  5.025e+02   0.038 0.969426    
PclassMiddle -1.262e+00  3.591e-01  -3.514 0.000441 ***
PclassLower  -2.357e+00  3.460e-01  -6.812 9.60e-12 ***
Sexmale      -1.464e+01  5.025e+02  -0.029 0.976752    
Age          -1.844e-02  1.105e-02  -1.669 0.095178 .  
Fare         -6.766e-04  2.753e-03  -0.246 0.805828    
EmbarkedQ    -4.807e-02  4.448e-01  -0.108 0.913938    
EmbarkedS    -3.543e-01  2.834e-01  -1.250 0.211237    
famSize      -5.942e-01  1.137e-01  -5.226 1.73e-07 ***
TitleMiss    -1.483e+01  5.025e+02  -0.030 0.976454    
TitleMr      -3.309e+00  9.651e-01  -3.429 0.000607 ***
TitleMrs     -1.414e+01  5.025e+02  -0.028 0.977559    
TitleNoble   -3.205e+00  1.146e+00  -2.796 0.005177 ** 
groupYes      1.098e+00  3.705e-01   2.964 0.003032 ** 
wom_chdYes   -1.017e-01  8.371e-01  -0.121 0.903321    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 575.51  on 699  degrees of freedom
AIC: 605.51

Number of Fisher Scoring iterations: 13

From the model summary, we can see that the Residual Deviance is the same as my best model so far. The prediction results are the same as well.

## prediction with cross validation
LR_pred.testcv <- predict(LR_model.traincv, testing)

confusionMatrix(LR_pred.testcv, testing$Fate, positive = "Survived")

Confusion Matrix and Statistics

          Reference
Prediction Deceased Survived
  Deceased       89       13
  Survived       20       55
                                          
               Accuracy : 0.8136          
                 95% CI : (0.7483, 0.8681)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 1.058e-08       
                                          
                  Kappa : 0.6135          
 Mcnemar's Test P-Value : 0.2963          
                                          
            Sensitivity : 0.8088          
            Specificity : 0.8165          
         Pos Pred Value : 0.7333          
         Neg Pred Value : 0.8725          
             Prevalence : 0.3842          
         Detection Rate : 0.3107          
   Detection Prevalence : 0.4237          
      Balanced Accuracy : 0.8127          
                                          
       'Positive' Class : Survived

Since cross validation didn’t improve my original model, I decide to use a different kind of algorithm called Support Vector Machine, AKA SVM.

4.3 Support Vector Machine

The svm function from e1071 package can build SVM models in R.

## svm
SVM_model.train <- svm(Fate ~ Pclass + Sex + Age + Fare + Embarked +
                               famSize + Title + group + wom_chd,
                       data = training, probability = TRUE)

Use it to predict directly. We can see that the overall Accuracy is inreased to 83.62%, Sensitivity is the same as my Logistic Regression model, but Specificity is now increased to 85.32%.

## prediction with svm
SVM_pred.test <- predict(SVM_model.train, testing)

confusionMatrix(SVM_pred.test, testing$Fate, positive = "Survived")

Confusion Matrix and Statistics

          Reference
Prediction Deceased Survived
  Deceased       93       13
  Survived       16       55
                                          
               Accuracy : 0.8362          
                 95% CI : (0.7732, 0.8874)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 1.38e-10        
                                          
                  Kappa : 0.6566          
 Mcnemar's Test P-Value : 0.7103          
                                          
            Sensitivity : 0.8088          
            Specificity : 0.8532          
         Pos Pred Value : 0.7746          
         Neg Pred Value : 0.8774          
             Prevalence : 0.3842          
         Detection Rate : 0.3107          
   Detection Prevalence : 0.4011          
      Balanced Accuracy : 0.8310          
                                          
       'Positive' Class : Survived

Just by looking at the numbers, it is a huge improvement to my Logistic Regression models. Submitting to Kaggle, this model scored:

0.78947/1.00000

Unfortunately, the score didn’t increase. So next, I want to tune the model a little bit using the same Cross Validation technique, and see if I can improve the score.

A good thing about the e1071 package is that it also provides the tune function for tuning svm models.

## tuning
SVM_model.traincv <- tune(svm, Fate ~ Pclass + Sex + Age + Fare + Embarked +
                                      famSize + Title + group + wom_chd,
                          data = training, probability = TRUE, kernel = "radial",
                          ranges = list(cost = c(0.1, 1, 10, 100, 1000),
                                        gamma = c(0.5, 1, 2, 3, 4)))
## get the best model
SVM_model.traincv$best.model


Call:
best.tune(method = svm, train.x = Fate ~ Pclass + Sex + Age + Fare + Embarked + famSize + Title + group + wom_chd, data = training, 
    ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 1, 2, 3, 4)), probability = TRUE, kernel = "radial")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.5 

Number of Support Vectors:  366

We can see that the best svm model is the one with cost 1 and gamma 0.5. Use the best model to predict again, the overall Accuracy is 83.05%, decreased, and Sensitivity is decreased to 76.47%, but Specificity is increased to 87.16%.

## prediction with best svm model
SVM_pred.testcv <- predict(SVM_model.traincv$best.model, testing)

confusionMatrix(SVM_pred.testcv, testing$Fate, positive = "Survived")

Confusion Matrix and Statistics

          Reference
Prediction Deceased Survived
  Deceased       95       16
  Survived       14       52
                                         
               Accuracy : 0.8305         
                 95% CI : (0.767, 0.8826)
    No Information Rate : 0.6158         
    P-Value [Acc > NIR] : 4.325e-10      
                                         
                  Kappa : 0.6398         
 Mcnemar's Test P-Value : 0.8551         
                                         
            Sensitivity : 0.7647         
            Specificity : 0.8716         
         Pos Pred Value : 0.7879         
         Neg Pred Value : 0.8559         
             Prevalence : 0.3842         
         Detection Rate : 0.2938         
   Detection Prevalence : 0.3729         
      Balanced Accuracy : 0.8181         
                                         
       'Positive' Class : Survived

Sensitivity is decreased a lot which is not a good sign. Submitting to Kaggle, this model only scored:

0.77990/1.00000

It becomes even lower… well, since cross validation didn’t improve my SVM model either, I decide to try something else.

4.4 Random Forest

Random Forest is my favorite machine learning algorithm so far. I like to use the randomForest function from the randomForest package.

## random forest
RF_model.train <- randomForest(Fate ~ Pclass + Sex + Age + Fare + Embarked +
                                      famSize + Title + group + wom_chd,
                               data = training, importance = TRUE, ntree = 2000)

The plot below shows the importance of the variables judged by the mean decrease accuracy. A variable is considered the most important if the accuracy of the model without it decreased the most compared to the full model.

Predict using this model, we get the same result as my best SVM model.

## prediction with random forest
RF_pred.test <- predict(RF_model.train, testing)

confusionMatrix(RF_pred.test, testing$Fate, positive = "Survived")

Confusion Matrix and Statistics

          Reference
Prediction Deceased Survived
  Deceased       95       16
  Survived       14       52
                                         
               Accuracy : 0.8305         
                 95% CI : (0.767, 0.8826)
    No Information Rate : 0.6158         
    P-Value [Acc > NIR] : 4.325e-10      
                                         
                  Kappa : 0.6398         
 Mcnemar's Test P-Value : 0.8551         
                                         
            Sensitivity : 0.7647         
            Specificity : 0.8716         
         Pos Pred Value : 0.7879         
         Neg Pred Value : 0.8559         
             Prevalence : 0.3842         
         Detection Rate : 0.2938         
   Detection Prevalence : 0.3729         
      Balanced Accuracy : 0.8181         
                                         
       'Positive' Class : Survived

However, after submitting to Kaggle, this model only scored:

0.77512/1.00000

which is very frustrating… but the bright side is that you cannot go any lower when you’ve hit the bottom already, right? ^_^

4.5 Condition Inference Trees

A good thing about Random Forest is that it is an ensemble algorithm based on Decision Tree, and cforest is a function from the party package which implements the random forest and bagging ensemble algorithms utilizing Conditional Inference Trees as base learners. They make their decisions in slightly different ways, using a statistical test rather than a purity measure, but the basic construction of each tree is fairly similar.

## conditional inference trees
CF_model.train <- cforest(Fate ~ Pclass + Sex + Age + Fare + Embarked +
                                 famSize + Title + group + wom_chd,
                          data = training, controls = cforest_unbiased(ntree = 2000, mtry = 3))

The prediction results show an overall Accuracy 83.62%, Sensitivity is 77.94% and Specificity is 87.16%.

## prediction with conditional inference trees
CF_pred.test <- predict(CF_model.train, testing, OOB = TRUE)

confusionMatrix(CF_pred.test, testing$Fate, positive = "Survived")

Confusion Matrix and Statistics

          Reference
Prediction Deceased Survived
  Deceased       95       15
  Survived       14       53
                                          
               Accuracy : 0.8362          
                 95% CI : (0.7732, 0.8874)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 1.38e-10        
                                          
                  Kappa : 0.6528          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.7794          
            Specificity : 0.8716          
         Pos Pred Value : 0.7910          
         Neg Pred Value : 0.8636          
             Prevalence : 0.3842          
         Detection Rate : 0.2994          
   Detection Prevalence : 0.3785          
      Balanced Accuracy : 0.8255          
                                          
       'Positive' Class : Survived

Submiiting to Kaggle, I got my best score:

0.80383/1.00000

5. Model Comparison

When the response variable is binary, Receiver Operating Characteristic, AKA ROC curve, and the Area Under the Curve (AUC) are commonly used to evaluate the model performance.

A ROC curve is composed of the ture positive rate (Sensitivity) vs. false positive rate (1 - Specificity) for each probablity threshold ranged from 0 to 1. The diagolnal line indicates the result of random guessing, for example, toss a coin. So with a binary response variable, you can never do worse than random guessing which has a AUC of 0.5.

The ROC curve and AUC of the models from part 4 are drawn below. From the comparison, we can clearly see that even though there are some differences among these algorithms, the AUC are actually very close. Random Forest has the highest AUC value, while SVM is almost the same.

From the comparison plot below, we can see that the algorithm with highest AUC is Random Forest, however its Kaggle score is the lowest; and with the highest Kaggle score, Conditional Inference Trees has the second smallest value of AUC.

This truely means something. Actually, if there is a golden rule for machine learning algorithms, it would be generalization.

6. Summary and Citation

In this project, I practiced:

exploratory data analysis with dplyr, ggplot2, and plotly packages
modeling with caret, e1071, randomForest, and party packages
model comparison with ROCR package for ROC curve and AUC
reporting with knitr and htmlTable packages

Citation:

Titanic: Getting Started With R by Trevor Stephens

Titanic Survival Prediction: One Approach to Deriving a Model by Curt Wehrley