1. Describe the Problem and Objectives

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, I need to build predictive models that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

2. Describe the Data

Survival: 0 = No, 1 = Yes
Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Sex: Sex of the passenger (female/male)
Age: Age in years (if the age is fractional, it is less than 1, if the age is estimated, it is in the form of xx.5)
SibSp: # of siblings (brother/sister/stepbrother/stepsister) or spouses (husband/wife) aboard
ParCh: # of parents (mother/father) or children (daughter/son/stepdaughter/stepson) aboard
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

setwd("/Users/shanshantong/Desktop/Skills/1.Tools/R/6.Datasets")
titanic.train <- read.csv("titanic.train.csv")
titanic.test <- read.csv("titanic.test.csv")

dim(titanic.train) # 891 rows, 12 columns
dim(titanic.test) # 418 rows, 11 columns

Use str() function to display the structure of the training and test datasets, and summary() function to summarize the datasets.

str(titanic.train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

str(titanic.test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

summary(titanic.train) # summarize the data

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

summary(titanic.test) # display the structure of the data

##   PassengerId         Pclass     
##  Min.   : 892.0   Min.   :1.000  
##  1st Qu.: 996.2   1st Qu.:1.000  
##  Median :1100.5   Median :3.000  
##  Mean   :1100.5   Mean   :2.266  
##  3rd Qu.:1204.8   3rd Qu.:3.000  
##  Max.   :1309.0   Max.   :3.000  
##                                  
##                                         Name         Sex     
##  Abbott, Master. Eugene Joseph            :  1   female:152  
##  Abelseth, Miss. Karen Marie              :  1   male  :266  
##  Abelseth, Mr. Olaus Jorgensen            :  1               
##  Abrahamsson, Mr. Abraham August Johannes :  1               
##  Abrahim, Mrs. Joseph (Sophie Halaut Easu):  1               
##  Aks, Master. Philip Frank                :  1               
##  (Other)                                  :412               
##       Age            SibSp            Parch             Ticket   
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   PC 17608:  5  
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   113503  :  4  
##  Median :27.00   Median :0.0000   Median :0.0000   CA. 2343:  4  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923   16966   :  3  
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000   220845  :  3  
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000   347077  :  3  
##  NA's   :86                                        (Other) :396  
##       Fare                     Cabin     Embarked
##  Min.   :  0.000                  :327   C:102   
##  1st Qu.:  7.896   B57 B59 B63 B66:  3   Q: 46   
##  Median : 14.454   A34            :  2   S:270   
##  Mean   : 35.627   B45            :  2           
##  3rd Qu.: 31.500   C101           :  2           
##  Max.   :512.329   C116           :  2           
##  NA's   :1         (Other)        : 80

Now, I merged the independent variables of training and test sets into a full data set, which is convenient for me to manipulate the data on both sets.

titanic.full <- rbind(titanic.train[,-c(1,2)], titanic.test[,-1])
summary(titanic.full)

##      Pclass                                    Name          Sex     
##  Min.   :1.000   Connolly, Miss. Kate            :   2   female:466  
##  1st Qu.:2.000   Kelly, Mr. James                :   2   male  :843  
##  Median :3.000   Abbing, Mr. Anthony             :   1               
##  Mean   :2.295   Abbott, Mr. Rossmore Edward     :   1               
##  3rd Qu.:3.000   Abbott, Mrs. Stanton (Rosa Hunt):   1               
##  Max.   :3.000   Abelson, Mr. Samuel             :   1               
##                  (Other)                         :1301               
##       Age            SibSp            Parch            Ticket    
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.000   CA. 2343:  11  
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000   1601    :   8  
##  Median :28.00   Median :0.0000   Median :0.000   CA 2144 :   8  
##  Mean   :29.88   Mean   :0.4989   Mean   :0.385   3101295 :   7  
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000   347077  :   7  
##  Max.   :80.00   Max.   :8.0000   Max.   :9.000   347082  :   7  
##  NA's   :263                                      (Other) :1261  
##       Fare                     Cabin      Embarked
##  Min.   :  0.000                  :1014    :  2   
##  1st Qu.:  7.896   C23 C25 C27    :   6   C:270   
##  Median : 14.454   B57 B59 B63 B66:   5   Q:123   
##  Mean   : 33.295   G6             :   5   S:914   
##  3rd Qu.: 31.275   B96 B98        :   4           
##  Max.   :512.329   C22 C26        :   4           
##  NA's   :1         (Other)        : 271

3. Manipulate the data

In this section, I will utilize the idea of feature engineering, basically try to create additional relevent features from the existing raw features in the data, and to increase the predictive power of the learning algorithms.

3.1 Factor Pclass

Based on the description of the Data in Part2, Pclass is an ordered factor variable, therefore, I will transform it from a numeric variable to a categorical variable.

titanic.full$Pclass <- factor(titanic.full$Pclass, 
                                 levels = c("3", "2", "1"),
                                 labels = c("3rd", "2nd", "1st"),
                                 ordered = T)

table(titanic.full$Pclass)

## 
## 3rd 2nd 1st 
## 709 277 323

3.2 Bin the Age

The idea of binning the Age variable is that I believe children and the old people are more likely to be saved, resulting higher survival rate.

library(dplyr)
titanic.full <- titanic.full %>%
  mutate(Aged = case_when(
    Age < 13 ~ "Child",
    between(Age, 13, 18) ~ "Teenager",
    between(Age, 19, 50) ~ "Adult",
    Age > 50 ~ "Old"
  ))

3.3 Add Titles

We could not get any useful information directly from variable Name, however, we could extract the Title from Name to support further analysis.

titanic.full$Title <-  gsub("^.*, (.*?)\\..*$", "\\1", titanic.full$Name)
table(titanic.full$Title)

## 
##         Capt          Col          Don         Dona           Dr 
##            1            4            1            1            8 
##     Jonkheer         Lady        Major       Master         Miss 
##            1            1            2           61          260 
##         Mlle          Mme           Mr          Mrs           Ms 
##            2            1          757          197            2 
##          Rev          Sir the Countess 
##            8            1            1

Based on the result above, the most frequent variables are “Master”, “Miss”, “Mr”, and “Mrs”. Therefore, I will explore the observations with rare titles and check if I could group them into the 4 most frequent titles.

titanic.full %>%
  filter(!Title %in% c("Master", "Miss", "Mr", "Mrs")) %>%
  select(Pclass, Title, Sex, Age, SibSp, Parch, Fare, Embarked) %>%
  arrange(Sex)

##    Pclass        Title    Sex Age SibSp Parch     Fare Embarked
## 1     1st          Mme female  24     0     0  69.3000        C
## 2     2nd           Ms female  28     0     0  13.0000        S
## 3     1st         Lady female  48     1     0  39.6000        C
## 4     1st         Mlle female  24     0     0  69.3000        C
## 5     1st         Mlle female  24     0     0  49.5042        C
## 6     1st the Countess female  33     0     0  86.5000        S
## 7     1st           Dr female  49     0     0  25.9292        S
## 8     3rd           Ms female  NA     0     0   7.7500        Q
## 9     1st         Dona female  39     0     0 108.9000        C
## 10    1st          Don   male  40     0     0  27.7208        C
## 11    2nd          Rev   male  42     0     0  13.0000        S
## 12    2nd          Rev   male  51     0     0  12.5250        S
## 13    1st           Dr   male  44     2     0  90.0000        Q
## 14    2nd          Rev   male  54     1     0  26.0000        S
## 15    2nd           Dr   male  54     0     0  14.0000        S
## 16    2nd           Dr   male  23     0     0  10.5000        S
## 17    1st        Major   male  52     0     0  30.5000        S
## 18    1st        Major   male  45     0     0  26.5500        S
## 19    1st          Sir   male  49     1     0  56.9292        C
## 20    2nd          Rev   male  57     0     0  12.3500        Q
## 21    1st           Dr   male  32     0     0  30.5000        C
## 22    1st          Col   male  56     0     0  35.5000        C
## 23    1st           Dr   male  50     2     0 133.6500        S
## 24    1st          Col   male  60     0     0  26.5500        S
## 25    1st         Capt   male  70     1     1  71.0000        S
## 26    1st           Dr   male  NA     0     0  39.6000        C
## 27    1st     Jonkheer   male  38     0     0   0.0000        S
## 28    2nd          Rev   male  28     0     1  33.0000        S
## 29    2nd          Rev   male  27     0     0  13.0000        S
## 30    1st          Col   male  53     0     0  28.5000        C
## 31    2nd          Rev   male  30     1     1  26.0000        S
## 32    2nd          Rev   male  41     0     0  13.0000        S
## 33    1st          Col   male  47     1     0 227.5250        C
## 34    1st           Dr   male  53     1     1  81.8583        S

library(forcats)
titanic.full <- titanic.full %>%
  mutate(Title = fct_collapse(Title,
                              "Miss" = c("Miss", "Ms", "Lady", "Mlle", "Mme"),
                              "Mr" = c("Mr", "Sir"),
                              "Mrs" = "Mrs",
                              "Master" = "Master",
                              )) %>%
  mutate(Title = fct_other(Title, keep = c("Miss", "Mr", "Mrs", "Master"), other_level = "rare title"))

table(titanic.full$Title)

## 
##       Miss     Master         Mr        Mrs rare title 
##        266         61        758        197         27

3.4 Add Family Size

Since siblings, spouse, parents, children all express “family size” information, I add a new categorical variable “FamilySize” to display this information.

titanic.full <- titanic.full %>%
  mutate(FamilySize = SibSp + Parch + 1,) %>%
  mutate(FamilySized = case_when(
    FamilySize == 1 ~ "single",
    between(FamilySize, 2, 4) ~ "small",
    FamilySize >= 5 ~ "large"
  )) %>%
  mutate(FamilySized = fct_relevel(FamilySized, "single", "small", "large"))

table(titanic.full$FamilySize)

## 
##   1   2   3   4   5   6   7   8  11 
## 790 235 159  43  22  25  16   8  11

table(titanic.full$FamilySized)

## 
## single  small  large 
##    790    437     82

3.5 Add Ticket Type

For Ticket variable, it’s similary as Name variable, meaning that we could not directly extract useful information from it. However, I found that sometimes the ticket number only shows once, while other times, ticket numbers shows multiple times. Therefore, I will add a new categorical variable named TicketType that shows the frequency of the Ticket number.

ticket.unique <- rep(0, nrow(titanic.full))
tickets <- unique(titanic.full$Ticket) # list each unique the Ticket number 1 time

for (i in 1:length(tickets)) {
  current.ticket <- tickets[i]
  party.indexes <- which(titanic.full$Ticket == current.ticket)
  for (k in 1:length(party.indexes)) {
    ticket.unique[party.indexes[k]] <- length(party.indexes)
  }
}

titanic.full <- titanic.full %>%
  mutate(TicketGroup = ticket.unique,) %>%
  mutate(TicketTyped = case_when(
    TicketGroup == 1 ~ "single",
    between(TicketGroup, 2, 4) ~ "small",
    TicketGroup >= 5 ~ "large"
  )) %>%
  mutate(TicketTyped = fct_relevel(TicketTyped, "single", "small", "large"))

table(titanic.full$TicketGroup)

## 
##   1   2   3   4   5   6   7   8  11 
## 713 264 147  64  35  24  35  16  11

table(titanic.full$TicketTyped)

## 
## single  small  large 
##    713    475    121

3.6 Deal with variables with missing values

First, check number of missing values/empty values for each predictor

nrow(titanic.full[!complete.cases(titanic.full),]) # 264 incomplete cases

## [1] 267

# There are some empty values, so first transform them to NA so as to count NAs
titanic.full[titanic.full[1:nrow(titanic.full), ] == ""] <- NA
library(VIM)
aggr(titanic.full, plot = F)

## 
##  Missings in variables:
##  Variable Count
##       Age   263
##      Fare     1
##     Cabin  1014
##  Embarked     2
##      Aged   266

To sum up: * Age has 263 missing values * Fare has 1 missing values * Cabin has 1014 missing values * Embarked has 2 missing values

3.5.1 Fill in Age

Since Age has so many missing values, I will use multiple imputation method in mice package to fill in the missing values.

# check observations with missing values of Embarked or Fare
library(mice)
set.seed(1234)
mice.mod <- mice(titanic.full[, !names(titanic.full) %in% c("Name", "SibSp", "Parch", "Ticket", "Cabin")], method = "pmm")

## 
##  iter imp variable
##   1   1  Age  Fare  Embarked
##   1   2  Age  Fare  Embarked
##   1   3  Age  Fare  Embarked
##   1   4  Age  Fare  Embarked
##   1   5  Age  Fare  Embarked
##   2   1  Age  Fare  Embarked
##   2   2  Age  Fare  Embarked
##   2   3  Age  Fare  Embarked
##   2   4  Age  Fare  Embarked
##   2   5  Age  Fare  Embarked
##   3   1  Age  Fare  Embarked
##   3   2  Age  Fare  Embarked
##   3   3  Age  Fare  Embarked
##   3   4  Age  Fare  Embarked
##   3   5  Age  Fare  Embarked
##   4   1  Age  Fare  Embarked
##   4   2  Age  Fare  Embarked
##   4   3  Age  Fare  Embarked
##   4   4  Age  Fare  Embarked
##   4   5  Age  Fare  Embarked
##   5   1  Age  Fare  Embarked
##   5   2  Age  Fare  Embarked
##   5   3  Age  Fare  Embarked
##   5   4  Age  Fare  Embarked
##   5   5  Age  Fare  Embarked

## Warning: Number of logged events: 51

mice.output <- complete(mice.mod)

Now, check the distribution of Age before and after the imputation. The two distributions are highly similar to each other, therefore, it’s an valid method to utilize.

par(mfrow = c(1,2))
hist(titanic.full$Age, freq = F, main = 'Age: Original Data', 
     col='skyblue', ylim=c(0,0.04))
hist(mice.output$Age, freq = F, main = 'Age: MICE Output', 
     col='lightblue', ylim=c(0, 0.04))

par(mfrow=c(1,1))

titanic.full$Age <- mice.output$Age

3.5.2 Fill in Fare

First, figure out the observation with missing value.

titanic.full %>%
  filter(is.na(Fare))

##   Pclass               Name  Sex  Age SibSp Parch Ticket Fare Cabin
## 1    3rd Storey, Mr. Thomas male 60.5     0     0   3701   NA  <NA>
##   Embarked Aged Title FamilySize FamilySized TicketGroup TicketTyped
## 1        S  Old    Mr          1      single           1      single

This passenger embarked on S, and belongs to the 3rd Pclass. Next, I summarised the data with the median Fare value sliced by Pclass and Embarked, the result is as below:

titanic.full %>%
  group_by(Pclass, Embarked) %>%
  summarise(median = median(Fare, na.rm = T))

## # A tibble: 10 x 3
## # Groups:   Pclass [3]
##    Pclass Embarked median
##    <ord>  <fct>     <dbl>
##  1 3rd    C          7.90
##  2 3rd    Q          7.75
##  3 3rd    S          8.05
##  4 2nd    C         15.3 
##  5 2nd    Q         12.4 
##  6 2nd    S         15.4 
##  7 1st    C         76.7 
##  8 1st    Q         90   
##  9 1st    S         52   
## 10 1st    <NA>      80

# the median Fare price for passengers with Embarked == S, Pclass = 3rd is 8.05

Then I made a density plot of passengers with Pclass = 3rd and Embarked = S and fill in the missing value with the median Fare price of this group of passengers.

# visualization
library(ggplot2)
titanic.full %>%
  filter(Pclass == "3rd", Embarked == "S") %>%
  ggplot(aes(x = Fare)) + 
  geom_density(fill = "skyblue", alpha = 0.4) +
  geom_vline(aes(xintercept = median(Fare, na.rm =  T)), 
             col = "red", linetype = 2, size = 0.5)

# fill in  the missing value using median
titanic.full$Fare[is.na(titanic.full$Fare)] <- 8.05

3.5.3 Drop Cabin

As we know, Cabin has 1014 missing values, meaning that more than 75% of the information was missed. Therefore, I simply deleted this variable.

titanic.full$Cabin <- NULL

3.5.4 Fill in Embarked

First, figure out the observation with missing value.

titanic.full %>%
  filter(is.na(Embarked))

##   Pclass                                      Name    Sex Age SibSp Parch
## 1    1st                       Icard, Miss. Amelie female  38     0     0
## 2    1st Stone, Mrs. George Nelson (Martha Evelyn) female  62     0     0
##   Ticket Fare Embarked  Aged Title FamilySize FamilySized TicketGroup
## 1 113572   80     <NA> Adult  Miss          1      single           2
## 2 113572   80     <NA>   Old   Mrs          1      single           2
##   TicketTyped
## 1       small
## 2       small

For these two observations, both of them were in the 1st class and their Fare prices were both 80.

Then, show the table of median Fare values sliced by Pclass and Embarked again:

titanic.full %>%
  group_by(Pclass, Embarked) %>%
  summarise(median = median(Fare, na.rm = T))

## # A tibble: 10 x 3
## # Groups:   Pclass [3]
##    Pclass Embarked median
##    <ord>  <fct>     <dbl>
##  1 3rd    C          7.90
##  2 3rd    Q          7.75
##  3 3rd    S          8.05
##  4 2nd    C         15.3 
##  5 2nd    Q         12.4 
##  6 2nd    S         15.4 
##  7 1st    C         76.7 
##  8 1st    Q         90   
##  9 1st    S         52   
## 10 1st    <NA>      80

Based on the result, it’s not hard to find that the median Fare price of passengers Embarked on C and were in the 1st class is 76.7, which is quite similar to 80. Therefore, it’s safe to fill in the missing value of Embarked information as “C” as below.

titanic.full$Embarked[which(is.na(titanic.full$Embarked))] <- "C"
summary(titanic.full$Embarked) # double check

##       C   Q   S 
##   0 272 123 914

# there's an unused level, therefore, drop it as below:

titanic.full$Embarked <- droplevels(titanic.full$Embarked)

Finally, delete useless variables and look at the structure and the summary information of the full data again.

titanic.full <- titanic.full %>%
  select(Pclass, Title, Sex, Age, Aged, 
         SibSp, Parch, FamilySize, FamilySized,
         TicketGroup, TicketTyped, Fare, Embarked)

str(titanic.full)

## 'data.frame':    1309 obs. of  13 variables:
##  $ Pclass     : Ord.factor w/ 3 levels "3rd"<"2nd"<"1st": 1 3 1 3 1 1 3 1 1 2 ...
##  $ Title      : Factor w/ 5 levels "Miss","Master",..: 3 4 1 4 3 3 3 2 4 4 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 50 54 2 27 14 ...
##  $ Aged       : chr  "Adult" "Adult" "Adult" "Adult" ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
##  $ FamilySized: Factor w/ 3 levels "single","small",..: 2 2 1 2 1 1 1 3 2 2 ...
##  $ TicketGroup: num  1 2 1 2 1 1 2 5 3 2 ...
##  $ TicketTyped: Factor w/ 3 levels "single","small",..: 1 2 1 2 1 1 2 3 2 2 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

summary(titanic.full)

##  Pclass           Title         Sex           Age       
##  3rd:709   Miss      :266   female:466   Min.   : 0.17  
##  2nd:277   Master    : 61   male  :843   1st Qu.:21.00  
##  1st:323   Mr        :758                Median :29.00  
##            Mrs       :197                Mean   :30.29  
##            rare title: 27                3rd Qu.:39.00  
##                                          Max.   :80.00  
##      Aged               SibSp            Parch         FamilySize    
##  Length:1309        Min.   :0.0000   Min.   :0.000   Min.   : 1.000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.: 1.000  
##  Mode  :character   Median :0.0000   Median :0.000   Median : 1.000  
##                     Mean   :0.4989   Mean   :0.385   Mean   : 1.884  
##                     3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.: 2.000  
##                     Max.   :8.0000   Max.   :9.000   Max.   :11.000  
##  FamilySized   TicketGroup     TicketTyped       Fare         Embarked
##  single:790   Min.   : 1.000   single:713   Min.   :  0.000   C:272   
##  small :437   1st Qu.: 1.000   small :475   1st Qu.:  7.896   Q:123   
##  large : 82   Median : 1.000   large :121   Median : 14.454   S:914   
##               Mean   : 2.102                Mean   : 33.276           
##               3rd Qu.: 3.000                3rd Qu.: 31.275           
##               Max.   :11.000                Max.   :512.329

4. Perform Exploratory Data Analysis

Before performing EDA, I need to create the cleaned train and test data sets and perform EDA on the train dataset, because we only have the Survived information of the train dataset.

titanic.train.clean <- cbind(titanic.train$Survived, titanic.full[1:891,])
names(titanic.train.clean)[1] <- "Survived"
titanic.train.clean$Survived <- as.factor(titanic.train.clean$Survived)

titanic.test.clean <- titanic.full[892:1309,]

4.1 Effects of Independent Variables on the Dependent Variable

4.1.1 Pclass

titanic.train.clean %>%
  ggplot(aes(x = Pclass, y = ..count.., fill = Survived)) +
  geom_bar(stat = "count", position = "dodge") +
  labs(x = "Ticket Class",
       y = "Number of Passengers",
       title = "Effect of Ticket Class on Survival Rate") +
  theme_classic() + scale_fill_brewer(palette = "Dark2") +
  geom_text(stat = "count", 
            aes(label = ..count..),
            position=position_dodge(width = 1), 
            vjust=-0.5)

Based on the visualization, more than 50% the passengers in the 1st class survived, about 50% of the passengers in the 2nd class survived, while most of the passenger in the 3rd class did not survive.

Next, calculate WOE and IV as below:

library(InformationValue)
WOETable(X = titanic.train.clean$Pclass, Y = titanic.train.clean$Survived)

##   CAT GOODS BADS TOTAL     PCT_G     PCT_B        WOE         IV
## 1 3rd   119  372   491 0.3479532 0.6775956 -0.6664827 0.21970095
## 2 2nd    87   97   184 0.2543860 0.1766849  0.3644848 0.02832087
## 3 1st   136   80   216 0.3976608 0.1457195  1.0039160 0.25292792

IV(X = titanic.train.clean$Pclass, Y = titanic.train.clean$Survived)

## [1] 0.5009497
## attr(,"howgood")
## [1] "Highly Predictive"

According to a 0.5 IV value, and the “Highly Predictive” result, we could regard Pclass as one of the features that are used to model the data.

4.1.2 Title

titanic.train.clean %>%
  ggplot(aes(x = Title, y = ..count.., fill = Survived)) +
  geom_bar(stat = "count", position = "stack") +
  labs(x = "Titles",
       y = "Number of Passengers",
       title = "Effect of Titles on Survival Rate") +
  theme_classic() + scale_fill_brewer(palette = "Dark2") +
  geom_text(stat = "count", 
            aes(label = ..count..),
            position = position_stack(vjust = 0.5))

According to the plot, most of the passengers with the title of “Mr” did not survive, while passengers with the title of “Miss” and “Mrs” have higher survival rate.

Next, calculate WOE and IV as below:

WOETable(X = titanic.train.clean$Title, Y = titanic.train.clean$Survived)

##          CAT GOODS BADS TOTAL      PCT_G      PCT_B        WOE          IV
## 1       Miss   132   55   187 0.38596491 0.10018215  1.3487564 0.385451342
## 2     Master    23   17    40 0.06725146 0.03096539  0.7755686 0.028142336
## 3         Mr    82  436   518 0.23976608 0.79417122 -1.1976353 0.663975160
## 4        Mrs    99   26   125 0.28947368 0.04735883  1.8103110 0.438303180
## 5 rare title     6   15    21 0.01754386 0.02732240 -0.4430030 0.004331925

IV(X = titanic.train.clean$Title, Y = titanic.train.clean$Survived)

## [1] 1.520204
## attr(,"howgood")
## [1] "Highly Predictive"

The IV value is as high as 1.52, therefore, we could regard Title as one of the features that are used to model the data.

4.1.2 Sex

titanic.train.clean %>%
  ggplot(aes(x = Sex, y = ..count.., fill = Survived)) +
  geom_bar(stat = "count", position = "dodge") +
  labs(x = "Sex",
       y = "Number of Passengers",
       title = "Effect of Sex on Survival Rate") +
  theme_classic() + scale_fill_brewer(palette = "Dark2") +
  geom_text(stat = "count", 
            aes(label = ..count..),
            position = position_dodge(width = 1), vjust = -0.5)

According to the plot, female passengers are much morel likely to survive than male passengers.

Next, calculate WOE and IV as below:

WOETable(X = titanic.train.clean$Sex, Y = titanic.train.clean$Survived)

##      CAT GOODS BADS TOTAL     PCT_G    PCT_B        WOE        IV
## 1 female   233   81   314 0.6812865 0.147541  1.5298770 0.8165651
## 2   male   109  468   577 0.3187135 0.852459 -0.9838327 0.5251163

IV(X = titanic.train.clean$Sex, Y = titanic.train.clean$Survived)

## [1] 1.341681
## attr(,"howgood")
## [1] "Highly Predictive"

The IV value is as high as 1.34, therefore, we could regard Sex as one of the features that are used to model the data.

4.1.3 Age or Aged?

First, perform WOE and IV analysis on both Age and Aged variable to decide which one to use.

WOETable(X = as.factor(titanic.train.clean$Age), Y = titanic.train.clean$Survived)
# since the table is too big, I will not show it here.

IV(X = as.factor(titanic.train.clean$Age), Y = titanic.train.clean$Survived)

## [1] 0.2999792
## attr(,"howgood")
## [1] "Highly Predictive"

WOETable(X = titanic.train.clean$Aged, Y = titanic.train.clean$Survived)

## [1] GOODS BADS  TOTAL PCT_G PCT_B WOE   IV   
## <0 rows> (or 0-length row.names)

IV(X = titanic.train.clean$Aged, Y = titanic.train.clean$Survived)

## [1] 0
## attr(,"howgood")
## [1] "Not Predictive"

Based on the result, I will use the Age variable rather than Aged. Below is the visualization,

titanic.train.clean %>%
  ggplot(aes(x = Age, color = Survived)) +
  geom_line(aes(label=..count..), stat = 'bin') +
  labs(x = "Age",
       y = "Number of Passengers",
       title = "Effect of Age on Survival Rate") +
  theme_classic() + scale_color_brewer(palette = "Dark2")

From the plot, we could conclude that yound passengers have higher survival rate than old passengers.

4.1.4 SibSp & Parch, FamilySize, or FamilySized?

Again, utilize WOE and IV to select the variable first.

WOETable(X = as.factor(titanic.train.clean$SibSp), Y = titanic.train.clean$Survived)

##   CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
## 1   0   210  398   608 0.593220339 0.724954463 -0.2005429 0.026418349
## 2   1   112   97   209 0.316384181 0.176684882  0.5825894 0.081387334
## 3   2    13   15    28 0.036723164 0.027322404  0.2957007 0.002779811
## 4   3     4   12    16 0.011299435 0.021857923 -0.6598108 0.006966604
## 5   4     3   15    18 0.008474576 0.027322404 -1.1706364 0.022063953
## 6   5     5    5     5 0.014124294 0.009107468  0.4388015 0.002201391
## 7   8     7    7     7 0.019774011 0.012750455  0.4388015 0.003081947

IV(X = as.factor(titanic.train.clean$SibSp), Y = titanic.train.clean$Survived)

## [1] 0.1448994
## attr(,"howgood")
## [1] "Highly Predictive"

WOETable(X = as.factor(titanic.train.clean$Parch), Y = titanic.train.clean$Survived)

##   CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
## 1   0   233  445   678 0.671469741 0.810564663 -0.1882622 0.026186312
## 2   1    65   53   118 0.187319885 0.096539162  0.6628690 0.060175728
## 3   2    40   40    80 0.115273775 0.072859745  0.4587737 0.019458440
## 4   3     3    2     5 0.008645533 0.003642987  0.8642388 0.004323394
## 5   4     4    4     4 0.011527378 0.007285974  0.4587737 0.001945844
## 6   5     1    4     5 0.002881844 0.007285974 -0.9275207 0.004084922
## 7   6     1    1     1 0.002881844 0.001821494  0.4587737 0.000486461

IV(X = as.factor(titanic.train.clean$Parch), Y = titanic.train.clean$Survived)

## [1] 0.1166611
## attr(,"howgood")
## [1] "Highly Predictive"

WOETable(X = as.factor(titanic.train.clean$FamilySize), Y = titanic.train.clean$Survived)

##   CAT GOODS BADS TOTAL       PCT_G      PCT_B        WOE           IV
## 1   1   163  374   537 0.459154930 0.68123862 -0.3945249 0.0876175539
## 2   2    89   72   161 0.250704225 0.13114754  0.6479509 0.0774668616
## 3   3    59   43   102 0.166197183 0.07832423  0.7523180 0.0661084057
## 4   4    21    8    29 0.059154930 0.01457195  1.4010615 0.0624634998
## 5   5     3   12    15 0.008450704 0.02185792 -0.9503137 0.0127410643
## 6   6     3   19    22 0.008450704 0.03460838 -1.4098460 0.0368782940
## 7   7     4    8    12 0.011267606 0.01457195 -0.2571665 0.0008497665
## 8   8     6    6     6 0.016901408 0.01092896  0.4359807 0.0026038712
## 9  11     7    7     7 0.019718310 0.01275046  0.4359807 0.0030378497

IV(X = as.factor(titanic.train.clean$FamilySize), Y = titanic.train.clean$Survived)

## [1] 0.3497672
## attr(,"howgood")
## [1] "Highly Predictive"

WOETable(X = titanic.train.clean$FamilySized, Y = titanic.train.clean$Survived)

##      CAT GOODS BADS TOTAL      PCT_G      PCT_B        WOE         IV
## 1 single   163  374   537 0.47660819 0.68123862 -0.3572179 0.07309765
## 2  small   169  123   292 0.49415205 0.22404372  0.7910021 0.21365625
## 3  large    10   52    62 0.02923977 0.09471767 -1.1753709 0.07696082

IV(X = titanic.train.clean$FamilySized, Y = titanic.train.clean$Survived)

## [1] 0.3637147
## attr(,"howgood")
## [1] "Highly Predictive"

Based on the highest IV criteria, I will select FamilySized here.

titanic.train.clean %>%
  ggplot(aes(x = FamilySized, y = ..count.., fill = Survived)) +
  geom_bar(stat = "count", position = "dodge") +
  labs(x = "Family Size",
       y = "Number of Passengers",
       title = "Effect of Family Size on Survival Rate") +
  theme_classic() + scale_fill_brewer(palette = "Dark2") +
  geom_text(stat = "count", 
            aes(label = ..count..),
            position = position_dodge(width = 1), vjust = -0.5)

According to the plot, passengers with small family are the group with the highest survival rate.

4.1.5 TicketGroup or TicketTyped?

As before, apply WOE and IV to select the variable first.

WOETable(X = as.factor(titanic.train.clean$TicketGroup), Y = titanic.train.clean$Survived)

##   CAT GOODS BADS TOTAL      PCT_G      PCT_B         WOE           IV
## 1   1   130  351   481 0.37249284 0.63934426 -0.54022525 1.441599e-01
## 2   2    93   88   181 0.26647564 0.16029144  0.50828920 5.397228e-02
## 3   3    66   35   101 0.18911175 0.06375228  1.08733320 1.363075e-01
## 4   4    32   12    44 0.09169054 0.02185792  1.43385577 1.001299e-01
## 5   5     7   14    21 0.02005731 0.02550091 -0.24012066 1.307122e-03
## 6   6     4   15    19 0.01146132 0.02732240 -0.86872932 1.377899e-02
## 7   7     5   19    24 0.01432665 0.03460838 -0.88197455 1.788797e-02
## 8   8     5    8    13 0.01432665 0.01457195 -0.01697711 4.164509e-06
## 9  11     7    7     7 0.02005731 0.01275046  0.45302652 3.310197e-03

IV(X = as.factor(titanic.train.clean$TicketGroup), Y = titanic.train.clean$Survived)

## [1] 0.470858
## attr(,"howgood")
## [1] "Highly Predictive"

WOETable(X = titanic.train.clean$TicketTyped, Y = titanic.train.clean$Survived)

##      CAT GOODS BADS TOTAL      PCT_G     PCT_B        WOE         IV
## 1 single   130  351   481 0.38011696 0.6393443 -0.5199641 0.13478888
## 2  small   191  135   326 0.55847953 0.2459016  0.8202864 0.25640338
## 3  large    21   63    84 0.06140351 0.1147541 -0.6253246 0.03336144

IV(X = titanic.train.clean$TicketTyped, Y = titanic.train.clean$Survived)

## [1] 0.4245537
## attr(,"howgood")
## [1] "Highly Predictive"

Based on the highest IV criteria, I will select TicketGroup here.

titanic.train.clean %>%
  ggplot(aes(x = TicketGroup, y = ..count.., fill = Survived)) +
  geom_bar(stat = "count", position = "dodge") +
  labs(x = "Ticket Group",
       y = "Number of Passengers",
       title = "Effect of Ticket Type on Survival Rate") +
  theme_classic() + scale_fill_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = 1:11) +
  geom_text(stat = "count", 
            aes(label = ..count..),
            position = position_dodge(width = 1), vjust = -0.5)

Based on the result, passengers with ticket group of 2,3,4 have the highest survival rate.

4.1.6 Fare

titanic.train.clean %>%
  ggplot(aes(x = Fare, color = Survived)) +
  geom_line(aes(label=..count..), stat = 'bin') +
  labs(x = "Fare Price",
       y = "Number of Passengers",
       title = "Effect of Fare Price on Survival Rate") +
  theme_classic() + scale_color_brewer(palette = "Dark2")

Accordingly, when the price is below 200 dollars, the higher the fare price is, the higher the survival rate it, when the price is higher than 200 dollars, there isn’t significant difference.

4.1.7 Embarked

titanic.train.clean %>%
  ggplot(aes(x = Embarked, y = ..count.., fill = Survived)) +
  geom_bar(stat = "count", position = "dodge") +
  labs(x = "Embarked Place",
       y = "Number of Passengers",
       title = "Effect of Embarked Place on Survival Rate") +
  theme_classic() + scale_fill_brewer(palette = "Dark2") +
  geom_text(stat = "count", 
            aes(label = ..count..),
            position = position_dodge(width = 1), vjust = -0.5)

Based on the plot, the survival rate for passengers embarked on C is the highest, while the survival rate for passengers embarked on S is the lowest.

WOETable(X = titanic.train.clean$Embarked, Y = titanic.train.clean$Survived)

##   CAT GOODS BADS TOTAL     PCT_G     PCT_B         WOE           IV
## 1   C    95   75   170 0.2777778 0.1366120  0.70967648 1.001820e-01
## 2   Q    30   47    77 0.0877193 0.0856102  0.02433748 5.133014e-05
## 3   S   217  427   644 0.6345029 0.7777778 -0.20359896 2.917061e-02

IV(X = titanic.train.clean$Embarked, Y = titanic.train.clean$Survived)

## [1] 0.129404
## attr(,"howgood")
## [1] "Highly Predictive"

Accordingly, Embarked could be regarded as one of the features used to predict the survival because of the high IV value.

To sum up, I will use Pclass, Title, Sex, Age, FamilySized, TicketGroup, Fare, and Embarked as predictors.

titanic.train.final <- titanic.train.clean %>%
  select(Survived, Pclass, Title, Sex, Age, FamilySized, 
         TicketGroup, Fare, Embarked)

titanic.test.final <- titanic.test.clean %>%
  select(Pclass, Title, Sex, Age, FamilySized, 
         TicketGroup, Fare, Embarked)

All later steps will using titanic.train.final and titanic.test.final datasets.

4.2 Correlations between Numeric Independent Variables

library(corrplot)
titanic.train.final %>%
  select(Age, TicketGroup, Fare) %>%
  cor() %>%
  corrplot.mixed()

5. Build Classification Models

Here, I will utilize 6 classification methods, which are Stepwise Logistic Regression Model, the k-Nearest-Neighbors (KNN), Naive Bayes Classifier, Classification and Regression Trees (CART), Random Forest Model, and Ensembles (Bagging and Boosting).

5.1 Stepwise Logistic Regression Model

titanic.null <- glm(Survived ~ 1, data = titanic.train.final, family = 'binomial')
titanic.full <- glm(Survived ~ ., data = titanic.train.final, family = 'binomial')
titanic.step <- step(titanic.null, scope = list(lower = titanic.null, upper = titanic.full),
                     direction = "both")

## Start:  AIC=1188.66
## Survived ~ 1
## 
##               Df Deviance     AIC
## + Title        4   886.64  896.64
## + Sex          1   917.80  921.80
## + Pclass       2  1083.11 1089.11
## + FamilySized  2  1111.56 1117.56
## + Fare         1  1117.57 1121.57
## + Embarked     2  1159.30 1165.30
## + Age          1  1176.89 1180.89
## + TicketGroup  1  1182.96 1186.96
## <none>            1186.66 1188.66
## 
## Step:  AIC=896.64
## Survived ~ Title
## 
##               Df Deviance     AIC
## + Pclass       2   782.74  796.74
## + FamilySized  2   813.72  827.72
## + Fare         1   855.86  867.86
## + Embarked     2   865.42  879.42
## + TicketGroup  1   873.60  885.60
## + Sex          1   881.07  893.07
## <none>             886.64  896.64
## + Age          1   885.39  897.39
## - Title        4  1186.66 1188.66
## 
## Step:  AIC=796.74
## Survived ~ Title + Pclass
## 
##               Df Deviance     AIC
## + FamilySized  2   729.80  747.80
## + TicketGroup  1   765.23  781.23
## + Embarked     2   774.44  792.44
## + Age          1   777.38  793.38
## + Sex          1   778.19  794.19
## <none>             782.74  796.74
## + Fare         1   782.73  798.73
## - Pclass       2   886.64  896.64
## - Title        4  1083.11 1089.11
## 
## Step:  AIC=747.8
## Survived ~ Title + Pclass + FamilySized
## 
##               Df Deviance     AIC
## + Age          1   719.92  739.92
## + Sex          1   725.46  745.46
## + Fare         1   725.58  745.58
## + TicketGroup  1   727.24  747.24
## <none>             729.80  747.80
## + Embarked     2   726.66  748.66
## - FamilySized  2   782.74  796.74
## - Pclass       2   813.72  827.72
## - Title        4  1039.15 1049.15
## 
## Step:  AIC=739.92
## Survived ~ Title + Pclass + FamilySized + Age
## 
##               Df Deviance     AIC
## + Sex          1   716.08  738.08
## + Fare         1   716.41  738.41
## <none>             719.92  739.92
## + TicketGroup  1   718.59  740.59
## + Embarked     2   716.85  740.85
## - Age          1   729.80  747.80
## - FamilySized  2   777.38  793.38
## - Pclass       2   813.60  829.60
## - Title        4   993.79 1005.79
## 
## Step:  AIC=738.08
## Survived ~ Title + Pclass + FamilySized + Age + Sex
## 
##               Df Deviance    AIC
## + Fare         1   712.64 736.64
## <none>             716.08 738.08
## + Embarked     2   712.60 738.60
## + TicketGroup  1   714.91 738.91
## - Sex          1   719.92 739.92
## - Age          1   725.46 745.46
## - Title        4   765.13 779.13
## - FamilySized  2   773.13 791.13
## - Pclass       2   807.83 825.83
## 
## Step:  AIC=736.64
## Survived ~ Title + Pclass + FamilySized + Age + Sex + Fare
## 
##               Df Deviance    AIC
## <none>             712.64 736.64
## + Embarked     2   710.05 738.05
## - Fare         1   716.08 738.08
## - Sex          1   716.41 738.41
## + TicketGroup  1   712.53 738.53
## - Age          1   721.34 743.34
## - Pclass       2   758.30 778.30
## - Title        4   763.32 779.32
## - FamilySized  2   773.12 793.12

summary(titanic.step)

## 
## Call:
## glm(formula = Survived ~ Title + Pclass + FamilySized + Age + 
##     Sex + Fare, family = "binomial", data = titanic.train.final)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7387  -0.5224  -0.3774   0.5415   2.5146  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.271461   0.344879   6.586 4.51e-11 ***
## TitleMaster       15.829109 617.587831   0.026  0.97955    
## TitleMr           12.279537 617.587655   0.020  0.98414    
## TitleMrs           0.678671   0.364734   1.861  0.06278 .  
## Titlerare title   12.018373 617.587399   0.019  0.98447    
## Pclass.L           1.488218   0.222807   6.679 2.40e-11 ***
## Pclass.Q           0.140950   0.198604   0.710  0.47789    
## FamilySizedsmall  -0.378234   0.242526  -1.560  0.11886    
## FamilySizedlarge  -3.268710   0.509555  -6.415 1.41e-10 ***
## Age               -0.025656   0.008868  -2.893  0.00381 ** 
## Sexmale          -15.223227 617.587597  -0.025  0.98033    
## Fare               0.004527   0.002655   1.705  0.08818 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  712.64  on 879  degrees of freedom
## AIC: 736.64
## 
## Number of Fisher Scoring iterations: 13

Have a look at the classification table of the stepwise logistic regression model:

library(QuantPsyc)
ClassLog(titanic.step, titanic.train.final$Survived, cut = .5)

## $rawtab
##        resp
##           0   1
##   FALSE 483  85
##   TRUE   66 257
## 
## $classtab
##        resp
##                 0         1
##   FALSE 0.8797814 0.2485380
##   TRUE  0.1202186 0.7514620
## 
## $overall
## [1] 0.8305275
## 
## $mcFadden
## [1] 0.399458

Plot the ROC curve:

library(InformationValue)
predicted <- plogis(predict(titanic.step, titanic.train.final))
plotROC(titanic.train.final$Survived, predicted)

Check the optimal cutoff value, and use it to build the classification table again:

(optCutOff <- optimalCutoff(titanic.train.final$Survived, predicted))

## [1] 0.6499997

ClassLog(titanic.step, titanic.train.final$Survived, cut = .65)

## $rawtab
##        resp
##           0   1
##   FALSE 510 105
##   TRUE   39 237
## 
## $classtab
##        resp
##                  0          1
##   FALSE 0.92896175 0.30701754
##   TRUE  0.07103825 0.69298246
## 
## $overall
## [1] 0.8383838
## 
## $mcFadden
## [1] 0.399458

Predict on the test data:

pred.logistic <- predict(titanic.step, titanic.test.final, type = "response")
pred.logistic <- ifelse(pred.logistic > 0.65, 1, 0)

5.2 Random Forest Model

set.seed(1)
library(randomForest)
titanic.rf <- randomForest(Survived ~ ., data = titanic.train.final)
library(caret)
confusionMatrix(titanic.rf$predicted, titanic.train.final$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 497  93
##          1  52 249
##                                           
##                Accuracy : 0.8373          
##                  95% CI : (0.8114, 0.8609)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.648           
##                                           
##  Mcnemar's Test P-Value : 0.0008943       
##                                           
##             Sensitivity : 0.9053          
##             Specificity : 0.7281          
##          Pos Pred Value : 0.8424          
##          Neg Pred Value : 0.8272          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5578          
##    Detection Prevalence : 0.6622          
##       Balanced Accuracy : 0.8167          
##                                           
##        'Positive' Class : 0               
##

Predict on the test data:

pred.rf <- predict(titanic.rf, titanic.test.final, type = "response")

5.3 Ensembles (Bagging and Boosting)

Bagging:

set.seed(1)
library(adabag)
titanic.bag <- bagging(Survived ~ ., data = titanic.train.final)
confusionMatrix(as.factor(titanic.bag$class), titanic.train.final$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 512  97
##          1  37 245
##                                           
##                Accuracy : 0.8496          
##                  95% CI : (0.8244, 0.8725)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6712          
##                                           
##  Mcnemar's Test P-Value : 3.454e-07       
##                                           
##             Sensitivity : 0.9326          
##             Specificity : 0.7164          
##          Pos Pred Value : 0.8407          
##          Neg Pred Value : 0.8688          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5746          
##    Detection Prevalence : 0.6835          
##       Balanced Accuracy : 0.8245          
##                                           
##        'Positive' Class : 0               
##

Predict on the test data:

pred.bag <- predict(titanic.bag, titanic.test.final, type = "response")

Boosting:

set.seed(1)
library(adabag)
titanic.boost <- boosting(Survived ~ ., data = titanic.train.final)
confusionMatrix(as.factor(titanic.boost$class), titanic.train.final$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 543  14
##          1   6 328
##                                           
##                Accuracy : 0.9776          
##                  95% CI : (0.9655, 0.9862)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9523          
##                                           
##  Mcnemar's Test P-Value : 0.1175          
##                                           
##             Sensitivity : 0.9891          
##             Specificity : 0.9591          
##          Pos Pred Value : 0.9749          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.6162          
##          Detection Rate : 0.6094          
##    Detection Prevalence : 0.6251          
##       Balanced Accuracy : 0.9741          
##                                           
##        'Positive' Class : 0               
##

Predict on the test data:

pred.boost <- predict(titanic.boost, titanic.test.final, type = "response")

6. Prediction Accuracy

write files into csv and upload the prediction result to Kaggle:

titanic.test$pred.logistic <- pred.logistic
titanic.test$pred.rf <- pred.rf
titanic.test$pred.bag <- pred.bag$class
titanic.test$pred.boost <- pred.boost$class

setwd("/Users/shanshantong/Desktop/Skills/1.Tools/R/5.Projects/2.Titanic")

write.csv(titanic.test[,c(1,12)], file = "pred_logistic.csv", row.names = F)
write.csv(titanic.test[,c(1,13)], file = "pred_rf.csv", row.names = F)
write.csv(titanic.test[,c(1,14)], file = "pred_bag.csv", row.names = F)
write.csv(titanic.test[,c(1,15)], file = "pred_boost.csv", row.names = F)

Kaggle Score with Logistic Regression - 0.7751 Accuracy with Random Forest - 0.7895 Accuracy with Bagging Model - 0.8134 Accuracy with Boosting Model - 0.7416

7. Summary

The Bagging model generates the highest accuracy with Kaggle score of 0.8134. Besides, for this specific task, there are still many other algorithms that could be utilized to make prediction but I will not discuss more details here.

Titanic: Machine Learning from Disaster

Shanshan Tong

3/24/2020

1. Describe the Problem and Objectives

2. Describe the Data

3. Manipulate the data

3.1 Factor Pclass

3.2 Bin the Age

3.3 Add Titles

3.4 Add Family Size

3.5 Add Ticket Type

3.6 Deal with variables with missing values

3.5.1 Fill in Age

3.5.2 Fill in Fare

3.5.3 Drop Cabin

3.5.4 Fill in Embarked

4. Perform Exploratory Data Analysis

4.1 Effects of Independent Variables on the Dependent Variable

4.1.1 Pclass

4.1.2 Title

4.1.2 Sex

4.1.3 Age or Aged?

4.1.4 SibSp & Parch, FamilySize, or FamilySized?

4.1.5 TicketGroup or TicketTyped?

4.1.6 Fare

4.1.7 Embarked

4.2 Correlations between Numeric Independent Variables

5. Build Classification Models

5.1 Stepwise Logistic Regression Model

5.2 Random Forest Model

5.3 Ensembles (Bagging and Boosting)

6. Prediction Accuracy

7. Summary