Titanic - Machine Learning from Disaster

Alok Pratap Singh

25 June, 2021

Titanic

library(tictoc)
tic()
start_time <- Sys.time()              # Capture the current time

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Overview

The data has been split into two groups:

training set (train.csv) test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

To download the data: https://www.kaggle.com/c/titanic/data

library(readr)
options(readr.num_columns = FALSE)
read_csv("C:/Users/Asus/Documents/R Pubs/Titanic/train.csv") -> traindf   # Reading Training Data set
read_csv("C:/Users/Asus/Documents/R Pubs/Titanic/test.csv") -> testdf   # Reading Testing Data set   
read_csv("C:/Users/Asus/Documents/R Pubs/Titanic/gender_submission.csv") -> submission  # Submission Template

Loading required Libraries

library(dplyr, warn.conflicts = F)       # Data manipulation
library(Amelia)   # Missing frequencies
library(mice)
library(caret)
library(ggplot2)
theme_set(theme_bw())
library(ggthemes)
library(corrplot)
library(ggcorrplot)
library(ggpubr)

traindf %>% head()

## # A tibble: 6 x 12
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund~ male     22     1     0 A/5 2~  7.25 <NA> 
## 2           2        1      1 Cuming~ fema~    38     1     0 PC 17~ 71.3  C85  
## 3           3        1      3 Heikki~ fema~    26     0     0 STON/~  7.92 <NA> 
## 4           4        1      1 Futrel~ fema~    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,~ male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Moran,~ male     NA     0     0 330877  8.46 <NA> 
## # ... with 1 more variable: Embarked <chr>

testdf$Survived <- rep(NA, nrow(testdf))
rbind(traindf, testdf) -> full_data
as_tibble(full_data) -> full_data
full_data

## # A tibble: 1,309 x 12
##    PassengerId Survived Pclass Name   Sex     Age SibSp Parch Ticket  Fare Cabin
##          <dbl>    <dbl>  <dbl> <chr>  <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
##  1           1        0      3 Braun~ male     22     1     0 A/5 2~  7.25 <NA> 
##  2           2        1      1 Cumin~ fema~    38     1     0 PC 17~ 71.3  C85  
##  3           3        1      3 Heikk~ fema~    26     0     0 STON/~  7.92 <NA> 
##  4           4        1      1 Futre~ fema~    35     1     0 113803 53.1  C123 
##  5           5        0      3 Allen~ male     35     0     0 373450  8.05 <NA> 
##  6           6        0      3 Moran~ male     NA     0     0 330877  8.46 <NA> 
##  7           7        0      1 McCar~ male     54     0     0 17463  51.9  E46  
##  8           8        0      3 Palss~ male      2     3     1 349909 21.1  <NA> 
##  9           9        1      3 Johns~ fema~    27     0     2 347742 11.1  <NA> 
## 10          10        1      2 Nasse~ fema~    14     1     0 237736 30.1  <NA> 
## # ... with 1,299 more rows, and 1 more variable: Embarked <chr>

Missing Values

missmap(full_data, 
        col= c("yellow", "steelblue"))     # form "Amelia" package

mice::md.pattern(full_data, rotate.names = T)

##     PassengerId Pclass Name Sex SibSp Parch Ticket Fare Embarked Age Survived
## 183           1      1    1   1     1     1      1    1        1   1        1
## 529           1      1    1   1     1     1      1    1        1   1        1
## 87            1      1    1   1     1     1      1    1        1   1        0
## 244           1      1    1   1     1     1      1    1        1   1        0
## 19            1      1    1   1     1     1      1    1        1   0        1
## 158           1      1    1   1     1     1      1    1        1   0        1
## 4             1      1    1   1     1     1      1    1        1   0        0
## 82            1      1    1   1     1     1      1    1        1   0        0
## 2             1      1    1   1     1     1      1    1        0   1        1
## 1             1      1    1   1     1     1      1    0        1   1        0
##               0      0    0   0     0     0      0    1        2 263      418
##     Cabin     
## 183     1    0
## 529     0    1
## 87      1    1
## 244     0    2
## 19      1    1
## 158     0    2
## 4       1    2
## 82      0    3
## 2       1    1
## 1       0    3
##      1014 1698

full_data %>% select(-c(PassengerId, Name, Ticket, Cabin)) -> full_data
colSums(is.na(full_data))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##      418        0        0      263        0        0        1        2

par(lwd= 2)
hist(full_data$Fare,
     col= "salmon2",
     xlab= "Fare", breaks = 25,
     main= "Histogram of Fare")

hist(full_data$Age,
     lwd= 2,
     border = "darkblue",
     col= "skyblue", xlab= "Age",
     main= "Histogram of Age")

Fare is negatively skewed, need transformation (log/ or factors)

Data Wrangling

Types of Variables

sapply(full_data, class) %>% data.frame() %>% `colnames<-`("Types")

##              Types
## Survived   numeric
## Pclass     numeric
## Sex      character
## Age        numeric
## SibSp      numeric
## Parch      numeric
## Fare       numeric
## Embarked character

Transforming Survived to Factor, Sex to Factor.

full_data$Survived <- factor(full_data$Survived,
                        levels = c(0,1),
                        labels = c("No 0", "Yes 1"))
full_data$Sex <- factor(full_data$Sex,
                   levels = c("female", "male"),
                   labels = c("female 0", "male 1"))

Creating a new variable FSize (Family Size), and removing SibSp and Parch.

full_data$FSize <- full_data$SibSp + full_data$Parch + 1
full_data %>% select(-c(SibSp, Parch)) -> full_data

Converting FSize to Factor.

table(full_data$FSize)

## 
##   1   2   3   4   5   6   7   8  11 
## 790 235 159  43  22  25  16   8  11

cut(full_data$FSize,
    breaks = c(1,3,6,11),
    labels = c("Low 1", "Medium 2", "Large 3"),
    include.lowest = T) -> full_data$FSize
table(full_data$FSize)

## 
##    Low 1 Medium 2  Large 3 
##     1184       90       35

colSums(is.na(full_data))

## Survived   Pclass      Sex      Age     Fare Embarked    FSize 
##      418        0        0      263        1        2        0

glimpse(full_data)

## Rows: 1,309
## Columns: 7
## $ Survived <fct> No 0, Yes 1, Yes 1, Yes 1, No 0, No 0, No 0, No 0, Yes 1, Yes~
## $ Pclass   <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2~
## $ Sex      <fct> male 1, female 0, female 0, female 0, male 1, male 1, male 1,~
## $ Age      <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,~
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21~
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S", "~
## $ FSize    <fct> Low 1, Low 1, Low 1, Low 1, Low 1, Low 1, Low 1, Medium 2, Lo~

Pclass as Factor

full_data$Pclass <- factor(full_data$Pclass,
                      levels= 1:3,
                      labels = c("High 1", "Med 2", "Low 3"))
table(full_data$Pclass)

## 
## High 1  Med 2  Low 3 
##    323    277    709

Analyzing Fare

par(mfrow= c(2,2))
boxplot(full_data$Fare,
        main= "Boxplot Fare")
boxplot(full_data$Fare,
        main= "Boxplot Fare Without Outliers", 
        outline = F, col= "steelblue")
hist(full_data$Fare, 
     main = "Histogram", col= "palegreen4")

hist(full_data$Fare,
     xlim = c(0,100),
     breaks = 150,
     main= "Histogram Fare < 100", col= "yellow")
text(x= 0, y= 130,
     labels= "xx\nxx", col= "red")
text(x= 60, y= 300,
     labels= "Note:\nSome entry of\nFare values are ZEROs")

Investigating Fare

full_data[full_data$Fare== 0,]

## # A tibble: 18 x 7
##    Survived Pclass Sex      Age  Fare Embarked FSize
##    <fct>    <fct>  <fct>  <dbl> <dbl> <chr>    <fct>
##  1 No 0     Low 3  male 1    36     0 S        Low 1
##  2 No 0     High 1 male 1    40     0 S        Low 1
##  3 Yes 1    Low 3  male 1    25     0 S        Low 1
##  4 No 0     Med 2  male 1    NA     0 S        Low 1
##  5 No 0     Low 3  male 1    19     0 S        Low 1
##  6 No 0     Med 2  male 1    NA     0 S        Low 1
##  7 No 0     Med 2  male 1    NA     0 S        Low 1
##  8 No 0     Med 2  male 1    NA     0 S        Low 1
##  9 No 0     Low 3  male 1    49     0 S        Low 1
## 10 No 0     High 1 male 1    NA     0 S        Low 1
## 11 No 0     Med 2  male 1    NA     0 S        Low 1
## 12 No 0     Med 2  male 1    NA     0 S        Low 1
## 13 No 0     High 1 male 1    39     0 S        Low 1
## 14 No 0     High 1 male 1    NA     0 S        Low 1
## 15 No 0     High 1 male 1    38     0 S        Low 1
## 16 <NA>     <NA>   <NA>      NA    NA <NA>     <NA> 
## 17 <NA>     High 1 male 1    NA     0 S        Low 1
## 18 <NA>     High 1 male 1    49     0 S        Low 1

All Zero fare entry are Embarked from “S”, Low family size, belongs form all categories of Pclass

Replacing Fare zeros with median value based on specific categories.

full_data %>% na.omit()-> df
tapply(df$Fare, df$Pclass, median)

##  High 1   Med 2   Low 3 
## 67.9500 15.0458  8.0500

tapply(df$Fare, df$Embarked, median)

##       C       Q       S 
## 36.2521  7.7500 13.0000

Deciding to fill the 0’s in Fare with Median Fare value of Pclass

full_data %>%
    mutate(Fare=
               ifelse(full_data$Fare==0 & full_data$Pclass== "High 1", 67.96,
                        ifelse(full_data$Fare==0 & full_data$Pclass== "Med 2", 15.0458,
                               ifelse(full_data$Fare==0 & full_data$Pclass== "Low 3", 8.05, full_data$Fare)))) -> full_data

full_data[is.na(full_data$Fare),] # Only one Fare is missing

## # A tibble: 1 x 7
##   Survived Pclass Sex      Age  Fare Embarked FSize
##   <fct>    <fct>  <fct>  <dbl> <dbl> <chr>    <fct>
## 1 <NA>     Low 3  male 1  60.5    NA S        Low 1

full_data$Fare <- ifelse(is.na(full_data$Fare), 8.0500, full_data$Fare)   # replacing with the median Fare value of "Low 3" Pclass

Working with Embarked variable

colSums(is.na(full_data))

## Survived   Pclass      Sex      Age     Fare Embarked    FSize 
##      418        0        0      263        0        2        0

full_data[is.na(full_data$Embarked),]

## # A tibble: 2 x 7
##   Survived Pclass Sex        Age  Fare Embarked FSize
##   <fct>    <fct>  <fct>    <dbl> <dbl> <chr>    <fct>
## 1 Yes 1    High 1 female 0    38    80 <NA>     Low 1
## 2 Yes 1    High 1 female 0    62    80 <NA>     Low 1

full_data %>% filter(Pclass== "High 1") %>% 
    group_by(Embarked) %>% summarise(count= n())

## # A tibble: 4 x 2
##   Embarked count
##   <chr>    <int>
## 1 C          141
## 2 Q            3
## 3 S          177
## 4 <NA>         2

full_data %>% 
    group_by(Embarked) %>%  
    summarise(count= n())

## # A tibble: 4 x 2
##   Embarked count
##   <chr>    <int>
## 1 C          270
## 2 Q          123
## 3 S          914
## 4 <NA>         2

full_data$Embarked <- coalesce(full_data$Embarked, "S")
full_data %>% 
    group_by(Embarked) %>% 
    summarise(median(Fare))

## # A tibble: 3 x 2
##   Embarked `median(Fare)`
##   <chr>             <dbl>
## 1 C                 28.5 
## 2 Q                  7.75
## 3 S                 13.9

full_data$Embarked <- factor(full_data$Embarked, 
                        levels = c("Q", "S", "C"), 
                        labels = c("Q 1", "S 2", "C 3"))

full_data$Age <- coalesce(full_data$Age, 
                     median(full_data$Age, na.rm = T))
colSums(is.na(full_data))

## Survived   Pclass      Sex      Age     Fare Embarked    FSize 
##      418        0        0        0        0        0        0

Correlation Plots

full_data %>% 
    na.omit() %>%  
    mutate_all(as.numeric) %>% 
    mutate(Male= as.numeric(Sex==2), 
           Female= as.numeric(Sex== 1)) %>% 
    select(-Sex)  %>% cor() %>%
  corrplot(method="pie") # library "corr plot"

full_data %>% 
    na.omit() %>%  
    mutate_all(as.numeric) %>% 
    mutate(Male= as.numeric(Sex==2), 
           Female= as.numeric(Sex== 1)) %>% 
    select(-Sex) %>% 
    cor() %>% 
    ggcorrplot(method = 
                   "circle", 
               type = "upper", 
               ggtheme = theme_foundation(), 
               legend.title = "Correlaiton\nCoefficient", 
               outline.color = "black")

full_data %>% head()

## # A tibble: 6 x 7
##   Survived Pclass Sex        Age  Fare Embarked FSize
##   <fct>    <fct>  <fct>    <dbl> <dbl> <fct>    <fct>
## 1 No 0     Low 3  male 1      22  7.25 S 2      Low 1
## 2 Yes 1    High 1 female 0    38 71.3  C 3      Low 1
## 3 Yes 1    Low 3  female 0    26  7.92 S 2      Low 1
## 4 Yes 1    High 1 female 0    35 53.1  S 2      Low 1
## 5 No 0     Low 3  male 1      35  8.05 S 2      Low 1
## 6 No 0     Low 3  male 1      28  8.46 Q 1      Low 1

Converting Fare into Categories

quantile(full_data$Fare)

##       0%      25%      50%      75%     100% 
##   3.1708   7.9250  14.5000  31.3875 512.3292

full_data$Fare <- cut(full_data$Fare, 
                 breaks = quantile(full_data$Fare), 
                 labels = c("Less 1", "Ecomomic 2", "High 3", "V-High 4"), 
                 include.lowest = T)

full_data %>% head()

## # A tibble: 6 x 7
##   Survived Pclass Sex        Age Fare       Embarked FSize
##   <fct>    <fct>  <fct>    <dbl> <fct>      <fct>    <fct>
## 1 No 0     Low 3  male 1      22 Less 1     S 2      Low 1
## 2 Yes 1    High 1 female 0    38 V-High 4   C 3      Low 1
## 3 Yes 1    Low 3  female 0    26 Less 1     S 2      Low 1
## 4 Yes 1    High 1 female 0    35 V-High 4   S 2      Low 1
## 5 No 0     Low 3  male 1      35 Ecomomic 2 S 2      Low 1
## 6 No 0     Low 3  male 1      28 Ecomomic 2 Q 1      Low 1

Fare Vs Survival

df <- full_data %>% na.omit()
mosaicplot(table(full_data$Fare, full_data$Survived), 
           col= c("red", "steelblue"), 
           main= "Fare Vs Survival", 
           border = F, 
           xlab= "Fare", 
           ylab= "Survived")

chesq.test

chisq.test(table(df$Fare, df$Survived))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Fare, df$Survived)
## X-squared = 72.765, df = 3, p-value = 1.092e-15

Gender Vs Survival

mosaicplot(table(df$Sex, df$Survived), 
           col= c("pink", "steelblue"), 
           main= "Gender Vs Survival", 
           ylab= "Survied", xlab= "Gender", 
           border = F)

chisq.test

chisq.test(table(df$Sex, df$Survived))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(df$Sex, df$Survived)
## X-squared = 260.72, df = 1, p-value < 2.2e-16

df$Age <- cut(df$Age, 
              breaks = c(round(quantile(df$Age))), 
              include.lowest = T)

Age Vs Survival

mosaicplot(table(df$Age, df$Survived), 
           col= c("red", "yellow"), 
           main= "Age Vs Survival", 
           xlab= "Age", 
           ylab= "Survival")

chisq.test

chisq.test(table(df$Age, df$Survived))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Age, df$Survived)
## X-squared = 6.8243, df = 3, p-value = 0.07772

Age and Survival have no significant relation

Final full_data

full_data %>% head()

## # A tibble: 6 x 7
##   Survived Pclass Sex        Age Fare       Embarked FSize
##   <fct>    <fct>  <fct>    <dbl> <fct>      <fct>    <fct>
## 1 No 0     Low 3  male 1      22 Less 1     S 2      Low 1
## 2 Yes 1    High 1 female 0    38 V-High 4   C 3      Low 1
## 3 Yes 1    Low 3  female 0    26 Less 1     S 2      Low 1
## 4 Yes 1    High 1 female 0    35 V-High 4   S 2      Low 1
## 5 No 0     Low 3  male 1      35 Ecomomic 2 S 2      Low 1
## 6 No 0     Low 3  male 1      28 Ecomomic 2 Q 1      Low 1

Train and Test Data

train_df <- full_data[1:nrow(traindf),]
train_df %>% na.omit() -> train_df
test_df <- full_data[(nrow(traindf)+1):nrow(full_data), 2:7]

rm(full_data)
rm(df)
rm(traindf)
rm(testdf)

Splitting for Validation Set

set.seed(20)
caTools::sample.split(train_df$Survived, .8) -> index
train_df[index==F,] -> valid_df
train_df[index==T,] -> train_df

Time of Execution

toc()

## 7.95 sec elapsed

tic()

Training Model

Setting Parameters

m= "Accuracy"
c= trainControl(method= "cv", number= 18)

LDA

set.seed(20)
fit_lda <- train(Survived~., 
                    data= train_df, 
                    method= "lda", 
                    metric= m, 
                 trControl= c)

KNN

set.seed(20)
fit_knn <- train(Survived~., 
                 data= train_df, 
                 method= "lda",
                 metric= m,
                 trControl= c)

Supervised Vector Machine (Radial)

set.seed(20)
fit_svmradial <- train(Survived~., 
                       data= train_df, 
                       method= "svmRadial", 
                       metric= m, trControl= c)

Supervised Vector Machine (Linear)

set.seed(20)
fit_svmlinear <- train(Survived~., 
                       data= train_df, 
                       method= "svmLinear", 
                       metric= m, trControl= c)

CART

set.seed(20)
fit_rpart <- train(Survived~., 
                   data= train_df, 
                   method= "rpart", 
                   metric= m, trControl= c)

Random Forest Model

set.seed(20)
fit_rf <- train(Survived~.,
                data= train_df, 
                method= "rf", 
                metric= m, 
                trControl= c)

results <- resamples(list(LDA= fit_lda,
                          KNN= fit_knn,
                          cart= fit_rpart,
                          svmLinear= fit_svmlinear,
                          svmRadial= fit_svmradial,
                          RF= fit_rf))

dotplot(results)

summary(results) %>% print(digits= 2)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LDA, KNN, cart, svmLinear, svmRadial, RF 
## Number of resamples: 18 
## 
## Accuracy 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA       0.6666667 0.7515244 0.8000000 0.7884233 0.8461538 0.8750000    0
## KNN       0.6666667 0.7515244 0.8000000 0.7884233 0.8461538 0.8750000    0
## cart      0.6829268 0.7451923 0.8024390 0.7983253 0.8490385 0.9230769    0
## svmLinear 0.6666667 0.7548077 0.8024390 0.7939485 0.8397436 0.8750000    0
## svmRadial 0.7435897 0.7961538 0.8102564 0.8149295 0.8461538 0.8750000    0
## RF        0.7179487 0.8051282 0.8250000 0.8276058 0.8490385 0.9250000    0
## 
## Kappa 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA       0.2488889 0.4621944 0.5694398 0.5408761 0.6644737 0.7422680    0
## KNN       0.2488889 0.4621944 0.5694398 0.5408761 0.6644737 0.7422680    0
## cart      0.2586926 0.4277174 0.5465343 0.5448427 0.6609526 0.8354430    0
## svmLinear 0.2488889 0.4589286 0.5819480 0.5506147 0.6520556 0.7422680    0
## svmRadial 0.4144144 0.5291150 0.5952381 0.5856635 0.6578947 0.7368421    0
## RF        0.4115226 0.5851816 0.6199152 0.6243739 0.6822917 0.8378378    0

Evaluation of the Accuracy using Confusion matrix

confusionMatrix(predict(fit_lda, valid_df), valid_df$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No 0 Yes 1
##      No 0    96    20
##      Yes 1   14    48
##                                           
##                Accuracy : 0.809           
##                  95% CI : (0.7434, 0.8639)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 2.973e-08       
##                                           
##                   Kappa : 0.5885          
##                                           
##  Mcnemar's Test P-Value : 0.3912          
##                                           
##             Sensitivity : 0.8727          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8276          
##          Neg Pred Value : 0.7742          
##              Prevalence : 0.6180          
##          Detection Rate : 0.5393          
##    Detection Prevalence : 0.6517          
##       Balanced Accuracy : 0.7893          
##                                           
##        'Positive' Class : No 0            
##

confusionMatrix(predict(fit_knn , valid_df), valid_df$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No 0 Yes 1
##      No 0    96    20
##      Yes 1   14    48
##                                           
##                Accuracy : 0.809           
##                  95% CI : (0.7434, 0.8639)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 2.973e-08       
##                                           
##                   Kappa : 0.5885          
##                                           
##  Mcnemar's Test P-Value : 0.3912          
##                                           
##             Sensitivity : 0.8727          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8276          
##          Neg Pred Value : 0.7742          
##              Prevalence : 0.6180          
##          Detection Rate : 0.5393          
##    Detection Prevalence : 0.6517          
##       Balanced Accuracy : 0.7893          
##                                           
##        'Positive' Class : No 0            
##

confusionMatrix(predict(fit_rpart, valid_df), valid_df$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No 0 Yes 1
##      No 0   104    27
##      Yes 1    6    41
##                                           
##                Accuracy : 0.8146          
##                  95% CI : (0.7496, 0.8688)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 1.104e-08       
##                                           
##                   Kappa : 0.5828          
##                                           
##  Mcnemar's Test P-Value : 0.0004985       
##                                           
##             Sensitivity : 0.9455          
##             Specificity : 0.6029          
##          Pos Pred Value : 0.7939          
##          Neg Pred Value : 0.8723          
##              Prevalence : 0.6180          
##          Detection Rate : 0.5843          
##    Detection Prevalence : 0.7360          
##       Balanced Accuracy : 0.7742          
##                                           
##        'Positive' Class : No 0            
##

confusionMatrix(predict(fit_svmlinear, valid_df), valid_df$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No 0 Yes 1
##      No 0    95    21
##      Yes 1   15    47
##                                           
##                Accuracy : 0.7978          
##                  95% CI : (0.7312, 0.8541)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 1.939e-07       
##                                           
##                   Kappa : 0.5643          
##                                           
##  Mcnemar's Test P-Value : 0.4047          
##                                           
##             Sensitivity : 0.8636          
##             Specificity : 0.6912          
##          Pos Pred Value : 0.8190          
##          Neg Pred Value : 0.7581          
##              Prevalence : 0.6180          
##          Detection Rate : 0.5337          
##    Detection Prevalence : 0.6517          
##       Balanced Accuracy : 0.7774          
##                                           
##        'Positive' Class : No 0            
##

confusionMatrix(predict(fit_svmradial, valid_df), valid_df$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No 0 Yes 1
##      No 0   103    25
##      Yes 1    7    43
##                                           
##                Accuracy : 0.8202          
##                  95% CI : (0.7558, 0.8737)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 3.958e-09       
##                                           
##                   Kappa : 0.599           
##                                           
##  Mcnemar's Test P-Value : 0.002654        
##                                           
##             Sensitivity : 0.9364          
##             Specificity : 0.6324          
##          Pos Pred Value : 0.8047          
##          Neg Pred Value : 0.8600          
##              Prevalence : 0.6180          
##          Detection Rate : 0.5787          
##    Detection Prevalence : 0.7191          
##       Balanced Accuracy : 0.7844          
##                                           
##        'Positive' Class : No 0            
##

confusionMatrix(predict(fit_rf, valid_df), valid_df$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No 0 Yes 1
##      No 0    99    16
##      Yes 1   11    52
##                                          
##                Accuracy : 0.8483         
##                  95% CI : (0.787, 0.8976)
##     No Information Rate : 0.618          
##     P-Value [Acc > NIR] : 1.317e-11      
##                                          
##                   Kappa : 0.6742         
##                                          
##  Mcnemar's Test P-Value : 0.4414         
##                                          
##             Sensitivity : 0.9000         
##             Specificity : 0.7647         
##          Pos Pred Value : 0.8609         
##          Neg Pred Value : 0.8254         
##              Prevalence : 0.6180         
##          Detection Rate : 0.5562         
##    Detection Prevalence : 0.6461         
##       Balanced Accuracy : 0.8324         
##                                          
##        'Positive' Class : No 0           
##

rf and svmradial have best results

Time of Execution

toc()

## 29.18 sec elapsed

tic()

Final training on whole data (train_df + valid_df) for more accuracy

train_df <- rbind(train_df, valid_df)

Training Model

Setting Parameters

m= "Accuracy"
c= trainControl(method= "cv", number= 18)

LDA

set.seed(20)
fit_lda <- train(Survived~., 
                    data= train_df, 
                    method= "lda", 
                    metric= m, 
                 trControl= c)

KNN

set.seed(20)
fit_knn <- train(Survived~., 
                 data= train_df, 
                 method= "lda",
                 metric= m,
                 trControl= c)

Supervised Vector Machine (Radial)

set.seed(20)
fit_svmradial <- train(Survived~., 
                       data= train_df, 
                       method= "svmRadial", 
                       metric= m, trControl= c)

Supervised Vector Machine (Linear)

set.seed(20)
fit_svmlinear <- train(Survived~., 
                       data= train_df, 
                       method= "svmLinear", 
                       metric= m, trControl= c)

CART

set.seed(20)
fit_rpart <- train(Survived~., 
                   data= train_df, 
                   method= "rpart", 
                   metric= m, trControl= c)

Random Forest Model

set.seed(20)
fit_rf <- train(Survived~.,
                data= train_df, 
                method= "rf", 
                metric= m, 
                trControl= c)

results <- resamples(list(LDA= fit_lda,
                          KNN= fit_knn,
                          cart= fit_rpart,
                          svmLinear= fit_svmlinear,
                          svmRadial= fit_svmradial,
                          RF= fit_rf))

dotplot(results)

summary(results) %>% print(digits= 2)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LDA, KNN, cart, svmLinear, svmRadial, RF 
## Number of resamples: 18 
## 
## Accuracy 
##                Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## LDA       0.6938776 0.7437755 0.7777551 0.7931973 0.8400000 0.90    0
## KNN       0.6938776 0.7437755 0.7777551 0.7931973 0.8400000 0.90    0
## cart      0.7142857 0.7755102 0.7959184 0.8033333 0.8391837 0.90    0
## svmLinear 0.7200000 0.7551020 0.7879592 0.7943311 0.8325510 0.90    0
## svmRadial 0.7346939 0.7806122 0.8163265 0.8168027 0.8600000 0.90    0
## RF        0.7200000 0.8163265 0.8283673 0.8336508 0.8600000 0.94    0
## 
## Kappa 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA       0.3360434 0.4674658 0.5207002 0.5552678 0.6586760 0.7810858    0
## KNN       0.3360434 0.4674658 0.5207002 0.5552678 0.6586760 0.7810858    0
## cart      0.3740876 0.4877221 0.5340167 0.5572735 0.6450972 0.7763864    0
## svmLinear 0.3805310 0.4480546 0.5490691 0.5542787 0.6435939 0.7810858    0
## svmRadial 0.4186047 0.5183790 0.5935484 0.5961001 0.6869410 0.7810858    0
## RF        0.3670886 0.5898862 0.6221991 0.6364404 0.6998285 0.8739496    0

Time of Execution

toc()

## 33.17 sec elapsed

submission_rf= 
    data.frame(
        PassengerId= submission$PassengerId, 
        Survived= factor(predict(fit_rf, test_df), 
                         levels = c("No 0", "Yes 1"), 
                         labels= c(0,1)))

submission_svmradial= 
    data.frame(
        PassengerId= submission$PassengerId, 
        Survived= factor(predict(fit_svmradial, test_df), 
                         levels = c("No 0", "Yes 1"), 
                         labels= c(0,1)))

submission_linear= 
    data.frame(
        PassengerId= submission$PassengerId, 
        Survived= factor(predict(fit_svmlinear, 
                                 test_df), 
                         levels = c("No 0", "Yes 1"), 
                         labels= c(0,1)))

submission_rpart=
  data.frame(
    PassengerId= submission$PassengerId,
    Survived= factor(predict(fit_rpart,
                             test_df),
                     levels= c("No 0", "Yes 1"),
                     labels= c(0,1))
  )

Printing top 20 results

print(submission_svmradial %>% head(20))

##    PassengerId Survived
## 1          892        0
## 2          893        1
## 3          894        0
## 4          895        0
## 5          896        0
## 6          897        0
## 7          898        1
## 8          899        0
## 9          900        1
## 10         901        0
## 11         902        0
## 12         903        0
## 13         904        1
## 14         905        0
## 15         906        1
## 16         907        1
## 17         908        0
## 18         909        0
## 19         910        1
## 20         911        1

readr::write_csv(submission_svmradial, "subradial1.csv")
readr::write_csv(submission_rf, "subrf1.csv")
readr::write_csv(submission_linear, "sublinear.csv")
readr::write_csv(submission_rpart, "submrpart1.csv")

Kaggle Result

kableExtra::kable(data.frame(method= c("rf", "svmradial"),
           accuracy= c("76%", "78%")))

method	accuracy
rf	76%
svmradial	78%

end_time <- Sys.time()

Total execution Time

end_time- start_time

## Time difference of 1.180538 mins

END

Regards

Please visit my profile

Alok Pratap Singh (Research Scholar)

Linkedin (Open in New TAB)

Department of Psychology

apsingh@alluniv.ac.in

University of Allahabad

Without data you’re just another person with an opinion