Predicting on Titanic Dataset

This is my first attempt at a Kaggle script. I have chosen to work with the Titanic dataset after spending some time poking around on the site and looking at other scripts made by other Kagglers for inspiration. I will also focus on doing some illustrative data visualizations along the way. I’ll then use machine learning algorithms like Logistic Regression, Random Forest and Support Vector Machine(SVM) to create models for predicting survival on the Titanic. I am new to machine learning and hoping to learn a lot, so feedback is very welcome!

There are Four parts to my script as follows:

1) Feature engineering

2) Missing value imputation

3) Exploratory Data Analysis

4) Building Models for Prediction

The goal of this exercise is to build and train the best machine learning model to predict the survival of a passenger based on the criterion built in the model. We will attempt to evaluate the worthy model based on accuracy criterion.

Loading The Necessary Libraries

library(ggplot2)
library(caret)
library(randomForest)
library(e1071)
library(grid)
library(gridExtra)
library(mice)
library(dplyr)
library(pscl)
library(knncat)

Loading the training and testing dataset

train <- read.csv("C:/Data Science/R/Exercises/Titanic/trainingdataset.csv",na.strings = c("N/A","DIV/0!",""), stringsAsFactors = FALSE)
test <- read.csv("C:/Data Science/R/Exercises/Titanic/testingdataset.csv",na.strings = c("N/A","DIV/0!",""),stringsAsFactors = FALSE)
training <- bind_rows(train,test)

Understanding the basic properties of the training dataset

summary(training)

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin          
##  Length:1309        Min.   :  0.000   Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 33.295                     
##                     3rd Qu.: 31.275                     
##                     Max.   :512.329                     
##                     NA's   :1                           
##    Embarked        
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

dim(training)

## [1] 1309   12

str(training)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

It can be seen that there are some missing values and “NA” Values in the dataset. We will tabulate the “NA” values.

training[training==""] <- NA
a <- apply(training,2,is.na)
summary(a)

##  PassengerId      Survived         Pclass           Name        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1309      FALSE:891       FALSE:1309      FALSE:1309     
##  NA's :0         TRUE :418       NA's :0         NA's :0        
##                  NA's :0                                        
##     Sex             Age            SibSp           Parch        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1309      FALSE:1046      FALSE:1309      FALSE:1309     
##  NA's :0         TRUE :263       NA's :0         NA's :0        
##                  NA's :0                                        
##    Ticket           Fare           Cabin          Embarked      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1309      FALSE:1308      FALSE:295       FALSE:1307     
##  NA's :0         TRUE :1         TRUE :1014      TRUE :2        
##                  NA's :0         NA's :0         NA's :0

apply(a,2,sum)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1        1014           2

It can be seen that Age,Cabin and Embarked variables have missing values. Cabin has most number of missing values.These missing values can be found by using ‘Multivariate Imputation by Chained Equations (MICE)’ package.

Feature engineering:

“Name” Column: It can be seen that there is a column “Name” in the dataset having names of each passenger. These individual names may not useful for prediction but titles in each names (eg.Mr., Mrs.,) can given better insights into the passenger profile and can be used for model building.

training$Salutation <- gsub('(.*, )|(\\..*)', '',training$Name)
table(training$Sex,training$Salutation)

##         
##          Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme
##   female    0   0   0    1   1        0    1     0      0  260    2   1
##   male      1   4   1    0   7        1    0     2     61    0    0   0
##         
##           Mr Mrs  Ms Rev Sir the Countess
##   female   0 197   2   0   0            1
##   male   757   0   0   8   1            0

It can be seen that only ‘Mr.’,‘Mrs.’,‘Miss.’,’Master" are prominent titles. We could group synonyoms titles and lables others as “Misc” as follows,

misc <- c("Capt","Col","Don","Dr","Jonkheer","Lady","Major","Rev","Sir","the Countess","Dona")
training$Salutation[training$Salutation == "Mlle"] <- "Miss"
training$Salutation[training$Salutation == "Mme"] <- "Miss"
training$Salutation[training$Salutation %in% misc] <- "Misc"
table(training$Sex,training$Salutation)

##         
##          Master Misc Miss  Mr Mrs  Ms
##   female      0    4  263   0 197   2
##   male       61   25    0 757   0   0

Separating Surname from the given name in the dataset

training$Surname <- sapply(training$Name,function(x) strsplit(x, split = '[,.]')[[1]][1])
s <- nlevels(factor(training$Surname))
paste('We have', s, 'unique surnames in the training dataset amongst',nrow(training), 'passangers.')

## [1] "We have 875 unique surnames in the training dataset amongst 1309 passangers."

Determining the family size and creating family variable

training$FamilySize <- training$SibSp + training$Parch + 1
training$Family <- paste(training$Surname,'-',training$familysize)

Creating a new variable to identify the deck of the passanger. Deck of the passenger can be found by using the first character of his cabin number.

training$Deck <- substr(training$Cabin,1,1)
paste("Titanic has", nlevels(factor(training$Deck)),"decks on the ship.")

## [1] "Titanic has 8 decks on the ship."

With new feature engineered from existing columns we could then predict and fill the missing values. Before missing value imputation the character variables are converted into factor variables where levels could be used for better prediction than characters.

trainingreengineered <- training[,c("Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","Salutation","Deck","FamilySize","Family")]
list <- c("Survived","Pclass","Sex","Embarked","Salutation","Deck","Family")
trainingreengineered[list] <- lapply(trainingreengineered[list],function(x) as.factor(x))
str(trainingreengineered)

## 'data.frame':    1309 obs. of  12 variables:
##  $ Survived  : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass    : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex       : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age       : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp     : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch     : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare      : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked  : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  $ Salutation: Factor w/ 6 levels "Master","Misc",..: 4 5 3 5 4 4 4 1 5 5 ...
##  $ Deck      : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...
##  $ FamilySize: num  2 2 1 2 1 1 1 5 3 2 ...
##  $ Family    : Factor w/ 875 levels "Abbing - ","Abbott - ",..: 101 183 335 273 16 544 506 614 388 565 ...

Missing Value Imputation

This is done with “random forest” as imputation method as it could work well with precition involving categorical variables. New complete dataset is stored in the name of “imputedtrainingreengineered”.

set.seed(6)
imp = mice(trainingreengineered, method = "rf", m=5)

## 
##  iter imp variable
##   1   1  Survived  Age  Fare  Embarked  Deck
##   1   2  Survived  Age  Fare  Embarked  Deck
##   1   3  Survived  Age  Fare  Embarked  Deck
##   1   4  Survived  Age  Fare  Embarked  Deck
##   1   5  Survived  Age  Fare  Embarked  Deck
##   2   1  Survived  Age  Fare  Embarked  Deck
##   2   2  Survived  Age  Fare  Embarked  Deck
##   2   3  Survived  Age  Fare  Embarked  Deck
##   2   4  Survived  Age  Fare  Embarked  Deck
##   2   5  Survived  Age  Fare  Embarked  Deck
##   3   1  Survived  Age  Fare  Embarked  Deck
##   3   2  Survived  Age  Fare  Embarked  Deck
##   3   3  Survived  Age  Fare  Embarked  Deck
##   3   4  Survived  Age  Fare  Embarked  Deck
##   3   5  Survived  Age  Fare  Embarked  Deck
##   4   1  Survived  Age  Fare  Embarked  Deck
##   4   2  Survived  Age  Fare  Embarked  Deck
##   4   3  Survived  Age  Fare  Embarked  Deck
##   4   4  Survived  Age  Fare  Embarked  Deck
##   4   5  Survived  Age  Fare  Embarked  Deck
##   5   1  Survived  Age  Fare  Embarked  Deck
##   5   2  Survived  Age  Fare  Embarked  Deck
##   5   3  Survived  Age  Fare  Embarked  Deck
##   5   4  Survived  Age  Fare  Embarked  Deck
##   5   5  Survived  Age  Fare  Embarked  Deck

imputedtrainingreengineered = complete(imp)
summary(imp)

## Multiply imputed data set
## Call:
## mice(data = trainingreengineered, m = 5, method = "rf")
## Number of multiple imputations:  5
## Missing cells per column:
##   Survived     Pclass        Sex        Age      SibSp      Parch 
##        418          0          0        263          0          0 
##       Fare   Embarked Salutation       Deck FamilySize     Family 
##          1          2          0       1014          0          0 
## Imputation methods:
##   Survived     Pclass        Sex        Age      SibSp      Parch 
##       "rf"       "rf"       "rf"       "rf"       "rf"       "rf" 
##       Fare   Embarked Salutation       Deck FamilySize     Family 
##       "rf"       "rf"       "rf"       "rf"       "rf"       "rf" 
## VisitSequence:
## Survived      Age     Fare Embarked     Deck 
##        1        4        7        8       10 
## PredictorMatrix:
##            Survived Pclass Sex Age SibSp Parch Fare Embarked Salutation
## Survived          0      1   1   1     1     1    1        1          1
## Pclass            0      0   0   0     0     0    0        0          0
## Sex               0      0   0   0     0     0    0        0          0
## Age               1      1   1   0     1     1    1        1          1
## SibSp             0      0   0   0     0     0    0        0          0
## Parch             0      0   0   0     0     0    0        0          0
## Fare              1      1   1   1     1     1    0        1          1
## Embarked          1      1   1   1     1     1    1        0          1
## Salutation        0      0   0   0     0     0    0        0          0
## Deck              1      1   1   1     1     1    1        1          1
## FamilySize        0      0   0   0     0     0    0        0          0
## Family            0      0   0   0     0     0    0        0          0
##            Deck FamilySize Family
## Survived      1          1      1
## Pclass        0          0      0
## Sex           0          0      0
## Age           1          1      1
## SibSp         0          0      0
## Parch         0          0      0
## Fare          1          1      1
## Embarked      1          1      1
## Salutation    0          0      0
## Deck          0          1      1
## FamilySize    0          0      0
## Family        0          0      0
## Random generator seed value:  NA

Verifying the quality of imputation for any missing value which still remains unimputed.

apply(apply(imputedtrainingreengineered,2,is.na),2,sum)

##   Survived     Pclass        Sex        Age      SibSp      Parch 
##          0          0          0          0          0          0 
##       Fare   Embarked Salutation       Deck FamilySize     Family 
##          0          0          0          0          0          0

We can check the results before imputation & post imputation for any of the parameters, let’s say ‘Fare’

par(mfrow=c(1,2))
hist(trainingreengineered$Fare, main = "Before Imputation", col = "violet")
hist(imputedtrainingreengineered$Fare, main = "Post Imputation", col = "blue")

As distribution pattern remains identical before and post imputation, quality of imputation is excellent.

Exploratory Data Analysis

Let us try to establish relationship between Family Size and Survived people.

ggplot(imputedtrainingreengineered,aes(x=FamilySize, fill = factor(Survived))) + geom_bar(stat = "count", position = "dodge") + scale_x_continuous(breaks = c(1:11)) + labs( x= "Familysize")

We can conclude that single passenger survival rate is highest.

Further, we can group the family size as ‘Single’, ‘Small Size Family’ & ‘Large Size Family’. We will plot a new graph based on Family Size.

'Single' -> imputedtrainingreengineered$FamilySizeC[imputedtrainingreengineered$FamilySize == 1]
'Small' -> imputedtrainingreengineered$FamilySizeC[imputedtrainingreengineered$FamilySize > 1 & imputedtrainingreengineered$FamilySize < 5]
'Large' -> imputedtrainingreengineered$FamilySizeC[imputedtrainingreengineered$FamilySize >4]
mosaicplot(table(imputedtrainingreengineered$FamilySizeC,imputedtrainingreengineered$Survived), main = "Survival by Family Size", shade = TRUE)

It can easily be inferred that Small Families have the highest survival rate amongst the single and Large Families.

Now, we’ll look at the relationship between Age & Survived for each Sex.

ggplot(imputedtrainingreengineered,aes(x = Age, fill = factor(Survived))) + geom_histogram()+ facet_grid(.~Sex)

table(imputedtrainingreengineered$Sex,imputedtrainingreengineered$Survived)

##         
##            0   1
##   female 133 333
##   male   676 167

It is crystal clear from the above graph that distribution histogram curve has identical pattern agewise for both Survived as well as Non-Survived people; however the peaks are different. Also,it may be noted that female survival % > male survival %.

Let’s now bifurcate Child and Adult

imputedtrainingreengineered$Child[imputedtrainingreengineered$Age < 18] <- 'Child'
imputedtrainingreengineered$Child[imputedtrainingreengineered$Age>= 18] <- 'Adult'
table(imputedtrainingreengineered$Child,imputedtrainingreengineered$Survived)

##        
##           0   1
##   Adult 714 412
##   Child  95  88

ggplot(imputedtrainingreengineered,aes(x = Age, fill = factor(Survived))) + geom_bar(stat = "count")+ facet_grid(.~Child)

It can be statistically derived from the above table and graph that about 50% children on board survived while about 35% adults survived on board.

Let’s further bifurcate Adults into Men, Mother, Not Mother(presumably Miss, if I dare say so) and Child. In order to do that, we will first catagorize ‘Mother’ & ‘Not Mother’. Since the ‘Child’ is already created, the remaining from the list of ‘Age’ will be men. Alternatively, it could be found out by tabulating Sex Vs Child.

imputedtrainingreengineered$Mother <- 'Not Mother'
imputedtrainingreengineered$Mother[imputedtrainingreengineered$Sex == 'female'& imputedtrainingreengineered$Age > 18 & imputedtrainingreengineered$Parch > 0 & imputedtrainingreengineered$Salutation != 'Miss'] <- 'Mother'
table(imputedtrainingreengineered$Mother,imputedtrainingreengineered$Survived)

##             
##                0   1
##   Mother      24  61
##   Not Mother 785 439

It can be summarized from the above table that about 71% ‘Mothers’ on board survived while about 35% ‘Not Mothers’ survived on board.

Converting the character variables into factor variables

list1 <- c("Child","Mother")
imputedtrainingreengineered[list1] <- lapply(imputedtrainingreengineered[list1],function(x) as.factor(x))
variables <- c("Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","Salutation","Deck","FamilySize","Child","Mother","Survived")
imputedtrainingreengineered <- imputedtrainingreengineered[,variables]
str(imputedtrainingreengineered)

## 'data.frame':    1309 obs. of  13 variables:
##  $ Pclass    : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##   ..- attr(*, "contrasts")= num [1:3, 1:2] 0 1 0 0 0 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "1" "2" "3"
##   .. .. ..$ : chr  "2" "3"
##  $ Sex       : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##   ..- attr(*, "contrasts")= num [1:2, 1] 0 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "female" "male"
##   .. .. ..$ : chr "2"
##  $ Age       : num  22 38 26 35 35 19 54 2 27 14 ...
##  $ SibSp     : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch     : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare      : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked  : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##   ..- attr(*, "contrasts")= num [1:3, 1:2] 0 1 0 0 0 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "C" "Q" "S"
##   .. .. ..$ : chr  "2" "3"
##  $ Salutation: Factor w/ 6 levels "Master","Misc",..: 4 5 3 5 4 4 4 1 5 5 ...
##   ..- attr(*, "contrasts")= num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "Master" "Misc" "Miss" "Mr" ...
##   .. .. ..$ : chr  "2" "3" "4" "5" ...
##  $ Deck      : Factor w/ 8 levels "A","B","C","D",..: 6 3 4 3 4 6 5 5 5 2 ...
##   ..- attr(*, "contrasts")= num [1:8, 1:7] 0 1 0 0 0 0 0 0 0 0 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "A" "B" "C" "D" ...
##   .. .. ..$ : chr  "2" "3" "4" "5" ...
##  $ FamilySize: num  2 2 1 2 1 1 1 5 3 2 ...
##  $ Child     : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 2 1 2 ...
##  $ Mother    : Factor w/ 2 levels "Mother","Not Mother": 2 2 2 2 2 2 2 2 1 2 ...
##  $ Survived  : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##   ..- attr(*, "contrasts")= num [1:2, 1] 0 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "0" "1"
##   .. .. ..$ : chr "2"

Let’s see the survival rate wrt Passanger Class.

p1 <- ggplot(imputedtrainingreengineered,aes(x = Pclass,fill = factor(Survived)))+geom_bar(stat = "count", position = "stack")
p2 <- ggplot(imputedtrainingreengineered,aes(x = Pclass,fill = factor(Survived)))+geom_bar(position = "fill")+labs(y = "Proportion")
grid.arrange(p1,p2,ncol=2)

It can cited from the from the above graphs that about 58% of 1st class passengers, 40% of 2nd class passengers and 27% of 3rd Class passengers survived.

Survival wrt Embarkation

par(mfrow = c(1,2))
qplot(Embarked,data = imputedtrainingreengineered,geom = "density",fill = factor(Survived), alpha = I(0.5),xlab = "Embarked", ylab = "Density", main = "Kernel Density Plot for Port of Embarkment And Survival")

ggplot(imputedtrainingreengineered,aes(x = Embarked, fill = factor(Survived))) + geom_bar(position = "fill")+labs(y="Proportion")

Kernel Density Plot shows Port “C” is the only port where survival rate is higher than Port “Q” & “S” where non-survival rate is higher than survival. It is evident that about 51% of passengers embarked at “C” port managed to survive while only about 40% & 34% passengers embarked on port “Q” & “S” managed to survive.

Let us now analyze Salutation of Survivals against the Deck.

ggplot(imputedtrainingreengineered,aes(Survived, fill= factor(Survived))) + geom_bar(stat = "count") + facet_grid(Deck ~Salutation)

It can deciphered from the above graph that people with ‘Miss’ and ‘Mr’ salutations had the highest rate of survival and non-survivals respectively in almost all the decks.

Now let us analyze the average and highest fare paid for each of the Passenger Class

tapply(imputedtrainingreengineered$Fare,imputedtrainingreengineered$Pclass,mean)

##        1        2        3 
## 87.50899 21.17920 13.29506

qplot(Fare,Pclass,data = imputedtrainingreengineered, geom = c("point","smooth"), method = "lm", formula = y~x, col = Sex, main = "Regression of Fare on Passenger Class By Each Sex", ylab = "Passenger class", xlab = "Fare")

Amongst many conclusions that can be drwan using regression, the easiest will be the fact that the highest passenger fares were not only paid by females for Class 1 & 3 but also they were substantially higher than the mean value for the respective class.

Now let us analyze the average and highest fare paid for each of the Decks

tapply(imputedtrainingreengineered$Fare,imputedtrainingreengineered$Deck,mean)

##        A        B        C        D        E        F        G        T 
## 25.64440 55.14814 50.70856 26.95982 21.10704 13.91100 14.37612 19.79834

qplot(Deck,Fare, data = imputedtrainingreengineered, geom = c("boxplot"), fill = Sex, main = "Fare Per Deck",xlab = "", ylab = "Fare")

It can be deduced from the above table & chart that Decks “B” & “C” have the highest fare paid and invariably in almost all decks except deck “T” & the average fare paid by females is higher than males.

Conditioning Plot for Age Vs Salutations conditioned under Survival status

coplot(Age~Salutation|Survived,data = imputedtrainingreengineered, panel = panel.smooth,xlab = "Salutations",ylab = "Age",columns = 2)

It can be inferred from the above graph that average age of the survived people with salutations “Mr” and “Miss” is slightly higher than those who did not survive. More or less, there is no difference observed for the average ages for Survived and Not Survived people across salutations.

Let’s plot Violin graph for Passenger Class & Age

ggplot(imputedtrainingreengineered, aes(x=Pclass,y=Age,fill = factor(Survived))) + geom_violin()

Above graph divulges that people above 60 years of age had a very slim chance of survival in 2nd & 3rd Class in comparison with 1st Class.

Statistical Inference

Using t-Test, we will try to check the Null Hypothesis (H₀) for Fare and Survivals.

# H0 : There is no difference between the two population means.
# H1 : There is difference between the two population means.

t.test(Fare~Survived,data = imputedtrainingreengineered)

## 
##  Welch Two Sample t-test
## 
## data:  Fare by Survived
## t = -8.1006, df = 646.65, p-value = 2.73e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -33.23113 -20.26365
## sample estimates:
## mean in group 0 mean in group 1 
##        23.05924        49.80663

As the p-value <0.05, we reject H₀ i.e. there is difference between the two population means. Further, average fare paid by those who survived is higher by $25 than those who did not survive.

Using t-Test, we will try to check the Null Hypothesis (H₀) for Age and Survivals.

# H0 : There is no difference between the two population means.
# H1 : There is difference between the two population means.

t.test(Age~Survived,data = imputedtrainingreengineered)

## 
##  Welch Two Sample t-test
## 
## data:  Age by Survived
## t = 0.038893, df = 966.54, p-value = 0.969
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.567218  1.630595
## sample estimates:
## mean in group 0 mean in group 1 
##        29.34837        29.31668

As the p-value is relatively higher but less than 0.05, we can tentatively say that we reject the H₀ i.e. there is a minor difference between the two population means. Further, the same is proved by the values of means i.e. average age of those who did not survive is slightly higher by 2 years than those who managed to survive.

Using Chi-Square Test, we will try to check the Null Hypothesis (H₀) for the Passenger Class and Survivals.

# H0 : Passenger Class and Survivals are independent.
# H1 : Passenger Class and Survivals are not independent.

chisq.test(imputedtrainingreengineered$Pclass,imputedtrainingreengineered$Survived)

## 
##  Pearson's Chi-squared test
## 
## data:  imputedtrainingreengineered$Pclass and imputedtrainingreengineered$Survived
## X-squared = 117.28, df = 2, p-value < 2.2e-16

As the p-value <0.05, we reject H₀ i.e. Passenger Class and Survivals are not independent. This conclusion matches with our exploratory analysis as shown earlier.

Building The Models

Let us first separate the training and testing data once again.

trainingdataset <- imputedtrainingreengineered[c(1:891),]
testingdataset <- imputedtrainingreengineered[c(892:1309),]

MODEL 1 : Logistic Regression

model1 <- glm(Survived ~ ., family = binomial(link = 'logit'), data = trainingdataset)
anova(model1, test = 'Chisq')

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                         890    1186.66              
## Pclass      2  103.547       888    1083.11 < 2.2e-16 ***
## Sex         1  256.220       887     826.89 < 2.2e-16 ***
## Age         1   15.417       886     811.47 8.622e-05 ***
## SibSp       1   13.556       885     797.92 0.0002315 ***
## Parch       1    0.417       884     797.50 0.5181887    
## Fare        1    1.520       883     795.98 0.2176495    
## Embarked    2    3.814       881     792.16 0.1485020    
## Salutation  5   62.405       876     729.76 3.866e-12 ***
## Deck        7   13.106       869     716.65 0.0695566 .  
## FamilySize  0    0.000       869     716.65              
## Child       1    0.711       868     715.94 0.3990952    
## Mother      1    0.114       867     715.83 0.7361542    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding Pclass, Sex and Age significantly reduces the residual deviance. The other variables seem to improve the model less even though SibSp has a low p-value. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.

While no exact equivalent to the R² of linear regression exists, the McFadden R² index can be used to assess the model fit.

pR2(model1)

##          llh      llhNull           G2     McFadden         r2ML 
## -357.9136485 -593.3275684  470.8278399    0.3967689    0.4104680 
##         r2CU 
##    0.5576976

Assessing the predictive ability of the model 1

In the steps above, we briefly evaluated the fitting of the model, now we would like to see how the model is doing when predicting y on a new set of data. By setting the parameter type=‘response’, R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0. Note that for some applications different decision boundaries could be a better option.

fit1 <- predict(model1,newdata = testingdataset,type = 'response')
fit1 <- ifelse(fit1 > 0.5,1,0)
misclassificationerror <- mean( fit1 != testingdataset$Survived)
print(paste('Accuracy of Logistic Regression Model = ', 1-misclassificationerror))

## [1] "Accuracy of Logistic Regression Model =  0.727272727272727"

MODEL 2 : Random Forest

model2 <- randomForest(Survived ~ ., data = trainingdataset, importance = TRUE)
model2

## 
## Call:
##  randomForest(formula = Survived ~ ., data = trainingdataset,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.28%
## Confusion matrix:
##     0   1 class.error
## 0 495  54  0.09836066
## 1 100 242  0.29239766

Random Forest model shows an Out Of Sample Error Rate of about 17%

The importance of each variable in the classification is show below, It could be seen that Fare variable has the highest importance.

varImpPlot(model2)

Assessing the predictive ability of the model 2

fit2 <- predict(model2,newdata = testingdataset)
confusionMatrix(fit2, testingdataset$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 208  63
##          1  52  95
##                                           
##                Accuracy : 0.7249          
##                  95% CI : (0.6794, 0.7672)
##     No Information Rate : 0.622           
##     P-Value [Acc > NIR] : 5.966e-06       
##                                           
##                   Kappa : 0.4068          
##  Mcnemar's Test P-Value : 0.3511          
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.6013          
##          Pos Pred Value : 0.7675          
##          Neg Pred Value : 0.6463          
##              Prevalence : 0.6220          
##          Detection Rate : 0.4976          
##    Detection Prevalence : 0.6483          
##       Balanced Accuracy : 0.7006          
##                                           
##        'Positive' Class : 0               
##

a <- confusionMatrix(fit2, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Random Forest Model = ', a))

## [1] "Accuracy of Random Forest Model =  0.72488038277512"

MODEL 3 : Support Vector Machine

model3 <- svm(Survived ~ ., data = trainingdataset)
model3

## 
## Call:
## svm(formula = Survived ~ ., data = trainingdataset)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.04 
## 
## Number of Support Vectors:  418

Assessing the predictive ability of the model 3

fit3 <- predict(model3, newdata = testingdataset)
confusionMatrix(fit3, testingdataset$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 202  54
##          1  58 104
##                                           
##                Accuracy : 0.7321          
##                  95% CI : (0.6869, 0.7739)
##     No Information Rate : 0.622           
##     P-Value [Acc > NIR] : 1.317e-06       
##                                           
##                   Kappa : 0.433           
##  Mcnemar's Test P-Value : 0.7768          
##                                           
##             Sensitivity : 0.7769          
##             Specificity : 0.6582          
##          Pos Pred Value : 0.7891          
##          Neg Pred Value : 0.6420          
##              Prevalence : 0.6220          
##          Detection Rate : 0.4833          
##    Detection Prevalence : 0.6124          
##       Balanced Accuracy : 0.7176          
##                                           
##        'Positive' Class : 0               
##

b <- confusionMatrix(fit3, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Support Vector Machine Model = ', b))

## [1] "Accuracy of Support Vector Machine Model =  0.732057416267943"

MODEL 4 : Linear Discriminant Analysis Method

model4 <- train(Survived ~ .,data = trainingdataset, method = "lda")
model4

## Linear Discriminant Analysis 
## 
## 891 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 891, 891, 891, 891, 891, 891, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8173444  0.6098066
## 
##

Assessing the predictive ability of the model 4

fit4 <- predict(model4, newdata = testingdataset)
confusionMatrix(fit4, testingdataset$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 202  56
##          1  58 102
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.6819, 0.7694)
##     No Information Rate : 0.622           
##     P-Value [Acc > NIR] : 3.648e-06       
##                                           
##                   Kappa : 0.4214          
##  Mcnemar's Test P-Value : 0.9254          
##                                           
##             Sensitivity : 0.7769          
##             Specificity : 0.6456          
##          Pos Pred Value : 0.7829          
##          Neg Pred Value : 0.6375          
##              Prevalence : 0.6220          
##          Detection Rate : 0.4833          
##    Detection Prevalence : 0.6172          
##       Balanced Accuracy : 0.7112          
##                                           
##        'Positive' Class : 0               
##

l <- confusionMatrix(fit4, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Linear Discriminent Accuracy Model = ', l))

## [1] "Accuracy of Linear Discriminent Accuracy Model =  0.727272727272727"

MODEL 5 : K Nearest Neighbour Method

Survived <- knncat(trainingdataset, classcol = 13)
Survived

## Training set misclass rate: 19.75%

fit5 <- predict(Survived,trainingdataset,testingdataset,train.classcol = 13,newdata.classcol = 13)
table(fit5,testingdataset$Survived)

##     
## fit5   0   1
##    0 206  60
##    1  54  98

misclassificationerror <- mean( fit5 != testingdataset$Survived)
print(paste('Accuracy of KNN Model = ', 1-misclassificationerror))

## [1] "Accuracy of KNN Model =  0.727272727272727"

Revised Solution and Analysis for Better Prediction

We will use the technique of ‘Ensembling’ to improve our prediction accuracy. We will ensemble all the above models and average the results. I am sure it will raise the prediction accuracy.

cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit2)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.07894737

cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit3)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.05263158

cat('Difference ratio between ridge and conditional random forest:', sum(fit2!=fit3)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.05023923

cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit4)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.0430622

cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit5)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.06698565

cat('Difference ratio between ridge and conditional random forest:', sum(fit2!=fit4)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.05502392

cat('Difference ratio between ridge and conditional random forest:', sum(fit2!=fit5)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.05980861

cat('Difference ratio between ridge and conditional random forest:', sum(fit3!=fit4)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.009569378

cat('Difference ratio between ridge and conditional random forest:', sum(fit3!=fit5)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.02870813

cat('Difference ratio between ridge and conditional random forest:', sum(fit4!=fit5)/nrow(testingdataset))

## Difference ratio between ridge and conditional random forest: 0.03349282

ensemble <- 0.2*(as.numeric(fit1)-1) + 0.4*(as.numeric(fit2)-1) + 0.4*(as.numeric(fit3)-1)
ensemble <- sapply(ensemble, round)
confusionMatrix(ensemble,testingdataset$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 211  63
##          1  49  95
##                                           
##                Accuracy : 0.7321          
##                  95% CI : (0.6869, 0.7739)
##     No Information Rate : 0.622           
##     P-Value [Acc > NIR] : 1.317e-06       
##                                           
##                   Kappa : 0.4201          
##  Mcnemar's Test P-Value : 0.2193          
##                                           
##             Sensitivity : 0.8115          
##             Specificity : 0.6013          
##          Pos Pred Value : 0.7701          
##          Neg Pred Value : 0.6597          
##              Prevalence : 0.6220          
##          Detection Rate : 0.5048          
##    Detection Prevalence : 0.6555          
##       Balanced Accuracy : 0.7064          
##                                           
##        'Positive' Class : 0               
##

c <- confusionMatrix(ensemble, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Ensembled Model = ', c))

## [1] "Accuracy of Ensembled Model =  0.732057416267943"

Predicting on Titanic Dataset

Abhijit Jantre

15 May 2017

There are Four parts to my script as follows:

1) Feature engineering

2) Missing value imputation

3) Exploratory Data Analysis

4) Building Models for Prediction

The goal of this exercise is to build and train the best machine learning model to predict the survival of a passenger based on the criterion built in the model. We will attempt to evaluate the worthy model based on accuracy criterion.

Loading The Necessary Libraries

Loading the training and testing dataset

Understanding the basic properties of the training dataset

It can be seen that there are some missing values and “NA” Values in the dataset. We will tabulate the “NA” values.

It can be seen that Age,Cabin and Embarked variables have missing values. Cabin has most number of missing values.These missing values can be found by using ‘Multivariate Imputation by Chained Equations (MICE)’ package.

Feature engineering:

“Name” Column: It can be seen that there is a column “Name” in the dataset having names of each passenger. These individual names may not useful for prediction but titles in each names (eg.Mr., Mrs.,) can given better insights into the passenger profile and can be used for model building.

It can be seen that only ‘Mr.’,‘Mrs.’,‘Miss.’,’Master" are prominent titles. We could group synonyoms titles and lables others as “Misc” as follows,

Separating Surname from the given name in the dataset

Determining the family size and creating family variable

Creating a new variable to identify the deck of the passanger. Deck of the passenger can be found by using the first character of his cabin number.

With new feature engineered from existing columns we could then predict and fill the missing values. Before missing value imputation the character variables are converted into factor variables where levels could be used for better prediction than characters.

Missing Value Imputation

This is done with “random forest” as imputation method as it could work well with precition involving categorical variables. New complete dataset is stored in the name of “imputedtrainingreengineered”.

Verifying the quality of imputation for any missing value which still remains unimputed.

We can check the results before imputation & post imputation for any of the parameters, let’s say ‘Fare’

As distribution pattern remains identical before and post imputation, quality of imputation is excellent.

Exploratory Data Analysis

Let us try to establish relationship between Family Size and Survived people.

We can conclude that single passenger survival rate is highest.

Further, we can group the family size as ‘Single’, ‘Small Size Family’ & ‘Large Size Family’. We will plot a new graph based on Family Size.

It can easily be inferred that Small Families have the highest survival rate amongst the single and Large Families.

Now, we’ll look at the relationship between Age & Survived for each Sex.

It is crystal clear from the above graph that distribution histogram curve has identical pattern agewise for both Survived as well as Non-Survived people; however the peaks are different. Also,it may be noted that female survival % > male survival %.

Let’s now bifurcate Child and Adult

It can be statistically derived from the above table and graph that about 50% children on board survived while about 35% adults survived on board.

It can be summarized from the above table that about 71% ‘Mothers’ on board survived while about 35% ‘Not Mothers’ survived on board.

Converting the character variables into factor variables

Let’s see the survival rate wrt Passanger Class.

It can cited from the from the above graphs that about 58% of 1st class passengers, 40% of 2nd class passengers and 27% of 3rd Class passengers survived.

Survival wrt Embarkation

Let us now analyze Salutation of Survivals against the Deck.

It can deciphered from the above graph that people with ‘Miss’ and ‘Mr’ salutations had the highest rate of survival and non-survivals respectively in almost all the decks.

Now let us analyze the average and highest fare paid for each of the Passenger Class

Amongst many conclusions that can be drwan using regression, the easiest will be the fact that the highest passenger fares were not only paid by females for Class 1 & 3 but also they were substantially higher than the mean value for the respective class.

Now let us analyze the average and highest fare paid for each of the Decks

It can be deduced from the above table & chart that Decks “B” & “C” have the highest fare paid and invariably in almost all decks except deck “T” & the average fare paid by females is higher than males.

Conditioning Plot for Age Vs Salutations conditioned under Survival status

It can be inferred from the above graph that average age of the survived people with salutations “Mr” and “Miss” is slightly higher than those who did not survive. More or less, there is no difference observed for the average ages for Survived and Not Survived people across salutations.

Let’s plot Violin graph for Passenger Class & Age

Above graph divulges that people above 60 years of age had a very slim chance of survival in 2nd & 3rd Class in comparison with 1st Class.

Statistical Inference

Using t-Test, we will try to check the Null Hypothesis (H0) for Fare and Survivals.

As the p-value <0.05, we reject H0 i.e. there is difference between the two population means. Further, average fare paid by those who survived is higher by $25 than those who did not survive.

Using t-Test, we will try to check the Null Hypothesis (H0) for Age and Survivals.

Using Chi-Square Test, we will try to check the Null Hypothesis (H0) for the Passenger Class and Survivals.

As the p-value <0.05, we reject H0 i.e. Passenger Class and Survivals are not independent. This conclusion matches with our exploratory analysis as shown earlier.

Building The Models

Let us first separate the training and testing data once again.

MODEL 1 : Logistic Regression

While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.

Assessing the predictive ability of the model 1

MODEL 2 : Random Forest

Random Forest model shows an Out Of Sample Error Rate of about 17%

The importance of each variable in the classification is show below, It could be seen that Fare variable has the highest importance.

Assessing the predictive ability of the model 2

MODEL 3 : Support Vector Machine

Assessing the predictive ability of the model 3

MODEL 4 : Linear Discriminant Analysis Method

Assessing the predictive ability of the model 4

MODEL 5 : K Nearest Neighbour Method

Revised Solution and Analysis for Better Prediction

We will use the technique of ‘Ensembling’ to improve our prediction accuracy. We will ensemble all the above models and average the results. I am sure it will raise the prediction accuracy.

Conclusion

Accuracy of Model5(Ensembled Weighted Average) is highest amongst all other above models. Therefore, we will select Model5 for prediction.

Using t-Test, we will try to check the Null Hypothesis (H₀) for Fare and Survivals.

As the p-value <0.05, we reject H₀ i.e. there is difference between the two population means. Further, average fare paid by those who survived is higher by $25 than those who did not survive.

Using t-Test, we will try to check the Null Hypothesis (H₀) for Age and Survivals.

Using Chi-Square Test, we will try to check the Null Hypothesis (H₀) for the Passenger Class and Survivals.

As the p-value <0.05, we reject H₀ i.e. Passenger Class and Survivals are not independent. This conclusion matches with our exploratory analysis as shown earlier.

While no exact equivalent to the R² of linear regression exists, the McFadden R² index can be used to assess the model fit.