This is my first attempt at a Kaggle script. I have chosen to work with the Titanic dataset after spending some time poking around on the site and looking at other scripts made by other Kagglers for inspiration. I will also focus on doing some illustrative data visualizations along the way. I’ll then use machine learning algorithms like Logistic Regression, Random Forest and Support Vector Machine(SVM) to create models for predicting survival on the Titanic. I am new to machine learning and hoping to learn a lot, so feedback is very welcome!
There are Four parts to my script as follows:
1) Feature engineering
2) Missing value imputation
3) Exploratory Data Analysis
4) Building Models for Prediction
The goal of this exercise is to build and train the best machine learning model to predict the survival of a passenger based on the criterion built in the model. We will attempt to evaluate the worthy model based on accuracy criterion.
Loading The Necessary Libraries
library(ggplot2)
library(caret)
library(randomForest)
library(e1071)
library(grid)
library(gridExtra)
library(mice)
library(dplyr)
library(pscl)
library(knncat)
Loading the training and testing dataset
train <- read.csv("C:/Data Science/R/Exercises/Titanic/trainingdataset.csv",na.strings = c("N/A","DIV/0!",""), stringsAsFactors = FALSE)
test <- read.csv("C:/Data Science/R/Exercises/Titanic/testingdataset.csv",na.strings = c("N/A","DIV/0!",""),stringsAsFactors = FALSE)
training <- bind_rows(train,test)
Understanding the basic properties of the training dataset
summary(training)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin
## Length:1309 Min. : 0.000 Length:1309
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## Embarked
## Length:1309
## Class :character
## Mode :character
##
##
##
##
dim(training)
## [1] 1309 12
str(training)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
It can be seen that there are some missing values and “NA” Values in the dataset. We will tabulate the “NA” values.
training[training==""] <- NA
a <- apply(training,2,is.na)
summary(a)
## PassengerId Survived Pclass Name
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1309 FALSE:891 FALSE:1309 FALSE:1309
## NA's :0 TRUE :418 NA's :0 NA's :0
## NA's :0
## Sex Age SibSp Parch
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1309 FALSE:1046 FALSE:1309 FALSE:1309
## NA's :0 TRUE :263 NA's :0 NA's :0
## NA's :0
## Ticket Fare Cabin Embarked
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1309 FALSE:1308 FALSE:295 FALSE:1307
## NA's :0 TRUE :1 TRUE :1014 TRUE :2
## NA's :0 NA's :0 NA's :0
apply(a,2,sum)
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 1014 2
It can be seen that Age,Cabin and Embarked variables have missing values. Cabin has most number of missing values.These missing values can be found by using ‘Multivariate Imputation by Chained Equations (MICE)’ package.
Feature engineering:
“Name” Column: It can be seen that there is a column “Name” in the dataset having names of each passenger. These individual names may not useful for prediction but titles in each names (eg.Mr., Mrs.,) can given better insights into the passenger profile and can be used for model building.
training$Salutation <- gsub('(.*, )|(\\..*)', '',training$Name)
table(training$Sex,training$Salutation)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
It can be seen that only ‘Mr.’,‘Mrs.’,‘Miss.’,’Master" are prominent titles. We could group synonyoms titles and lables others as “Misc” as follows,
misc <- c("Capt","Col","Don","Dr","Jonkheer","Lady","Major","Rev","Sir","the Countess","Dona")
training$Salutation[training$Salutation == "Mlle"] <- "Miss"
training$Salutation[training$Salutation == "Mme"] <- "Miss"
training$Salutation[training$Salutation %in% misc] <- "Misc"
table(training$Sex,training$Salutation)
##
## Master Misc Miss Mr Mrs Ms
## female 0 4 263 0 197 2
## male 61 25 0 757 0 0
Separating Surname from the given name in the dataset
training$Surname <- sapply(training$Name,function(x) strsplit(x, split = '[,.]')[[1]][1])
s <- nlevels(factor(training$Surname))
paste('We have', s, 'unique surnames in the training dataset amongst',nrow(training), 'passangers.')
## [1] "We have 875 unique surnames in the training dataset amongst 1309 passangers."
Determining the family size and creating family variable
training$FamilySize <- training$SibSp + training$Parch + 1
training$Family <- paste(training$Surname,'-',training$familysize)
Creating a new variable to identify the deck of the passanger. Deck of the passenger can be found by using the first character of his cabin number.
training$Deck <- substr(training$Cabin,1,1)
paste("Titanic has", nlevels(factor(training$Deck)),"decks on the ship.")
## [1] "Titanic has 8 decks on the ship."
With new feature engineered from existing columns we could then predict and fill the missing values. Before missing value imputation the character variables are converted into factor variables where levels could be used for better prediction than characters.
trainingreengineered <- training[,c("Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","Salutation","Deck","FamilySize","Family")]
list <- c("Survived","Pclass","Sex","Embarked","Salutation","Deck","Family")
trainingreengineered[list] <- lapply(trainingreengineered[list],function(x) as.factor(x))
str(trainingreengineered)
## 'data.frame': 1309 obs. of 12 variables:
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Salutation: Factor w/ 6 levels "Master","Misc",..: 4 5 3 5 4 4 4 1 5 5 ...
## $ Deck : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...
## $ FamilySize: num 2 2 1 2 1 1 1 5 3 2 ...
## $ Family : Factor w/ 875 levels "Abbing - ","Abbott - ",..: 101 183 335 273 16 544 506 614 388 565 ...
Missing Value Imputation
This is done with “random forest” as imputation method as it could work well with precition involving categorical variables. New complete dataset is stored in the name of “imputedtrainingreengineered”.
set.seed(6)
imp = mice(trainingreengineered, method = "rf", m=5)
##
## iter imp variable
## 1 1 Survived Age Fare Embarked Deck
## 1 2 Survived Age Fare Embarked Deck
## 1 3 Survived Age Fare Embarked Deck
## 1 4 Survived Age Fare Embarked Deck
## 1 5 Survived Age Fare Embarked Deck
## 2 1 Survived Age Fare Embarked Deck
## 2 2 Survived Age Fare Embarked Deck
## 2 3 Survived Age Fare Embarked Deck
## 2 4 Survived Age Fare Embarked Deck
## 2 5 Survived Age Fare Embarked Deck
## 3 1 Survived Age Fare Embarked Deck
## 3 2 Survived Age Fare Embarked Deck
## 3 3 Survived Age Fare Embarked Deck
## 3 4 Survived Age Fare Embarked Deck
## 3 5 Survived Age Fare Embarked Deck
## 4 1 Survived Age Fare Embarked Deck
## 4 2 Survived Age Fare Embarked Deck
## 4 3 Survived Age Fare Embarked Deck
## 4 4 Survived Age Fare Embarked Deck
## 4 5 Survived Age Fare Embarked Deck
## 5 1 Survived Age Fare Embarked Deck
## 5 2 Survived Age Fare Embarked Deck
## 5 3 Survived Age Fare Embarked Deck
## 5 4 Survived Age Fare Embarked Deck
## 5 5 Survived Age Fare Embarked Deck
imputedtrainingreengineered = complete(imp)
summary(imp)
## Multiply imputed data set
## Call:
## mice(data = trainingreengineered, m = 5, method = "rf")
## Number of multiple imputations: 5
## Missing cells per column:
## Survived Pclass Sex Age SibSp Parch
## 418 0 0 263 0 0
## Fare Embarked Salutation Deck FamilySize Family
## 1 2 0 1014 0 0
## Imputation methods:
## Survived Pclass Sex Age SibSp Parch
## "rf" "rf" "rf" "rf" "rf" "rf"
## Fare Embarked Salutation Deck FamilySize Family
## "rf" "rf" "rf" "rf" "rf" "rf"
## VisitSequence:
## Survived Age Fare Embarked Deck
## 1 4 7 8 10
## PredictorMatrix:
## Survived Pclass Sex Age SibSp Parch Fare Embarked Salutation
## Survived 0 1 1 1 1 1 1 1 1
## Pclass 0 0 0 0 0 0 0 0 0
## Sex 0 0 0 0 0 0 0 0 0
## Age 1 1 1 0 1 1 1 1 1
## SibSp 0 0 0 0 0 0 0 0 0
## Parch 0 0 0 0 0 0 0 0 0
## Fare 1 1 1 1 1 1 0 1 1
## Embarked 1 1 1 1 1 1 1 0 1
## Salutation 0 0 0 0 0 0 0 0 0
## Deck 1 1 1 1 1 1 1 1 1
## FamilySize 0 0 0 0 0 0 0 0 0
## Family 0 0 0 0 0 0 0 0 0
## Deck FamilySize Family
## Survived 1 1 1
## Pclass 0 0 0
## Sex 0 0 0
## Age 1 1 1
## SibSp 0 0 0
## Parch 0 0 0
## Fare 1 1 1
## Embarked 1 1 1
## Salutation 0 0 0
## Deck 0 1 1
## FamilySize 0 0 0
## Family 0 0 0
## Random generator seed value: NA
Verifying the quality of imputation for any missing value which still remains unimputed.
apply(apply(imputedtrainingreengineered,2,is.na),2,sum)
## Survived Pclass Sex Age SibSp Parch
## 0 0 0 0 0 0
## Fare Embarked Salutation Deck FamilySize Family
## 0 0 0 0 0 0
We can check the results before imputation & post imputation for any of the parameters, let’s say ‘Fare’
par(mfrow=c(1,2))
hist(trainingreengineered$Fare, main = "Before Imputation", col = "violet")
hist(imputedtrainingreengineered$Fare, main = "Post Imputation", col = "blue")

As distribution pattern remains identical before and post imputation, quality of imputation is excellent.
Exploratory Data Analysis
Let us try to establish relationship between Family Size and Survived people.
ggplot(imputedtrainingreengineered,aes(x=FamilySize, fill = factor(Survived))) + geom_bar(stat = "count", position = "dodge") + scale_x_continuous(breaks = c(1:11)) + labs( x= "Familysize")

We can conclude that single passenger survival rate is highest.
Further, we can group the family size as ‘Single’, ‘Small Size Family’ & ‘Large Size Family’. We will plot a new graph based on Family Size.
'Single' -> imputedtrainingreengineered$FamilySizeC[imputedtrainingreengineered$FamilySize == 1]
'Small' -> imputedtrainingreengineered$FamilySizeC[imputedtrainingreengineered$FamilySize > 1 & imputedtrainingreengineered$FamilySize < 5]
'Large' -> imputedtrainingreengineered$FamilySizeC[imputedtrainingreengineered$FamilySize >4]
mosaicplot(table(imputedtrainingreengineered$FamilySizeC,imputedtrainingreengineered$Survived), main = "Survival by Family Size", shade = TRUE)

It can easily be inferred that Small Families have the highest survival rate amongst the single and Large Families.
Now, we’ll look at the relationship between Age & Survived for each Sex.
ggplot(imputedtrainingreengineered,aes(x = Age, fill = factor(Survived))) + geom_histogram()+ facet_grid(.~Sex)

table(imputedtrainingreengineered$Sex,imputedtrainingreengineered$Survived)
##
## 0 1
## female 133 333
## male 676 167
It is crystal clear from the above graph that distribution histogram curve has identical pattern agewise for both Survived as well as Non-Survived people; however the peaks are different. Also,it may be noted that female survival % > male survival %.
Let’s now bifurcate Child and Adult
imputedtrainingreengineered$Child[imputedtrainingreengineered$Age < 18] <- 'Child'
imputedtrainingreengineered$Child[imputedtrainingreengineered$Age>= 18] <- 'Adult'
table(imputedtrainingreengineered$Child,imputedtrainingreengineered$Survived)
##
## 0 1
## Adult 714 412
## Child 95 88
ggplot(imputedtrainingreengineered,aes(x = Age, fill = factor(Survived))) + geom_bar(stat = "count")+ facet_grid(.~Child)

It can be statistically derived from the above table and graph that about 50% children on board survived while about 35% adults survived on board.
Let’s further bifurcate Adults into Men, Mother, Not Mother(presumably Miss, if I dare say so) and Child. In order to do that, we will first catagorize ‘Mother’ & ‘Not Mother’. Since the ‘Child’ is already created, the remaining from the list of ‘Age’ will be men. Alternatively, it could be found out by tabulating Sex Vs Child.
imputedtrainingreengineered$Mother <- 'Not Mother'
imputedtrainingreengineered$Mother[imputedtrainingreengineered$Sex == 'female'& imputedtrainingreengineered$Age > 18 & imputedtrainingreengineered$Parch > 0 & imputedtrainingreengineered$Salutation != 'Miss'] <- 'Mother'
table(imputedtrainingreengineered$Mother,imputedtrainingreengineered$Survived)
##
## 0 1
## Mother 24 61
## Not Mother 785 439
It can be summarized from the above table that about 71% ‘Mothers’ on board survived while about 35% ‘Not Mothers’ survived on board.
Converting the character variables into factor variables
list1 <- c("Child","Mother")
imputedtrainingreengineered[list1] <- lapply(imputedtrainingreengineered[list1],function(x) as.factor(x))
variables <- c("Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","Salutation","Deck","FamilySize","Child","Mother","Survived")
imputedtrainingreengineered <- imputedtrainingreengineered[,variables]
str(imputedtrainingreengineered)
## 'data.frame': 1309 obs. of 13 variables:
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## ..- attr(*, "contrasts")= num [1:3, 1:2] 0 1 0 0 0 1
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "1" "2" "3"
## .. .. ..$ : chr "2" "3"
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## ..- attr(*, "contrasts")= num [1:2, 1] 0 1
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "female" "male"
## .. .. ..$ : chr "2"
## $ Age : num 22 38 26 35 35 19 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## ..- attr(*, "contrasts")= num [1:3, 1:2] 0 1 0 0 0 1
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "C" "Q" "S"
## .. .. ..$ : chr "2" "3"
## $ Salutation: Factor w/ 6 levels "Master","Misc",..: 4 5 3 5 4 4 4 1 5 5 ...
## ..- attr(*, "contrasts")= num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "Master" "Misc" "Miss" "Mr" ...
## .. .. ..$ : chr "2" "3" "4" "5" ...
## $ Deck : Factor w/ 8 levels "A","B","C","D",..: 6 3 4 3 4 6 5 5 5 2 ...
## ..- attr(*, "contrasts")= num [1:8, 1:7] 0 1 0 0 0 0 0 0 0 0 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "A" "B" "C" "D" ...
## .. .. ..$ : chr "2" "3" "4" "5" ...
## $ FamilySize: num 2 2 1 2 1 1 1 5 3 2 ...
## $ Child : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 2 1 2 ...
## $ Mother : Factor w/ 2 levels "Mother","Not Mother": 2 2 2 2 2 2 2 2 1 2 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## ..- attr(*, "contrasts")= num [1:2, 1] 0 1
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "0" "1"
## .. .. ..$ : chr "2"
Let’s see the survival rate wrt Passanger Class.
p1 <- ggplot(imputedtrainingreengineered,aes(x = Pclass,fill = factor(Survived)))+geom_bar(stat = "count", position = "stack")
p2 <- ggplot(imputedtrainingreengineered,aes(x = Pclass,fill = factor(Survived)))+geom_bar(position = "fill")+labs(y = "Proportion")
grid.arrange(p1,p2,ncol=2)

It can cited from the from the above graphs that about 58% of 1st class passengers, 40% of 2nd class passengers and 27% of 3rd Class passengers survived.
Survival wrt Embarkation
par(mfrow = c(1,2))
qplot(Embarked,data = imputedtrainingreengineered,geom = "density",fill = factor(Survived), alpha = I(0.5),xlab = "Embarked", ylab = "Density", main = "Kernel Density Plot for Port of Embarkment And Survival")

ggplot(imputedtrainingreengineered,aes(x = Embarked, fill = factor(Survived))) + geom_bar(position = "fill")+labs(y="Proportion")

Kernel Density Plot shows Port “C” is the only port where survival rate is higher than Port “Q” & “S” where non-survival rate is higher than survival. It is evident that about 51% of passengers embarked at “C” port managed to survive while only about 40% & 34% passengers embarked on port “Q” & “S” managed to survive.
Let us now analyze Salutation of Survivals against the Deck.
ggplot(imputedtrainingreengineered,aes(Survived, fill= factor(Survived))) + geom_bar(stat = "count") + facet_grid(Deck ~Salutation)

It can deciphered from the above graph that people with ‘Miss’ and ‘Mr’ salutations had the highest rate of survival and non-survivals respectively in almost all the decks.
Now let us analyze the average and highest fare paid for each of the Passenger Class
tapply(imputedtrainingreengineered$Fare,imputedtrainingreengineered$Pclass,mean)
## 1 2 3
## 87.50899 21.17920 13.29506
qplot(Fare,Pclass,data = imputedtrainingreengineered, geom = c("point","smooth"), method = "lm", formula = y~x, col = Sex, main = "Regression of Fare on Passenger Class By Each Sex", ylab = "Passenger class", xlab = "Fare")

Amongst many conclusions that can be drwan using regression, the easiest will be the fact that the highest passenger fares were not only paid by females for Class 1 & 3 but also they were substantially higher than the mean value for the respective class.
Now let us analyze the average and highest fare paid for each of the Decks
tapply(imputedtrainingreengineered$Fare,imputedtrainingreengineered$Deck,mean)
## A B C D E F G T
## 25.64440 55.14814 50.70856 26.95982 21.10704 13.91100 14.37612 19.79834
qplot(Deck,Fare, data = imputedtrainingreengineered, geom = c("boxplot"), fill = Sex, main = "Fare Per Deck",xlab = "", ylab = "Fare")

It can be deduced from the above table & chart that Decks “B” & “C” have the highest fare paid and invariably in almost all decks except deck “T” & the average fare paid by females is higher than males.
Conditioning Plot for Age Vs Salutations conditioned under Survival status
coplot(Age~Salutation|Survived,data = imputedtrainingreengineered, panel = panel.smooth,xlab = "Salutations",ylab = "Age",columns = 2)

It can be inferred from the above graph that average age of the survived people with salutations “Mr” and “Miss” is slightly higher than those who did not survive. More or less, there is no difference observed for the average ages for Survived and Not Survived people across salutations.
Let’s plot Violin graph for Passenger Class & Age
ggplot(imputedtrainingreengineered, aes(x=Pclass,y=Age,fill = factor(Survived))) + geom_violin()

Above graph divulges that people above 60 years of age had a very slim chance of survival in 2nd & 3rd Class in comparison with 1st Class.
Statistical Inference
Using t-Test, we will try to check the Null Hypothesis (H0) for Fare and Survivals.
# H0 : There is no difference between the two population means.
# H1 : There is difference between the two population means.
t.test(Fare~Survived,data = imputedtrainingreengineered)
##
## Welch Two Sample t-test
##
## data: Fare by Survived
## t = -8.1006, df = 646.65, p-value = 2.73e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -33.23113 -20.26365
## sample estimates:
## mean in group 0 mean in group 1
## 23.05924 49.80663
As the p-value <0.05, we reject H0 i.e. there is difference between the two population means. Further, average fare paid by those who survived is higher by $25 than those who did not survive.
Using t-Test, we will try to check the Null Hypothesis (H0) for Age and Survivals.
# H0 : There is no difference between the two population means.
# H1 : There is difference between the two population means.
t.test(Age~Survived,data = imputedtrainingreengineered)
##
## Welch Two Sample t-test
##
## data: Age by Survived
## t = 0.038893, df = 966.54, p-value = 0.969
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.567218 1.630595
## sample estimates:
## mean in group 0 mean in group 1
## 29.34837 29.31668
As the p-value is relatively higher but less than 0.05, we can tentatively say that we reject the H0 i.e. there is a minor difference between the two population means. Further, the same is proved by the values of means i.e. average age of those who did not survive is slightly higher by 2 years than those who managed to survive.
Using Chi-Square Test, we will try to check the Null Hypothesis (H0) for the Passenger Class and Survivals.
# H0 : Passenger Class and Survivals are independent.
# H1 : Passenger Class and Survivals are not independent.
chisq.test(imputedtrainingreengineered$Pclass,imputedtrainingreengineered$Survived)
##
## Pearson's Chi-squared test
##
## data: imputedtrainingreengineered$Pclass and imputedtrainingreengineered$Survived
## X-squared = 117.28, df = 2, p-value < 2.2e-16
As the p-value <0.05, we reject H0 i.e. Passenger Class and Survivals are not independent. This conclusion matches with our exploratory analysis as shown earlier.
Building The Models
Let us first separate the training and testing data once again.
trainingdataset <- imputedtrainingreengineered[c(1:891),]
testingdataset <- imputedtrainingreengineered[c(892:1309),]
MODEL 1 : Logistic Regression
model1 <- glm(Survived ~ ., family = binomial(link = 'logit'), data = trainingdataset)
anova(model1, test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Survived
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 890 1186.66
## Pclass 2 103.547 888 1083.11 < 2.2e-16 ***
## Sex 1 256.220 887 826.89 < 2.2e-16 ***
## Age 1 15.417 886 811.47 8.622e-05 ***
## SibSp 1 13.556 885 797.92 0.0002315 ***
## Parch 1 0.417 884 797.50 0.5181887
## Fare 1 1.520 883 795.98 0.2176495
## Embarked 2 3.814 881 792.16 0.1485020
## Salutation 5 62.405 876 729.76 3.866e-12 ***
## Deck 7 13.106 869 716.65 0.0695566 .
## FamilySize 0 0.000 869 716.65
## Child 1 0.711 868 715.94 0.3990952
## Mother 1 0.114 867 715.83 0.7361542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding Pclass, Sex and Age significantly reduces the residual deviance. The other variables seem to improve the model less even though SibSp has a low p-value. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.
While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.
pR2(model1)
## llh llhNull G2 McFadden r2ML
## -357.9136485 -593.3275684 470.8278399 0.3967689 0.4104680
## r2CU
## 0.5576976
Assessing the predictive ability of the model 1
MODEL 2 : Random Forest
model2 <- randomForest(Survived ~ ., data = trainingdataset, importance = TRUE)
model2
##
## Call:
## randomForest(formula = Survived ~ ., data = trainingdataset, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 17.28%
## Confusion matrix:
## 0 1 class.error
## 0 495 54 0.09836066
## 1 100 242 0.29239766
Random Forest model shows an Out Of Sample Error Rate of about 17%
The importance of each variable in the classification is show below, It could be seen that Fare variable has the highest importance.
varImpPlot(model2)

Assessing the predictive ability of the model 2
fit2 <- predict(model2,newdata = testingdataset)
confusionMatrix(fit2, testingdataset$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 208 63
## 1 52 95
##
## Accuracy : 0.7249
## 95% CI : (0.6794, 0.7672)
## No Information Rate : 0.622
## P-Value [Acc > NIR] : 5.966e-06
##
## Kappa : 0.4068
## Mcnemar's Test P-Value : 0.3511
##
## Sensitivity : 0.8000
## Specificity : 0.6013
## Pos Pred Value : 0.7675
## Neg Pred Value : 0.6463
## Prevalence : 0.6220
## Detection Rate : 0.4976
## Detection Prevalence : 0.6483
## Balanced Accuracy : 0.7006
##
## 'Positive' Class : 0
##
a <- confusionMatrix(fit2, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Random Forest Model = ', a))
## [1] "Accuracy of Random Forest Model = 0.72488038277512"
MODEL 3 : Support Vector Machine
model3 <- svm(Survived ~ ., data = trainingdataset)
model3
##
## Call:
## svm(formula = Survived ~ ., data = trainingdataset)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.04
##
## Number of Support Vectors: 418
Assessing the predictive ability of the model 3
fit3 <- predict(model3, newdata = testingdataset)
confusionMatrix(fit3, testingdataset$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 202 54
## 1 58 104
##
## Accuracy : 0.7321
## 95% CI : (0.6869, 0.7739)
## No Information Rate : 0.622
## P-Value [Acc > NIR] : 1.317e-06
##
## Kappa : 0.433
## Mcnemar's Test P-Value : 0.7768
##
## Sensitivity : 0.7769
## Specificity : 0.6582
## Pos Pred Value : 0.7891
## Neg Pred Value : 0.6420
## Prevalence : 0.6220
## Detection Rate : 0.4833
## Detection Prevalence : 0.6124
## Balanced Accuracy : 0.7176
##
## 'Positive' Class : 0
##
b <- confusionMatrix(fit3, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Support Vector Machine Model = ', b))
## [1] "Accuracy of Support Vector Machine Model = 0.732057416267943"
MODEL 4 : Linear Discriminant Analysis Method
model4 <- train(Survived ~ .,data = trainingdataset, method = "lda")
model4
## Linear Discriminant Analysis
##
## 891 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 891, 891, 891, 891, 891, 891, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8173444 0.6098066
##
##
Assessing the predictive ability of the model 4
fit4 <- predict(model4, newdata = testingdataset)
confusionMatrix(fit4, testingdataset$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 202 56
## 1 58 102
##
## Accuracy : 0.7273
## 95% CI : (0.6819, 0.7694)
## No Information Rate : 0.622
## P-Value [Acc > NIR] : 3.648e-06
##
## Kappa : 0.4214
## Mcnemar's Test P-Value : 0.9254
##
## Sensitivity : 0.7769
## Specificity : 0.6456
## Pos Pred Value : 0.7829
## Neg Pred Value : 0.6375
## Prevalence : 0.6220
## Detection Rate : 0.4833
## Detection Prevalence : 0.6172
## Balanced Accuracy : 0.7112
##
## 'Positive' Class : 0
##
l <- confusionMatrix(fit4, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Linear Discriminent Accuracy Model = ', l))
## [1] "Accuracy of Linear Discriminent Accuracy Model = 0.727272727272727"
MODEL 5 : K Nearest Neighbour Method
Survived <- knncat(trainingdataset, classcol = 13)
Survived
## Training set misclass rate: 19.75%
fit5 <- predict(Survived,trainingdataset,testingdataset,train.classcol = 13,newdata.classcol = 13)
table(fit5,testingdataset$Survived)
##
## fit5 0 1
## 0 206 60
## 1 54 98
misclassificationerror <- mean( fit5 != testingdataset$Survived)
print(paste('Accuracy of KNN Model = ', 1-misclassificationerror))
## [1] "Accuracy of KNN Model = 0.727272727272727"
Revised Solution and Analysis for Better Prediction
We will use the technique of ‘Ensembling’ to improve our prediction accuracy. We will ensemble all the above models and average the results. I am sure it will raise the prediction accuracy.
cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit2)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.07894737
cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit3)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.05263158
cat('Difference ratio between ridge and conditional random forest:', sum(fit2!=fit3)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.05023923
cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit4)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.0430622
cat('Difference ratio between ridge and conditional random forest:', sum(fit1!=fit5)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.06698565
cat('Difference ratio between ridge and conditional random forest:', sum(fit2!=fit4)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.05502392
cat('Difference ratio between ridge and conditional random forest:', sum(fit2!=fit5)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.05980861
cat('Difference ratio between ridge and conditional random forest:', sum(fit3!=fit4)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.009569378
cat('Difference ratio between ridge and conditional random forest:', sum(fit3!=fit5)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.02870813
cat('Difference ratio between ridge and conditional random forest:', sum(fit4!=fit5)/nrow(testingdataset))
## Difference ratio between ridge and conditional random forest: 0.03349282
ensemble <- 0.2*(as.numeric(fit1)-1) + 0.4*(as.numeric(fit2)-1) + 0.4*(as.numeric(fit3)-1)
ensemble <- sapply(ensemble, round)
confusionMatrix(ensemble,testingdataset$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 211 63
## 1 49 95
##
## Accuracy : 0.7321
## 95% CI : (0.6869, 0.7739)
## No Information Rate : 0.622
## P-Value [Acc > NIR] : 1.317e-06
##
## Kappa : 0.4201
## Mcnemar's Test P-Value : 0.2193
##
## Sensitivity : 0.8115
## Specificity : 0.6013
## Pos Pred Value : 0.7701
## Neg Pred Value : 0.6597
## Prevalence : 0.6220
## Detection Rate : 0.5048
## Detection Prevalence : 0.6555
## Balanced Accuracy : 0.7064
##
## 'Positive' Class : 0
##
c <- confusionMatrix(ensemble, testingdataset$Survived)$overall[1]
print(paste('Accuracy of Ensembled Model = ', c))
## [1] "Accuracy of Ensembled Model = 0.732057416267943"
Conclusion
Accuracy of Model5(Ensembled Weighted Average) is highest amongst all other above models. Therefore, we will select Model5 for prediction.