This analysis features the prediction of Survival of Passengers in Titanic based on the data given by kaggle for “Titanic: Machine Learning from Disaster” competition.
library(data.table, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(randomForest, quietly = TRUE)
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(vcd, quietly = TRUE)
## Warning: package 'vcd' was built under R version 3.2.3
require(Hmisc, quietly = TRUE)
## Warning: package 'Hmisc' was built under R version 3.2.3
## Warning: package 'Formula' was built under R version 3.2.3
##
## Attaching package: 'Hmisc'
##
## The following object is masked from 'package:randomForest':
##
## combine
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(party, quietly = TRUE)
## Warning: package 'party' was built under R version 3.2.3
## Warning: package 'mvtnorm' was built under R version 3.2.3
## Warning: package 'modeltools' was built under R version 3.2.3
## Warning: package 'strucchange' was built under R version 3.2.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Warning: package 'sandwich' was built under R version 3.2.3
require(Amelia)
## Loading required package: Amelia
## Warning: package 'Amelia' was built under R version 3.2.3
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2015 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
The data size is pretty small and any package (data.table/base) can be used for reading the files. The files are read such that empty data is replaced by NA.
train <- fread("train.csv",na.strings = c("NA",""))
test <- fread("test.csv",na.strings = c("NA",""))
By using the Amelia package, we can get the sense of the percentage of data missing in the train data set. based on that we can proceed to impute the data.
missmap(train,col=c("yellow", "black"))
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used
sum(is.na(train$Age))/nrow(train) #NApercent_age <-
## [1] 0.1986532
sum(is.na(train$Cabin))/nrow(train) # NApercent_cabin <-
## [1] 0.7710438
It can be observed 80% of the cabin data and 20% of the Age data is missing. So we need to impute the Age missing values. Before that we can check the various factors influencing the survival through plots. This also give us insight to pick our variables for imputation. The following factors are first converted to factor variables.
train <- train[,Survived:= as.factor(Survived)] #survive factor
train <- train[,Sex:= as.factor(Sex)]
train <- train[,Pclass:= as.factor(Pclass)]
train <- train[,Embarked:= as.factor(Embarked)]
test <- test[,Sex:= as.factor(Sex)]
test <- test[,Pclass:= as.factor(Pclass)]
test <- test[,Embarked:= as.factor(Embarked)]
save(train,test,file = "rawdata.RData")
str(train)
## Classes 'data.table' and 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr NA NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## - attr(*, ".internal.selfref")=<externalptr>
ggplot(data = train,aes(Survived)) + geom_bar(fill = "black") + ggtitle("Passengers Survived/Dead")
ggplot(data = train,aes(Pclass)) + geom_bar(fill = "black") + ggtitle("Passengers Class")
ggplot(data = train,aes(Age)) + geom_histogram(fill = "black") + ggtitle("Age of Passengers")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = train,aes(Sex)) + geom_histogram(fill = "black") + ggtitle("Gender of Passengers")
ggplot(data = train,aes(SibSp)) + geom_histogram(fill = "black") + ggtitle("Sibings + Spouse")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = train,aes(Parch)) + geom_histogram(fill = "black") + ggtitle("Parents and Kids")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = train,aes(Fare)) + geom_histogram(fill = "black") + ggtitle("Fare")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(data = train,aes(Embarked)) + geom_bar(fill = "black") + ggtitle("Place of Embarkment")
boxplot(train$Age ~ train$Survived, main="Passenger Fate by Age",
xlab="Survived", ylab="Age")
“0” indicates passenger as Dead and 1 as “Survived”. It can be observed that more passengers died than survived. The number of people travelling by Third class is way more compared to other classes. Male passengers completely outnumbered the female passengers. Most of the passengers embarked at Southampton. Most of the passengers are middle aged.
mosaicplot(train$Pclass ~ train$Survived, main = "Passenger fate by class",color=TRUE)
mosaicplot(train$Sex ~ train$Survived, main = "Passenger fate by Gender",color=TRUE)
mosaicplot(train$Embarked ~ train$Survived, main = "Passenger fate by Embarkment",color=TRUE)
Mosaic plots are used to understand the effect of variables on Survival.It can be observed that more third class passengers perished and also the male passengers. It is known that females and children were first allowed to get on to the life boats. The embarkment place does not have much influence on the survival.
summary(train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
It was observed that 20% of the Age data are NA’s which is huge. There are many ways of imputing and the simplest way is by using the impute function of the hmisc package. But if we replace all the NA’s by the mean of the age, it may not be very accurate since the age groups are varying and the data set is small. So we can find the variables that influence the age.
boxplot(train$Age~train$Pclass, main="Passenger Class by Age",
xlab="Passenger class", ylab="Age")
It can be seen that the average age of the First class is more than the other two classes. It is quiet innate that it took some years for the people to become rich.
Next variable is Name. The way of addresing can in some way determine the age of the person like “Master” for a child and “Mr” for an adult. The following function is used to extract the title from the Name variable . Source
getTitle <- function(d){
d$Title <- sapply(d$Name,FUN = function(x){
strsplit(x,split = '[,.]')[[1]][2]
})
return(d$Title)
}
train$Title <- getTitle(train)
test$Title <- getTitle(test)
unique(train$Title)
## [1] " Mr" " Mrs" " Miss" " Master"
## [5] " Don" " Rev" " Dr" " Mme"
## [9] " Ms" " Major" " Lady" " Sir"
## [13] " Mlle" " Col" " Capt" " the Countess"
## [17] " Jonkheer"
By uisng bystas function in the hmisc package, we can know how many Titles have missing NA’s of age and use these titles only to impute the age.
bystats(train$Age,train$Title)
##
## Mean of train$Age by
##
## N Missing Mean
## Capt 1 0 70.000000
## Col 2 0 58.000000
## Don 1 0 40.000000
## Dr 6 1 42.000000
## Jonkheer 1 0 38.000000
## Lady 1 0 48.000000
## Major 2 0 48.500000
## Master 36 4 4.574167
## Miss 146 36 21.773973
## Mlle 2 0 24.000000
## Mme 1 0 24.000000
## Mr 398 119 32.368090
## Mrs 108 17 35.898148
## Ms 1 0 28.000000
## Rev 6 0 43.166667
## Sir 1 0 49.000000
## the Countess 1 0 33.000000
## ALL 714 177 29.699118
bystats(test$Age, test$Title)
##
## Mean of test$Age by
##
## N Missing Mean
## Col 2 0 50.000000
## Dona 1 0 39.000000
## Dr 1 0 53.000000
## Master 17 4 7.406471
## Miss 64 14 21.774844
## Mr 183 57 32.000000
## Mrs 62 10 38.903226
## Ms 0 1 NA
## Rev 2 0 35.500000
## ALL 332 86 30.272590
Thus, only “Dr”,“Master”,“Miss”,“Mr”,“Mrs” have unknown ages and are used to impute the age by replacing the NA with the mean of the age of the corresponding Title.
title.na_age <- c("Dr","Master","Miss","Mr","Mrs")
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
train$Title <- trim(train$Title)
test$Title <- trim(test$Title)
imputeMedian <- function(impute.var, filter.var, var.levels) {
for (v in var.levels) {
impute.var[ which( filter.var == v)] <- impute(impute.var[
which( filter.var == v)])
}
return (impute.var)
}
#train[,"Age"] <-
train$Age <- imputeMedian(train$Age,train$Title,title.na_age)
title.na_age_test <- c("Master","Miss","Mr","Mrs","Ms")
test$Age <- as.integer(imputeMedian(test$Age, test$Title,
title.na_age_test))
test$Age[which(is.na(test$Age))] <- as.integer(20)
summary(train$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 21.00 30.00 29.39 35.00 80.00
summary(test$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 22.00 28.00 29.55 36.00 76.00
Before proceeding to predict, we can further reduce the titles such that both the test and train datasets have same levels of titles.
test$Title[which(test$Title == "Dona" )] <- "Don"
train$Title[which(train$Title == "Mme" )] <- "Miss"
train$Title[which(train$Title == "Major" )] <- "Mr"
train$Title[which(train$Title == "Lady" )] <- "Mrs"
train$Title[which(train$Title == "Sir" )] <- "Rev"
train$Title[which(train$Title == "Mlle" )] <- "Miss"
train$Title[which(train$Title == "Capt" )] <- "Col"
train$Title[which(train$Title == "the Countess" )] <- "Mrs"
train$Title[which(train$Title == "Jonkheer" )] <- "Mr"
Fare has been give zero in few places. This can be converted to Na and replaced by the imputeMedian function by using class a sthe base for calculating the mean.
train$Fare[which(train$Fare == 0)]<- NA
train$Fare <- imputeMedian(train$Fare,train$Pclass,as.numeric(levels(train$Pclass)))
test$Fare[which(test$Fare == 0)]<- NA
test$Fare <- imputeMedian(test$Fare,test$Pclass,as.numeric(levels(test$Pclass)))
summary(train$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.012 7.925 14.500 32.690 31.280 512.300
train$Embarked[which(is.na(train$Embarked))] <- 'S'
Embarked coloumn has very few NA’s and can be replaced by the highest (Southampton).
summary(train$Embarked)
## C Q S
## 168 77 646
test$Embarked[which(is.na(test$Embarked))] <- 'S'
summary(test$Embarked)
## C Q S
## 102 46 270
Before prediction, let the required variables be grouped and converted to the required class.
train <- train[,Title := as.factor(Title)]
test <- test[,Title := as.factor(Title)]
str(train)
## Classes 'data.table' and 'data.frame': 891 obs. of 13 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 30 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Title : Factor w/ 9 levels "Col","Don","Dr",..: 6 7 5 7 6 6 6 4 7 7 ...
## - attr(*, ".internal.selfref")=<externalptr>
str(test)
## Classes 'data.table' and 'data.frame': 418 obs. of 12 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : int 34 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr NA NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
## $ Title : Factor w/ 9 levels "Col","Don","Dr",..: 6 7 6 6 7 6 5 6 7 6 ...
## - attr(*, ".internal.selfref")=<externalptr>
getFeatures <- function(data){
features <- c("Pclass",
"Age",
"Sex",
"Parch",
"SibSp",
"Fare",
"Embarked",
"Title")
f <- data[,features, with = FALSE]
return(f)
}
The getfeatures function is used to extract the required features from the datasets. Random Forest function is used for the prediction fit. The importance = TRUE argument enables us to inspect the impotance of the difference variables via varImpPlot.
rf <- randomForest(getFeatures(train),as.factor(train$Survived), ntree = 1000, importance = TRUE)
#str(rf)
varImpPlot(rf)
The accuracy tests helps to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. The Gini one digs into the mathematics behind decision trees, but essentially measures how pure the nodes are at the end of the tree. Again it tests to see the result if each variable is taken out and a high score means the variable was important. source
The Passengerclass, Fare and the Title can be seen as very influential parameters.The random forest fit isused to predict the survival and the out put is written in the format required for the submission.
sub <- data.table(PassengerId = test$PassengerId)
sub$Survived <- predict(rf,getFeatures(test))
write.csv(sub,file = "rf.csv",row.names=FALSE)