The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. This project investigates the likelihood of survival on titanic by employing Machine Learning tools to predict which passengers survived the tragedy.
# Load and check the training data
train <- read.csv('Kaggle_Titanic_Train.csv', stringsAsFactors=FALSE)
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
Observation- As we can see, the training dataset has 891 observations and 12 variables. Looking at the structure and summary of our dataset gives us some sense of the data.
# Number of people survived
table(train$Survived)
##
## 0 1
## 549 342
# Proportion of people survived
prop.table(table(train$Survived))
##
## 0 1
## 0.6161616 0.3838384
Observation- Looking at the numbers, we can see that only 38% passengers survived.
library(ggplot2)
ggplot(train, aes(factor(Survived), fill=factor(Survived))) + geom_bar() +
scale_y_continuous(limits=c(0,600)) +
scale_x_discrete(labels=c('Dead', 'Survived')) +
scale_fill_discrete(guide=FALSE) +
labs(x='Survival', y='No of Passengers', title='Survival Rate')
# Number of people survived by sex
table(train$Sex, train$Survived)
##
## 0 1
## female 81 233
## male 468 109
# Proportion of people survived by sex
prop.table(table(train$Sex, train$Survived),1)
##
## 0 1
## female 0.2579618 0.7420382
## male 0.8110919 0.1889081
Observation- We can see that of the passengers that survived, 74% were females while only 19% were males.
ggplot(train, aes(factor(Survived))) + geom_bar(aes(fill=factor(Sex))) +
scale_y_continuous(limits=c(0,600), breaks=seq(0,600,100)) +
scale_x_discrete(labels = c('Dead', 'Survived')) +
scale_fill_discrete(name="Sex") +
labs(x='Survived', y='No. of Passengers', title='Survival by Sex')
# Number of people survived by Passenger Class
table(train$Pclass, train$Survived)
##
## 0 1
## 1 80 136
## 2 97 87
## 3 372 119
# Proportion of people survived by Passenger Class
prop.table(table(train$Pclass, train$Survived),1)
##
## 0 1
## 1 0.3703704 0.6296296
## 2 0.5271739 0.4728261
## 3 0.7576375 0.2423625
Observation- We can see that of the passengers that survived 1. 63% passengers that travelled in Class 1 survived 2. Only 24% passengers in class 3 survived
ggplot(train, aes(factor(Survived))) + geom_bar(aes(fill=factor(Pclass))) +
scale_y_continuous(limits=c(0,600), breaks=seq(0,600,100))+
scale_x_discrete(labels = c('Dead', 'Survived')) +
scale_fill_discrete(name="Passenger\nClass") +
labs(x='Survived', y='No. of Passengers', title='Survival by Passenger Class')
# Number of people survived by Sex and Passenger Class
aggregate(Survived~Sex+Pclass, train, FUN = sum)
## Sex Pclass Survived
## 1 female 1 91
## 2 male 1 45
## 3 female 2 70
## 4 male 2 17
## 5 female 3 72
## 6 male 3 47
Note- This gives the total number of male/female passengers that survived in each class.
# Proportion of people survived by Sex and Passenger Class
aggregate(Survived~Sex+Pclass, train, FUN=function(x){sum(x)/length(x)})
## Sex Pclass Survived
## 1 female 1 0.9680851
## 2 male 1 0.3688525
## 3 female 2 0.9210526
## 4 male 2 0.1574074
## 5 female 3 0.5000000
## 6 male 3 0.1354467
Observation- 1. Atleast 92% women in class 1 and 2 survived. Only 50% women in class 3 survived 2. 37% men in class 1 survived while only 13% men in class 3 survived
# Lets look at relationship b/w age and survival for each class
ggplot(train, aes(Age)) +
geom_histogram(aes(fill=factor(Survived))) +
facet_grid(Pclass~.) +
scale_fill_discrete(name="Survival", labels=c('Dead', 'Survived')) +
labs(title = 'Survival by Age and Class')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Lets look at relation b/w age, class and sex as survival factors
ggplot(train, aes(Age, factor(Pclass), col=factor(Survived))) +
geom_jitter(position=position_jitter(height=0.2)) +
facet_grid(Sex~.) +
scale_color_discrete(name="Survival", labels=c('Dead', 'Survived')) +
labs(y='Passenger Class', title='Survival by Age, Sex and Class')
# Lets look at relation b/w age, fare and survival for each class
ggplot(train, aes(Age, Fare, col=factor(Pclass))) +
geom_point(aes(shape=factor(Survived))) +
facet_grid(Sex~.) +
scale_color_discrete(name="Passenger\nClass") +
scale_shape_discrete(name="Survival", labels=c('Dead', 'Survived'))
We will combine train and test datasets for feature engineering. Let’s load the test data and take a look at the structure of the test data.
# Load and check test data
test <- read.csv('Kaggle_Titanic_Test.csv', stringsAsFactors = FALSE)
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
Observation- As we can see that the test data does not have variable ‘Survived’ as we are trying to predict survival. In order for us to combine the 2 datasets for feature engineering, we will create a new variable called ‘Survived’ in the test set and assign a value of 0 (dead).
# Create 'Survived' variable in test set
test$Survived <- 0
# Combine the datasets
full <- rbind(train, test)
# View the structure of combined dataset
str(full)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Observation- We can see that the combined dataset has 1309 observations of 12 variables.
Lets look at the summary of our dataset to see which variables have missing values.
summary(full)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.2613 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin
## Length:1309 Min. : 0.000 Length:1309
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## Embarked
## Length:1309
## Class :character
## Mode :character
##
##
##
##
Observation- We see that Age has 263 missing values and Fare has 1 missing value. Let’s start with Fare variable.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
filter(full, is.na(full$Fare))
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 1 1044 0 3 Storey, Mr. Thomas male 60.5 0 0
## Ticket Fare Cabin Embarked
## 1 3701 NA S
Observation- We can see that the passenger travelled in 3rd class and embarked from ‘S’
Fare_est <- filter(full, full$Embarked=='S' & full$Pclass==3)
median(Fare_est$Fare, na.rm=TRUE)
## [1] 8.05
Observation- We can see that the median fare for passenger who travelled in 3rd class and embarked from ‘S’ is 8.05. Lets assign this value to the missing fare.
full$Fare[full$PassengerId==1044] <- 8.05
any(is.na(full$Fare))
## [1] FALSE
full$Embarked <- as.factor(full$Embarked)
filter(full, full$Embarked=='')
## PassengerId Survived Pclass Name
## 1 62 1 1 Icard, Miss. Amelie
## 2 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 female 38 0 0 113572 80 B28
## 2 female 62 0 0 113572 80 B28
Observation- Passengers with ID 62 and 830 have missing Embarked value. Both passengers travelled in 1st class.
Embark_est <- filter(full, full$Pclass==1)
plot(Embark_est$Embarked)
Observation- Most passengers embarked from port ‘S’
full$Embarked[full$PassengerId %in% c(62, 830)] <- 'S'
filter(full, full$Embarked=='')
## [1] PassengerId Survived Pclass Name Sex
## [6] Age SibSp Parch Ticket Fare
## [11] Cabin Embarked
## <0 rows> (or 0-length row.names)
Now that we have removed misiing values from Fare and Embarked, we can use MICE package for imputation. If we had not replaced missing values in Fare and Embarked, using these 2 variables would generate an error. Not using these 2 variables however would cause MICE to impute values for Fare and Embarked as well.
So unless we want to use MICE for all varibales that have missing values, we either need to fix missing values for variables we dont want MICE to replace or create a subset of variables that can be used by MICE for imputation.
hist(full$Age)
sum(is.na(full$Age))
## [1] 263
Observation- We can see that Age varies between 0-80 and we have 263 missing values in Age.
library(mice)
## Loading required package: Rcpp
## mice 2.25 2015-11-09
set.seed(101)
vars.for.imputation <- full[,!names(full) %in% c('PassengerId','Name','Survived', 'Pclass', 'Sex', 'SibSp', 'Parch','Ticket','Cabin','Fare')]
imputed <- complete(mice(vars.for.imputation))
par(mfrow=c(1,2))
hist(full$Age, freq=F, main='Age: Original', col='darkgreen', ylim=c(0,0.04))
hist(imputed$Age, freq=F, main='Age: Imputed', col='lightgreen', ylim=c(0,0.04))
full$Age <- imputed$Age
sum(is.na(full$Age))
## [1] 0
full$Fsize <- full$SibSp + full$Parch + 1
ggplot(full[1:891,], aes(Fsize)) +
geom_bar(aes(fill=factor(Survived), stat='count', position='dodge')) +
scale_x_continuous(breaks = c(1:11)) +
scale_y_continuous(breaks=seq(0,600,50)) +
labs(x='Family Size', title='Survival by Family Size') +
scale_fill_discrete(name="Survival", labels=c('Dead', 'Survived'))
## Warning: Ignoring unknown aesthetics: stat, position
Observation- We can see that the survival rate for those with a family size of 1 or those with family sizes above 4 is really low.
Perhaps we can collapse the Family Size variable into three levels which will be helpful since there are comparatively fewer large families. Let’s create a discretized family size variable.
full$FsizeD[full$Fsize==1] <- 'Single'
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'Small'
full$FsizeD[full$Fsize > 4] <- 'Large'
ggplot(full[1:891,], aes(factor(Survived))) + geom_bar(aes(fill=factor(FsizeD))) +
scale_x_discrete(labels = c('Dead','Survived')) +
scale_y_continuous(breaks = seq(0,600,100)) +
labs(x='Survival', title='Survival by Family Size') +
scale_fill_discrete(name="Family Size") +
mosaicplot(table(full[1:891,]$FsizeD, full[1:891,]$Survived), main='Family Size by Survival', shade=TRUE)
Observation- The mosaic plot shows that there’s a survival penalty among singletons and large families, but a benefit for passengers in small families.
Let’s grab the title from passenger name. We will look for a pattern in any string preceding a ‘comma and space’ OR any string that follows a period and replace it with blank space to get the title
full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)
table(full$Sex, full$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
# Combine titles with low counts to rare
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don',
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')
full$Title[full$Title == 'Ms'] <- 'Miss'
full$Title[full$Title=='Mlle'] <- 'Miss'
full$Title[full$Title=='Mme'] <- 'Mrs'
full$Title[full$Title %in% rare_title] <- 'Rare Title'
table(full$Sex, full$Title)
##
## Master Miss Mr Mrs Rare Title
## female 0 264 0 198 4
## male 61 0 757 0 25
full$Title <- as.factor(full$Title)
# Plot Title by Sex and Passenger Class
ggplot(full, aes(Title)) + geom_bar(aes(fill=factor(Pclass))) +
scale_y_continuous(breaks = seq(0,800,100)) +
labs(title='Title by Passenger Class') +
scale_fill_discrete(name="Passenger\nClass")
ggplot(full, aes(Age, Title, col=factor(Sex))) +
geom_jitter(position=position_jitter(height=0.2)) + facet_grid(Pclass~.) +
labs(y='Title', title='Age by Title, Sex and Class') +
scale_color_discrete(name="Sex")
Observation- Looking at the plot we can see that some passengers in class 3 with age > 18 are titled as Master.
# Filter rows where age > 18 and title is Master
young <- filter(full, full$Title=='Master' & full$Age > 18)
# Look at the Name, age, Class and Sex of this subset of data
young[,c('Name','Age','Pclass','Sex')]
## Name Age Pclass Sex
## 1 Moubarek, Master. Gerios 30 3 male
## 2 Sage, Master. Thomas Henry 19 3 male
## 3 Lefebre, Master. Henry Forbes 30 3 male
## 4 Moubarek, Master. Halim Gonios ("William George") 30 3 male
## 5 Johnston, Master. William Arthur Willie"" 29 3 male
## 6 van Billiard, Master. James William 27 3 male
# Since all passengers are male, change their title to Mr
full$Title[young$Title=='Master'] <- 'Mr'
# Check to confirm
filter(full, full$Title=='Master' & full$Age > 18)
## [1] PassengerId Survived Pclass Name Sex
## [6] Age SibSp Parch Ticket Fare
## [11] Cabin Embarked Fsize FsizeD Title
## <0 rows> (or 0-length row.names)
full$Child[full$Age < 18] <- 'Child'
full$Child[full$Age >= 18] <- 'Adult'
table(full[1:891,]$Child, full[1:891,]$Survived)
##
## 0 1
## Adult 478 276
## Child 71 66
Observation- Being a child didn’t real help much when it comes to survivalon Titanic.
full$Mother <- 'Not Mother'
full$Mother[full$Sex=='female' & full$Age>18 & full$Parch>0 & full$Title != 'Miss'] <- 'Mother'
table(full[1:891,]$Mother, full[1:891,]$Survived)
##
## 0 1
## Mother 23 50
## Not Mother 526 292
Observation- Being a child didn’t real help much when it comes to survivalon Titanic.
# Convert variables to factor
factor_vars <- c('Pclass', 'Sex', 'FsizeD', 'Child', 'Mother')
full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))
str(full)
## 'data.frame': 1309 obs. of 17 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 23 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ Fsize : num 2 2 1 2 1 1 1 5 3 2 ...
## $ FsizeD : Factor w/ 3 levels "Large","Single",..: 3 3 2 3 2 2 2 1 3 3 ...
## $ Title : Factor w/ 5 levels "Master","Miss",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Child : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 2 1 2 ...
## $ Mother : Factor w/ 2 levels "Mother","Not Mother": 2 2 2 2 2 2 2 2 1 2 ...
# Split the data back to training and test set
train <- full[1:891,]
test <- full[892:1309,]
We will use 10-fold cross validation with 3 repeats. Since this is a binary classification problem, we will use Accuracy and Kappa metrics for evaluating performance of the models.
Some algorithms perform a whole lot better with some basic data preprocessing. So we will center and scale the data in some algorithms.
library(caret)
## Loading required package: lattice
trainControl <- trainControl(method='repeatedcv', number=10, repeats=3)
# Logistic Regression model
set.seed(101)
fit.glm <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Title + FsizeD + Child + Mother,
data = train, method = 'glm', metric = 'Accuracy',
trControl = trainControl)
# KNN
set.seed(101)
fit.knn <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Title + FsizeD + Child + Mother,
data = train, method = 'knn', metric = 'Accuracy',
preProc=c("center", "scale"), trControl = trainControl)
# SVM Radial
set.seed(101)
fit.svmRadial <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp +
Parch + Fare + Embarked + Title + FsizeD + Child + Mother,
data = train, method = 'svmRadial', metric = 'Accuracy',
preProc=c("center", "scale"), trControl = trainControl,
fit=FALSE)
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
# CART model
set.seed(101)
fit.cart <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Title + FsizeD + Child + Mother,
data = train, method = 'rpart', metric = 'Accuracy',
trControl = trainControl)
## Loading required package: rpart
# Random Forest model
set.seed(101)
fit.rf <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Title + FsizeD + Child + Mother,
data = train, method = 'ranger', metric = 'Accuracy',
trControl = trainControl)
## Loading required package: e1071
## Loading required package: ranger
# C 5.0 model
set.seed(101)
fit.c50 <- train(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Title + FsizeD + Child + Mother,
data = train, method = 'C5.0', metric = 'Accuracy',
trControl = trainControl)
## Loading required package: C50
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
# Let's compare the results of these algorithms:
results <- resamples(list(logistic=fit.glm, knn=fit.knn,
svm=fit.svmRadial, cart=fit.cart, c50=fit.c50, rf=fit.rf))
# Table comparison
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: logistic, knn, svm, cart, c50, rf
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## logistic 0.7222 0.7865 0.8045 0.8093 0.8427 0.8989 0
## knn 0.7111 0.7759 0.8090 0.8003 0.8202 0.8764 0
## svm 0.7444 0.8017 0.8268 0.8268 0.8499 0.8989 0
## cart 0.7500 0.7865 0.8090 0.8118 0.8399 0.8652 0
## c50 0.7753 0.8090 0.8315 0.8264 0.8427 0.8876 0
## rf 0.7444 0.7911 0.8202 0.8186 0.8539 0.8652 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## logistic 0.4186 0.5252 0.5862 0.5890 0.6621 0.7822 0
## knn 0.3727 0.5196 0.5954 0.5697 0.6266 0.7338 0
## svm 0.4595 0.5783 0.6230 0.6272 0.6719 0.7822 0
## cart 0.4224 0.5333 0.5843 0.5856 0.6486 0.7010 0
## c50 0.5075 0.5811 0.6389 0.6247 0.6592 0.7566 0
## rf 0.4480 0.5543 0.6170 0.6087 0.6780 0.7176 0
# Let's plot the results of these algorithms:
bwplot(results)
dotplot(results)
Observation- Looking at the results and plots, we can say that accuracy of SVM model is better than other models. Lets use svm model to make final predictions.
# Apply preprocessing to test data
test$Survived <- as.factor(test$Survived)
preprocessParams <- preProcess(test, method=c('center', 'scale'))
testData <- predict(preprocessParams, test)
# Make final predictions
prediction <- predict(fit.svmRadial, testData)
# Generate predictions
confusionMatrix(testData$Survived, prediction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 261 157
## 1 0 0
##
## Accuracy : 0.6244
## 95% CI : (0.576, 0.671)
## No Information Rate : 0.6244
## P-Value [Acc > NIR] : 0.5218
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.6244
## Neg Pred Value : NaN
## Prevalence : 0.6244
## Detection Rate : 0.6244
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
# Create a data frame with two columns: PassengerId & Survived where Survived contains my predictions.
submit <- data.frame(PassengerId = test$PassengerId, Survived = prediction)
# Write the solution to a csv file
write.csv(submit, file = "firstSVM_titanic.csv", row.names = FALSE)