The objective is to make a prediction model that will predict whether a passenger survived on the titanic or not.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
# Import raw data
rawdata <- read.csv("~/Downloads/train.csv")
Part of this task might include determining which variables you want to use as features.
For cleaning the data, lets remove variables which we think are irrelevant.
# Dataframe to train model on
newdata <- select(rawdata, Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)
This data frame above only contains the feature variables. Now lets store the response variable in another vector.
# set seed and split sample
set.seed(123)
tttt <- sample.split(newdata$Survived, SplitRatio = 0.7)
# split newdata
train <- newdata[tttt,]
test <- newdata[!tttt,]
Use the line:
> names(getModelInfo())
to get all the different model types that can be built in the train function.
response <- train[,1]
training_features <- train[,-1]
response <- as.factor(response)
mod1 <- train(training_features, response, method = "rpart")
# Make predictions on test set
predictions1 <- predict(mod1, test)
table(predictions1, test$Survived)
##
## predictions1 0 1
## 0 146 41
## 1 19 62
# Accuracy for mod1 (rpart)
(146+62)/(146+41+19+62)
## [1] 0.7761194
Model 1 has accuracy of 77.6%.
First, need to get rid of NA data. Do this with complete.cases() function
cc <- complete.cases(train)
train2 <- train[cc,]
cc <- complete.cases(test)
test2 <- test[cc,]
train2_response <- as.factor(train2$Survived)
train2_features <- train2[,-1]
mod2 <- train(train2_features, train2_response)
predictions2 <- predict(mod2, test2)
table(predictions2, test2$Survived)
##
## predictions2 0 1
## 0 113 32
## 1 13 56
# Accuracy
(113+55)/(113+33+13+55)
## [1] 0.7850467
Accuracy = 78.5%