The objective is to make a prediction model that will predict whether a passenger survived on the Titanic or not.
library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
# First import the raw data (import dataset in your environment)
# Then copy and paste console read.csv into an R chunk
raw_data <- read.csv("~/Desktop/train.csv")
Part of this task might include determining which variables you want to use as features.
For cleaning the data, let’s remove the variables which we think are irrelevant
How to remove variables:
-Indexing: If I want the 2nd, 3, 5, 6, 8, 10th columns, then newdata <- raw_data[,c(2, 3, 5, 6, 8, 10)]
-Select() function from dplyr library newdata <- select(raw_data,….) … = names of desired variables
# dataframe to train model on
newdata <- raw_data[,c(2, 3, 5, 6, 7, 8, 10, 12)]
set.seed(123)
newer_data <- sample.split(newdata$Survived, SplitRatio = 0.7)
train <- newdata[newer_data,]
test <- newdata[!newer_data,]
Use the line: names(getModelInfo())
to get all the different model types that can be built in the train function
response <- train[,1]
training_features <- train[,-1]
response <- as.factor(response)
mod1 <- train(training_features, response, method = "rpart")
# make predicitions on test set
predictions1 <- predict(mod1, test)
table(predictions1, test$Survived)
##
## predictions1 0 1
## 0 146 41
## 1 19 62
accuracy1 <- (141+67)/(141+67+24+36)
First we need to get rid of NA data. Do this with complete.cases() function.
cctrain <- complete.cases(train)
cctest <- complete.cases(test)
train2 <- train[cctrain,]
test2 <- test[cctest,]
train2_response <- as.factor(train2$Survived)
train2_features <- train2[,-1]
mod2 <- train(train2_features, train2_response)
predictions2 <- predict(mod2, test2)
table(predictions2, test2$Survived)
##
## predictions2 0 1
## 0 113 32
## 1 13 56
(113+56)/(113+32+13+56)
## [1] 0.7897196