Overview

The objective is to make a prediction model that will predict whether a passenger survived on the Titanic or not.

Load Libraries

library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)

Import and Understand Data

# First import the raw data (import dataset in your environment)
# Then copy and paste console read.csv into an R chunk

raw_data <- read.csv("~/Desktop/train.csv")

Transform/Clean/Reprocess Data

Part of this task might include determining which variables you want to use as features.

For cleaning the data, let’s remove the variables which we think are irrelevant

How to remove variables:

-Indexing: If I want the 2nd, 3, 5, 6, 8, 10th columns, then newdata <- raw_data[,c(2, 3, 5, 6, 8, 10)]

-Select() function from dplyr library newdata <- select(raw_data,….) … = names of desired variables

# dataframe to train model on
newdata <- raw_data[,c(2, 3, 5, 6, 7, 8, 10, 12)]

Split data into training set (70) and test set (30)

set.seed(123)
newer_data <- sample.split(newdata$Survived, SplitRatio = 0.7)

train <- newdata[newer_data,]
test <- newdata[!newer_data,]

Use the line: names(getModelInfo())

to get all the different model types that can be built in the train function

Train/Build Model on Training Set Data

response <- train[,1]
training_features <- train[,-1]

response <- as.factor(response)
mod1 <- train(training_features, response, method = "rpart")

Test model on testing set and get accuracy

# make predicitions on test set

predictions1 <- predict(mod1, test)
table(predictions1, test$Survived)
##             
## predictions1   0   1
##            0 146  41
##            1  19  62
accuracy1 <- (141+67)/(141+67+24+36)

Attempt Model 2 with random forests

First we need to get rid of NA data. Do this with complete.cases() function.

cctrain <- complete.cases(train)
cctest <- complete.cases(test)
train2 <- train[cctrain,]
test2 <- test[cctest,]

train2_response <- as.factor(train2$Survived)
train2_features <- train2[,-1]

mod2 <- train(train2_features, train2_response)

predictions2 <- predict(mod2, test2)
table(predictions2, test2$Survived)
##             
## predictions2   0   1
##            0 113  32
##            1  13  56
(113+56)/(113+32+13+56)
## [1] 0.7897196