Overview

The objective is to make a prediction model that will predict whether a passenger survived on the titanic or not.

Load libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caTools)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)

Import and Understand Data

# Import raw data

rawdata <- read.csv("~/Downloads/train.csv")

Transform/Clean/Preprocess Data

Part of this task might include determining which variables you want to use as features.

For cleaning the data, lets remove variables which we think are irrelevant.

# Dataframe to train model on
newdata <- select(rawdata, Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked)

This data frame above only contains the feature variables. Now lets store the response variable in another vector.

Split data into training and testing sets

# set seed and split sample
set.seed(123)
tttt <- sample.split(newdata$Survived, SplitRatio = 0.7)

# split newdata 
train <- newdata[tttt,]
test <- newdata[!tttt,]

Use the line:
> names(getModelInfo())

to get all the different model types that can be built in the train function.

Train/Build Model on Training Data

response <- train[,1]
training_features <- train[,-1]

response <- as.factor(response)

mod1 <- train(training_features, response, method = "rpart")

Test model on Testing Set and get accuracy

# Make predictions on test set
predictions1 <- predict(mod1, test)
table(predictions1, test$Survived)
##             
## predictions1   0   1
##            0 146  41
##            1  19  62
# Accuracy for mod1 (rpart)
(146+62)/(146+41+19+62)
## [1] 0.7761194

Model 1 has accuracy of 77.6%.

Model 2 with random forests

First, need to get rid of NA data. Do this with complete.cases() function

cc <- complete.cases(train)
train2 <- train[cc,]
cc <- complete.cases(test)
test2 <- test[cc,]

train2_response <- as.factor(train2$Survived)
train2_features <- train2[,-1]

mod2 <- train(train2_features, train2_response)
predictions2 <- predict(mod2, test2)
table(predictions2, test2$Survived)
##             
## predictions2   0   1
##            0 113  32
##            1  13  56
# Accuracy
(113+55)/(113+33+13+55)
## [1] 0.7850467

Accuracy = 78.5%