First we load the data:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data <- read.csv("train.csv")
datest <- read.csv("test.csv")
We will fit a linear model that predicts age with sibling-spouse(SibSp) numbers. We will use this for filling missing Age values:
fit <- lm(Age ~ SibSp, data = data)
nohay <- which(is.na(data$Age))
data$Age[nohay] <- fit$coeff[[1]] + data$SibSp[nohay] * fit$coeff[[2]]
nohay <- which(is.na(datest$Age))
datest$Age[nohay] <- fit$coeff[[1]] + datest$SibSp[nohay] * fit$coeff[[2]]
Now we fit a random forest model on the training data. The confusion matrix with the training data is reported:
modRF <- train(Survived ~ Pclass + Sex + Age + Fare + Embarked + SibSp + Parch
, data = data, method = "rf")
save(modRF, file = ".modRF")
We now compute the confusion model on the same training data (just to have an idea):
pred <- predict(modRF, data)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
pred <- as.integer(pred > 0.5)
conf <- confusionMatrix(data$Survived, pred)
conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 531 18
## 1 81 261
##
## Accuracy : 0.889
## 95% CI : (0.866, 0.909)
## No Information Rate : 0.687
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.757
## Mcnemar's Test P-Value : 4.63e-10
##
## Sensitivity : 0.868
## Specificity : 0.935
## Pos Pred Value : 0.967
## Neg Pred Value : 0.763
## Prevalence : 0.687
## Detection Rate : 0.596
## Detection Prevalence : 0.616
## Balanced Accuracy : 0.902
##
## 'Positive' Class : 0
##
We have noted that one of the passengers in the testing set doesn’t have Fare data, we will fill it:
datest[153,9]<-mean(datest[,9],na.rm=TRUE)
We now apply the model to the testing set and write the result to csv file for submission:
pred <- predict(modRF, datest)
pred <- as.integer(pred > 0.5)
result <- data.frame(datest$PassengerId, pred)
names(result)<-c("PassengerId","Survived")
write.csv(result, "prediction.csv",row.names=FALSE)