Almost everyone is familiar with the sinking of the Titanic. When the large ship sunk in the Atlantic, killing a majority of its passengers, it stunned the world. This report explores a dataset containing information on passengers on the Titanic. Some survived and some passed away. In the end, a machine learning model will be built using this information, in order to predict who will survive the Titanic given a different set of passengers.
#loading the library packages and dataset
library(dplyr)
library(ggplot2)
library(caret)
library(gridExtra)
library(VIM)
titanic <- read.csv("train.csv")
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
head(titanic)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
summary(titanic)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
The dataset contains records of 891 passengers. These records contain passenger ID, their class on the ship, age, sex, name, ticket number, cabin number, fare, where they boarded the ship and other measurements of family members on the ship. Some of these variables can be removed, which will happen in the next section.
#Creating Factor Variables out of Survived and Pclass columns
titanic$Survived <- factor(titanic$Survived, labels = c("Passed", "Survived"))
titanic$Pclass <- factor(titanic$Pclass, labels = c("First", "Second", "Third"))
#Subsetting and viewing dataset
titanic_sub <- titanic %>%
select(-PassengerId, -Name, -Ticket, -Cabin, -SibSp, -Parch)
glimpse(titanic_sub)
## Observations: 891
## Variables: 6
## $ Survived <fctr> Passed, Survived, Survived, Survived, Passed, Passed...
## $ Pclass <fctr> Third, First, Third, First, Third, Third, First, Thi...
## $ Sex <fctr> male, female, female, female, male, male, male, male...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39,...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51....
## $ Embarked <fctr> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S...
#Finding total NA values in Age and Fare columns
sum(is.na(titanic_sub$Age))
## [1] 177
sum(is.na(titanic_sub$Fare))
## [1] 0
#Imputing NA values
titanic_sub <- kNN(titanic_sub)
titanic_sub <- titanic_sub %>%
select(-(Survived_imp:Embarked_imp))
glimpse(titanic_sub)
## Observations: 891
## Variables: 6
## $ Survived <fctr> Passed, Survived, Survived, Survived, Passed, Passed...
## $ Pclass <fctr> Third, First, Third, First, Third, Third, First, Thi...
## $ Sex <fctr> male, female, female, female, male, male, male, male...
## $ Age <dbl> 22.0, 38.0, 26.0, 35.0, 35.0, 32.0, 54.0, 2.0, 27.0, ...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51....
## $ Embarked <fctr> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S...
Initially the variables, Survived and Pclass are stored as numeric variables. However, they needed to be converted to factor variables. In order to create clarity when looking at the dataset, labels were added to the levels of the factor variables. The 0 and 1 in the Survived column were converted to Passed and Survived respectively. Also the 1, 2, and 3 in the Pclass column were converted to First, Second and Third. The variables Name, PassengerID and Cabin were removed because they are attributes that will be unique to each passenger. These variables will not be useful in the analysis. Also there are 177 NA values in the Age variable of the dataset. These NA values will cause problems when trying to build a machine learning model. For this analysis, a suitable replacement for these NA values will be knn imputation. Knn imputation will replace missing values with a guess based on people with similar attributes.
#Creating Tables to See Who Survived and Who Didn't
table(titanic$Survived, titanic$Sex)
##
## female male
## Passed 81 468
## Survived 233 109
table(titanic$Survived, titanic$Pclass)
##
## First Second Third
## Passed 80 97 372
## Survived 136 87 119
table(titanic$Survived, titanic$Embarked)
##
## C Q S
## Passed 0 75 47 427
## Survived 2 93 30 217
These tables show that the majority of survivors were female. It also shows that many who passed were male, from third class and who embarked from Southampton. Many of the survivors were also from Southampton, since it appears that this dock was where most passengers boarded.
#Creating Scatter Plots of Fare vs Age
fa_class <- ggplot(titanic_sub, aes(x= Age, y = Fare, col = Pclass)) +
geom_point() +
labs(title = "Fare vs Age (Class)")
fa_survival <- ggplot(titanic_sub, aes(x=Age, y = Fare, col = Survived)) +
geom_point() +
labs(title = "Fare vs Age (Survival)")
grid.arrange(fa_class, fa_survival, ncol = 1, nrow = 2)
#Creating Histograms of Age
age_s_hist <- ggplot(titanic_sub, aes(x=Age, fill = Survived)) +
geom_histogram() +
labs(title = "Age Histogram by Survival")
age_c_hist <- ggplot(titanic_sub, aes(x=Age, fill = Pclass)) +
geom_histogram() +
labs(title = "Age Histogram by Class")
grid.arrange(age_s_hist, age_c_hist, ncol = 1, nrow = 2)
#Creating histograms of Fare
fare_s_hist <- ggplot(titanic_sub, aes(x=Fare, fill = Survived)) +
geom_histogram() +
labs(title = "Fare Histogram by Survival")
fare_c_hist <- ggplot(titanic_sub, aes(x=Fare, fill = Pclass)) +
geom_histogram() +
labs(title = "Fare Histogram by Class")
grid.arrange(fare_c_hist, fare_s_hist, ncol = 1, nrow = 2)
These graphs show that there is no relationship between the Age and Fare variables. However it does show that those who survived often paid a higher fare. In addition, the Age Histogram by Survival graph shows that older people passed away at a higher proportion than younger people. The age with the highest rate of survival occurs on the low end of the graph, suggesting that children survived at a higher rate. Additionally, most of the riders of the Titanic were between the ages of 20 and 40. The graph Fare Histogram by Class shows that those who paid the lowest fares died at the highest rates. Now that the Exploratory Analysis is concluded, the Machine Learning portion of the paper can begin.
set.seed(366284)
#Partitioning the data and subsetting data into a Train Set and a Test Set
inTrain <- createDataPartition(y = titanic_sub$Survived, p = 0.7, list=FALSE)
train <- titanic_sub[inTrain, ]
test <- titanic_sub[-inTrain, ]
During this step the dataset is split into a train set and a test set. The train set contains 70% of the data that was selected using random sampling. The test set contains the other 30% of the data. The set.seed function was called in order to ensure reproducibility for this paper.
#Creating Random Forest Model
model_rf <- train(Survived ~ ., train, method = "ranger", preProcess = c("center", "scale"), weights = train$Fare, tuneLength = 7, trControl = trainControl(method = "cv", number = 10, repeats = 10))
model_rf
## Random Forest
##
## 625 samples
## 5 predictor
## 2 classes: 'Passed', 'Survived'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 562, 563, 562, 563, 563, 563, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.8160522 0.5869918
## 2 extratrees 0.8143881 0.5825573
## 3 gini 0.8240143 0.6099831
## 3 extratrees 0.8159754 0.5844420
## 4 gini 0.8271889 0.6254364
## 4 extratrees 0.8143625 0.5833607
## 5 gini 0.8320789 0.6396720
## 5 extratrees 0.8127240 0.5889617
## 6 gini 0.8304659 0.6365018
## 6 extratrees 0.8191244 0.6122004
## 7 gini 0.8336662 0.6422804
## 7 extratrees 0.8288018 0.6345698
## 8 gini 0.8304916 0.6365859
## 8 extratrees 0.8271633 0.6302253
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7 and splitrule = gini.
This table shows the results of creating the Random Forest Model. The algorithm used accuracy to select the optimal model using the largest value. The final values for the model were 7 for mtry and gini for splitrule. This model’s accuracy was about 83.37%.
#Testing Accuracy of Model
predictions_rf <- predict(model_rf, test)
confusionMatrix(predictions_rf, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Passed Survived
## Passed 148 25
## Survived 16 77
##
## Accuracy : 0.8459
## 95% CI : (0.7968, 0.8871)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6685
## Mcnemar's Test P-Value : 0.2115
##
## Sensitivity : 0.9024
## Specificity : 0.7549
## Pos Pred Value : 0.8555
## Neg Pred Value : 0.8280
## Prevalence : 0.6165
## Detection Rate : 0.5564
## Detection Prevalence : 0.6504
## Balanced Accuracy : 0.8287
##
## 'Positive' Class : Passed
##
The model predicted with 84.59% accuracy when used to compare the predicted values for the test set with the actual values in the test set. This result is in line with the model which predicted that it would predict values at about 83.37% accuracy. Although this result is satisfactory, it is beneficial to test other models. The Sensitivity or True Positive Rate was 90.24%. The Specificity or True Negative Rate was 75.49%. Thus, the model was better at predicting who passed correctly than it was at predicting who survived.
#Creating Ada Model
model_ada <- train(Survived ~ ., train, method = "ada", weights = train$Fare, trControl = trainControl(method = "cv", number = 10, repeats = 10))
model_ada
## Boosted Classification Trees
##
## 625 samples
## 5 predictor
## 2 classes: 'Passed', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 563, 563, 563, 563, 562, 562, ...
## Resampling results across tuning parameters:
##
## maxdepth iter Accuracy Kappa
## 1 50 0.7869688 0.5415727
## 1 100 0.7869688 0.5408629
## 1 150 0.7885817 0.5444646
## 2 50 0.7903226 0.5438846
## 2 100 0.7919099 0.5487324
## 2 150 0.7966718 0.5588023
## 3 50 0.8238607 0.6146299
## 3 100 0.8142601 0.5998318
## 3 150 0.8126216 0.5968771
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 50, maxdepth = 3 and nu
## = 0.1.
For this model accuracy was used to select the optimal model using the largest value. The final values used for the model were 50 for iter, 3 for maxdepth, and 0.1 for nu. This model should predict with about 82.39% accuracy. When testing the model, it should yield slightly less accurate results than the Random Forest Model.
#Testing Ada Model
predictions_ada <- predict(model_ada, test)
confusionMatrix(predictions_ada, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Passed Survived
## Passed 152 33
## Survived 12 69
##
## Accuracy : 0.8308
## 95% CI : (0.7803, 0.8738)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : 2.211e-14
##
## Kappa : 0.6277
## Mcnemar's Test P-Value : 0.002869
##
## Sensitivity : 0.9268
## Specificity : 0.6765
## Pos Pred Value : 0.8216
## Neg Pred Value : 0.8519
## Prevalence : 0.6165
## Detection Rate : 0.5714
## Detection Prevalence : 0.6955
## Balanced Accuracy : 0.8016
##
## 'Positive' Class : Passed
##
This model predicted with about 83.08% accuracy. Althouh its predictions were more successful than the 82.39% shown in the model, its accuracy was still close to what was predicted. In addition, the Sensitivity or True Positive Rate was 92.68% and the Specificity or True Negative Rate was 67.65%. This Ada Model, like the Random Forest Model, is better at predicting who would pass away on the Titanic than it is at predicting who would survive.
Since the Random Tree Model estimated that it would predict with higher accuracy than the Ada Model and showed that it actually predicted with better accuracy than the Ada model when tested, the Random Tree Model is the model that will be used on our outside data. Although there were many trends that may have ensured survival on the Titanic, more than likely a person would die if they were a passenger. Women survived more but still many died. The same is true for those who paid higher fares and were young. Still we will test how this model performs when given data from another dataset.
#Loading Data Set
#Turning Pclass to Factor Variable
#Imputing NA Values
final_test <- read.csv("test.csv", header = TRUE)
final_test$Pclass <- factor(final_test$Pclass,
labels = c("First", "Second", "Third"))
final_test <- kNN(final_test)
final_test <- final_test %>%
select(-(PassengerId_imp:Embarked_imp))
dim(final_test)
## [1] 418 11
Here the dataset to be used for predictions is loaded and the Pclass variable is converted to a factor for testing. In addition, knn imputation is used for all missing values in the titanic dataset. This imputation was used mostly in the Age column and for one passenger whose Fare was missing. Finally, the dimensions for the dataset are shown. So there will be 418 predictions for this dataset.
#Making Predictions
#Adding Survived Column
#Selecting Columns for Submission
#Changing Survived Column back to Factors of 0 and 1
#Writing CSV file for submission. Removing row names
predictions_rf <- predict(model_rf, final_test)
final_test$Survived <- predictions_rf
submission_rf <- final_test[, c("PassengerId", "Survived")]
submission_rf$Survived <- factor(submission_rf$Survived, labels = c(0, 1))
write.csv(submission_rf, "titanic_rf_predictions.csv", row.names = FALSE)
length(predictions_rf)
## [1] 418
There were 418 predictions made and stored in the variable predictions_rf. For the purpose of submission to the Kaggle competition, a heading was added to the dataframe and it was saved as a csv file.