This is the Data Science Competition Project Titanic: Machine Learning from Disaster hosted by Kaggle. The goal of this project is to predict the survivability of Titanic Passengers. The datasets are provided by Kaggle and contains both train and test datasets. The table below shows the descriptions of the variables in the datasets.
| Variable | Description |
|---|---|
| Survived | Indicates whether the Passenger Survived or Died (1 = Survived, 0 = Died) |
| Name | Name of the Passenger |
| Pclass | Ticket Class (1 = 1st class, 2 = 2nd class, 3 = 3rd class) |
| Sex | Gender of the Passenger (Male/Female) |
| Age | Age of the Passenger |
| SibSp | Number of Siblings/Spouse aboard with the Passenger |
| Parch | Number of Parents/Children aboard with the Passenger |
| Ticket | Passenger’s Ticket Number |
| Fare | Passenger’s Ticket Price |
| Cabin | Passenger’s Cabin Number |
| Embarked | Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
This notebook is divided into a number of activities including the preparation of data, data cleaning, data transformation, Exploratory Data Analysis and finally the Prediction. We first check if there are any missing values and correct them if necessary. After the data has been cleaned and structured, we then do multiple visualizations and analysis to understand more about the data. We also select some importance features for building the prediction model that can be used to predict survival rate. This project only use the Random Forest as the algorithm for the model.
We first load the given train and test datasets. We also set all the string variables as non factors and identify the NA values.
titanicTrain <- read.csv("/Users/adrianromano/R/titanic/train.csv", stringsAsFactors = FALSE, na.strings = c("NA", ""))
titanicTest <- read.csv("/Users/adrianromano/R/titanic/test.csv", stringsAsFactors = FALSE, na.strings = c("NA", ""))
Note: We decide to combine both the datasets into one so that we only clean one dataset instead of inputting the code twice for two datasets.
Before we merge the dataset, we have to make sure both the datasets have the same exact variables. We check them as follows:
names(titanicTrain)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
names(titanicTest)
## [1] "PassengerId" "Pclass" "Name" "Sex" "Age"
## [6] "SibSp" "Parch" "Ticket" "Fare" "Cabin"
## [11] "Embarked"
The test dataset is missing the Survived variable so we create the variable ourselves consisting of NA values.
titanicTest$Survived <- "NA"
Now the two datasets have the same exact variables. We have to do one last thing before we combine the datasets, that is to label which datasets is the train and the test set. This is needed so that we can later separate them easily after the cleaning process is done. We do so by adding another variable Set to both datasets containing characters of either train or test.
titanicTrain$Set <- "Train"
titanicTest$Set <- "Test"
Now that it is done, we can now merge the datasets into one.
titanicFull <- rbind(titanicTrain, titanicTest)
Let’s see the structure and dimension of the combined dataset.
str(titanicFull)
## 'data.frame': 1309 obs. of 13 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : chr "0" "1" "1" "1" ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## $ Set : chr "Train" "Train" "Train" "Train" ...
dim(titanicFull)
## [1] 1309 13
The data contains 1309 observations and 13 Variables.
We are done preparing the data! Now let’s move on to the cleaning process.
Now is the part where we clean and transform the data for easier analysis and prediction. We first check if there are any missing values present in the dataset:
colSums(is.na(titanicFull))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 1014 2
## Set
## 0
Looks like the variables Fare and Embarked have some missing values. The Age and Cabin variables however have a lot of missing values! (263 and 1014 respectively).
The Embarked variable contains two missing values. Since the missing values are just a few and not that concerning, we just decide to replace them with the value of the mode (highest amount of count). We first check the Embarked distribution as shown:
table(titanicFull$Embarked)
##
## C Q S
## 270 123 914
Looks like the highest amount of count comes from port “S” with 914 total passengers. So we will use this as the mode value and replace the missing Embarked value as “S”.
titanicFull[is.na(titanicFull$Embarked), "Embarked"] <- "S"
The Fare variable contains only one missing value. We can easily replace it with the median of the fares as follows:
fareMedian <- median(titanicFull$Fare, na.rm = TRUE)
titanicFull[is.na(titanicFull$Fare), "Fare"] <- fareMedian
Note: One thing that we can also do is to transform the Fare variable to categorical variables to see if it would increase prediction accuracy later on.
We divide Fare to three category (Low Fares, Medium Fares, High Fares) as shown:
titanicFull$FareGroup <- cut(titanicFull$Fare, breaks = 3, labels = c("Low", "Medium", "High"))
table(titanicFull$FareGroup)
##
## Low Medium High
## 1271 34 4
That is all we needed to do for Fare. Next let’s look at the Cabin variable:
head(titanicFull$Cabin)
## [1] NA "C85" NA "C123" NA NA
Note: Since the missing values in Cabin variable is really high and we also have no way of knowing the cabin informations that are missing, we decide to just ignore and drop this feature completely.
For Age variable, we can also do the same as we did on Fare by replacing the missing values with the median. However, the missing values for Age are a lot higher so a simple median computation alone may not provide an accurate prediction. We decide to use another approach by using a regression model to predict the values. We will do this later on in the section and explore the other variables first.
Those are the only variables that need to be cleaned. Now we move on to see if there are any variables that can be transformed to help with easier analysis and prediction.
Let’s now look at the Name Variable:
head(titanicFull$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
length(unique(titanicFull$Name))
## [1] 1307
There are so many different names (1307 total), it will be hard to visualize it for the report.
Note: I notice that every name has a title such as “Mr”, “Mrs”, “Miss”" etc. So we decide to extract the titles to another variable Title to tone down the distribution.
titanicFull$Title <- gsub("^.*, (.*?)\\..*$", "\\1", titanicFull$Name)
table(titanicFull$Sex, titanicFull$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
Looking from the result above, it seems that “Mr”, “Mrs”, “Master” and “Miss” have lot of counts so we decide to keep those. “Mlle” and “Ms” are the same as “Miss” so we include them together with “Miss”. “Mme” is the same as “Mrs” so we also include them together. All the other titles only have very few counts, so we combine them all as one into a new name “Other” containing all those other titles.
titanicFull$Title[titanicFull$Title %in% c("Mlle", "Ms")] <- "Miss"
titanicFull$Title[titanicFull$Title == "Mme"] <- "Mrs"
titanicFull$Title[titanicFull$Title %in% c("Capt", "Col", "Don", "Dona", "Dr", "Jonkheer", "Lady", "Major", "Rev", "Sir", "the Countess")] <- "Other"
table(titanicFull$Sex, titanicFull$Title)
##
## Master Miss Mr Mrs Other
## female 0 264 0 198 4
## male 61 0 757 0 25
Now this looks a lot better! I can see this feature as a part of the visualization and prediction later.
Next we look at both SibSp and Parch variables.
These two variables contain the number of family members on board with the passengers, it may be possible to combine them into one variable called FamilySize for simplicity.
titanicFull$FamilySize <- titanicFull$SibSp + titanicFull$Parch + 1
Note: The passenger is counted towards the size of the family thats why we added 1 representing the passenger themselves.
One thing we can also do is to divide the Family Size distribution into 3 groups (Alone, Small Family Size, Large Family Size).
titanicFull$FamilyGroup[titanicFull$FamilySize == 1] <- "Alone"
titanicFull$FamilyGroup[titanicFull$FamilySize > 1 & titanicFull$FamilySize < 5] <- "Small"
titanicFull$FamilyGroup[titanicFull$FamilySize > 4] <- "Large"
Next, let’s now look at the Ticket variable
head(titanicFull$Ticket)
## [1] "A/5 21171" "PC 17599" "STON/O2. 3101282"
## [4] "113803" "373450" "330877"
length(unique(titanicFull$Ticket))
## [1] 929
Note: It looks like this variable is not relevant to predict Survival and there are too many unique values so we decide to just ignore and drop this feature completely.
Now we can finally look at Age variable. Before we do the regression model, we need to see if the data has any outliers by looking at the boxplot distribution.
boxplot(titanicFull$Age, col = "bisque", main = "Boxplot of Age")
Note: We can see that there are numerous outliers present, they need to be removed since these outliers can make the prediction less accurate.
Here we filter the outliers out by taking only the values until the maximum value:
outlierFilter <- titanicFull$Age < boxplot.stats(titanicFull$Age)$stats[5]
Now we can make the regression model using the filtered outlier data. The predicted results are used to replace the missing values of Age.
ageModel <- lm(Age ~ Pclass + Sex + Fare + FamilyGroup + Embarked + Title, data = titanicFull[outlierFilter,])
newdata <- titanicFull[is.na(titanicFull$Age), c("Pclass", "Sex", "Fare", "FamilyGroup", "Embarked", "Title")]
titanicFull[is.na(titanicFull$Age), "Age"] <- predict(ageModel, newdata)
One thing that we can also do is to divide the Age variable into groups of either Child or Adult. We make Child as the passengers with age lower than 18 years old and Adult as passengers with age 18 or higher. We store them in a new variable named AgeGroup.
titanicFull$AgeGroup[titanicFull$Age < 18] <- "Child"
titanicFull$AgeGroup[titanicFull$Age >= 18] <- "Adult"
I think that is done for this section! Lets do one final check of the columns to make sure they have no more missing values.
colSums(is.na(titanicFull))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 1014 0
## Set FareGroup Title FamilySize FamilyGroup AgeGroup
## 0 0 0 0 0 0
All cleaned except for Cabin variable that we decided earlier to ignore and drop so we are good!
Before we split the dataset back to the train and test datasets, we convert some variables into factors so that they can be visualized later in the next section.
Note: The Survived variable will be converted later in the Train set after we split the dataset since the combined dataset still has NA values present from the Test set that we created before.
titanicFull$Pclass <- as.factor(titanicFull$Pclass)
titanicFull$Sex <- as.factor(titanicFull$Sex)
titanicFull$AgeGroup <- as.factor(titanicFull$AgeGroup)
titanicFull$FareGroup <- as.factor(titanicFull$FareGroup)
titanicFull$Embarked <- as.factor(titanicFull$Embarked)
titanicFull$Title <- as.factor(titanicFull$Title)
titanicFull$FamilySize <- as.factor(titanicFull$FamilySize)
titanicFull$FamilyGroup <- as.factor(titanicFull$FamilyGroup)
Now we split them back to the original train and test datasets along with setting the Survived variable to factors.
titanicTrain <- titanicFull[titanicFull$Set == "Train", ]
titanicTest <- titanicFull[titanicFull$Set == "Test", ]
titanicTrain$Survived <- as.factor(titanicTrain$Survived)
Let’s do a final check of the train set structure before moving on to the next section.
str(titanicTrain)
## 'data.frame': 891 obs. of 18 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Set : chr "Train" "Train" "Train" "Train" ...
## $ FareGroup : Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Title : Factor w/ 5 levels "Master","Miss",..: 3 4 2 4 3 3 3 1 4 4 ...
## $ FamilySize : Factor w/ 9 levels "1","2","3","4",..: 2 2 1 2 1 1 1 5 3 2 ...
## $ FamilyGroup: Factor w/ 3 levels "Alone","Large",..: 3 3 1 3 1 1 1 2 3 3 ...
## $ AgeGroup : Factor w/ 2 levels "Adult","Child": 1 1 1 1 1 1 1 2 1 2 ...
Now that the data is already cleaned and structured, we can move on to the analysis and visualizations.
As usual, we first load the necessary packages:
library(dplyr)
library(pander)
library(ggplot2)
We then see the distribution of Survival:
ggplot(titanicTrain, aes(x = Survived)) +
geom_bar(stat = "count", aes(fill = Survived), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Number of Survivors") +
geom_label(stat = "count", aes(label = ..count..)) +
theme_classic()
Comments: We can see that there are more passengers that died than survived (549 people died and only 342 people survived).
Which are the passengers that survived? Let’s find out by exploring more of the data.
ggplot(titanicTrain, aes(Age)) +
geom_histogram() +
labs(x = "Age", y = "Number of Passengers", title = "Number of Passengers based on Age") +
theme_bw()
Comment: There seem to be a lot of passengers of around age 30 on board shown by the highest peak in the histogram.
Lets see the Survival comparison for age:
ggplot(titanicTrain) +
geom_freqpoly(mapping = aes(x = Age, color = Survived), binwidth = 1) +
labs(x = "Age", y = "Number of Passengers", title = "Survival Comparison by Age") +
theme_bw()
Comment: It seems that a lot of the middle aged passengers died while the younger age passengers of around age 1 or 2 are more likely to survive.
We can probably see it more clearly by looking at the survival comparison between Child and Adults using the AgeGroup feature we created from previous section.
Note: On previous section we assigned Child as passengers of age below 18 and Adults as passengers of age 18 and above.
Let’s see the count comparison between Child and Adult on board.
agegroupCount <- titanicTrain %>%
group_by(AgeGroup) %>%
count(AgeGroup) %>%
select(AgeGroup, Passengers = n)
pandoc.table(agegroupCount)
##
## -----------------------
## AgeGroup Passengers
## ---------- ------------
## Adult 763
##
## Child 128
## -----------------------
ggplot(agegroupCount, aes(x = AgeGroup, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = AgeGroup), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Child vs Adult on board") +
theme_classic()
Comment: The difference is pretty big! 763 total Adults are on board while there are only 128 Children on board.
Now we see the survival comparison between Child and Adult.
agegroupSurvived <- titanicTrain %>%
group_by(AgeGroup) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(AgeGroup, Passengers = n)
pandoc.table(agegroupSurvived)
##
## -----------------------
## AgeGroup Passengers
## ---------- ------------
## Adult 278
##
## Child 64
## -----------------------
agegroupSurvived <- titanicTrain %>%
group_by(AgeGroup) %>%
count(Survived) %>%
select(AgeGroup, Survived, Passengers = n)
ggplot(agegroupSurvived, aes(x = AgeGroup, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = AgeGroup), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Survival of Child vs Adult on board (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Summary:
It seems pretty expected that Child survival rate is higher than Adult as usually the safety of childrens are prioritized especially if the other family members are present.
ggplot(titanicTrain, aes(Fare)) +
geom_density() +
labs(x = "Fares", y = "Number of Passengers", title = "Number of Passengers based on Fares") +
theme_bw()
ggplot(titanicTrain) +
geom_freqpoly(mapping = aes(x = Fare, color = Survived), binwidth = 0.05) +
scale_x_log10() +
labs(x = "Fares", y = "Number of Passengers", title = "Number of Survival based on Fares") +
theme_bw()
Comment: The plot shows that majority of the passengers bought a low fare ticket to board the ship. It also looks like the higher the fare amount, passengers tend to have higher chance to survive.
We can see the relationship more clearly by looking at the FareGroup distribution:
faregroupCount <- titanicTrain %>%
group_by(FareGroup) %>%
count(FareGroup) %>%
select(FareGroup, Passengers = n)
pandoc.table(faregroupCount)
##
## ------------------------
## FareGroup Passengers
## ----------- ------------
## Low 871
##
## Medium 17
##
## High 3
## ------------------------
ggplot(faregroupCount, aes(x = FareGroup, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = FareGroup), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Passengers based on Fare levels") +
theme_classic()
Comment: So again, it is shown that majority of passengers have low fare tickets. This matches with the previous findings.
Now see the relationship of FareGroup with Survival:
faregroupSurvived <- titanicTrain %>%
group_by(FareGroup) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(FareGroup, Passengers = n)
pandoc.table(faregroupSurvived)
##
## ------------------------
## FareGroup Passengers
## ----------- ------------
## Low 328
##
## Medium 11
##
## High 3
## ------------------------
faregroupSurvived <- titanicTrain %>%
group_by(FareGroup) %>%
count(Survived) %>%
select(FareGroup, Survived, Passengers = n)
ggplot(faregroupSurvived, aes(x = FareGroup, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = FareGroup), col = "black", alpha = 0.8) +
labs(x = "", y = "", title = "Number of Survivors based on Fare Levels (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Summary:
It looks like the passengers that paid higher fare tickets have a higher chance of surviving. It does make sense since having a higher Fare ticket means they are probably the rich people such as the nobility that usually are prioritized more to be saved.
classCount <- titanicTrain %>%
group_by(Pclass) %>%
count(Pclass) %>%
select(Pclass, Passengers = n)
pandoc.table(classCount)
##
## ---------------------
## Pclass Passengers
## -------- ------------
## 1 216
##
## 2 184
##
## 3 491
## ---------------------
ggplot(classCount, aes(x = Pclass, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Pclass), col = "black", alpha = 0.8) +
labs(x = "Passenger Class", y = "Number of Passengers", title = "Number of Passengers in each passenger class") +
theme_classic()
Comment: It is shown that most of the passengers are in the 3rd class while the other two classes have similar counts.
Let’s now see the classes relationship with survival rate:
classSurvived <- titanicTrain %>%
group_by(Pclass) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(Pclass, Passengers = n)
pandoc.table(classSurvived)
##
## ---------------------
## Pclass Passengers
## -------- ------------
## 1 136
##
## 2 87
##
## 3 119
## ---------------------
classSurvived <- titanicTrain %>%
group_by(Pclass) %>%
count(Survived) %>%
select(Pclass, Survived, Passengers = n)
ggplot(classSurvived, aes(x = Pclass, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Pclass), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Survivors based on their passenger class (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Summary:
It is highly possible that the 1st class passengers includes a majority of nobility and other rich passengers that pays more to be in 1st class. So it does make sense that the result shows the highest survival rate from the 1st class passengers.
sexCount <- titanicTrain %>%
group_by(Sex) %>%
count(Sex) %>%
select(Gender = Sex, Passengers = n)
pandoc.table(sexCount)
##
## ---------------------
## Gender Passengers
## -------- ------------
## female 314
##
## male 577
## ---------------------
ggplot(sexCount, aes(x = Gender, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Gender), col = "black", alpha = 0.8) +
labs(x = "Gender", y = "Number of Passengers", title = "Number of Males vs Females on board") +
theme_classic()
Comment: There are more male passengers on board compared to females. Male passengers are almost twice the count of female passengers.
Now let’s see the relationship of Gender with survival rate:
sexSurvived <- titanicTrain %>%
group_by(Sex) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(Gender = Sex, Passengers = n)
pandoc.table(sexSurvived)
##
## ---------------------
## Gender Passengers
## -------- ------------
## female 233
##
## male 109
## ---------------------
sexSurvived <- titanicTrain %>%
group_by(Sex) %>%
count(Survived) %>%
select(Gender = Sex, Survived, Passengers = n)
ggplot(sexSurvived, aes(x = Gender, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Gender), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Males vs Females Survival Comparison (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Summary:
Females seem to be highly prioritized to be saved as shown by the significant difference of survival rate compared to males. It is not a surprise since women and children are usually saved first before the men in this society.
embarkedCount <- titanicTrain %>%
group_by(Embarked) %>%
count(Embarked) %>%
select(Port = Embarked, Passengers = n)
pandoc.table(embarkedCount)
##
## -------------------
## Port Passengers
## ------ ------------
## C 168
##
## Q 77
##
## S 646
## -------------------
ggplot(embarkedCount, aes(x = Port, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Port), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Passengers from each Port of Embarkation") +
theme_classic()
Comment: The analysis shows a majority of the passengers embarked from port S (Southampton).
Now lets see the relationship between the Ports of Embarkation with Survival:
embarkedSurvived <- titanicTrain %>%
group_by(Embarked) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(Port = Embarked, Passengers = n)
pandoc.table(embarkedSurvived)
##
## -------------------
## Port Passengers
## ------ ------------
## C 93
##
## Q 30
##
## S 219
## -------------------
embarkedSurvived <- titanicTrain %>%
group_by(Embarked) %>%
count(Survived) %>%
select(Port = Embarked, Survived, Passengers = n)
ggplot(embarkedSurvived, aes(x = Port, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Port), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Survivors based on the Port of Embarkation (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Summary:
It seems that the highest survival rate exists in passengers who embarked from port C (Cherbourq). The other two ports (S and Q) have similar survival rate with one another.
titleCount <- titanicTrain %>%
group_by(Title) %>%
count(Title) %>%
select(Title, Passengers = n)
pandoc.table(titleCount)
##
## ---------------------
## Title Passengers
## -------- ------------
## Master 40
##
## Miss 185
##
## Mr 517
##
## Mrs 126
##
## Other 23
## ---------------------
ggplot(titleCount, aes(x = Title, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Title), col = "black", alpha = 0.8) +
labs(x = "Title", y = "Number of Passengers", title = "Number of Passengers based on the Title of their name") +
theme_classic()
Comment: The result shows a large number of passengers having the title “Mr” on their name. It does match with earlier analysis that there are a lot of Males that boarded the ship.
Now lets see the relationship of Title with Survival
titleSurvived <- titanicTrain %>%
group_by(Title) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(Title, Passengers = n)
pandoc.table(titleSurvived)
##
## ---------------------
## Title Passengers
## -------- ------------
## Master 23
##
## Miss 130
##
## Mr 81
##
## Mrs 100
##
## Other 8
## ---------------------
titleSurvived <- titanicTrain %>%
group_by(Title) %>%
count(Survived) %>%
select(Title, Survived, Passengers = n)
ggplot(titleSurvived, aes(x = Title, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = Title), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Survivors based on the Title of their name (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Summary:
These findings are also expected since we know from earlier analysis that Females have higher chance to survive compared to males. So it is expected that the female titles such as “Miss” and “Mrs” have a high survival rate while the male counterparts have low survival rates.
famsizeCount <- titanicTrain %>%
group_by(FamilySize) %>%
count(FamilySize) %>%
select(FamilySize, Passengers = n)
pandoc.table(famsizeCount)
##
## -------------------------
## FamilySize Passengers
## ------------ ------------
## 1 537
##
## 2 161
##
## 3 102
##
## 4 29
##
## 5 15
##
## 6 22
##
## 7 12
##
## 8 6
##
## 11 7
## -------------------------
ggplot(famsizeCount, aes(x = FamilySize, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = FamilySize), col = "black", alpha = 0.8) +
labs(x = "Family Size", y = "Number of Passengers", title = "Number of Passengers based on their Family Size") +
theme_classic()
Comment: There are a lot of passengers with no family members that boarded the ship (537 total). The count trend seem to be decreasing the more family size a passenger have other than family size of 6 where it increases a little.
We now look at the relationship between FamilySize with Survival
famsizeSurvived <- titanicTrain %>%
group_by(FamilySize) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(FamilySize, Passengers = n)
pandoc.table(famsizeSurvived)
##
## -------------------------
## FamilySize Passengers
## ------------ ------------
## 1 163
##
## 2 89
##
## 3 59
##
## 4 21
##
## 5 3
##
## 6 3
##
## 7 4
## -------------------------
famsizeSurvived <- titanicTrain %>%
group_by(FamilySize) %>%
count(Survived) %>%
select(FamilySize, Survived, Passengers = n)
ggplot(famsizeSurvived, aes(x = FamilySize, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = FamilySize), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Survivors based on their Family Size (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Comment: The result seems to show a decreasing survival rate the more family members are on board with the passengers. None of the passengers with family size of more than 7 survived.
We can look at the Familygroup variable that we created before to understand the relationship more:
famgroupCount <- titanicTrain %>%
group_by(FamilyGroup) %>%
count(FamilyGroup) %>%
select(FamilyGroup, Passengers = n)
pandoc.table(famgroupCount)
##
## --------------------------
## FamilyGroup Passengers
## ------------- ------------
## Alone 537
##
## Large 62
##
## Small 292
## --------------------------
ggplot(famgroupCount, aes(x = FamilyGroup, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = FamilyGroup), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Passengers based on their Family Group") +
theme_classic()
Comment: Just like before, the result shows that most passengers boarded the ship without any family members.
Lets see the relationship with Survival
famgroupSurvived <- titanicTrain %>%
group_by(FamilyGroup) %>%
count(Survived) %>%
filter(Survived == 1) %>%
select(FamilyGroup, Passengers = n)
pandoc.table(famgroupSurvived)
##
## --------------------------
## FamilyGroup Passengers
## ------------- ------------
## Alone 163
##
## Large 10
##
## Small 169
## --------------------------
famgroupSurvived <- titanicTrain %>%
group_by(FamilyGroup) %>%
count(Survived) %>%
select(FamilyGroup, Survived, Passengers = n)
ggplot(famgroupSurvived, aes(x = FamilyGroup, y = Passengers)) +
geom_bar(stat = "identity", aes(fill = FamilyGroup), col = "black", alpha = 0.8) +
labs(x = "", y = "Number of Passengers", title = "Number of Survivors based on their Family Group (0 = Died, 1 = Survived)") +
facet_wrap(~Survived) +
theme_bw()
Summary:
The result shows that small family group of size 2 - 4 has the highest chance of surviving. The chance of survival is really low for the passengers that has a large family size members of more than 4. Passengers that boarded alone also do not have a high survival rate.
We finished exploring the data! Now we should have a good understanding of how the data looks like and can move on to building the prediction model.
As usual, we first load the necessary packages:
library(caret)
library(randomForest)
Note: As mentioned in the introduction, we only use Random Forest model for the prediction.
We then further split the data to 70% training sets and 30% test sets as follows:
set.seed(1995)
inTrain <- createDataPartition(titanicTrain$Survived, p = 0.7, list = FALSE)
training <- titanicTrain[inTrain, ]
testing <- titanicTrain[-inTrain, ]
Here we build the model and see the accuracy of the model:
set.seed(1995)
rfModel <- randomForest(Survived ~ Pclass + Sex + AgeGroup + Fare + Embarked + Title + FamilyGroup, data = training)
rfPred <- predict(rfModel, newdata = testing)
rfCM <- confusionMatrix(rfPred, testing$Survived)
rfCM$table
## Reference
## Prediction 0 1
## 0 146 33
## 1 18 69
accuracy <- rfCM$overall[1]
accuracy
## Accuracy
## 0.8082707
The random forest model yields a solid 80.8% accuracy. Not bad for our first try!
It is done! We can now use this model to make the prediction on the original test data and store the result to a csv file for the submission to Kaggle:
prediction <- predict(rfModel, newdata = titanicTest)
titanicPrediction <- data.frame(PassengerId = titanicTest$PassengerId, Survived = prediction)
write.csv(titanicPrediction, file = "TitanicPrediction.csv", row.names = FALSE)