RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912 after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history’s deadliest peacetime commercial marine disasters. Over 100 years after the crash, titanic seems to pop up in many different contexts. We will see the insight by visualizing the disaster data towards the survived and dead passengers’ conditions.
library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(yardstick)The source of the data is https://www.kaggle.com/c/titanic
Titanic_Survival <- read.csv("train (1).csv")The data contain information as follow:
survival : Survival 0 = No, 1 = Yespclass : Ticket class 1 = 1st, 2 = 2nd, 3 = 3rdsex : Male or FemaleAge : Age in yearssibsp : number of siblings / spouses aboard the Titanicparch : number of parents / children aboard the Titanicticket : Ticket numberfare : Passenger farecabin : Cabin numberembarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southamptonrmarkdown::paged_table(Titanic_Survival)Before we build the visualization, we need to investigate the data whether it has NA value or not.
colSums(is.na(Titanic_Survival))#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 177
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 0 0
sum(is.na(Titanic_Survival$Age))/nrow(Titanic_Survival)*100#> [1] 19.86532
We lost 177 data or approximately almost 20% of the dataset. Next we see the missing value in the data
sapply(Titanic_Survival, function(x){sum(x=='')})#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 NA
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 687 2
The Cabin have a lot of blank data. We will remove this variable and the two blank observation in the embarked data.
As a justification, the variable age will be deleted since it has less than 20% NA value. The further recommendation may arise since tackling with NA value would be better with imputation with the specific technique rather than just removing the observations. We also remove the variable Name, PassengerId, and Ticket. It might be interesting to analyze the characteristic of each passenger name with the individual survival rate.
Titanic_new <- Titanic_Survival %>%
dplyr::select(-c(Cabin, PassengerId, Name, Ticket)) %>%
filter(Age != '') %>%
filter(Embarked != '')glimpse(Titanic_new)#> Rows: 712
#> Columns: 8
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
#> $ Pclass <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3...
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal...
#> $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, ...
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0...
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0...
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750...
#> $ Embarked <chr> "S", "C", "S", "S", "S", "S", "S", "S", "C", "S", "S", "S"...
colSums(is.na(Titanic_new))#> Survived Pclass Sex Age SibSp Parch Fare Embarked
#> 0 0 0 0 0 0 0 0
We need to reformat the type of the data into the proper format—the variable Sex and Embarked need to reformat into factor.
Titanic_clean <- Titanic_new %>%
mutate(Sex = as.factor(Sex),
Embarked = as.factor(Embarked))glimpse(Titanic_clean)#> Rows: 712
#> Columns: 8
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
#> $ Pclass <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3...
#> $ Sex <fct> male, female, female, female, male, male, male, female, fe...
#> $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, ...
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0...
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0...
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750...
#> $ Embarked <fct> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q...
titanic_survived <- Titanic_clean %>%
filter(Survived == 1)
titanic_not_survived <- Titanic_clean %>%
filter(Survived == 0)
hist(titanic_survived$Age, breaks=20, col = "Green", xlab = "Age", main = "Distribution of Survived Passengers")hist(titanic_not_survived$Age, breaks=20, col = "Red", xlab = "Age", main = "Distribution of Non-Survived Passengers")We see a lot of young passengers survived from the crash. The consideration to save young passengers might arise during the critical time.
Titanic_clean %>% group_by(Sex) %>% summarise(avgAge = mean(Age), stdev = sd(Age))ggplot(data=Titanic_clean, aes(x=Age, fill=Sex)) + geom_density(alpha=0.5)The male passenger age is over 60 until 80. This is higher than the fraction in females.
prop.table(table(Titanic_clean$Survived, Titanic_clean$Sex))#>
#> female male
#> 0 0.08988764 0.50561798
#> 1 0.27387640 0.13061798
The male passenger will likely not survived during the tragedy, with more than 50% than the total number of passengers.
hist(titanic_survived$Pclass, col="green", main="Distribution of Survived Passenger Based on Class", xlab = "Class")hist(titanic_not_survived$Pclass, col="red", main="Distribution of Died Passenger Based on Class", xlab = "Class")The passenger who died came mostly from class 3, the lowest class while the passenger in the first-class gave the highest portion of the number who survived.
posn.j <- position_jitter(0.3, 1)
ggplot(Titanic_clean,aes(x=factor(Pclass),y=Age,col=factor(Sex)))+
geom_jitter(size=3,alpha=0.5,position=posn.j)+
facet_grid(". ~ Survived")The passengers who died from class 3 are dominated by Male Passengers, while in the survived passenger graphics, females have a higher chance of living.
ggplot(data = Titanic_clean, aes( x = SibSp + Parch, fill = as.factor(Survived) ) ) +
geom_bar(position = 'dodge')Having from 1 to 3 family members on board increased the survival chances; this makes sense small groups could organize better and find space on a boat.
ggplot(data = Titanic_clean , aes(x = as.factor(Pclass), y = Fare, colour = Sex)) +
geom_boxplot() + #Boxplot
scale_y_log10()We see here the first-class paid higher than the other class while interestingly, male paid lower than female belonging in the same class.
posn.t <- position_jitter(0.2, 1)
ggplot(Titanic_clean,aes(x=factor(Embarked),y=Age,col=factor(Sex)))+
geom_jitter(size=3,alpha=0.5,position=posn.t)+
facet_grid(". ~ Survived")Many passengers embarked from Southhampton while the least come from Queenstown. We see here female embarked from Cherbroug mostly survived during the tragedy while male embarked from Southampton died in the disaster.
Data visualization is an excellent way to inform the reader about the situation with the data. It is excellent since telling a story is more potent than visualizing a chunk of numbers. According to the insight above, we see that several factors affecting the life of the passengers. Female passengers had a higher chance of living than males, while the age distribution shows that younger passengers prioritized using the life-saver facility during the tragedy. We also the first-class passenger tended to have a better chance to survive than the other two classes. Fewer family members would give a better score to survive, as well. The most embarked passenger came from Southhampton, while in the Cherbourg came mostly from female passengers. It is interesting to see that male passengers paid lower in the same class. We might find exciting facts with further insight into why males paid lower in the ship.