Brief History of Titanic Disaster
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.
Preprocessing Process of Passengers Data
The data is retrieved from Kaggle. The goal of this project is to explore the train.csv (It will be useful for the analysis of what sorts of people were likely to survive in the next project).
# Load CSV file
titanic_data <- read.csv("dataInputs/train.csv")
str(titanic_data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Check missing values (empty or NA)
# Check missing value using "reshape" library
missing_data <- melt(apply(titanic_data[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])
## [,1] [,2]
## [1,] "Age" "177"
## [2,] "Cabin" "687"
## [3,] "Embarked" "2"
- Missing values on Age will be replaced to 0 and be categorised as “Uncategorised” later.
- Cabin has missed around 80% values. We won’t fix this variable.
- Missing values on Embarked will be replaced to most common data
Clean-up data
titanic_data$Surviveddetailed[titanic_data$Survived == 0] <- "No"
titanic_data$Surviveddetailed[titanic_data$Survived == 1] <- "Yes"
titanic_data$Sex[titanic_data$Sex == "male"] <- "Male"
titanic_data$Sex[titanic_data$Sex == "female"] <- "Female"
titanic_data <- titanic_data %>%
mutate(
Age = as.integer(Age),
Embarked = as.factor(Embarked),
Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch),
Surviveddetailed = as.factor(Surviveddetailed)) %>%
select(-c(1,11))
Update missing data
# Update missing Age data to 0
titanic_data$Age[which(is.na(titanic_data$Age) | titanic_data$Age=="")] <- 0
# Check common value on Embarked data to S
table(titanic_data$Embarked)
##
## C Q S
## 2 168 77 644
# Update missing Embarked data
titanic_data$Embarked[which(is.na(titanic_data$Embarked) | titanic_data$Embarked=="")] <- 'S'
Create new values for “Age Category” and “Title of Passenger”
# Create a new value to categorise Age. N/A or 0 will be categorized as "Uncategorised"
titanic_data$Age_Category[titanic_data$Age < 1] <- "Uncategorised"
titanic_data$Age_Category[titanic_data$Age > 0 & titanic_data$Age <=14] <- "Children"
titanic_data$Age_Category[titanic_data$Age >=15 & titanic_data$Age <=24] <- "Youth"
titanic_data$Age_Category[titanic_data$Age >=25 & titanic_data$Age <=64] <- "Adults"
titanic_data$Age_Category[titanic_data$Age >= 65] <- "Seniors"
titanic_data$Age_Category <- as.factor(titanic_data$Age_Category)
levels(titanic_data$Age_Category)
## [1] "Adults" "Children" "Seniors" "Uncategorised"
## [5] "Youth"
# Create a new value to categorise Title of Passengger
titanic_data$Titles <- regmatches(as.character(titanic_data$Name),regexpr("\\,[A-z ]{1,20}\\.", as.character(titanic_data$Name)))
titanic_data$Titles <- unlist(lapply(titanic_data$Titles,FUN=function(x) substr(x, 3, nchar(x)-1)))
titanic_data$Titles <- gsub("(Dr|Rev|Co|Major|Countess|Sir|Jonkheer|Lady|Capt|Don|Othersl|the Others)", "Others", titanic_data$Titles)
titanic_data$Titles <- gsub("(Ms|Mlle)", "Miss", titanic_data$Titles)
titanic_data$Titles[titanic_data$Titles == "Mme"] <- "Mrs"
titanic_data$Titles[titanic_data$Titles == "Othersl"] <- "Others"
titanic_data$Titles[titanic_data$Titles == "the Others"] <- "Others"
titanic_data$Titles <- as.factor(titanic_data$Titles)
levels(titanic_data$Titles)
## [1] "Master" "Miss" "Mr" "Mrs" "Others"
# Update embarked location
switch.location <- function(x){
y <- switch(as.character(x),
"C" = "Cherbourg",
"Q" = "Queenstown",
"S" = "Southampton")
return(y)
}
titanic_data$Embarked <- as.factor(sapply(titanic_data$Embarked, FUN = switch.location))
Ensure the data is as wanted
# Check N/A, NULL or Empty values
colSums(is.na(titanic_data))
## Survived Pclass Name Sex
## 0 0 0 0
## Age SibSp Parch Ticket
## 0 0 0 0
## Fare Embarked Surviveddetailed Age_Category
## 0 0 0 0
## Titles
## 0
# Check structure of each column
str(titanic_data)
## 'data.frame': 891 obs. of 13 variables:
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 0 54 2 27 14 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked : Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 1 3 3 3 2 3 3 3 1 ...
## $ Surviveddetailed: Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
## $ Age_Category : Factor w/ 5 levels "Adults","Children",..: 5 1 1 1 1 4 1 2 1 2 ...
## $ Titles : Factor w/ 5 levels "Master","Miss",..: 3 4 2 4 3 3 3 1 4 4 ...
Passenger Information at A Glance
Data information
Details of data after pre-processing process:
- Survival = (0 = No; 1 = Yes)
- Pclass = Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name = Name
- Sex = Sex
- Age = Age
- Sibsp = Number of Siblings/Spouses Aboard
- Parch = Number of Parents/Children Aboard
- Ticket = Ticket Number
- Fare = Passenger Fare
- Embarked = Port of Embarkation (Cherbourg, Queenstown or Southampton)
- Age_Category = Age Category of Passenger
- Titles = Titles of Passenger
# Number of passengers
nrow(titanic_data)
## [1] 891
# Summary of data passengers
summary(titanic_data)
## Survived Pclass Name Sex Age SibSp
## 0:549 1:216 Length:891 Female:314 Min. : 0.00 0:608
## 1:342 2:184 Class :character Male :577 1st Qu.: 6.00 1:209
## 3:491 Mode :character Median :24.00 2: 28
## Mean :23.78 3: 16
## 3rd Qu.:35.00 4: 18
## Max. :80.00 5: 5
## 8: 7
## Parch Ticket Fare Embarked Surviveddetailed
## 0:678 Length:891 Min. : 0.00 Cherbourg :168 No :549
## 1:118 Class :character 1st Qu.: 7.91 Queenstown : 77 Yes:342
## 2: 80 Mode :character Median : 14.45 Southampton:646
## 3: 5 Mean : 32.20
## 4: 4 3rd Qu.: 31.00
## 5: 5 Max. :512.33
## 6: 1
## Age_Category Titles
## Adults :425 Master: 40
## Children : 71 Miss :185
## Seniors : 11 Mr :517
## Uncategorised:184 Mrs :126
## Youth :200 Others: 23
##
##
# Take a glimpse on data
head(titanic_data, n=1)
## Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## 1 0 3 Braund, Mr. Owen Harris Male 22 1 0 A/5 21171 7.25
## Embarked Surviveddetailed Age_Category Titles
## 1 Southampton No Youth Mr
# Show a comparison between male and female passengers using pie chart
passenger_sex <- data.frame(sex = titanic_data$Sex)
PieChart(sex, hole = 0,
values = "%",
data = passenger_sex,
fill = c("#B9F3FC", "#93C6E7"),
color = "black",
values_size=getOption("10"),
main = "Gender Comparison"
)
There were 891 passengers (577 males and 314 females) on this data, 549
of them survived from the disaster. We can also have other insight from
the above information. For instance, the first passenger listed is
Mr. Owen Harris Braund. He was 22 years old when he died on the Titanic.
Next, 65% of the passengers gender are Male and 35% are female.
Passengers data statistics and numbers
Age distribution
# Remove age = 0
hist_age <- titanic_data[titanic_data$Age != 0, ]
# Use histogram to see the Age distribution on passengers data
hist(hist_age$Age,
breaks=30,
col = "#93C6E7",
main = "Age Distribution of Titanic Passengers",
xlab = "Age Range",
ylab = "Freq")
# Use box plot to see the outlier on passengers data
boxplot(titanic_data$Age,
col = "#93C6E7",
main = "Age of Passengers",
xlab = NULL,
ylab = "Freq")
# Central tendency of age
median(titanic_data$Age)
## [1] 24
The histogram and box plot help us understand that many of the passengers present on the titanic were in the age range of 20-35 years. Also, we know that “Age” data has an outlier and skewed distributed.
From above insight, we can decide to use median to measure the central tendency. We can get the result that “24 years old” is the central tendency of “age” data.
Numbers of passengers
# Number of passengers based on age category
table(titanic_data$Age_Category)
##
## Adults Children Seniors Uncategorised Youth
## 425 71 11 184 200
# Number of passengers based on passengers class
table(titanic_data$Pclass)
##
## 1 2 3
## 216 184 491
# Number of passengers based on the port passengers embarked
table(titanic_data$Embarked)
##
## Cherbourg Queenstown Southampton
## 168 77 646
# Number of passengers based on titles
table(titanic_data$Titles)
##
## Master Miss Mr Mrs Others
## 40 185 517 126 23
From above data, we can know :
- The most passengers were Adults (425), followed by Youth, Uncategorised, Children and Seniors. The uncategorised means the we don’t have sufficient information of their age and stated as “Uncategorised”.
- The passengers mostly chose 3rd class. But, surprisingly people chose 1st class instead of 2nd.
- 76% of passsengers departed form Southampton port, followed by Cherbourg and Queenstown
- And the last, there were 517 man, 126 married women, 185 unmarried woman, 40 master adn 23 others on-board. Master is a title for an underage male. If a person is under 18, master would be used. Once a person turns 18 and enters adulthood, mister would be used. Others means that small portion of number for other titles.
Details of survivors
Based on Age
plot_age <- ggplot(data= titanic_data, aes(x = Age, fill = Survived)) +
geom_histogram() +
scale_fill_manual(values = c("#93C6E7", "#B9F3FC")) +
labs(title = NULL,
y = "Passenger Count",
x = "Passenger Age",
fill = "Survived?",
caption = "Source : Kaggle") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"),
legend.position = "bottom",
plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
axis.title = element_text(face = "bold.italic", size = rel(0.85)))
plot_age
Based on passenger class
plot_pclass <- ggplot(data = titanic_data, aes(x = Pclass, fill = Surviveddetailed)) +
geom_bar(width = 0.4) +
scale_fill_manual(values = c("#93C6E7", "#B9F3FC")) +
stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
labs(title = "Survivor Number by Passenger Class",
y = "Count",
x = "Passenger Class",
fill = "Survived?",
caption = "Source : Kaggle") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"),
legend.position = "bottom",
plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
axis.title = element_text(face = "bold.italic", size = rel(0.85)))
plot_pclass
The highest survivors were from 1st class(136 people). Followed by 3rd class(119 people) and 2nd class (87 people)
Based on embarked port
plot_gender <- ggplot(data = titanic_data, aes(x = Embarked, fill = Surviveddetailed)) +
geom_bar(width = 0.4) +
scale_fill_manual(values = c("#93C6E7", "#B9F3FC")) +
stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
labs(title = "Survivor Number by Embarked Port",
y = "Count",
x = "Embarked Port",
fill = "Survived?",
caption = "Source : Kaggle") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"),
legend.position = "bottom",
plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
axis.title = element_text(face = "bold.italic", size = rel(0.85)))
plot_gender
The highest survivors were from Southampton(219 people). Followed by Cherbourg(93 people) and Queenstown(30 people).
Based on sex
plot_gender <- ggplot(data = titanic_data, aes(x = Sex, fill = Surviveddetailed)) +
geom_bar(width = 0.4) +
scale_fill_manual(values = c("#93C6E7", "#B9F3FC")) +
stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
labs(title = "Survivor Number by Gender",
y = "Count",
x = "Gender",
fill = "Survived?",
caption = "Source : Kaggle") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"),
legend.position = "bottom",
plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
axis.title = element_text(face = "bold.italic", size = rel(0.85)))
plot_gender
Most of the survivors were female (233 people) and only 109 male passengers survived.
Based on age category
plot_agecat <- ggplot(data = titanic_data, aes(x = Age_Category, fill = Surviveddetailed)) +
geom_bar(width = 0.4) +
scale_fill_manual(values = c("#93C6E7", "#B9F3FC")) +
stat_count(aes(label = ..count..), geom = "text", position = position_stack(vjust = 0.5), show.legend = FALSE) +
labs(title = "Survivor Number by Age Category",
y = "Count",
x = "Age Category",
fill = "Survived?",
caption = "Source : Kaggle") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"),
legend.position = "bottom",
plot.caption = element_text(hjust = 0.5, size = rel(0.8)),
axis.title = element_text(face = "bold.italic", size = rel(0.85)))
plot_agecat
Most of the survivors were adults (171 people) and the lowest was seniors with only 1 survivor.
Conclusion
Now, I get some insight from Subset Data of the Titanic Passenger Onboard
- There were 891 passengers(577 males and 314 females) on this data, only 549 of them survived from the disaster. 65% of the passengers are male and 35% are female . But, most of the survivors were female with 233 people and only 109 male passengers survived.
- “24 years old” is the central tendency of passengers “age” data.
- The most passengers were Adults (425 people). Also, the most survivors were from Adults (171 people) and the lowest was Seniors with only 1 survivor.
- The passengers mostly chose 3rd class. But, the highest survivors were from 1st class(136 people). Followed by 3rd class(119 people) and 2nd class (87 people).
- 76% of passengers departed form Southampton port, followed by Cherbourg and Queenstown. Based on this parameter, the highest survivors were from Southampton(219 people). Next are Cherbourg(93 people) and Queenstown(30 people).
Reference
- https://www.noaa.gov/gc-international-section/rms-titanic-history-and-significance
- https://www.kaggle.com/competitions/titanic/overview/description
- https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook
- https://www.statcan.gc.ca/en/concepts/definitions/age2
- https://stackoverflow.com/questions/68996121/categorize-age-in-r