This project analyzes the Titanic dataset to understand survival patterns among passengers. It focuses on factors such as passenger class, gender, age, fare, and embarkation point. The objective is to extract meaningful insights using statistical summaries and visualizations. This analysis helps in understanding how different variables influenced survival chances.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
head(titanic)
## Unnamed..0 survived pclass sex age sibsp parch fare embarked class
## 1 0 0 3 male 22 1 0 7.2500 S Third
## 2 1 1 1 female 38 1 0 71.2833 C First
## 3 2 1 3 female 26 0 0 7.9250 S Third
## 4 3 1 1 female 35 1 0 53.1000 S First
## 5 4 0 3 male 35 0 0 8.0500 S Third
## 6 5 0 3 male NA 0 0 8.4583 Q Third
## who adult_male deck embark_town alive alone
## 1 man True Southampton no False
## 2 woman False C Cherbourg yes False
## 3 woman False Southampton yes True
## 4 woman False C Southampton yes False
## 5 man True Southampton no True
## 6 man True Queenstown no True
tc <- titanic
tc[tc == ""] <- NA
str(tc)
## 'data.frame': 1000 obs. of 16 variables:
## $ Unnamed..0 : int 0 1 2 3 4 5 6 7 8 9 ...
## $ survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ sex : chr "male" "female" "female" "female" ...
## $ age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ sibsp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ embarked : chr "S" "C" "S" "S" ...
## $ class : chr "Third" "First" "Third" "First" ...
## $ who : chr "man" "woman" "woman" "woman" ...
## $ adult_male : chr "True" "False" "False" "False" ...
## $ deck : chr NA "C" NA "C" ...
## $ embark_town: chr "Southampton" "Cherbourg" "Southampton" "Southampton" ...
## $ alive : chr "no" "yes" "yes" "yes" ...
## $ alone : chr "False" "False" "True" "False" ...
summary(tc)
## Unnamed..0 survived pclass sex
## Min. : 0.0 Min. :0.000 Min. :1.000 Length:1000
## 1st Qu.:249.8 1st Qu.:0.000 1st Qu.:2.000 Class :character
## Median :499.5 Median :0.000 Median :3.000 Mode :character
## Mean :499.5 Mean :0.392 Mean :2.315
## 3rd Qu.:749.2 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :999.0 Max. :1.000 Max. :3.000
##
## age sibsp parch fare
## Min. : 0.42 Min. :0.000 Min. :0.00 Min. : 0.000
## 1st Qu.:20.00 1st Qu.:0.000 1st Qu.:0.00 1st Qu.: 7.896
## Median :28.00 Median :0.000 Median :0.00 Median : 14.068
## Mean :29.59 Mean :0.518 Mean :0.38 Mean : 31.708
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.00 3rd Qu.: 30.696
## Max. :80.00 Max. :8.000 Max. :6.00 Max. :512.329
## NA's :197
## embarked class who adult_male
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## deck embark_town alive alone
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
dim(tc)
## [1] 1000 16
colSums(is.na(tc))
## Unnamed..0 survived pclass sex age sibsp
## 0 0 0 0 197 0
## parch fare embarked class who adult_male
## 0 0 2 0 0 0
## deck embark_town alive alone
## 769 2 0 0
titanic %>% count(pclass)
## pclass n
## 1 1 241
## 2 2 203
## 3 3 556
The majority of passengers belonged to the 3rd class, indicating that
it was the most crowded section of the ship.
This suggests that lower-class passengers made up a significant portion
of the population onboard.
titanic %>% count(sex)
## sex n
## 1 female 352
## 2 male 648
There were significantly more male passengers than female passengers
on the Titanic.
This imbalance in gender distribution could influence survival patterns
observed later.
titanic %>%
summarise(
Min_Age = min(age, na.rm = TRUE),
Max_Age = max(age, na.rm = TRUE),
Avg_Age = mean(age, na.rm = TRUE)
)
## Min_Age Max_Age Avg_Age
## 1 0.42 80 29.58851
Passengers ranged from very young children to elderly individuals,
showing a wide age distribution.
The average age indicates that most passengers were adults.
titanic %>% count(survived)
## survived n
## 1 0 608
## 2 1 392
A larger number of passengers did not survive compared to those who
survived.
This reflects the tragic nature of the Titanic disaster.
titanic %>%
group_by(pclass, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 6 × 3
## pclass survived count
## <int> <int> <int>
## 1 1 0 89
## 2 1 1 152
## 3 2 0 103
## 4 2 1 100
## 5 3 0 416
## 6 3 1 140
Passengers from the 1st class had higher survival counts compared to
other classes.
This suggests that higher-class passengers had better access to safety
measures.
titanic %>%
group_by(sex, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 4 × 3
## sex survived count
## <chr> <int> <int>
## 1 female 0 90
## 2 female 1 262
## 3 male 0 518
## 4 male 1 130
Females had a much higher survival rate compared to males.
This indicates that priority may have been given to women during rescue
operations.
titanic_age <- titanic %>%
mutate(Age_Group = case_when(
age < 18 ~ "Child",
age >= 18 & age < 60 ~ "Adult",
age >= 60 ~ "Senior",
TRUE ~ "Unknown"
))
titanic_age %>% count(Age_Group)
## Age_Group n
## 1 Adult 643
## 2 Child 130
## 3 Senior 30
## 4 Unknown 197
Most passengers fall under the adult category, with fewer children
and senior citizens.
This shows that the majority of travelers were working-age
individuals.
titanic_rule <- titanic %>%
mutate(Category = case_when(
sex == "female" ~ "Female",
age < 18 ~ "Child",
TRUE ~ "Adult Male"
))
titanic_rule %>%
group_by(Category, survived) %>%
summarise(Count = n(), .groups = "drop")
## # A tibble: 6 × 3
## Category survived Count
## <chr> <int> <int>
## 1 Adult Male 0 473
## 2 Adult Male 1 106
## 3 Child 0 45
## 4 Child 1 24
## 5 Female 0 90
## 6 Female 1 262
Women and children had significantly higher survival rates compared
to adult males.
This supports the “women and children first” rescue policy followed
during emergencies.
titanic %>%
group_by(survived) %>%
summarise(Avg_Fare = mean(fare, na.rm = TRUE))
## # A tibble: 2 × 2
## survived Avg_Fare
## <int> <dbl>
## 1 0 21.8
## 2 1 47.0
Passengers who paid higher fares had better survival chances.
This indicates that wealthier passengers may have had access to safer
locations or resources.
titanic %>%
group_by(embarked, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 7 × 3
## embarked survived count
## <chr> <int> <int>
## 1 "" 1 2
## 2 "C" 0 80
## 3 "C" 1 107
## 4 "Q" 0 51
## 5 "Q" 1 33
## 6 "S" 0 477
## 7 "S" 1 250
Survival rates differed based on embarkation ports.
This suggests that passenger composition or class distribution varied
across ports.
titanic %>%
group_by(pclass, sex, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 12 × 4
## pclass sex survived count
## <int> <chr> <int> <int>
## 1 1 female 0 4
## 2 1 female 1 100
## 3 1 male 0 85
## 4 1 male 1 52
## 5 2 female 0 6
## 6 2 female 1 78
## 7 2 male 0 97
## 8 2 male 1 22
## 9 3 female 0 80
## 10 3 female 1 84
## 11 3 male 0 336
## 12 3 male 1 56
3rd class male passengers had the lowest survival rates.
They were the most vulnerable group during the disaster.
titanic %>%
filter(survived == 1) %>%
arrange(desc(fare)) %>%
head(10)
## Unnamed..0 survived pclass sex age sibsp parch fare embarked class
## 1 258 1 1 female 35 0 0 512.3292 C First
## 2 679 1 1 male 36 0 1 512.3292 C First
## 3 737 1 1 male 35 0 0 512.3292 C First
## 4 936 1 1 female 15 0 0 286.6448 S First
## 5 88 1 1 female 23 3 2 263.0000 S First
## 6 341 1 1 female 24 3 2 263.0000 S First
## 7 311 1 1 female 18 2 2 262.3750 C First
## 8 742 1 1 female 21 2 2 262.3750 C First
## 9 299 1 1 female 50 0 1 247.5208 C First
## 10 380 1 1 female 42 0 0 227.5250 C First
## who adult_male deck embark_town alive alone
## 1 woman False Cherbourg yes True
## 2 man True B Cherbourg yes False
## 3 man True B Cherbourg yes True
## 4 child False C Southampton yes True
## 5 woman False C Southampton yes False
## 6 woman False C Southampton yes False
## 7 woman False B Cherbourg yes False
## 8 woman False B Cherbourg yes False
## 9 woman False B Cherbourg yes False
## 10 woman False Cherbourg yes True
Top-paying passengers who survived were mostly from higher
classes.
This reinforces the relationship between fare and survival
advantage.
ggplot(titanic, aes(x = factor(pclass))) +
geom_bar(fill = "skyblue") +
labs(title = "Passenger Count by Class",
x = "Passenger Class",
y = "Count")
The bar graph clearly shows that 3rd class had the highest number of
passengers.
1st class had the least number of passengers comparatively.
This uneven distribution indicates socio-economic diversity among
travelers.
Crowding in lower classes may have affected evacuation efficiency.
ggplot(titanic, aes(x = sex, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Gender",
x = "Gender",
y = "Count",
fill = "Survived")
Females show a much higher survival count compared to males.
The gap between survival and non-survival is clearly visible for both
genders.
This supports the idea that women were prioritized during rescue.
Male passengers were more affected in the disaster.
ggplot(titanic, aes(x = factor(pclass), fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Class",
x = "Class",
y = "Count",
fill = "Survived")
1st class passengers had the highest survival rate.
3rd class passengers had the lowest survival rate.
This shows a clear link between socio-economic status and
survival.
Access to lifeboats and location played an important role.
ggplot(titanic, aes(x = embarked, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Embarkation Port",
x = "Port",
y = "Count",
fill = "Survived")
Survival rates differ slightly across embarkation ports.
Some ports show better survival outcomes than others.
This may be due to differences in passenger class distribution.
It indicates that boarding location had some influence on survival.
ggplot(titanic_age, aes(x = Age_Group, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Age Group",
x = "Age Group",
y = "Count",
fill = "Survived")
Children show relatively higher survival rates compared to
adults.
Adults had the highest number of deaths.
This reflects priority given to children during rescue.
Age played a significant role in survival chances.
ggplot(titanic, aes(x = factor(pclass), fill = sex)) +
geom_bar(position = "dodge") +
labs(title = "Gender Distribution by Class",
x = "Class",
y = "Count")
Gender distribution varies across different passenger classes.
3rd class has a higher number of male passengers.
1st class shows a more balanced gender distribution.
This indicates differences in travel patterns across classes.
titanic_family <- titanic %>%
mutate(Family_Size = sibsp + parch + 1)
ggplot(titanic_family, aes(x = Family_Size)) +
geom_bar(fill = "purple") +
labs(title = "Family Size Distribution",
x = "Family Size",
y = "Count")
Most passengers traveled alone or with small families.
Large family groups were relatively rare.
This suggests that individual or small group travel was more
common.
Family size may have influenced survival chances.
titanic_age %>%
group_by(Age_Group, survived) %>%
summarise(n = n()) %>%
group_by(Age_Group) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(x = Age_Group, y = percent, fill = factor(survived))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Survival Percentage by Age Group",
x = "Age Group",
y = "Percentage",
fill = "Survived")
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Age_Group and survived.
## ℹ Output is grouped by Age_Group.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Age_Group, survived))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
Children have the highest survival percentage among all age
groups.
Adults show lower survival percentages.
This highlights the importance of rescue priority.
Age clearly influenced survival probability.
ggplot(titanic, aes(x = sex, fill = factor(survived))) +
geom_bar(position = "dodge") +
facet_wrap(~pclass) +
labs(title = "Class vs Gender vs Survival",
x = "Gender",
y = "Count",
fill = "Survived")
Females across all classes had higher survival rates.
3rd class males were the most affected group.
Survival patterns vary significantly across class and gender.
This combined analysis clearly shows inequality in survival chances.