Introduction

This project analyzes the Titanic dataset to understand survival patterns among passengers. It focuses on factors such as passenger class, gender, age, fare, and embarkation point. The objective is to extract meaningful insights using statistical summaries and visualizations. This analysis helps in understanding how different variables influenced survival chances.

Load Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3

Load Dataset

titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
head(titanic)
##   Unnamed..0 survived pclass    sex age sibsp parch    fare embarked class
## 1          0        0      3   male  22     1     0  7.2500        S Third
## 2          1        1      1 female  38     1     0 71.2833        C First
## 3          2        1      3 female  26     0     0  7.9250        S Third
## 4          3        1      1 female  35     1     0 53.1000        S First
## 5          4        0      3   male  35     0     0  8.0500        S Third
## 6          5        0      3   male  NA     0     0  8.4583        Q Third
##     who adult_male deck embark_town alive alone
## 1   man       True      Southampton    no False
## 2 woman      False    C   Cherbourg   yes False
## 3 woman      False      Southampton   yes  True
## 4 woman      False    C Southampton   yes False
## 5   man       True      Southampton    no  True
## 6   man       True       Queenstown    no  True

Data Cleaning & Overview

tc <- titanic
tc[tc == ""] <- NA

str(tc)
## 'data.frame':    1000 obs. of  16 variables:
##  $ Unnamed..0 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ sex        : chr  "male" "female" "female" "female" ...
##  $ age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ sibsp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ embarked   : chr  "S" "C" "S" "S" ...
##  $ class      : chr  "Third" "First" "Third" "First" ...
##  $ who        : chr  "man" "woman" "woman" "woman" ...
##  $ adult_male : chr  "True" "False" "False" "False" ...
##  $ deck       : chr  NA "C" NA "C" ...
##  $ embark_town: chr  "Southampton" "Cherbourg" "Southampton" "Southampton" ...
##  $ alive      : chr  "no" "yes" "yes" "yes" ...
##  $ alone      : chr  "False" "False" "True" "False" ...
summary(tc)
##    Unnamed..0       survived         pclass          sex           
##  Min.   :  0.0   Min.   :0.000   Min.   :1.000   Length:1000       
##  1st Qu.:249.8   1st Qu.:0.000   1st Qu.:2.000   Class :character  
##  Median :499.5   Median :0.000   Median :3.000   Mode  :character  
##  Mean   :499.5   Mean   :0.392   Mean   :2.315                     
##  3rd Qu.:749.2   3rd Qu.:1.000   3rd Qu.:3.000                     
##  Max.   :999.0   Max.   :1.000   Max.   :3.000                     
##                                                                    
##       age            sibsp           parch           fare        
##  Min.   : 0.42   Min.   :0.000   Min.   :0.00   Min.   :  0.000  
##  1st Qu.:20.00   1st Qu.:0.000   1st Qu.:0.00   1st Qu.:  7.896  
##  Median :28.00   Median :0.000   Median :0.00   Median : 14.068  
##  Mean   :29.59   Mean   :0.518   Mean   :0.38   Mean   : 31.708  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.00   3rd Qu.: 30.696  
##  Max.   :80.00   Max.   :8.000   Max.   :6.00   Max.   :512.329  
##  NA's   :197                                                     
##    embarked            class               who             adult_male       
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      deck           embark_town           alive              alone          
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
## 
dim(tc)
## [1] 1000   16
colSums(is.na(tc))
##  Unnamed..0    survived      pclass         sex         age       sibsp 
##           0           0           0           0         197           0 
##       parch        fare    embarked       class         who  adult_male 
##           0           0           2           0           0           0 
##        deck embark_town       alive       alone 
##         769           2           0           0

Scenario Based Analysis


Question 1: How many passengers belonged to each passenger class on the Titanic?

titanic %>% count(pclass)
##   pclass   n
## 1      1 241
## 2      2 203
## 3      3 556

Interpretation:

The majority of passengers belonged to the 3rd class, indicating that it was the most crowded section of the ship.
This suggests that lower-class passengers made up a significant portion of the population onboard.


Question 2: What is the distribution of male and female passengers?

titanic %>% count(sex)
##      sex   n
## 1 female 352
## 2   male 648

Interpretation:

There were significantly more male passengers than female passengers on the Titanic.
This imbalance in gender distribution could influence survival patterns observed later.


Question 3: What are the minimum, maximum, and average ages of passengers?

titanic %>%
summarise(
Min_Age = min(age, na.rm = TRUE),
Max_Age = max(age, na.rm = TRUE),
Avg_Age = mean(age, na.rm = TRUE)
)
##   Min_Age Max_Age  Avg_Age
## 1    0.42      80 29.58851

Interpretation:

Passengers ranged from very young children to elderly individuals, showing a wide age distribution.
The average age indicates that most passengers were adults.


Question 4: How many passengers survived and how many did not survive?

titanic %>% count(survived)
##   survived   n
## 1        0 608
## 2        1 392

Interpretation:

A larger number of passengers did not survive compared to those who survived.
This reflects the tragic nature of the Titanic disaster.


Question 5: Which passenger class had the highest survival count?

titanic %>%
group_by(pclass, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 6 × 3
##   pclass survived count
##    <int>    <int> <int>
## 1      1        0    89
## 2      1        1   152
## 3      2        0   103
## 4      2        1   100
## 5      3        0   416
## 6      3        1   140

Interpretation:

Passengers from the 1st class had higher survival counts compared to other classes.
This suggests that higher-class passengers had better access to safety measures.


Question 6: How does survival differ between male and female passengers?

titanic %>%
group_by(sex, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 4 × 3
##   sex    survived count
##   <chr>     <int> <int>
## 1 female        0    90
## 2 female        1   262
## 3 male          0   518
## 4 male          1   130

Interpretation:

Females had a much higher survival rate compared to males.
This indicates that priority may have been given to women during rescue operations.


Question 7: How are passengers distributed across different age groups?

titanic_age <- titanic %>%
mutate(Age_Group = case_when(
age < 18 ~ "Child",
age >= 18 & age < 60 ~ "Adult",
age >= 60 ~ "Senior",
TRUE ~ "Unknown"
))

titanic_age %>% count(Age_Group)
##   Age_Group   n
## 1     Adult 643
## 2     Child 130
## 3    Senior  30
## 4   Unknown 197

Interpretation:

Most passengers fall under the adult category, with fewer children and senior citizens.
This shows that the majority of travelers were working-age individuals.


Question 8: Did women and children have higher survival chances?

titanic_rule <- titanic %>%
mutate(Category = case_when(
sex == "female" ~ "Female",
age < 18 ~ "Child",
TRUE ~ "Adult Male"
))

titanic_rule %>%
group_by(Category, survived) %>%
summarise(Count = n(), .groups = "drop")
## # A tibble: 6 × 3
##   Category   survived Count
##   <chr>         <int> <int>
## 1 Adult Male        0   473
## 2 Adult Male        1   106
## 3 Child             0    45
## 4 Child             1    24
## 5 Female            0    90
## 6 Female            1   262

Interpretation:

Women and children had significantly higher survival rates compared to adult males.
This supports the “women and children first” rescue policy followed during emergencies.


Question 9: Did passengers who paid higher fares survive more often?

titanic %>%
group_by(survived) %>%
summarise(Avg_Fare = mean(fare, na.rm = TRUE))
## # A tibble: 2 × 2
##   survived Avg_Fare
##      <int>    <dbl>
## 1        0     21.8
## 2        1     47.0

Interpretation:

Passengers who paid higher fares had better survival chances.
This indicates that wealthier passengers may have had access to safer locations or resources.


Question 10: How does survival vary by embarkation port?

titanic %>%
group_by(embarked, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 7 × 3
##   embarked survived count
##   <chr>       <int> <int>
## 1 ""              1     2
## 2 "C"             0    80
## 3 "C"             1   107
## 4 "Q"             0    51
## 5 "Q"             1    33
## 6 "S"             0   477
## 7 "S"             1   250

Interpretation:

Survival rates differed based on embarkation ports.
This suggests that passenger composition or class distribution varied across ports.


Question 11: Which combination of class and gender had the lowest survival rate?

titanic %>%
group_by(pclass, sex, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 12 × 4
##    pclass sex    survived count
##     <int> <chr>     <int> <int>
##  1      1 female        0     4
##  2      1 female        1   100
##  3      1 male          0    85
##  4      1 male          1    52
##  5      2 female        0     6
##  6      2 female        1    78
##  7      2 male          0    97
##  8      2 male          1    22
##  9      3 female        0    80
## 10      3 female        1    84
## 11      3 male          0   336
## 12      3 male          1    56

Interpretation:

3rd class male passengers had the lowest survival rates.
They were the most vulnerable group during the disaster.


Question 12: Among survivors, who paid the highest fares?

titanic %>%
filter(survived == 1) %>%
arrange(desc(fare)) %>%
head(10)
##    Unnamed..0 survived pclass    sex age sibsp parch     fare embarked class
## 1         258        1      1 female  35     0     0 512.3292        C First
## 2         679        1      1   male  36     0     1 512.3292        C First
## 3         737        1      1   male  35     0     0 512.3292        C First
## 4         936        1      1 female  15     0     0 286.6448        S First
## 5          88        1      1 female  23     3     2 263.0000        S First
## 6         341        1      1 female  24     3     2 263.0000        S First
## 7         311        1      1 female  18     2     2 262.3750        C First
## 8         742        1      1 female  21     2     2 262.3750        C First
## 9         299        1      1 female  50     0     1 247.5208        C First
## 10        380        1      1 female  42     0     0 227.5250        C First
##      who adult_male deck embark_town alive alone
## 1  woman      False        Cherbourg   yes  True
## 2    man       True    B   Cherbourg   yes False
## 3    man       True    B   Cherbourg   yes  True
## 4  child      False    C Southampton   yes  True
## 5  woman      False    C Southampton   yes False
## 6  woman      False    C Southampton   yes False
## 7  woman      False    B   Cherbourg   yes False
## 8  woman      False    B   Cherbourg   yes False
## 9  woman      False    B   Cherbourg   yes False
## 10 woman      False        Cherbourg   yes  True

Interpretation:

Top-paying passengers who survived were mostly from higher classes.
This reinforces the relationship between fare and survival advantage.


Question 13: What is the distribution of passengers across classes? (Graph)

ggplot(titanic, aes(x = factor(pclass))) +
geom_bar(fill = "skyblue") +
labs(title = "Passenger Count by Class",
x = "Passenger Class",
y = "Count")

Interpretation:

The bar graph clearly shows that 3rd class had the highest number of passengers.
1st class had the least number of passengers comparatively.
This uneven distribution indicates socio-economic diversity among travelers.
Crowding in lower classes may have affected evacuation efficiency.


Question 14: How does survival differ by gender? (Graph)

ggplot(titanic, aes(x = sex, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Gender",
x = "Gender",
y = "Count",
fill = "Survived")

Interpretation:

Females show a much higher survival count compared to males.
The gap between survival and non-survival is clearly visible for both genders.
This supports the idea that women were prioritized during rescue.
Male passengers were more affected in the disaster.


Question 15: How does survival vary by passenger class? (Graph)

ggplot(titanic, aes(x = factor(pclass), fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Class",
x = "Class",
y = "Count",
fill = "Survived")

Interpretation:

1st class passengers had the highest survival rate.
3rd class passengers had the lowest survival rate.
This shows a clear link between socio-economic status and survival.
Access to lifeboats and location played an important role.


Question 16: How does survival vary by embarkation port? (Graph)

ggplot(titanic, aes(x = embarked, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Embarkation Port",
x = "Port",
y = "Count",
fill = "Survived")

Interpretation:

Survival rates differ slightly across embarkation ports.
Some ports show better survival outcomes than others.
This may be due to differences in passenger class distribution.
It indicates that boarding location had some influence on survival.


Question 17: How does survival vary by age group? (Graph)

ggplot(titanic_age, aes(x = Age_Group, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Age Group",
x = "Age Group",
y = "Count",
fill = "Survived")

Interpretation:

Children show relatively higher survival rates compared to adults.
Adults had the highest number of deaths.
This reflects priority given to children during rescue.
Age played a significant role in survival chances.


Question 18: How is gender distributed across passenger classes? (Graph)

ggplot(titanic, aes(x = factor(pclass), fill = sex)) +
geom_bar(position = "dodge") +
labs(title = "Gender Distribution by Class",
x = "Class",
y = "Count")

Interpretation:

Gender distribution varies across different passenger classes.
3rd class has a higher number of male passengers.
1st class shows a more balanced gender distribution.
This indicates differences in travel patterns across classes.


Question 19: What is the distribution of family size among passengers? (Graph)

titanic_family <- titanic %>%
mutate(Family_Size = sibsp + parch + 1)

ggplot(titanic_family, aes(x = Family_Size)) +
geom_bar(fill = "purple") +
labs(title = "Family Size Distribution",
x = "Family Size",
y = "Count")

Interpretation:

Most passengers traveled alone or with small families.
Large family groups were relatively rare.
This suggests that individual or small group travel was more common.
Family size may have influenced survival chances.


Question 20: What is the survival percentage across age groups? (Graph)

titanic_age %>%
group_by(Age_Group, survived) %>%
summarise(n = n()) %>%
group_by(Age_Group) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(x = Age_Group, y = percent, fill = factor(survived))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Survival Percentage by Age Group",
x = "Age Group",
y = "Percentage",
fill = "Survived")
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Age_Group and survived.
## ℹ Output is grouped by Age_Group.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Age_Group, survived))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

Interpretation:

Children have the highest survival percentage among all age groups.
Adults show lower survival percentages.
This highlights the importance of rescue priority.
Age clearly influenced survival probability.


Question 21: How do class, gender, and survival relate? (Facet Graph)

ggplot(titanic, aes(x = sex, fill = factor(survived))) +
geom_bar(position = "dodge") +
facet_wrap(~pclass) +
labs(title = "Class vs Gender vs Survival",
x = "Gender",
y = "Count",
fill = "Survived")

Interpretation:

Females across all classes had higher survival rates.
3rd class males were the most affected group.
Survival patterns vary significantly across class and gender.
This combined analysis clearly shows inequality in survival chances.