CA2

Introduction This project analyzes the Titanic dataset to understand survival patterns among passengers. It focuses on factors such as passenger class, gender, age, fare, and embarkation point. The objective is to extract meaningful insights using statistical summaries and visualizations. This analysis helps in understanding how different variables influenced survival chances.

Load Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Load Dataset

titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
head(titanic)
##   X survived pclass    sex age sibsp parch    fare embarked class   who
## 1 0        0      3   male  22     1     0  7.2500        S Third   man
## 2 1        1      1 female  38     1     0 71.2833        C First woman
## 3 2        1      3 female  26     0     0  7.9250        S Third woman
## 4 3        1      1 female  35     1     0 53.1000        S First woman
## 5 4        0      3   male  35     0     0  8.0500        S Third   man
## 6 5        0      3   male  NA     0     0  8.4583        Q Third   man
##   adult_male deck embark_town alive alone
## 1       TRUE      Southampton    no FALSE
## 2      FALSE    C   Cherbourg   yes FALSE
## 3      FALSE      Southampton   yes  TRUE
## 4      FALSE    C Southampton   yes FALSE
## 5       TRUE      Southampton    no  TRUE
## 6       TRUE       Queenstown    no  TRUE

Data Cleaning & Overview

tc <- titanic
tc[tc == ""] <- NA

str(tc)
## 'data.frame':    891 obs. of  16 variables:
##  $ X          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ sex        : chr  "male" "female" "female" "female" ...
##  $ age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ sibsp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ embarked   : chr  "S" "C" "S" "S" ...
##  $ class      : chr  "Third" "First" "Third" "First" ...
##  $ who        : chr  "man" "woman" "woman" "woman" ...
##  $ adult_male : logi  TRUE FALSE FALSE FALSE TRUE TRUE ...
##  $ deck       : chr  NA "C" NA "C" ...
##  $ embark_town: chr  "Southampton" "Cherbourg" "Southampton" "Southampton" ...
##  $ alive      : chr  "no" "yes" "yes" "yes" ...
##  $ alone      : logi  FALSE FALSE TRUE FALSE TRUE TRUE ...
summary(tc)
##        X            survived          pclass          sex           
##  Min.   :  0.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:222.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :445.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :445.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:667.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :890.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##       age            sibsp           parch             fare       
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91  
##  Median :28.00   Median :0.000   Median :0.0000   Median : 14.45  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816   Mean   : 32.20  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00  
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000   Max.   :512.33  
##  NA's   :177                                                      
##    embarked            class               who            adult_male     
##  Length:891         Length:891         Length:891         Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:354      
##  Mode  :character   Mode  :character   Mode  :character   TRUE :537      
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##      deck           embark_town           alive             alone        
##  Length:891         Length:891         Length:891         Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:354      
##  Mode  :character   Mode  :character   Mode  :character   TRUE :537      
##                                                                          
##                                                                          
##                                                                          
## 
dim(tc)
## [1] 891  16
colSums(is.na(tc))
##           X    survived      pclass         sex         age       sibsp 
##           0           0           0           0         177           0 
##       parch        fare    embarked       class         who  adult_male 
##           0           0           2           0           0           0 
##        deck embark_town       alive       alone 
##         688           2           0           0

Scenario Based Analysis


Question 1: How many passengers belonged to each passenger class on the Titanic?

titanic %>% count(pclass)
##   pclass   n
## 1      1 216
## 2      2 184
## 3      3 491

Interpretation:

The majority of passengers belonged to the 3rd class, indicating that it was the most crowded section of the ship.
This suggests that lower-class passengers made up a significant portion of the population onboard.


Question 2: What is the distribution of male and female passengers?

titanic %>% count(sex)
##      sex   n
## 1 female 314
## 2   male 577

Interpretation:

There were significantly more male passengers than female passengers on the Titanic.
This imbalance in gender distribution could influence survival patterns observed later.


Question 3: What are the minimum, maximum, and average ages of passengers?

titanic %>%
summarise(
Min_Age = min(age, na.rm = TRUE),
Max_Age = max(age, na.rm = TRUE),
Avg_Age = mean(age, na.rm = TRUE)
)
##   Min_Age Max_Age  Avg_Age
## 1    0.42      80 29.69912

Interpretation:

Passengers ranged from very young children to elderly individuals, showing a wide age distribution.
The average age indicates that most passengers were adults.


Question 4: How many passengers survived and how many did not survive?

titanic %>% count(survived)
##   survived   n
## 1        0 549
## 2        1 342

Interpretation:

A larger number of passengers did not survive compared to those who survived.
This reflects the tragic nature of the Titanic disaster.


Question 5: Which passenger class had the highest survival count?

titanic %>%
group_by(pclass, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 6 × 3
##   pclass survived count
##    <int>    <int> <int>
## 1      1        0    80
## 2      1        1   136
## 3      2        0    97
## 4      2        1    87
## 5      3        0   372
## 6      3        1   119

Interpretation:

Passengers from the 1st class had higher survival counts compared to other classes.
This suggests that higher-class passengers had better access to safety measures.


Question 6: How does survival differ between male and female passengers?

titanic %>%
group_by(sex, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 4 × 3
##   sex    survived count
##   <chr>     <int> <int>
## 1 female        0    81
## 2 female        1   233
## 3 male          0   468
## 4 male          1   109

Interpretation:

Females had a much higher survival rate compared to males.
This indicates that priority may have been given to women during rescue operations.


Question 7: How are passengers distributed across different age groups?

titanic_age <- titanic %>%
mutate(Age_Group = case_when(
age < 18 ~ "Child",
age >= 18 & age < 60 ~ "Adult",
age >= 60 ~ "Senior",
TRUE ~ "Unknown"
))

titanic_age %>% count(Age_Group)
##   Age_Group   n
## 1     Adult 575
## 2     Child 113
## 3    Senior  26
## 4   Unknown 177

Interpretation:

Most passengers fall under the adult category, with fewer children and senior citizens.
This shows that the majority of travelers were working-age individuals.


Question 8: Did women and children have higher survival chances?

titanic_rule <- titanic %>%
mutate(Category = case_when(
sex == "female" ~ "Female",
age < 18 ~ "Child",
TRUE ~ "Adult Male"
))

titanic_rule %>%
group_by(Category, survived) %>%
summarise(Count = n(), .groups = "drop")
## # A tibble: 6 × 3
##   Category   survived Count
##   <chr>         <int> <int>
## 1 Adult Male        0   433
## 2 Adult Male        1    86
## 3 Child             0    35
## 4 Child             1    23
## 5 Female            0    81
## 6 Female            1   233

Interpretation:

Women and children had significantly higher survival rates compared to adult males.
This supports the “women and children first” rescue policy followed during emergencies.


Question 9: Did passengers who paid higher fares survive more often?

titanic %>%
group_by(survived) %>%
summarise(Avg_Fare = mean(fare, na.rm = TRUE))
## # A tibble: 2 × 2
##   survived Avg_Fare
##      <int>    <dbl>
## 1        0     22.1
## 2        1     48.4

Interpretation:

Passengers who paid higher fares had better survival chances.
This indicates that wealthier passengers may have had access to safer locations or resources.


Question 10: How does survival vary by embarkation port?

titanic %>%
group_by(embarked, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 7 × 3
##   embarked survived count
##   <chr>       <int> <int>
## 1 ""              1     2
## 2 "C"             0    75
## 3 "C"             1    93
## 4 "Q"             0    47
## 5 "Q"             1    30
## 6 "S"             0   427
## 7 "S"             1   217

Interpretation:

Survival rates differed based on embarkation ports.
This suggests that passenger composition or class distribution varied across ports.


Question 11: Which combination of class and gender had the lowest survival rate?

titanic %>%
group_by(pclass, sex, survived) %>%
summarise(count = n(), .groups = "drop")
## # A tibble: 12 × 4
##    pclass sex    survived count
##     <int> <chr>     <int> <int>
##  1      1 female        0     3
##  2      1 female        1    91
##  3      1 male          0    77
##  4      1 male          1    45
##  5      2 female        0     6
##  6      2 female        1    70
##  7      2 male          0    91
##  8      2 male          1    17
##  9      3 female        0    72
## 10      3 female        1    72
## 11      3 male          0   300
## 12      3 male          1    47

Interpretation:

3rd class male passengers had the lowest survival rates.
They were the most vulnerable group during the disaster.


Question 12: Among survivors, who paid the highest fares?

titanic %>%
filter(survived == 1) %>%
arrange(desc(fare)) %>%
head(10)
##      X survived pclass    sex age sibsp parch     fare embarked class   who
## 1  258        1      1 female  35     0     0 512.3292        C First woman
## 2  679        1      1   male  36     0     1 512.3292        C First   man
## 3  737        1      1   male  35     0     0 512.3292        C First   man
## 4   88        1      1 female  23     3     2 263.0000        S First woman
## 5  341        1      1 female  24     3     2 263.0000        S First woman
## 6  311        1      1 female  18     2     2 262.3750        C First woman
## 7  742        1      1 female  21     2     2 262.3750        C First woman
## 8  299        1      1 female  50     0     1 247.5208        C First woman
## 9  380        1      1 female  42     0     0 227.5250        C First woman
## 10 700        1      1 female  18     1     0 227.5250        C First woman
##    adult_male deck embark_town alive alone
## 1       FALSE        Cherbourg   yes  TRUE
## 2        TRUE    B   Cherbourg   yes FALSE
## 3        TRUE    B   Cherbourg   yes  TRUE
## 4       FALSE    C Southampton   yes FALSE
## 5       FALSE    C Southampton   yes FALSE
## 6       FALSE    B   Cherbourg   yes FALSE
## 7       FALSE    B   Cherbourg   yes FALSE
## 8       FALSE    B   Cherbourg   yes FALSE
## 9       FALSE        Cherbourg   yes  TRUE
## 10      FALSE    C   Cherbourg   yes FALSE

Interpretation:

Top-paying passengers who survived were mostly from higher classes.
This reinforces the relationship between fare and survival advantage.


Question 13: What is the distribution of passengers across classes? (Graph)

ggplot(titanic, aes(x = factor(pclass))) +
geom_bar(fill = "skyblue") +
labs(title = "Passenger Count by Class",
x = "Passenger Class",
y = "Count")

Interpretation:

The bar graph clearly shows that 3rd class had the highest number of passengers.
1st class had the least number of passengers comparatively.
This uneven distribution indicates socio-economic diversity among travelers.
Crowding in lower classes may have affected evacuation efficiency.


Question 14: How does survival differ by gender? (Graph)

ggplot(titanic, aes(x = sex, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Gender",
x = "Gender",
y = "Count",
fill = "Survived")

Interpretation:

Females show a much higher survival count compared to males.
The gap between survival and non-survival is clearly visible for both genders.
This supports the idea that women were prioritized during rescue.
Male passengers were more affected in the disaster.


Question 15: How does survival vary by passenger class? (Graph)

ggplot(titanic, aes(x = factor(pclass), fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Class",
x = "Class",
y = "Count",
fill = "Survived")

Interpretation:

1st class passengers had the highest survival rate.
3rd class passengers had the lowest survival rate.
This shows a clear link between socio-economic status and survival.
Access to lifeboats and location played an important role.


Question 16: How does survival vary by embarkation port? (Graph)

ggplot(titanic, aes(x = embarked, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Embarkation Port",
x = "Port",
y = "Count",
fill = "Survived")

Interpretation:

Survival rates differ slightly across embarkation ports.
Some ports show better survival outcomes than others.
This may be due to differences in passenger class distribution.
It indicates that boarding location had some influence on survival.


Question 17: How does survival vary by age group? (Graph)

ggplot(titanic_age, aes(x = Age_Group, fill = factor(survived))) +
geom_bar(position = "dodge") +
labs(title = "Survival by Age Group",
x = "Age Group",
y = "Count",
fill = "Survived")

Interpretation:

Children show relatively higher survival rates compared to adults.
Adults had the highest number of deaths.
This reflects priority given to children during rescue.
Age played a significant role in survival chances.


Question 18: How is gender distributed across passenger classes? (Graph)

ggplot(titanic, aes(x = factor(pclass), fill = sex)) +
geom_bar(position = "dodge") +
labs(title = "Gender Distribution by Class",
x = "Class",
y = "Count")

Interpretation:

Gender distribution varies across different passenger classes.
3rd class has a higher number of male passengers.
1st class shows a more balanced gender distribution.
This indicates differences in travel patterns across classes.


Question 19: What is the distribution of family size among passengers? (Graph)

titanic_family <- titanic %>%
mutate(Family_Size = sibsp + parch + 1)

ggplot(titanic_family, aes(x = Family_Size)) +
geom_bar(fill = "purple") +
labs(title = "Family Size Distribution",
x = "Family Size",
y = "Count")

Interpretation:

Most passengers traveled alone or with small families.
Large family groups were relatively rare.
This suggests that individual or small group travel was more common.
Family size may have influenced survival chances.


Question 20: What is the survival percentage across age groups? (Graph)

titanic_age %>%
group_by(Age_Group, survived) %>%
summarise(n = n()) %>%
group_by(Age_Group) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(x = Age_Group, y = percent, fill = factor(survived))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Survival Percentage by Age Group",
x = "Age Group",
y = "Percentage",
fill = "Survived")
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Age_Group and survived.
## ℹ Output is grouped by Age_Group.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Age_Group, survived))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

Interpretation:

Children have the highest survival percentage among all age groups.
Adults show lower survival percentages.
This highlights the importance of rescue priority.
Age clearly influenced survival probability.


Question 21: How do class, gender, and survival relate? (Facet Graph)

ggplot(titanic, aes(x = sex, fill = factor(survived))) +
geom_bar(position = "dodge") +
facet_wrap(~pclass) +
labs(title = "Class vs Gender vs Survival",
x = "Gender",
y = "Count",
fill = "Survived")

Interpretation:

Females across all classes had higher survival rates.
3rd class males were the most affected group.
Survival patterns vary significantly across class and gender.
This combined analysis clearly shows inequality in survival chances.


#CA3 Introduction This project continues the analysis of the Titanic dataset with 30 new analytical questions. The focus of this assessment is on advanced visualization techniques including histograms, boxplots, scatter plots, correlation analysis, and line charts. These visualizations help uncover deeper patterns related to passenger survival, demographics, and socio-economic factors. The objective is to extract meaningful insights by applying multiple statistical and graphical methods to the same dataset.

Load Libraries

library(dplyr)
library(ggplot2)
library(tidyr)
if (!require("corrplot", quietly = TRUE)) install.packages("corrplot")
## corrplot 0.95 loaded
library(corrplot)
if (!require("moments", quietly = TRUE)) install.packages("moments")
library(moments)

Load Dataset

titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
head(titanic)
##   X survived pclass    sex age sibsp parch    fare embarked class   who
## 1 0        0      3   male  22     1     0  7.2500        S Third   man
## 2 1        1      1 female  38     1     0 71.2833        C First woman
## 3 2        1      3 female  26     0     0  7.9250        S Third woman
## 4 3        1      1 female  35     1     0 53.1000        S First woman
## 5 4        0      3   male  35     0     0  8.0500        S Third   man
## 6 5        0      3   male  NA     0     0  8.4583        Q Third   man
##   adult_male deck embark_town alive alone
## 1       TRUE      Southampton    no FALSE
## 2      FALSE    C   Cherbourg   yes FALSE
## 3      FALSE      Southampton   yes  TRUE
## 4      FALSE    C Southampton   yes FALSE
## 5       TRUE      Southampton    no  TRUE
## 6       TRUE       Queenstown    no  TRUE

Data Cleaning & Overview

tc <- titanic
tc[tc == ""] <- NA

str(tc)
## 'data.frame':    891 obs. of  16 variables:
##  $ X          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ sex        : chr  "male" "female" "female" "female" ...
##  $ age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ sibsp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ embarked   : chr  "S" "C" "S" "S" ...
##  $ class      : chr  "Third" "First" "Third" "First" ...
##  $ who        : chr  "man" "woman" "woman" "woman" ...
##  $ adult_male : logi  TRUE FALSE FALSE FALSE TRUE TRUE ...
##  $ deck       : chr  NA "C" NA "C" ...
##  $ embark_town: chr  "Southampton" "Cherbourg" "Southampton" "Southampton" ...
##  $ alive      : chr  "no" "yes" "yes" "yes" ...
##  $ alone      : logi  FALSE FALSE TRUE FALSE TRUE TRUE ...
summary(tc)
##        X            survived          pclass          sex           
##  Min.   :  0.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:222.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :445.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :445.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:667.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :890.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##       age            sibsp           parch             fare       
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91  
##  Median :28.00   Median :0.000   Median :0.0000   Median : 14.45  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816   Mean   : 32.20  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00  
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000   Max.   :512.33  
##  NA's   :177                                                      
##    embarked            class               who            adult_male     
##  Length:891         Length:891         Length:891         Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:354      
##  Mode  :character   Mode  :character   Mode  :character   TRUE :537      
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##      deck           embark_town           alive             alone        
##  Length:891         Length:891         Length:891         Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:354      
##  Mode  :character   Mode  :character   Mode  :character   TRUE :537      
##                                                                          
##                                                                          
##                                                                          
## 
dim(tc)
## [1] 891  16
colSums(is.na(tc))
##           X    survived      pclass         sex         age       sibsp 
##           0           0           0           0         177           0 
##       parch        fare    embarked       class         who  adult_male 
##           0           0           2           0           0           0 
##        deck embark_town       alive       alone 
##         688           2           0           0

Prepare Helper Variables

titanic_age <- titanic %>%
  mutate(Age_Group = case_when(
    age < 18  ~ "Child",
    age >= 18 & age < 60 ~ "Adult",
    age >= 60 ~ "Senior",
    TRUE ~ "Unknown"
  ))

titanic_family <- titanic %>%
  mutate(Family_Size = sibsp + parch + 1)

Scenario Based Analysis


Question 1: What is the age distribution of Titanic passengers?

ggplot(titanic, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Age Distribution of Titanic Passengers",
       x = "Age",
       y = "Frequency")
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

Interpretation:

The histogram shows that the majority of passengers were between 20 and 40 years old.
There is a small peak for young children under 10 years of age.
The distribution is slightly right-skewed, indicating fewer elderly passengers.
Most travelers were working-age adults.


Question 2: What is the fare distribution among all passengers?

ggplot(titanic, aes(x = fare)) +
  geom_histogram(binwidth = 10, fill = "coral", color = "black", alpha = 0.7) +
  labs(title = "Fare Distribution of Titanic Passengers",
       x = "Fare (£)",
       y = "Frequency")

Interpretation:

The fare distribution is heavily right-skewed, with most passengers paying low fares.
A small number of passengers paid very high fares, creating a long right tail.
This reflects the wide economic gap between different passenger classes.
Lower-fare passengers dominated the overall composition of the Titanic.


Question 3: How are passenger ages distributed across different classes?

ggplot(titanic, aes(x = age, fill = factor(pclass))) +
  geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
  facet_wrap(~pclass) +
  labs(title = "Age Distribution by Passenger Class",
       x = "Age",
       y = "Frequency",
       fill = "Class")
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

Interpretation:

All three passenger classes show a peak concentration in the 20–40 age range.
1st class passengers tend to be slightly older on average compared to other classes.
3rd class has the highest proportion of young adults and children.
Age distribution varies meaningfully across the three passenger classes.


Question 4: What is the family size distribution among passengers?

ggplot(titanic_family, aes(x = Family_Size)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black", alpha = 0.8) +
  labs(title = "Family Size Distribution of Titanic Passengers",
       x = "Family Size",
       y = "Frequency")

Interpretation:

The majority of passengers traveled alone, reflected by the dominant peak at family size 1.
Two-person families were the second most common group onboard.
Very large families of 7 or more members were extremely rare.
Solo travel was the most dominant travel pattern aboard the Titanic.


Question 5: What is the age distribution with a density curve, and is it skewed?

ggplot(titanic, aes(x = age, y = ..density..)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1.5, adjust = 1.5) +
  labs(title = "Age Distribution with Density Curve",
       x = "Age",
       y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_density()`).

skewness(titanic$age, na.rm = TRUE)
## [1] 0.3882899

Interpretation:

The density curve overlaid on the histogram confirms a slightly right-skewed age distribution.
A skewness value greater than 0 indicates a mild positive skew in the data.
This means more passengers were younger, with fewer older travelers.
The distribution departs slightly from a perfect normal bell curve shape.


Question 6: How does age vary across passenger classes?

titanic$pclass <- as.factor(titanic$pclass)

ggplot(titanic, aes(x = pclass, y = age, fill = pclass)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16) +
  labs(title = "Age Distribution by Passenger Class",
       x = "Passenger Class",
       y = "Age") +
  theme_minimal()
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Interpretation:

1st class passengers have a higher median age compared to 2nd and 3rd class passengers.
3rd class passengers are generally younger, with a lower median age.
Outliers are visible in all classes, representing a few very elderly passengers.
Socio-economic class and passenger age appear to be meaningfully related.


Question 7: How does fare differ between survivors and non-survivors?

ggplot(titanic, aes(x = factor(survived), y = fare, fill = factor(survived))) +
  geom_boxplot(outlier.colour = "darkred", outlier.shape = 16) +
  labs(title = "Fare Distribution by Survival Status",
       x = "Survived (0 = No, 1 = Yes)",
       y = "Fare (£)",
       fill = "Survived") +
  theme_minimal()

Interpretation:

Survivors paid significantly higher median fares compared to non-survivors.
The fare distribution for survivors has a wider spread and more high-value outliers.
This suggests that wealthier passengers had a higher likelihood of survival.
Fare is a strong indicator of both socio-economic status and survival advantage.


Question 8: How does age distribution differ by gender?

ggplot(titanic, aes(x = sex, y = age, fill = sex)) +
  geom_boxplot(position = position_dodge(width = 0.8),
               outlier.colour = "red", alpha = 0.8) +
  labs(title = "Age Distribution by Gender",
       x = "Gender",
       y = "Age") +
  theme_minimal()
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Interpretation:

Male and female passengers show similar median age values onboard.
Female passengers display slightly less age variability than male passengers.
Both genders contain outliers, indicating a few very elderly individuals.
Gender alone does not strongly differentiate the age distribution of passengers.


Question 9: How does fare vary by embarkation port?

titanic_emb <- titanic %>% filter(!is.na(embarked) & embarked != "")

ggplot(titanic_emb, aes(x = embarked, y = fare, fill = embarked)) +
  geom_boxplot(outlier.colour = "blue", outlier.shape = 16) +
  labs(title = "Fare Distribution by Embarkation Port",
       x = "Port (C = Cherbourg, Q = Queenstown, S = Southampton)",
       y = "Fare (£)") +
  theme_minimal()

Interpretation:

Cherbourg passengers paid the highest median fares, suggesting more 1st class travelers boarded there.
Queenstown passengers had the lowest median fares, indicating mostly lower-class travelers.
Southampton shows a wide fare range, reflecting passengers from all classes.
Embarkation port is a meaningful indicator of passenger socio-economic background.


Question 10: How does fare distribution compare across classes and survival status?

ggplot(titanic, aes(x = factor(survived), y = fare, fill = factor(survived))) +
  geom_boxplot(alpha = 0.8) +
  facet_wrap(~pclass) +
  labs(title = "Fare Distribution by Class and Survival Status",
       x = "Survived (0 = No, 1 = Yes)",
       y = "Fare (£)",
       fill = "Survived") +
  theme_minimal()

Interpretation:

In all classes, survivors tend to have higher median fares than non-survivors.
1st class survivors paid considerably higher fares, showing a strong wealth-survival link.
3rd class shows little difference in fare between survivors and non-survivors.
The survival-fare relationship is most pronounced among 1st class passengers.


Question 11: Is there a relationship between age and fare paid?

ggplot(titanic, aes(x = age, y = fare)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  labs(title = "Scatter Plot: Age vs Fare",
       x = "Age",
       y = "Fare (£)")
## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

The scatter plot shows no strong linear relationship between passenger age and fare paid.
High fares are distributed across passengers of various age groups.
A few extreme fare outliers are visible, mostly for middle-aged passengers.
Age alone is not a strong predictor of the fare a passenger paid.


Question 12: How does the age-fare relationship differ between survivors and non-survivors?

ggplot(titanic, aes(x = age, y = fare, color = factor(survived))) +
  geom_point(size = 2, alpha = 0.6) +
  labs(title = "Age vs Fare by Survival Status",
       x = "Age",
       y = "Fare (£)",
       color = "Survived") +
  theme_minimal()
## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

Survivors are more concentrated in the higher fare range across all age groups.
Non-survivors are densely clustered in the low-fare and younger-age region.
A few older survivors paid extremely high fares, suggesting 1st class membership.
Considering both age and fare together provides a stronger survival signal.


Question 13: How does the age-fare relationship vary across passenger classes?

ggplot(titanic, aes(x = age, y = fare, color = pclass)) +
  geom_point(size = 2, alpha = 0.6) +
  facet_wrap(~pclass) +
  labs(title = "Age vs Fare by Passenger Class",
       x = "Age",
       y = "Fare (£)",
       color = "Class") +
  theme_minimal()
## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

1st class passengers have a wide range of fares spread across all ages.
3rd class passengers are tightly clustered at low fare values regardless of age.
2nd class shows moderate fares with limited variation across age groups.
Passenger class clearly defines the fare structure observed within each age group.


Question 14: How does family size relate to fare paid by passengers?

ggplot(titanic_family, aes(x = Family_Size, y = fare, color = factor(survived))) +
  geom_point(size = 2, alpha = 0.6) +
  labs(title = "Family Size vs Fare",
       x = "Family Size",
       y = "Fare (£)",
       color = "Survived") +
  theme_minimal()

Interpretation:

Solo travelers span the full range of fare amounts from very low to very high.
Medium-sized families of 2 to 4 members tend to cluster in the lower fare range.
Some small-family survivors paid relatively high fares, indicating higher class.
Family size alone does not show a clear positive relationship with fare paid.


Question 15: What is the relationship between age and fare across gender groups with a regression line?

ggplot(titanic, aes(x = age, y = fare, color = sex)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Age vs Fare by Gender with Regression Line",
       x = "Age",
       y = "Fare (£)",
       color = "Gender") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 177 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

The regression lines for both genders show a very slight upward trend with age.
Female passengers show a marginally higher association with fares at older ages.
The overall slope is nearly flat, confirming a weak correlation between age and fare.
Gender interacts only mildly with age when predicting the fare paid.


Question 16: What is the Pearson correlation between age and fare?

titanic_clean <- titanic %>% filter(!is.na(age) & !is.na(fare))
cor_result <- cor.test(titanic_clean$age, titanic_clean$fare, method = "pearson")
cat("Pearson Correlation Coefficient (Age vs Fare):", cor_result$estimate, "\n")
## Pearson Correlation Coefficient (Age vs Fare): 0.09606669
cat("p-value:", cor_result$p.value, "\n")
## p-value: 0.01021628

Interpretation:

The Pearson correlation coefficient between age and fare is close to zero.
This confirms a very weak linear relationship between passenger age and fare paid.
The p-value indicates whether this weak relationship is statistically significant.
Age alone cannot reliably predict how much a passenger paid for their ticket.


Question 17: What is the correlation matrix of key numeric variables?

titanic_numeric <- titanic %>%
  mutate(pclass_num = as.numeric(as.character(pclass))) %>%
  select(survived, pclass_num, age, sibsp, parch, fare) %>%
  na.omit()

cor_matrix <- cor(titanic_numeric)
print(round(cor_matrix, 2))
##            survived pclass_num   age sibsp parch  fare
## survived       1.00      -0.36 -0.08 -0.02  0.09  0.27
## pclass_num    -0.36       1.00 -0.37  0.07  0.03 -0.55
## age           -0.08      -0.37  1.00 -0.31 -0.19  0.10
## sibsp         -0.02       0.07 -0.31  1.00  0.38  0.14
## parch          0.09       0.03 -0.19  0.38  1.00  0.21
## fare           0.27      -0.55  0.10  0.14  0.21  1.00

Interpretation:

The correlation matrix reveals relationships among all key numeric variables.
Fare and survival show a positive correlation, indicating wealthier passengers survived more.
Passenger class and fare have a strong negative correlation, as higher class has lower class number.
Sibsp and parch show a moderate positive correlation, both being related to family travel.


Question 18: How strongly are number of siblings/spouses and parents/children correlated?

cor_sibsp_parch <- cor.test(titanic$sibsp, titanic$parch, method = "pearson")
cat("Pearson Correlation (sibsp vs parch):", cor_sibsp_parch$estimate, "\n")
## Pearson Correlation (sibsp vs parch): 0.4148377
cat("p-value:", cor_sibsp_parch$p.value, "\n")
## p-value: 2.241824e-38

Interpretation:

There is a moderate positive correlation between siblings/spouses and parents/children counts.
Passengers traveling with family tend to have higher values for both sibsp and parch.
This confirms that family travelers generally bring along multiple family members.
The statistically significant p-value confirms this is a meaningful relationship.


Question 19: What is the correlation between fare and survival status?

cor_fare_survived <- cor.test(titanic$fare, titanic$survived, method = "pearson")
cat("Pearson Correlation (Fare vs Survived):", cor_fare_survived$estimate, "\n")
## Pearson Correlation (Fare vs Survived): 0.2573065
cat("p-value:", cor_fare_survived$p.value, "\n")
## p-value: 6.120189e-15

Interpretation:

There is a positive correlation between fare paid and survival, confirming earlier findings.
Higher-fare passengers were significantly more likely to survive the disaster.
The p-value confirms this is a statistically significant and reliable relationship.
Economic privilege played a tangible and measurable role in determining survival outcomes.


Question 20: How does the correlation matrix look visually for all numeric variables?

corrplot(cor_matrix,
         method = "color",
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("red", "white", "blue"))(200),
         tl.col = "black",
         tl.srt = 45,
         mar = c(0, 0, 2, 0),
         title = "Titanic - Numeric Variable Correlation Matrix")

Interpretation:

The visual correlation matrix clearly highlights positive and negative relationships.
Blue cells indicate positive correlations while red cells indicate negative correlations.
Fare and passenger class have the strongest visible negative correlation in the matrix.
Survival shows positive blue shading with fare and negative red shading with class.


Question 21: How does average fare change across passenger classes?

fare_by_class <- titanic %>%
  mutate(pclass_num = as.numeric(as.character(pclass))) %>%
  group_by(pclass_num) %>%
  summarise(Avg_Fare = mean(fare, na.rm = TRUE))

ggplot(fare_by_class, aes(x = pclass_num, y = Avg_Fare)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 3) +
  labs(title = "Average Fare by Passenger Class",
       x = "Passenger Class",
       y = "Average Fare (£)") +
  theme_minimal()

Interpretation:

Average fare drops sharply from 1st class to 2nd and 3rd class passengers.
1st class passengers paid the highest average fares by a very large margin.
The decline from class 2 to class 3 is smaller but still clearly visible.
This line chart confirms a clear hierarchy of spending power across classes.


Question 22: How does survival rate vary across age groups?

survival_by_age <- titanic_age %>%
  filter(Age_Group != "Unknown") %>%
  group_by(Age_Group) %>%
  summarise(Survival_Rate = mean(survived, na.rm = TRUE))

survival_by_age$Age_Group <- factor(survival_by_age$Age_Group,
                                     levels = c("Child", "Adult", "Senior"))

ggplot(survival_by_age, aes(x = Age_Group, y = Survival_Rate, group = 1)) +
  geom_line(color = "darkgreen", size = 1.2) +
  geom_point(color = "orange", size = 3) +
  labs(title = "Survival Rate by Age Group",
       x = "Age Group",
       y = "Survival Rate") +
  theme_minimal()

Interpretation:

Children have the highest survival rate among all three age groups.
Survival rates decrease progressively from children to adults to seniors.
This trend confirms the priority given to younger passengers during rescue operations.
Senior passengers faced the lowest survival chances of any age group.


Question 23: How does average fare trend across age groups?

fare_by_age <- titanic_age %>%
  filter(Age_Group != "Unknown") %>%
  group_by(Age_Group) %>%
  summarise(Avg_Fare = mean(fare, na.rm = TRUE))

fare_by_age$Age_Group <- factor(fare_by_age$Age_Group,
                                 levels = c("Child", "Adult", "Senior"))

ggplot(fare_by_age, aes(x = Age_Group, y = Avg_Fare, group = 1)) +
  geom_line(color = "purple", size = 1.2) +
  geom_point(color = "red", size = 3) +
  labs(title = "Average Fare by Age Group",
       x = "Age Group",
       y = "Average Fare (£)") +
  theme_minimal()

Interpretation:

Senior passengers paid the highest average fares among all age groups.
Children and adults paid comparatively lower average fares.
This suggests that older passengers may have been more likely to travel in 1st class.
Age group provides a useful lens for understanding fare patterns across the dataset.


Question 24: How does the average age of passengers vary across embarkation towns?

age_by_town <- titanic %>%
  filter(!is.na(embark_town) & embark_town != "" & !is.na(age)) %>%
  group_by(embark_town) %>%
  summarise(Avg_Age = mean(age, na.rm = TRUE))

ggplot(age_by_town, aes(x = embark_town, y = Avg_Age, group = 1)) +
  geom_line(color = "brown", size = 1.2) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Average Passenger Age by Embarkation Town",
       x = "Embarkation Town",
       y = "Average Age (Years)") +
  theme_minimal()

Interpretation:

Average passenger age varies noticeably across the three embarkation towns.
Cherbourg passengers tend to be older on average than those from the other ports.
Queenstown shows the youngest average passenger age among the three towns.
Embarkation town reflects clear differences in the demographic profile of boarding passengers.


Question 25: How do average fares compare between male and female passengers across classes?

fare_gender_class <- titanic %>%
  mutate(pclass_num = as.numeric(as.character(pclass))) %>%
  group_by(pclass_num, sex) %>%
  summarise(Avg_Fare = mean(fare, na.rm = TRUE), .groups = "drop")

ggplot(fare_gender_class, aes(x = pclass_num, y = Avg_Fare,
                               color = sex, group = sex)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(title = "Average Fare by Class and Gender",
       x = "Passenger Class",
       y = "Average Fare (£)",
       color = "Gender") +
  theme_minimal()

Interpretation:

Female passengers paid higher average fares than males across all passenger classes.
The gap between male and female fares is most pronounced in 1st class.
Both gender lines show a sharp declining trend from 1st to 3rd class.
This multi-line chart clearly reveals the combined influence of gender and class on fare.


Question 26: Does fare differ significantly across passenger classes? (One-Way ANOVA)

anova_fare <- aov(fare ~ pclass, data = titanic)
summary(anova_fare)
##              Df  Sum Sq Mean Sq F value Pr(>F)    
## pclass        2  776030  388015   242.3 <2e-16 ***
## Residuals   888 1421769    1601                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:

The one-way ANOVA tests whether mean fare differs significantly across the three passenger classes.
A very small p-value (typically < 0.05) indicates that the differences in fare between classes are statistically significant.
This confirms that passenger class is a strong determinant of how much a ticket cost.
The F-statistic reflects the ratio of between-group variance to within-group variance, and a large value supports rejection of the null hypothesis.


Question 27: Does age differ significantly across embarkation ports? (One-Way ANOVA)

titanic_emb_age <- titanic %>%
  filter(!is.na(embarked) & embarked != "" & !is.na(age))

anova_age_emb <- aov(age ~ embarked, data = titanic_emb_age)
summary(anova_age_emb)
##              Df Sum Sq Mean Sq F value Pr(>F)
## embarked      2    268   133.9   0.637  0.529
## Residuals   709 149074   210.3

Interpretation:

The ANOVA tests whether passengers from different embarkation ports (C, Q, S) had significantly different average ages.
A significant p-value would confirm that the port of boarding is associated with the age profile of passengers.
This is consistent with earlier findings that Cherbourg attracted older, wealthier travelers.
If the p-value is above 0.05, the age differences across ports may be due to random variation rather than a true effect.


Question 28: Can fare predict passenger survival? (Simple Linear Regression)

model_slr <- lm(survived ~ fare, data = titanic)
summary(model_slr)
## 
## Call:
## lm(formula = survived ~ fare, data = titanic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9653 -0.3391 -0.3222  0.6044  0.6973 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3026994  0.0187849  16.114  < 2e-16 ***
## fare        0.0025195  0.0003174   7.939 6.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4705 on 889 degrees of freedom
## Multiple R-squared:  0.06621,    Adjusted R-squared:  0.06516 
## F-statistic: 63.03 on 1 and 889 DF,  p-value: 6.12e-15
ggplot(titanic, aes(x = fare, y = survived)) +
  geom_point(alpha = 0.3, color = "steelblue", size = 1.5) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Simple Linear Regression: Fare vs Survival",
       x = "Fare (£)",
       y = "Survived (0 = No, 1 = Yes)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Interpretation:

The simple linear regression model uses fare as the sole predictor of survival.
A positive and significant coefficient for fare confirms that higher-paying passengers had better survival odds.
The R-squared value tells us what proportion of the variation in survival is explained by fare alone.
While the relationship is statistically significant, fare alone explains only a modest portion of survival variance.


Question 29: How do age, fare, and class together predict survival? (Multiple Linear Regression)

titanic_mlr <- titanic %>%
  filter(!is.na(age) & !is.na(fare)) %>%
  mutate(pclass_num = as.numeric(as.character(pclass)))

model_mlr <- lm(survived ~ age + fare + pclass_num, data = titanic_mlr)
summary(model_mlr)
## 
## Call:
## lm(formula = survived ~ age + fare + pclass_num, data = titanic_mlr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9932 -0.3159 -0.1721  0.4141  1.0578 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.1625550  0.0883058  13.165  < 2e-16 ***
## age         -0.0079608  0.0012472  -6.383 3.14e-10 ***
## fare         0.0005808  0.0003822   1.520    0.129    
## pclass_num  -0.2414791  0.0258451  -9.343  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4451 on 710 degrees of freedom
## Multiple R-squared:  0.1831, Adjusted R-squared:  0.1796 
## F-statistic: 53.04 on 3 and 710 DF,  p-value: < 2.2e-16
results_mlr <- data.frame(
  Actual    = titanic_mlr$survived,
  Predicted = predict(model_mlr, newdata = titanic_mlr)
)

ggplot(results_mlr, aes(x = Actual, y = Predicted)) +
  geom_point(color = "darkgreen", size = 2, alpha = 0.4) +
  geom_abline(intercept = 0, slope = 1, color = "red", size = 1.2) +
  labs(title = "Multiple Linear Regression: Actual vs Predicted Survival",
       x = "Actual Survival (0 = No, 1 = Yes)",
       y = "Predicted Survival Score") +
  theme_minimal()

Interpretation:

The multiple linear regression model combines age, fare, and passenger class to predict survival.
Including all three predictors improves explanatory power compared to using fare alone.
A higher class number (3rd class) negatively impacts predicted survival, while higher fare positively contributes.
The Actual vs Predicted plot shows how well the model estimates survival — points close to the red diagonal line indicate accurate predictions.


Question 30: Does age have a non-linear relationship with fare? (Polynomial Regression)

titanic_poly <- titanic %>% filter(!is.na(age) & !is.na(fare))

model_poly <- lm(fare ~ poly(age, 2), data = titanic_poly)
summary(model_poly)
## 
## Call:
## lm(formula = fare ~ poly(age, 2), data = titanic_poly)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46.96 -24.09 -17.90   1.59 476.42 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     34.695      1.974  17.578   <2e-16 ***
## poly(age, 2)1  135.747     52.740   2.574   0.0103 *  
## poly(age, 2)2   24.169     52.740   0.458   0.6469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.74 on 711 degrees of freedom
## Multiple R-squared:  0.009521,   Adjusted R-squared:  0.006735 
## F-statistic: 3.417 on 2 and 711 DF,  p-value: 0.03334
ggplot(titanic_poly, aes(x = age, y = fare)) +
  geom_point(color = "blue", size = 2, alpha = 0.4) +
  stat_smooth(method = "lm",
              formula = y ~ x + I(x^2),
              color = "red",
              size = 1.5,
              se = TRUE) +
  labs(title = "Polynomial Regression: Age vs Fare (Degree 2)",
       x = "Age",
       y = "Fare (£)") +
  theme_minimal()

Interpretation:

The polynomial regression of degree 2 captures any curved relationship between age and fare.
If the quadratic term is statistically significant, it confirms that the age-fare relationship is non-linear.
The fitted curve may show that middle-aged passengers paid higher fares than the very young or very old.
Polynomial regression is more flexible than simple linear regression when the relationship between variables is curved rather than straight.