Introduction

This project continues the analysis of the Titanic dataset with 25 new analytical questions. The focus of this assessment is on advanced visualization techniques including histograms, boxplots, scatter plots, correlation analysis, and line charts. These visualizations help uncover deeper patterns related to passenger survival, demographics, and socio-economic factors. The objective is to extract meaningful insights by applying multiple statistical and graphical methods to the same dataset.

Load Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.5.3
if (!require("corrplot", quietly = TRUE)) install.packages("corrplot")
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
library(corrplot)
if (!require("moments", quietly = TRUE)) install.packages("moments")
library(moments)

Load Dataset

titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
head(titanic)
##   Unnamed..0 survived pclass    sex age sibsp parch    fare embarked class
## 1          0        0      3   male  22     1     0  7.2500        S Third
## 2          1        1      1 female  38     1     0 71.2833        C First
## 3          2        1      3 female  26     0     0  7.9250        S Third
## 4          3        1      1 female  35     1     0 53.1000        S First
## 5          4        0      3   male  35     0     0  8.0500        S Third
## 6          5        0      3   male  NA     0     0  8.4583        Q Third
##     who adult_male deck embark_town alive alone
## 1   man       True      Southampton    no False
## 2 woman      False    C   Cherbourg   yes False
## 3 woman      False      Southampton   yes  True
## 4 woman      False    C Southampton   yes False
## 5   man       True      Southampton    no  True
## 6   man       True       Queenstown    no  True

Data Cleaning & Overview

tc <- titanic
tc[tc == ""] <- NA

str(tc)
## 'data.frame':    1000 obs. of  16 variables:
##  $ Unnamed..0 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ sex        : chr  "male" "female" "female" "female" ...
##  $ age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ sibsp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ embarked   : chr  "S" "C" "S" "S" ...
##  $ class      : chr  "Third" "First" "Third" "First" ...
##  $ who        : chr  "man" "woman" "woman" "woman" ...
##  $ adult_male : chr  "True" "False" "False" "False" ...
##  $ deck       : chr  NA "C" NA "C" ...
##  $ embark_town: chr  "Southampton" "Cherbourg" "Southampton" "Southampton" ...
##  $ alive      : chr  "no" "yes" "yes" "yes" ...
##  $ alone      : chr  "False" "False" "True" "False" ...
summary(tc)
##    Unnamed..0       survived         pclass          sex           
##  Min.   :  0.0   Min.   :0.000   Min.   :1.000   Length:1000       
##  1st Qu.:249.8   1st Qu.:0.000   1st Qu.:2.000   Class :character  
##  Median :499.5   Median :0.000   Median :3.000   Mode  :character  
##  Mean   :499.5   Mean   :0.392   Mean   :2.315                     
##  3rd Qu.:749.2   3rd Qu.:1.000   3rd Qu.:3.000                     
##  Max.   :999.0   Max.   :1.000   Max.   :3.000                     
##                                                                    
##       age            sibsp           parch           fare        
##  Min.   : 0.42   Min.   :0.000   Min.   :0.00   Min.   :  0.000  
##  1st Qu.:20.00   1st Qu.:0.000   1st Qu.:0.00   1st Qu.:  7.896  
##  Median :28.00   Median :0.000   Median :0.00   Median : 14.068  
##  Mean   :29.59   Mean   :0.518   Mean   :0.38   Mean   : 31.708  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.00   3rd Qu.: 30.696  
##  Max.   :80.00   Max.   :8.000   Max.   :6.00   Max.   :512.329  
##  NA's   :197                                                     
##    embarked            class               who             adult_male       
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      deck           embark_town           alive              alone          
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
## 
dim(tc)
## [1] 1000   16
colSums(is.na(tc))
##  Unnamed..0    survived      pclass         sex         age       sibsp 
##           0           0           0           0         197           0 
##       parch        fare    embarked       class         who  adult_male 
##           0           0           2           0           0           0 
##        deck embark_town       alive       alone 
##         769           2           0           0

Prepare Helper Variables

titanic_age <- titanic %>%
  mutate(Age_Group = case_when(
    age < 18  ~ "Child",
    age >= 18 & age < 60 ~ "Adult",
    age >= 60 ~ "Senior",
    TRUE ~ "Unknown"
  ))

titanic_family <- titanic %>%
  mutate(Family_Size = sibsp + parch + 1)

Scenario Based Analysis


Question 1: What is the age distribution of Titanic passengers?

ggplot(titanic, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Age Distribution of Titanic Passengers",
       x = "Age",
       y = "Frequency")
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_bin()`).

Interpretation:

The histogram shows that the majority of passengers were between 20 and 40 years old.
There is a small peak for young children under 10 years of age.
The distribution is slightly right-skewed, indicating fewer elderly passengers.
Most travelers were working-age adults.


Question 2: What is the fare distribution among all passengers?

ggplot(titanic, aes(x = fare)) +
  geom_histogram(binwidth = 10, fill = "coral", color = "black", alpha = 0.7) +
  labs(title = "Fare Distribution of Titanic Passengers",
       x = "Fare (£)",
       y = "Frequency")

Interpretation:

The fare distribution is heavily right-skewed, with most passengers paying low fares.
A small number of passengers paid very high fares, creating a long right tail.
This reflects the wide economic gap between different passenger classes.
Lower-fare passengers dominated the overall composition of the Titanic.


Question 3: How are passenger ages distributed across different classes?

ggplot(titanic, aes(x = age, fill = factor(pclass))) +
  geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
  facet_wrap(~pclass) +
  labs(title = "Age Distribution by Passenger Class",
       x = "Age",
       y = "Frequency",
       fill = "Class")
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_bin()`).

Interpretation:

All three passenger classes show a peak concentration in the 20–40 age range.
1st class passengers tend to be slightly older on average compared to other classes.
3rd class has the highest proportion of young adults and children.
Age distribution varies meaningfully across the three passenger classes.


Question 4: What is the family size distribution among passengers?

ggplot(titanic_family, aes(x = Family_Size)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "black", alpha = 0.8) +
  labs(title = "Family Size Distribution of Titanic Passengers",
       x = "Family Size",
       y = "Frequency")

Interpretation:

The majority of passengers traveled alone, reflected by the dominant peak at family size 1.
Two-person families were the second most common group onboard.
Very large families of 7 or more members were extremely rare.
Solo travel was the most dominant travel pattern aboard the Titanic.


Question 5: What is the age distribution with a density curve, and is it skewed?

ggplot(titanic, aes(x = age, y = ..density..)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1.5, adjust = 1.5) +
  labs(title = "Age Distribution with Density Curve",
       x = "Age",
       y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_density()`).

skewness(titanic$age, na.rm = TRUE)
## [1] 0.39133

Interpretation:

The density curve overlaid on the histogram confirms a slightly right-skewed age distribution.
A skewness value greater than 0 indicates a mild positive skew in the data.
This means more passengers were younger, with fewer older travelers.
The distribution departs slightly from a perfect normal bell curve shape.


Question 6: How does age vary across passenger classes?

titanic$pclass <- as.factor(titanic$pclass)

ggplot(titanic, aes(x = pclass, y = age, fill = pclass)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16) +
  labs(title = "Age Distribution by Passenger Class",
       x = "Passenger Class",
       y = "Age") +
  theme_minimal()
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Interpretation:

1st class passengers have a higher median age compared to 2nd and 3rd class passengers.
3rd class passengers are generally younger, with a lower median age.
Outliers are visible in all classes, representing a few very elderly passengers.
Socio-economic class and passenger age appear to be meaningfully related.


Question 7: How does fare differ between survivors and non-survivors?

ggplot(titanic, aes(x = factor(survived), y = fare, fill = factor(survived))) +
  geom_boxplot(outlier.colour = "darkred", outlier.shape = 16) +
  labs(title = "Fare Distribution by Survival Status",
       x = "Survived (0 = No, 1 = Yes)",
       y = "Fare (£)",
       fill = "Survived") +
  theme_minimal()

Interpretation:

Survivors paid significantly higher median fares compared to non-survivors.
The fare distribution for survivors has a wider spread and more high-value outliers.
This suggests that wealthier passengers had a higher likelihood of survival.
Fare is a strong indicator of both socio-economic status and survival advantage.


Question 8: How does age distribution differ by gender?

ggplot(titanic, aes(x = sex, y = age, fill = sex)) +
  geom_boxplot(position = position_dodge(width = 0.8),
               outlier.colour = "red", alpha = 0.8) +
  labs(title = "Age Distribution by Gender",
       x = "Gender",
       y = "Age") +
  theme_minimal()
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Interpretation:

Male and female passengers show similar median age values onboard.
Female passengers display slightly less age variability than male passengers.
Both genders contain outliers, indicating a few very elderly individuals.
Gender alone does not strongly differentiate the age distribution of passengers.


Question 9: How does fare vary by embarkation port?

titanic_emb <- titanic %>% filter(!is.na(embarked) & embarked != "")

ggplot(titanic_emb, aes(x = embarked, y = fare, fill = embarked)) +
  geom_boxplot(outlier.colour = "blue", outlier.shape = 16) +
  labs(title = "Fare Distribution by Embarkation Port",
       x = "Port (C = Cherbourg, Q = Queenstown, S = Southampton)",
       y = "Fare (£)") +
  theme_minimal()

Interpretation:

Cherbourg passengers paid the highest median fares, suggesting more 1st class travelers boarded there.
Queenstown passengers had the lowest median fares, indicating mostly lower-class travelers.
Southampton shows a wide fare range, reflecting passengers from all classes.
Embarkation port is a meaningful indicator of passenger socio-economic background.


Question 10: How does fare distribution compare across classes and survival status?

ggplot(titanic, aes(x = factor(survived), y = fare, fill = factor(survived))) +
  geom_boxplot(alpha = 0.8) +
  facet_wrap(~pclass) +
  labs(title = "Fare Distribution by Class and Survival Status",
       x = "Survived (0 = No, 1 = Yes)",
       y = "Fare (£)",
       fill = "Survived") +
  theme_minimal()

Interpretation:

In all classes, survivors tend to have higher median fares than non-survivors.
1st class survivors paid considerably higher fares, showing a strong wealth-survival link.
3rd class shows little difference in fare between survivors and non-survivors.
The survival-fare relationship is most pronounced among 1st class passengers.


Question 11: Is there a relationship between age and fare paid?

ggplot(titanic, aes(x = age, y = fare)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  labs(title = "Scatter Plot: Age vs Fare",
       x = "Age",
       y = "Fare (£)")
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

The scatter plot shows no strong linear relationship between passenger age and fare paid.
High fares are distributed across passengers of various age groups.
A few extreme fare outliers are visible, mostly for middle-aged passengers.
Age alone is not a strong predictor of the fare a passenger paid.


Question 12: How does the age-fare relationship differ between survivors and non-survivors?

ggplot(titanic, aes(x = age, y = fare, color = factor(survived))) +
  geom_point(size = 2, alpha = 0.6) +
  labs(title = "Age vs Fare by Survival Status",
       x = "Age",
       y = "Fare (£)",
       color = "Survived") +
  theme_minimal()
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

Survivors are more concentrated in the higher fare range across all age groups.
Non-survivors are densely clustered in the low-fare and younger-age region.
A few older survivors paid extremely high fares, suggesting 1st class membership.
Considering both age and fare together provides a stronger survival signal.


Question 13: How does the age-fare relationship vary across passenger classes?

ggplot(titanic, aes(x = age, y = fare, color = pclass)) +
  geom_point(size = 2, alpha = 0.6) +
  facet_wrap(~pclass) +
  labs(title = "Age vs Fare by Passenger Class",
       x = "Age",
       y = "Fare (£)",
       color = "Class") +
  theme_minimal()
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

1st class passengers have a wide range of fares spread across all ages.
3rd class passengers are tightly clustered at low fare values regardless of age.
2nd class shows moderate fares with limited variation across age groups.
Passenger class clearly defines the fare structure observed within each age group.


Question 14: How does family size relate to fare paid by passengers?

ggplot(titanic_family, aes(x = Family_Size, y = fare, color = factor(survived))) +
  geom_point(size = 2, alpha = 0.6) +
  labs(title = "Family Size vs Fare",
       x = "Family Size",
       y = "Fare (£)",
       color = "Survived") +
  theme_minimal()

Interpretation:

Solo travelers span the full range of fare amounts from very low to very high.
Medium-sized families of 2 to 4 members tend to cluster in the lower fare range.
Some small-family survivors paid relatively high fares, indicating higher class.
Family size alone does not show a clear positive relationship with fare paid.


Question 15: What is the relationship between age and fare across gender groups with a regression line?

ggplot(titanic, aes(x = age, y = fare, color = sex)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Age vs Fare by Gender with Regression Line",
       x = "Age",
       y = "Fare (£)",
       color = "Gender") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation:

The regression lines for both genders show a very slight upward trend with age.
Female passengers show a marginally higher association with fares at older ages.
The overall slope is nearly flat, confirming a weak correlation between age and fare.
Gender interacts only mildly with age when predicting the fare paid.


Question 16: What is the Pearson correlation between age and fare?

titanic_clean <- titanic %>% filter(!is.na(age) & !is.na(fare))
cor_result <- cor.test(titanic_clean$age, titanic_clean$fare, method = "pearson")
cat("Pearson Correlation Coefficient (Age vs Fare):", cor_result$estimate, "\n")
## Pearson Correlation Coefficient (Age vs Fare): 0.108268
cat("p-value:", cor_result$p.value, "\n")
## p-value: 0.00212434

Interpretation:

The Pearson correlation coefficient between age and fare is close to zero.
This confirms a very weak linear relationship between passenger age and fare paid.
The p-value indicates whether this weak relationship is statistically significant.
Age alone cannot reliably predict how much a passenger paid for their ticket.


Question 17: What is the correlation matrix of key numeric variables?

titanic_numeric <- titanic %>%
  mutate(pclass_num = as.numeric(as.character(pclass))) %>%
  select(survived, pclass_num, age, sibsp, parch, fare) %>%
  na.omit()

cor_matrix <- cor(titanic_numeric)
print(round(cor_matrix, 2))
##            survived pclass_num   age sibsp parch  fare
## survived       1.00      -0.35 -0.05  0.00  0.12  0.26
## pclass_num    -0.35       1.00 -0.36  0.04  0.03 -0.56
## age           -0.05      -0.36  1.00 -0.25 -0.17  0.11
## sibsp          0.00       0.04 -0.25  1.00  0.33  0.14
## parch          0.12       0.03 -0.17  0.33  1.00  0.19
## fare           0.26      -0.56  0.11  0.14  0.19  1.00

Interpretation:

The correlation matrix reveals relationships among all key numeric variables.
Fare and survival show a positive correlation, indicating wealthier passengers survived more.
Passenger class and fare have a strong negative correlation, as higher class has lower class number.
Sibsp and parch show a moderate positive correlation, both being related to family travel.


Question 18: How strongly are number of siblings/spouses and parents/children correlated?

cor_sibsp_parch <- cor.test(titanic$sibsp, titanic$parch, method = "pearson")
cat("Pearson Correlation (sibsp vs parch):", cor_sibsp_parch$estimate, "\n")
## Pearson Correlation (sibsp vs parch): 0.3684433
cat("p-value:", cor_sibsp_parch$p.value, "\n")
## p-value: 1.644771e-33

Interpretation:

There is a moderate positive correlation between siblings/spouses and parents/children counts.
Passengers traveling with family tend to have higher values for both sibsp and parch.
This confirms that family travelers generally bring along multiple family members.
The statistically significant p-value confirms this is a meaningful relationship.


Question 19: What is the correlation between fare and survival status?

cor_fare_survived <- cor.test(titanic$fare, titanic$survived, method = "pearson")
cat("Pearson Correlation (Fare vs Survived):", cor_fare_survived$estimate, "\n")
## Pearson Correlation (Fare vs Survived): 0.2511494
cat("p-value:", cor_fare_survived$p.value, "\n")
## p-value: 7.526333e-16

Interpretation:

There is a positive correlation between fare paid and survival, confirming earlier findings.
Higher-fare passengers were significantly more likely to survive the disaster.
The p-value confirms this is a statistically significant and reliable relationship.
Economic privilege played a tangible and measurable role in determining survival outcomes.


Question 20: How does the correlation matrix look visually for all numeric variables?

corrplot(cor_matrix,
         method = "color",
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("red", "white", "blue"))(200),
         tl.col = "black",
         tl.srt = 45,
         mar = c(0, 0, 2, 0),
         title = "Titanic - Numeric Variable Correlation Matrix")

Interpretation:

The visual correlation matrix clearly highlights positive and negative relationships.
Blue cells indicate positive correlations while red cells indicate negative correlations.
Fare and passenger class have the strongest visible negative correlation in the matrix.
Survival shows positive blue shading with fare and negative red shading with class.


Question 21: How does average fare change across passenger classes?

fare_by_class <- titanic %>%
  mutate(pclass_num = as.numeric(as.character(pclass))) %>%
  group_by(pclass_num) %>%
  summarise(Avg_Fare = mean(fare, na.rm = TRUE))

ggplot(fare_by_class, aes(x = pclass_num, y = Avg_Fare)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 3) +
  labs(title = "Average Fare by Passenger Class",
       x = "Passenger Class",
       y = "Average Fare (£)") +
  theme_minimal()

Interpretation:

Average fare drops sharply from 1st class to 2nd and 3rd class passengers.
1st class passengers paid the highest average fares by a very large margin.
The decline from class 2 to class 3 is smaller but still clearly visible.
This line chart confirms a clear hierarchy of spending power across classes.


Question 22: How does survival rate vary across age groups?

survival_by_age <- titanic_age %>%
  filter(Age_Group != "Unknown") %>%
  group_by(Age_Group) %>%
  summarise(Survival_Rate = mean(survived, na.rm = TRUE))

survival_by_age$Age_Group <- factor(survival_by_age$Age_Group,
                                     levels = c("Child", "Adult", "Senior"))

ggplot(survival_by_age, aes(x = Age_Group, y = Survival_Rate, group = 1)) +
  geom_line(color = "darkgreen", size = 1.2) +
  geom_point(color = "orange", size = 3) +
  labs(title = "Survival Rate by Age Group",
       x = "Age Group",
       y = "Survival Rate") +
  theme_minimal()

Interpretation:

Children have the highest survival rate among all three age groups.
Survival rates decrease progressively from children to adults to seniors.
This trend confirms the priority given to younger passengers during rescue operations.
Senior passengers faced the lowest survival chances of any age group.


Question 23: How does average fare trend across age groups?

fare_by_age <- titanic_age %>%
  filter(Age_Group != "Unknown") %>%
  group_by(Age_Group) %>%
  summarise(Avg_Fare = mean(fare, na.rm = TRUE))

fare_by_age$Age_Group <- factor(fare_by_age$Age_Group,
                                 levels = c("Child", "Adult", "Senior"))

ggplot(fare_by_age, aes(x = Age_Group, y = Avg_Fare, group = 1)) +
  geom_line(color = "purple", size = 1.2) +
  geom_point(color = "red", size = 3) +
  labs(title = "Average Fare by Age Group",
       x = "Age Group",
       y = "Average Fare (£)") +
  theme_minimal()

Interpretation:

Senior passengers paid the highest average fares among all age groups.
Children and adults paid comparatively lower average fares.
This suggests that older passengers may have been more likely to travel in 1st class.
Age group provides a useful lens for understanding fare patterns across the dataset.


Question 24: How does the average age of passengers vary across embarkation towns?

age_by_town <- titanic %>%
  filter(!is.na(embark_town) & embark_town != "" & !is.na(age)) %>%
  group_by(embark_town) %>%
  summarise(Avg_Age = mean(age, na.rm = TRUE))

ggplot(age_by_town, aes(x = embark_town, y = Avg_Age, group = 1)) +
  geom_line(color = "brown", size = 1.2) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Average Passenger Age by Embarkation Town",
       x = "Embarkation Town",
       y = "Average Age (Years)") +
  theme_minimal()

Interpretation:

Average passenger age varies noticeably across the three embarkation towns.
Cherbourg passengers tend to be older on average than those from the other ports.
Queenstown shows the youngest average passenger age among the three towns.
Embarkation town reflects clear differences in the demographic profile of boarding passengers.


Question 25: How do average fares compare between male and female passengers across classes?

fare_gender_class <- titanic %>%
  mutate(pclass_num = as.numeric(as.character(pclass))) %>%
  group_by(pclass_num, sex) %>%
  summarise(Avg_Fare = mean(fare, na.rm = TRUE), .groups = "drop")

ggplot(fare_gender_class, aes(x = pclass_num, y = Avg_Fare,
                               color = sex, group = sex)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  labs(title = "Average Fare by Class and Gender",
       x = "Passenger Class",
       y = "Average Fare (£)",
       color = "Gender") +
  theme_minimal()

Interpretation:

Female passengers paid higher average fares than males across all passenger classes.
The gap between male and female fares is most pronounced in 1st class.
Both gender lines show a sharp declining trend from 1st to 3rd class.
This multi-line chart clearly reveals the combined influence of gender and class on fare.


Question 26: Does fare differ significantly across passenger classes? (One-Way ANOVA)

anova_fare <- aov(fare ~ pclass, data = titanic)
summary(anova_fare)
##              Df  Sum Sq Mean Sq F value Pr(>F)    
## pclass        2  863977  431989   280.3 <2e-16 ***
## Residuals   997 1536329    1541                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:

The one-way ANOVA tests whether mean fare differs significantly across the three passenger classes.
A very small p-value (typically < 0.05) indicates that the differences in fare between classes are statistically significant.
This confirms that passenger class is a strong determinant of how much a ticket cost.
The F-statistic reflects the ratio of between-group variance to within-group variance, and a large value supports rejection of the null hypothesis.


Question 27: Does age differ significantly across embarkation ports? (One-Way ANOVA)

titanic_emb_age <- titanic %>%
  filter(!is.na(embarked) & embarked != "" & !is.na(age))

anova_age_emb <- aov(age ~ embarked, data = titanic_emb_age)
summary(anova_age_emb)
##              Df Sum Sq Mean Sq F value Pr(>F)
## embarked      2    655   327.5   1.572  0.208
## Residuals   798 166267   208.4

Interpretation:

The ANOVA tests whether passengers from different embarkation ports (C, Q, S) had significantly different average ages.
A significant p-value would confirm that the port of boarding is associated with the age profile of passengers.
This is consistent with earlier findings that Cherbourg attracted older, wealthier travelers.
If the p-value is above 0.05, the age differences across ports may be due to random variation rather than a true effect.


Question 28: Can fare predict passenger survival? (Simple Linear Regression)

model_slr <- lm(survived ~ fare, data = titanic)
summary(model_slr)
## 
## Call:
## lm(formula = survived ~ fare, data = titanic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9708 -0.3477 -0.3320  0.6088  0.6874 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3126476  0.0178177  17.547  < 2e-16 ***
## fare        0.0025026  0.0003053   8.197 7.53e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.473 on 998 degrees of freedom
## Multiple R-squared:  0.06308,    Adjusted R-squared:  0.06214 
## F-statistic: 67.19 on 1 and 998 DF,  p-value: 7.526e-16
ggplot(titanic, aes(x = fare, y = survived)) +
  geom_point(alpha = 0.3, color = "steelblue", size = 1.5) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Simple Linear Regression: Fare vs Survival",
       x = "Fare (£)",
       y = "Survived (0 = No, 1 = Yes)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Interpretation:

The simple linear regression model uses fare as the sole predictor of survival.
A positive and significant coefficient for fare confirms that higher-paying passengers had better survival odds.
The R-squared value tells us what proportion of the variation in survival is explained by fare alone.
While the relationship is statistically significant, fare alone explains only a modest portion of survival variance.


Question 29: How do age, fare, and class together predict survival? (Multiple Linear Regression)

titanic_mlr <- titanic %>%
  filter(!is.na(age) & !is.na(fare)) %>%
  mutate(pclass_num = as.numeric(as.character(pclass)))

model_mlr <- lm(survived ~ age + fare + pclass_num, data = titanic_mlr)
summary(model_mlr)
## 
## Call:
## lm(formula = survived ~ age + fare + pclass_num, data = titanic_mlr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9517 -0.3178 -0.1969  0.4246  0.9903 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.0897827  0.0843540  12.919  < 2e-16 ***
## age         -0.0065789  0.0011934  -5.513 4.77e-08 ***
## fare         0.0006538  0.0003719   1.758   0.0791 .  
## pclass_num  -0.2239574  0.0247375  -9.053  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4522 on 799 degrees of freedom
## Multiple R-squared:  0.1609, Adjusted R-squared:  0.1577 
## F-statistic: 51.06 on 3 and 799 DF,  p-value: < 2.2e-16
results_mlr <- data.frame(
  Actual    = titanic_mlr$survived,
  Predicted = predict(model_mlr, newdata = titanic_mlr)
)

ggplot(results_mlr, aes(x = Actual, y = Predicted)) +
  geom_point(color = "darkgreen", size = 2, alpha = 0.4) +
  geom_abline(intercept = 0, slope = 1, color = "red", size = 1.2) +
  labs(title = "Multiple Linear Regression: Actual vs Predicted Survival",
       x = "Actual Survival (0 = No, 1 = Yes)",
       y = "Predicted Survival Score") +
  theme_minimal()

Interpretation:

The multiple linear regression model combines age, fare, and passenger class to predict survival.
Including all three predictors improves explanatory power compared to using fare alone.
A higher class number (3rd class) negatively impacts predicted survival, while higher fare positively contributes.
The Actual vs Predicted plot shows how well the model estimates survival — points close to the red diagonal line indicate accurate predictions.


Question 30: Does age have a non-linear relationship with fare? (Polynomial Regression)

titanic_poly <- titanic %>% filter(!is.na(age) & !is.na(fare))

model_poly <- lm(fare ~ poly(age, 2), data = titanic_poly)
summary(model_poly)
## 
## Call:
## lm(formula = fare ~ poly(age, 2), data = titanic_poly)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.61 -23.36 -17.94   1.53 477.07 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     34.108      1.837  18.569  < 2e-16 ***
## poly(age, 2)1  160.387     52.049   3.081  0.00213 ** 
## poly(age, 2)2   38.318     52.049   0.736  0.46183    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.05 on 800 degrees of freedom
## Multiple R-squared:  0.01239,    Adjusted R-squared:  0.009922 
## F-statistic: 5.019 on 2 and 800 DF,  p-value: 0.006823
ggplot(titanic_poly, aes(x = age, y = fare)) +
  geom_point(color = "blue", size = 2, alpha = 0.4) +
  stat_smooth(method = "lm",
              formula = y ~ x + I(x^2),
              color = "red",
              size = 1.5,
              se = TRUE) +
  labs(title = "Polynomial Regression: Age vs Fare (Degree 2)",
       x = "Age",
       y = "Fare (£)") +
  theme_minimal()

Interpretation:

The polynomial regression of degree 2 captures any curved relationship between age and fare.
If the quadratic term is statistically significant, it confirms that the age-fare relationship is non-linear.
The fitted curve may show that middle-aged passengers paid higher fares than the very young or very old.
Polynomial regression is more flexible than simple linear regression when the relationship between variables is curved rather than straight.