This project continues the analysis of the Titanic dataset with 25 new analytical questions. The focus of this assessment is on advanced visualization techniques including histograms, boxplots, scatter plots, correlation analysis, and line charts. These visualizations help uncover deeper patterns related to passenger survival, demographics, and socio-economic factors. The objective is to extract meaningful insights by applying multiple statistical and graphical methods to the same dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.5.3
if (!require("corrplot", quietly = TRUE)) install.packages("corrplot")
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
library(corrplot)
if (!require("moments", quietly = TRUE)) install.packages("moments")
library(moments)
titanic <- read.csv("Titanic.csv", stringsAsFactors = FALSE)
head(titanic)
## Unnamed..0 survived pclass sex age sibsp parch fare embarked class
## 1 0 0 3 male 22 1 0 7.2500 S Third
## 2 1 1 1 female 38 1 0 71.2833 C First
## 3 2 1 3 female 26 0 0 7.9250 S Third
## 4 3 1 1 female 35 1 0 53.1000 S First
## 5 4 0 3 male 35 0 0 8.0500 S Third
## 6 5 0 3 male NA 0 0 8.4583 Q Third
## who adult_male deck embark_town alive alone
## 1 man True Southampton no False
## 2 woman False C Cherbourg yes False
## 3 woman False Southampton yes True
## 4 woman False C Southampton yes False
## 5 man True Southampton no True
## 6 man True Queenstown no True
tc <- titanic
tc[tc == ""] <- NA
str(tc)
## 'data.frame': 1000 obs. of 16 variables:
## $ Unnamed..0 : int 0 1 2 3 4 5 6 7 8 9 ...
## $ survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ sex : chr "male" "female" "female" "female" ...
## $ age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ sibsp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ embarked : chr "S" "C" "S" "S" ...
## $ class : chr "Third" "First" "Third" "First" ...
## $ who : chr "man" "woman" "woman" "woman" ...
## $ adult_male : chr "True" "False" "False" "False" ...
## $ deck : chr NA "C" NA "C" ...
## $ embark_town: chr "Southampton" "Cherbourg" "Southampton" "Southampton" ...
## $ alive : chr "no" "yes" "yes" "yes" ...
## $ alone : chr "False" "False" "True" "False" ...
summary(tc)
## Unnamed..0 survived pclass sex
## Min. : 0.0 Min. :0.000 Min. :1.000 Length:1000
## 1st Qu.:249.8 1st Qu.:0.000 1st Qu.:2.000 Class :character
## Median :499.5 Median :0.000 Median :3.000 Mode :character
## Mean :499.5 Mean :0.392 Mean :2.315
## 3rd Qu.:749.2 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :999.0 Max. :1.000 Max. :3.000
##
## age sibsp parch fare
## Min. : 0.42 Min. :0.000 Min. :0.00 Min. : 0.000
## 1st Qu.:20.00 1st Qu.:0.000 1st Qu.:0.00 1st Qu.: 7.896
## Median :28.00 Median :0.000 Median :0.00 Median : 14.068
## Mean :29.59 Mean :0.518 Mean :0.38 Mean : 31.708
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.00 3rd Qu.: 30.696
## Max. :80.00 Max. :8.000 Max. :6.00 Max. :512.329
## NA's :197
## embarked class who adult_male
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## deck embark_town alive alone
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
dim(tc)
## [1] 1000 16
colSums(is.na(tc))
## Unnamed..0 survived pclass sex age sibsp
## 0 0 0 0 197 0
## parch fare embarked class who adult_male
## 0 0 2 0 0 0
## deck embark_town alive alone
## 769 2 0 0
titanic_age <- titanic %>%
mutate(Age_Group = case_when(
age < 18 ~ "Child",
age >= 18 & age < 60 ~ "Adult",
age >= 60 ~ "Senior",
TRUE ~ "Unknown"
))
titanic_family <- titanic %>%
mutate(Family_Size = sibsp + parch + 1)
ggplot(titanic, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Age Distribution of Titanic Passengers",
x = "Age",
y = "Frequency")
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_bin()`).
The histogram shows that the majority of passengers were between 20
and 40 years old.
There is a small peak for young children under 10 years of age.
The distribution is slightly right-skewed, indicating fewer elderly
passengers.
Most travelers were working-age adults.
ggplot(titanic, aes(x = fare)) +
geom_histogram(binwidth = 10, fill = "coral", color = "black", alpha = 0.7) +
labs(title = "Fare Distribution of Titanic Passengers",
x = "Fare (£)",
y = "Frequency")
The fare distribution is heavily right-skewed, with most passengers
paying low fares.
A small number of passengers paid very high fares, creating a long right
tail.
This reflects the wide economic gap between different passenger
classes.
Lower-fare passengers dominated the overall composition of the
Titanic.
ggplot(titanic, aes(x = age, fill = factor(pclass))) +
geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
facet_wrap(~pclass) +
labs(title = "Age Distribution by Passenger Class",
x = "Age",
y = "Frequency",
fill = "Class")
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_bin()`).
All three passenger classes show a peak concentration in the 20–40
age range.
1st class passengers tend to be slightly older on average compared to
other classes.
3rd class has the highest proportion of young adults and children.
Age distribution varies meaningfully across the three passenger
classes.
ggplot(titanic_family, aes(x = Family_Size)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "black", alpha = 0.8) +
labs(title = "Family Size Distribution of Titanic Passengers",
x = "Family Size",
y = "Frequency")
The majority of passengers traveled alone, reflected by the dominant
peak at family size 1.
Two-person families were the second most common group onboard.
Very large families of 7 or more members were extremely rare.
Solo travel was the most dominant travel pattern aboard the Titanic.
ggplot(titanic, aes(x = age, y = ..density..)) +
geom_histogram(binwidth = 5, fill = "lightgreen", color = "black", alpha = 0.7) +
geom_density(color = "red", size = 1.5, adjust = 1.5) +
labs(title = "Age Distribution with Density Curve",
x = "Age",
y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_density()`).
skewness(titanic$age, na.rm = TRUE)
## [1] 0.39133
The density curve overlaid on the histogram confirms a slightly
right-skewed age distribution.
A skewness value greater than 0 indicates a mild positive skew in the
data.
This means more passengers were younger, with fewer older
travelers.
The distribution departs slightly from a perfect normal bell curve
shape.
titanic$pclass <- as.factor(titanic$pclass)
ggplot(titanic, aes(x = pclass, y = age, fill = pclass)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16) +
labs(title = "Age Distribution by Passenger Class",
x = "Passenger Class",
y = "Age") +
theme_minimal()
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
1st class passengers have a higher median age compared to 2nd and 3rd
class passengers.
3rd class passengers are generally younger, with a lower median
age.
Outliers are visible in all classes, representing a few very elderly
passengers.
Socio-economic class and passenger age appear to be meaningfully
related.
ggplot(titanic, aes(x = factor(survived), y = fare, fill = factor(survived))) +
geom_boxplot(outlier.colour = "darkred", outlier.shape = 16) +
labs(title = "Fare Distribution by Survival Status",
x = "Survived (0 = No, 1 = Yes)",
y = "Fare (£)",
fill = "Survived") +
theme_minimal()
Survivors paid significantly higher median fares compared to
non-survivors.
The fare distribution for survivors has a wider spread and more
high-value outliers.
This suggests that wealthier passengers had a higher likelihood of
survival.
Fare is a strong indicator of both socio-economic status and survival
advantage.
ggplot(titanic, aes(x = sex, y = age, fill = sex)) +
geom_boxplot(position = position_dodge(width = 0.8),
outlier.colour = "red", alpha = 0.8) +
labs(title = "Age Distribution by Gender",
x = "Gender",
y = "Age") +
theme_minimal()
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Male and female passengers show similar median age values
onboard.
Female passengers display slightly less age variability than male
passengers.
Both genders contain outliers, indicating a few very elderly
individuals.
Gender alone does not strongly differentiate the age distribution of
passengers.
titanic_emb <- titanic %>% filter(!is.na(embarked) & embarked != "")
ggplot(titanic_emb, aes(x = embarked, y = fare, fill = embarked)) +
geom_boxplot(outlier.colour = "blue", outlier.shape = 16) +
labs(title = "Fare Distribution by Embarkation Port",
x = "Port (C = Cherbourg, Q = Queenstown, S = Southampton)",
y = "Fare (£)") +
theme_minimal()
Cherbourg passengers paid the highest median fares, suggesting more
1st class travelers boarded there.
Queenstown passengers had the lowest median fares, indicating mostly
lower-class travelers.
Southampton shows a wide fare range, reflecting passengers from all
classes.
Embarkation port is a meaningful indicator of passenger socio-economic
background.
ggplot(titanic, aes(x = factor(survived), y = fare, fill = factor(survived))) +
geom_boxplot(alpha = 0.8) +
facet_wrap(~pclass) +
labs(title = "Fare Distribution by Class and Survival Status",
x = "Survived (0 = No, 1 = Yes)",
y = "Fare (£)",
fill = "Survived") +
theme_minimal()
In all classes, survivors tend to have higher median fares than
non-survivors.
1st class survivors paid considerably higher fares, showing a strong
wealth-survival link.
3rd class shows little difference in fare between survivors and
non-survivors.
The survival-fare relationship is most pronounced among 1st class
passengers.
ggplot(titanic, aes(x = age, y = fare)) +
geom_point(color = "steelblue", alpha = 0.6) +
labs(title = "Scatter Plot: Age vs Fare",
x = "Age",
y = "Fare (£)")
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).
The scatter plot shows no strong linear relationship between
passenger age and fare paid.
High fares are distributed across passengers of various age
groups.
A few extreme fare outliers are visible, mostly for middle-aged
passengers.
Age alone is not a strong predictor of the fare a passenger paid.
ggplot(titanic, aes(x = age, y = fare, color = factor(survived))) +
geom_point(size = 2, alpha = 0.6) +
labs(title = "Age vs Fare by Survival Status",
x = "Age",
y = "Fare (£)",
color = "Survived") +
theme_minimal()
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).
Survivors are more concentrated in the higher fare range across all
age groups.
Non-survivors are densely clustered in the low-fare and younger-age
region.
A few older survivors paid extremely high fares, suggesting 1st class
membership.
Considering both age and fare together provides a stronger survival
signal.
ggplot(titanic, aes(x = age, y = fare, color = pclass)) +
geom_point(size = 2, alpha = 0.6) +
facet_wrap(~pclass) +
labs(title = "Age vs Fare by Passenger Class",
x = "Age",
y = "Fare (£)",
color = "Class") +
theme_minimal()
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).
1st class passengers have a wide range of fares spread across all
ages.
3rd class passengers are tightly clustered at low fare values regardless
of age.
2nd class shows moderate fares with limited variation across age
groups.
Passenger class clearly defines the fare structure observed within each
age group.
ggplot(titanic_family, aes(x = Family_Size, y = fare, color = factor(survived))) +
geom_point(size = 2, alpha = 0.6) +
labs(title = "Family Size vs Fare",
x = "Family Size",
y = "Fare (£)",
color = "Survived") +
theme_minimal()
Solo travelers span the full range of fare amounts from very low to
very high.
Medium-sized families of 2 to 4 members tend to cluster in the lower
fare range.
Some small-family survivors paid relatively high fares, indicating
higher class.
Family size alone does not show a clear positive relationship with fare
paid.
ggplot(titanic, aes(x = age, y = fare, color = sex)) +
geom_point(size = 2, alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Age vs Fare by Gender with Regression Line",
x = "Age",
y = "Fare (£)",
color = "Gender") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 197 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 197 rows containing missing values or values outside the scale range
## (`geom_point()`).
The regression lines for both genders show a very slight upward trend
with age.
Female passengers show a marginally higher association with fares at
older ages.
The overall slope is nearly flat, confirming a weak correlation between
age and fare.
Gender interacts only mildly with age when predicting the fare paid.
titanic_clean <- titanic %>% filter(!is.na(age) & !is.na(fare))
cor_result <- cor.test(titanic_clean$age, titanic_clean$fare, method = "pearson")
cat("Pearson Correlation Coefficient (Age vs Fare):", cor_result$estimate, "\n")
## Pearson Correlation Coefficient (Age vs Fare): 0.108268
cat("p-value:", cor_result$p.value, "\n")
## p-value: 0.00212434
The Pearson correlation coefficient between age and fare is close to
zero.
This confirms a very weak linear relationship between passenger age and
fare paid.
The p-value indicates whether this weak relationship is statistically
significant.
Age alone cannot reliably predict how much a passenger paid for their
ticket.
titanic_numeric <- titanic %>%
mutate(pclass_num = as.numeric(as.character(pclass))) %>%
select(survived, pclass_num, age, sibsp, parch, fare) %>%
na.omit()
cor_matrix <- cor(titanic_numeric)
print(round(cor_matrix, 2))
## survived pclass_num age sibsp parch fare
## survived 1.00 -0.35 -0.05 0.00 0.12 0.26
## pclass_num -0.35 1.00 -0.36 0.04 0.03 -0.56
## age -0.05 -0.36 1.00 -0.25 -0.17 0.11
## sibsp 0.00 0.04 -0.25 1.00 0.33 0.14
## parch 0.12 0.03 -0.17 0.33 1.00 0.19
## fare 0.26 -0.56 0.11 0.14 0.19 1.00
The correlation matrix reveals relationships among all key numeric
variables.
Fare and survival show a positive correlation, indicating wealthier
passengers survived more.
Passenger class and fare have a strong negative correlation, as higher
class has lower class number.
Sibsp and parch show a moderate positive correlation, both being related
to family travel.
cor_fare_survived <- cor.test(titanic$fare, titanic$survived, method = "pearson")
cat("Pearson Correlation (Fare vs Survived):", cor_fare_survived$estimate, "\n")
## Pearson Correlation (Fare vs Survived): 0.2511494
cat("p-value:", cor_fare_survived$p.value, "\n")
## p-value: 7.526333e-16
There is a positive correlation between fare paid and survival,
confirming earlier findings.
Higher-fare passengers were significantly more likely to survive the
disaster.
The p-value confirms this is a statistically significant and reliable
relationship.
Economic privilege played a tangible and measurable role in determining
survival outcomes.
corrplot(cor_matrix,
method = "color",
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("red", "white", "blue"))(200),
tl.col = "black",
tl.srt = 45,
mar = c(0, 0, 2, 0),
title = "Titanic - Numeric Variable Correlation Matrix")
The visual correlation matrix clearly highlights positive and
negative relationships.
Blue cells indicate positive correlations while red cells indicate
negative correlations.
Fare and passenger class have the strongest visible negative correlation
in the matrix.
Survival shows positive blue shading with fare and negative red shading
with class.
fare_by_class <- titanic %>%
mutate(pclass_num = as.numeric(as.character(pclass))) %>%
group_by(pclass_num) %>%
summarise(Avg_Fare = mean(fare, na.rm = TRUE))
ggplot(fare_by_class, aes(x = pclass_num, y = Avg_Fare)) +
geom_line(color = "blue", size = 1.2) +
geom_point(color = "red", size = 3) +
labs(title = "Average Fare by Passenger Class",
x = "Passenger Class",
y = "Average Fare (£)") +
theme_minimal()
Average fare drops sharply from 1st class to 2nd and 3rd class
passengers.
1st class passengers paid the highest average fares by a very large
margin.
The decline from class 2 to class 3 is smaller but still clearly
visible.
This line chart confirms a clear hierarchy of spending power across
classes.
survival_by_age <- titanic_age %>%
filter(Age_Group != "Unknown") %>%
group_by(Age_Group) %>%
summarise(Survival_Rate = mean(survived, na.rm = TRUE))
survival_by_age$Age_Group <- factor(survival_by_age$Age_Group,
levels = c("Child", "Adult", "Senior"))
ggplot(survival_by_age, aes(x = Age_Group, y = Survival_Rate, group = 1)) +
geom_line(color = "darkgreen", size = 1.2) +
geom_point(color = "orange", size = 3) +
labs(title = "Survival Rate by Age Group",
x = "Age Group",
y = "Survival Rate") +
theme_minimal()
Children have the highest survival rate among all three age
groups.
Survival rates decrease progressively from children to adults to
seniors.
This trend confirms the priority given to younger passengers during
rescue operations.
Senior passengers faced the lowest survival chances of any age
group.
fare_by_age <- titanic_age %>%
filter(Age_Group != "Unknown") %>%
group_by(Age_Group) %>%
summarise(Avg_Fare = mean(fare, na.rm = TRUE))
fare_by_age$Age_Group <- factor(fare_by_age$Age_Group,
levels = c("Child", "Adult", "Senior"))
ggplot(fare_by_age, aes(x = Age_Group, y = Avg_Fare, group = 1)) +
geom_line(color = "purple", size = 1.2) +
geom_point(color = "red", size = 3) +
labs(title = "Average Fare by Age Group",
x = "Age Group",
y = "Average Fare (£)") +
theme_minimal()
Senior passengers paid the highest average fares among all age
groups.
Children and adults paid comparatively lower average fares.
This suggests that older passengers may have been more likely to travel
in 1st class.
Age group provides a useful lens for understanding fare patterns across
the dataset.
age_by_town <- titanic %>%
filter(!is.na(embark_town) & embark_town != "" & !is.na(age)) %>%
group_by(embark_town) %>%
summarise(Avg_Age = mean(age, na.rm = TRUE))
ggplot(age_by_town, aes(x = embark_town, y = Avg_Age, group = 1)) +
geom_line(color = "brown", size = 1.2) +
geom_point(color = "blue", size = 3) +
labs(title = "Average Passenger Age by Embarkation Town",
x = "Embarkation Town",
y = "Average Age (Years)") +
theme_minimal()
Average passenger age varies noticeably across the three embarkation
towns.
Cherbourg passengers tend to be older on average than those from the
other ports.
Queenstown shows the youngest average passenger age among the three
towns.
Embarkation town reflects clear differences in the demographic profile
of boarding passengers.
fare_gender_class <- titanic %>%
mutate(pclass_num = as.numeric(as.character(pclass))) %>%
group_by(pclass_num, sex) %>%
summarise(Avg_Fare = mean(fare, na.rm = TRUE), .groups = "drop")
ggplot(fare_gender_class, aes(x = pclass_num, y = Avg_Fare,
color = sex, group = sex)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
labs(title = "Average Fare by Class and Gender",
x = "Passenger Class",
y = "Average Fare (£)",
color = "Gender") +
theme_minimal()
Female passengers paid higher average fares than males across all
passenger classes.
The gap between male and female fares is most pronounced in 1st
class.
Both gender lines show a sharp declining trend from 1st to 3rd
class.
This multi-line chart clearly reveals the combined influence of gender
and class on fare.
anova_fare <- aov(fare ~ pclass, data = titanic)
summary(anova_fare)
## Df Sum Sq Mean Sq F value Pr(>F)
## pclass 2 863977 431989 280.3 <2e-16 ***
## Residuals 997 1536329 1541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The one-way ANOVA tests whether mean fare differs significantly
across the three passenger classes.
A very small p-value (typically < 0.05) indicates that the
differences in fare between classes are statistically significant.
This confirms that passenger class is a strong determinant of how much a
ticket cost.
The F-statistic reflects the ratio of between-group variance to
within-group variance, and a large value supports rejection of the null
hypothesis.
titanic_emb_age <- titanic %>%
filter(!is.na(embarked) & embarked != "" & !is.na(age))
anova_age_emb <- aov(age ~ embarked, data = titanic_emb_age)
summary(anova_age_emb)
## Df Sum Sq Mean Sq F value Pr(>F)
## embarked 2 655 327.5 1.572 0.208
## Residuals 798 166267 208.4
The ANOVA tests whether passengers from different embarkation ports
(C, Q, S) had significantly different average ages.
A significant p-value would confirm that the port of boarding is
associated with the age profile of passengers.
This is consistent with earlier findings that Cherbourg attracted older,
wealthier travelers.
If the p-value is above 0.05, the age differences across ports may be
due to random variation rather than a true effect.
model_slr <- lm(survived ~ fare, data = titanic)
summary(model_slr)
##
## Call:
## lm(formula = survived ~ fare, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9708 -0.3477 -0.3320 0.6088 0.6874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3126476 0.0178177 17.547 < 2e-16 ***
## fare 0.0025026 0.0003053 8.197 7.53e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.473 on 998 degrees of freedom
## Multiple R-squared: 0.06308, Adjusted R-squared: 0.06214
## F-statistic: 67.19 on 1 and 998 DF, p-value: 7.526e-16
ggplot(titanic, aes(x = fare, y = survived)) +
geom_point(alpha = 0.3, color = "steelblue", size = 1.5) +
geom_smooth(method = "lm", color = "red", se = TRUE) +
labs(title = "Simple Linear Regression: Fare vs Survival",
x = "Fare (£)",
y = "Survived (0 = No, 1 = Yes)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The simple linear regression model uses fare as the sole predictor of
survival.
A positive and significant coefficient for fare confirms that
higher-paying passengers had better survival odds.
The R-squared value tells us what proportion of the variation in
survival is explained by fare alone.
While the relationship is statistically significant, fare alone explains
only a modest portion of survival variance.
titanic_mlr <- titanic %>%
filter(!is.na(age) & !is.na(fare)) %>%
mutate(pclass_num = as.numeric(as.character(pclass)))
model_mlr <- lm(survived ~ age + fare + pclass_num, data = titanic_mlr)
summary(model_mlr)
##
## Call:
## lm(formula = survived ~ age + fare + pclass_num, data = titanic_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9517 -0.3178 -0.1969 0.4246 0.9903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0897827 0.0843540 12.919 < 2e-16 ***
## age -0.0065789 0.0011934 -5.513 4.77e-08 ***
## fare 0.0006538 0.0003719 1.758 0.0791 .
## pclass_num -0.2239574 0.0247375 -9.053 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4522 on 799 degrees of freedom
## Multiple R-squared: 0.1609, Adjusted R-squared: 0.1577
## F-statistic: 51.06 on 3 and 799 DF, p-value: < 2.2e-16
results_mlr <- data.frame(
Actual = titanic_mlr$survived,
Predicted = predict(model_mlr, newdata = titanic_mlr)
)
ggplot(results_mlr, aes(x = Actual, y = Predicted)) +
geom_point(color = "darkgreen", size = 2, alpha = 0.4) +
geom_abline(intercept = 0, slope = 1, color = "red", size = 1.2) +
labs(title = "Multiple Linear Regression: Actual vs Predicted Survival",
x = "Actual Survival (0 = No, 1 = Yes)",
y = "Predicted Survival Score") +
theme_minimal()
The multiple linear regression model combines age, fare, and
passenger class to predict survival.
Including all three predictors improves explanatory power compared to
using fare alone.
A higher class number (3rd class) negatively impacts predicted survival,
while higher fare positively contributes.
The Actual vs Predicted plot shows how well the model estimates survival
— points close to the red diagonal line indicate accurate
predictions.
titanic_poly <- titanic %>% filter(!is.na(age) & !is.na(fare))
model_poly <- lm(fare ~ poly(age, 2), data = titanic_poly)
summary(model_poly)
##
## Call:
## lm(formula = fare ~ poly(age, 2), data = titanic_poly)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.61 -23.36 -17.94 1.53 477.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.108 1.837 18.569 < 2e-16 ***
## poly(age, 2)1 160.387 52.049 3.081 0.00213 **
## poly(age, 2)2 38.318 52.049 0.736 0.46183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.05 on 800 degrees of freedom
## Multiple R-squared: 0.01239, Adjusted R-squared: 0.009922
## F-statistic: 5.019 on 2 and 800 DF, p-value: 0.006823
ggplot(titanic_poly, aes(x = age, y = fare)) +
geom_point(color = "blue", size = 2, alpha = 0.4) +
stat_smooth(method = "lm",
formula = y ~ x + I(x^2),
color = "red",
size = 1.5,
se = TRUE) +
labs(title = "Polynomial Regression: Age vs Fare (Degree 2)",
x = "Age",
y = "Fare (£)") +
theme_minimal()
The polynomial regression of degree 2 captures any curved
relationship between age and fare.
If the quadratic term is statistically significant, it confirms that the
age-fare relationship is non-linear.
The fitted curve may show that middle-aged passengers paid higher fares
than the very young or very old.
Polynomial regression is more flexible than simple linear regression
when the relationship between variables is curved rather than
straight.