DATA 712 HW#9
Titanic Data Analysis
Introduction
The sinking of the Titanic remains one of the most well-known disasters in history, with its legacy known across generations, even among those who were not alive at the time. During its time at sea, in April 1912, the ship struck an iceberg and tragically sank, leading to the loss of 1502 out of 2224 passengers and crew members. A lack of lifeboats meant that not everyone on board had a chance of survival.
Although chance played a role, emerging patterns suggest that survival was not entirely random. Factors such as gender, social class, and ticket fare appeared to significantly influence a passenger’s likelihood of survival.
In this analysis, I will be looking at these survival patterns using the Titanic dataset sourced from Kaggle’s “Titanic: Machine Learning from Disaster”, while addressing missing data issues through multiple imputation based on methods outlined by Acock (2005), Honaker et al. (2011), and King et al. (2001).
Loading the Data
library(readxl)
library(dplyr)
library(ggplot2)
library(MASS)
library(clarify)
library(texreg)
library(tidyr)
library(purrr)
library(betareg)
library(modelsummary)
library(Amelia)
library(tibble)
set.seed(123123)
knitr::opts_knit$set(root.dir = "/Users/ruthiemaurer/Desktop/DATA 712")
setwd("/Users/ruthiemaurer/Desktop/DATA 712")## # A tibble: 891 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braun… male 22 1 0 A/5 2… 7.25 <NA>
## 2 2 1 1 Cumin… fema… 38 1 0 PC 17… 71.3 C85
## 3 3 1 3 Heikk… fema… 26 0 0 STON/… 7.92 <NA>
## 4 4 1 1 Futre… fema… 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen… male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Moran… male NA 0 0 330877 8.46 <NA>
## 7 7 0 1 McCar… male 54 0 0 17463 51.9 E46
## 8 8 0 3 Palss… male 2 3 1 349909 21.1 <NA>
## 9 9 1 3 Johns… fema… 27 0 2 347742 11.1 <NA>
## 10 10 1 2 Nasse… fema… 14 1 0 237736 30.1 <NA>
## # ℹ 881 more rows
## # ℹ 1 more variable: Embarked <chr>
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
## [1] 891 12
Although the dataset contains 891 observations, closer inspection
reveals missing values, particularly for variables like Age
(approximately 20.09% of cases contain missing data.)
Historically, analysts handled “missingness” through listwise deletion, which would shrink the dataset to 712 observations.
However, Acock (2005)
and King et
al. (2001) caution that listwise deletion can introduce bias and
reduce statistical power unnecessarily. Because of that, in this
analysis I will use multiple imputation via the Amelia
package to maximize the use of all available information.
Cabin was dropped due to extreme “missingness” (~80%)
and its limited relevance to the analysis.
# Drop incomplete rows first
complete_data <- DATA %>% drop_na()
# Now Calculate counts
n_original <- nrow(DATA)
n_complete <- nrow(complete_data)
n_dropped <- n_original - n_complete
perc_dropped <- round((n_dropped / n_original) * 100, 2)
obs_summary <- tibble(
`Stage` = c("Original Data", "Complete Cases Only", "Dropped Observations", "Percent Dropped"),
`Count` = c(n_original, n_complete, n_dropped, paste0(perc_dropped, "%"))
)
modelsummary::datasummary_df(obs_summary, title = "Observation Summary Before Imputation")| Stage | Count |
|---|---|
| Original Data | 891 |
| Complete Cases Only | 712 |
| Dropped Observations | 179 |
| Percent Dropped | 20.09% |
Fare and Survival Analysis Before Imputation
# Analysis by Sex and Pclass
fare_sex_pclass <- complete_data %>%
group_by(Sex, Pclass) %>%
summarise(Average_Fare = mean(Fare))
survival_sex_pclass <- complete_data %>%
group_by(Sex, Pclass) %>%
summarise(Average_Survival = mean(Survived))
fare_sex <- complete_data %>%
group_by(Sex) %>%
summarise(Average_Fare = mean(Fare))
survival_sex <- complete_data %>%
group_by(Sex) %>%
summarise(Average_Survival = mean(Survived))
survival_pclass <- complete_data %>%
group_by(Pclass) %>%
summarise(Average_Survival = mean(Survived))
fare_sex## # A tibble: 2 × 2
## Sex Average_Fare
## <chr> <dbl>
## 1 female 47.3
## 2 male 27.3
## # A tibble: 2 × 2
## Sex Average_Survival
## <chr> <dbl>
## 1 female 0.753
## 2 male 0.205
## # A tibble: 3 × 2
## Pclass Average_Survival
## <dbl> <dbl>
## 1 1 0.652
## 2 2 0.480
## 3 3 0.239
Average fare for females: $47.33
Average fare for males: $27.27
Female survival rate: 75.3%
Male survival rate: 20.5%
First class survival: 65.2%
Second class survival: 47.9%
Third class survival: 23.9%
These results confirm that gender and class played major roles in determining survival chances.
Regression Analysis (Before Imputation)
complete_model <- lm(Survived ~ Pclass + Sex + Fare, data = complete_data)
modelsummary(complete_model, title = "Regression Results: Before Imputation")| (1) | |
|---|---|
| (Intercept) | 1.064 |
| (0.059) | |
| Pclass | -0.156 |
| (0.021) | |
| Sexmale | -0.501 |
| (0.031) | |
| Fare | 0.000 |
| (0.000) | |
| Num.Obs. | 712 |
| R2 | 0.366 |
| R2 Adj. | 0.364 |
| AIC | 692.2 |
| BIC | 715.0 |
| Log.Lik. | -341.093 |
| F | 136.457 |
| RMSE | 0.39 |
Regression model findings (complete cases only):
- Being male decreased survival probability by -0.501.
- Higher class (lower Pclass number) significantly improved survival odds.
- Fare had a very small positive effect.
Model performance:
R² = 0.366 AIC = 692.2
Multiple Imputation Using Amelia
amelia_output <- amelia(
DATA,
m = 5,
idvars = c("PassengerId", "Name", "Ticket"),
noms = c("Sex", "Embarked")
)## -- Imputation 1 --
##
## 1 2 3 4
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7
##
## -- Imputation 4 --
##
## 1 2 3 4
##
## -- Imputation 5 --
##
## 1 2 3
##
## Amelia output with 5 imputed datasets.
## Return code: 1
## Message: Normal EM convergence.
##
## Chain Lengths:
## --------------
## Imputation 1: 4
## Imputation 2: 7
## Imputation 3: 7
## Imputation 4: 4
## Imputation 5: 3
## [1] "imputations" "m" "missMatrix" "overvalues" "theta"
## [6] "mu" "covMatrices" "code" "message" "iterHist"
## [11] "arguments" "orig.vars"
Following the approach described by Honaker et al. (2011),
multiple imputation was performed using Amelia, an
Expectation-Maximization with Bootstrapping (EMB) based method. Multiple
imputation assumes the data are Missing at Random (MAR) (Honaker et
al. (2010)).
Regression Analysis (After Imputation)
titanic_model_imputed <- lm(Survived ~ Pclass + Sex + Fare, data = imputed_data)
modelsummary(titanic_model_imputed, title = "Regression Results: After Imputation")| (1) | |
|---|---|
| (Intercept) | 1.058 |
| (0.054) | |
| Pclass | -0.151 |
| (0.019) | |
| Sexmale | -0.514 |
| (0.028) | |
| Fare | 0.000 |
| (0.000) | |
| Num.Obs. | 891 |
| R2 | 0.368 |
| R2 Adj. | 0.366 |
| AIC | 845.0 |
| BIC | 869.0 |
| Log.Lik. | -417.513 |
| F | 172.185 |
| RMSE | 0.39 |
Regression model findings (after imputation):
- Being male decreased survival probability by -0.514.
- Higher class still strongly predicted survival.
- Fare remained a small positive factor.
Model performance:
R² = 0.368 AIC = 845.0
Comparing Models: Before vs After Imputation
# Comparison of the complete-case and imputed-data models
modelsummary(
list(
"Complete Case Model" = complete_model,
"Imputed Data Model" = titanic_model_imputed
),
title = "Comparison of Regression Models: Complete Cases vs Imputed Data"
)| Complete Case Model | Imputed Data Model | |
|---|---|---|
| (Intercept) | 1.064 | 1.058 |
| (0.059) | (0.054) | |
| Pclass | -0.156 | -0.151 |
| (0.021) | (0.019) | |
| Sexmale | -0.501 | -0.514 |
| (0.031) | (0.028) | |
| Fare | 0.000 | 0.000 |
| (0.000) | (0.000) | |
| Num.Obs. | 712 | 891 |
| R2 | 0.366 | 0.368 |
| R2 Adj. | 0.364 | 0.366 |
| AIC | 692.2 | 845.0 |
| BIC | 715.0 | 869.0 |
| Log.Lik. | -341.093 | -417.513 |
| F | 136.457 | 172.185 |
| RMSE | 0.39 | 0.39 |
Both models produced similar results, but the imputed model:
- Preserved the full sample size (891 observations)
- Reduced standard errors
- Produced slightly better model fit (higher R²)
Conclusion
Handling missing data is crucial for valid and unbiased statistical
analysis. Without multiple imputation, I would have been leaving out
approximately 20% of the dataset, potentially having biased results and
reducing precision. Using Amelia allows for missing data to
not be left out, leading to more reliable regression results.
Throughout the analysis, the findings confirm that gender and
passenger class were critical predictors of survival aboard the Titanic
— comparing both the data before the multiple imputation and after shows
that these relationships remain consistent and using Amelia
adds increased precision, improved model fit, and the retention of the
full dataset, adding more reliability to the results.
References
Acock, Alan C. 2005. “Working With Missing Values.”Journal of Marriage and Family 67 (4): 1012–28. https://doi.org/10.1111/j.1741-3737.2005.00191.x.
Honaker, James, and Gary King. 2010. “What to Do about MissingValues in Time-Series Cross-Section Data.” American Journalof Political Science 54 (2): 561–81. https://doi.org/10.1111/j.1540-5907.2010.00447.x.
Honaker, James, Gary King, and Matthew Blackwell. 2011.“Amelia II: A Program for Missing Data.”Journal of Statistical Software 45 (7). https://doi.org/10.18637/jss.v045.i07.
King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001.“Analyzing Incomplete Political Science Data: An AlternativeAlgorithm for Multiple Imputation.” American PoliticalScience Review 95 (1): 49–69. https://doi.org/10.1017/S0003055401000235.
Titanic - Machine Learning from Disaster. n.d. https://kaggle.com/titanic.