DATA 712 HW#9

Titanic Data Analysis

Introduction

The sinking of the Titanic remains one of the most well-known disasters in history, with its legacy known across generations, even among those who were not alive at the time. During its time at sea, in April 1912, the ship struck an iceberg and tragically sank, leading to the loss of 1502 out of 2224 passengers and crew members. A lack of lifeboats meant that not everyone on board had a chance of survival.

Although chance played a role, emerging patterns suggest that survival was not entirely random. Factors such as gender, social class, and ticket fare appeared to significantly influence a passenger’s likelihood of survival.

In this analysis, I will be looking at these survival patterns using the Titanic dataset sourced from Kaggle’s “Titanic: Machine Learning from Disaster”, while addressing missing data issues through multiple imputation based on methods outlined by Acock (2005), Honaker et al. (2011), and King et al. (2001).

Loading the Data

library(readxl)
library(dplyr)
library(ggplot2)
library(MASS)
library(clarify)
library(texreg)
library(tidyr)
library(purrr)
library(betareg)
library(modelsummary)
library(Amelia)
library(tibble)
set.seed(123123)
knitr::opts_knit$set(root.dir = "/Users/ruthiemaurer/Desktop/DATA 712")
setwd("/Users/ruthiemaurer/Desktop/DATA 712")
DATA <- read_xlsx("/Users/ruthiemaurer/Desktop/DATA 712/titanic_data.xlsx", col_names = TRUE)
print(DATA)
## # A tibble: 891 × 12
##    PassengerId Survived Pclass Name   Sex     Age SibSp Parch Ticket  Fare Cabin
##          <dbl>    <dbl>  <dbl> <chr>  <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
##  1           1        0      3 Braun… male     22     1     0 A/5 2…  7.25 <NA> 
##  2           2        1      1 Cumin… fema…    38     1     0 PC 17… 71.3  C85  
##  3           3        1      3 Heikk… fema…    26     0     0 STON/…  7.92 <NA> 
##  4           4        1      1 Futre… fema…    35     1     0 113803 53.1  C123 
##  5           5        0      3 Allen… male     35     0     0 373450  8.05 <NA> 
##  6           6        0      3 Moran… male     NA     0     0 330877  8.46 <NA> 
##  7           7        0      1 McCar… male     54     0     0 17463  51.9  E46  
##  8           8        0      3 Palss… male      2     3     1 349909 21.1  <NA> 
##  9           9        1      3 Johns… fema…    27     0     2 347742 11.1  <NA> 
## 10          10        1      2 Nasse… fema…    14     1     0 237736 30.1  <NA> 
## # ℹ 881 more rows
## # ℹ 1 more variable: Embarked <chr>
summary(DATA)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 
dim(DATA)
## [1] 891  12

Although the dataset contains 891 observations, closer inspection reveals missing values, particularly for variables like Age (approximately 20.09% of cases contain missing data.)

Historically, analysts handled “missingness” through listwise deletion, which would shrink the dataset to 712 observations.

However, Acock (2005) and King et al. (2001) caution that listwise deletion can introduce bias and reduce statistical power unnecessarily. Because of that, in this analysis I will use multiple imputation via the Amelia package to maximize the use of all available information.

DATA <- DATA %>% dplyr::select(-Cabin)

Cabin was dropped due to extreme “missingness” (~80%) and its limited relevance to the analysis.

# Drop incomplete rows first
complete_data <- DATA %>% drop_na()

# Now Calculate counts
n_original <- nrow(DATA)
n_complete <- nrow(complete_data)
n_dropped <- n_original - n_complete
perc_dropped <- round((n_dropped / n_original) * 100, 2)

obs_summary <- tibble(
  `Stage` = c("Original Data", "Complete Cases Only", "Dropped Observations", "Percent Dropped"),
  `Count` = c(n_original, n_complete, n_dropped, paste0(perc_dropped, "%"))
)

modelsummary::datasummary_df(obs_summary, title = "Observation Summary Before Imputation")
Observation Summary Before Imputation
Stage Count
Original Data 891
Complete Cases Only 712
Dropped Observations 179
Percent Dropped 20.09%

Fare and Survival Analysis Before Imputation

# Analysis by Sex and Pclass
fare_sex_pclass <- complete_data %>%
  group_by(Sex, Pclass) %>%
  summarise(Average_Fare = mean(Fare))

survival_sex_pclass <- complete_data %>%
  group_by(Sex, Pclass) %>%
  summarise(Average_Survival = mean(Survived))

fare_sex <- complete_data %>%
  group_by(Sex) %>%
  summarise(Average_Fare = mean(Fare))

survival_sex <- complete_data %>%
  group_by(Sex) %>%
  summarise(Average_Survival = mean(Survived))

survival_pclass <- complete_data %>%
  group_by(Pclass) %>%
  summarise(Average_Survival = mean(Survived))

fare_sex
## # A tibble: 2 × 2
##   Sex    Average_Fare
##   <chr>         <dbl>
## 1 female         47.3
## 2 male           27.3
survival_sex
## # A tibble: 2 × 2
##   Sex    Average_Survival
##   <chr>             <dbl>
## 1 female            0.753
## 2 male              0.205
survival_pclass
## # A tibble: 3 × 2
##   Pclass Average_Survival
##    <dbl>            <dbl>
## 1      1            0.652
## 2      2            0.480
## 3      3            0.239
  • Average fare for females: $47.33

  • Average fare for males: $27.27

  • Female survival rate: 75.3%

  • Male survival rate: 20.5%

  • First class survival: 65.2%

  • Second class survival: 47.9%

  • Third class survival: 23.9%

These results confirm that gender and class played major roles in determining survival chances.

Regression Analysis (Before Imputation)

complete_model <- lm(Survived ~ Pclass + Sex + Fare, data = complete_data)

modelsummary(complete_model, title = "Regression Results: Before Imputation")
Regression Results: Before Imputation
(1)
(Intercept) 1.064
(0.059)
Pclass -0.156
(0.021)
Sexmale -0.501
(0.031)
Fare 0.000
(0.000)
Num.Obs. 712
R2 0.366
R2 Adj. 0.364
AIC 692.2
BIC 715.0
Log.Lik. -341.093
F 136.457
RMSE 0.39

Regression model findings (complete cases only):

  • Being male decreased survival probability by -0.501.
  • Higher class (lower Pclass number) significantly improved survival odds.
  • Fare had a very small positive effect.

Model performance:

R² = 0.366 AIC = 692.2

Multiple Imputation Using Amelia

DATA <- DATA %>%
  mutate(across(where(is.character), as.factor))
amelia_output <- amelia(
  DATA,
  m = 5,
  idvars = c("PassengerId", "Name", "Ticket"),  
  noms = c("Sex", "Embarked")  
)
## -- Imputation 1 --
## 
##   1  2  3  4
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7
## 
## -- Imputation 4 --
## 
##   1  2  3  4
## 
## -- Imputation 5 --
## 
##   1  2  3
imputed_data <- amelia_output$imputations[[1]]
amelia_output
## 
## Amelia output with 5 imputed datasets.
## Return code:  1 
## Message:  Normal EM convergence. 
## 
## Chain Lengths:
## --------------
## Imputation 1:  4
## Imputation 2:  7
## Imputation 3:  7
## Imputation 4:  4
## Imputation 5:  3
names(amelia_output)
##  [1] "imputations" "m"           "missMatrix"  "overvalues"  "theta"      
##  [6] "mu"          "covMatrices" "code"        "message"     "iterHist"   
## [11] "arguments"   "orig.vars"
View(amelia_output$imputations$imp1)

Following the approach described by Honaker et al. (2011), multiple imputation was performed using Amelia, an Expectation-Maximization with Bootstrapping (EMB) based method. Multiple imputation assumes the data are Missing at Random (MAR) (Honaker et al. (2010)).

Regression Analysis (After Imputation)

titanic_model_imputed <- lm(Survived ~ Pclass + Sex + Fare, data = imputed_data)

modelsummary(titanic_model_imputed, title = "Regression Results: After Imputation")
Regression Results: After Imputation
(1)
(Intercept) 1.058
(0.054)
Pclass -0.151
(0.019)
Sexmale -0.514
(0.028)
Fare 0.000
(0.000)
Num.Obs. 891
R2 0.368
R2 Adj. 0.366
AIC 845.0
BIC 869.0
Log.Lik. -417.513
F 172.185
RMSE 0.39

Regression model findings (after imputation):

  • Being male decreased survival probability by -0.514.
  • Higher class still strongly predicted survival.
  • Fare remained a small positive factor.

Model performance:

R² = 0.368 AIC = 845.0

Comparing Models: Before vs After Imputation

# Comparison of the complete-case and imputed-data models
modelsummary(
  list(
    "Complete Case Model" = complete_model,
    "Imputed Data Model" = titanic_model_imputed
  ),
  title = "Comparison of Regression Models: Complete Cases vs Imputed Data"
)
Comparison of Regression Models: Complete Cases vs Imputed Data
Complete Case Model Imputed Data Model
(Intercept) 1.064 1.058
(0.059) (0.054)
Pclass -0.156 -0.151
(0.021) (0.019)
Sexmale -0.501 -0.514
(0.031) (0.028)
Fare 0.000 0.000
(0.000) (0.000)
Num.Obs. 712 891
R2 0.366 0.368
R2 Adj. 0.364 0.366
AIC 692.2 845.0
BIC 715.0 869.0
Log.Lik. -341.093 -417.513
F 136.457 172.185
RMSE 0.39 0.39

Both models produced similar results, but the imputed model:

  • Preserved the full sample size (891 observations)
  • Reduced standard errors
  • Produced slightly better model fit (higher R²)

Conclusion

Handling missing data is crucial for valid and unbiased statistical analysis. Without multiple imputation, I would have been leaving out approximately 20% of the dataset, potentially having biased results and reducing precision. Using Amelia allows for missing data to not be left out, leading to more reliable regression results.

Throughout the analysis, the findings confirm that gender and passenger class were critical predictors of survival aboard the Titanic — comparing both the data before the multiple imputation and after shows that these relationships remain consistent and using Amelia adds increased precision, improved model fit, and the retention of the full dataset, adding more reliability to the results.

References

Acock, Alan C. 2005. “Working With Missing Values.”Journal of Marriage and Family 67 (4): 1012–28. https://doi.org/10.1111/j.1741-3737.2005.00191.x.

Honaker, James, and Gary King. 2010. “What to Do about MissingValues in Time-Series Cross-Section Data.” American Journalof Political Science 54 (2): 561–81. https://doi.org/10.1111/j.1540-5907.2010.00447.x.

Honaker, James, Gary King, and Matthew Blackwell. 2011.“Amelia II: A Program for Missing Data.”Journal of Statistical Software 45 (7). https://doi.org/10.18637/jss.v045.i07.

King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001.“Analyzing Incomplete Political Science Data: An AlternativeAlgorithm for Multiple Imputation.” American PoliticalScience Review 95 (1): 49–69. https://doi.org/10.1017/S0003055401000235.

Titanic - Machine Learning from Disaster. n.d. https://kaggle.com/titanic.