2025-02-14

INTRODUCTION

I chose to look at an R built-in dataset - Titanic. This dataset focuses on survival of the passenger on the Titanic and contains six variables; Class (1st, 2nd, 3rd, and crew), Sex (male and female), Age (children group and adult group), Survival (yes or no), and Freq (for frequency of each variable occurring in conjunction).

The Titanic was an infamous “unsinkable” ship that sunk on the night of April 14th, 1912, after striking an iceberg, claiming the lives of over 1500 individuals on board.

Can we statistically confirm that survival on the Titanic was influenced by passenger characteristics (variables)? If yes, is there a single variable that affected survival rates more than others?

LOADING IN

Before getting to work, we must convert our Titanic dataset into a dataframe, since it is originally structured as a contingency table.

data("Titanic")
titanicdf <- as.data.frame(Titanic)

We should run some exploratory analysis on our dataset to get a feel for each variable. First, lets take a look at survival rates among all passengers, regardless of characteristics.

SURVIVAL DISTRIBUTION

##   Survived Freq
## 1       No 1490
## 2      Yes  711

SURVIVAL BY CHARACTERISTICS

Based on the previous slides, we can see that out of 2201 passengers (including crew), only 711 survived.

Next, we should take a look at the overall stats of passengers by the variables Class, Sex, and Age to see if there are any large trends we can see before performing some statistical procedures.

SURVIVAL BY CLASS

For our stats, I wanted to also include different proportions to easily tell if there were large differences in survival between each Class group; 1st, 2nd, 3rd, and crew.

csum <- aggregate(Freq ~ Class, data = titanicdf, sum)
colnames(csum)[2] <- "Total_Count"
survived_class <- aggregate(Freq ~ Class + Survived, data = titanicdf, sum)
survived_class <- subset(survived_class, Survived == "Yes")[, c("Class", "Freq")]
colnames(survived_class)[2] <- "Survived_Count"
death_class <- aggregate(Freq ~ Class + Survived, data = titanicdf, sum)
death_class <- subset(death_class, Survived == "No")[, c("Class", "Freq")]
colnames(death_class)[2] <- "Death_Count"
csum <- merge(csum, survived_class, by = "Class")
csum <- merge(csum, death_class, by = "Class")
csum$Survival_Proportion <- round(csum$Survived_Count / csum$Total_Count, 3)
csum$Death_Proportion <- round(csum$Death_Count / csum$Total_Count, 3)
csum$Total_Proportion <- round(csum$Total_Count / sum(csum$Total_Count), 3)
csum
##   Class Total_Count Survived_Count Death_Count Survival_Proportion
## 1   1st         325            203         122               0.625
## 2   2nd         285            118         167               0.414
## 3   3rd         706            178         528               0.252
## 4  Crew         885            212         673               0.240
##   Death_Proportion Total_Proportion
## 1            0.375            0.148
## 2            0.586            0.129
## 3            0.748            0.321
## 4            0.760            0.402

SURVIVAL BY CLASS (cont.)

This code calculates the total number of passengers, survivors, and deaths for each class - as well as computes the proportions for survival and death. It groups by class and also calculates the total proportion of individuals in each class, so we can clearly see where everyone falls into each group.

##   Class Total_Count Survived_Count Death_Count Survival_Proportion
## 1   1st         325            203         122               0.625
## 2   2nd         285            118         167               0.414
## 3   3rd         706            178         528               0.252
## 4  Crew         885            212         673               0.240
##   Death_Proportion Total_Proportion
## 1            0.375            0.148
## 2            0.586            0.129
## 3            0.748            0.321
## 4            0.760            0.402

SURVIVAL BY CLASS (cont.)

Based on our table results, we can see that 1st class had the highest survival rate and the crew had the lowest survival rate (with 3rd class being closely behind). We can visualize these results as well;

SURVIVAL BY GENDER

I used the exact same table methods for survival by gender, to also include all of the related proportions for our observations;

##      Sex Total_Count Survived_Count Death_Count Survival_Proportion
## 1 Female         470            344         126               0.732
## 2   Male        1731            367        1364               0.212
##   Death_Proportion Total_Proportion
## 1            0.268            0.214
## 2            0.788            0.786

Based on our gender table, we can see that women were more likely to survive compared to men. However, an important observation to make is the total proportion of men and women on board - there were significantly more men than women;

## [1] 268.2979

approximately 268% more men than women on board.

SURVIVAL BY GENDER (cont.)

Here is our plot for our survival by gender;

SURVIVAL BY GENDER - PRIORITIZE WOMEN AND CHILDREN

It is a widely known fact that evacuation processes on the Titanic were supposed to prioritize women and children - and based solely on the survival rates of women, it would appear that fact to be true. However, considering there were not many women on board compared to men, it could have been “easier” for women to evacuate considering they were less in numbers. Just a fact to ponder!

SURVIVAL BY AGE GROUP

I performed the same sort of analysis on age group (adult and child);

##     Age Total_Count Survived_Count Death_Count Survival_Proportion
## 1 Adult        2092            654        1438               0.313
## 2 Child         109             57          52               0.523
##   Death_Proportion Total_Proportion
## 1            0.687             0.95
## 2            0.477             0.05

Only about half of the children on board managed to survive. Similarly to men vs. women, there was a much larger percentage of adults compared to children. Still, children had a higher survival rate comparatively.

SURVIVAL BY AGE GROUP (cont.)

Here is our visualization for such;

STATISTICAL ANALYSIS - CHI-SQUARE TEST FOR INDEPENDENCE

Based on our previous analysis, we know that the characteristics of individuals probably affected survival rates - specifically; women, children, and members of first class had the highest survival rates. How can we prove this to be true?

I decided to go with the Chi-Square test for independence, which tests if there is a statistically significant relationship between two variables. In other words, testing to see how survival is dependent on class, gender, and age.

STATISTICAL ANALYSIS - CHI-SQUARE TEST FOR INDEPENDENCE (cont.)

Some important aspects of the chi-square test;

  • X-squared: measures the difference in actual vs. expected survival rates. High values indicate a large deviation, meaning we reject our null hypothesis.

  • df (degrees of freedom): shows the number of independent comparisons

  • p-value (probability value): tells us if the relationship is statistically significant. If p-value < 0.05, we reject our null hypothesis.

  • NULL HYPOTHESIS: survival was completely random

STATISTICAL ANALYSIS - MOSAIC PLOTS

Moasic plots are visual represenations of chi-square tests. In this case, they will show how survival is distributed acorss our different characteristics. Some important aspects of mosaic plots;

  • Box size represents frequency of people

  • Colors are important; blue = more survival than expected, red = less survival than expected, and grey = expected values, no big deviation between expected and actual.

CHI-SQUARE - GENDER

survtble <- xtabs(Freq ~ Sex + Survived, data = titanicdf)
chisex <- chisq.test(survtble)
chisex
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survtble
## X-squared = 454.5, df = 1, p-value < 2.2e-16

Our X-squared value is high and our p-value is extremely low, meaning we reject our null hypothesis, and therefore can conclude that the relationship between gender and survival is statistically significant.

CHI-SQUARE - GENDER VISUALIZATION

mosaic(~ Sex + Survived, data = titanicdf, shade = TRUE, legend = TRUE,
       main = "Survival by Gender")

CHI-SQ - CLASS

clastbl <- xtabs(Freq ~ Class + Survived, data = titanicdf)
chiclass <- chisq.test(clastbl)
chiclass
## 
##  Pearson's Chi-squared test
## 
## data:  clastbl
## X-squared = 190.4, df = 3, p-value < 2.2e-16

Since our X-squared value is high and our p-value is < 0.05, we reject our null hypothesis and conclude that the relationship between survival and class is statistically significant.

CHI-SQUARE - CLASS VISUALIZATION

mosaic(~ Class + Survived, data = titanicdf, shade = TRUE, legend = TRUE,
       main = "Survival by Class")

CHI-SQUARE - AGE

agetbl <- xtabs(Freq ~ Age + Survived, data = titanicdf)
chiage <- chisq.test(agetbl)
chiage
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  agetbl
## X-squared = 20.005, df = 1, p-value = 7.725e-06

Since our X-squared value is high, and our p-value is < 0.05, we reject our null hypothesis and conclude that the relationship between survival and age is statistically significant. However, our values are not as nearly significant compared to class and sex, and therefore age is a bit more nuanced here.

CHI-SQUARE - AGE VISUALIZATION

mosaic(~ Age + Survived, data = titanicdf, shade = TRUE, legend = TRUE,
       main = "Survival by Age Group (Chi-Square Visualization)")

LOGISTIC REGRESSION MODEL

To conclude our statistical analysis, I decided to use a logistic regression model to predict the probability of survival.

\[ \log\left(\frac{P(Survived)}{1 - P(Survived)}\right) = \beta_0 + \beta_1(\text{Class}) + \beta_2(\text{Sex}) \]

  • P(survived): probability of survival

  • 1 - P(survived): probability of dying

  • log(P(survived)/1-P(survived)): odds of survival

  • B0: intercept

  • B1: class, effect of class on survival

  • B2: sex, effect of gender on survival

REGRESSION MODEL

## 
## Call:
## glm(formula = Survived ~ Class + Sex + Age, family = binomial, 
##     data = titanicdf, weights = Freq)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6853     0.2730   2.510   0.0121 *  
## Class2nd     -1.0181     0.1960  -5.194 2.05e-07 ***
## Class3rd     -1.7778     0.1716 -10.362  < 2e-16 ***
## ClassCrew    -0.8577     0.1573  -5.451 5.00e-08 ***
## SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
## AgeAdult     -1.0615     0.2440  -4.350 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2769.5  on 23  degrees of freedom
## Residual deviance: 2210.1  on 18  degrees of freedom
## AIC: 2222.1
## 
## Number of Fisher Scoring iterations: 5

REGRESSION ANALYSIS

  • 1st class had the highest probability of survival

  • Women had a much higher chance of survival than men

  • Children had a higher chance of survival than adults

exp(coef(model))
## (Intercept)    Class2nd    Class3rd   ClassCrew   SexFemale    AgeAdult 
##   1.9844057   0.3612825   0.1690159   0.4241466  11.2465380   0.3459219

PREDICTED SURVIVAL PROBABILITY - CLASS AND GENDER

CONCLUSION

Based on both our exploratory and statistical analyses, we can conclude that survival on the Titanic was dependent on passenger characteristics; class, sex, and age.

The variables that showed to impact survival the most were class and sex.