1 Intro: Understanding risk factors for Titanic survival

The goal of this work is to explore risk factors and create an analysis for survival likelihood of an individual passenger on the ship Titanic which sunk in 1912. The sinking of the Titanic was one of the deadliest, most infamous shipwrecks in history. The Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Many factors are involved in understanding who survived this shipwreck and why. In addition to sheer luck, some groups of people such as women, children, and the upper-class were more likely to survive than others. There are many systemic and social factors that contributed to being or becoming “upper-class” such as race and education. The sinking of the Titanic is a prime example of how privilege can compound and continue to give benefit beyond expectation.

1.1 Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.4.2
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(RWeka) 
## Warning: package 'RWeka' was built under R version 4.4.2
library(dplyr)
library(stringr)
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggplot2)
library(broom)

1.2 Import CSV files

The recommended Titanic CSV file does not include any indicators of race/ethnicity. I believed this be a potential factor in survival so I found another CSV that contains country to help explore if country of origin (as a shorthand for race) had an impact. This author acknowledges that it is an incomplete shorthand, but was the most readily available information. Race itself may be an understudied component of this data set.
I hoped to join the country column back into the recommended ‘titanic.csv’ data set.

1.2.1 Attempt to join datasets

titanic <- read.csv("titanic.csv", stringsAsFactors = FALSE)
titanicwnation <- read.csv("titanicwnation.csv", stringsAsFactors = FALSE)
# Clean the name columns in both datasets
titanic <- titanic %>%
  mutate(Name_clean = str_to_lower(str_trim(Name)))

titanicwnation <- titanicwnation %>%
  mutate(name_clean = str_to_lower(str_trim(name)))

# Join using cleaned name columns
titanic <- titanic %>%
  left_join(titanicwnation %>% select(name_clean, country), by = c("Name_clean" = "name_clean"))
# Count NAs in titanic$Name
sum(is.na(titanic$country))
## [1] 435
# Count NAs in titanicwnation$name
sum(is.na(titanicwnation$country))
## [1] 81

1.2.2 Determined ‘titanicwnation’ to be the more complete dataset

Because of differences in the formatting of the Names column, the join was lackluster. As seen above, the join resulted in significantly more null values in the ‘country’ column than existed in the ‘titanicwnation’ data set. ‘titanicwnation’ also has more entries. For these reasons I have elected to forgo ‘titanic.csv’ in favor of ‘titanicwnation.csv’. I am changing the name of ‘titanic’ to ‘titanic_old’ and ‘titanicwnation’ to simply ‘titanic’

# Rename 'titanic' to 'titanic_old'
titanic_old <- titanic

# Rename 'titanicwnation' to 'titanic'
titanic <- titanicwnation

1.3 Variable Descriptions:

 VARIABLE DESCRIPTIONS:
 survived        Survival (Yes,No)
 class          Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (class is a proxy for socio-economic status (SES)
                     1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower)(Additionally the addtional dataset selected later includes staff, and where they worked)
 name            Name
 gender          Gender
 age             Age (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5)
 sibsp           Number of Siblings/Spouses Aboard
 parch           Number of Parents/Children Aboard
 ticket          Ticket Number
 fare            Passenger Fare
 embarked        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
 country         Country of origin (ADDED VIA JOIN)


2 Explore the structure of data:

2.1 Examine the overall data frame

titanic %>% str()
## 'data.frame':    2207 obs. of  12 variables:
##  $ name      : chr  "Abbing, Mr. Anthony" "Abbott, Mr. Eugene Joseph" "Abbott, Mr. Rossmore Edward" "Abbott, Mrs. Rhoda Mary 'Rosa'" ...
##  $ gender    : chr  "male" "male" "male" "female" ...
##  $ age       : num  42 13 16 39 16 25 30 28 27 20 ...
##  $ class     : chr  "3rd" "3rd" "3rd" "3rd" ...
##  $ embarked  : chr  "S" "S" "S" "S" ...
##  $ country   : chr  "United States" "United States" "United States" "England" ...
##  $ ticketno  : int  5547 2673 2673 2673 348125 348122 3381 3381 2699 3101284 ...
##  $ fare      : num  7.11 20.05 20.05 20.05 7.13 ...
##  $ sibsp     : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ parch     : int  0 2 1 1 0 0 0 0 0 0 ...
##  $ survived  : chr  "no" "no" "no" "yes" ...
##  $ name_clean: chr  "abbing, mr. anthony" "abbott, mr. eugene joseph" "abbott, mr. rossmore edward" "abbott, mrs. rhoda mary 'rosa'" ...

2.2 Show the head and tail rows of a data frame

titanic %>% head()
##                             name gender age class embarked       country
## 1            Abbing, Mr. Anthony   male  42   3rd        S United States
## 2      Abbott, Mr. Eugene Joseph   male  13   3rd        S United States
## 3    Abbott, Mr. Rossmore Edward   male  16   3rd        S United States
## 4 Abbott, Mrs. Rhoda Mary 'Rosa' female  39   3rd        S       England
## 5    Abelseth, Miss. Karen Marie female  16   3rd        S        Norway
## 6  Abelseth, Mr. Olaus Jørgensen   male  25   3rd        S United States
##   ticketno  fare sibsp parch survived                     name_clean
## 1     5547  7.11     0     0       no            abbing, mr. anthony
## 2     2673 20.05     0     2       no      abbott, mr. eugene joseph
## 3     2673 20.05     1     1       no    abbott, mr. rossmore edward
## 4     2673 20.05     1     1      yes abbott, mrs. rhoda mary 'rosa'
## 5   348125  7.13     0     0      yes    abelseth, miss. karen marie
## 6   348122  7.13     0     0      yes  abelseth, mr. olaus jørgensen
titanic %>% tail()
##                          name gender age            class embarked country
## 2202  Wyeth, Mr. James Robert   male  26 engineering crew        S England
## 2203         Wynn, Mr. Walter   male  41        deck crew        B England
## 2204      Yearsley, Mr. Harry   male  40 victualling crew        S England
## 2205 Young, Mr. Francis James   male  32 engineering crew        S England
## 2206      Zanetti, Sig. Minio   male  20 restaurant staff        S England
## 2207       Zarracchi, Sig. L.   male  26 restaurant staff        S England
##      ticketno fare sibsp parch survived               name_clean
## 2202       NA   NA    NA    NA       no  wyeth, mr. james robert
## 2203       NA   NA    NA    NA      yes         wynn, mr. walter
## 2204       NA   NA    NA    NA      yes      yearsley, mr. harry
## 2205       NA   NA    NA    NA       no young, mr. francis james
## 2206       NA   NA    NA    NA       no      zanetti, sig. minio
## 2207       NA   NA    NA    NA       no       zarracchi, sig. l.

2.3 Summary

titanic %>% summary()
##      name              gender               age             class          
##  Length:2207        Length:2207        Min.   : 0.1667   Length:2207       
##  Class :character   Class :character   1st Qu.:22.0000   Class :character  
##  Mode  :character   Mode  :character   Median :29.0000   Mode  :character  
##                                        Mean   :30.4444                     
##                                        3rd Qu.:38.0000                     
##                                        Max.   :74.0000                     
##                                        NA's   :2                           
##    embarked           country             ticketno            fare        
##  Length:2207        Length:2207        Min.   :      2   Min.   :  3.030  
##  Class :character   Class :character   1st Qu.:  14262   1st Qu.:  7.181  
##  Mode  :character   Mode  :character   Median : 111427   Median : 14.090  
##                                        Mean   : 284216   Mean   : 33.405  
##                                        3rd Qu.: 347077   3rd Qu.: 31.061  
##                                        Max.   :3101317   Max.   :512.061  
##                                        NA's   :891       NA's   :916      
##      sibsp            parch          survived          name_clean       
##  Min.   :0.0000   Min.   :0.0000   Length:2207        Length:2207       
##  1st Qu.:0.0000   1st Qu.:0.0000   Class :character   Class :character  
##  Median :0.0000   Median :0.0000   Mode  :character   Mode  :character  
##  Mean   :0.4996   Mean   :0.3856                                        
##  3rd Qu.:1.0000   3rd Qu.:0.0000                                        
##  Max.   :8.0000   Max.   :9.0000                                        
##  NA's   :900      NA's   :900

3 Data transformation/cleaning

3.1 Remove columns

dropped_col_df <- titanic %>% select(name_clean)
titanic <- titanic %>% select(-name_clean)

#Check that column was dropped successfully:
titanic %>% str()
## 'data.frame':    2207 obs. of  11 variables:
##  $ name    : chr  "Abbing, Mr. Anthony" "Abbott, Mr. Eugene Joseph" "Abbott, Mr. Rossmore Edward" "Abbott, Mrs. Rhoda Mary 'Rosa'" ...
##  $ gender  : chr  "male" "male" "male" "female" ...
##  $ age     : num  42 13 16 39 16 25 30 28 27 20 ...
##  $ class   : chr  "3rd" "3rd" "3rd" "3rd" ...
##  $ embarked: chr  "S" "S" "S" "S" ...
##  $ country : chr  "United States" "United States" "United States" "England" ...
##  $ ticketno: int  5547 2673 2673 2673 348125 348122 3381 3381 2699 3101284 ...
##  $ fare    : num  7.11 20.05 20.05 20.05 7.13 ...
##  $ sibsp   : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ parch   : int  0 2 1 1 0 0 0 0 0 0 ...
##  $ survived: chr  "no" "no" "no" "yes" ...

3.2 Factor categorical columns

titanic$survived <- factor(titanic$survived)
titanic$gender <- factor(titanic$gender)
titanic$class <- factor(titanic$class)
titanic$embarked <- factor(titanic$embarked)


#Check to make sure factorization was successful

titanic %>% str()
## 'data.frame':    2207 obs. of  11 variables:
##  $ name    : chr  "Abbing, Mr. Anthony" "Abbott, Mr. Eugene Joseph" "Abbott, Mr. Rossmore Edward" "Abbott, Mrs. Rhoda Mary 'Rosa'" ...
##  $ gender  : Factor w/ 2 levels "female","male": 2 2 2 1 1 2 2 1 2 2 ...
##  $ age     : num  42 13 16 39 16 25 30 28 27 20 ...
##  $ class   : Factor w/ 7 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 2 2 3 3 ...
##  $ embarked: Factor w/ 4 levels "B","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
##  $ country : chr  "United States" "United States" "United States" "England" ...
##  $ ticketno: int  5547 2673 2673 2673 348125 348122 3381 3381 2699 3101284 ...
##  $ fare    : num  7.11 20.05 20.05 20.05 7.13 ...
##  $ sibsp   : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ parch   : int  0 2 1 1 0 0 0 0 0 0 ...
##  $ survived: Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 2 2 ...

3.3 Find NAs

titanic %>% summarize(across(everything(), ~ sum(is.na(.))))
##   name gender age class embarked country ticketno fare sibsp parch survived
## 1    0      0   2     0        0      81      891  916   900   900        0

3.4 Drop rows with null values

dropped_titanic <- titanic %>% drop_na()
nrow(dropped_titanic) 
## [1] 1213

4 Understanding numeric variables

#gender, age
titanic %>% select(gender,age) %>% summary()
##     gender          age         
##  female: 489   Min.   : 0.1667  
##  male  :1718   1st Qu.:22.0000  
##                Median :29.0000  
##                Mean   :30.4444  
##                3rd Qu.:38.0000  
##                Max.   :74.0000  
##                NA's   :2
#gender, age, fare
titanic %>% select(gender,age,fare) %>% summary()
##     gender          age               fare        
##  female: 489   Min.   : 0.1667   Min.   :  3.030  
##  male  :1718   1st Qu.:22.0000   1st Qu.:  7.181  
##                Median :29.0000   Median : 14.090  
##                Mean   :30.4444   Mean   : 33.405  
##                3rd Qu.:38.0000   3rd Qu.: 31.061  
##                Max.   :74.0000   Max.   :512.061  
##                NA's   :2         NA's   :916

Insights:

4.1 Quantile exploration

# Quantile ex. fare
titanic %>% pull(fare) %>% quantile(., seq(from = 0, to = 1, by = 0.20), na.rm = TRUE)
##       0%      20%      40%      60%      80%     100% 
##   3.0305   7.1706  11.0208  22.0702  42.0800 512.0607
titanic %>% pull(fare) %>% quantile(., seq(from = 0, to = 1, by = 0.10), na.rm = TRUE)
##       0%      10%      20%      30%      40%      50%      60%      70% 
##   3.0305   7.1408   7.1706   8.0100  11.0208  14.0902  22.0702  27.1405 
##      80%      90%     100% 
##  42.0800  78.0504 512.0607
# Quantile ex. age
titanic %>% pull(age) %>% quantile(., seq(from = 0, to = 1, by = 0.20), na.rm = TRUE)
##         0%        20%        40%        60%        80%       100% 
##  0.1666667 21.0000000 26.0000000 32.0000000 40.0000000 74.0000000
titanic %>% pull(age) %>% quantile(., seq(from = 0, to = 1, by = 0.10), na.rm = TRUE)
##         0%        10%        20%        30%        40%        50%        60% 
##  0.1666667 18.0000000 21.0000000 24.0000000 26.0000000 29.0000000 32.0000000 
##        70%        80%        90%       100% 
## 36.0000000 40.0000000 46.0000000 74.0000000

Insights:

  • Quantile analysis of fares indicates that the majority (80%) of fares were $42 or less. The max fare, $512, is somewhat of an outlier.

  • Quantile analysis of ages indicates that the majority (80%) of ages were 36-40 years or less.

4.2 Boxplot of fare

#ggplot
dropped_titanic %>% 
 ggplot(aes(x= fare)) + 
 geom_boxplot() +
 ggtitle('boxplot of Fare')

4.3 Histogram of fare

# histogram of a numeric variable, fare:

hist(dropped_titanic$fare, main = "Histogram of Fare in the titanic data set",
     xlab = "Fare")

Insights:

  • Both of these reaffirm the prior insight that fare is grouped largely between $40 and $100 with a tendency towards $40

4.4 Histogram of age

dropped_titanic %>% ggplot() +
  geom_histogram(aes(x=age),binwidth = 20) +
  ggtitle('Histogram of Age in the titanic data set')

Insights:

  • Reaffirms the insight of a skew towards middle age; helps to establish child passengers as more common than elderly passengers.

4.5 Variance and standard deviation for fare

#base R
var(dropped_titanic$fare)
## [1] 2324.664
sd(dropped_titanic$fare)
## [1] 48.21477

Insights:

  • Shows wide variance in fare pricing.

4.6 Variance and standard deviation for age

dropped_titanic %>% pull(age) %>% var()
## [1] 196.6936
dropped_titanic %>% pull(age) %>% sd()
## [1] 14.02475

Insights:

  • High variance in age, but less severe than fare.

5 Explore relationship of multiple variables

# cor,  boxplot, 2D scatter plot - plot, 3D scatter plot


# ggplot
dropped_titanic %>% ggplot() +
  geom_point(aes(x=age,y=fare))

cov(dropped_titanic[,c("fare","age")]) 
##           fare      age
## fare 2324.6644 126.9543
## age   126.9543 196.6936
var(dropped_titanic[,c("fare","age")])
##           fare      age
## fare 2324.6644 126.9543
## age   126.9543 196.6936

Insights:

5.1 pairs.panels

# Generate 2D scatter plots and correlation coefficients

dropped_titanic %>% select(where(is.numeric)) %>% pairs.panels()

Insights: Pairs tells us much we already know-

  • Fare tends to increase slightly with age, but not strongly.

  • Passengers with more siblings/spouses (SibSp) tend to have more parents/children (Parch) — makes sense for family groups.

  • Ticket number has weak or no meaningful correlation with other variables — just a random identifier.

dropped_col_df <- titanic %>% select(ticketno)
titanic <- titanic %>% select(-ticketno)

5.2 Rate of survival by country

# Count of survived by country
titanic %>%
  group_by(country, survived) %>%
  summarise(count = n(), .groups = "drop")
## # A tibble: 79 × 3
##    country   survived count
##    <chr>     <fct>    <int>
##  1 Argentina no           7
##  2 Australia no           5
##  3 Austria   no           2
##  4 Austria   yes          1
##  5 Belgium   no          17
##  6 Belgium   yes          5
##  7 Bosnia    no           4
##  8 Bulgaria  no          19
##  9 Canada    no          19
## 10 Canada    yes         15
## # ℹ 69 more rows
titanic %>%
  group_by(country) %>%
  summarise(
    total = n(),
    survived = sum(survived == "yes"),  # or survived == 1 if it's numeric
    survival_ratio = survived / total,
    .groups = "drop"
  )
## # A tibble: 49 × 4
##    country         total survived survival_ratio
##    <chr>           <int>    <int>          <dbl>
##  1 Argentina           7        0          0    
##  2 Australia           5        0          0    
##  3 Austria             3        1          0.333
##  4 Belgium            22        5          0.227
##  5 Bosnia              4        0          0    
##  6 Bulgaria           19        0          0    
##  7 Canada             34       15          0.441
##  8 Channel Islands    17        8          0.471
##  9 China/Hong Kong     1        1          1    
## 10 Croatia            11        0          0    
## # ℹ 39 more rows

Insights from country are interesting but may not be large enough samples to be significant. For example, Italy is showing a 100% survival rate– because the one Italian person who exists in the data survived. This data also sometimes fails to show a full portrait of an individual’s life. For example, Mr. Joseph Philippe Lemercier Laroche– the only verified Black passenger on the Titanic, is listed as French because he lived in France when the ticket was purchased. Mr. Laroche was born in Haiti, and married a French white woman named Juliette. The two had been living in France with their two daughters and travelling to Haiti so their youngest could be born in his father’s homeland. It is believed he was only granted passage on the Titanic because the family had booked passage with another liner and transferred their tickets to the Titanic. Mr. Laroche’s family survived and he did not. One account tells of Mr.Laroche being barred from attempts to hand his baby daughter to his wife when she was in the escape boat. The account states the the white boatman barring Mr.Laroche didn’t realize the parentage of the child and appeared to be stopping Mr.Laroche because of his Blackness. Mr. Laroche has been mistreated by the boatmen and is now mistreated by the creators of the datasets who either leave out ancestry entirely or who erase the full context of his story.

5.2.1 Change country for Mr.Laroche from France to Haiti

dropped_titanic <- dropped_titanic %>%
  mutate(country = ifelse(name == "Laroche, Mr. Joseph Philippe Lemercier", "Haiti", country))

5.3 Barplots of class and gender

#tidyveryse plot of class
titanic %>% ggplot() +
  geom_bar(aes(x=class)) +
  ggtitle("Barplot of class")

#tidyveryse plot of gender
titanic %>% ggplot() +
  geom_bar(aes(x=gender)) +
  ggtitle("Barplot of gender")

Insights:

  • There were significantly more men aboard the Titanic than women. This is important to note for later data that shows survival rates of the sexes.

  • The highest single passenger class is 3rd – this class has the cheapest tickets and some sources indicate that some workers (especially workers of color) had to pay their own 3rd class fare in order to work the ship. This inequity is relevant to the low survival rates for 3rd class passengers.

6 Exploring Categorical grouped by Categorical (Factor by Factor)

titanic %>% select(survived,class) %>% table()
##         class
## survived 1st 2nd 3rd deck crew engineering crew restaurant staff
##      no  123 166 528        23              253               66
##      yes 201 118 181        43               71                3
##         class
## survived victualling crew
##      no               337
##      yes               94
titanic %>% select(survived,class) %>% table() %>% prop.table() %>% round(2)
##         class
## survived  1st  2nd  3rd deck crew engineering crew restaurant staff
##      no  0.06 0.08 0.24      0.01             0.11             0.03
##      yes 0.09 0.05 0.08      0.02             0.03             0.00
##         class
## survived victualling crew
##      no              0.15
##      yes             0.04
titanic %>% select(survived,class) %>% table() %>% prop.table() %>% round(2) * 100
##         class
## survived 1st 2nd 3rd deck crew engineering crew restaurant staff
##      no    6   8  24         1               11                3
##      yes   9   5   8         2                3                0
##         class
## survived victualling crew
##      no                15
##      yes                4

Insights:

6.1 Barplots of gender and class grouped by Survived

#tidyveryse barplot for passenger class by survived
titanic %>% ggplot() +
  geom_bar(aes(x=class,fill=survived),position="dodge") +
  ggtitle("Barplot of Class by Survived")

#tidyverse barplot for gender by survived
titanic %>% ggplot() +
  geom_bar(aes(x=gender,fill=survived),position="dodge") +
  ggtitle("Barplot of Gender by Survived")

titanic %>% ggplot() +
  geom_bar(aes(x=gender,fill= class),position="dodge") +
  ggtitle("Barplot of Gender by class")

summary_data <- dropped_titanic %>%
  group_by(class, gender) %>%
  summarize(survival_rate = mean(survived == "yes"))
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
ggplot(summary_data, aes(x = class, y = survival_rate, fill = gender)) +
  geom_col(position = "dodge") +
  labs(title = "Survival Rates by Class and Gender",
       y = "Survival Rate", x = "Passenger Class") +
  scale_fill_brewer(palette = "Pastel1", name = "Gender") +
  theme_classic()

Insights:

  • Reinforces that 1st class had high survival

  • Visually clarifies that victualing crew had high survival among crew members

  • Shows strong correlation with gender (Female) and survival. Similar numbers of men and women survived (slightly more women) and as mentioned earlier, men significantly outnumbered women

  • Age was not much of a factor in survival

7 Exploring numeric variables by factors

7.1 Boxplot numeric values by fare, age, and all factors

# survived by age
dropped_titanic %>% 
  ggplot() +
  geom_boxplot(aes(x=survived,y= age)) +
  ggtitle('Survived by age')

dropped_titanic %>% 
  ggplot() +
  geom_boxplot(aes(x=survived,y= fare)) +
  ggtitle('Survived by Fare')

ggplot(dropped_titanic, aes(x = survived, y = age, fill = gender)) +
  geom_boxplot(outlier.shape = 21) +
  facet_wrap(~ class) +
  labs(title = "Age, Gender, and Survival by Class", y = "Age", x = "Survived") +
  scale_fill_manual(values = c("goldenrod", "lightblue")) +
  theme_minimal()

Insights:

  • Survival is somewhat tied to fare but the large number of 3rd class tickets sold and the variance in fare cost even among 1st class tickets muddy the plot slightly. What is clear is that those who paid the very most were likely to survive, and those who paid the least were likely to perish.

  • Age does not seem to be much of a factor, given that the spread is similar between survival and non-survival.

7.2 Aggregation of survived by fare, as chart showing min, max, quartiles and median

aggregate(fare~survived, summary, data = dropped_titanic)
##   survived fare.Min. fare.1st Qu. fare.Median fare.Mean fare.3rd Qu. fare.Max.
## 1       no   4.00000      7.17080    12.03080  23.74167     26.00000 263.00000
## 2      yes   3.03050     10.33525    26.00000  45.95062     56.18070 512.06070
#tidyverse

dropped_titanic %>%
  group_by(survived) %>%
  summarize(
    min = min(fare),
    q1 = quantile(fare, 0.25),
    median = median(fare),
    mean = mean(fare),
    q3 = quantile(fare, 0.75),
    max = max(fare)
  )
## # A tibble: 2 × 7
##   survived   min    q1 median  mean    q3   max
##   <fct>    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 no        4     7.17   12.0  23.7  26    263 
## 2 yes       3.03 10.3    26    46.0  56.2  512.

7.3 Scatter plot of numeric values and factor values

plot(dropped_titanic$age,dropped_titanic$fare, col=dropped_titanic$survived, pch = as.numeric((dropped_titanic$survived)))

dropped_titanic %>% ggplot() + geom_point(aes(x=age,y=fare,color=survived))

Insights from another viewing style of these factors reinforces:

  • Lower fare = somewhat lower chance of survival. The weight of this is mitigated by fare variance and an abundance of lower fare tickets sold.

7.4 Basic regression model

# Fit a logistic regression model
model <- glm(survived ~ class + gender + age + fare, 
             data = dropped_titanic, 
             family = binomial) #logistic regression

# View model summary
summary(model)
## 
## Call:
## glm(formula = survived ~ class + gender + age + fare, family = binomial, 
##     data = dropped_titanic)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.963470   0.373026  10.625  < 2e-16 ***
## class2nd    -1.450138   0.249754  -5.806 6.39e-09 ***
## class3rd    -2.599696   0.255797 -10.163  < 2e-16 ***
## gendermale  -2.587138   0.159321 -16.239  < 2e-16 ***
## age         -0.039171   0.006168  -6.351 2.14e-10 ***
## fare        -0.002324   0.001893  -1.228     0.22    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1619.6  on 1212  degrees of freedom
## Residual deviance: 1119.3  on 1207  degrees of freedom
## AIC: 1131.3
## 
## Number of Fisher Scoring iterations: 4

7.4.1 Confustion Matrix for Reg Model

# Predict probabilities
predicted_probs <- predict(model, type = "response")



# Predict classes
predicted_classes <- ifelse(predicted_probs > 0.5, "yes", "no") %>% factor()

conf_mat <- confusionMatrix(predicted_classes, dropped_titanic$survived)
conf_mat$table
##           Reference
## Prediction  no yes
##        no  628 136
##        yes 115 334

7.4.2 Visualization of model

tidy(model) %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2) +
  coord_flip() +
  ggtitle("Logistic Regression Coefficients")

Insights:

  • 1st class female most likely to survive

  • Men and 3rd class least

  • Age and fare have little impact/are too dispersed to impact coefficients

8 Conclusion:

In the above model, the intercept represents the baseline log-odds of survival for a female passenger in 1st class with an average fare and age. All other coefficients reflect how the odds change relative to this reference profile. The model has selected 1st-class female passengers as the most likely to survive. The groups least likely are 3rd class passengers and men. It is important to note that women had higher survival odds, likely because of a “women and children first” mentality. This mentality was likely not applied evenly, as seen in section 24. 3rd class women are still more likely to survive than even 1st class men, but were significantly less likely to survive than 1st class women. These results have been consistent across the analytic methods and have troubling implications about wealth inequity. I believe that some of the causes for a high death rate among 3rd class passengers relates to their location in the ship, as lower class cabins were in less desirable, deeper parts of the Titanic, which would have been difficult to escape. This could also contribute to understanding the high deaths among engineers above all other workers. Additionally, their duties may have required them to stay as long as possible. I believe there could be social factors at play, like a conscious or unconscious bias towards wealthier passengers. That theory is supported by the fact that the only people of color on board (Mr.Laroche, Chinese workmen, and more) were all in 2nd class or lower. This analysis is just a short overview of an emergency situation which had many factors for survival. It would take a more in-depth cultural analysis to confirm theories on why some survived and others didn’t. That said, it is clear to me that wealth, gender, and race had a major role in determining who survived that fateful night the Titanic was overcome.