Final Project: Analysis on Titanic Dataset

Import packages

# Load the packages
library(conflicted)
library(readxl)
library(here)

## here() starts at /Users/varad/Documents/Academics/Year 2024/DACSS 601

library(scales)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

library(ggplot2)
library(ggmosaic)
conflicts_prefer(dplyr::filter)

## [conflicted] Will prefer dplyr::filter over any other package.

Introduction

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. During an emergency, the protocol is to prioratize the safety of women and children first. Secondly, there’s a stark difference between the surival rate of First class (rich people) and Third class (poor people) passengers’ survival rate.

In this project, we will walk you through how we can analyse and further validate these claims. R language [1] is a powerful tool for analysing and understanding the inherent information present in the data. We make use of RStudio [2] for the subsequent analysis and RPubs [3] to host the project. The following document is structured as shown in the table of content above.

Read the data

titanic_data <- here("titanic3.xls") %>%
      read_excel()

## Warning: Coercing text to numeric in M1306 / R1306C13: '328'

titanic_data

## # A tibble: 1,309 × 14
##    pclass survived name     sex      age sibsp parch ticket  fare cabin embarked
##     <dbl>    <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr> <chr>   
##  1      1        1 Allen, … fema… 29         0     0 24160  211.  B5    S       
##  2      1        1 Allison… male   0.917     1     2 113781 152.  C22 … S       
##  3      1        0 Allison… fema…  2         1     2 113781 152.  C22 … S       
##  4      1        0 Allison… male  30         1     2 113781 152.  C22 … S       
##  5      1        0 Allison… fema… 25         1     2 113781 152.  C22 … S       
##  6      1        1 Anderso… male  48         0     0 19952   26.6 E12   S       
##  7      1        1 Andrews… fema… 63         1     0 13502   78.0 D7    S       
##  8      1        0 Andrews… male  39         0     0 112050   0   A36   S       
##  9      1        1 Appleto… fema… 53         2     0 11769   51.5 C101  S       
## 10      1        0 Artagav… male  71         0     0 PC 17…  49.5 <NA>  C       
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>

Narrative

The Titanic dataset contains information about passengers on the ill-fated Titanic voyage. The Titanic dataset offers a comprehensive look into the tragic maiden voyage of the RMS Titanic, a British passenger liner that sank in the North Atlantic Ocean in April 1912 after hitting an iceberg during her maiden voyage from Southampton to New York City. The dataset provides a window into the lives of those onboard, encompassing a diverse group of passengers from different socio-economic backgrounds, encapsulated within the three passenger classes.

Variables

pclass: Passenger class (1st, 2nd, 3rd) - This categorical variable divides passengers into three classes (1st, 2nd, and 3rd), reflecting the socio-economic stratification of the early 20th century.
survived: Survival status (1 = Yes, 0 = No) - A binary categorical variable indicating survival (1) or non-survival (0) of the passengers.
name: Name of the passenger - Textual data providing the names of the passengers, allowing for individual identification and historical research.
sex: Gender - A categorical variable recording the gender of passengers.
age: Age in years - A numerical variable detailing the age of each passenger, giving insights into the age distribution onboard.
sibsp: Number of siblings/spouses aboard - This numerical variable counts the number of siblings or spouses that a passenger had aboard the Titanic.
parch: Number of parents/children aboard - Similar to sibsp, this numerical variable tallies the number of parents or children a passenger had on the ship.
ticket: Ticket number - A combination of text and numeric data representing each passenger’s ticket number.
fare: Passenger fare - A numerical variable showing how much each passenger paid, potentially indicating their financial status.
cabin: Cabin number - Textual data providing the cabin number for passengers, which can be linked to their class and location on the ship.
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) - A categorical variable indicating the port where passengers boarded the Titanic, with codes for Cherbourg, Queenstown, and Southampton.
boat: Lifeboat (if survived) - For survivors, this text/numeric variable identifies the lifeboat they were on, giving insight into the rescue process.
body: Body identification number (if did not survive) - For those who did not survive, this numeric variable provides a body identification number, if available.
home.dest: Home/Destination - Textual data about the passengers’ home or intended destination, offering a glimpse into their personal journeys and backgrounds.

Data collection

The data used here comes from Kaggle Challenge for Predicting the Survival of a passenger from the Complete Titanic dataset [4]. The dataset was collected from passenger information available on Wikipedia [5], which is turn collected this information from two sources – Encyclopedia [6] and an article published in New York Times on 19th April, 1912 that mentioned the survivors from the Titanic tragedy. One thing we noticed is that there were a total of 2240 passengers that boarded the Titanic with only 705 passengers surviving. However, we notice above that only 1305 entries are recorded. This is because, the known passenger list from Titanic is roughly around ~1300 with rest of the passengers being unknown or survived passengers who refused to make their information “public”. Thus, you will notice a stark skew between the actual ratio of survived vs not survived as compared to the one present in the above dataset.

Regardless, the study still remains relevant with a lot of interesting questions that can be answered. We next move on to cleaning the dataset.

Cleaning data

# Cleaning the data
titanic_clean <- titanic_data %>%
  # Convert factors to characters
  mutate_if(is.factor, as.character) 
  
  # Handle missing values (example: fill NA in 'age' with median)
  # mutate(age = ifelse(is.na(age), median(age, na.rm = TRUE), age))
  # Removed after incorporating feedback from the first assignment.
  # Moving ahead, we can either identify the ticket type and then take the age mean using the mean of that ticket type,
  # or, we can simply remove the NAs.

titanic_clean

## # A tibble: 1,309 × 14
##    pclass survived name     sex      age sibsp parch ticket  fare cabin embarked
##     <dbl>    <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr> <chr>   
##  1      1        1 Allen, … fema… 29         0     0 24160  211.  B5    S       
##  2      1        1 Allison… male   0.917     1     2 113781 152.  C22 … S       
##  3      1        0 Allison… fema…  2         1     2 113781 152.  C22 … S       
##  4      1        0 Allison… male  30         1     2 113781 152.  C22 … S       
##  5      1        0 Allison… fema… 25         1     2 113781 152.  C22 … S       
##  6      1        1 Anderso… male  48         0     0 19952   26.6 E12   S       
##  7      1        1 Andrews… fema… 63         1     0 13502   78.0 D7    S       
##  8      1        0 Andrews… male  39         0     0 112050   0   A36   S       
##  9      1        1 Appleto… fema… 53         2     0 11769   51.5 C101  S       
## 10      1        0 Artagav… male  71         0     0 PC 17…  49.5 <NA>  C       
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>

str(titanic_clean)

## tibble [1,309 × 14] (S3: tbl_df/tbl/data.frame)
##  $ pclass   : num [1:1309] 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : num [1:1309] 1 1 0 0 0 1 1 0 1 0 ...
##  $ name     : chr [1:1309] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
##  $ sex      : chr [1:1309] "female" "male" "female" "male" ...
##  $ age      : num [1:1309] 29 0.917 2 30 25 ...
##  $ sibsp    : num [1:1309] 0 1 1 1 1 0 1 0 2 0 ...
##  $ parch    : num [1:1309] 0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket   : chr [1:1309] "24160" "113781" "113781" "113781" ...
##  $ fare     : num [1:1309] 211 152 152 152 152 ...
##  $ cabin    : chr [1:1309] "B5" "C22 C26" "C22 C26" "C22 C26" ...
##  $ embarked : chr [1:1309] "S" "S" "S" "S" ...
##  $ boat     : chr [1:1309] "2" "11" NA NA ...
##  $ body     : num [1:1309] NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest: chr [1:1309] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...

Descriptive Statistics

# statistics
summary_statistics <- titanic_clean %>%
  summarise(
    MeanAge = mean(age, na.rm = TRUE),
    MedianAge = median(age, na.rm = TRUE),
    SdAge = sd(age, na.rm = TRUE),
    SurvivedCount = sum(survived == 1, na.rm = TRUE),
    NotSurvivedCount = sum(survived == 0, na.rm = TRUE),
  )
summary_statistics

## # A tibble: 1 × 5
##   MeanAge MedianAge SdAge SurvivedCount NotSurvivedCount
##     <dbl>     <dbl> <dbl>         <int>            <int>
## 1    29.9        28  14.4           500              809

# Survival rates by passenger class
titanic_clean %>%
  group_by(pclass) %>%
  summarise(SurvivalRate = mean(survived == 1)) %>%
  ggplot(aes(x = factor(pclass), y = SurvivalRate)) + 
  geom_bar(stat = "identity", fill = "skyblue") + # Set all bars to the same color
  labs(title = "Survival Rates by Passenger Class", x = "Passenger Class", y = "Survival Rate") +
  scale_y_continuous(labels = label_percent(), limits = c(0, 0.85)) +
  theme_minimal()

The graph shows that the highest survival rate is associated with the 1st passenger class, followed by the 2nd and then the 3rd class, which has the lowest survival rate. This suggests a correlation between passenger class and survival likelihood, with higher classes (fare wise and not numerically) having a better chance of survival.

# Age distribution among survivors and non-survivors
titanic_clean %>%
  ggplot(aes(x = age, fill = as.factor(survived))) +
  geom_histogram(bins = 30, position = "identity", alpha = 0.6) +
  labs(title = "Age Distribution of Survivors and Non-Survivors", x = "Age", y = "Count") +
  scale_fill_brewer(palette = "Set1", name = "Survived", labels = c("No", "Yes"))

## Warning: Removed 263 rows containing non-finite values (`stat_bin()`).

The plot overlays the counts of individuals who survived and did not survive across different age groups. It appears that the largest number of survivors falls within the younger age groups, indicating that age may have been a factor in survival. Moreover, there is a noticeable peak in non-survivors in the adult age range, suggesting higher mortality among middle-aged passengers. Statistically testing needs to be performed to make any further conclusions.

# Gender proportion amongst survivors
titanic_clean %>%
  group_by(sex) %>%
  summarise(SurvivalRate = mean(survived == 1)) %>%
  ggplot(aes(x = sex, y = SurvivalRate)) +
  geom_bar(stat = "identity", fill = "coral") +
  labs(title = "Survival Rates by Gender", x = "Gender", y = "Survival Rate") +
  theme_minimal()

The bar graph depicts a significantly higher survival rate for females compared to males. This suggests that gender was a major factor in survival, with females being more likely to survive than males.

# Fare distribution
titanic_clean %>%
  ggplot(aes(x = fare)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Fares", x = "Fare", y = "Count")

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

The histogram illustrates that the majority of fares are clustered at the lower end of the fare range, with a sharp decline as fares increase. This indicates that lower fares were more common, and very high fares were quite rare.

Further descriptive analysis

Fare Distribution by Embarkation Port

Question: How does the fare distribution vary among different embarkation ports?

This can reveal if passengers embarking from certain ports tended to pay more or less, possibly indicating economic disparities based on embarkation points.

if("embarked" %in% names(titanic_clean)) {
  # Check initial state of 'embarked'
  print(paste("Initial rows:", nrow(titanic_clean)))
  print(table(titanic_clean$embarked))

  # Filter out rows where 'embarked' is NA
  titanic_clean <- titanic_clean %>%
    filter(!is.na(embarked))

  # Confirm state after filtering
  print(paste("Rows after NA removal:", nrow(titanic_clean)))
  
  if(nrow(titanic_clean) > 0) {
    # Summarize data again
    summary_count <- titanic_clean %>%
      group_by(embarked) %>%
      summarise(Count = n())
    
    # Check summary_count
    if(nrow(summary_count) == 0) {
      print("No data in 'embarked' after NA removal.")
    } else {
      print(summary_count)
      
      # Relabel and plot if data is appropriate
      titanic_clean$embarked <- factor(titanic_clean$embarked, 
                                       levels = c("C", "Q", "S"),
                                       labels = c("Cherbourg", "Queenstown", "Southampton"))
      
      ggplot(titanic_clean, aes(x = fare, fill = embarked)) + 
        geom_histogram(binwidth = 10, color = "black", alpha = 0.7) +
        facet_wrap(~embarked, scales = 'fixed') +
        theme_minimal() +
        labs(title = "Fare Distribution by Embarkation Port", x = "Fare", y = "Count", fill = "Embarkation Port")
    }
  } else {
    print("The dataset is empty after removing NA values.")
  }
} else {
  print("Error: 'Embarked' column not found in the dataset.")
}

## [1] "Initial rows: 1309"
## 
##   C   Q   S 
## 270 123 914 
## [1] "Rows after NA removal: 1307"
## # A tibble: 3 × 2
##   embarked Count
##   <chr>    <int>
## 1 C          270
## 2 Q          123
## 3 S          914

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

The majority of passengers embarking from port S (Southampton) paid lower fares compared to those from port C (Cherbourg). Port Q (Queenstown) shows a very tight range of fares, predominantly on the lower end, suggesting that passengers from this port were likely to be in the lower passenger classes. The presence of some higher fares at port C indicates that this port had a relatively higher proportion of first or second-class passengers.

Passenger Class Proportion by Age Group

Question: How is the age of passengers distributed across different passenger classes?

This can show if certain age groups were more prevalent in a particular class, providing insights into the socio-economic demographics of the passengers.

titanic_clean$AgeGroup <- cut(titanic_clean$age, breaks = c(0, 12, 18, 60, Inf), labels = c("Child", "Teenager", "Adult", "Senior"))
ggplot(titanic_clean, aes(x = factor(pclass), fill = AgeGroup)) + 
    geom_bar(position = "dodge") +
    theme_minimal() +
    labs(title = "Passenger Class Proportion by Age Group", x = "Passenger Class", y = "Proportion", fill = "Age Group")

The visualization suggests that the first-class passengers were predominantly adults, with a relatively small proportion of children, teenagers, and seniors. The second and third classes show a more diverse age distribution, with the third class having a higher proportion of children and teenagers. This could reflect economic factors, where families and younger individuals were more likely to travel in the lower classes.

Cabin Location by Survival Status

Question: Does the cabin location (deck) have any correlation with survival rates?

This can highlight if certain decks were more prone to having survivors or casualties, possibly due to their location on the ship.

titanic_clean$deck <- substr(titanic_clean$cabin, 1, 1)
titanic_cabin <- titanic_clean[!is.na(titanic_clean$deck) & !is.na(titanic_clean$survived), ]
deck_survived_table <- table(titanic_cabin$deck, titanic_cabin$survived)
mosaicplot(
        deck_survived_table, 
        main = "Mosaic Plot of Survival Status Across Decks", 
        xlab = "Deck", 
        ylab = "Survival Status", 
        color = TRUE
)

The plot illustrates that survival rates varied significantly across different deck locations. For instance, decks B, C, D, and E show a larger proportion of survivors (indicated by the size of the blocks), while decks A and G, as well as the deck denoted by T, had relatively fewer survivors. This might be due to the location of these decks on the ship and their accessibility to lifeboats.

From these visualizations, we can infer that socio-economic status (indicated by fare and passenger class), age distribution, and cabin location (deck) played roles in the survival patterns observed on the Titanic. These factors are intertwined with the historical context of the event, where first-class passengers had better access to emergency resources, and certain decks were either more or less advantageous for evacuation.

Let’s lastly examine how the average fare and survival rate vary by passenger class (Pclass) and embarkation port (Embarked).

# Grouping by passenger class and embarkation port to find average fare and survival rate
grouped_stats <- titanic_clean %>%
  group_by(pclass, embarked) %>%
  summarise(
    Average_Fare = mean(fare, na.rm = TRUE),
    Survival_Rate = mean(as.numeric(survived), na.rm = TRUE)
  )

## `summarise()` has grouped output by 'pclass'. You can override using the
## `.groups` argument.

# View the results
grouped_stats

## # A tibble: 9 × 4
## # Groups:   pclass [3]
##   pclass embarked    Average_Fare Survival_Rate
##    <dbl> <fct>              <dbl>         <dbl>
## 1      1 Cherbourg          107.          0.688
## 2      1 Queenstown          90           0.667
## 3      1 Southampton         72.1         0.559
## 4      2 Cherbourg           23.3         0.571
## 5      2 Queenstown          11.7         0.286
## 6      2 Southampton         21.2         0.417
## 7      3 Cherbourg           11.0         0.366
## 8      3 Queenstown          10.4         0.354
## 9      3 Southampton         14.4         0.210

ggplot(grouped_stats, aes(x = factor(pclass), y = Average_Fare, fill = embarked)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.7), width = 0.6) + # Use dodged bars
  theme_minimal() +
  scale_y_continuous(limits = c(0, max(grouped_stats$Average_Fare) * 1.1)) + # Rescale y-axis, extending it slightly above the highest average fare
  labs(title = "Average Fare by Passenger Class and Embarkation Port",
       x = "Passenger Class",
       y = "Average Fare",
       fill = "Embarkation Port")

# Bubble plot with color gradient based on survival rate, without size mapping
ggplot(grouped_stats, aes(x = factor(pclass), y = factor(embarked), color = Survival_Rate)) +
  geom_point(alpha = 0.6, size = 4) + # Use a fixed size for all points, adjust transparency
  scale_color_gradient(low = "blue", high = "red") + # Color gradient from blue (low) to red (high)
  theme_minimal() +
  labs(title = "Survival Rate by Passenger Class and Embarkation Port",
       x = "Passenger Class",
       y = "Embarkation Port",
       color = "Survival Rate")

First-Class Passengers (Pclass 1):
- Passengers embarking from Cherbourg (C) paid the highest average fare ($106.85) among all first-class passengers and had a survival rate of approximately 68.79%.
- Those embarking from Queenstown (Q) had a lower average fare ($90.00) but a slightly lower survival rate of 66.67%.
- Passengers embarking from Southampton (S) paid the least on average ($72.15) among first-class passengers and had a survival rate of about 55.93%.
- There is an entry with NA for embarkation, which has an average fare of $80.00 and a survival rate of 100%. This could be due to a small sample size or missing data, and it’s an outlier compared to other ports.
Second-Class Passengers (Pclass 2):
- The average fares are significantly lower than those of the first class, with Cherbourg passengers paying around $23.30 and having a survival rate of approximately 57.14%.
- Queenstown’s second-class passengers had the lowest average fare ($11.74) and the lowest survival rate of 28.57%.
- Those from Southampton had an average fare of $21.21 and a survival rate of 41.74%.
Third-Class Passengers (Pclass 3):
- Cherbourg’s third-class passengers paid an average fare of $11.02 and had a survival rate of 36.63%, which is higher than the survival rates for third-class passengers from other ports.
- Queenstown passengers had a very close average fare ($10.39) but a lower survival rate of 35.39%.
- Southampton’s third-class passengers had a higher average fare ($14.44) compared to the other two ports but the lowest survival rate of 21.01%.

General Observations: - The average fare decreases as we move from first-class to third-class passengers, as expected. - First-class passengers generally had higher survival rates compared to second and third-class passengers, which is consistent with historical accounts suggesting that higher-class passengers had better access to lifeboats. - Among all groups, Cherbourg’s first-class passengers paid the highest average fare and also had the highest survival rate, which might suggest a correlation between fare and survival. - The survival rate for passengers with NA embarkation data is an anomaly and should be investigated further; it could potentially skew the overall analysis due to its perfect survival rate.

These conclusions suggest that both passenger class and point of embarkation were influential factors in survival on the Titanic, with higher fares (likely linked to higher classes) correlating with higher survival rates. However, the embarkation point also played a role independently of class, as seen by varying survival rates within classes for different embarkation points.

Research directions

Let’s answer some of the remaining research questions.

Did age and sex significantly affect the survival chances?

While the raw data visualizations such as the survival rates by gender barplot and age distribution by survivors histogram provide clear insights, we utilize logistic regression and a Chi-Squared test for a more rigorous statistical analysis. These methods allow us to quantify the strength of these relationships and test the significance of our observations.

The logistic regression model helps in understanding how age and sex together impact the odds of survival. It provides a way to control for one variable while assessing the effect of another.

# Logistic regression model with Age and Sex as predictors for Survival
survival_model <- glm(survived ~ age + sex, family = binomial(link = "logit"), data = titanic_clean)

# Summary of the model to check the significance of predictors
summary(survival_model)

## 
## Call:
## glm(formula = survived ~ age + sex, family = binomial(link = "logit"), 
##     data = titanic_clean)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.237032   0.192077   6.440 1.19e-10 ***
## age         -0.004563   0.005221  -0.874    0.382    
## sexmale     -2.453018   0.152409 -16.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1411.0  on 1043  degrees of freedom
## Residual deviance: 1100.1  on 1041  degrees of freedom
##   (263 observations deleted due to missingness)
## AIC: 1106.1
## 
## Number of Fisher Scoring iterations: 4

While logistic regression gives us an insight into the relationship between survival and our predictors, the Chi-Squared test helps us understand if the observed association between sex and survival is statistically significant.

# Create a contingency table of survival by sex
table_sex_survival <- table(titanic_clean$survived, titanic_clean$sex)

# Perform the Chi-Squared test
chisq.test(table_sex_survival)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_sex_survival
## X-squared = 361.36, df = 1, p-value < 2.2e-16

Based on the output of the logistic regression analysis and the Chi-squared test, we can draw the following conclusions:

Logistic Regression Conclusion: - The coefficient for age is -0.004254, and it is not statistically significant (p-value = 0.414), which implies that age alone may not have a strong effect on the likelihood of survival when controlling for sex. - The coefficient for sexmale is -2.460689, which is statistically significant (p-value < 2e-16). This indicates that being male significantly decreases the odds of survival on the Titanic when controlling for age. The negative sign suggests that the odds of surviving for males are lower compared to females, holding age constant. - The intercept, which represents the log odds of survival for the baseline group, is positive and statistically significant, suggesting that females had a higher probability of surviving than males, controlling for age.

Chi-Squared Test Conclusion: - The result of the Chi-squared test (X-squared = 363.62, p-value < 2.2e-16) confirms a statistically significant association between sex and survival, reinforcing the logistic regression findings.

Overall Conclusion: - Sex appears to be a significant predictor of survival on the Titanic, with females having much higher odds of surviving than males. - Age, as a continuous variable, does not seem to have a significant impact on survival when analyzed in conjunction with sex. - The relationship between age and survival might be more complex and could potentially be better understood by considering age groups or including interaction terms in the model. - From the provided statistical analyses, we can conclude that sex was a significant factor in survival aboard the Titanic, while the effect of age was not as clear-cut.

These analytical approaches highlight the importance of statistical testing in confirming and quantifying observations, offering a more nuanced understanding of the factors influencing survival on the Titanic.

What was the impact of having family onboard (siblings/spouses, parents/children) on survival?

We categorize passengers into two groups: those with family onboard and those without. We then compare their survival rates.

titanic_clean$Family_Onboard <- with(titanic_clean, sibsp + parch > 0)

# Create a summarized dataset for plotting
survival_by_family <- titanic_clean %>%
  group_by(Family_Onboard) %>%
  summarise(Survival_Rate = mean(as.numeric(survived)))

# Convert Family_Onboard to a more descriptive factor
survival_by_family$Family_Onboard <- factor(survival_by_family$Family_Onboard, labels = c("Alone", "With Family"))

# Bar plot for Survival Rate by Family Onboard
ggplot(survival_by_family, aes(x = Family_Onboard, y = Survival_Rate, fill = Family_Onboard)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("violet", "violet")) +
  theme_minimal() +
  labs(title = "Survival Rate by Family Onboard",
       x = "Family Onboard",
       y = "Survival Rate") +
  theme(legend.position = "none") +
  ylim(0, 0.6) # Set y-axis to start at 0

Having observed the survival trends visually, we now turn to logistic regression to assess the impact quantitatively.

# Logistic regression model with Family Onboard as a predictor for Survival

# Logistic regression model with Family Onboard as a predictor
family_survival_model <- glm(survived ~ Family_Onboard, family = binomial(link = "logit"), data = titanic_clean)

# Summary of the model to check the significance of having family onboard
summary(family_survival_model)

## 
## Call:
## glm(formula = survived ~ Family_Onboard, family = binomial(link = "logit"), 
##     data = titanic_clean)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.84367    0.07768 -10.861  < 2e-16 ***
## Family_OnboardTRUE  0.85524    0.11722   7.296 2.97e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1737.2  on 1306  degrees of freedom
## Residual deviance: 1683.2  on 1305  degrees of freedom
## AIC: 1687.2
## 
## Number of Fisher Scoring iterations: 4

Based on the output from the logistic regression model, here are the conclusions we can draw regarding the impact of having family onboard on the chances of survival:

Logistic Regression Conclusion: - The intercept (representing the log odds of survival for passengers without family onboard) is significant and negative (-0.83527), indicating that passengers who were alone had lower log odds of survival. - The coefficient for Family_OnboardTRUE is 0.84683 and is statistically significant (p-value = 4.71e-13). This positive coefficient suggests that passengers with family onboard had higher odds of survival compared to those who were alone, holding all else constant. - The significance of the Family_Onboard variable implies that having family members onboard significantly increased the likelihood of survival during the Titanic disaster.

Overall Conclusion: - Having family onboard significantly increased a passenger’s chances of survival on the Titanic. - This finding may reflect social and behavioral dynamics during the evacuation, such as families being prioritized for rescue or family members being more motivated to seek and aid each other’s survival. - The relationship is strong and robust, given the high significance level of the model’s coefficient.

These conclusions suggest that during the Titanic disaster, social bonds, as indicated by the presence of family members, could have played a crucial role in survival outcomes.

Critical Reflection

In concluding the analysis of the Titanic dataset, several key insights emerge, alongside areas that warrant further exploration and the limitations inherent in the dataset. Here’s a summary:

Key Learnings:

Socio-Economic Factors: The data analysis reinforced the notion that socio-economic status, represented by passenger class and fare, had a significant impact on survival rates. Higher-class passengers generally had better survival odds.
Gender and Age: Gender emerged as a significant factor in survival, with women having higher survival rates. However, age as a singular factor did not show a strong correlation with survival when analyzed alongside gender.
Family Ties: The presence of family members onboard appeared to positively influence survival chances.
Embarkation Points: Different embarkation points showed variations in survival rates and average fares, suggesting geographical or socio-economic disparities.
Statistical Rigor: Using logistic regression and Chi-Squared tests added statistical rigor to the analysis, quantifying the strength of relationships between variables like age, sex, and family onboard.

Future Research Directions:

Interactions Between Variables: Including multivariate plots or adding facets to existing plots can help to show the relationship between more than two variables.
Cabin Location Analysis: Further examination of cabin locations and their relation to survival rates could provide insights into the ship’s evacuation dynamics.
Survivorship Bias: The visualizations cannot account for survivorship bias, where the data available is conditioned on the outcome of survival, potentially skewing the analysis..
Missing Data Handling: Addressing missing data with advanced imputation techniques could provide a more accurate picture and should be clearly communicated in the visualization’s narrative.

Limitations:

Incomplete Data: The dataset does not encompass all passengers and crew, leading to potential sample bias. Especially, the missing records of certain passengers could skew the results.
Missing Values: In the Titanic dataset, there are typically missing values for age and cabin, which could affect the accuracy of the visualizations. The handling of these missing values (e.g., imputation or deletion) can significantly influence the results.
Lack of Contextual Data: The dataset lacks certain contextual information, such as the specifics of the evacuation process, which could provide deeper insights into survival patterns.
Potential Overgeneralization: The analysis might overgeneralize certain trends due to the limitations of the data and the methods used.

Conclusion:

The project provided valuable insights into the survival patterns of the Titanic disaster, emphasizing the role of socio-economic status, gender, and family ties. However, the limitations of the dataset and the need for further research to explore more complex interactions and address biases are evident. Future research could build upon these findings, incorporating more sophisticated statistical techniques and potentially additional data sources to gain a deeper understanding of the factors that influenced survival on the Titanic.

Bibliography

RStudio Desktop - Posit. (2024, January 11). Posit. https://posit.co/download/rstudio-desktop/
R: What is R? (n.d.). https://www.r-project.org/about.html
RPubs. (n.d.). https://rpubs.com/
The complete Titanic Dataset. (2020, January 4). Kaggle. https://www.kaggle.com/datasets/vinicius150987/titanic3
Wikipedia contributors. (2024, January 30). Passengers of the Titanic. Wikipedia. https://en.wikipedia.org/wiki/Passengers_of_the_Titanic
Encyclopedia of Titanic passenger List. (2021, March 14) Encyclopedia. https://www.encyclopedia-titanica.org/titanic-passenger-list/
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Baumer, B. S., Kaplan, D. T., & Horton, N. J. (2017). Modern data science with R. crc Press.