Import packages

# Load the packages
library(readxl)
library(here)
## here() starts at /Users/varad/Documents/Academics/Year 2024/DACSS 601
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggmosaic)

Read the dataset

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

titanic_data <- here("titanic3.xls") %>%
      read_excel()
## Warning: Coercing text to numeric in M1306 / R1306C13: '328'
titanic_data
## # A tibble: 1,309 × 14
##    pclass survived name     sex      age sibsp parch ticket  fare cabin embarked
##     <dbl>    <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr> <chr>   
##  1      1        1 Allen, … fema… 29         0     0 24160  211.  B5    S       
##  2      1        1 Allison… male   0.917     1     2 113781 152.  C22 … S       
##  3      1        0 Allison… fema…  2         1     2 113781 152.  C22 … S       
##  4      1        0 Allison… male  30         1     2 113781 152.  C22 … S       
##  5      1        0 Allison… fema… 25         1     2 113781 152.  C22 … S       
##  6      1        1 Anderso… male  48         0     0 19952   26.6 E12   S       
##  7      1        1 Andrews… fema… 63         1     0 13502   78.0 D7    S       
##  8      1        0 Andrews… male  39         0     0 112050   0   A36   S       
##  9      1        1 Appleto… fema… 53         2     0 11769   51.5 C101  S       
## 10      1        0 Artagav… male  71         0     0 PC 17…  49.5 <NA>  C       
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>

Cleaning data

# Cleaning the data
titanic_clean <- titanic_data %>%
  # Convert factors to characters
  mutate_if(is.factor, as.character) 
  
  # Handle missing values (example: fill NA in 'age' with median)
  # mutate(age = ifelse(is.na(age), median(age, na.rm = TRUE), age))
  # Removed after incorporating feedback from the first assignment.
  # Moving ahead, we can either identify the ticket type and then take the age mean using the mean of that ticket type,
  # or, we can simply remove the NAs.
titanic_clean
## # A tibble: 1,309 × 14
##    pclass survived name     sex      age sibsp parch ticket  fare cabin embarked
##     <dbl>    <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr> <chr>   
##  1      1        1 Allen, … fema… 29         0     0 24160  211.  B5    S       
##  2      1        1 Allison… male   0.917     1     2 113781 152.  C22 … S       
##  3      1        0 Allison… fema…  2         1     2 113781 152.  C22 … S       
##  4      1        0 Allison… male  30         1     2 113781 152.  C22 … S       
##  5      1        0 Allison… fema… 25         1     2 113781 152.  C22 … S       
##  6      1        1 Anderso… male  48         0     0 19952   26.6 E12   S       
##  7      1        1 Andrews… fema… 63         1     0 13502   78.0 D7    S       
##  8      1        0 Andrews… male  39         0     0 112050   0   A36   S       
##  9      1        1 Appleto… fema… 53         2     0 11769   51.5 C101  S       
## 10      1        0 Artagav… male  71         0     0 PC 17…  49.5 <NA>  C       
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>
str(titanic_clean)
## tibble [1,309 × 14] (S3: tbl_df/tbl/data.frame)
##  $ pclass   : num [1:1309] 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : num [1:1309] 1 1 0 0 0 1 1 0 1 0 ...
##  $ name     : chr [1:1309] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
##  $ sex      : chr [1:1309] "female" "male" "female" "male" ...
##  $ age      : num [1:1309] 29 0.917 2 30 25 ...
##  $ sibsp    : num [1:1309] 0 1 1 1 1 0 1 0 2 0 ...
##  $ parch    : num [1:1309] 0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket   : chr [1:1309] "24160" "113781" "113781" "113781" ...
##  $ fare     : num [1:1309] 211 152 152 152 152 ...
##  $ cabin    : chr [1:1309] "B5" "C22 C26" "C22 C26" "C22 C26" ...
##  $ embarked : chr [1:1309] "S" "S" "S" "S" ...
##  $ boat     : chr [1:1309] "2" "11" NA NA ...
##  $ body     : num [1:1309] NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest: chr [1:1309] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...

Narrative

The Titanic dataset contains information about passengers on the ill-fated Titanic voyage. The Titanic dataset offers a comprehensive look into the tragic maiden voyage of the RMS Titanic, a British passenger liner that sank in the North Atlantic Ocean in April 1912 after hitting an iceberg during her maiden voyage from Southampton to New York City. The dataset provides a window into the lives of those onboard, encompassing a diverse group of passengers from different socio-economic backgrounds, encapsulated within the three passenger classes.

Variables

  • pclass: Passenger class (1st, 2nd, 3rd) - This categorical variable divides passengers into three classes (1st, 2nd, and 3rd), reflecting the socio-economic stratification of the early 20th century.

  • survived: Survival status (1 = Yes, 0 = No) - A binary categorical variable indicating survival (1) or non-survival (0) of the passengers.

  • name: Name of the passenger - Textual data providing the names of the passengers, allowing for individual identification and historical research.

  • sex: Gender - A categorical variable recording the gender of passengers.

  • age: Age in years - A numerical variable detailing the age of each passenger, giving insights into the age distribution onboard.

  • sibsp: Number of siblings/spouses aboard - This numerical variable counts the number of siblings or spouses that a passenger had aboard the Titanic.

  • parch: Number of parents/children aboard - Similar to sibsp, this numerical variable tallies the number of parents or children a passenger had on the ship.

  • ticket: Ticket number - A combination of text and numeric data representing each passenger’s ticket number.

  • fare: Passenger fare - A numerical variable showing how much each passenger paid, potentially indicating their financial status.

  • cabin: Cabin number - Textual data providing the cabin number for passengers, which can be linked to their class and location on the ship.

  • embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) - A categorical variable indicating the port where passengers boarded the Titanic, with codes for Cherbourg, Queenstown, and Southampton.

  • boat: Lifeboat (if survived) - For survivors, this text/numeric variable identifies the lifeboat they were on, giving insight into the rescue process.

  • body: Body identification number (if did not survive) - For those who did not survive, this numeric variable provides a body identification number, if available.

  • home.dest: Home/Destination - Textual data about the passengers’ home or intended destination, offering a glimpse into their personal journeys and backgrounds.

Descriptive Statistics

# statistics
summary_statistics <- titanic_clean %>%
  summarise(
    MeanAge = mean(age, na.rm = TRUE),
    MedianAge = median(age, na.rm = TRUE),
    SdAge = sd(age, na.rm = TRUE),
    SurvivedCount = sum(survived == 1, na.rm = TRUE),
    NotSurvivedCount = sum(survived == 0, na.rm = TRUE),
  )
summary_statistics
## # A tibble: 1 × 5
##   MeanAge MedianAge SdAge SurvivedCount NotSurvivedCount
##     <dbl>     <dbl> <dbl>         <int>            <int>
## 1    29.9        28  14.4           500              809
# Survival rates by passenger class
titanic_clean %>%
  group_by(pclass) %>%
  summarise(SurvivalRate = mean(survived == 1)) %>%
  ggplot(aes(x = factor(pclass), y = SurvivalRate, fill = factor(pclass))) +
  geom_bar(stat = "identity") +
  labs(title = "Survival Rates by Passenger Class", x = "Passenger Class", y = "Survival Rate") +
  scale_fill_brewer(palette = "Set1")

# Age distribution among survivors and non-survivors
titanic_clean %>%
  ggplot(aes(x = age, fill = as.factor(survived))) +
  geom_histogram(bins = 30, position = "identity", alpha = 0.6) +
  labs(title = "Age Distribution of Survivors and Non-Survivors", x = "Age", y = "Count") +
  scale_fill_brewer(palette = "Set1", name = "Survived", labels = c("No", "Yes"))
## Warning: Removed 263 rows containing non-finite values (`stat_bin()`).

# Gender proportion amongst survivors
titanic_clean %>%
  group_by(sex) %>%
  summarise(SurvivalRate = mean(survived == 1)) %>%
  ggplot(aes(x = sex, y = SurvivalRate, fill = sex)) +
  geom_bar(stat = "identity") +
  labs(title = "Survival Rates by Gender", x = "Gender", y = "Survival Rate") +
  scale_fill_brewer(palette = "Pastel1")

# Fare distribution
titanic_clean %>%
  ggplot(aes(x = fare)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Fares", x = "Fare", y = "Count")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

Further descriptive analysis

Fare Distribution by Embarkation Port

Question: How does the fare distribution vary among different embarkation ports?

This can reveal if passengers embarking from certain ports tended to pay more or less, possibly indicating economic disparities based on embarkation points.

# Check if 'embarked' column exists and handle missing values if necessary
if("embarked" %in% names(titanic_clean)) {
  titanic_data <- na.omit(titanic_clean) # This line removes rows with any NA values

  # Fare Distribution by Embarkation Port
  ggplot(titanic_clean, aes(x = fare, fill = factor(embarked))) + 
      geom_histogram(binwidth = 10, color = "black", alpha = 0.7) +
      facet_wrap(~embarked, scales = 'free_x') +
      theme_minimal() +
      labs(title = "Fare Distribution by Embarkation Port", x = "Fare", y = "Count", fill = "Embarkation Port")
} else {
  print("Error: 'Embarked' column not found in the dataset.")
}
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

The majority of passengers embarking from port S (Southampton) paid lower fares compared to those from port C (Cherbourg). Port Q (Queenstown) shows a very tight range of fares, predominantly on the lower end, suggesting that passengers from this port were likely to be in the lower passenger classes. The presence of some higher fares at port C indicates that this port had a relatively higher proportion of first or second-class passengers.

Passenger Class Proportion by Age Group

Question: How is the age of passengers distributed across different passenger classes?

This can show if certain age groups were more prevalent in a particular class, providing insights into the socio-economic demographics of the passengers.

titanic_clean$AgeGroup <- cut(titanic_clean$age, breaks = c(0, 12, 18, 60, Inf), labels = c("Child", "Teenager", "Adult", "Senior"))
ggplot(titanic_clean, aes(x = factor(pclass), fill = AgeGroup)) + 
    geom_bar(position = "fill") +
    theme_minimal() +
    labs(title = "Passenger Class Proportion by Age Group", x = "Passenger Class", y = "Proportion", fill = "Age Group")

The visualization suggests that the first-class passengers were predominantly adults, with a relatively small proportion of children, teenagers, and seniors. The second and third classes show a more diverse age distribution, with the third class having a higher proportion of children and teenagers. This could reflect economic factors, where families and younger individuals were more likely to travel in the lower classes.

Cabin Location by Survival Status

Question: Does the cabin location (deck) have any correlation with survival rates?

This can highlight if certain decks were more prone to having survivors or casualties, possibly due to their location on the ship.

titanic_clean$deck <- substr(titanic_clean$cabin, 1, 1)
titanic_cabin <- titanic_clean[!is.na(titanic_clean$deck) & !is.na(titanic_clean$survived), ]
deck_survived_table <- table(titanic_cabin$deck, titanic_cabin$survived)
mosaicplot(
        deck_survived_table, 
        main = "Mosaic Plot of Survival Status Across Decks", 
        xlab = "Deck", 
        ylab = "Survival Status", 
        color = TRUE
)

The plot illustrates that survival rates varied significantly across different deck locations. For instance, decks B, C, D, and E show a larger proportion of survivors (indicated by the size of the blocks), while decks A and G, as well as the deck denoted by T, had relatively fewer survivors. This might be due to the location of these decks on the ship and their accessibility to lifeboats.

From these visualizations, we can infer that socio-economic status (indicated by fare and passenger class), age distribution, and cabin location (deck) played roles in the survival patterns observed on the Titanic. These factors are intertwined with the historical context of the event, where first-class passengers had better access to emergency resources, and certain decks were either more or less advantageous for evacuation.

Let’s lastly examine how the average fare and survival rate vary by passenger class (Pclass) and embarkation port (Embarked).

# Grouping by passenger class and embarkation port to find average fare and survival rate
grouped_stats <- titanic_clean %>%
  group_by(pclass, embarked) %>%
  summarise(
    Average_Fare = mean(fare, na.rm = TRUE),
    Survival_Rate = mean(as.numeric(survived), na.rm = TRUE)
  )
## `summarise()` has grouped output by 'pclass'. You can override using the
## `.groups` argument.
# View the results
grouped_stats
## # A tibble: 10 × 4
## # Groups:   pclass [3]
##    pclass embarked Average_Fare Survival_Rate
##     <dbl> <chr>           <dbl>         <dbl>
##  1      1 C               107.          0.688
##  2      1 Q                90           0.667
##  3      1 S                72.1         0.559
##  4      1 <NA>             80           1    
##  5      2 C                23.3         0.571
##  6      2 Q                11.7         0.286
##  7      2 S                21.2         0.417
##  8      3 C                11.0         0.366
##  9      3 Q                10.4         0.354
## 10      3 S                14.4         0.210
# Line plot of average fare by passenger class, colored by embarkation port
ggplot(grouped_stats, aes(x = factor(pclass), y = Average_Fare, group = embarked, color = embarked)) +
  geom_line() +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "Average Fare by Passenger Class and Embarkation Port",
       x = "Passenger Class",
       y = "Average Fare",
       color = "Embarkation Port")

# Bubble plot with color gradient based on survival rate
ggplot(grouped_stats, aes(x = factor(pclass), y = factor(embarked), size = Survival_Rate, color = Survival_Rate)) +
  geom_point(alpha = 0.6) + # Set transparency to see overlapping points
  scale_size_continuous(range = c(3, 10)) + # Adjust the size range for better visibility
  scale_color_gradient(low = "blue", high = "red") + # Use a color gradient from blue to red
  theme_minimal() +
  labs(title = "Survival Rate by Passenger Class and Embarkation Port",
       x = "Passenger Class",
       y = "Embarkation Port",
       size = "Survival Rate",
       color = "Survival Rate")

  1. First-Class Passengers (Pclass 1):
    • Passengers embarking from Cherbourg (C) paid the highest average fare ($106.85) among all first-class passengers and had a survival rate of approximately 68.79%.
    • Those embarking from Queenstown (Q) had a lower average fare ($90.00) but a slightly lower survival rate of 66.67%.
    • Passengers embarking from Southampton (S) paid the least on average ($72.15) among first-class passengers and had a survival rate of about 55.93%.
    • There is an entry with NA for embarkation, which has an average fare of $80.00 and a survival rate of 100%. This could be due to a small sample size or missing data, and it’s an outlier compared to other ports.
  2. Second-Class Passengers (Pclass 2):
    • The average fares are significantly lower than those of the first class, with Cherbourg passengers paying around $23.30 and having a survival rate of approximately 57.14%.
    • Queenstown’s second-class passengers had the lowest average fare ($11.74) and the lowest survival rate of 28.57%.
    • Those from Southampton had an average fare of $21.21 and a survival rate of 41.74%.
  3. Third-Class Passengers (Pclass 3):
    • Cherbourg’s third-class passengers paid an average fare of $11.02 and had a survival rate of 36.63%, which is higher than the survival rates for third-class passengers from other ports.
    • Queenstown passengers had a very close average fare ($10.39) but a lower survival rate of 35.39%.
    • Southampton’s third-class passengers had a higher average fare ($14.44) compared to the other two ports but the lowest survival rate of 21.01%.

General Observations: - The average fare decreases as we move from first-class to third-class passengers, as expected. - First-class passengers generally had higher survival rates compared to second and third-class passengers, which is consistent with historical accounts suggesting that higher-class passengers had better access to lifeboats. - Among all groups, Cherbourg’s first-class passengers paid the highest average fare and also had the highest survival rate, which might suggest a correlation between fare and survival. - The survival rate for passengers with NA embarkation data is an anomaly and should be investigated further; it could potentially skew the overall analysis due to its perfect survival rate.

These conclusions suggest that both passenger class and point of embarkation were influential factors in survival on the Titanic, with higher fares (likely linked to higher classes) correlating with higher survival rates. However, the embarkation point also played a role independently of class, as seen by varying survival rates within classes for different embarkation points.

Research directions

Let’s answer some of the remaining research questions.

Did age and sex significantly affect the survival chances?

We will use logistic regression to assess the impact of age and sex on the odds of survival. The logistic regression model is appropriate for binary outcome variables like survival status (where 1 might denote survival and 0 might denote not surviving).

# Logistic regression model with Age and Sex as predictors for Survival
survival_model <- glm(survived ~ age + sex, family = binomial(link = "logit"), data = titanic_clean)

# Summary of the model to check the significance of predictors
summary(survival_model)
## 
## Call:
## glm(formula = survived ~ age + sex, family = binomial(link = "logit"), 
##     data = titanic_clean)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.235414   0.192032   6.433 1.25e-10 ***
## age         -0.004254   0.005207  -0.817    0.414    
## sexmale     -2.460689   0.152315 -16.155  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1414.6  on 1045  degrees of freedom
## Residual deviance: 1101.3  on 1043  degrees of freedom
##   (263 observations deleted due to missingness)
## AIC: 1107.3
## 
## Number of Fisher Scoring iterations: 4

To determine if the survival is independent of sex, we can use the Chi-Squared test.

# Create a contingency table of survival by sex
table_sex_survival <- table(titanic_clean$survived, titanic_clean$sex)

# Perform the Chi-Squared test
chisq.test(table_sex_survival)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_sex_survival
## X-squared = 363.62, df = 1, p-value < 2.2e-16

Based on the output of the logistic regression analysis and the Chi-squared test, we can draw the following conclusions:

Logistic Regression Conclusion: - The coefficient for age is -0.004254, and it is not statistically significant (p-value = 0.414), which implies that age alone may not have a strong effect on the likelihood of survival when controlling for sex. - The coefficient for sexmale is -2.460689, which is statistically significant (p-value < 2e-16). This indicates that being male significantly decreases the odds of survival on the Titanic when controlling for age. The negative sign suggests that the odds of surviving for males are lower compared to females, holding age constant. - The intercept, which represents the log odds of survival for the baseline group, is positive and statistically significant, suggesting that females had a higher probability of surviving than males, controlling for age.

Chi-Squared Test Conclusion: - The result of the Chi-squared test (X-squared = 363.62, p-value < 2.2e-16) indicates that there is a highly statistically significant association between sex and survival. This suggests that sex is a strong predictor of survival on the Titanic, with females having a higher survival rate than males.

Overall Conclusion: - Sex appears to be a significant predictor of survival on the Titanic, with females having much higher odds of surviving than males. - Age, as a continuous variable, does not seem to have a significant impact on survival when analyzed in conjunction with sex. - The relationship between age and survival might be more complex and could potentially be better understood by considering age groups or including interaction terms in the model. - From the provided statistical analyses, we can conclude that sex was a significant factor in survival aboard the Titanic, while the effect of age was not as clear-cut.

These conclusions are consistent with historical accounts that highlight that women and children were given priority for lifeboats. Although age is not significant in the logistic regression, it’s possible that a more nuanced approach, such as categorizing age into groups or considering interaction effects, might reveal a different pattern.

What was the impact of having family onboard (siblings/spouses, parents/children) on survival?

We’ll use logistic regression to assess the impact of having family onboard on the odds of survival. We’ll create a binary variable that indicates whether the passenger had family onboard.

# Logistic regression model with Family Onboard as a predictor for Survival
titanic_clean$Family_Onboard <- with(titanic_clean, sibsp + parch > 0)

# Logistic regression model with Family Onboard as a predictor
family_survival_model <- glm(survived ~ Family_Onboard, family = binomial(link = "logit"), data = titanic_clean)

# Summary of the model to check the significance of having family onboard
summary(family_survival_model)
## 
## Call:
## glm(formula = survived ~ Family_Onboard, family = binomial(link = "logit"), 
##     data = titanic_clean)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.83527    0.07745 -10.784  < 2e-16 ***
## Family_OnboardTRUE  0.84683    0.11707   7.233 4.71e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1741  on 1308  degrees of freedom
## Residual deviance: 1688  on 1307  degrees of freedom
## AIC: 1692
## 
## Number of Fisher Scoring iterations: 4
# Create a summarized dataset for plotting
survival_by_family <- titanic_clean %>%
  group_by(Family_Onboard) %>%
  summarise(Survival_Rate = mean(as.numeric(survived)))

# Point plot for Survival Rate by Family Onboard
ggplot(survival_by_family, aes(x = factor(Family_Onboard), y = Survival_Rate, color = factor(Family_Onboard))) +
  geom_point(size = 5) +
  scale_color_manual(values = c("red", "green"), labels = c("Alone", "With Family")) +
  theme_minimal() +
  labs(title = "Survival Rate by Family Onboard",
       x = "Family Onboard",
       y = "Survival Rate",
       color = "Family Onboard") +
  theme(legend.position = "bottom")

Based on the output from the logistic regression model, here are the conclusions we can draw regarding the impact of having family onboard on the chances of survival:

Logistic Regression Conclusion: - The intercept (representing the log odds of survival for passengers without family onboard) is significant and negative (-0.83527), indicating that passengers who were alone had lower log odds of survival. - The coefficient for Family_OnboardTRUE is 0.84683 and is statistically significant (p-value = 4.71e-13). This positive coefficient suggests that passengers with family onboard had higher odds of survival compared to those who were alone, holding all else constant. - The significance of the Family_Onboard variable implies that having family members onboard significantly increased the likelihood of survival during the Titanic disaster.

Overall Conclusion: - Having family onboard significantly increased a passenger’s chances of survival on the Titanic. - This finding may reflect social and behavioral dynamics during the evacuation, such as families being prioritized for rescue or family members being more motivated to seek and aid each other’s survival. - The relationship is strong and robust, given the high significance level of the model’s coefficient.

These conclusions suggest that during the Titanic disaster, social bonds, as indicated by the presence of family members, could have played a crucial role in survival outcomes.

Limitations

When creating visualizations, it’s essential to consider their limitations and the potential questions that might remain unanswered. Here are some limitations related to the visualizations we discussed:

Limitations and Unanswered Questions:

  1. Sample Bias: The dataset may not include all passengers and crew of the Titanic, and the missing data could bias the visualizations and conclusions.

  2. Missing Data: In the Titanic dataset, there are typically missing values for age and cabin, which could affect the accuracy of the visualizations. The handling of these missing values (e.g., imputation or deletion) can significantly influence the results.

  3. Overlooked Interactions: The visualizations may not fully capture interactions between variables. For example, the combined effect of class, sex, and age may be more nuanced than can be conveyed in a single plot.

  4. Survivorship Bias: The visualizations cannot account for survivorship bias, where the data available is conditioned on the outcome of survival, potentially skewing the analysis.

Improvements for the Final Project:

  1. Interactive Elements: Interactive visualizations that allow users to explore the data themselves (e.g., with tools like Shiny in R) can be more engaging and informative.

  2. Multivariate Analysis: Including multivariate plots or adding facets to existing plots can help to show the relationship between more than two variables.

  3. Data Imputation: Addressing missing data with advanced imputation techniques could provide a more accurate picture and should be clearly communicated in the visualization’s narrative.