# Load the packages
library(readxl)
library(here)
## here() starts at /Users/varad/Documents/Academics/Year 2024/DACSS 601
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggmosaic)
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
titanic_data <- here("titanic3.xls") %>%
read_excel()
## Warning: Coercing text to numeric in M1306 / R1306C13: '328'
titanic_data
## # A tibble: 1,309 × 14
## pclass survived name sex age sibsp parch ticket fare cabin embarked
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 1 Allen, … fema… 29 0 0 24160 211. B5 S
## 2 1 1 Allison… male 0.917 1 2 113781 152. C22 … S
## 3 1 0 Allison… fema… 2 1 2 113781 152. C22 … S
## 4 1 0 Allison… male 30 1 2 113781 152. C22 … S
## 5 1 0 Allison… fema… 25 1 2 113781 152. C22 … S
## 6 1 1 Anderso… male 48 0 0 19952 26.6 E12 S
## 7 1 1 Andrews… fema… 63 1 0 13502 78.0 D7 S
## 8 1 0 Andrews… male 39 0 0 112050 0 A36 S
## 9 1 1 Appleto… fema… 53 2 0 11769 51.5 C101 S
## 10 1 0 Artagav… male 71 0 0 PC 17… 49.5 <NA> C
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>
# Cleaning the data
titanic_clean <- titanic_data %>%
# Convert factors to characters
mutate_if(is.factor, as.character)
# Handle missing values (example: fill NA in 'age' with median)
# mutate(age = ifelse(is.na(age), median(age, na.rm = TRUE), age))
# Removed after incorporating feedback from the first assignment.
# Moving ahead, we can either identify the ticket type and then take the age mean using the mean of that ticket type,
# or, we can simply remove the NAs.
titanic_clean
## # A tibble: 1,309 × 14
## pclass survived name sex age sibsp parch ticket fare cabin embarked
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 1 Allen, … fema… 29 0 0 24160 211. B5 S
## 2 1 1 Allison… male 0.917 1 2 113781 152. C22 … S
## 3 1 0 Allison… fema… 2 1 2 113781 152. C22 … S
## 4 1 0 Allison… male 30 1 2 113781 152. C22 … S
## 5 1 0 Allison… fema… 25 1 2 113781 152. C22 … S
## 6 1 1 Anderso… male 48 0 0 19952 26.6 E12 S
## 7 1 1 Andrews… fema… 63 1 0 13502 78.0 D7 S
## 8 1 0 Andrews… male 39 0 0 112050 0 A36 S
## 9 1 1 Appleto… fema… 53 2 0 11769 51.5 C101 S
## 10 1 0 Artagav… male 71 0 0 PC 17… 49.5 <NA> C
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>
str(titanic_clean)
## tibble [1,309 × 14] (S3: tbl_df/tbl/data.frame)
## $ pclass : num [1:1309] 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : num [1:1309] 1 1 0 0 0 1 1 0 1 0 ...
## $ name : chr [1:1309] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
## $ sex : chr [1:1309] "female" "male" "female" "male" ...
## $ age : num [1:1309] 29 0.917 2 30 25 ...
## $ sibsp : num [1:1309] 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : num [1:1309] 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : chr [1:1309] "24160" "113781" "113781" "113781" ...
## $ fare : num [1:1309] 211 152 152 152 152 ...
## $ cabin : chr [1:1309] "B5" "C22 C26" "C22 C26" "C22 C26" ...
## $ embarked : chr [1:1309] "S" "S" "S" "S" ...
## $ boat : chr [1:1309] "2" "11" NA NA ...
## $ body : num [1:1309] NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: chr [1:1309] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
The Titanic dataset contains information about passengers on the ill-fated Titanic voyage. The Titanic dataset offers a comprehensive look into the tragic maiden voyage of the RMS Titanic, a British passenger liner that sank in the North Atlantic Ocean in April 1912 after hitting an iceberg during her maiden voyage from Southampton to New York City. The dataset provides a window into the lives of those onboard, encompassing a diverse group of passengers from different socio-economic backgrounds, encapsulated within the three passenger classes.
pclass: Passenger class (1st, 2nd, 3rd) - This categorical variable divides passengers into three classes (1st, 2nd, and 3rd), reflecting the socio-economic stratification of the early 20th century.
survived: Survival status (1 = Yes, 0 = No) - A binary categorical variable indicating survival (1) or non-survival (0) of the passengers.
name: Name of the passenger - Textual data providing the names of the passengers, allowing for individual identification and historical research.
sex: Gender - A categorical variable recording the gender of passengers.
age: Age in years - A numerical variable detailing the age of each passenger, giving insights into the age distribution onboard.
sibsp: Number of siblings/spouses aboard - This numerical variable counts the number of siblings or spouses that a passenger had aboard the Titanic.
parch: Number of parents/children aboard - Similar to sibsp, this numerical variable tallies the number of parents or children a passenger had on the ship.
ticket: Ticket number - A combination of text and numeric data representing each passenger’s ticket number.
fare: Passenger fare - A numerical variable showing how much each passenger paid, potentially indicating their financial status.
cabin: Cabin number - Textual data providing the cabin number for passengers, which can be linked to their class and location on the ship.
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) - A categorical variable indicating the port where passengers boarded the Titanic, with codes for Cherbourg, Queenstown, and Southampton.
boat: Lifeboat (if survived) - For survivors, this text/numeric variable identifies the lifeboat they were on, giving insight into the rescue process.
body: Body identification number (if did not survive) - For those who did not survive, this numeric variable provides a body identification number, if available.
home.dest: Home/Destination - Textual data about the passengers’ home or intended destination, offering a glimpse into their personal journeys and backgrounds.
# statistics
summary_statistics <- titanic_clean %>%
summarise(
MeanAge = mean(age, na.rm = TRUE),
MedianAge = median(age, na.rm = TRUE),
SdAge = sd(age, na.rm = TRUE),
SurvivedCount = sum(survived == 1, na.rm = TRUE),
NotSurvivedCount = sum(survived == 0, na.rm = TRUE),
)
summary_statistics
## # A tibble: 1 × 5
## MeanAge MedianAge SdAge SurvivedCount NotSurvivedCount
## <dbl> <dbl> <dbl> <int> <int>
## 1 29.9 28 14.4 500 809
# Survival rates by passenger class
titanic_clean %>%
group_by(pclass) %>%
summarise(SurvivalRate = mean(survived == 1)) %>%
ggplot(aes(x = factor(pclass), y = SurvivalRate, fill = factor(pclass))) +
geom_bar(stat = "identity") +
labs(title = "Survival Rates by Passenger Class", x = "Passenger Class", y = "Survival Rate") +
scale_fill_brewer(palette = "Set1")
# Age distribution among survivors and non-survivors
titanic_clean %>%
ggplot(aes(x = age, fill = as.factor(survived))) +
geom_histogram(bins = 30, position = "identity", alpha = 0.6) +
labs(title = "Age Distribution of Survivors and Non-Survivors", x = "Age", y = "Count") +
scale_fill_brewer(palette = "Set1", name = "Survived", labels = c("No", "Yes"))
## Warning: Removed 263 rows containing non-finite values (`stat_bin()`).
# Gender proportion amongst survivors
titanic_clean %>%
group_by(sex) %>%
summarise(SurvivalRate = mean(survived == 1)) %>%
ggplot(aes(x = sex, y = SurvivalRate, fill = sex)) +
geom_bar(stat = "identity") +
labs(title = "Survival Rates by Gender", x = "Gender", y = "Survival Rate") +
scale_fill_brewer(palette = "Pastel1")
# Fare distribution
titanic_clean %>%
ggplot(aes(x = fare)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Fares", x = "Fare", y = "Count")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
Question: How does the fare distribution vary among different embarkation ports?
This can reveal if passengers embarking from certain ports tended to pay more or less, possibly indicating economic disparities based on embarkation points.
# Check if 'embarked' column exists and handle missing values if necessary
if("embarked" %in% names(titanic_clean)) {
titanic_data <- na.omit(titanic_clean) # This line removes rows with any NA values
# Fare Distribution by Embarkation Port
ggplot(titanic_clean, aes(x = fare, fill = factor(embarked))) +
geom_histogram(binwidth = 10, color = "black", alpha = 0.7) +
facet_wrap(~embarked, scales = 'free_x') +
theme_minimal() +
labs(title = "Fare Distribution by Embarkation Port", x = "Fare", y = "Count", fill = "Embarkation Port")
} else {
print("Error: 'Embarked' column not found in the dataset.")
}
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
The majority of passengers embarking from port S (Southampton) paid
lower fares compared to those from port C (Cherbourg). Port Q
(Queenstown) shows a very tight range of fares, predominantly on the
lower end, suggesting that passengers from this port were likely to be
in the lower passenger classes. The presence of some higher fares at
port C indicates that this port had a relatively higher proportion of
first or second-class passengers.
Question: How is the age of passengers distributed across different passenger classes?
This can show if certain age groups were more prevalent in a particular class, providing insights into the socio-economic demographics of the passengers.
titanic_clean$AgeGroup <- cut(titanic_clean$age, breaks = c(0, 12, 18, 60, Inf), labels = c("Child", "Teenager", "Adult", "Senior"))
ggplot(titanic_clean, aes(x = factor(pclass), fill = AgeGroup)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(title = "Passenger Class Proportion by Age Group", x = "Passenger Class", y = "Proportion", fill = "Age Group")
The visualization suggests that the first-class passengers were
predominantly adults, with a relatively small proportion of children,
teenagers, and seniors. The second and third classes show a more diverse
age distribution, with the third class having a higher proportion of
children and teenagers. This could reflect economic factors, where
families and younger individuals were more likely to travel in the lower
classes.
Question: Does the cabin location (deck) have any correlation with survival rates?
This can highlight if certain decks were more prone to having survivors or casualties, possibly due to their location on the ship.
titanic_clean$deck <- substr(titanic_clean$cabin, 1, 1)
titanic_cabin <- titanic_clean[!is.na(titanic_clean$deck) & !is.na(titanic_clean$survived), ]
deck_survived_table <- table(titanic_cabin$deck, titanic_cabin$survived)
mosaicplot(
deck_survived_table,
main = "Mosaic Plot of Survival Status Across Decks",
xlab = "Deck",
ylab = "Survival Status",
color = TRUE
)
The plot illustrates that survival rates varied significantly across different deck locations. For instance, decks B, C, D, and E show a larger proportion of survivors (indicated by the size of the blocks), while decks A and G, as well as the deck denoted by T, had relatively fewer survivors. This might be due to the location of these decks on the ship and their accessibility to lifeboats.
From these visualizations, we can infer that socio-economic status (indicated by fare and passenger class), age distribution, and cabin location (deck) played roles in the survival patterns observed on the Titanic. These factors are intertwined with the historical context of the event, where first-class passengers had better access to emergency resources, and certain decks were either more or less advantageous for evacuation.
Let’s lastly examine how the average fare and survival rate vary by passenger class (Pclass) and embarkation port (Embarked).
# Grouping by passenger class and embarkation port to find average fare and survival rate
grouped_stats <- titanic_clean %>%
group_by(pclass, embarked) %>%
summarise(
Average_Fare = mean(fare, na.rm = TRUE),
Survival_Rate = mean(as.numeric(survived), na.rm = TRUE)
)
## `summarise()` has grouped output by 'pclass'. You can override using the
## `.groups` argument.
# View the results
grouped_stats
## # A tibble: 10 × 4
## # Groups: pclass [3]
## pclass embarked Average_Fare Survival_Rate
## <dbl> <chr> <dbl> <dbl>
## 1 1 C 107. 0.688
## 2 1 Q 90 0.667
## 3 1 S 72.1 0.559
## 4 1 <NA> 80 1
## 5 2 C 23.3 0.571
## 6 2 Q 11.7 0.286
## 7 2 S 21.2 0.417
## 8 3 C 11.0 0.366
## 9 3 Q 10.4 0.354
## 10 3 S 14.4 0.210
# Line plot of average fare by passenger class, colored by embarkation port
ggplot(grouped_stats, aes(x = factor(pclass), y = Average_Fare, group = embarked, color = embarked)) +
geom_line() +
geom_point(size = 3) +
theme_minimal() +
labs(title = "Average Fare by Passenger Class and Embarkation Port",
x = "Passenger Class",
y = "Average Fare",
color = "Embarkation Port")
# Bubble plot with color gradient based on survival rate
ggplot(grouped_stats, aes(x = factor(pclass), y = factor(embarked), size = Survival_Rate, color = Survival_Rate)) +
geom_point(alpha = 0.6) + # Set transparency to see overlapping points
scale_size_continuous(range = c(3, 10)) + # Adjust the size range for better visibility
scale_color_gradient(low = "blue", high = "red") + # Use a color gradient from blue to red
theme_minimal() +
labs(title = "Survival Rate by Passenger Class and Embarkation Port",
x = "Passenger Class",
y = "Embarkation Port",
size = "Survival Rate",
color = "Survival Rate")
NA
for embarkation, which has an
average fare of $80.00 and a survival rate of 100%. This could be due to
a small sample size or missing data, and it’s an outlier compared to
other ports.General Observations: - The average fare decreases
as we move from first-class to third-class passengers, as expected. -
First-class passengers generally had higher survival rates compared to
second and third-class passengers, which is consistent with historical
accounts suggesting that higher-class passengers had better access to
lifeboats. - Among all groups, Cherbourg’s first-class passengers paid
the highest average fare and also had the highest survival rate, which
might suggest a correlation between fare and survival. - The survival
rate for passengers with NA
embarkation data is an anomaly
and should be investigated further; it could potentially skew the
overall analysis due to its perfect survival rate.
These conclusions suggest that both passenger class and point of embarkation were influential factors in survival on the Titanic, with higher fares (likely linked to higher classes) correlating with higher survival rates. However, the embarkation point also played a role independently of class, as seen by varying survival rates within classes for different embarkation points.
Let’s answer some of the remaining research questions.
We will use logistic regression to assess the impact of age and sex on the odds of survival. The logistic regression model is appropriate for binary outcome variables like survival status (where 1 might denote survival and 0 might denote not surviving).
# Logistic regression model with Age and Sex as predictors for Survival
survival_model <- glm(survived ~ age + sex, family = binomial(link = "logit"), data = titanic_clean)
# Summary of the model to check the significance of predictors
summary(survival_model)
##
## Call:
## glm(formula = survived ~ age + sex, family = binomial(link = "logit"),
## data = titanic_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.235414 0.192032 6.433 1.25e-10 ***
## age -0.004254 0.005207 -0.817 0.414
## sexmale -2.460689 0.152315 -16.155 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1414.6 on 1045 degrees of freedom
## Residual deviance: 1101.3 on 1043 degrees of freedom
## (263 observations deleted due to missingness)
## AIC: 1107.3
##
## Number of Fisher Scoring iterations: 4
To determine if the survival is independent of sex, we can use the Chi-Squared test.
# Create a contingency table of survival by sex
table_sex_survival <- table(titanic_clean$survived, titanic_clean$sex)
# Perform the Chi-Squared test
chisq.test(table_sex_survival)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_sex_survival
## X-squared = 363.62, df = 1, p-value < 2.2e-16
Based on the output of the logistic regression analysis and the Chi-squared test, we can draw the following conclusions:
Logistic Regression Conclusion: - The coefficient
for age
is -0.004254, and it is not statistically
significant (p-value = 0.414), which implies that age alone may not have
a strong effect on the likelihood of survival when controlling for sex.
- The coefficient for sexmale
is -2.460689, which is
statistically significant (p-value < 2e-16). This indicates that
being male significantly decreases the odds of survival on the Titanic
when controlling for age. The negative sign suggests that the odds of
surviving for males are lower compared to females, holding age constant.
- The intercept, which represents the log odds of survival for the
baseline group, is positive and statistically significant, suggesting
that females had a higher probability of surviving than males,
controlling for age.
Chi-Squared Test Conclusion: - The result of the Chi-squared test (X-squared = 363.62, p-value < 2.2e-16) indicates that there is a highly statistically significant association between sex and survival. This suggests that sex is a strong predictor of survival on the Titanic, with females having a higher survival rate than males.
Overall Conclusion: - Sex appears to be a significant predictor of survival on the Titanic, with females having much higher odds of surviving than males. - Age, as a continuous variable, does not seem to have a significant impact on survival when analyzed in conjunction with sex. - The relationship between age and survival might be more complex and could potentially be better understood by considering age groups or including interaction terms in the model. - From the provided statistical analyses, we can conclude that sex was a significant factor in survival aboard the Titanic, while the effect of age was not as clear-cut.
These conclusions are consistent with historical accounts that highlight that women and children were given priority for lifeboats. Although age is not significant in the logistic regression, it’s possible that a more nuanced approach, such as categorizing age into groups or considering interaction effects, might reveal a different pattern.
We’ll use logistic regression to assess the impact of having family onboard on the odds of survival. We’ll create a binary variable that indicates whether the passenger had family onboard.
# Logistic regression model with Family Onboard as a predictor for Survival
titanic_clean$Family_Onboard <- with(titanic_clean, sibsp + parch > 0)
# Logistic regression model with Family Onboard as a predictor
family_survival_model <- glm(survived ~ Family_Onboard, family = binomial(link = "logit"), data = titanic_clean)
# Summary of the model to check the significance of having family onboard
summary(family_survival_model)
##
## Call:
## glm(formula = survived ~ Family_Onboard, family = binomial(link = "logit"),
## data = titanic_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.83527 0.07745 -10.784 < 2e-16 ***
## Family_OnboardTRUE 0.84683 0.11707 7.233 4.71e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1741 on 1308 degrees of freedom
## Residual deviance: 1688 on 1307 degrees of freedom
## AIC: 1692
##
## Number of Fisher Scoring iterations: 4
# Create a summarized dataset for plotting
survival_by_family <- titanic_clean %>%
group_by(Family_Onboard) %>%
summarise(Survival_Rate = mean(as.numeric(survived)))
# Point plot for Survival Rate by Family Onboard
ggplot(survival_by_family, aes(x = factor(Family_Onboard), y = Survival_Rate, color = factor(Family_Onboard))) +
geom_point(size = 5) +
scale_color_manual(values = c("red", "green"), labels = c("Alone", "With Family")) +
theme_minimal() +
labs(title = "Survival Rate by Family Onboard",
x = "Family Onboard",
y = "Survival Rate",
color = "Family Onboard") +
theme(legend.position = "bottom")
Based on the output from the logistic regression model, here are the conclusions we can draw regarding the impact of having family onboard on the chances of survival:
Logistic Regression Conclusion: - The intercept
(representing the log odds of survival for passengers without family
onboard) is significant and negative (-0.83527), indicating that
passengers who were alone had lower log odds of survival. - The
coefficient for Family_OnboardTRUE
is 0.84683 and is
statistically significant (p-value = 4.71e-13). This positive
coefficient suggests that passengers with family onboard had higher odds
of survival compared to those who were alone, holding all else constant.
- The significance of the Family_Onboard
variable implies
that having family members onboard significantly increased the
likelihood of survival during the Titanic disaster.
Overall Conclusion: - Having family onboard significantly increased a passenger’s chances of survival on the Titanic. - This finding may reflect social and behavioral dynamics during the evacuation, such as families being prioritized for rescue or family members being more motivated to seek and aid each other’s survival. - The relationship is strong and robust, given the high significance level of the model’s coefficient.
These conclusions suggest that during the Titanic disaster, social bonds, as indicated by the presence of family members, could have played a crucial role in survival outcomes.
When creating visualizations, it’s essential to consider their limitations and the potential questions that might remain unanswered. Here are some limitations related to the visualizations we discussed:
Sample Bias: The dataset may not include all passengers and crew of the Titanic, and the missing data could bias the visualizations and conclusions.
Missing Data: In the Titanic dataset, there are typically missing values for age and cabin, which could affect the accuracy of the visualizations. The handling of these missing values (e.g., imputation or deletion) can significantly influence the results.
Overlooked Interactions: The visualizations may not fully capture interactions between variables. For example, the combined effect of class, sex, and age may be more nuanced than can be conveyed in a single plot.
Survivorship Bias: The visualizations cannot account for survivorship bias, where the data available is conditioned on the outcome of survival, potentially skewing the analysis.
Interactive Elements: Interactive visualizations that allow users to explore the data themselves (e.g., with tools like Shiny in R) can be more engaging and informative.
Multivariate Analysis: Including multivariate plots or adding facets to existing plots can help to show the relationship between more than two variables.
Data Imputation: Addressing missing data with advanced imputation techniques could provide a more accurate picture and should be clearly communicated in the visualization’s narrative.