# Load the packages
library(conflicted)
library(readxl)
library(here)
## here() starts at /Users/varad/Documents/Academics/Year 2024/DACSS 601
library(scales)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
library(ggplot2)
library(ggmosaic)
conflicts_prefer(dplyr::filter)
## [conflicted] Will prefer dplyr::filter over any other package.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. During an emergency, the protocol is to prioratize the safety of women and children first. Secondly, there’s a stark difference between the surival rate of First class (rich people) and Third class (poor people) passengers’ survival rate.
In this project, we will walk you through how we can analyse and further validate these claims. R language [1] is a powerful tool for analysing and understanding the inherent information present in the data. We make use of RStudio [2] for the subsequent analysis and RPubs [3] to host the project. The following document is structured as shown in the table of content above.
titanic_data <- here("titanic3.xls") %>%
read_excel()
## Warning: Coercing text to numeric in M1306 / R1306C13: '328'
titanic_data
## # A tibble: 1,309 × 14
## pclass survived name sex age sibsp parch ticket fare cabin embarked
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 1 Allen, … fema… 29 0 0 24160 211. B5 S
## 2 1 1 Allison… male 0.917 1 2 113781 152. C22 … S
## 3 1 0 Allison… fema… 2 1 2 113781 152. C22 … S
## 4 1 0 Allison… male 30 1 2 113781 152. C22 … S
## 5 1 0 Allison… fema… 25 1 2 113781 152. C22 … S
## 6 1 1 Anderso… male 48 0 0 19952 26.6 E12 S
## 7 1 1 Andrews… fema… 63 1 0 13502 78.0 D7 S
## 8 1 0 Andrews… male 39 0 0 112050 0 A36 S
## 9 1 1 Appleto… fema… 53 2 0 11769 51.5 C101 S
## 10 1 0 Artagav… male 71 0 0 PC 17… 49.5 <NA> C
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>
The Titanic dataset contains information about passengers on the ill-fated Titanic voyage. The Titanic dataset offers a comprehensive look into the tragic maiden voyage of the RMS Titanic, a British passenger liner that sank in the North Atlantic Ocean in April 1912 after hitting an iceberg during her maiden voyage from Southampton to New York City. The dataset provides a window into the lives of those onboard, encompassing a diverse group of passengers from different socio-economic backgrounds, encapsulated within the three passenger classes.
pclass: Passenger class (1st, 2nd, 3rd) - This categorical variable divides passengers into three classes (1st, 2nd, and 3rd), reflecting the socio-economic stratification of the early 20th century.
survived: Survival status (1 = Yes, 0 = No) - A binary categorical variable indicating survival (1) or non-survival (0) of the passengers.
name: Name of the passenger - Textual data providing the names of the passengers, allowing for individual identification and historical research.
sex: Gender - A categorical variable recording the gender of passengers.
age: Age in years - A numerical variable detailing the age of each passenger, giving insights into the age distribution onboard.
sibsp: Number of siblings/spouses aboard - This numerical variable counts the number of siblings or spouses that a passenger had aboard the Titanic.
parch: Number of parents/children aboard - Similar to sibsp, this numerical variable tallies the number of parents or children a passenger had on the ship.
ticket: Ticket number - A combination of text and numeric data representing each passenger’s ticket number.
fare: Passenger fare - A numerical variable showing how much each passenger paid, potentially indicating their financial status.
cabin: Cabin number - Textual data providing the cabin number for passengers, which can be linked to their class and location on the ship.
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) - A categorical variable indicating the port where passengers boarded the Titanic, with codes for Cherbourg, Queenstown, and Southampton.
boat: Lifeboat (if survived) - For survivors, this text/numeric variable identifies the lifeboat they were on, giving insight into the rescue process.
body: Body identification number (if did not survive) - For those who did not survive, this numeric variable provides a body identification number, if available.
home.dest: Home/Destination - Textual data about the passengers’ home or intended destination, offering a glimpse into their personal journeys and backgrounds.
The data used here comes from Kaggle Challenge for Predicting the Survival of a passenger from the Complete Titanic dataset [4]. The dataset was collected from passenger information available on Wikipedia [5], which is turn collected this information from two sources – Encyclopedia [6] and an article published in New York Times on 19th April, 1912 that mentioned the survivors from the Titanic tragedy. One thing we noticed is that there were a total of 2240 passengers that boarded the Titanic with only 705 passengers surviving. However, we notice above that only 1305 entries are recorded. This is because, the known passenger list from Titanic is roughly around ~1300 with rest of the passengers being unknown or survived passengers who refused to make their information “public”. Thus, you will notice a stark skew between the actual ratio of survived vs not survived as compared to the one present in the above dataset.
Regardless, the study still remains relevant with a lot of interesting questions that can be answered. We next move on to cleaning the dataset.
# Cleaning the data
titanic_clean <- titanic_data %>%
# Convert factors to characters
mutate_if(is.factor, as.character)
# Handle missing values (example: fill NA in 'age' with median)
# mutate(age = ifelse(is.na(age), median(age, na.rm = TRUE), age))
# Removed after incorporating feedback from the first assignment.
# Moving ahead, we can either identify the ticket type and then take the age mean using the mean of that ticket type,
# or, we can simply remove the NAs.
titanic_clean
## # A tibble: 1,309 × 14
## pclass survived name sex age sibsp parch ticket fare cabin embarked
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 1 Allen, … fema… 29 0 0 24160 211. B5 S
## 2 1 1 Allison… male 0.917 1 2 113781 152. C22 … S
## 3 1 0 Allison… fema… 2 1 2 113781 152. C22 … S
## 4 1 0 Allison… male 30 1 2 113781 152. C22 … S
## 5 1 0 Allison… fema… 25 1 2 113781 152. C22 … S
## 6 1 1 Anderso… male 48 0 0 19952 26.6 E12 S
## 7 1 1 Andrews… fema… 63 1 0 13502 78.0 D7 S
## 8 1 0 Andrews… male 39 0 0 112050 0 A36 S
## 9 1 1 Appleto… fema… 53 2 0 11769 51.5 C101 S
## 10 1 0 Artagav… male 71 0 0 PC 17… 49.5 <NA> C
## # ℹ 1,299 more rows
## # ℹ 3 more variables: boat <chr>, body <dbl>, home.dest <chr>
str(titanic_clean)
## tibble [1,309 × 14] (S3: tbl_df/tbl/data.frame)
## $ pclass : num [1:1309] 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : num [1:1309] 1 1 0 0 0 1 1 0 1 0 ...
## $ name : chr [1:1309] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
## $ sex : chr [1:1309] "female" "male" "female" "male" ...
## $ age : num [1:1309] 29 0.917 2 30 25 ...
## $ sibsp : num [1:1309] 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : num [1:1309] 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : chr [1:1309] "24160" "113781" "113781" "113781" ...
## $ fare : num [1:1309] 211 152 152 152 152 ...
## $ cabin : chr [1:1309] "B5" "C22 C26" "C22 C26" "C22 C26" ...
## $ embarked : chr [1:1309] "S" "S" "S" "S" ...
## $ boat : chr [1:1309] "2" "11" NA NA ...
## $ body : num [1:1309] NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: chr [1:1309] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
# statistics
summary_statistics <- titanic_clean %>%
summarise(
MeanAge = mean(age, na.rm = TRUE),
MedianAge = median(age, na.rm = TRUE),
SdAge = sd(age, na.rm = TRUE),
SurvivedCount = sum(survived == 1, na.rm = TRUE),
NotSurvivedCount = sum(survived == 0, na.rm = TRUE),
)
summary_statistics
## # A tibble: 1 × 5
## MeanAge MedianAge SdAge SurvivedCount NotSurvivedCount
## <dbl> <dbl> <dbl> <int> <int>
## 1 29.9 28 14.4 500 809
# Survival rates by passenger class
titanic_clean %>%
group_by(pclass) %>%
summarise(SurvivalRate = mean(survived == 1)) %>%
ggplot(aes(x = factor(pclass), y = SurvivalRate)) +
geom_bar(stat = "identity", fill = "skyblue") + # Set all bars to the same color
labs(title = "Survival Rates by Passenger Class", x = "Passenger Class", y = "Survival Rate") +
scale_y_continuous(labels = label_percent(), limits = c(0, 0.85)) +
theme_minimal()
The graph shows that the highest survival rate is associated with the 1st passenger class, followed by the 2nd and then the 3rd class, which has the lowest survival rate. This suggests a correlation between passenger class and survival likelihood, with higher classes (fare wise and not numerically) having a better chance of survival.
# Age distribution among survivors and non-survivors
titanic_clean %>%
ggplot(aes(x = age, fill = as.factor(survived))) +
geom_histogram(bins = 30, position = "identity", alpha = 0.6) +
labs(title = "Age Distribution of Survivors and Non-Survivors", x = "Age", y = "Count") +
scale_fill_brewer(palette = "Set1", name = "Survived", labels = c("No", "Yes"))
## Warning: Removed 263 rows containing non-finite values (`stat_bin()`).
The plot overlays the counts of individuals who survived and did not survive across different age groups. It appears that the largest number of survivors falls within the younger age groups, indicating that age may have been a factor in survival. Moreover, there is a noticeable peak in non-survivors in the adult age range, suggesting higher mortality among middle-aged passengers. Statistically testing needs to be performed to make any further conclusions.
# Gender proportion amongst survivors
titanic_clean %>%
group_by(sex) %>%
summarise(SurvivalRate = mean(survived == 1)) %>%
ggplot(aes(x = sex, y = SurvivalRate)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Survival Rates by Gender", x = "Gender", y = "Survival Rate") +
theme_minimal()
The bar graph depicts a significantly higher survival rate for females compared to males. This suggests that gender was a major factor in survival, with females being more likely to survive than males.
# Fare distribution
titanic_clean %>%
ggplot(aes(x = fare)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Fares", x = "Fare", y = "Count")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
The histogram illustrates that the majority of fares are clustered at the lower end of the fare range, with a sharp decline as fares increase. This indicates that lower fares were more common, and very high fares were quite rare.
Question: How does the fare distribution vary among different embarkation ports?
This can reveal if passengers embarking from certain ports tended to pay more or less, possibly indicating economic disparities based on embarkation points.
if("embarked" %in% names(titanic_clean)) {
# Check initial state of 'embarked'
print(paste("Initial rows:", nrow(titanic_clean)))
print(table(titanic_clean$embarked))
# Filter out rows where 'embarked' is NA
titanic_clean <- titanic_clean %>%
filter(!is.na(embarked))
# Confirm state after filtering
print(paste("Rows after NA removal:", nrow(titanic_clean)))
if(nrow(titanic_clean) > 0) {
# Summarize data again
summary_count <- titanic_clean %>%
group_by(embarked) %>%
summarise(Count = n())
# Check summary_count
if(nrow(summary_count) == 0) {
print("No data in 'embarked' after NA removal.")
} else {
print(summary_count)
# Relabel and plot if data is appropriate
titanic_clean$embarked <- factor(titanic_clean$embarked,
levels = c("C", "Q", "S"),
labels = c("Cherbourg", "Queenstown", "Southampton"))
ggplot(titanic_clean, aes(x = fare, fill = embarked)) +
geom_histogram(binwidth = 10, color = "black", alpha = 0.7) +
facet_wrap(~embarked, scales = 'fixed') +
theme_minimal() +
labs(title = "Fare Distribution by Embarkation Port", x = "Fare", y = "Count", fill = "Embarkation Port")
}
} else {
print("The dataset is empty after removing NA values.")
}
} else {
print("Error: 'Embarked' column not found in the dataset.")
}
## [1] "Initial rows: 1309"
##
## C Q S
## 270 123 914
## [1] "Rows after NA removal: 1307"
## # A tibble: 3 × 2
## embarked Count
## <chr> <int>
## 1 C 270
## 2 Q 123
## 3 S 914
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
The majority of passengers embarking from port S (Southampton) paid lower fares compared to those from port C (Cherbourg). Port Q (Queenstown) shows a very tight range of fares, predominantly on the lower end, suggesting that passengers from this port were likely to be in the lower passenger classes. The presence of some higher fares at port C indicates that this port had a relatively higher proportion of first or second-class passengers.
Question: How is the age of passengers distributed across different passenger classes?
This can show if certain age groups were more prevalent in a particular class, providing insights into the socio-economic demographics of the passengers.
titanic_clean$AgeGroup <- cut(titanic_clean$age, breaks = c(0, 12, 18, 60, Inf), labels = c("Child", "Teenager", "Adult", "Senior"))
ggplot(titanic_clean, aes(x = factor(pclass), fill = AgeGroup)) +
geom_bar(position = "dodge") +
theme_minimal() +
labs(title = "Passenger Class Proportion by Age Group", x = "Passenger Class", y = "Proportion", fill = "Age Group")
The visualization suggests that the first-class passengers were predominantly adults, with a relatively small proportion of children, teenagers, and seniors. The second and third classes show a more diverse age distribution, with the third class having a higher proportion of children and teenagers. This could reflect economic factors, where families and younger individuals were more likely to travel in the lower classes.
Question: Does the cabin location (deck) have any correlation with survival rates?
This can highlight if certain decks were more prone to having survivors or casualties, possibly due to their location on the ship.
titanic_clean$deck <- substr(titanic_clean$cabin, 1, 1)
titanic_cabin <- titanic_clean[!is.na(titanic_clean$deck) & !is.na(titanic_clean$survived), ]
deck_survived_table <- table(titanic_cabin$deck, titanic_cabin$survived)
mosaicplot(
deck_survived_table,
main = "Mosaic Plot of Survival Status Across Decks",
xlab = "Deck",
ylab = "Survival Status",
color = TRUE
)
The plot illustrates that survival rates varied significantly across different deck locations. For instance, decks B, C, D, and E show a larger proportion of survivors (indicated by the size of the blocks), while decks A and G, as well as the deck denoted by T, had relatively fewer survivors. This might be due to the location of these decks on the ship and their accessibility to lifeboats.
From these visualizations, we can infer that socio-economic status (indicated by fare and passenger class), age distribution, and cabin location (deck) played roles in the survival patterns observed on the Titanic. These factors are intertwined with the historical context of the event, where first-class passengers had better access to emergency resources, and certain decks were either more or less advantageous for evacuation.
Let’s lastly examine how the average fare and survival rate vary by passenger class (Pclass) and embarkation port (Embarked).
# Grouping by passenger class and embarkation port to find average fare and survival rate
grouped_stats <- titanic_clean %>%
group_by(pclass, embarked) %>%
summarise(
Average_Fare = mean(fare, na.rm = TRUE),
Survival_Rate = mean(as.numeric(survived), na.rm = TRUE)
)
## `summarise()` has grouped output by 'pclass'. You can override using the
## `.groups` argument.
# View the results
grouped_stats
## # A tibble: 9 × 4
## # Groups: pclass [3]
## pclass embarked Average_Fare Survival_Rate
## <dbl> <fct> <dbl> <dbl>
## 1 1 Cherbourg 107. 0.688
## 2 1 Queenstown 90 0.667
## 3 1 Southampton 72.1 0.559
## 4 2 Cherbourg 23.3 0.571
## 5 2 Queenstown 11.7 0.286
## 6 2 Southampton 21.2 0.417
## 7 3 Cherbourg 11.0 0.366
## 8 3 Queenstown 10.4 0.354
## 9 3 Southampton 14.4 0.210
ggplot(grouped_stats, aes(x = factor(pclass), y = Average_Fare, fill = embarked)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7), width = 0.6) + # Use dodged bars
theme_minimal() +
scale_y_continuous(limits = c(0, max(grouped_stats$Average_Fare) * 1.1)) + # Rescale y-axis, extending it slightly above the highest average fare
labs(title = "Average Fare by Passenger Class and Embarkation Port",
x = "Passenger Class",
y = "Average Fare",
fill = "Embarkation Port")
# Bubble plot with color gradient based on survival rate, without size mapping
ggplot(grouped_stats, aes(x = factor(pclass), y = factor(embarked), color = Survival_Rate)) +
geom_point(alpha = 0.6, size = 4) + # Use a fixed size for all points, adjust transparency
scale_color_gradient(low = "blue", high = "red") + # Color gradient from blue (low) to red (high)
theme_minimal() +
labs(title = "Survival Rate by Passenger Class and Embarkation Port",
x = "Passenger Class",
y = "Embarkation Port",
color = "Survival Rate")
NA
for embarkation, which has an
average fare of $80.00 and a survival rate of 100%. This could be due to
a small sample size or missing data, and it’s an outlier compared to
other ports.General Observations: - The average fare decreases
as we move from first-class to third-class passengers, as expected. -
First-class passengers generally had higher survival rates compared to
second and third-class passengers, which is consistent with historical
accounts suggesting that higher-class passengers had better access to
lifeboats. - Among all groups, Cherbourg’s first-class passengers paid
the highest average fare and also had the highest survival rate, which
might suggest a correlation between fare and survival. - The survival
rate for passengers with NA
embarkation data is an anomaly
and should be investigated further; it could potentially skew the
overall analysis due to its perfect survival rate.
These conclusions suggest that both passenger class and point of embarkation were influential factors in survival on the Titanic, with higher fares (likely linked to higher classes) correlating with higher survival rates. However, the embarkation point also played a role independently of class, as seen by varying survival rates within classes for different embarkation points.
Let’s answer some of the remaining research questions.
While the raw data visualizations such as the survival rates by gender barplot and age distribution by survivors histogram provide clear insights, we utilize logistic regression and a Chi-Squared test for a more rigorous statistical analysis. These methods allow us to quantify the strength of these relationships and test the significance of our observations.
The logistic regression model helps in understanding how age and sex together impact the odds of survival. It provides a way to control for one variable while assessing the effect of another.
# Logistic regression model with Age and Sex as predictors for Survival
survival_model <- glm(survived ~ age + sex, family = binomial(link = "logit"), data = titanic_clean)
# Summary of the model to check the significance of predictors
summary(survival_model)
##
## Call:
## glm(formula = survived ~ age + sex, family = binomial(link = "logit"),
## data = titanic_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.237032 0.192077 6.440 1.19e-10 ***
## age -0.004563 0.005221 -0.874 0.382
## sexmale -2.453018 0.152409 -16.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1411.0 on 1043 degrees of freedom
## Residual deviance: 1100.1 on 1041 degrees of freedom
## (263 observations deleted due to missingness)
## AIC: 1106.1
##
## Number of Fisher Scoring iterations: 4
While logistic regression gives us an insight into the relationship between survival and our predictors, the Chi-Squared test helps us understand if the observed association between sex and survival is statistically significant.
# Create a contingency table of survival by sex
table_sex_survival <- table(titanic_clean$survived, titanic_clean$sex)
# Perform the Chi-Squared test
chisq.test(table_sex_survival)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_sex_survival
## X-squared = 361.36, df = 1, p-value < 2.2e-16
Based on the output of the logistic regression analysis and the Chi-squared test, we can draw the following conclusions:
Logistic Regression Conclusion: - The coefficient
for age
is -0.004254, and it is not statistically
significant (p-value = 0.414), which implies that age alone may not have
a strong effect on the likelihood of survival when controlling for sex.
- The coefficient for sexmale
is -2.460689, which is
statistically significant (p-value < 2e-16). This indicates that
being male significantly decreases the odds of survival on the Titanic
when controlling for age. The negative sign suggests that the odds of
surviving for males are lower compared to females, holding age constant.
- The intercept, which represents the log odds of survival for the
baseline group, is positive and statistically significant, suggesting
that females had a higher probability of surviving than males,
controlling for age.
Chi-Squared Test Conclusion: - The result of the Chi-squared test (X-squared = 363.62, p-value < 2.2e-16) confirms a statistically significant association between sex and survival, reinforcing the logistic regression findings.
Overall Conclusion: - Sex appears to be a significant predictor of survival on the Titanic, with females having much higher odds of surviving than males. - Age, as a continuous variable, does not seem to have a significant impact on survival when analyzed in conjunction with sex. - The relationship between age and survival might be more complex and could potentially be better understood by considering age groups or including interaction terms in the model. - From the provided statistical analyses, we can conclude that sex was a significant factor in survival aboard the Titanic, while the effect of age was not as clear-cut.
These analytical approaches highlight the importance of statistical testing in confirming and quantifying observations, offering a more nuanced understanding of the factors influencing survival on the Titanic.
We categorize passengers into two groups: those with family onboard and those without. We then compare their survival rates.
titanic_clean$Family_Onboard <- with(titanic_clean, sibsp + parch > 0)
# Create a summarized dataset for plotting
survival_by_family <- titanic_clean %>%
group_by(Family_Onboard) %>%
summarise(Survival_Rate = mean(as.numeric(survived)))
# Convert Family_Onboard to a more descriptive factor
survival_by_family$Family_Onboard <- factor(survival_by_family$Family_Onboard, labels = c("Alone", "With Family"))
# Bar plot for Survival Rate by Family Onboard
ggplot(survival_by_family, aes(x = Family_Onboard, y = Survival_Rate, fill = Family_Onboard)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("violet", "violet")) +
theme_minimal() +
labs(title = "Survival Rate by Family Onboard",
x = "Family Onboard",
y = "Survival Rate") +
theme(legend.position = "none") +
ylim(0, 0.6) # Set y-axis to start at 0
Having observed the survival trends visually, we now turn to logistic regression to assess the impact quantitatively.
# Logistic regression model with Family Onboard as a predictor for Survival
# Logistic regression model with Family Onboard as a predictor
family_survival_model <- glm(survived ~ Family_Onboard, family = binomial(link = "logit"), data = titanic_clean)
# Summary of the model to check the significance of having family onboard
summary(family_survival_model)
##
## Call:
## glm(formula = survived ~ Family_Onboard, family = binomial(link = "logit"),
## data = titanic_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.84367 0.07768 -10.861 < 2e-16 ***
## Family_OnboardTRUE 0.85524 0.11722 7.296 2.97e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1737.2 on 1306 degrees of freedom
## Residual deviance: 1683.2 on 1305 degrees of freedom
## AIC: 1687.2
##
## Number of Fisher Scoring iterations: 4
Based on the output from the logistic regression model, here are the conclusions we can draw regarding the impact of having family onboard on the chances of survival:
Logistic Regression Conclusion: - The intercept
(representing the log odds of survival for passengers without family
onboard) is significant and negative (-0.83527), indicating that
passengers who were alone had lower log odds of survival. - The
coefficient for Family_OnboardTRUE
is 0.84683 and is
statistically significant (p-value = 4.71e-13). This positive
coefficient suggests that passengers with family onboard had higher odds
of survival compared to those who were alone, holding all else constant.
- The significance of the Family_Onboard
variable implies
that having family members onboard significantly increased the
likelihood of survival during the Titanic disaster.
Overall Conclusion: - Having family onboard significantly increased a passenger’s chances of survival on the Titanic. - This finding may reflect social and behavioral dynamics during the evacuation, such as families being prioritized for rescue or family members being more motivated to seek and aid each other’s survival. - The relationship is strong and robust, given the high significance level of the model’s coefficient.
These conclusions suggest that during the Titanic disaster, social bonds, as indicated by the presence of family members, could have played a crucial role in survival outcomes.
In concluding the analysis of the Titanic dataset, several key insights emerge, alongside areas that warrant further exploration and the limitations inherent in the dataset. Here’s a summary:
The project provided valuable insights into the survival patterns of the Titanic disaster, emphasizing the role of socio-economic status, gender, and family ties. However, the limitations of the dataset and the need for further research to explore more complex interactions and address biases are evident. Future research could build upon these findings, incorporating more sophisticated statistical techniques and potentially additional data sources to gain a deeper understanding of the factors that influenced survival on the Titanic.