1 Download the data
install.packages("Amelia", "dplyr", "ggplot2", "stargazer")
data.frame <- read.csv("train.csv", na.strings = "")
(c): Titanic is one of my favorite movies, and I have to admit that it sparked my fascination with the themes of love, even if the portrayal in the film is a bit romanticized and unrealistic. Beyond the love story, the Titanic disaster is surrounded by intriguing narratives and mysteries. For example, did some wealthy nobles really give up their chances to escape? Did men prioritize saving women and children? And, did passengers in lower fare classes have a lower survival rate? Is there a relationship between sexuality and survival rate? These questions have always piqued my curiosity, which is why I find exploring this data both personally and intellectually engaging.
#2 Describe the structure of the data
#First, I want to create a missingness map to identify which variables have the most 'NA' values.
library(Amelia)
missmap(data.frame, col=c("black", "grey"))
# I’ve decided to drop the 'Cabin' column since it contains too many missing values. Additionally, I will drop 'PassengerID,' 'Name,' 'Fare,' 'Embarked,' and 'Ticket' because these variables are unlikely to affect survival. My next step is to use dplyr to select the remaining columns for analysis.
library(dplyr)
data.frame = select(Tdata, Survived, Pclass, Age, Sex, SibSp, Parch, Fare)
#Next, I want to drop the 'NA' values.
data.frame = na.omit(data.frame)
#Then, I want to check the structure of the data.
str(data.frame)
There are 7 variables. survival: Categorical pclass: Categorica age: Numerical sex: Categorical sibsp: Numerical (number of siblings/spouses aboard) parch: Numerical (number of parents/children aboard) Fare: Numerical
Number of observations: 714
#3. Data Collection The Titanic dataset in Base R only contains 5 variables and 32 rows, which is insufficient for my analysis. As a result, I searched for Titanic-related data on Google and ultimately found a comprehensive dataset on Kaggle. This dataset is public and was used in a Python competition for Machine Learning. Unfortunately, Kaggle does not disclose the source of the data or its sampling strategies. However, I speculate that the data was likely collected by a committee during the investigation following the shipwreck. It’s important to note that the data may not be entirely accurate, as there could be discrepancies in the reported numbers of passengers, survivors, and missing persons.
#4 Questions For each question, the outcome variable is survival (0 = No, 1 = Yes), and the explanatory variables are as follows:
Are the survival rates of girls higher than those of boys? Explanatory variable: sex (male, female)
Do third-class passengers have the lowest survival rate? Explanatory variable: pclass (1st, 2nd, 3rd class)
Do first-class passengers have the highest survival rate? Explanatory variable: pclass (1st, 2nd, 3rd class)
Which age group has the highest survival rate? Explanatory variable: age
#5 Summarize the data
library(stargazer)
stargazer("Tdata", type = "text", title="Data Summary")
library(ggplot2)
#Density distribution
ggplot(data.frame, aes(x = Age)) + geom_density(fill='lightblue') + labs(title = "Density distribution of age")
The density distribution of the age of Titanic passengers shows a right-skewed distribution. The highest density occurs around the early 20s, suggesting that a significant portion of the passengers were young adults. The tail on the right side of the curve is longer, indicating that there were fewer older passengers.There seems to be a small peak around age 0–5, likely representing young children onboard.
#Scatter plot
ggplot(data.frame, aes(x = Fare, y = Age)) + geom_point()
#6 Research on the question
# survival rate by class
library(ggplot2)
ggplot(data.frame, aes(x = Survived, fill = Sex)) + theme_bw() + geom_bar() + labs(y = "passenger count" , title = "Titanic survival rate by Sex")
Sandra L. Takis published an article titled Titanic: A Statistical Exploration in 1999, where he explored questions similar to mine, such as, “Was the survival rate related to passenger class?” and “Was the survival rate related to gender?” When creating a bar plot, Takis used the survival rate (percentage) on the Y-axis, which I find more intuitive than the raw number of survivors that I initially used.
Regarding the relationship between gender and survival rate, my conclusion aligns with Takis’s: females had a higher survival rate. Additionally, Takis differentiated children from males and females, finding that children had a lower survival rate than females but a higher rate than males. His data also indicated a higher survival rate for first-class passengers. After reading Takis’s research, I realized that the proportion of women in each class may influence the overall survival rate. For instance, while the data shows that first-class passengers had a higher survival rate, it’s difficult to determine whether this is due to a larger proportion of women in first class. In other words, the gender distribution within each class could be a confounding variable in the relationship between class and survival rate. This is an area I plan to focus on in my future research.