The Titanic was a large transatlantic passenger ship that departed from Southampton on April 10, 1912 and sank in the North Atlantic after striking an iceberg on April 15. Of the 2,224 passengers and crew on board, about 1,500 lost their lives, making it one of the most tragic maritime disasters in history.
Titanic Dataset
In data science, the Titanic passenger data has become a classic introductory dataset. It contains information such as age, gender, passenger class, fare, and survival status. Because these variables are easy to understand and relate to a real-world event, the dataset is widely used to teach fundamental concepts in data analysis and classification.
The dataset gained particular popularity through Kaggle’s early competition “Titanic: Machine Learning from Disaster,” where participants build predictive models to estimate whether a passenger survived. Today it remains a common benchmark for learning data preprocessing, exploratory analysis, and basic machine learning techniques.
Variable Descriptions
Variable
Description
Survived
Survival indicator (0 = did not survive, 1 = survived).
Number of siblings or spouses traveling with the passenger.
ParentsChildren
Number of parents or children traveling with the passenger.
Fare
Ticket fare paid by the passenger.
Using the provided titanic.csv file, answer the 10 questions below.
For each question, write your R code immediately under the question inside an R code chunk ({r}), and make sure that your code runs without errors and produces the requested output.
library(readr)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.1.0
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 887 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Sex
dbl (6): Survived, Pclass, Age, SiblingsSpouses, ParentsChildren, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 1: Compute the summary statistics (mean, median, min, max, sd) of the Age variable in the dataset.
titanic%>%summarise(mean_Age=mean(Age, na.rm=TRUE),median_Age=median(Age, na.rm=TRUE),min_Age=min(Age, na.rm=TRUE),max_Age=max(Age, na.rm=TRUE),sd_Age=sd(Age, na.rm=TRUE), )%>%gt()%>%tab_header(title="Summary of Age Dataset",subtitle ="Source: Titanic Dataset")
Summary of Age Dataset
Source: Titanic Dataset
mean_Age
median_Age
min_Age
max_Age
sd_Age
29.47144
28
0.42
80
14.12191
Question 2: Group the data by Sex and calculate the survival rate for each group.
titanic%>%group_by(Sex)%>%summarise(survival_rate=mean(Survived) )%>%gt()%>%tab_header(title="Survival Rate by Sex",subtitle ="Source: Titanic Dataset")%>%fmt_number(columns=survival_rate, decimals =2)
Survival Rate by Sex
Source: Titanic Dataset
Sex
survival_rate
female
0.74
male
0.19
Question 3: Compute the average age for each Pclass, and arrange the table in an ascending order.
titanic%>%group_by(Pclass)%>%summarise(average_age=mean(Age, na.rm=TRUE))%>%arrange(average_age)%>%gt()%>%tab_header(title="Average Age for Pclass",subtitle ="Source: Titanic Dataset")%>%fmt_number(columns=average_age, decimals =2)
Average Age for Pclass
Source: Titanic Dataset
Pclass
average_age
3
25.19
2
29.87
1
38.79
Question 4: Draw three histograms using facet_wrap or facet_grid to show the distribution of fares between different passenger classes (ie, adjust the binsize or binwidth and x breaks accordingly)
titanic%>%ggplot(aes(x=Fare))+geom_histogram(binwidth=25, color="white", boundary=0)+facet_wrap(~Pclass)+labs(title="Distribution of fares between different passenger classes",subtitle="Titanic Dataset",x="Fare",y="Price" )
Question 5: Show the same distribution with the boxplot (without faceting)
titanic%>%ggplot(aes(x=factor(Pclass),y=Fare,fill=Pclass))+geom_boxplot()+labs(title="Distribution of fares between different passenger classes",subtitle="Titanic Dataset",x="Passenger Class",y="Fare" )
Question 6: Filter the dataset to include only survivors and identify both the youngest and the oldest surviving passenger.
labs(title="Age Vs Fare",subtitle="Titanic Dataset",x="Age",y="Fare" )
<ggplot2::labels> List of 4
$ x : chr "Age"
$ y : chr "Fare"
$ title : chr "Age Vs Fare"
$ subtitle: chr "Titanic Dataset"
Question 8: Create a new variable FamilySize = SiblingsSpouses + ParentsChildren, and create a boxplot showing FamilySize for each Pclass.
titanic%>%mutate(FamilySize = SiblingsSpouses + ParentsChildren+1)%>%#When calculating family size, I added +1 to include the person who owns the family. Otherwise, the result is 0. Family size cannot be 0.ggplot(aes(x=as.factor(Pclass),y=FamilySize))+geom_boxplot()+labs(title="FamilySize & Pclass",subtitle="Titanic Dataset",x="Pclass",y="FamilySize" )
Question 9: Calculate mean and median family size for each Pclass
titanic%>%mutate(FamilySize = SiblingsSpouses + ParentsChildren+1)%>%#When calculating family size, I added +1 to include the person who owns the family. Otherwise, the result is 0. Family size cannot be 0.group_by(Pclass)%>%summarise(mean_FamilySize=mean(FamilySize),median_FamilySize=median(FamilySize) )