This markdown was made to solve an assessment test for the course Data Science: Data Visualization by Harvard University on EDX
Context: The Titanic was a British ocean liner that struck an iceberg and sunk on its maiden voyage in 1912 from the United Kingdom to New York. More than 1,500 of the estimated 2,224 passengers and crew died in the accident, making this one of the largest maritime disasters ever outside of war. The ship carried a wide range of passengers of all ages and both genders, from luxury travelers in first-class to immigrants in the lower classes. However, not all passengers were equally likely to survive the accident. We use real data about a selection of 891 passengers to learn who was on the Titanic and which passengers were more likely to survive.
Loading the required packages
Getting and setting up the required data
Inspecting the variable types. Look up ?titanic_train
for info on the variables
## Rows: 891
## Columns: 7
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1...
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3...
## $ Sex <fct> male, female, female, female, male, male, male, male, fema...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,...
Exploring the demographics of Titanic passengers
Overall age distribution of Titanic passengers
Counts of male vs female passengers
Sex | n |
---|---|
female | 314 |
male | 577 |
Now, checking the claim “The proportion of females under age 17 was higher than the proportion of males under age 17.”
titanic %>%
filter(!is.na(Age)) %>%
mutate(under_17=if_else(Age<17,1,0)) %>%
group_by(Sex) %>%
summarise(mean(under_17)) %>%
knitr::kable()
Sex | mean(under_17) |
---|---|
female | 0.1877395 |
male | 0.1125828 |
The claim was indeed correct
Checking the second claim “The proportion of males age 18-35 was higher than the proportion of females age 18-35.”
titanic %>%
filter(!is.na(Age)) %>%
mutate(Eighteen_thirtyfive=if_else(Age>=18 & Age<=35,1,0)) %>%
group_by(Sex) %>%
summarise(mean(Eighteen_thirtyfive)) %>%
knitr::kable()
Sex | mean(Eighteen_thirtyfive) |
---|---|
female | 0.5095785 |
male | 0.5540839 |
The claim was also correct
Survival by sex of passengers in the Titanic
Survival by age
Survival by fare
titanic %>%
filter(!Fare==0) %>%
ggplot(aes(x=Survived, y=Fare))+
geom_boxplot()+
geom_jitter(alpha=0.2, width = 0.02)+
scale_y_log10()
Survival by passenger class
We’ll be creating three barplots to answer three different questions
Bar plots of passenger class filled by Survival with counts on the y-axis
Bar plots of passenger class filled by Survival with proportion on the y-axis
Proportion bar plot