Code
# combine test and train datasets
library(titanic)
library(dplyr)
titanic <- bind_rows(titanic_train, titanic_test) Titanic
The content above in lines 1 through 17 is called the YAML, and it is useful to set the document title and specify the output when you render the file. Rendering options include html, Word, and .pdf. Here, we will render the document into an html file with folded code that wraps. Note, too that we include a table of contents (toc) and we can apply the global options to suppress warnings and messages as well.
This text in the white sections are areas where you can insert narratives and other useful information. The backslash at the end is a carriage return – which means start a new line, otherwise this text will run together.
The gray section below is called a code chunk. This is where you will place R script.
All code chunks start and end with three tick marks — use the key just below the ESC key on a Windows keyboard.
The first line needs the {r} to indicate that this code chunk is using R script. (Quarto will run with other code types such as Python and you can include different code chunks in the same document. However, for this course, we will only use R throughout.)
You can run all of the commands just in this code chunk by clicking on the green play arrow in the upper right corner of the chunk. Go ahead and run the code below. Note, you may need to install the titanic package first using the Packages tab in the lower right quadrant of RStudio.
For this exercise, we will use the dataset that is stored in the titanic package. Note that the data are already split into a training and test datasets. We want to use all records, so the first thing we will do is the concatenate the datasets (bind_rows) into one.
# combine test and train datasets
library(titanic)
library(dplyr)
titanic <- bind_rows(titanic_train, titanic_test) Let’s understand the structure of the dataset as well as view a few of the first and last observations.
# view data and metadata
str(titanic) # look at the structure of the dataset'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
head(titanic,3) # look at the first 3 rows of the dataset PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
Ticket Fare Cabin Embarked
1 A/5 21171 7.2500 S
2 PC 17599 71.2833 C85 C
3 STON/O2. 3101282 7.9250 S
tail(titanic,3) # look at the last 3 rows of the dataset PassengerId Survived Pclass Name Sex Age SibSp
1307 1307 NA 3 Saether, Mr. Simon Sivertsen male 38.5 0
1308 1308 NA 3 Ware, Mr. Frederick male NA 0
1309 1309 NA 3 Peter, Master. Michael J male NA 1
Parch Ticket Fare Cabin Embarked
1307 0 SOTON/O.Q. 3101262 7.2500 S
1308 0 359309 8.0500 S
1309 1 2668 22.3583 C
Note that important variables such as \(Survived\) and \(Age\) have missing values.
summary(titanic) # look at the descriptive statistics for the numeric variables PassengerId Survived Pclass Name
Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median : 655 Median :0.0000 Median :3.000 Mode :character
Mean : 655 Mean :0.3838 Mean :2.295
3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :1309 Max. :1.0000 Max. :3.000
NA's :418
Sex Age SibSp Parch
Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
Mode :character Median :28.00 Median :0.0000 Median :0.000
Mean :29.88 Mean :0.4989 Mean :0.385
3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
Max. :80.00 Max. :8.0000 Max. :9.000
NA's :263
Ticket Fare Cabin Embarked
Length:1309 Min. : 0.000 Length:1309 Length:1309
Class :character 1st Qu.: 7.896 Class :character Class :character
Mode :character Median : 14.454 Mode :character Mode :character
Mean : 33.295
3rd Qu.: 31.275
Max. :512.329
NA's :1
John Jacob Astor, one of the wealthiest people in the world, was aboard the Titanic. Let’s see if he survived.
astor <- titanic[grepl("Astor", titanic$Name), ]
print(astor %>% select(Name, Survived)) Name Survived
701 Astor, Mrs. John Jacob (Madeleine Talmadge Force) 1
1094 Astor, Col. John Jacob NA
Col. Astor was not known to have survived nor was he confirmed as a non-survivor. The NA means he was not recovered. His wife, however, did survive.
#create a new variable that is the average age. the na.rm=TRUE tells R to skip missing values.
xage <- round(mean(titanic$Age, na.rm = TRUE), digits = 2)
xage[1] 29.88
Rather than hardcode the value for the age, we will use inline code to print the value in the narrative. This is useful if the variable you create is dynamic, such as when you frequently update a dataset with new data. To do this, we use inline code, like this:
- The average age of all passengers on board the Titanic was 29.8.
- You will see the answer when you render the code.
Let’s run just a few more bits of analysis. Specifically, let’s see if we can answer the following questions:
1. How many passengers are male and how many female?
2. How many passengers survived and how many died?
3. How many females survived? How many died?
4. What percent of females survived?
5. What percent of survivors were male?
6. What percent of passengers were females who perished?
7. What is the distribution of passengers by fare class and embark location?
The following table shows the number of female and male passengers aboard the Titanic.
table(titanic$Sex)
female male
466 843
However, it may be nicer to show the answer as inline text vs code output:
females <- nrow(titanic[titanic$Sex == "female", ])
males <- nrow(titanic[titanic$Sex == "male", ])The following table shows the number of known survivors and non-survivors from the Titanic.
Note that many of the passengers were unaccounted (Unsure).
# convert the variable to a factor
titanic$Survived.f <- as.factor(ifelse(is.na(titanic$Survived), "Unsure", # if the value is NA, then unsure
ifelse(titanic$Survived == 0, "Did Not Survive",
"Survived")))
table1 <- as.data.frame(table(titanic$Survived.f))
table1 Var1 Freq
1 Did Not Survive 549
2 Survived 342
3 Unsure 418
There were 549 passengers aboard that did not survive, 342 that survived and 418 that is not sure if they either survived or perished./
The following table shows the number of survivors and non-survivors by gender.
::: {.cell}
table(titanic$Sex,titanic$Survived.f)
Did Not Survive Survived Unsure
female 81 233 152
male 468 109 266
# use this to create the values for inline code
female_survived <- nrow(titanic[titanic$Sex == "female" & titanic$Survived.f == "Survived", ])
female_nonsurvived <- nrow(titanic[titanic$Sex == "female" & titanic$Survived.f == "Did Not Survive" , ])::: Records show that among the females on board the Titanic, 233 survived and 81 did not survive.
While the prior table showed the number, this table depicts the percentage of survivors by gender.
# create contingency table
library(summarytools)
ctable(titanic$Sex,titanic$Survived.f, prop="r")Cross-Tabulation, Row Proportions
Sex * Survived.f
Data Frame: titanic
-------- ------------ ----------------- ------------- ------------- ---------------
Survived.f Did Not Survive Survived Unsure Total
Sex
female 81 (17.4%) 233 (50.0%) 152 (32.6%) 466 (100.0%)
male 468 (55.5%) 109 (12.9%) 266 (31.6%) 843 (100.0%)
Total 549 (41.9%) 342 (26.1%) 418 (31.9%) 1309 (100.0%)
-------- ------------ ----------------- ------------- ------------- ---------------
# rather do this by hand so you can use inline code
total_females <- sum(titanic$Sex == "female") # count females
percentage_female_survived <- round((female_survived / total_females) * 100, digits=1)
percentage_female_nonsurvived <- round((female_nonsurvived / total_females) * 100, digits=1)Among Females on board the Titanic, 50% survived and 17.4% did not survive.
The following table shows the distribution of survivors by gender.
::: {.cell}
library(summarytools)
ctable(titanic$Sex,titanic$Survived.f, prop="r")Cross-Tabulation, Row Proportions
Sex * Survived.f
Data Frame: titanic
-------- ------------ ----------------- ------------- ------------- ---------------
Survived.f Did Not Survive Survived Unsure Total
Sex
female 81 (17.4%) 233 (50.0%) 152 (32.6%) 466 (100.0%)
male 468 (55.5%) 109 (12.9%) 266 (31.6%) 843 (100.0%)
Total 549 (41.9%) 342 (26.1%) 418 (31.9%) 1309 (100.0%)
-------- ------------ ----------------- ------------- ------------- ---------------
male_survived <- nrow(titanic[titanic$Sex == "male" & titanic$Survived.f == "Survived", ])
total_males <- sum(titanic$Sex == "male")
percentage_males_survived <- round((male_survived / total_males) * 100, digits=1)
percentage_males_survived[1] 12.9
::: Among Males aboard of the Titanic, 12.9% survived the accident.
library(summarytools)
total_titanic <- nrow(titanic)
percentage_female_nonsurvived <- round((female_nonsurvived / total_titanic) * 100, digits=1)
percentage_female_nonsurvived[1] 6.2
Among all passengers in the Titanic, 6.2% were females who perished.
Note that passengers embarked at one of three locations (in order).
S: Southampton, England
C: Cherbourg, France
Q: Queenstown, Ireland
# fix missing labels and convert to a factor
titanic$Embarked <- ifelse(titanic$Embarked=="","Unknown",titanic$Embarked)
titanic$Embarked.f <- as.factor(titanic$Embarked)
# insert code here to complete.
table(titanic$Pclass, titanic$Embarked)
C Q S Unknown
1 141 3 177 2
2 28 7 242 0
3 101 113 495 0
Let’s create three plots using ggplot2. We will learn more about the capabilities of ggplot2 over this course. We will practice a few here.
The first two charts are similar – one shows the levels, and the other shows the values in percent.
library(ggplot2)
ggplot(titanic, aes(x = Sex, fill = Survived.f)) +
geom_bar(position = "dodge") +
geom_text(aes(label = after_stat(count)), stat = "count", position = position_dodge(width = 0.9), vjust = -0.25) + #data labels
labs(x = "Gender", y = "Count", fill = "Survival Status",
title = "Number of Titanic Passengers by Survival and Gender",
subtitle = "The largest group were men who did not survive")This is a similar chart as above, but showing the data labels as a percent of Total Passengers rather than raw values.
Note that we first calculate the percentage using dplyr then we use that data to create the chart. We also add data labels.
# summarize the data to create the percentages
pip <- titanic %>%
group_by(Sex, Survived.f) %>%
summarise(Count = n(), .groups = 'drop') %>%
mutate(Total = sum(Count), Percentage = (Count / Total) * 100)
pip# A tibble: 6 × 5
Sex Survived.f Count Total Percentage
<chr> <fct> <int> <int> <dbl>
1 female Did Not Survive 81 1309 6.19
2 female Survived 233 1309 17.8
3 female Unsure 152 1309 11.6
4 male Did Not Survive 468 1309 35.8
5 male Survived 109 1309 8.33
6 male Unsure 266 1309 20.3
# Create the dodged bar chart
ggplot(pip, aes(x = Sex, y = Percentage, fill = Survived.f)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
geom_text(aes(label = sprintf("%.0f%%", Percentage), #sprintf rounds the percentage to 0 decimals
group = Survived.f),
position = position_dodge(width = 0.9), vjust = -0.25) +
scale_y_continuous(labels = scales::percent_format()) +
labs(x = "Gender", y = " ", fill = "Survival Status",
title = "Percentage of Total Titanic Passengers by Survival and Gender",
subtitle = "Nearly 36% of all passengers were men who did not survive") +
theme(axis.text.y = element_blank()) # Suppress y-axis labelsCreate a bar chart that shows the average age of passengers by fare class – 1st Class, 2nd Class, or 3rd Class.
# Create a file with the average ages
library(dplyr)
mean_ages <- titanic %>%
group_by(Pclass) %>%
summarise(MeanAge = round(mean(Age, na.rm=TRUE), 0))
# Create the bar chart
library(ggplot2)
ggplot(data=mean_ages, aes(x=Pclass, y=MeanAge)) +
geom_bar(stat="identity", fill="light blue", position="dodge") +
geom_text(aes(label=round(MeanAge, 1)), vjust=-0.5) +
ylim(0,45) +
labs(x = "Passenger Class", y="Average Age",
title = "Average Age by Passenger Class",
subtitle = "First class passengers were older, on average.",
caption = "Source: titanic dataset") Here we created the plot that shows the average age by fare class. First class Titanic passengers were, on average, older than those in second and third class.
###Last sectione intent, the chart, and the interpretation.
library(ggplot2)
ggplot(titanic, aes(x = Pclass, fill = Survived.f)) +
geom_bar(position = "dodge") +
geom_text(aes(label = after_stat(count)), stat = "count", position = position_dodge(width = 0.9), vjust = -0.25) + #data labels
labs(x = "Passenger Class", y = "Count", fill = "Survival Status",
title = "Number of Titanic Passengers by Survival and Class",
subtitle = "The largest group of people that died were people that were traveling on third class")In this graph we can see a distinct pattern of survival rates across different passenger classes aboard the Titanic in this visual portrayal. The “Survived” category dominated the first class, with a large number of 136 people. In contrast, the third class has the highest number of casualties, totaling 372, much outnumbering all other classes. When we compare the survival dynamics of the first and third classes, a fascinating observation arises. Despite the fact that the third class accommodates a significantly larger number of passengers, the first class outnumbers it in terms of survivors. This shows that the first-class rooms were more secure, increasing the likelihood of survival in the case of a tragedy. Finally, the second class travelers survival status remains very similar across all status.
# summarize the data to create the percentages
pip <- titanic %>%
group_by(Pclass, Survived.f) %>%
summarise(Count = n(), .groups = 'drop') %>%
mutate(Total = sum(Count), Percentage = (Count / Total) * 100)
pip# A tibble: 9 × 5
Pclass Survived.f Count Total Percentage
<int> <fct> <int> <int> <dbl>
1 1 Did Not Survive 80 1309 6.11
2 1 Survived 136 1309 10.4
3 1 Unsure 107 1309 8.17
4 2 Did Not Survive 97 1309 7.41
5 2 Survived 87 1309 6.65
6 2 Unsure 93 1309 7.10
7 3 Did Not Survive 372 1309 28.4
8 3 Survived 119 1309 9.09
9 3 Unsure 218 1309 16.7
# Create the dodged bar chart
ggplot(pip, aes(x = Pclass, y = Percentage, fill = Survived.f)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
geom_text(aes(label = sprintf("%.0f%%", Percentage), #sprintf rounds the percentage to 0 decimals
group = Survived.f),
position = position_dodge(width = 0.9), vjust = -0.25) +
scale_y_continuous(labels = scales::percent_format()) +
labs(x = "Passenger Travel Class", y = " ", fill = "Survival Status",
title = "Percentage of Total Titanic Passengers by Survival and Class",
subtitle = "Nearly 28% of all passengers wtravel in third class me who did not survive") +
theme(axis.text.y = element_blank()) When analyzing percentages in the context of this graph, a sophisticated perspective emerges. Notably, 28% of passengers aboard the Titanic died in the third class, representing the greatest percentage of “unsure” survival status among all classes. This supports the notion that third class was the most dangerous travel class on the Titanic, as seen by the greatest mortality rate and the most unclear survival status across all classes. In the first class, however, a different pattern is emerging. Despite having a smaller passenger population, first class has the highest percentage of survivors. This demonstrates a remarkable resilience in first-class survival rates, stressing that a reduced number of passengers did not prevent a higher percentage of successful outcomes.
## END