suppressPackageStartupMessages(library(tidyverse))
# Increases font size for all ggplot2 plots
theme_set(theme_gray(base_size=18))
# List of colors for customizing plots
colors <- c("#1f77b4","#ff7f0e", "#2ca02c", "#d62728",
"#9467bd","#8c564b", "#e377c2", "#7f7f7f",
"#bcbd22", "#17becf")
titanic <- read.csv(file.choose())
head(titanic, 10)
## survived pclass sex age sibsp parch fare embarked class who
## 1 0 3 male 22 1 0 7.2500 S Third man
## 2 1 1 female 38 1 0 71.2833 C First woman
## 3 1 3 female 26 0 0 7.9250 S Third woman
## 4 1 1 female 35 1 0 53.1000 S First woman
## 5 0 3 male 35 0 0 8.0500 S Third man
## 6 0 3 male NA 0 0 8.4583 Q Third man
## 7 0 1 male 54 0 0 51.8625 S First man
## 8 0 3 male 2 3 1 21.0750 S Third child
## 9 1 3 female 27 0 2 11.1333 S Third woman
## 10 1 2 female 14 1 0 30.0708 C Second child
## adult_male deck embark_town alive alone
## 1 TRUE Southampton no FALSE
## 2 FALSE C Cherbourg yes FALSE
## 3 FALSE Southampton yes TRUE
## 4 FALSE C Southampton yes FALSE
## 5 TRUE Southampton no TRUE
## 6 TRUE Queenstown no TRUE
## 7 TRUE E Southampton no TRUE
## 8 FALSE Southampton no FALSE
## 9 FALSE Southampton yes FALSE
## 10 FALSE Cherbourg yes FALSE
# Subset the titanic dataset to include first class passengers who embarked in Southampton
# Using base R
firstSouth <- titanic[titanic$class == "First" & titanic$embarked == "S",]
# Subset the titanic dataset to include either second or third class passenger
# Using base R
secondThird <- titanic[titanic$pclass == 2 | titanic$pclass == 3,]
firstSouth %>%
group_by(class, sex) %>%
summarize(n=n(), .groups="drop_last") %>%
spread(sex, n)
## # A tibble: 1 × 3
## # Groups: class [1]
## class female male
## <chr> <int> <int>
## 1 First 48 79
secondThird %>%
group_by(pclass, alive) %>%
summarize(n=n(), .groups="drop_last") %>%
spread(alive, n)
## # A tibble: 2 × 3
## # Groups: pclass [2]
## pclass no yes
## <int> <int> <int>
## 1 2 97 87
## 2 3 372 119
The R code you provided is designed to manipulate the Titanic dataset using the dplyr and tidyr packages, which are commonly used for data manipulation and tidying in R.
Here’s a step-by-step breakdown of what this code does to the Titanic dataset:
group_by(pclass, sex): This line groups the dataset by two variables, “pclass” (passenger class) and “sex” (gender). This means that the data will be organized into groups based on these two variables.
summarize(n = n(), .groups = “drop_last”): Within each group created in the previous step, this line calculates the count of observations using the n() function and assigns the result to a new variable called “n.” The .groups = “drop_last” argument is used to drop the grouping information after summarizing, which is useful for further data manipulation.
spread(sex, n): This line spreads the data into a wide format, with “sex” values as columns and the corresponding counts (“n”) as values. This essentially creates a table where each row represents a unique combination of “pclass” and “sex,” and the columns represent the counts of male and female passengers in each passenger class.
The result of running this code will be a new data frame where each row corresponds to a passenger class (“pclass”) and includes counts for both male and female passengers in separate columns. The “sex” values will become column headers, and the “n” values will be the counts of passengers in each category. This can be useful for further analysis or visualization of gender distribution within each passenger class on the Titanic.
# Create a bar chart for the first class passengers who embarked in Southampton grouped by sex
firstSouth %>% ggplot(aes(x=class)) +
geom_bar(aes(fill=sex), position="dodge") +
scale_fill_manual(values=colors) +
labs(x="Class", y="Count", fill="Sex")
# Create a bar chart for the second and third class passengers grouped by survival status
secondThird %>% ggplot(aes(x=class)) +
geom_bar(aes(fill=alive)) +
scale_fill_manual(values=colors) +
labs(x="Class", y="Count", fill="Alive")