This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.
The goal is to learn R by doing, not just by listening.
Before we begin, make sure you’ve installed the required packages:
tidyverse
for data manipulation and plottingtitanic
for practice dataYou only need to install these once. If you already did it during Homework 1, you’re good to go.
# Load required packages
library(tidyverse)
library(titanic)
library(e1071)
df_titanic <- as_tibble(titanic_train)
In this practice, we will explore how to generate summary statistics
using pipe operators in tidyverse
. Let’s begin with a
simple example: reporting the class of every variable in
df_titanic
:
df_titanic %>%
summarise(across(everything(), class))
Use a single chain of pipe operators to generate each of the
following summaries from df_titanic
:
Fare
. Your output should look like
this:skew_fare |
---|
4.77121 |
Name | Sex | Ticket | Cabin | Embarked |
---|---|---|---|---|
891 | 2 | 681 | 148 | 4 |
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 177 | 0 | 0 | 0 | 0 | 0 | 0 |
Age
with the median, then
compute the mean for all numeric variables. Your output should look like
this:PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
---|---|---|---|---|---|---|
446 | 0.3838384 | 2.308642 | 29.36158 | 0.5230079 | 0.3815937 | 32.20421 |
Embarked == ""
(the port of
embarkation is unknown), then compute survival rates by Sex
and Embarked
. Format the answer as a table with rows
corresponding to ports of embarkation and columns to sexes. Your output
should look like this:Embarked | Female | Male |
---|---|---|
C | 0.8767123 | 0.30526316 |
Q | 0.7500000 | 0.07317073 |
S | 0.6896552 | 0.17460317 |
### Answers
# Part (a)
df_titanic %>%
summarise(skew_fare = skewness(Fare))
# Part (b)
df_titanic %>%
summarise(across(is.character, n_distinct))
# Part (c)
df_titanic %>%
summarise(across(everything(), ~ sum(is.na(.))))
# Part (d)
df_titanic %>%
mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age)) %>%
summarise(across(is.numeric, mean))
# Part (e)
df_titanic %>%
filter(Embarked != "") %>%
group_by(Sex, Embarked) %>%
summarise(survival_rate = mean(Survived)) %>%
pivot_wider(names_from = Sex, values_from = survival_rate)
Pick a dataset of your choice. Compute the following summary statistics:
Mean, median, standard deviation, skewness (numeric variables)
Number of unique values, number of missing values, fraction of most prevalent values (character variables)
💡 Hints:
Use across(is.character, ...)
and
across(is.numeric, ...)
to apply functions to selected
variables.
Use reframe()
instead of summarise()
when your function returns a vector.
Functions you’ll need: n_distinct()
,
table()
, and custom logic for missing values and most
frequent values.
## ANSWER
# First Method
# This is for character variables, similar will work for numeric variables
char_summary <- function(x) {
# Given that x is a character vector,
# Computes a vector of three statistics:
# Number of unique values
# Number of missing entries
# Fraction of the most common entry
no_unique <- n_distinct(x)
no_missing <- sum(is.na(x))
tab_x <- table(x)
frac_most_common <- tab_x[which.max(tab_x)] / length(x)
c(no_unique, no_missing, frac_most_common)
}
df_titanic %>%
reframe(across(is.character, char_summary)) %>%
mutate(statistic = c("Number of Unique Values",
"Number of Missing Values",
"Fraction of Most Common Value")) %>%
relocate(statistic)
## ANSWER
# Second Method
# This is for numeric variables, similar will work for character variables
df_titanic %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(
mean = mean(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
skew = skewness(value, na.rm = TRUE),
.groups = "drop"
)
In this exercise, we’ll simulate binomial samples and compute empirical success rates to assess how much variation can arise due to randomness alone.
🔹 Example
Suppose the theoretical success probability is \(p=0.37\). We generate a binomial sample of
size = 50
and repeat this experiment n = 10000
times (there is a confusion of notation here - n
in R
refers to the number of samples or experiments to generate while \(n\) in Wikipedia is size
in
R). Each time, we compute the empirical success rate:
df_binom_experiment <- rbinom(n = 10000, size = 50, prob = 0.37) %>%
enframe() %>%
count(value) %>%
mutate(empirical_success_rate = value / 50, freq = n / sum(n))
df_binom_experiment %>% head()
We can now visualize the distribution of empirical success rates
(note that we use geom_col
rather than
geom_histogram
since this is a discrete random
variable):
df_binom_experiment %>%
ggplot(aes(x = empirical_success_rate, y = freq)) +
geom_col(fill = "grey90", colour = "black") +
labs(x = "Empirical Success Rate", y = "Relative Frequency")
Let’s now compute the probability that the empirical success rate is below 0.2, even though the theoretical probability is 0.37:
df_binom_experiment %>%
filter(empirical_success_rate <= 0.2) %>%
pull(freq) %>%
sum()
## [1] 0.0071
As expected, this probability is quite small, but not zero.
🚢 Titanic Survival Rates
Let’s examine the overall survival rate on the Titanic:
survival_rate <- df_titanic %>% pull(Survived) %>% mean()
survival_rate
## [1] 0.3838384
Now compare this with survival rates by passenger class:
df_titanic %>%
group_by(Pclass) %>%
summarise(
count = n(),
survival_rate = mean(Survived)
)
You’ll see that first-class passengers had a much higher survival rate compared to third-class passengers.
But is this difference due to random chance?
Let’s simulate binomial experiments to find out.
Simulate 100,000 samples from a binomial distribution with:
size = class_count
— number of passengers in the
class (e.g., 1st class = 216, 3rd class = 491)
prob = survival_rate
— overall Titanic survival
rate
Then compute the fraction of simulated samples where:
The empirical survival rate is as high or higher than the observed survival rate of 1st class passengers, or
The empirical survival rate is as low or lower than that of 3rd class passengers.
# ANSWER
set.seed(42) # For reproducibility
# (a) First class passengers:
df_titanic_experiment_class_1 <- rbinom(n = 100000, size = 216, prob = survival_rate) %>%
enframe() %>%
count(value) %>%
mutate(empirical_success_rate = value / 216, freq = n / sum(n))
cat("For 1st class passengers it doesn't really occur in our experiments:\n")
df_titanic_experiment_class_1 %>%
filter(empirical_success_rate >= 0.6296296) %>%
pull(freq) %>%
sum()
# (b) Third class passengers:
df_titanic_experiment_class_3 <- rbinom(n = 100000, size = 491, prob = survival_rate) %>%
enframe() %>%
count(value) %>%
mutate(empirical_success_rate = value / 491, freq = n / sum(n))
cat("For 3rd class passengers it doesn't occur either:\n")
df_titanic_experiment_class_3 %>%
filter(empirical_success_rate <= 0.2423625) %>%
pull(freq) %>%
sum()
## For 1st class passengers it doesn't really occur in our experiments:
## [1] 0
## For 3rd class passengers it doesn't occur either:
## [1] 0
ANSWER No, different survival rates of of the Titanic passengers of different classes do not seem to have occurred due to random chance.
To sample from a normal distribution in R, we use the
rnorm()
function. For example, the following generates 10
observations from a normal distribution with mean \(\mu = 42\) and standard deviation \(\sigma = 3.5\):
rnorm(10, mean = 42, sd = 3.5)
## [1] 40.95650 46.53173 45.16880 43.06968 38.90828 37.93005 36.14160 41.65145
## [9] 42.64892 47.84968
Generate a random sample from a normal distribution with the same
size, mean, and standard deviation as the Fare
variable in
df_titanic.
Then plot histograms of both the original
Fare
values and the simulated sample on the same plot using
different colours.
Use geom_histogram(position = "dodge")
to be able to
visually compare the histograms of the empirical data and the simulated
data.
Does the distribution of Fare appear similar to a normal distribution?
mean_fare <- df_titanic %>% pull(Fare) %>% mean()
sd_fare <- df_titanic %>% pull(Fare) %>% sd()
df_normal <- tibble(x = rnorm(n = nrow(df_titanic),
mean = mean_fare,
sd = sd_fare)) %>%
mutate(type = "Random Experiment")
df_titanic %>%
select(Fare) %>%
rename(x = Fare) %>%
mutate(type = "Titanic") %>%
bind_rows(df_normal) %>%
ggplot(aes(x = x, fill = type)) +
geom_histogram(color = "black", position = "dodge") +
labs(x = "Fare", y = "Count", fill = "Data Type",
title = "Comparison of Titanic Fare Distribution and Normal Sample")
ANSWER No, these two distributions do not appear to be similar
at all. The empirical distribution of
Fare
is highly
right-skewed.