Info

Objective

This in-class notebook is designed to complement the lecture. You’ll practice what you just learned, avoid falling asleep mid-slide, and get instant feedback - both from Fedor and your fellow classmates. You’re encouraged to experiment, ask questions, and correct your answers as we go.

The goal is to learn R by doing, not just by listening.

Your Task

  • Attempt each question yourself before checking the answer or asking for help.
  • Use lecture notes and the example code provided.
  • Update your answers after Fedor’s explanations.
  • Feel free to work with your neighbor if you get stuck — but make sure you understand the final answer!

Initial Setup

Before we begin, make sure you’ve installed the required packages:

  • tidyverse for data manipulation and plotting
  • titanic for practice data

You only need to install these once. If you already did it during Homework 1, you’re good to go.

# Load required packages
library(tidyverse)
library(titanic)
library(e1071)

df_titanic <- as_tibble(titanic_train)

Summary statistics

In this practice, we will explore how to generate summary statistics using pipe operators in tidyverse. Let’s begin with a simple example: reporting the class of every variable in df_titanic:

df_titanic %>%
  summarise(across(everything(), class))

Question 1

Use a single chain of pipe operators to generate each of the following summaries from df_titanic:

  1. Skewness of Fare. Your output should look like this:
skew_fare
4.77121
  1. Number of unique values for each character variable. Your output should look like this:
Name Sex Ticket Cabin Embarked
891 2 681 148 4
  1. Number of missing values for each variable. Your output should look like this:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 0 0 0 177 0 0 0 0 0 0
  1. Replace missing values in Age with the median, then compute the mean for all numeric variables. Your output should look like this:
PassengerId Survived Pclass Age SibSp Parch Fare
446 0.3838384 2.308642 29.36158 0.5230079 0.3815937 32.20421
  1. Remove rows where Embarked == "" (the port of embarkation is unknown), then compute survival rates by Sex and Embarked. Format the answer as a table with rows corresponding to ports of embarkation and columns to sexes. Your output should look like this:
Embarked Female Male
C 0.8767123 0.30526316
Q 0.7500000 0.07317073
S 0.6896552 0.17460317
### Answers
# Part (a)
df_titanic %>%
  summarise(skew_fare = skewness(Fare))

# Part (b)
df_titanic %>%
  summarise(across(is.character, n_distinct))

# Part (c)
df_titanic %>%
  summarise(across(everything(), ~ sum(is.na(.)))) 

# Part (d)
df_titanic %>%
  mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age)) %>%
  summarise(across(is.numeric, mean))

# Part (e)
df_titanic %>%
  filter(Embarked != "") %>%
  group_by(Sex, Embarked) %>%
  summarise(survival_rate = mean(Survived)) %>%
  pivot_wider(names_from = Sex, values_from = survival_rate)

Question 2

Pick a dataset of your choice. Compute the following summary statistics:

  • Mean, median, standard deviation, skewness (numeric variables)

  • Number of unique values, number of missing values, fraction of most prevalent values (character variables)

💡 Hints:

  • Use across(is.character, ...) and across(is.numeric, ...) to apply functions to selected variables.

  • Use reframe() instead of summarise() when your function returns a vector.

  • Functions you’ll need: n_distinct(), table(), and custom logic for missing values and most frequent values.

## ANSWER
# First Method 
# This is for character variables, similar will work for numeric variables

char_summary <- function(x) {
  # Given that x is a character vector,
  # Computes a vector of three statistics:
  # Number of unique values
  # Number of missing entries
  # Fraction of the most common entry
  no_unique <- n_distinct(x)
  no_missing <- sum(is.na(x))
  tab_x <- table(x)
  frac_most_common <- tab_x[which.max(tab_x)] / length(x)
  c(no_unique, no_missing, frac_most_common)
}


df_titanic %>%
  reframe(across(is.character, char_summary)) %>%
  mutate(statistic = c("Number of Unique Values", 
                       "Number of Missing Values", 
                       "Fraction of Most Common Value")) %>%
  relocate(statistic)
## ANSWER
# Second Method
# This is for numeric variables, similar will work for character variables
df_titanic %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarise(
    mean = mean(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    skew = skewness(value, na.rm = TRUE),
    .groups = "drop"
  )

Sampling

Binomial distribution

In this exercise, we’ll simulate binomial samples and compute empirical success rates to assess how much variation can arise due to randomness alone.

🔹 Example

Suppose the theoretical success probability is \(p=0.37\). We generate a binomial sample of size = 50 and repeat this experiment n = 10000 times (there is a confusion of notation here - n in R refers to the number of samples or experiments to generate while \(n\) in Wikipedia is size in R). Each time, we compute the empirical success rate:

df_binom_experiment <- rbinom(n = 10000, size = 50, prob = 0.37) %>%
  enframe() %>%
  count(value) %>% 
  mutate(empirical_success_rate = value / 50, freq = n / sum(n))

df_binom_experiment %>% head()

We can now visualize the distribution of empirical success rates (note that we use geom_col rather than geom_histogram since this is a discrete random variable):

df_binom_experiment %>%
  ggplot(aes(x = empirical_success_rate, y = freq)) +
  geom_col(fill = "grey90", colour = "black") +
  labs(x = "Empirical Success Rate", y = "Relative Frequency")

Let’s now compute the probability that the empirical success rate is below 0.2, even though the theoretical probability is 0.37:

df_binom_experiment %>%
  filter(empirical_success_rate <= 0.2) %>%
  pull(freq) %>% 
  sum()
## [1] 0.0071

As expected, this probability is quite small, but not zero.

🚢 Titanic Survival Rates

Let’s examine the overall survival rate on the Titanic:

survival_rate <- df_titanic %>% pull(Survived) %>% mean()
survival_rate
## [1] 0.3838384

Now compare this with survival rates by passenger class:

df_titanic %>%
  group_by(Pclass) %>%
  summarise(
    count = n(),
    survival_rate = mean(Survived)
  )

You’ll see that first-class passengers had a much higher survival rate compared to third-class passengers.

But is this difference due to random chance?

Let’s simulate binomial experiments to find out.

Question 3

Simulate 100,000 samples from a binomial distribution with:

  • size = class_count — number of passengers in the class (e.g., 1st class = 216, 3rd class = 491)

  • prob = survival_rate — overall Titanic survival rate

Then compute the fraction of simulated samples where:

  1. The empirical survival rate is as high or higher than the observed survival rate of 1st class passengers, or

  2. The empirical survival rate is as low or lower than that of 3rd class passengers.

# ANSWER 
set.seed(42) # For reproducibility

# (a) First class passengers:
df_titanic_experiment_class_1 <- rbinom(n = 100000, size = 216, prob = survival_rate) %>%
  enframe() %>%
  count(value) %>% 
  mutate(empirical_success_rate = value / 216, freq = n / sum(n))

cat("For 1st class passengers it doesn't really occur in our experiments:\n")
df_titanic_experiment_class_1 %>% 
  filter(empirical_success_rate >= 0.6296296) %>%
  pull(freq) %>%
  sum()


# (b) Third class passengers:
df_titanic_experiment_class_3 <- rbinom(n = 100000, size = 491, prob = survival_rate) %>%
  enframe() %>%
  count(value) %>% 
  mutate(empirical_success_rate = value / 491, freq = n / sum(n))

cat("For 3rd class passengers it doesn't occur either:\n")
df_titanic_experiment_class_3 %>% 
  filter(empirical_success_rate <= 0.2423625) %>%
  pull(freq) %>%
  sum()
## For 1st class passengers it doesn't really occur in our experiments:
## [1] 0
## For 3rd class passengers it doesn't occur either:
## [1] 0

ANSWER No, different survival rates of of the Titanic passengers of different classes do not seem to have occurred due to random chance.

Normal distribution

To sample from a normal distribution in R, we use the rnorm() function. For example, the following generates 10 observations from a normal distribution with mean \(\mu = 42\) and standard deviation \(\sigma = 3.5\):

rnorm(10, mean = 42, sd = 3.5)
##  [1] 40.95650 46.53173 45.16880 43.06968 38.90828 37.93005 36.14160 41.65145
##  [9] 42.64892 47.84968

Question 4

Generate a random sample from a normal distribution with the same size, mean, and standard deviation as the Fare variable in df_titanic. Then plot histograms of both the original Fare values and the simulated sample on the same plot using different colours.

Use geom_histogram(position = "dodge") to be able to visually compare the histograms of the empirical data and the simulated data.

Does the distribution of Fare appear similar to a normal distribution?

mean_fare <- df_titanic %>% pull(Fare) %>% mean()
sd_fare <- df_titanic %>% pull(Fare) %>% sd()
df_normal <- tibble(x = rnorm(n = nrow(df_titanic),
                              mean = mean_fare,
                              sd = sd_fare)) %>%
  mutate(type = "Random Experiment")

df_titanic %>%
  select(Fare) %>%
  rename(x = Fare) %>%
  mutate(type = "Titanic") %>%
  bind_rows(df_normal) %>%
  ggplot(aes(x = x, fill = type)) + 
  geom_histogram(color = "black", position = "dodge") +
  labs(x = "Fare", y = "Count", fill = "Data Type",
       title = "Comparison of Titanic Fare Distribution and Normal Sample")

ANSWER No, these two distributions do not appear to be similar at all. The empirical distribution of Fare is highly right-skewed.

Model Answers:

https://rpubs.com/fduzhin/mh3511_cw_4_answers