Info

Material Covered

Lectures 4–7:

  • Summary statistics
  • Sampling from theoretical distributions
  • Goodness of fit test: QQ-plots, chi-squared, Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling
  • Statistical tests: t-test, U-test, ANOVA, Kruskal-Wallis test.

Objective:

Solve as many questions as you can. Write your answers in the provided spaces. Knit your submission to PDF or to HTML and print to PDF, and upload to NTULearn.

If some of your R code doesn’t run, you may comment it out.

Mode

This is a restricted open-book quiz. You may use R manuals and course materials, but internet search and AI tools are not allowed.

Time limit

You have 30 minutes to complete the quiz.

# Load required packages
library(tidyverse)
library(janitor)
library(broom)

###### REPLACE 42 here with the numeric part of your matric number
set.seed(42)

Question 1

The penguins dataset is available by default when you load tidyverse. Here is a random sample:

penguins %>%
  sample_n(5)

Write a single pipeline (using the pipe operator %>%) that computes the following summary statistics by species:

Your output should follow this format:

species median_bill mean_mass female_frac
Adelie
Gentoo
Chinstrap
### ANSWER
penguins %>%
  group_by(species) %>%
  summarise(
    median_bill = median(bill_len, na.rm = TRUE),
    mean_mass = mean(body_mass, na.rm = TRUE),
    female_frac = mean(sex == "female", na.rm = TRUE),
    number_of_islands = n_distinct(island)
  )

Question 2

After returning home from Singapore, you are taking a driving theory test consisting of 30 multiple-choice questions.

Estimate your chance of passing the test using simulation:

  1. Generate 10,000 samples from a binomial distribution \(X \sim \mathrm{Binom}(20, 0.7)\) for the number of correct easy questions.

  2. Generate 10,000 samples from a binomial distribution \(Y \sim \mathrm{Binom}(10, 0.4)\) for the number of correct hard questions.

  3. Estimate the empirical probability \(P(X + Y \ge 20)\).

## ANSWER HERE
x <- rbinom(10000, 20, 0.7)
y <- rbinom(10000, 10, 0.4)
mean(x + y >= 20)
## [1] 0.2822

Question 3

Generate a QQ-plot of the variable cyl from the mpg dataset. Based on the plot, is it appropriate to use a t-test to compare the sample means of cyl across different groups in mpg?

ANSWER

### ANSWER
mpg %>% ggplot(aes(sample = cyl)) + 
  geom_qq() + 
  geom_qq_line()

The QQ-plot shows that the points do not follow a straight line but instead form a stepped pattern. This indicates that the variable cyl is discrete and does not follow a normal distribution.

Therefore, using a t-test may not be appropriate, especially for small sample sizes or unequal group variances. A non-parametric test (e.g., Wilcoxon or Kruskal–Wallis) might be more suitable in such cases.

Question 4

The diamonds dataset is available by default when you load tidyverse. Here is a random sample:

diamonds %>%
  sample_n(5)

A jewelry trader in Antwerp runs the following R command:

diamonds %>%
  tabyl(color, cut) %>%
  chisq.test()
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 310.32, df = 24, p-value < 2.2e-16

Explain:

ANSWER

The objective is to test whether categorical variables color and cut are independent. Since the \(p\)-value of the chi-squared test is extremely small, we conclude that color and cut are not independent, there exists some association between color and cut.

Question 5

The following code computes the empirical mean flipper length of female Gentoo penguins:

penguins %>%
  drop_na() %>%
  filter(species == "Gentoo" & sex == "female") %>%
  pull(flipper_len) %>%
  mean()
## [1] 212.7069

Compute and report the 99% confidence interval for the population mean flipper length of female Gentoo penguins.

# ANSWER
penguins %>%
  drop_na() %>%
  filter(species == "Gentoo" & sex == "female") %>%
  pull(flipper_len) %>%
  t.test(conf.level = 0.99) %>%
  tidy() %>%
  select(starts_with("conf"))

Question 6

The diamonds dataset is available by default when you load tidyverse.

A jeweler wants to know whether diamond color is associated with cut quality. Specifically, they ask:

“Do diamonds of different colors tend to receive different cut grades?”

Use the diamonds dataset to investigate this question. The cut variable is ordinal (“Fair” < “Good” < “Very Good” < “Premium” < “Ideal”), and color has 7 categories (D to J).

For your convenience, here is a helper function to recode cut to numeric values:

recode_cut_to_numeric <- function(x) {
  x %>% recode(
      "Fair" = 1,
      "Good" = 2,
      "Very Good" = 3,
      "Premium" = 4,
      "Ideal" = 5
    )
}

diamonds %>% 
  mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
  tabyl(cut, numeric_cut)

Your task is to choose an appropriate statistical test to compare the distribution of cut across different color groups.

ANSWER

Since values are ordinal and the number of groups is more than 2, we choose the Kruskal-Wallis Test:

diamonds %>% 
  mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
  kruskal.test(numeric_cut ~ color, .)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  numeric_cut by color
## Kruskal-Wallis chi-squared = 160.2, df = 6, p-value < 2.2e-16

Since the p-value is extremely small, it means that at least one group has distribution of cut that is significantly different from other groups. In principle, we can do the violin plot to detect it. The colour with the largest fraction of ideal cut is “G” while “J” has the smallest fraction of ideal cut, but the difference is subtle, though statistically significant:

diamonds %>% 
  mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
  ggplot(aes(x = color, y = numeric_cut)) + geom_violin()