MH3511 – Quiz 2

Info
Question 1
Question 2
Question 3
Question 4
Question 5
Question 6

Info

Material Covered

Lectures 4–7:

Summary statistics
Sampling from theoretical distributions
Goodness of fit test: QQ-plots, chi-squared, Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling
Statistical tests: t-test, U-test, ANOVA, Kruskal-Wallis test.

Objective:

Solve as many questions as you can. Write your answers in the provided spaces. Knit your submission to PDF or to HTML and print to PDF, and upload to NTULearn.

If some of your R code doesn’t run, you may comment it out.

Mode

This is a restricted open-book quiz. You may use R manuals and course materials, but internet search and AI tools are not allowed.

Time limit

You have 30 minutes to complete the quiz.

# Load required packages
library(tidyverse)
library(janitor)
library(broom)

###### REPLACE 42 here with the numeric part of your matric number
set.seed(42)

Question 1

The penguins dataset is available by default when you load tidyverse. Here is a random sample:

penguins %>%
  sample_n(5)

Write a single pipeline (using the pipe operator %>%) that computes the following summary statistics by species:

Median bill length (bill_len variable)
Mean body mass (body_mass variable)
Proportion of female specimen (sex variable)
Number of distinct islands that the species is found on (island variable)

Your output should follow this format:

species	median_bill	mean_mass	female_frac
Adelie	…	…	…
Gentoo	…	…	…
Chinstrap	…	…	…

### ANSWER
penguins %>%
  group_by(species) %>%
  summarise(
    median_bill = median(bill_len, na.rm = TRUE),
    mean_mass = mean(body_mass, na.rm = TRUE),
    female_frac = mean(sex == "female", na.rm = TRUE),
    number_of_islands = n_distinct(island)
  )

Question 2

After returning home from Singapore, you are taking a driving theory test consisting of 30 multiple-choice questions.

20 of the questions are easy, and you have a 70% chance of answering each one correctly.
The remaining 10 questions are hard, and you have a 40% chance of answering each one correctly.
The passing score is at least 20 correct answers out of 30.

Estimate your chance of passing the test using simulation:

Generate 10,000 samples from a binomial distribution \(X \sim \mathrm{Binom}(20, 0.7)\) for the number of correct easy questions.
Generate 10,000 samples from a binomial distribution \(Y \sim \mathrm{Binom}(10, 0.4)\) for the number of correct hard questions.
Estimate the empirical probability \(P(X + Y \ge 20)\).

## ANSWER HERE
x <- rbinom(10000, 20, 0.7)
y <- rbinom(10000, 10, 0.4)
mean(x + y >= 20)

## [1] 0.2822

Question 3

Generate a QQ-plot of the variable cyl from the mpg dataset. Based on the plot, is it appropriate to use a t-test to compare the sample means of cyl across different groups in mpg?

ANSWER

### ANSWER
mpg %>% ggplot(aes(sample = cyl)) + 
  geom_qq() + 
  geom_qq_line()

The QQ-plot shows that the points do not follow a straight line but instead form a stepped pattern. This indicates that the variable cyl is discrete and does not follow a normal distribution.

Therefore, using a t-test may not be appropriate, especially for small sample sizes or unequal group variances. A non-parametric test (e.g., Wilcoxon or Kruskal–Wallis) might be more suitable in such cases.

Question 4

The diamonds dataset is available by default when you load tidyverse. Here is a random sample:

diamonds %>%
  sample_n(5)

A jewelry trader in Antwerp runs the following R command:

diamonds %>%
  tabyl(color, cut) %>%
  chisq.test()

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 310.32, df = 24, p-value < 2.2e-16

Explain:

What is the objective of this analysis?
What does the result of the chisq.test() tell us in this context?

ANSWER

The objective is to test whether categorical variables color and cut are independent. Since the \(p\)-value of the chi-squared test is extremely small, we conclude that color and cut are not independent, there exists some association between color and cut.

Question 5

The following code computes the empirical mean flipper length of female Gentoo penguins:

penguins %>%
  drop_na() %>%
  filter(species == "Gentoo" & sex == "female") %>%
  pull(flipper_len) %>%
  mean()

## [1] 212.7069

Compute and report the 99% confidence interval for the population mean flipper length of female Gentoo penguins.

# ANSWER
penguins %>%
  drop_na() %>%
  filter(species == "Gentoo" & sex == "female") %>%
  pull(flipper_len) %>%
  t.test(conf.level = 0.99) %>%
  tidy() %>%
  select(starts_with("conf"))

Question 6

The diamonds dataset is available by default when you load tidyverse.

A jeweler wants to know whether diamond color is associated with cut quality. Specifically, they ask:

“Do diamonds of different colors tend to receive different cut grades?”

Use the diamonds dataset to investigate this question. The cut variable is ordinal (“Fair” < “Good” < “Very Good” < “Premium” < “Ideal”), and color has 7 categories (D to J).

For your convenience, here is a helper function to recode cut to numeric values:

recode_cut_to_numeric <- function(x) {
  x %>% recode(
      "Fair" = 1,
      "Good" = 2,
      "Very Good" = 3,
      "Premium" = 4,
      "Ideal" = 5
    )
}

diamonds %>% 
  mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
  tabyl(cut, numeric_cut)

Your task is to choose an appropriate statistical test to compare the distribution of cut across different color groups.

Justify your choice of test.
Report and briefly interpret the result.

ANSWER

Since values are ordinal and the number of groups is more than 2, we choose the Kruskal-Wallis Test:

diamonds %>% 
  mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
  kruskal.test(numeric_cut ~ color, .)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  numeric_cut by color
## Kruskal-Wallis chi-squared = 160.2, df = 6, p-value < 2.2e-16

Since the p-value is extremely small, it means that at least one group has distribution of cut that is significantly different from other groups. In principle, we can do the violin plot to detect it. The colour with the largest fraction of ideal cut is “G” while “J” has the smallest fraction of ideal cut, but the difference is subtle, though statistically significant:

diamonds %>% 
  mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
  ggplot(aes(x = color, y = numeric_cut)) + geom_violin()

MH3511 – Quiz 2

Model Answers.

Fedor Duzhin

2025-07-14

Info

Material Covered

Objective:

Mode

Time limit

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6