Lectures 4–7:
Solve as many questions as you can. Write your answers in the provided spaces. Knit your submission to PDF or to HTML and print to PDF, and upload to NTULearn.
If some of your R code doesn’t run, you may comment it out.
This is a restricted open-book quiz. You may use R manuals and course materials, but internet search and AI tools are not allowed.
You have 30 minutes to complete the quiz.
# Load required packages
library(tidyverse)
library(janitor)
library(broom)
###### REPLACE 42 here with the numeric part of your matric number
set.seed(42)
The penguins
dataset is available by default when you
load tidyverse.
Here is a random sample:
penguins %>%
sample_n(5)
Write a single pipeline (using the pipe operator %>%
)
that computes the following summary statistics by
species
:
bill_len
variable)body_mass
variable)sex
variable)island
variable)Your output should follow this format:
species | median_bill | mean_mass | female_frac |
---|---|---|---|
Adelie | … | … | … |
Gentoo | … | … | … |
Chinstrap | … | … | … |
### ANSWER
penguins %>%
group_by(species) %>%
summarise(
median_bill = median(bill_len, na.rm = TRUE),
mean_mass = mean(body_mass, na.rm = TRUE),
female_frac = mean(sex == "female", na.rm = TRUE),
number_of_islands = n_distinct(island)
)
After returning home from Singapore, you are taking a driving theory test consisting of 30 multiple-choice questions.
Estimate your chance of passing the test using simulation:
Generate 10,000 samples from a binomial distribution \(X \sim \mathrm{Binom}(20, 0.7)\) for the number of correct easy questions.
Generate 10,000 samples from a binomial distribution \(Y \sim \mathrm{Binom}(10, 0.4)\) for the number of correct hard questions.
Estimate the empirical probability \(P(X + Y \ge 20)\).
## ANSWER HERE
x <- rbinom(10000, 20, 0.7)
y <- rbinom(10000, 10, 0.4)
mean(x + y >= 20)
## [1] 0.2822
Generate a QQ-plot of the variable cyl
from the
mpg
dataset. Based on the plot, is it appropriate to use a
t-test to compare the sample means of cyl
across different
groups in mpg
?
ANSWER
### ANSWER
mpg %>% ggplot(aes(sample = cyl)) +
geom_qq() +
geom_qq_line()
The QQ-plot shows that the points do not follow a straight line but
instead form a stepped pattern. This indicates that the variable
cyl
is discrete and does not follow a normal
distribution.
Therefore, using a t-test may not be appropriate, especially for small sample sizes or unequal group variances. A non-parametric test (e.g., Wilcoxon or Kruskal–Wallis) might be more suitable in such cases.
The diamonds
dataset is available by default when you
load tidyverse.
Here is a random sample:
diamonds %>%
sample_n(5)
A jewelry trader in Antwerp runs the following R command:
diamonds %>%
tabyl(color, cut) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 310.32, df = 24, p-value < 2.2e-16
Explain:
chisq.test()
tell us in
this context?ANSWER
The objective is to test whether categorical variables
color
and cut
are independent. Since the \(p\)-value of the chi-squared test is
extremely small, we conclude that color
and
cut
are not independent, there exists some association
between color
and cut
.
The following code computes the empirical mean flipper length of female Gentoo penguins:
penguins %>%
drop_na() %>%
filter(species == "Gentoo" & sex == "female") %>%
pull(flipper_len) %>%
mean()
## [1] 212.7069
Compute and report the 99% confidence interval for the population mean flipper length of female Gentoo penguins.
# ANSWER
penguins %>%
drop_na() %>%
filter(species == "Gentoo" & sex == "female") %>%
pull(flipper_len) %>%
t.test(conf.level = 0.99) %>%
tidy() %>%
select(starts_with("conf"))
The diamonds
dataset is available by default when you
load tidyverse.
A jeweler wants to know whether diamond color is associated with cut quality. Specifically, they ask:
“Do diamonds of different colors tend to receive different cut grades?”
Use the diamonds
dataset to investigate this question.
The cut
variable is ordinal (“Fair” < “Good” < “Very
Good” < “Premium” < “Ideal”), and color has 7 categories (D to
J).
For your convenience, here is a helper function to recode
cut
to numeric values:
recode_cut_to_numeric <- function(x) {
x %>% recode(
"Fair" = 1,
"Good" = 2,
"Very Good" = 3,
"Premium" = 4,
"Ideal" = 5
)
}
diamonds %>%
mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
tabyl(cut, numeric_cut)
Your task is to choose an appropriate statistical test to compare the distribution of cut across different color groups.
ANSWER
Since values are ordinal and the number of groups is more than 2, we choose the Kruskal-Wallis Test:
diamonds %>%
mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
kruskal.test(numeric_cut ~ color, .)
##
## Kruskal-Wallis rank sum test
##
## data: numeric_cut by color
## Kruskal-Wallis chi-squared = 160.2, df = 6, p-value < 2.2e-16
Since the p-value is extremely small, it means that at least one
group has distribution of cut
that is significantly
different from other groups. In principle, we can do the violin plot to
detect it. The colour with the largest fraction of ideal cut is “G”
while “J” has the smallest fraction of ideal cut, but the difference is
subtle, though statistically significant:
diamonds %>%
mutate(numeric_cut = recode_cut_to_numeric(cut)) %>%
ggplot(aes(x = color, y = numeric_cut)) + geom_violin()