This examination paper contains FIVE (5) questions.
Answer ALL questions. The marks for each question are indicated at its beginning.
Answer each question in the space provided in this notebook. You may insert new R code chunks as appropriate.
This is an OPEN BOOK examination. You are
allowed to use any reference materials, such as course materials, your
notes, and tidyverse
cheatsheets. All reference materials
should be printed in advance and referred to from hard
copies. During the examination, only RStudio
should be on your screen. Online searches and generative AI
tools are strictly prohibited.
Click Run All
and knit this document to ensure that
all required libraries are installed and all R code runs correctly. If
any code fails, seek assistance immediately.
Once you have confirmed that the R code runs and the notebook knits successfully, disable WiFi and remain offline for the rest of the examination. This applies to all devices in your possession.
You are required to bring your own laptop to the examination fully charged.
Do not view the exam questions until instructed to begin.
At the end of the exam, knit your notebook to PDF. Alternatively, you may knit to HTML, open it in a browser, and print it to PDF.
Ensure that all R code runs (comment out any code that doesn’t). All outputs and plots must be visible in the final PDF.
In the end of the exam, you will be required to enable WiFi again and submit this resulting PDF as your final answer script to NTULearn.
# Load required packages
library(tidyverse)
library(zoo)
library(janitor)
library(psych)
library(GGally)
library(stargazer)
theme_set(theme_minimal())
# Replace the number 73 below with the numeric part
# of your matriculation or your passport number:
set.seed(73)
Part (a)
Write a single R command to compute the following sum:
\[ \sum_{k=1}^{30}\log(3k+2) = \log 5 + \log 8 + \log{11} + \cdots + \log{92} \]
# ANSWER HERE
### FIRST METHOD
sum(log(3 * (1:30) + 2))
### SECOND METHOD
seq(5, 92, 3) %>% log %>% sum
## [1] 110.0047
## [1] 110.0047
Part (b)
Given a numeric vector x
, write a single R command that
counts how many entries in x
lie between \(-1\) and \(1\), inclusive. Your command should work
for any numeric vector x
, not just the one generated
below:
x <- rnorm(20)
x
#### ANSWER BELOW
### FIRST METHOD
sum(x >= -1 & x <= 1)
### SECOND METHOD
sum((x >= -1) * (x <= 1))
### THIRD METHOD
tibble(x) %>% filter(x <= 1 & x >= -1) %>% nrow()
## [1] -0.1450469 0.2913580 0.0937974 -0.1273679 -0.8468310 0.1411659
## [7] 1.5374285 2.7388478 1.5511972 0.4615500 -0.8099335 0.9053432
## [13] 1.3198469 0.5033341 -0.5218265 -1.3072612 1.0914046 -0.1601435
## [19] 0.8653942 -0.8711010
## [1] 14
## [1] 14
## [1] 14
Part (c)
Write a single R command that creates a numeric vector x
of length n
with the following rule:
For example, when \(n = 6\), the
vector should be:
\(x = (1, -2, 9, -4, 25, -6)\)
Your code should work for any n
, not just
n = 20
.
n <- 20
### ANSWER BELOW
### FIRST METHOD
((1:n) %% 2) * (1:n)^2 - ((0:(n-1)) %% 2) * (1:n)
### SECOND METHOD
ifelse((1:n) %% 2 == 0, -(1:n), (1:n)^2)
### THIRD METHOD
1:n %>% map_vec(~ifelse(. %% 2, .^2, -.))
## [1] 1 -2 9 -4 25 -6 49 -8 81 -10 121 -12 169 -14 225 -16 289 -18 361
## [20] -20
## [1] 1 -2 9 -4 25 -6 49 -8 81 -10 121 -12 169 -14 225 -16 289 -18 361
## [20] -20
## [1] 1 -2 9 -4 25 -6 49 -8 81 -10 121 -12 169 -14 225 -16 289 -18 361
## [20] -20
Part (d)
Using the built-in mtcars
dataset, write a single R
command (or a single pipeline using the %>%
operator)
that computes the mean horsepower (hp
) of
cars with a 4-cylinder engine (cyl == 4
).
### ANSWER BELOW
### FIRST METHOD
mtcars %>%
filter(cyl == 4) %>%
pull(hp) %>%
mean()
### SECOND METHOD
mean(mtcars$hp[mtcars$cyl == 4])
## [1] 82.63636
## [1] 82.63636
This question uses the billboard
dataset, which is
available by default when you load the tidyverse
package.
head(billboard)
For reference, here is the number of missing values across all weeks:
billboard %>%
summarise(across(starts_with("wk"), ~ sum(is.na(.))))
Part (a)
Plot the total number of songs that entered the Billboard charts each month as a line chart. The \(x\)-axis should represent the month, and the \(y\)-axis the number of songs.
### ANSWER HERE
billboard %>%
mutate(month = as.yearmon(date.entered)) %>%
count(month) %>%
ggplot(aes(x = month, y = n)) +
geom_line()
Part (b)
Write a single pipeline (using the %>%
operator) to
compute the following summary statistics by artist: -
The number of distinct tracks they have on the Billboard charts. - The
number of days between their earliest and latest entries.
Arrange the results in descending order of number of tracks, and display the top 10 artists.
### ANSWER HERE
billboard %>%
group_by(artist) %>%
summarise(
number_of_tracks = n_distinct(track),
duration = max(date.entered) - min(date.entered)
) %>%
arrange(-number_of_tracks) %>%
head(10)
Part (c)
Write a single pipeline (using the %>%
operator) that
computes the following summary statistics by track:
Then, in the same pipeline:
### ANSWER HERE
billboard %>%
pivot_longer(cols = starts_with("wk"),
names_to = "week",
values_to = "position") %>%
group_by(track) %>%
drop_na() %>%
summarise(
number_of_weeks = n(),
median_position = median(position),
top_weeks = sum(position %in% 1:5)
) %>%
arrange(-top_weeks) %>%
filter(top_weeks >= 10)
Part (d)
Write a single pipeline (using the %>%
operator) that
produces a line chart for all tracks by artists who
have at least 4 tracks on the Billboard charts (this
should include three artists).
Use the following aesthetic mappings: - x
: week number
(from 1 to 76) - y
: 101 - position
(so higher
lines indicate higher chart positions) - colour
: track -
linetype
: artist
### ANSWER HERE
billboard %>%
add_count(artist) %>%
filter(n >= 4) %>%
pivot_longer(cols = starts_with("wk"),
names_to = "week",
values_to = "position") %>%
mutate(week = parse_number(week)) %>%
drop_na() %>%
ggplot(aes(x = week, y = 101 - position,
colour = track, linetype = artist)) +
geom_line()
In this question, we explore a random variable \(S\) defined as the sum of two independent components:
size = 10
and
prob = 0.3
.Below is an example of generating 20 observations from the distribution of \(S\):
x <- rlnorm(20) + rbinom(20, size = 10, prob = 0.3)
x
## [1] 7.352565 4.035512 4.387158 4.689155 7.143987 6.989220 4.469038
## [8] 2.413253 8.206732 5.260310 7.131627 5.630918 4.393899 6.554807
## [15] 8.791990 4.366791 5.503114 4.011534 11.516606 2.921527
Part (a)
Generate a sample of 10,000 observations of \(S\) (do not print the full sample), and estimate the empirical probability \(P(S < 1)\).
### ANSWER HERE
x <- rlnorm(10000) + rbinom(10000, size = 10, prob = 0.3)
mean(x < 1)
## [1] 0.0142
Part (b)
Plot a histogram of the 10,000 observations of \(S\) generated in part (a).
### ANSWER HERE
tibble(x) %>%
ggplot(aes(x = x)) +
geom_histogram(fill = "grey90", colour = "black")
Part (c)
Write a custom R function sample_median_s()
that takes
an argument sample_size
, generates a random sample of that
size from the distribution of \(S\),
and returns the sample median.
### ANSWER HERE
### CORRECT ANSWER:
sample_median_s <- function(sample_size) {
x <- rlnorm(sample_size) +
rbinom(sample_size, size = 10, prob = 0.3)
median(x)
}
sample_median_s(73)
## [1] 4.234446
Part (d)
Simulate the median of a sample of 5 independent observations from
\(S\), repeated multiple times. Plot
the density estimate (geom_density()
) of these medians.
Note: Depending on your computer’s performance, you can run this simulation between 500 and 20,000 times.
### ANSWER HERE
sample_median_s2 <- function(trial, sample_size = 5) {
x <- rlnorm(sample_size) +
rbinom(sample_size, size = 10, prob = 0.3)
tibble(trial, med = median(x))
}
1:20000 %>%
map_df(sample_median_s2) %>%
ggplot(aes(x = med)) +
geom_density(fill = "burlywood", color = "black")
In this question, we will use the UCBAdmissions
dataset,
which is available in base R. This dataset is stored as a 3-dimensional
array. To work with it using tidyverse
tools, we first
convert it into a data frame (tibble) as follows:
UCBAdmissions %>%
as_tibble()
Part (a)
A statistician at the California Department of Education runs the following R code:
UCBAdmissions %>%
as_tibble() %>%
group_by(Gender) %>%
summarise(number_of_students = sum(n)) %>%
pull(number_of_students) %>%
chisq.test(p = c(0.5, 0.5))
##
## Chi-squared test for given probabilities
##
## data: .
## X-squared = 161.89, df = 1, p-value < 2.2e-16
Explain:
chisq.test()
tell us in
this context?ANSWER HERE:
The statistician is testing whether the number of male and female applicants to UC Berkeley is equally distributed. The null hypothesis is that the probability of a student being male or female is 0.5. Since the p-value is extremely small, the null hypothesis is rejected, indicating that there is a statistically significant difference in the number of male and female applicants.
Part (b)
Next, the statistician runs the following code:
UCBAdmissions %>%
as_tibble() %>%
group_by(Admit, Dept) %>%
summarise(number_of_students = sum(n), .groups = "drop") %>%
pivot_wider(names_from = Dept, values_from = number_of_students) %>%
column_to_rownames("Admit") %>%
as.matrix() %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 778.91, df = 5, p-value < 2.2e-16
Explain:
chisq.test()
tell us in
this context?ANSWER HERE:
The statistician is testing whether admission decisions
(Admit
) are independent of the department
(Dept
). The null hypothesis is that admission rates are the
same across all departments. Since the p-value is extremely small, the
null hypothesis is rejected. This means there is strong evidence that
admission rates vary significantly by department.
Part (c)
Finally, the statistician runs the following code:
UCBAdmissions %>%
as_tibble() %>%
ggplot(aes(sample = n)) +
geom_qq() +
geom_qq_line() +
facet_grid(Admit ~ Gender)
Explain:
ANSWER HERE:
The statistician is visually assessing whether the number of students (by admission status and gender across departments) follows a normal distribution using Q-Q plots. The plots do not show any strong deviations from the reference line, suggesting approximate normality. However, due to small sample sizes, the test has low power, so only large deviations would be noticeable. The visual evidence is not strong enough to reject normality.
Part (d)
Write a single pipeline (using the %>%
operator)
that:
\[ \frac{\text{Number of Admitted Students}}{\text{Number of Admitted Students} + \text{Number of Rejected Students}} \]
Give your interpretation of the test results.
### ANSWER HERE
UCBAdmissions %>%
as_tibble() %>%
pivot_wider(names_from = Admit, values_from = n) %>%
mutate(admission_rate = Admitted / (Admitted + Rejected)) %>%
t.test(admission_rate ~ Gender, data = .)
##
## Welch Two Sample t-test
##
## data: admission_rate by Gender
## t = 0.24772, df = 9.3979, p-value = 0.8097
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -0.2906653 0.3626712
## sample estimates:
## mean in group Female mean in group Male
## 0.4172692 0.3812662
Interpretation:
Since the p-value is large, we do not reject the null hypothesis. There is no statistically significant evidence that the average admission rates differ between male and female students across departments.
In this question, we will use the diamonds
dataset,
which is available in tidyverse
.
diamonds %>% head()
Note that cut
is an ordinal variable (ordered
factor):
diamonds %>% pull(cut) %>% unique()
## [1] Ideal Premium Good Very Good Fair
## Levels: Fair < Good < Very Good < Premium < Ideal
If we want to disregard the ordering of the categories, we can
convert cut
to a character
vector:
diamonds %>% pull(cut) %>% as.character() %>% unique() %>% sort()
## [1] "Fair" "Good" "Ideal" "Premium" "Very Good"
Part (a) A bench jeweler in Amsterdam runs the
following R command to see if table
correlates with
depth
:
cor(diamonds$table, diamonds$depth)
## [1] -0.2957785
The correlation is negative, but he is unsure whether it is
statistically significant. Find the \(p\)-value for the correlation coefficient
between table
and depth
, and explain what it
means – i.e., whether the negative correlation is statistically
significant.
### ANSWER HERE
cor.test(diamonds$table, diamonds$depth)
##
## Pearson's product-moment correlation
##
## data: diamonds$table and diamonds$depth
## t = -71.911, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3034601 -0.2880585
## sample estimates:
## cor
## -0.2957785
Interpretation Since the \(p\)-value is very small, the negative correlation is statistically significant.
Part (b) Produce a correlation diagram, i.e., a plot
of pairwise correlations between all numeric variables in
diamonds
:
### ANSWER HERE
diamonds %>% ggcorr()
Part (c) Fit a linear regression model that explains
price
using x
, y
, and
z
as independent variables. Print the table of coefficients
and their significance levels:
### ANSWER HERE
diamonds %>%
lm(price ~ x + y + z, .) %>%
stargazer(type = "text")
##
## =================================================
## Dependent variable:
## -----------------------------
## price
## -------------------------------------------------
## x 2,790.321***
## (40.985)
##
## y 218.788***
## (31.564)
##
## z 225.910***
## (47.574)
##
## Constant -14,113.050***
## (41.763)
##
## -------------------------------------------------
## Observations 53,940
## R2 0.783
## Adjusted R2 0.783
## Residual Std. Error 1,860.421 (df = 53936)
## F Statistic 64,698.040*** (df = 3; 53936)
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Part (d)
The bench jeweler runs the following R command:
diamonds %>%
mutate(cut = as.character(cut)) %>%
lm(price ~ cut, .) %>%
stargazer(type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## price
## -----------------------------------------------
## cutGood -429.893***
## (113.849)
##
## cutIdeal -901.216***
## (102.412)
##
## cutPremium 225.500**
## (104.395)
##
## cutVery Good -376.998***
## (105.164)
##
## Constant 4,358.758***
## (98.788)
##
## -----------------------------------------------
## Observations 53,940
## R2 0.013
## Adjusted R2 0.013
## Residual Std. Error 3,963.847 (df = 53935)
## F Statistic 175.689*** (df = 4; 53935)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
To his surprise, he finds that the mean prices for Good, Ideal, and Very Good cuts are much lower than that for Fair cut — and that these differences are statistically significant.
Explain:
ANSWER HERE His analysis is actually correct. The most likely explanation is that diamonds of Fair cut are just larger. This is confirmed by checking mean prices and mean weights in carats for all the cuts:
diamonds %>%
group_by(cut) %>%
summarise(
n_obs = n(),
mean_price = mean(price),
mean_carat = mean(carat)
)