YOUR MATRICULATION NUMBER:

YOUR SEAT NUMBER:

INSTRUCTIONS TO CANDIDATES

This examination paper contains FIVE (5) questions.
Answer ALL questions. The marks for each question are indicated at its beginning.
Answer each question in the space provided in this notebook. You may insert new R code chunks as appropriate.
This is an OPEN BOOK examination. You are allowed to use any reference materials, such as course materials, your notes, and tidyverse cheatsheets. All reference materials should be printed in advance and referred to from hard copies. During the examination, only RStudio should be on your screen. Online searches and generative AI tools are strictly prohibited.
Click Run All and knit this document to ensure that all required libraries are installed and all R code runs correctly. If any code fails, seek assistance immediately.
Once you have confirmed that the R code runs and the notebook knits successfully, disable WiFi and remain offline for the rest of the examination. This applies to all devices in your possession.
You are required to bring your own laptop to the examination fully charged.
Do not view the exam questions until instructed to begin.
At the end of the exam, knit your notebook to PDF. Alternatively, you may knit to HTML, open it in a browser, and print it to PDF.
Ensure that all R code runs (comment out any code that doesn’t). All outputs and plots must be visible in the final PDF.
In the end of the exam, you will be required to enable WiFi again and submit this resulting PDF as your final answer script to NTULearn.

# Load required packages
library(tidyverse)
library(zoo)
library(janitor)
library(psych)
library(GGally)
library(stargazer)

theme_set(theme_minimal())

# Replace the number 73 below with the numeric part 
# of your matriculation or your passport number:
set.seed(73)

Question 1 (20 marks)

Part (a)

Write a single R command to compute the following sum:

\[ \sum_{k=1}^{30}\log(3k+2) = \log 5 + \log 8 + \log{11} + \cdots + \log{92} \]

# ANSWER HERE

### FIRST METHOD
sum(log(3 * (1:30) + 2))

### SECOND METHOD
seq(5, 92, 3) %>% log %>% sum

## [1] 110.0047
## [1] 110.0047

Part (b)

Given a numeric vector x, write a single R command that counts how many entries in x lie between \(-1\) and \(1\), inclusive. Your command should work for any numeric vector x, not just the one generated below:

x <- rnorm(20)
x
#### ANSWER BELOW

### FIRST METHOD
sum(x >= -1 & x <= 1)

### SECOND METHOD
sum((x >= -1) * (x <= 1))

### THIRD METHOD
tibble(x) %>% filter(x <= 1 & x >= -1) %>% nrow()

##  [1] -0.1450469  0.2913580  0.0937974 -0.1273679 -0.8468310  0.1411659
##  [7]  1.5374285  2.7388478  1.5511972  0.4615500 -0.8099335  0.9053432
## [13]  1.3198469  0.5033341 -0.5218265 -1.3072612  1.0914046 -0.1601435
## [19]  0.8653942 -0.8711010
## [1] 14
## [1] 14
## [1] 14

Part (c)

Write a single R command that creates a numeric vector x of length n with the following rule:

\(x_k = k^2\) if \(k\) is odd
\(x_k = -k\) if \(k\) is even

For example, when \(n = 6\), the vector should be:
\(x = (1, -2, 9, -4, 25, -6)\)

Your code should work for any n, not just n = 20.

n <- 20
### ANSWER BELOW

### FIRST METHOD
((1:n) %% 2) * (1:n)^2 - ((0:(n-1)) %% 2) * (1:n)

### SECOND METHOD
ifelse((1:n) %% 2 == 0, -(1:n), (1:n)^2)

### THIRD METHOD
1:n %>% map_vec(~ifelse(. %% 2, .^2, -.))

##  [1]   1  -2   9  -4  25  -6  49  -8  81 -10 121 -12 169 -14 225 -16 289 -18 361
## [20] -20
##  [1]   1  -2   9  -4  25  -6  49  -8  81 -10 121 -12 169 -14 225 -16 289 -18 361
## [20] -20
##  [1]   1  -2   9  -4  25  -6  49  -8  81 -10 121 -12 169 -14 225 -16 289 -18 361
## [20] -20

Part (d)

Using the built-in mtcars dataset, write a single R command (or a single pipeline using the %>% operator) that computes the mean horsepower (hp) of cars with a 4-cylinder engine (cyl == 4).

### ANSWER BELOW

### FIRST METHOD
mtcars %>%
  filter(cyl == 4) %>%
  pull(hp) %>%
  mean()

### SECOND METHOD
mean(mtcars$hp[mtcars$cyl == 4])

## [1] 82.63636
## [1] 82.63636

Question 2 (20 marks)

This question uses the billboard dataset, which is available by default when you load the tidyverse package.

head(billboard)

For reference, here is the number of missing values across all weeks:

billboard %>%
  summarise(across(starts_with("wk"), ~ sum(is.na(.))))

Part (a)

Plot the total number of songs that entered the Billboard charts each month as a line chart. The \(x\)-axis should represent the month, and the \(y\)-axis the number of songs.

### ANSWER HERE
billboard %>%
  mutate(month = as.yearmon(date.entered)) %>%
  count(month) %>%
  ggplot(aes(x = month, y = n)) +
  geom_line()

Part (b)

Write a single pipeline (using the %>% operator) to compute the following summary statistics by artist: - The number of distinct tracks they have on the Billboard charts. - The number of days between their earliest and latest entries.

Arrange the results in descending order of number of tracks, and display the top 10 artists.

### ANSWER HERE
billboard %>%
  group_by(artist) %>%
  summarise(
    number_of_tracks = n_distinct(track),
    duration = max(date.entered) - min(date.entered)
  ) %>%
  arrange(-number_of_tracks) %>%
  head(10)

Part (c)

Write a single pipeline (using the %>% operator) that computes the following summary statistics by track:

The number of weeks the track was on the Billboard charts.
The median chart position of the track.
The number of weeks the track was in the Top 5 (i.e., position 1–5).

Then, in the same pipeline:

Arrange the results in descending order of Top 5 weeks.
Filter to show only tracks that were in the Top 5 for at least 10 weeks.

### ANSWER HERE
billboard %>%
  pivot_longer(cols = starts_with("wk"),
               names_to = "week",
               values_to = "position") %>%
  group_by(track) %>%
  drop_na() %>%
  summarise(
    number_of_weeks = n(),
    median_position = median(position),
    top_weeks = sum(position %in% 1:5)
  ) %>%
  arrange(-top_weeks) %>%
  filter(top_weeks >= 10)

Part (d)

Write a single pipeline (using the %>% operator) that produces a line chart for all tracks by artists who have at least 4 tracks on the Billboard charts (this should include three artists).

Use the following aesthetic mappings: - x: week number (from 1 to 76) - y: 101 - position (so higher lines indicate higher chart positions) - colour: track - linetype: artist

### ANSWER HERE
billboard %>%
  add_count(artist) %>%
  filter(n >= 4) %>%
  pivot_longer(cols = starts_with("wk"),
               names_to = "week",
               values_to = "position") %>%
  mutate(week = parse_number(week)) %>%
  drop_na() %>%
  ggplot(aes(x = week, y = 101 - position, 
             colour = track, linetype = artist)) +
  geom_line()

Question 3 (20 marks)

In this question, we explore a random variable \(S\) defined as the sum of two independent components:

A standard log-normal random variable, and
A binomial random variable with size = 10 and prob = 0.3.

Below is an example of generating 20 observations from the distribution of \(S\):

x <- rlnorm(20) + rbinom(20, size = 10, prob = 0.3)
x

##  [1]  7.352565  4.035512  4.387158  4.689155  7.143987  6.989220  4.469038
##  [8]  2.413253  8.206732  5.260310  7.131627  5.630918  4.393899  6.554807
## [15]  8.791990  4.366791  5.503114  4.011534 11.516606  2.921527

Part (a)

Generate a sample of 10,000 observations of \(S\) (do not print the full sample), and estimate the empirical probability \(P(S < 1)\).

### ANSWER HERE
x <- rlnorm(10000) + rbinom(10000, size = 10, prob = 0.3)
mean(x < 1)

## [1] 0.0142

Part (b)

Plot a histogram of the 10,000 observations of \(S\) generated in part (a).

### ANSWER HERE
tibble(x) %>%
  ggplot(aes(x = x)) +
  geom_histogram(fill = "grey90", colour = "black")

Part (c)

Write a custom R function sample_median_s() that takes an argument sample_size, generates a random sample of that size from the distribution of \(S\), and returns the sample median.

### ANSWER HERE

### CORRECT ANSWER:
sample_median_s <- function(sample_size) {
  x <- rlnorm(sample_size) + 
    rbinom(sample_size, size = 10, prob = 0.3)
  median(x)
}

sample_median_s(73)

## [1] 4.234446

Part (d)

Simulate the median of a sample of 5 independent observations from \(S\), repeated multiple times. Plot the density estimate (geom_density()) of these medians.

Note: Depending on your computer’s performance, you can run this simulation between 500 and 20,000 times.

### ANSWER HERE
sample_median_s2 <- function(trial, sample_size = 5) {
  x <- rlnorm(sample_size) + 
    rbinom(sample_size, size = 10, prob = 0.3)
  tibble(trial, med = median(x))
}

1:20000 %>%
  map_df(sample_median_s2) %>%
  ggplot(aes(x = med)) +
  geom_density(fill = "burlywood", color = "black")

Question 4 (20 marks)

In this question, we will use the UCBAdmissions dataset, which is available in base R. This dataset is stored as a 3-dimensional array. To work with it using tidyverse tools, we first convert it into a data frame (tibble) as follows:

UCBAdmissions %>%
  as_tibble()

Part (a)

A statistician at the California Department of Education runs the following R code:

UCBAdmissions %>%
  as_tibble() %>%
  group_by(Gender) %>%
  summarise(number_of_students = sum(n)) %>%
  pull(number_of_students) %>%
  chisq.test(p = c(0.5, 0.5))

## 
##  Chi-squared test for given probabilities
## 
## data:  .
## X-squared = 161.89, df = 1, p-value < 2.2e-16

Explain:

What is the objective of this analysis?
What does the result of the chisq.test() tell us in this context?

ANSWER HERE:

The statistician is testing whether the number of male and female applicants to UC Berkeley is equally distributed. The null hypothesis is that the probability of a student being male or female is 0.5. Since the p-value is extremely small, the null hypothesis is rejected, indicating that there is a statistically significant difference in the number of male and female applicants.

Part (b)

Next, the statistician runs the following code:

UCBAdmissions %>%
  as_tibble() %>%
  group_by(Admit, Dept) %>%
  summarise(number_of_students = sum(n), .groups = "drop") %>%
  pivot_wider(names_from = Dept, values_from = number_of_students) %>%
  column_to_rownames("Admit") %>%
  as.matrix() %>%
  chisq.test()

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 778.91, df = 5, p-value < 2.2e-16

Explain:

What is the objective of this analysis?
What does the result of the chisq.test() tell us in this context?

ANSWER HERE:

The statistician is testing whether admission decisions (Admit) are independent of the department (Dept). The null hypothesis is that admission rates are the same across all departments. Since the p-value is extremely small, the null hypothesis is rejected. This means there is strong evidence that admission rates vary significantly by department.

Part (c)

Finally, the statistician runs the following code:

UCBAdmissions %>%
  as_tibble() %>%
  ggplot(aes(sample = n)) +
  geom_qq() +
  geom_qq_line() +
  facet_grid(Admit ~ Gender)

Explain:

What is the objective of this analysis?
How can the resulting plots be interpreted in this context?

ANSWER HERE:

The statistician is visually assessing whether the number of students (by admission status and gender across departments) follows a normal distribution using Q-Q plots. The plots do not show any strong deviations from the reference line, suggesting approximate normality. However, due to small sample sizes, the test has low power, so only large deviations would be noticeable. The visual evidence is not strong enough to reject normality.

Part (d)

Write a single pipeline (using the %>% operator) that:

Computes the admission rate for each combination of gender and department, defined as

\[ \frac{\text{Number of Admitted Students}}{\text{Number of Admitted Students} + \text{Number of Rejected Students}} \]

Then applies a t-test to compare the mean admission rates between male and female students across departments.

Give your interpretation of the test results.

### ANSWER HERE
UCBAdmissions %>%
  as_tibble()  %>%
  pivot_wider(names_from = Admit, values_from = n) %>%
  mutate(admission_rate = Admitted / (Admitted + Rejected)) %>%
  t.test(admission_rate ~ Gender, data = .)

## 
##  Welch Two Sample t-test
## 
## data:  admission_rate by Gender
## t = 0.24772, df = 9.3979, p-value = 0.8097
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -0.2906653  0.3626712
## sample estimates:
## mean in group Female   mean in group Male 
##            0.4172692            0.3812662

Interpretation:

Since the p-value is large, we do not reject the null hypothesis. There is no statistically significant evidence that the average admission rates differ between male and female students across departments.

Question 5 (20 marks)

In this question, we will use the diamonds dataset, which is available in tidyverse.

diamonds %>% head()

Note that cut is an ordinal variable (ordered factor):

diamonds %>% pull(cut) %>% unique()

## [1] Ideal     Premium   Good      Very Good Fair     
## Levels: Fair < Good < Very Good < Premium < Ideal

If we want to disregard the ordering of the categories, we can convert cut to a character vector:

diamonds %>% pull(cut) %>% as.character() %>% unique() %>% sort()

## [1] "Fair"      "Good"      "Ideal"     "Premium"   "Very Good"

Part (a) A bench jeweler in Amsterdam runs the following R command to see if table correlates with depth:

cor(diamonds$table, diamonds$depth)

## [1] -0.2957785

The correlation is negative, but he is unsure whether it is statistically significant. Find the \(p\)-value for the correlation coefficient between table and depth, and explain what it means – i.e., whether the negative correlation is statistically significant.

### ANSWER HERE
cor.test(diamonds$table, diamonds$depth)

## 
##  Pearson's product-moment correlation
## 
## data:  diamonds$table and diamonds$depth
## t = -71.911, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3034601 -0.2880585
## sample estimates:
##        cor 
## -0.2957785

Interpretation Since the \(p\)-value is very small, the negative correlation is statistically significant.

Part (b) Produce a correlation diagram, i.e., a plot of pairwise correlations between all numeric variables in diamonds:

### ANSWER HERE
diamonds %>% ggcorr()

Part (c) Fit a linear regression model that explains price using x, y, and z as independent variables. Print the table of coefficients and their significance levels:

### ANSWER HERE
diamonds %>% 
  lm(price ~ x + y + z, .) %>%
  stargazer(type = "text")

## 
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                                 price            
## -------------------------------------------------
## x                           2,790.321***         
##                               (40.985)           
##                                                  
## y                            218.788***          
##                               (31.564)           
##                                                  
## z                            225.910***          
##                               (47.574)           
##                                                  
## Constant                   -14,113.050***        
##                               (41.763)           
##                                                  
## -------------------------------------------------
## Observations                   53,940            
## R2                              0.783            
## Adjusted R2                     0.783            
## Residual Std. Error    1,860.421 (df = 53936)    
## F Statistic         64,698.040*** (df = 3; 53936)
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01

Part (d)

The bench jeweler runs the following R command:

diamonds %>%
  mutate(cut = as.character(cut)) %>%
  lm(price ~ cut, .) %>%
  stargazer(type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                price           
## -----------------------------------------------
## cutGood                     -429.893***        
##                              (113.849)         
##                                                
## cutIdeal                    -901.216***        
##                              (102.412)         
##                                                
## cutPremium                   225.500**         
##                              (104.395)         
##                                                
## cutVery Good                -376.998***        
##                              (105.164)         
##                                                
## Constant                   4,358.758***        
##                              (98.788)          
##                                                
## -----------------------------------------------
## Observations                  53,940           
## R2                             0.013           
## Adjusted R2                    0.013           
## Residual Std. Error   3,963.847 (df = 53935)   
## F Statistic         175.689*** (df = 4; 53935) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

To his surprise, he finds that the mean prices for Good, Ideal, and Very Good cuts are much lower than that for Fair cut — and that these differences are statistically significant.

Explain:

Is his conclusion consistent with the regression results shown in the table above?
If so, how could this be – i.e., why would diamonds with higher-quality cuts have lower average prices than those with Fair cut?
If not, what mistake did he make in interpreting the regression output, and how can it be corrected?

ANSWER HERE His analysis is actually correct. The most likely explanation is that diamonds of Fair cut are just larger. This is confirmed by checking mean prices and mean weights in carats for all the cuts:

diamonds %>%
  group_by(cut) %>%
  summarise(
    n_obs = n(),
    mean_price = mean(price),
    mean_carat = mean(carat)
  )

NANYANG TECHNOLOGICAL UNIVERSITY

SPECIAL TERM EXAMINATION 2024–2025

MH3511 – Data Analysis with Computer

YOUR MATRICULATION NUMBER:

YOUR SEAT NUMBER:

INSTRUCTIONS TO CANDIDATES

Question 1 (20 marks)

Question 2 (20 marks)

Question 3 (20 marks)

Question 4 (20 marks)

Question 5 (20 marks)