MATH1062/MATH1005 - Statistics Assignment 2

Author

Student

library(tidyverse)

Instructions

There are two sections to this assignment, each with multiple parts, covering the material discussed in lectures, exercises and tutorials up to the end of Week 9.
You should only use R functions which have been used in lectures, exercises or tutorials. This includes base R code or the Tidyverse library (ggplot, dplyr). This should be sufficient to complete any task in this assignment. If you use functions which have not been used in lectures or tutorials, you need to provide written justification of why this was necessary, otherwise marks will be deducted.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Do NOT add code blocks. Only work in the space provided.
All answers are either typed into code blocks under the comment # Your code here: ... or typed after the prompt: “Your written answer:“ Start your written answer in the next line after the prompt, like this:

Your written answer:

Start your answer here…..
Submission: Upon completion, you must render this worksheet (using Render in R Studio) into an html file and submit the html file. Your html file MUST contain all the R code you have written in the worksheet.

1. Old Faithful Geyser 🌋

The Old Faithful Geyser is a cone geyser in Yellowstone National Park in Wyoming, USA. It is a highly predictable geothermal feature and has erupted every 44 minutes to two hours since 2000. Read more about this natural wonder here.

You’re part of a research team which is studying the mean waiting time between eruptions at Old Faithful. Your team has collected a sample of waiting times between eruptions stored in the variable eruption_sample.

Your task in this first question is to test whether the mean waiting time between eruptions differs from 70.89 minutes at the \(\alpha = 0.05\) significance level.

# Run this code cell, do not change the data
eruption_sample <- c(81, 51, 63, 54, 62, 58, 51, 77, 51, 94, 82, 67, 53, 49, 54, 56, 81, 75, 78, 54)

Compute the mean of eruption_sample and save it in a variable named sample_mean.

# Your code here
sample_mean <- mean(eruption_sample)
sample_mean

[1] 64.55

Use two (different) appropriate graphical summaries to check assumptions and decide on an appropriate test of means. In your written answer, give a brief description of your graphical summaries and how you use them as justification for your choice of test.

# Your code here
library(ggplot2)
library(tibble)

# Histogram to check shape and outliers
ggplot(tibble(eruption_sample), aes(x = eruption_sample)) +
  geom_histogram(binwidth = 5, colour = 'black', fill = 'lightblue') +
  labs(
    title = 'Histogram of Waiting Times Between Eruptions',
    x = 'Waiting time (minutes)',
    y = 'Frequency'
  )

# QQ plot to check normality
ggplot(tibble(eruption_sample), aes(sample = eruption_sample)) +
  stat_qq() +
  stat_qq_line(colour = 'red') +
  labs(
    title = 'QQ Plot of Waiting Times Between Eruptions',
    x = 'Theoretical quantiles',
    y = 'Sample quantiles'
  )

Your written answer here:

Two graphs were used to check if a t-test was suitable.

The histogram shows a roughly bell-shaped distribution with no clear outliers, suggesting the data are approximately normal.

The QQ plot confirms this, as most points follow the straight line closely.

A QQ plot is preferred over a scatter plot because it directly compares the data to a normal distribution, making it better for assessing normality.

Since the data appear roughly normal and the population standard deviation is unknown, a two-tailed one-sample t-test is appropriate.

Compute and print the test statistic.

# Your code here
n <- length(eruption_sample)
xbar <- mean(eruption_sample)
s <- sd(eruption_sample)
mu0 <- 70.89

t_stat <- (xbar - mu0) / (s / sqrt(n))
t_stat

[1] -2.074584

Compute the p-value for the test. (Optional: plot the test distribution)

# Your code here
df <- n - 1
p_value <- 2 * pt(abs(t_stat), df = df, lower.tail = FALSE)
p_value

[1] 0.05185399

# Your code here (if you want a separate code cell for p-value from test-distribution plot)
curve(dt(x, df = df), from = -4, to = 4,
      main = 't-distribution with rejection regions',
      xlab = 't', ylab = 'Density')
abline(v = c(-abs(t_stat), abs(t_stat)), col = 'red', lty = 2)

Compute a 95% confidence interval for the mean waiting time between eruptions.

# Your code here
alpha <- 0.05
df <- n - 1
t_crit <- qt(1 - alpha/2, df = df)
se <- s / sqrt(n)

ci_lower <- xbar - t_crit * se
ci_upper <- xbar + t_crit * se
c(ci_lower, ci_upper)

[1] 58.15365 70.94635

State the conclusion of the test based on both the \(p\)-value and the confidence interval.

Your written answer here:

The p-value is greater than 0.05, so there is not enough evidence to reject the null hypothesis. The 95% confidence interval also includes 70.89 minutes, giving the same conclusion. Therefore, the mean waiting time between eruptions is not significantly different from 70.89 minutes at the 5% significance level.

2. Calculus and Stats in action! 🧪

A new medication is being tested to understand how quickly it leaves the bloodstream. After a single dose, researchers record the concentration of the drug in the blood at regular intervals. Run the code in the following cell to load the data from this experiment into the data frame called medication.

# Run this cell to load the data, DO NOT CHANGE THIS CELL.
set.seed(1)
Time <- seq(0, 10, length.out = 50)
true_conc <- 100 * exp(-0.4 * Time)        
noise <- rlnorm(length(Time), meanlog = 0, sdlog = 0.3) 
Concentration <- true_conc * noise 
Concentration <- ifelse(Concentration < 0.5, 0.5, Concentration)

# Create data frame
medication <- tibble(Time, Concentration)

Make a scatter plot of the variables Time vs. Concentration from the medication data frame. Ensure your plot has a title.

# Your code here
ggplot(medication, aes(x = Time, y = Concentration)) +
  geom_point() +
  labs(
    title = 'Concentration of Drug over time',
    x = 'Time',
    y = 'Concentration'
  )

This data is an example of exponential decay. A mathematical model which fits this data has the form

\[ C(t) = C_0 e^{-kt} \]

where \(t\) denotes Time, \(C_0\) is the initial concentration and \(k > 0\) is the rate of decay. It is a general solution to the differential equation:

\[ \frac{dC}{dt} = -k C(t) \]

Note: the above comment is just to demonstrate the connection between the two topics, and is not relevant to doing this question (so you don’t need to be doing Calculus to do this question).

In this part of the assignment, we will use our knowledge of linear regression to solve for the values of \(C_0\) and \(k\) which fit this data.

Run the following code cell to create a new variable called LogConcentration whose values are the natural logarithm of the values in the Concentration variable. (Note that the log() function in R is the natural log and exp() is the exponential function)

# Run this cell
medication <- medication %>% 
  mutate(LogConcentration = log(Concentration))

Make a scatter plot of Time vs. LogConcentration. Describe the overall pattern you observe in the scatter plot. Consider the direction, form, and strength of the relationship.

# Your code here
ggplot(medication, aes(x = Time, y = LogConcentration)) +
  geom_point() +
  labs(
    title = 'Scatter plot of Time vs Log(Concentration)',
    x = 'Time (hours)',
    y = 'Log(Concentration)'
  )

Your written answer here:

The scatter plot shows a strong, negative, linear relationship between Time and Log(Concentration). As Time increases, Log(Concentration) decreases in a nearly straight line, meaning the relationship is linear and very strong. This suggests that a linear model will fit the data well.

Make a linear model of LogConcentration (dependent variable) by Time (independent variable) and display the coefficients of the model.

# Your code here
lm_fit <- lm(LogConcentration ~ Time, data = medication)
summary(lm_fit)


Call:
lm(formula = LogConcentration ~ Time, data = medication)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.69672 -0.14086  0.00614  0.18661  0.44439 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.63993    0.07022   66.08   <2e-16 ***
Time        -0.40093    0.01210  -33.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.252 on 48 degrees of freedom
Multiple R-squared:  0.9581,    Adjusted R-squared:  0.9572 
F-statistic:  1098 on 1 and 48 DF,  p-value: < 2.2e-16

Use the linear model solve for \(C_0\) and \(k\) in the equation (explain your working):

\[ C(t) = C_0 e^{-kt} \]

(Hint: Take the logarithm of both sides of the above equation).

# Your code here
b0 <- coef(lm_fit)[1]  # intercept
b1 <- coef(lm_fit)[2]  # slope

C0_hat <- exp(b0)
k_hat <- -b1

C0_hat

(Intercept) 
   103.5375

k_hat

     Time 
0.4009258

Your written answer here:

After taking the natural log of the equation, it becomes log(C) = log(C₀) – k × Time. Here, the intercept equals log(C₀) and the slope equals –k. So, C₀ is found by taking e to the power of the intercept, and k is the negative of the slope. This gives a model showing that drug concentration decreases exponentially over time.