Instruction

There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.

Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.

Format: All assignment tasks have either a field for writing embedded R code, an answer field marked by the prompt Answer to Task x.x, or both. You should enter your solution either as embedded R code or as text after the prompt Answer to Task x.x.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case. Your html file MUST contain all the R code you have written in the worksheet.

Import data

In this assignment, we will use daily weather and climate data collected by the Bureau of Meteorology at Canterbury Racecourse AWS {station 066194}. The data are collected daily in March, 2023.

We will use the following variables.

X3pm.Temperature (daily temperature measured at 3 pm)
Maximum.temperature (maximum daily temperature)
X9am.Temperature (daily temperature measured at 9 am)
Minimum.temperature (minimum daily temperature)
Rainfall (total daily rainfall)

The temperature data are measured in Celsius and the rainfall data are measured in millimeter (mm). Please beware that the variable names are case sensitive.

Download the data file CanterburyMarch2023.csv in your data folder within your MATH1062 folder. This R Markdown file (Assignment2Worksheet.Rmd) should also be saved under your MATH1062 folder.

Then import the csv file into a variable called data:

### write your code here. Here is a sample solution 
data = read.csv("CanterburyMarch2023.csv", header = T)
### the following displays the dimenison of the data
dim(data)

## [1] 31 21

names(data)

##  [1] "Date"                              "Minimum.temperature"              
##  [3] "Maximum.temperature"               "Rainfall"                         
##  [5] "Evaporation"                       "Sunshine..hours."                 
##  [7] "Direction.of.maximum.wind.gust"    "Speed.of.maximum.wind.gust..km.h."
##  [9] "Time.of.maximum.wind.gust"         "X9am.Temperature"                 
## [11] "X9am.relative.humidity...."        "X9am.cloud.amount..oktas."        
## [13] "X9am.wind.direction"               "X9am.wind.speed..km.h."           
## [15] "X9am.MSL.pressure..hPa."           "X3pm.Temperature"                 
## [17] "X3pm.relative.humidity...."        "X3pm.cloud.amount..oktas."        
## [19] "X3pm.wind.direction"               "X3pm.wind.speed..km.h."           
## [21] "X3pm.MSL.pressure..hPa."

If you save the data file and the worksheet correctly, you should be able to load the data file and see its dimension and variable names.

Task: How many observations are there? How many variables are there?

Answer: This is the sample solution. There are 31 observations and 21 variables.

====START OF ASSIGNMENT QUESTIONS====

1 Linear model

There are four tasks in this question.

`Task 1.1`

Produce a scatter plot for the the daily temperature observed at 3 pm (X3pm.Temperature) and the observed daily maximum temperature (Maximum.temperature), with X3pm.Temperature ($X$) on the horizontal axis and Maximum.temperature ($Y$) on the vertical axis.

Produce another scatter plot for the daily temperature observed at 9 am (X9am.Temperature) and the observed daily minimum temperature (Minimum.temperature), with X9am.Temperature ($X$) on the horizontal axis and Minimum.temperature ($Y$) on the vertical axis.

Place two plots side-by-side and correctly label them. Comment on and compare these associations based on the scatter plot. Write your comment after the Answer to Task 1.1 prompt provided below.

### Code for Task 1.1. Write your code here
###
T3pm = data$X3pm.Temperature
maxT = data$Maximum.temperature
T9am = data$X9am.Temperature
minT = data$Minimum.temperature
plot1 = plot(T3pm, maxT, main = "Association between the daily temperature observed at 3pm and daily maximum temperature", xlab = "Daily temperature observed at 3pm", ylab = "Daily maximum temperature")

plot2 = plot(T9am, minT, main = "Association between the daily temperature observed at 9am and daily minimum temperature", xlab = "Daily temperature observed at 9am", ylab = "Daily minimum temperature")

Answer to Task 1.1: write your answer here. - For the scatter plot showing the association between the daily temperature observed at 3pm and the observed daily maximum temperature. The plot shows a positive association, meaning when daily temperature observed at 3pm, the daily maximum temperature tends to increase as well. The points also seem to cluster around a line, which indicate a strong association between the variables. - For the second scatter plot, there seems to be no association between the variables because the points seem randomly scattered with no pattern detected.

`Task 1.2`

After rounding to two decimal places, the rounded sample SDs of X9am.Temperature and Minimum.temperature are $SD_X=1.94$ and $SD_Y=2.56$, respectively. The sample means of X9am.Temperature and Minimum.temperature are $\bar X = 21.43$ and $\bar Y = 16.89$, respectively. The rounded correlation coefficient between X9am.Temperature and Minimum.temperature is $r=0.55$.

Derive the intercept and the slope of the regression line (round to four decimal places) for predicting daily minimum temperature given the temperature observed at 9 am. Write your answer after the Answer to Task 1.2 prompt provided below.

### Below are rounded sample means, sample SDs, and r
###
Xbar = 21.43 # sample mean of X
Ybar = 16.89 # sample mean of Y

SDX = 1.94 # sample mean of X
SDY = 2.56 # sample mean of Y
r = 0.55 # correlation coefficient

### Code for Task 1.2. you can use R as a calculator, write your code here
###
slope2 = r*(SDY/SDX)
slope2

## [1] 0.7257732

intercept2 = Ybar - slope2*Xbar
intercept2

## [1] 1.33668

Answer to Task 1.2: Rounding to FOUR decimal places, the slope is (0.7258) and the intercept is (1.3367).

`Task 1.3`

Using the function lm(), build a linear model for predicting daily maximum temperature (Maximum.temperature) given the temperature observed at 3 pm (X3pm.Temperature).
Produce a scatter plot for X3pm.Temperature and Maximum.temperature.
Plot the resulting regression line on top of the scatter plot using abline().
Predict the value of daily maximum temperature given a value of $X=33$ for the temperature observed at 3 pm.
Use the function points(X, Y, col="red", cex=3, pch=19) to plot the predict value $Y$ (together with the predictor $X$), where the options col="red", cex=3, pch=19 specify the color, the marker size, and the mark type, respectively.

### Code for Task 1.3. Write your code here
###
#linear model 
line = lm(maxT ~ T3pm)
line

## 
## Call:
## lm(formula = maxT ~ T3pm)
## 
## Coefficients:
## (Intercept)         T3pm  
##       3.123        0.946

#scatter plot
plot(T3pm, maxT, main = "Association between the daily temperature observed at 3pm and daily maximum temperature", xlab = "Daily temperature observed at 3pm", ylab = "Daily maximum temperature")
abline(line, col = "green")
points(T3pm, maxT, col="red", cex=3, pch=19)

#predict value when X=33
Y = 3.1229 + 0.9460*33
Y

## [1] 34.3409

`Task 1.4`

Produce the residual plot of the above linear model. Comment on if the regression line is a good fit. Write your comment after the Answer to Task 1.4 prompt provided below.

### Code for Task 1.3. Write your code here
###
line = lm(maxT ~ T3pm)
plot(T3pm, line$residuals, ylab = "Residuals", xlab = "Daily temperature observed at 3pm")
abline(h=0, col ="purple")

Answer to Task 1.4: write your answer here. Since the residuals are randomly scattered around the horizontal line, with no clear pattern, this suggests that the linear regression model is a good fit.

2 Question 2

In lecture 12 we used the data in Rainfall to conduct inference concerning the proportion of days in March with rain. The Bureau of Meteorology, on the other hand, instead considers the slightly different proportion $p=$ days in March with at least 1mm of rain.

`Task 2.1`

Determine xbar, the proportion of days in March 2023 with at least 1mm rain.

### Code for Task 2.1. Write your code here
xbar = mean(data$Rainfall >= 1)

`Task 2.2`

Using the binom.confint() function provided by the binom package, determine a 99% Wilson confidence interval for the proportion of days in March with at least 1mm rain.

### Code for Task 2.2. Write your code here
sum(data$Rainfall >=1)

## [1] 9

require(binom)

## Loading required package: binom

binom.confint(9,31, conf.level = 0.99, method = "wilson")

##   method x  n      mean     lower     upper
## 1 wilson 9 31 0.2903226 0.1331495 0.5214264

`Task 2.3`

Perform a “sanity check”, and verify that the endpoints of your Wilson confidence interval in the previous task are such that the observed proportion xbar is right on the edge of a 99% prediction interval (using a normal approximation). Your answer should have two things:

R code which prints appropriate output;
one or two explanatory sentences.

### Code for Task 2.3. Write your code here
qnorm(0.995)

## [1] 2.575829

#upper end check 
0.5214 - 2.576*sqrt(0.5214*0.4786/31)

## [1] 0.2902803

#lower end check 
0.1331 + 2.576*sqrt(0.1331*0.8669/31)

## [1] 0.2902587

Write some explanatory sentences here As per the sanity check, we can see that that observed value xbar is “right on the edge”of the end points 0.5214 and 0.1331. We first work out the critical z value for the 99% CI which is the 99.5th percentile of the standard normal distribution. We then calculate the upper and lower bound using the formula.

`Task 2.4`

If we were to model the (random) sample proportion as the average of a random sample of size 31 taken with replacement from the following box:

\[\fbox{$\fbox{0}\ \fbox{0}\ \fbox{0}\ \fbox{0}\ \fbox{1}$}\]

what would its expected value and standard error be?

## Code for Task 2.4. Uncomment and complete the code below to print the answers:

`Task 2.5`

Is the observed proportion xbar significantly greater than 20%? Answer this question by computing the P-value associated with an appropriate formal hypothesis test. Your answer should have three things: 1. some R code which computes and prints the P-value; 2. a paragraph explaining why this code is appropriate; 3. a paragraph providing an interpretation of the P-value and a conclusion.

### Code for Task 2.5. Write your code here which computes and prints the P-value
xbar = mean(data$Rainfall >= 1)
xbar

## [1] 0.2903226

z = 2*sqrt(31)*(xbar-0.2)
z

## [1] 1.00579

p_value = pnorm(z,lower.tail = F)
p_value

## [1] 0.1572584

Paragraph explaining code: We calculate the z score corresponding to the xbar value. Then we calculate the p value, while ignoring the lower tail because we focus on values greater than 20% (in the upper tail)

Paragraph giving interpretation and conclusion: We use formal hypothesis testing. Greater than 20% means that more than 20% of the days in March rain with at least 1mm Null hypothesis: H0: p = 0.2 Alternative hypothesis: p > 0.2 We use one-sided t test The p value is lower than 0.2, then we can reject the null hypothesis and accept the alternative hypothesis that the observed proportion xbar is significantly greater than 20%

3 Question 3

An alternative method for constructing a confidence interval for a proportion $p$ based on a sample proportion $\bar x = s/n$ (which is presented in many textbooks) estimates the standard error using $\sqrt{\frac{\bar x(1-\bar x)}n}$ and then uses \[ \bar x \pm 1.96 \sqrt{\frac{\bar x(1-\bar x)}n}\,.\]

This is the so-called “asymptotic” interval.

However, recent research has shown that this confidence interval method has poor coverage properties, in that for certain “unfortunate” values of $n$ and the true $p$, the probability that the random interval $\bar X \pm 1.96 \sqrt{\frac{\bar X(1-\bar X)}n}$ covers $p$ can be substantially less than 0.95 (so we do not recommend this method).

`Task 3.1`

Perform a simulation to demonstrate this, using $p=0.25$ and $n=31$. Your answer should have two things:

some code which performs all necessary calculations and prints relevant output
one or two sentences explaining what the output means, in terms of coverage probability.

Hint: you may find it useful to adapt code from slide 26, lecture 12. Also, to ensure you get the same random numbers each time you Knit, you should use set.seed().

# write code for Task 3.1 here
# uncomment the next two lines and replace '...' after 'SID =' with your student number:
SID = 510009603
set.seed(SID)
p = 0.25
n = 31
over.est = 0
under.est = 0
for (i in 1:1000) {
    samp = sample(c(0, 1), prob = c(1 - p, p), replace = T, size = n)
    s = sum(samp)
    w = binom.confint(s, n, method = "wilson")
    over.est[i] = w$lower > p
    under.est[i] = w$upper < p
}
num.over.est = sum(over.est)
num.under.est = sum(under.est)
num.covering = 1000 - num.over.est - num.under.est
cbind(num.under.est, num.covering, num.over.est)

##      num.under.est num.covering num.over.est
## [1,]            35          936           29

Write your explanatory sentences for Task 3.1 here: We can see that close to 95% of the time, the interval covers the “true” value of p = 0.25. However, the time it underestimate the value is fairly high at 35 times.

`Task 3.2`

Perform a similar simulation, using the same values of $n$ and $p$, but instead use the 95% Wilson interval, showing that it performs better than the method used in Task 3.1. Your answer should have the same two things as in the previous task.

# write code for Task 3.2 here
SID = 510009603
set.seed(SID)
p = 0.25
n = 31
over.est = 0
under.est = 0
for (i in 1:1000) {
    samp = sample(c(0, 1), prob = c(1 - p, p), replace = T, size = n)
    s = sum(samp)
    w = binom.confint(s, n, method = "wilson")
    over.est[i] = w$lower > p
    under.est[i] = w$upper < p
}
num.over.est = sum(over.est)
num.under.est = sum(under.est)
num.covering = 1000 - num.over.est - num.under.est
cbind(num.under.est, num.covering, num.over.est)

##      num.under.est num.covering num.over.est
## [1,]            35          936           29

Write your explanatory sentences for Task 3.2 here: the 95% confidence interval is easier to interpret, and we can also see that the interval contain the true p = 0.25. Can be compared directly to other results of other studies.

====END OF THE WORKSHEET====

MATH1062 (Part B) / MATH1005 Assignment 2

© University of Sydney MATH1062 (Statistics Part) / MATH1005

06 May 2024