There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.
Format: All assignment tasks have either a field for
writing embedded R code, an answer field marked by the prompt
Answer to Task x.x, or both. You should
enter your solution either as embedded R code or as text after the
prompt Answer to Task x.x.
Submission: Upon completion, you must render this
worksheet (using Knit in R Studio) into an html file and
submit the html file. Make sure the file extension “html” is in lower
case. Your html file MUST contain all the R code you
have written in the worksheet.
In this assignment, we will use daily weather and climate data collected by the Bureau of Meteorology at Canterbury Racecourse AWS {station 066194}. The data are collected daily in March, 2023.
We will use the following variables.
X3pm.Temperature (daily temperature measured at
3 pm)Maximum.temperature (maximum daily
temperature)X9am.Temperature (daily temperature measured at
9 am)Minimum.temperature (minimum daily
temperature)Rainfall (total daily rainfall)The temperature data are measured in Celsius and the rainfall data are measured in millimeter (mm). Please beware that the variable names are case sensitive.
Download the data file CanterburyMarch2023.csv in your
data folder within your MATH1062 folder. This
R Markdown file (Assignment2Worksheet.Rmd) should also be
saved under your MATH1062 folder.
Then import the csv file into a variable called
data:
### write your code here. Here is a sample solution
data = read.csv("CanterburyMarch2023.csv", header = T)
### the following displays the dimenison of the data
dim(data)
## [1] 31 21
names(data)
## [1] "Date" "Minimum.temperature"
## [3] "Maximum.temperature" "Rainfall"
## [5] "Evaporation" "Sunshine..hours."
## [7] "Direction.of.maximum.wind.gust" "Speed.of.maximum.wind.gust..km.h."
## [9] "Time.of.maximum.wind.gust" "X9am.Temperature"
## [11] "X9am.relative.humidity...." "X9am.cloud.amount..oktas."
## [13] "X9am.wind.direction" "X9am.wind.speed..km.h."
## [15] "X9am.MSL.pressure..hPa." "X3pm.Temperature"
## [17] "X3pm.relative.humidity...." "X3pm.cloud.amount..oktas."
## [19] "X3pm.wind.direction" "X3pm.wind.speed..km.h."
## [21] "X3pm.MSL.pressure..hPa."
If you save the data file and the worksheet correctly, you should be able to load the data file and see its dimension and variable names.
Task: How many observations are there?
How many variables are there?
Answer: This is the sample solution.
There are 31 observations and 21 variables.
====START OF ASSIGNMENT QUESTIONS====
There are four tasks in this question.
Task 1.1Produce a scatter plot for the the daily temperature observed at 3 pm
(X3pm.Temperature) and the observed daily maximum
temperature (Maximum.temperature), with
X3pm.Temperature (\(X\))
on the horizontal axis and Maximum.temperature (\(Y\)) on the vertical axis.
Produce another scatter plot for the daily temperature observed at 9
am (X9am.Temperature) and the observed daily minimum
temperature (Minimum.temperature), with
X9am.Temperature (\(X\))
on the horizontal axis and Minimum.temperature (\(Y\)) on the vertical axis.
Place two plots side-by-side and correctly label them.
Comment on and compare these associations based on the scatter
plot. Write your comment after the
Answer to Task 1.1 prompt provided
below.
### Code for Task 1.1. Write your code here
###
T3pm = data$X3pm.Temperature
maxT = data$Maximum.temperature
T9am = data$X9am.Temperature
minT = data$Minimum.temperature
plot1 = plot(T3pm, maxT, main = "Association between the daily temperature observed at 3pm and daily maximum temperature", xlab = "Daily temperature observed at 3pm", ylab = "Daily maximum temperature")
plot2 = plot(T9am, minT, main = "Association between the daily temperature observed at 9am and daily minimum temperature", xlab = "Daily temperature observed at 9am", ylab = "Daily minimum temperature")
Answer to Task 1.1:
write your answer here. - For the scatter plot showing the
association between the daily temperature observed at 3pm and the
observed daily maximum temperature. The plot shows a positive
association, meaning when daily temperature observed at 3pm, the daily
maximum temperature tends to increase as well. The points also seem to
cluster around a line, which indicate a strong association between the
variables. - For the second scatter plot, there seems to be no
association between the variables because the points seem randomly
scattered with no pattern detected.
Task 1.2After rounding to two decimal places, the rounded sample SDs of
X9am.Temperature and Minimum.temperature are
\(SD_X=1.94\) and \(SD_Y=2.56\), respectively. The sample means
of X9am.Temperature and Minimum.temperature
are \(\bar X = 21.43\) and \(\bar Y = 16.89\), respectively. The rounded
correlation coefficient between X9am.Temperature and
Minimum.temperature is \(r=0.55\).
Derive the intercept and the slope
of the regression line (round to four decimal places)
for predicting daily minimum temperature given the temperature observed
at 9 am. Write your answer after the
Answer to Task 1.2 prompt provided
below.
### Below are rounded sample means, sample SDs, and r
###
Xbar = 21.43 # sample mean of X
Ybar = 16.89 # sample mean of Y
SDX = 1.94 # sample mean of X
SDY = 2.56 # sample mean of Y
r = 0.55 # correlation coefficient
### Code for Task 1.2. you can use R as a calculator, write your code here
###
slope2 = r*(SDY/SDX)
slope2
## [1] 0.7257732
intercept2 = Ybar - slope2*Xbar
intercept2
## [1] 1.33668
Answer to Task 1.2: Rounding to
FOUR decimal places, the slope is (0.7258) and the
intercept is (1.3367).
Task 1.3Using the function lm(), build a linear model for
predicting daily maximum temperature (Maximum.temperature)
given the temperature observed at 3 pm
(X3pm.Temperature).
Produce a scatter plot for X3pm.Temperature and
Maximum.temperature.
Plot the resulting regression line on top of the scatter plot
using abline().
Predict the value of daily maximum temperature given a value of \(X=33\) for the temperature observed at 3 pm.
Use the function
points(X, Y, col="red", cex=3, pch=19) to plot the predict
value \(Y\) (together with the
predictor \(X\)), where the options
col="red", cex=3, pch=19 specify the color, the marker
size, and the mark type, respectively.
### Code for Task 1.3. Write your code here
###
#linear model
line = lm(maxT ~ T3pm)
line
##
## Call:
## lm(formula = maxT ~ T3pm)
##
## Coefficients:
## (Intercept) T3pm
## 3.123 0.946
#scatter plot
plot(T3pm, maxT, main = "Association between the daily temperature observed at 3pm and daily maximum temperature", xlab = "Daily temperature observed at 3pm", ylab = "Daily maximum temperature")
abline(line, col = "green")
points(T3pm, maxT, col="red", cex=3, pch=19)
#predict value when X=33
Y = 3.1229 + 0.9460*33
Y
## [1] 34.3409
Task 1.4Produce the residual plot of the above linear model. Comment on if
the regression line is a good fit. Write your comment after the
Answer to Task 1.4 prompt provided
below.
### Code for Task 1.3. Write your code here
###
line = lm(maxT ~ T3pm)
plot(T3pm, line$residuals, ylab = "Residuals", xlab = "Daily temperature observed at 3pm")
abline(h=0, col ="purple")
Answer to Task 1.4:
write your answer here. Since the residuals are randomly
scattered around the horizontal line, with no clear pattern, this
suggests that the linear regression model is a good fit.
In lecture 12 we used the data in Rainfall to conduct
inference concerning the proportion of days in March with rain. The
Bureau of Meteorology, on the other hand, instead considers the slightly
different proportion \(p=\) days in
March with at least 1mm of rain.
Task 2.1Determine xbar, the proportion of days in March 2023
with at least 1mm rain.
### Code for Task 2.1. Write your code here
xbar = mean(data$Rainfall >= 1)
Task 2.2Using the binom.confint() function provided by the
binom package, determine a 99% Wilson confidence interval
for the proportion of days in March with at least 1mm rain.
### Code for Task 2.2. Write your code here
sum(data$Rainfall >=1)
## [1] 9
require(binom)
## Loading required package: binom
binom.confint(9,31, conf.level = 0.99, method = "wilson")
## method x n mean lower upper
## 1 wilson 9 31 0.2903226 0.1331495 0.5214264
Task 2.3Perform a “sanity check”, and verify that the endpoints of your
Wilson confidence interval in the previous task are such that the
observed proportion xbar is right on the edge of a
99% prediction interval (using a normal approximation). Your answer
should have two things:
### Code for Task 2.3. Write your code here
qnorm(0.995)
## [1] 2.575829
#upper end check
0.5214 - 2.576*sqrt(0.5214*0.4786/31)
## [1] 0.2902803
#lower end check
0.1331 + 2.576*sqrt(0.1331*0.8669/31)
## [1] 0.2902587
Write some explanatory sentences here
As per the sanity check, we can see that that observed value xbar is
“right on the edge”of the end points 0.5214 and 0.1331. We first work
out the critical z value for the 99% CI which is the 99.5th percentile
of the standard normal distribution. We then calculate the upper and
lower bound using the formula.
Task 2.4If we were to model the (random) sample proportion as the average of a random sample of size 31 taken with replacement from the following box:
\[\fbox{$\fbox{0}\ \fbox{0}\ \fbox{0}\ \fbox{0}\ \fbox{1}$}\]
what would its expected value and standard error be?
## Code for Task 2.4. Uncomment and complete the code below to print the answers:
Task 2.5 Is the observed proportion xbar significantly
greater than 20%? Answer this question by computing the P-value
associated with an appropriate formal hypothesis test. Your answer
should have three things: 1. some R code which computes and prints the
P-value; 2. a paragraph explaining why this code is appropriate; 3. a
paragraph providing an interpretation of the P-value and a
conclusion.
### Code for Task 2.5. Write your code here which computes and prints the P-value
xbar = mean(data$Rainfall >= 1)
xbar
## [1] 0.2903226
z = 2*sqrt(31)*(xbar-0.2)
z
## [1] 1.00579
p_value = pnorm(z,lower.tail = F)
p_value
## [1] 0.1572584
Paragraph explaining code: We calculate
the z score corresponding to the xbar value. Then we calculate the p
value, while ignoring the lower tail because we focus on values greater
than 20% (in the upper tail)
Paragraph giving interpretation and conclusion:
We use formal hypothesis testing. Greater than 20% means that more than
20% of the days in March rain with at least 1mm Null hypothesis: H0: p =
0.2 Alternative hypothesis: p > 0.2 We use one-sided t test The p
value is lower than 0.2, then we can reject the null hypothesis and
accept the alternative hypothesis that the observed proportion xbar is
significantly greater than 20%
An alternative method for constructing a confidence interval for a proportion \(p\) based on a sample proportion \(\bar x = s/n\) (which is presented in many textbooks) estimates the standard error using \(\sqrt{\frac{\bar x(1-\bar x)}n}\) and then uses \[ \bar x \pm 1.96 \sqrt{\frac{\bar x(1-\bar x)}n}\,.\]
This is the so-called “asymptotic” interval.
However, recent research has shown that this confidence interval method has poor coverage properties, in that for certain “unfortunate” values of \(n\) and the true \(p\), the probability that the random interval \(\bar X \pm 1.96 \sqrt{\frac{\bar X(1-\bar X)}n}\) covers \(p\) can be substantially less than 0.95 (so we do not recommend this method).
Task 3.1Perform a simulation to demonstrate this, using \(p=0.25\) and \(n=31\). Your answer should have two things:
Hint: you may find it useful to adapt code from
slide 26, lecture 12. Also, to ensure you get the same random numbers
each time you Knit, you should use
set.seed().
# write code for Task 3.1 here
# uncomment the next two lines and replace '...' after 'SID =' with your student number:
SID = 510009603
set.seed(SID)
p = 0.25
n = 31
over.est = 0
under.est = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(1 - p, p), replace = T, size = n)
s = sum(samp)
w = binom.confint(s, n, method = "wilson")
over.est[i] = w$lower > p
under.est[i] = w$upper < p
}
num.over.est = sum(over.est)
num.under.est = sum(under.est)
num.covering = 1000 - num.over.est - num.under.est
cbind(num.under.est, num.covering, num.over.est)
## num.under.est num.covering num.over.est
## [1,] 35 936 29
Write your explanatory sentences for Task 3.1 here:
We can see that close to 95% of the time, the interval covers the “true”
value of p = 0.25. However, the time it underestimate the value is
fairly high at 35 times.
Task 3.2Perform a similar simulation, using the same values of \(n\) and \(p\), but instead use the 95% Wilson
interval, showing that it performs better than the method used in
Task 3.1. Your answer should have the same
two things as in the previous task.
# write code for Task 3.2 here
SID = 510009603
set.seed(SID)
p = 0.25
n = 31
over.est = 0
under.est = 0
for (i in 1:1000) {
samp = sample(c(0, 1), prob = c(1 - p, p), replace = T, size = n)
s = sum(samp)
w = binom.confint(s, n, method = "wilson")
over.est[i] = w$lower > p
under.est[i] = w$upper < p
}
num.over.est = sum(over.est)
num.under.est = sum(under.est)
num.covering = 1000 - num.over.est - num.under.est
cbind(num.under.est, num.covering, num.over.est)
## num.under.est num.covering num.over.est
## [1,] 35 936 29
Write your explanatory sentences for Task 3.2 here:
the 95% confidence interval is easier to interpret, and we can also see
that the interval contain the true p = 0.25. Can be compared directly to
other results of other studies.
====END OF THE WORKSHEET====