There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.
Format: All assignment tasks have either a field for
writing embedded R code, an answer field marked by the prompt
Answer to Task x.x, or both. You should
enter your solution either as embedded R code or as text after the
prompt Answer to Task x.x.
Submission: Upon completion, you must render this
worksheet (using Knit in R Studio) into an html file and
submit the html file. Make sure the file extension “html” is in lower
case. Your html file MUST contain all the R code you
have written in the worksheet.
In this assignment, we will use simulated climate data based on real data from the Bureau of Meteorology at Canterbury Racecourse AWS {station 066194} collected in 2023. The simulated data contains several different daily measurements throughout Autumn (March-May).
We will use the following variables.
X3pm.temperature (daily temperature measured at
3 pm)Maximum.temperature (maximum daily
temperature)X9am.temperature (daily temperature measured at
9 am)Minimum.temperature (minimum daily
temperature)Speed.of.maximum.wind.gust (Speed of maximum
wind gust)The temperature data are measured in Celsius and the speed of wind data are measured in km/h. Please beware that the variable names are case sensitive.
Download the data file AutumnCleaned.csv in your
data folder within your MATH1062 folder. This
R Markdown file (Assignment2Worksheet.Rmd) should also be
saved under your MATH1062 folder.
Then import the csv file into a variable called
data:
options(repos = c(CRAN = "https://cloud.r-project.org"))
### write your code here. Here is a sample solution
data = read.csv("data/AutumnCleaned.csv", header = T)
### the following displays the dimension of the data
dim(data)
## [1] 92 16
names(data)
## [1] "X" "Minimum.temperature"
## [3] "Maximum.temperature" "Rainfall"
## [5] "Evaporation" "Sunshine..hours."
## [7] "Direction.of.maximum.wind.gust" "Speed.of.maximum.wind.gust"
## [9] "X9am.temperature" "X9am.relative.humidity"
## [11] "X9am.wind.direction" "X9am.wind.speed"
## [13] "X3pm.temperature" "X3pm.relative.humidity"
## [15] "X3pm.wind.direction" "X3pm.wind.speed"
If you save the data file and the worksheet correctly, you should be able to load the data file and see its dimension and variable names.
Task: How many observations are there?
How many variables are there?
Answer: This is the sample solution.
There are 92 observations and 16 variables.
====START OF ASSIGNMENT QUESTIONS====
Write your SID here: 540716399.
There are four tasks in this question.
Task 1.1Produce a scatter plot for the the daily temperature observed at 3 pm
(X3pm.temperature) and the observed daily maximum
temperature (Maximum.temperature), with
X3pm.temperature (\(X\))
on the horizontal axis and Maximum.temperature (\(Y\)) on the vertical axis.
Produce another scatter plot for the daily temperature observed at 9
am (X9am.temperature) and the observed daily minimum
temperature (Minimum.temperature), with
X9am.temperature (\(X\))
on the horizontal axis and Minimum.temperature (\(Y\)) on the vertical axis.
Place two plots side-by-side and correctly label them.
Comment on and compare these associations based on the scatter
plot. Write your comment after the
Answer to Task 1.1 prompt provided
below.
### Code for Task 1.1. Write your code here
a=data$X3pm.temperature
b=data$Maximum.temperature
c=data$X9am.temperature
d=data$Minimum.temperature
plot(a,b, xlab = "3pm Temperature", ylab = "Maximum Temperature", main="3pm Temperature Vs Maximum Temperature")
plot(c,d, xlab = "9am Temperature", ylab="Minimum Temperature", main = "9am Temperature Vs Minimum Temperature")
Answer to Task 1.1: both graphs have a
linear, positive correlation. however, the 9am data set had tighter
clusters and a slightly higher correlation coeefficient ##
Task 1.2 {-}
After rounding to two decimal places, the rounded sample SDs of
X9am.temperature and Minimum.temperature are
\(SD_X=1.98\) and \(SD_Y=1.44\), respectively. The sample means
of X9am.temperature and Minimum.temperature
are \(\bar X = 21.22\) and \(\bar Y = 16.73\), respectively. The rounded
correlation coefficient between X9am.temperature and
Minimum.temperature is \(r=0.82\).
Derive the intercept and the slope
of the regression line (round to four decimal places)
for predicting daily minimum temperature given the temperature observed
at 9 am. Write your answer after the
Answer to Task 1.2 prompt provided
below.
### Below are rounded sample means, sample SDs, and r
###
Xbar = 21.22 # sample mean of X
Ybar = 16.73 # sample mean of Y
SDX = 1.98 # sample SD of X
SDY = 1.44 # sample SD of Y
r = 0.82 # corr coeff
### Code for Task 1.2. you can use R as a calculator, write your code here
slope=r*(SDY/SDX)
intercept=Ybar-slope*Xbar
slope
## [1] 0.5963636
intercept
## [1] 4.075164
Answer to Task 1.2: Slope ~ 0.5964,
intercept ~ 4.0752 ## Task 1.3 {-}
Using the function lm(), build a linear model for
predicting daily maximum temperature (Maximum.temperature)
given the temperature observed at 3 pm (X3pm.temperature).
Produce a scatter plot for X3pm.temperature and
Maximum.temperature. Plot the resulting regression line on
top of the scatter plot using abline(). Predict the value
of daily maximum temperature given a value of \(X=33\) for the temperature observed at 3
pm. Use the function points(X, Y, col="red", cex=3, pch=19)
to plot the predict value \(Y\)
(together with the predictor \(X\)),
where the options col="red", cex=3, pch=19 specify the
color, the marker size, and the mark type, respectively.
### Code for Task 1.3. Write your code here
x = data$X3pm.temperature
y = data$Maximum.temperature
plot(x,y, xlab = "3pm Temperature", ylab = "Maxiumum Temperature", main = "3pm Temperature Vs Maximum Temperature", ylim=c(20,40))
l = lm (y~x)
abline(l,col = "blue", lwd=2)
intercept13=coef(l)[1]
slope13=coef(l)[2]
yvalue=slope13*33+intercept13
points(33, yvalue, col="red", cex=3, pch=19)
Task 1.4Produce the residual plot of the linear model built in Task 1.3.
Comment on if the regression line is a good fit. Write your comment
after the Answer to Task 1.4 prompt
provided below.
### Code for Task 1.4. Write your code here
plot(x, l$residuals,xlab="3pm Temperature", ylab = "Residual", main = "Residual Plot")
Answer to Task 1.4:
Task 1.5Compared to the baseline prediction, what percentage of variation in
the response variable Maximum.temperature can be explained
by the linear regression model fitted in Task 1.3. Write your answer
after the Answer to Task 1.5 prompt
provided below. Round your answer to two decimal places.
### Code for Task 1.5. Write your code here
cor(x,y)^2*100
## [1] 89.57635
Answer to Task 1.5: 89.58% # Question 2
{-}
Using historical data from the early 1900s, Statisticians at the Bureau of Meteorology calculated that 20% of days in Autumn had max wind speeds exceeding 40km/h. We want to test whether recent Autumn data is consistent with this hypothesis that “20% of days during Autumn exceed 40km/h”.
Task 2.1Determine xbar, the observed sample proportion
of days in Autumn 2023 that (strictly) exceed a speed of 40km/h.
### Code for Task 2.1. Write your code here
WIND=data$"Speed.of.maximum.wind.gust"
xbar=mean(WIND>40)
xbar
## [1] 0.3152174
Task 2.2Calculate the expectation and standard error for the sample proportion assuming the hypothesis is true.
### Code for Task 2.2. Write your code here
E=0.2
SE=sqrt((E-E^2)/length(WIND))
SE
## [1] 0.04170288
Task 2.3Calculate the 99% prediction interval that can be used to test whether the data is consistent with the above hypothesis.
### Code for Task 2.3. Write your code here
value99 = qnorm(0.995)
Lowbound = E-value99*SE
Uppbound = E+value99*SE
Lowbound
## [1] 0.09258049
Uppbound
## [1] 0.3074195
Task 2.4Determine a 99% (Wilson) confidence interval for the unknown proportion of days in Autumn that exceed a max wind speed of 40km/h.
### Code for Task 2.4. Write your code here
install.packages("binom")
##
## The downloaded binary packages are in
## /var/folders/z9/q0mzcvxs23j48gm4k7zxyrpm0000gn/T//RtmpYOyj62/downloaded_packages
require(binom)
## Loading required package: binom
wilsons = binom.confint(x = length(WIND[WIND > 40]),n = length(WIND),conf.level = 0.99,method = "wilson")
wilsons
## method x n mean lower upper
## 1 wilson 29 92 0.3152174 0.2065089 0.4487855
Task 2.5Perform a “sanity check”, and verify that the endpoints of your
Wilson confidence interval in the previous task are such that the
observed proportion xbar is right on the edge of a
99% prediction interval. Your answer should have two things:
### Code for Task 2.5. Write your code here
p1 = wilsons$lower
p2 = wilsons$upper
p1SE=sqrt(p1*(1-p1)/length(WIND))
p2SE=sqrt(p2*(1-p2)/length(WIND))
lower1 = p1-p1SE*qnorm(0.995)
lower2 = p1+p1SE*qnorm(0.995)
upper1 = p2-p2SE*qnorm(0.995)
upper2 = p2+p2SE*qnorm(0.995)
cat("Lower Interval:[",lower1, ",", lower2, "]")
## Lower Interval:[ 0.09780036 , 0.3152174 ]
cat("\n", "Upper Interval:[",upper1, ",", upper2, "]")
##
## Upper Interval:[ 0.3152174 , 0.5823537 ]
Answer to Task 2.5: zxbar is equal to
both The upper limit of the lower interval and the lower limit of the
upper interval meaning xbar is sitting opn the edge of the 99%
prediction interval
Task 2.6Compute the p-value for the hypothesis.
### Code for Task 2.6. Write your code here
z=(xbar-E)/SE
p=2*pnorm(abs(z),0,1, lower.tail=F)
p
## [1] 0.005730506
Task 2.7What is the conclusion of your hypothesis test at the 1% significance
level? Is the observed proportion xbar
significantly different from 20%? What assumptions do
we need about our data in order to make our hypothesis test valid? Your
answer should have three things:
Answer to Task 2.7:
Task 3.1What is the the smallest prediction interval at which the data is
consistent with the hypothesis? What percentage does this prediction
interval correspond to? What do you notice about this interval? Write
your answer after the Answer to Task 3.1
prompt provided below.
### Code for Task 3.1. Write your code here
z = (xbar-E)/SE
upper=E+z*SE
lower=E-z*SE
upper
## [1] 0.3152174
lower
## [1] 0.08478261
pnorm(z)
## [1] 0.9971347
Answer to Task 3.1: when data is
consistent with the hypothesis, the smallest prediction interval is
[0.08478261, 0.3152174]. this corresponds to a 99.7% confidence interval
found using the pnorm value, which is greater than the 99% confidence
interval ## Task 3.2 {-} Suppose we only
had access to \(n\) days from Autumn
2023, where \(n < 92\). If we assume
that xbar would remain the same as in Task 2.1 regardless
of the value of \(n\), what is the
smallest value \(n\) such that we would
reject the hypothesis with a 95% significance level?
Hint: you may need to do go through some trial and error. Your answer
should have two things:
### Code for Task 3.2. Write your code here
z32 = 1.96
expect = 0.2
n = (expect*(1-expect)/((xbar-expect)^2/z32^2))
n
## [1] 46.30161
Answer to Task 3.2: the smallest value
of n where the hypothesis would be rejected with 95% confidence is
n=46.3 or 47 since n is an integer count of days.