Instruction

There are several questions. Each question may contain multiple tasks. To receive a full mark in this part, you should correctly solve all tasks, justify your solution in the space provided in case necessary, and add appropriate labels to your graphical summaries.

Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Only work in the space provided.

Format: All assignment tasks have either a field for writing embedded R code, an answer field marked by the prompt Answer to Task x.x, or both. You should enter your solution either as embedded R code or as text after the prompt Answer to Task x.x.

Submission: Upon completion, you must render this worksheet (using Knit in R Studio) into an html file and submit the html file. Make sure the file extension “html” is in lower case. Your html file MUST contain all the R code you have written in the worksheet.

Import data

In this assignment, we will use simulated climate data based on real data from the Bureau of Meteorology at Canterbury Racecourse AWS {station 066194} collected in 2023. The simulated data contains several different daily measurements throughout Autumn (March-May).

We will use the following variables.

  • X3pm.temperature (daily temperature measured at 3 pm)
  • Maximum.temperature (maximum daily temperature)
  • X9am.temperature (daily temperature measured at 9 am)
  • Minimum.temperature (minimum daily temperature)
  • Speed.of.maximum.wind.gust (Speed of maximum wind gust)

The temperature data are measured in Celsius and the speed of wind data are measured in km/h. Please beware that the variable names are case sensitive.

Download the data file AutumnCleaned.csv in your data folder within your MATH1062 folder. This R Markdown file (Assignment2Worksheet.Rmd) should also be saved under your MATH1062 folder.

Then import the csv file into a variable called data:

options(repos = c(CRAN = "https://cloud.r-project.org"))
### write your code here. Here is a sample solution 
data = read.csv("data/AutumnCleaned.csv", header = T)
### the following displays the dimension of the data
dim(data)
## [1] 92 16
names(data)
##  [1] "X"                              "Minimum.temperature"           
##  [3] "Maximum.temperature"            "Rainfall"                      
##  [5] "Evaporation"                    "Sunshine..hours."              
##  [7] "Direction.of.maximum.wind.gust" "Speed.of.maximum.wind.gust"    
##  [9] "X9am.temperature"               "X9am.relative.humidity"        
## [11] "X9am.wind.direction"            "X9am.wind.speed"               
## [13] "X3pm.temperature"               "X3pm.relative.humidity"        
## [15] "X3pm.wind.direction"            "X3pm.wind.speed"

If you save the data file and the worksheet correctly, you should be able to load the data file and see its dimension and variable names.

Task: How many observations are there? How many variables are there?

Answer: This is the sample solution. There are 92 observations and 16 variables.

====START OF ASSIGNMENT QUESTIONS====

Write your SID here: 540716399.

Question 1

There are four tasks in this question.

Task 1.1

Produce a scatter plot for the the daily temperature observed at 3 pm (X3pm.temperature) and the observed daily maximum temperature (Maximum.temperature), with X3pm.temperature (\(X\)) on the horizontal axis and Maximum.temperature (\(Y\)) on the vertical axis.

Produce another scatter plot for the daily temperature observed at 9 am (X9am.temperature) and the observed daily minimum temperature (Minimum.temperature), with X9am.temperature (\(X\)) on the horizontal axis and Minimum.temperature (\(Y\)) on the vertical axis.

Place two plots side-by-side and correctly label them. Comment on and compare these associations based on the scatter plot. Write your comment after the Answer to Task 1.1 prompt provided below.

### Code for Task 1.1. Write your code here
a=data$X3pm.temperature
b=data$Maximum.temperature
c=data$X9am.temperature
d=data$Minimum.temperature
plot(a,b, xlab = "3pm Temperature", ylab = "Maximum Temperature", main="3pm Temperature Vs Maximum Temperature")

plot(c,d, xlab = "9am Temperature", ylab="Minimum Temperature", main = "9am Temperature Vs Minimum Temperature")

Answer to Task 1.1: both graphs have a linear, positive correlation. however, the 9am data set had tighter clusters and a slightly higher correlation coeefficient ## Task 1.2 {-}

After rounding to two decimal places, the rounded sample SDs of X9am.temperature and Minimum.temperature are \(SD_X=1.98\) and \(SD_Y=1.44\), respectively. The sample means of X9am.temperature and Minimum.temperature are \(\bar X = 21.22\) and \(\bar Y = 16.73\), respectively. The rounded correlation coefficient between X9am.temperature and Minimum.temperature is \(r=0.82\).

Derive the intercept and the slope of the regression line (round to four decimal places) for predicting daily minimum temperature given the temperature observed at 9 am. Write your answer after the Answer to Task 1.2 prompt provided below.

### Below are rounded sample means, sample SDs, and r
###
Xbar = 21.22 # sample mean of X
Ybar = 16.73 # sample mean of Y

SDX = 1.98 # sample SD of X
SDY = 1.44 # sample SD of Y
r = 0.82 # corr coeff

### Code for Task 1.2. you can use R as a calculator, write your code here
slope=r*(SDY/SDX)
intercept=Ybar-slope*Xbar
slope
## [1] 0.5963636
intercept
## [1] 4.075164

Answer to Task 1.2: Slope ~ 0.5964, intercept ~ 4.0752 ## Task 1.3 {-}

Using the function lm(), build a linear model for predicting daily maximum temperature (Maximum.temperature) given the temperature observed at 3 pm (X3pm.temperature). Produce a scatter plot for X3pm.temperature and Maximum.temperature. Plot the resulting regression line on top of the scatter plot using abline(). Predict the value of daily maximum temperature given a value of \(X=33\) for the temperature observed at 3 pm. Use the function points(X, Y, col="red", cex=3, pch=19) to plot the predict value \(Y\) (together with the predictor \(X\)), where the options col="red", cex=3, pch=19 specify the color, the marker size, and the mark type, respectively.

### Code for Task 1.3. Write your code here
x = data$X3pm.temperature
y = data$Maximum.temperature
plot(x,y, xlab = "3pm Temperature", ylab = "Maxiumum Temperature", main = "3pm Temperature Vs Maximum Temperature", ylim=c(20,40))
l = lm (y~x)
abline(l,col = "blue", lwd=2)
intercept13=coef(l)[1]
slope13=coef(l)[2]
yvalue=slope13*33+intercept13
points(33, yvalue, col="red", cex=3, pch=19)

Task 1.4

Produce the residual plot of the linear model built in Task 1.3. Comment on if the regression line is a good fit. Write your comment after the Answer to Task 1.4 prompt provided below.

### Code for Task 1.4. Write your code here
plot(x, l$residuals,xlab="3pm Temperature", ylab = "Residual", main = "Residual Plot")

Answer to Task 1.4:

Task 1.5

Compared to the baseline prediction, what percentage of variation in the response variable Maximum.temperature can be explained by the linear regression model fitted in Task 1.3. Write your answer after the Answer to Task 1.5 prompt provided below. Round your answer to two decimal places.

### Code for Task 1.5. Write your code here
cor(x,y)^2*100
## [1] 89.57635

Answer to Task 1.5: 89.58% # Question 2 {-}

Using historical data from the early 1900s, Statisticians at the Bureau of Meteorology calculated that 20% of days in Autumn had max wind speeds exceeding 40km/h. We want to test whether recent Autumn data is consistent with this hypothesis that “20% of days during Autumn exceed 40km/h”.

Task 2.1

Determine xbar, the observed sample proportion of days in Autumn 2023 that (strictly) exceed a speed of 40km/h.

### Code for Task 2.1. Write your code here
WIND=data$"Speed.of.maximum.wind.gust"
xbar=mean(WIND>40)
xbar
## [1] 0.3152174

Task 2.2

Calculate the expectation and standard error for the sample proportion assuming the hypothesis is true.

### Code for Task 2.2. Write your code here
E=0.2
SE=sqrt((E-E^2)/length(WIND))
SE
## [1] 0.04170288

Task 2.3

Calculate the 99% prediction interval that can be used to test whether the data is consistent with the above hypothesis.

### Code for Task 2.3. Write your code here
value99 = qnorm(0.995)
Lowbound = E-value99*SE
Uppbound = E+value99*SE
Lowbound
## [1] 0.09258049
Uppbound
## [1] 0.3074195

Task 2.4

Determine a 99% (Wilson) confidence interval for the unknown proportion of days in Autumn that exceed a max wind speed of 40km/h.

### Code for Task 2.4. Write your code here
install.packages("binom")
## 
## The downloaded binary packages are in
##  /var/folders/z9/q0mzcvxs23j48gm4k7zxyrpm0000gn/T//RtmpYOyj62/downloaded_packages
require(binom)
## Loading required package: binom
wilsons = binom.confint(x = length(WIND[WIND > 40]),n = length(WIND),conf.level = 0.99,method = "wilson")
wilsons
##   method  x  n      mean     lower     upper
## 1 wilson 29 92 0.3152174 0.2065089 0.4487855

Task 2.5

Perform a “sanity check”, and verify that the endpoints of your Wilson confidence interval in the previous task are such that the observed proportion xbar is right on the edge of a 99% prediction interval. Your answer should have two things:

  1. R code which prints appropriate output;
  2. one or two explanatory sentences (strictly no more than two).
### Code for Task 2.5. Write your code here
p1 = wilsons$lower
p2 = wilsons$upper
p1SE=sqrt(p1*(1-p1)/length(WIND))
p2SE=sqrt(p2*(1-p2)/length(WIND))
lower1 = p1-p1SE*qnorm(0.995)
lower2 = p1+p1SE*qnorm(0.995)
upper1 = p2-p2SE*qnorm(0.995)
upper2 = p2+p2SE*qnorm(0.995)
cat("Lower Interval:[",lower1, ",", lower2, "]")
## Lower Interval:[ 0.09780036 , 0.3152174 ]
cat("\n", "Upper Interval:[",upper1, ",", upper2, "]")
## 
##  Upper Interval:[ 0.3152174 , 0.5823537 ]

Answer to Task 2.5: zxbar is equal to both The upper limit of the lower interval and the lower limit of the upper interval meaning xbar is sitting opn the edge of the 99% prediction interval

Task 2.6

Compute the p-value for the hypothesis.

### Code for Task 2.6. Write your code here
z=(xbar-E)/SE
p=2*pnorm(abs(z),0,1, lower.tail=F)
p
## [1] 0.005730506

Task 2.7

What is the conclusion of your hypothesis test at the 1% significance level? Is the observed proportion xbar significantly different from 20%? What assumptions do we need about our data in order to make our hypothesis test valid? Your answer should have three things:

  1. At most one sentence stating the conclusion of your hypothesis test.
  2. Three reasons for your conclusion. At most three sentences.
  3. At most two sentences explaining what assumptions we used during our hypothesis test.

Answer to Task 2.7:

  1. hypothesis is incorrect
  2. 1st. The sample proportion is not within the 99% prediction interval. 2nd. The p value of the hypothesis was 0.2, which does not lie within the 99% confidence interval since lower limit is 0.206 3rd. the calculated p value remains lower than 1% significance (~0.006)
  3. it was assumed that the n value was large enough for the central limit theorem to work. Additionally, wind speed data was independent from each other and was replaced.

Question 3

Task 3.1

What is the the smallest prediction interval at which the data is consistent with the hypothesis? What percentage does this prediction interval correspond to? What do you notice about this interval? Write your answer after the Answer to Task 3.1 prompt provided below.

### Code for Task 3.1. Write your code here
z = (xbar-E)/SE
upper=E+z*SE
lower=E-z*SE
upper
## [1] 0.3152174
lower
## [1] 0.08478261
pnorm(z)
## [1] 0.9971347

Answer to Task 3.1: when data is consistent with the hypothesis, the smallest prediction interval is [0.08478261, 0.3152174]. this corresponds to a 99.7% confidence interval found using the pnorm value, which is greater than the 99% confidence interval ## Task 3.2 {-} Suppose we only had access to \(n\) days from Autumn 2023, where \(n < 92\). If we assume that xbar would remain the same as in Task 2.1 regardless of the value of \(n\), what is the smallest value \(n\) such that we would reject the hypothesis with a 95% significance level? Hint: you may need to do go through some trial and error. Your answer should have two things:

  1. R code which prints appropriate output;
  2. one or two explanatory sentences (strictly no more than two).
### Code for Task 3.2. Write your code here
z32 = 1.96
expect = 0.2
n = (expect*(1-expect)/((xbar-expect)^2/z32^2))
n
## [1] 46.30161

Answer to Task 3.2: the smallest value of n where the hypothesis would be rejected with 95% confidence is n=46.3 or 47 since n is an integer count of days.