Homework 1 | STAT 330

Data and Description

Energy can be produced from wind using windmills. Choosing a site for a wind farm (i.e. the location of the windmills), however, can be a multi-million dollar gamble. If wind is inadequate at the site, then the energy produced over the lifetime of the wind farm can be much less than the cost of building the operation. Hence, accurate prediction of wind speed at a candidate site can be an important component in the decision to build or not to build. Since energy produced varies as the square of the wind speed, even small errors in prediction can have serious consequences.

One possible solution to help predict wind speed at a candidate site is to use wind speed at a nearby reference site. A reference site is a nearby location where the wind speed is already being monitored and should, theoretically, be similar to the candidate site. Using information from the reference site will allow windmill companies to know the wind speed at the candidate site without going through a costly data collection period, if the reference site is a good predictor.

The Windmill data set contains measurements of wind speed (in meters per second m/s) at a candidate site (CSpd) (column 1) and at an accompanying reference site (RSpd) (column 2) for 1,116 areas. Download the Windmill.txt file from Learning Suite, and put it in the same folder as this R Markdown file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with

your full name.

1. Briefly explain why simple linear regression is an appropriate tool to

use in this situation.

Simple linear regression is appropriate here because we are trying to find out if the reference site is a good predictor for the candidate site. We want to find out whether there is a correlation between the speeds at the reference site and the speeds at the candidate site, and also if the model we obtain allows us to predict speeds at the candidate site.

2. Read in the data set, and call the tibble “wind”. Print a summary of the

data and make sure the data makes sense.

wind = read_table("C:/Users/samlo/OneDrive/Documents/winter 2023/STAT 330 Statistical Modeling 2/windmill_data.txt")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   CSpd = col_double(),
##   RSpd = col_double()
## )

print(head(wind,n=10))

## # A tibble: 10 × 2
##     CSpd  RSpd
##    <dbl> <dbl>
##  1   6.9  5.97
##  2   7.1  7.22
##  3   7.8  7.94
##  4   6.9  6.02
##  5   5.5  6.16
##  6   3.1  1.77
##  7   6.8  4.65
##  8  11.4 10.9 
##  9  12.9 11.7 
## 10  13.5 13.0

summary(wind)

##       CSpd             RSpd        
##  Min.   : 0.400   Min.   : 0.2221  
##  1st Qu.: 6.100   1st Qu.: 4.7769  
##  Median : 8.800   Median : 7.5477  
##  Mean   : 9.019   Mean   : 7.7773  
##  3rd Qu.:11.500   3rd Qu.:10.2096  
##  Max.   :22.400   Max.   :21.6015

3. What is the outcome variable in this situation? (Think about which

variable makes the most sense to be the response.)

The candidate site speed, CSpd.

4. What is the explanatory variable in this situation?

The reference site speed, RSpd.

5. Create a scatterplot of the data with variables on the appropriate axes.

Make the plot square. Add descriptive axis labels with appropriate units. Save the plot to a variable and print the plot.

scatter = ggplot(data = wind, aes(x = wind$RSpd, y = wind$CSpd))+
  geom_point()+
  labs(x = "Reference site speed (m/s)", y = "Candidate site speed (m/s)")+
  coord_fixed()+
  theme_bw()
print(scatter)

6. Briefly describe the relationship between RSpd and CSpd. (Hint: you

should use 3 key words in a complete sentence that includes referencing the variables.)

The trend is linear and strongly positive. It has some variability, especially among larger values.

7. Calculate the correlation coefficient for the two variables (you may use

a built-in R function). Print the result.

r = cor(wind$RSpd,wind$CSpd)
print(r)

## [1] 0.7555948

8. Briefly interpret the number you calculated for the correlation

coefficient (what is the direction and strength of the correlation?).

This is a strong positive correlation.

9. Mathematically write out the theoretical/general simple linear

regression model for this data set (using parameters (\(\beta\)s), not estimates, and not using matrix notation). Clearly explain which part of the model is deterministic and which part is random. Do not use “x” and “y” in your model - use variable names that are fairly descriptive.

\(Candidate_i\) = \(\beta_0\) + \(\beta_1\)\(Reference_i\) + \(\epsilon_i\)

The deterministic elements are the \(\beta\) values, and the random part is \(\epsilon\). \(\epsilon_i\overset{iid}\sim N(0, \sigma^2)\)

10. Add the OLS regression line to the scatterplot you created in 4. Print

the result. You can remove the standard error line with the option se = FALSE.

scatter + geom_smooth(method = "lm", se = F)

## `geom_smooth()` using formula = 'y ~ x'

11. (a) Apply linear regression to the data. (b) Print out a summary of the

results from the lm function. (c) Save the residuals and fitted values to the wind tibble. (d) Print the first few rows of the wind tibble.

wind_lm = lm(CSpd~RSpd, data = wind)
summary(wind_lm)

## 
## Call:
## lm(formula = CSpd ~ RSpd, data = wind)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7877 -1.5864 -0.1994  1.4403  9.1738 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.14123    0.16958   18.52   <2e-16 ***
## RSpd         0.75573    0.01963   38.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.466 on 1114 degrees of freedom
## Multiple R-squared:  0.5709, Adjusted R-squared:  0.5705 
## F-statistic:  1482 on 1 and 1114 DF,  p-value: < 2.2e-16

wind$residuals = wind_lm$residuals
wind$fitted = wind_lm$fitted.values

12. Briefly explain the rational behind the ordinary least-squares model

fit (how does OLS choose the parameter estimates?).

The OLS fit finds the line that has the smallest squared vertical distances from the points to that line. Squared values are used so that both positive and negative distance measurements will be converted to a number that has a positive magnitude.

13. Mathematically write out the fitted simple linear regression model for

this data set using the coefficients you found above (do not use parameters/\(\beta\)s and do not use matrix notation). Do not use “x” and “y” in your model - use variable names that are fairly descriptive.

\(\widehat{Candidate}_i\) = 3.141 + 0.756\(Reference_i\)

14. Interpret the coefficient for the slope.

As each speed at the reference site increases by 1, we estimate that the average candidate speed increases by 0.756.

15. Interpret the coefficient for the intercept.

The estimated speed corresponding to a reference site speed of zero is 3.141m/s

16. What is the average wind speed at the candidate site (CSpd) when the

wind speed at the reference site (RSpd) is 12 m/s? Show your code, and print the result.

wind_lm$coefficients[1] + wind_lm$coefficients[2]*12

## (Intercept) 
##    12.21003

17. Briefly explain why it would be wrong to answer this question: What is

the average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 25 m/s?

This would be extrapolation, making a prediction of data that we don’t have. We could guess that the trend line takes us in a general direction, but obtaining future data points for those speeds could prove the prediction wrong.

18. Calculate the estimate of \(\sigma^2\), the average squared variability

of the residuals around the line. Show your code, and print the result.

sum(wind$residuals ^ 2) / wind_lm$df.residual

## [1] 6.082312

19. Create the design matrix and store it in a variable. Print the first

few rows of the design matrix.

x = cbind(rep(1, length(wind$RSpd)), wind$RSpd)
print(head(x,n=3))

##      [,1]   [,2]
## [1,]    1 5.9666
## [2,]    1 7.2176
## [3,]    1 7.9405

20. Obtain, and print, the parameter estimates for this data set (found

above using lm) using matrix multiplication. You should use the following in your computations: t() [transpose], solve() [inverse], %*% [matrix multiplication].

x_t_x_inv = solve(t(x)%*%x)
beta = x_t_x_inv%*%t(x)%*%wind$CSpd
print(beta)

##           [,1]
## [1,] 3.1412324
## [2,] 0.7557333

21. Briefly summarize what you learned, personally, from this analysis

about the statistics, model fitting process, etc.

I learned how the OLS model fit is effective. It allows us to find a line that is equally distant from each of the data points. With that line, we can determine whether the relationship is positive or negative, and how strong the correlation is.

22. Briefly summarize what you learned from this analysis

to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

The purpose of this data set was to determine whether a set of data from a reference site had enough of a relationship with data from a candidate site so that we could determine if the reference data would allow us to assume corresponding values for the candidate site. Obtaining these values through statistical analysis is much more feasible than obtaining them through additional data collection. From the analysis we can say that there is a strong relationship between the two, from which we can expect an increase in wind speed at the reference site to correspond with a wind speed at the candidate site. The values in our model help us determine what we can expect that value to be.