Homework 1 | STAT 330

Simple Linear Regression

Author

Noah Champagne

Data and Description

Energy can be produced from wind using windmills. Choosing a site for a wind farm (i.e. the location of the windmills), however, can be a multi-million dollar gamble. If wind is inadequate at the site, then the energy produced over the lifetime of the wind farm can be much less than the cost of building the operation. Hence, accurate prediction of wind speed at a candidate site can be an important component in the decision to build or not to build. Because the energy produced is proportional to the square of the wind speed, even small errors in wind speed prediction can lead to significant impacts.

A potential approach for predicting wind speed at a candidate site is to use wind speed data from a nearby reference site. A reference site is a nearby location where the wind speed is already being monitored and should, theoretically, be similar to the candidate site. If the reference sites turn out to be good predictors of the candidate sites, leveraging the information from the reference site would allow windmill companies to estimate the wind speed at the candidate sites without going through a costly and lengthy data collection period.

The Windmill data set contains measurements of wind speed (in meters per second m/s) at candidate sites (CSpd) (column 1) and at corresponding reference sites (RSpd) (column 2) for 1,116 sites. Download the Windmill.txt file from Canvas (which can be found in Files -> Data Sets), and put it in the same folder as this quarto file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

1. Briefly explain why simple linear regression could be a useful tool for this problem.

The goal of this assignment is to predict whether the wind speed at the reference site would be a good measure of the wind speed at the candidate site. Since we have one explanatory variable (RSpd) and one response variable (CSpd), simple linear regression would be a good tool to see if the data is linearly related.

2. Read in the data set into an object named “wind”. Display the first 6 rows of the data.

# <your code here>

wind = read_table("Windmill.txt")


── Column specification ────────────────────────────────────────────────────────
cols(
  CSpd = col_double(),
  RSpd = col_double()
)

head(wind, 6)

# A tibble: 6 × 2
   CSpd  RSpd
  <dbl> <dbl>
1   6.9  5.97
2   7.1  7.22
3   7.8  7.94
4   6.9  6.02
5   5.5  6.16
6   3.1  1.77

3. What is the response/outcome variable in this situation? (Think about which variable makes the most sense to be the response.)

The response variable is CSpd since we are seeing if the reference sight is a good predictor of the candidate site.

4. What is the explanatory variable in this situation?

RSpd.

5. Create a scatterplot of the data with variables on the appropriate axes. Add descriptive axis labels with appropriate units. Display the plot.

# <your code here>

plot = ggplot(wind, mapping = aes(x = RSpd, y = CSpd)) +
  geom_point() + 
  labs(
    x = "Reference Site Wind Speed (m/s)",
    y = "Candidate Site Wind Speed (m/s)",
    title = "Wind Speed at Reference vs Candidate Sites"
  )

plot

6. Briefly describe the relationship between RSpd and CSpd in terms of its linearity, direction, and strength.

RSpd and CSpd appear to have a relatively strong positive linear correlation.

7. Calculate the correlation coefficient for the two variables (you may use a built-in R function). Display the result.

# <your code here>

cor(wind$RSpd, wind$CSpd)

[1] 0.7555948

8. What does the calculated correlation coefficient tell you about the direction and strength of the linear association between these two variables?

The correlation coefficient was 0.7555948. This tells us that there is a strong positive linear assosiation.

9. The equation below shows the generic theoretical simple linear regression model. Update the equation to use variable names that are meaningful in context of this dataset (i.e., rename “x” and “y”).

\[\begin{align} \text{CSpd}_i &= \beta_0 + \beta_1 \times \text{RSpd}_i + \epsilon_i \\ \epsilon_i &\stackrel{iid}{\sim} N(0, \sigma^2) \end{align}\]

10. Add the OLS regression line to the scatterplot you created in 5. Display the result. (If you use `ggplot` with `geom_smooth`, You can remove the standard error line with the option `se = FALSE`).

# <your code here>

plot + 
  geom_smooth(method = "lm", 
              color = "red", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

11. (a) Apply linear regression to the data. (b) Display a summary of the results from the `lm` function. (c) Plot the model residuals against fitted values in a well-labeled scatterplot.

# <your code here>

model = lm(CSpd ~ RSpd, data = wind)
model


Call:
lm(formula = CSpd ~ RSpd, data = wind)

Coefficients:
(Intercept)         RSpd  
     3.1412       0.7557

wind = wind %>% mutate(
  residuals = resid(model),
  fitted.values = fitted(model)
)


ggplot(wind, aes(x = fitted.values, y = residuals)) +
  geom_point() +
  labs(
    x = "Fitted Values",
    y = "Residuals",
    title = "Residuals vs Fitted Values"
  )

12. Update the fitted simple linear regression model shown below with the coefficients you obtained from your linear model (i.e., update the \(\hat{\beta}\)s with specific numeric values). Round to two decimal places.

\[\widehat{\text{Cand Speed}}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{Ref Speed}_i.\]

\[\widehat{\text{Cand Speed}}_i = 3.14 + 0.76 \times \text{Ref Speed}_i.\]

13. Interpret the coefficient for the slope.

For every 1 m/s increase in reference site’s wind speed, the candidate site’s wind speed is expected to increase by 0.76 m/s on average.

14. Interpret the coefficient for the intercept.

When the reference site wind speed is 0 m/s, the predicted wind speed at the candidate site is 3.14 m/s.

15. What is the estimated average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 12 m/s? Show your code, and display the result.

# <your code here>
new_data = data.frame(RSpd = 12)
predict(model, newdata = new_data)

       1 
12.21003

16. Briefly explain why it would be inadvisable to answer this question using your model: What is the estimated average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 25 m/s?

If we look at the original scatterplot, there isn’t any data past about 21–22 m/s for the reference site wind speed. Therefore, using the model to estimate the candidate site wind speed when the reference site wind speed is 25 m/s would be extrapolating and give an unreliable estimate.

17. Calculate the (unbiased) estimate of \(\sigma^2\). Show your code, and display the result.

# <your code here>

sigma = sigma(model)   
sigma2 = sigma^2         
sigma2

[1] 6.082312

18. Create the design matrix for your model and store it in a variable. Display the first few rows of the design matrix.

X = model.matrix(model)
head(X)

  (Intercept)   RSpd
1           1 5.9666
2           1 7.2176
3           1 7.9405
4           1 6.0174
5           1 6.1646
6           1 1.7687

19. Obtain the parameter estimates for this same model using matrix multiplication. You should use the following in your computations: t() [tranpose], solve() [inverse], and %*% [matrix multiplicaiton]. Display the result.

y = wind$CSpd
result = solve(t(X) %*% X) %*% t(X) %*% y
result

                 [,1]
(Intercept) 3.1412324
RSpd        0.7557333

20. Briefly summarize what you learned, personally, from this analysis about statistics, the model fitting process, R, etc.

This analysis was pretty fun; I enjoyed learning how to apply simple linear regression to a data set. My biggest takeaway was the new things I learned in R. After looking up a few things, it was really satisfying to see that I could get the same intercept and slope coefficients using linear algebra as with the linear model. I didn’t know how to do matrix inverses, matrix multiplication, or use the solve() function before.

21. Briefly summarize what you learned about the scientific question from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

The purpose of analyzing this data was to see if we could predict the wind speed at a candidate site based on measurements taken at a reference site. This is important because if we choose a candidate site with low wind speed, we could lose millions of dollars.

In our analysis, we found that there is a strong relationship between the reference site and candidate site wind speeds. This means that we can reasonably predict what a candidate site’s wind speed will be based on measurements taken at a nearby reference site, which will help us determine the best places to build wind farms.