Homework 1 | STAT 330

Simple Linear Regression

Author

< Spencer Hamilton >

Data and Description

Energy can be produced from wind using windmills. Choosing a site for a wind farm (i.e. the location of the windmills), however, can be a multi-million dollar gamble. If wind is inadequate at the site, then the energy produced over the lifetime of the wind farm can be much less than the cost of building the operation. Hence, accurate prediction of wind speed at a candidate site can be an important component in the decision to build or not to build. Since energy produced varies as the square of the wind speed, even small errors in prediction can have serious consequences.

One possible solution to help predict wind speed at a candidate site is to use wind speed at a nearby reference site. A reference site is a nearby location where the wind speed is already being monitored and should, theoretically, be similar to the candidate site. Using information from the reference site will allow windmill companies to estimate the wind speed at the candidate site without going through a costly data collection period, if the reference site is a good predictor.

The Windmill data set contains measurements of wind speed (in meters per second m/s) at a candidate site (CSpd) (column 1) and at an accompanying reference site (RSpd) (column 2) for 1,116 areas. Download the Windmill.txt file from Canvas (which can be found in Files -> Data Sets), and put it in the same folder as this quarto file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

1. Briefly explain why simple linear regression could be a useful tool for this problem.

Simple linear regression could be a helpful tool in figuring out how correlated the candidate site is to the reference site. Knowing how the two different sites are related could help us decide if it would be a good idea to invest and build a wind farm at the candidate site.

2. Read in the data set, and call the tibble “wind”. Print a summary of the data and make sure the data makes sense.

wind <- read.table('Windmill.txt', header = TRUE)
summary(wind)

      CSpd             RSpd        
 Min.   : 0.400   Min.   : 0.2221  
 1st Qu.: 6.100   1st Qu.: 4.7769  
 Median : 8.800   Median : 7.5477  
 Mean   : 9.019   Mean   : 7.7773  
 3rd Qu.:11.500   3rd Qu.:10.2096  
 Max.   :22.400   Max.   :21.6015

3. What is the outcome variable in this situation? (Think about which variable makes the most sense to be the response.)

The response variable will be the CSpd (Candidate Site wind speed), because this is the variable we want to predict, based on what the RSpd (Reference Site wind speed) is.

4. What is the explanatory variable in this situation?

The explanatory variable must then be RSpd, as it is what we are using to explain how we predict the CSpd. In other words, we mesured the reference site’s wind speed many times in order to try to predict the wind speed at the candidate site.

5. Create a scatterplot of the data with variables on the appropriate axes. Add descriptive axis labels with appropriate units. Print the plot.

ggplot(data = wind) + 
  geom_point(mapping = aes(x = RSpd, y = CSpd)) + labs(title = "Scatterplot of RSpd and CSpd",
  x = "Reference site",
  Y =  "Candidate site")

saved_plot <- ggplot(data = wind) + 
  geom_point(mapping = aes(x = RSpd, y = CSpd)) + labs(title = "Scatterplot of RSpd and CSpd",
  x = "Reference site",
  Y =  "Candidate site")

6. Briefly describe the relationship between RSpd and CSpd. (Hint: you should use 3 key words in a complete sentence that includes referencing the variables.)

There is a noticeable upward trend in CSpd, indicating a positive correlation between RSpd and CSpd in this scatterplot.

7. Calculate the correlation coefficient for the two variables (you may use a built-in R function). Print the result.

r <- cor(wind$RSpd, wind$CSpd)

8. Briefly interpret the number you calculated for the correlation coefficient (what is the direction and strength of the correlation?).

There is a strong positive correlation between RSpd and CSpd.

9. Mathematically write out the theoretical simple linear regression model for this data set (using parameters (\(\beta\)s), not estimates, and not using matrix notation). Your answer should include your assumptions on the error term. Do not use “x” and “y” in your model - use variable names that are fairly descriptive.

< your response here. Note that you can write math in R markdown by surrounding the math in dollar signs. For example:

\(\beta_0\) (intercept - beta with zero subscript)

\(\times\) (multiplication symbol)

\(\text{Weight}_i\) (i subscript on variable name not italicized)

\(\epsilon_i\) (error term with i subscript) >

\(\text{CSpd} = \beta_0 + beta_1\) \(\times\) \(\text{RSpd} + \epsilon_i\)

10. Add the OLS regression line to the scatterplot you created in 5. Print the result. (If you use `ggplot` with `geom_smooth`, You can remove the standard error line with the option `se = FALSE`).

# <your code here>

saved_plot +
  geom_smooth(mapping = aes(x = RSpd, y = CSpd),
              method = "lm",
              se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

11. (a) Apply linear regression to the data. (b) Print out a summary of the results from the `lm` function. (c) Save the residuals and fitted values to the `wind` tibble. (d) Print the first few rows of the `wind` tibble.

#a
lm_wind <- lm(CSpd ~ RSpd, data = wind)
#b
summary(lm_wind)


Call:
lm(formula = CSpd ~ RSpd, data = wind)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7877 -1.5864 -0.1994  1.4403  9.1738 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.14123    0.16958   18.52   <2e-16 ***
RSpd         0.75573    0.01963   38.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.466 on 1114 degrees of freedom
Multiple R-squared:  0.5709,    Adjusted R-squared:  0.5705 
F-statistic:  1482 on 1 and 1114 DF,  p-value: < 2.2e-16

#c
wind$residuals <- residuals(lm_wind)
wind$fitted_values <- fitted(lm_wind)

#d
print(head(wind))

  CSpd   RSpd  residuals fitted_values
1  6.9 5.9666 -0.7503908      7.650391
2  7.1 7.2176 -1.4958132      8.595813
3  7.8 7.9405 -1.3421328      9.142133
4  6.9 6.0174 -0.7887821      7.688782
5  5.5 6.1646 -2.3000260      7.800026
6  3.1 1.7687 -1.3778979      4.477898

12. Briefly explain the rationale behind minimizing squared error loss in order to obtain parameter estimates.

Our goal is to find the best-fitting model. In our case, what does best mean? Closes to all the lines, should we connect the dots? No, we want to find the general trends and find how the data are mostly correlated. It leverages our knowledge of calculus to computationally efficiently find the line that best fits the general trend of the data.

13. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above (do not use parameters/\(\beta\)s and do not use matrix notation). Do not use “x” and “y” in your model - use variable names that are fairly descriptive.

< your response here. Note that you can write math in R markdown by surrounding the math in dollar signs. >

14. Interpret the coefficient for the slope.

The coefficient for the slope in a simple linear regression model represents the change in the dependent variable (or CSpd) for a one-unit change in the independent(RSpd) variable.

15. Interpret the coefficient for the intercept.

The coefficient for the intercept in a simple linear regression model represents the estimated value of the dependent variable (CSpd) when the independent variable (RSpd) is zero.

16. What is the estimated average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 12 m/s? Show your code, and print the result.

wind_coefs <- coef(lm_wind)
speed_at_12 <- wind_coefs[1] + wind_coefs[2] * 12
print(speed_at_12)

(Intercept) 
   12.21003

17. Briefly explain why it would be risky to answer this question: What is the estimated average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 25 m/s?

It would be risky because we are extrapolating on the data

18. Calculate the (unbiased) estimate of \(\sigma^2\), the average squared variability of the residuals around the line. Show your code, and print the result.

wind_sig <-  sigma(lm_wind)
print(wind_sig)

[1] 2.466234

19. Create the design matrix and store it in a variable. Print the first few rows of the design matrix.

design_mat <- model.matrix(lm_wind)
print(head(design_mat))

  (Intercept)   RSpd
1           1 5.9666
2           1 7.2176
3           1 7.9405
4           1 6.0174
5           1 6.1646
6           1 1.7687

20. Obtain, and print, the parameter estimates for this data set (found above using `lm`) using matrix multiplication. You should use the following in your computations: t() [tranpose], solve() [inverse], %*% [matrix multiplicaiton].

# <your code here>

21. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

< your response here >

22. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.