Homework 1 | STAT 330

Simple Linear Regression

Author

Nate Pennock

Data and Description

Energy can be produced from wind using windmills. Choosing a site for a wind farm (i.e. the location of the windmills), however, can be a multi-million dollar gamble. If wind is inadequate at the site, then the energy produced over the lifetime of the wind farm can be much less than the cost of building the operation. Hence, accurate prediction of wind speed at a candidate site can be an important component in the decision to build or not to build. Because the energy produced is proportional to the square of the wind speed, even small errors in wind speed prediction can lead to significant impacts.

A potential approach for predicting wind speed at a candidate site is to use wind speed data from a nearby reference site. A reference site is a nearby location where the wind speed is already being monitored and should, theoretically, be similar to the candidate site. If the reference sites turn out to be good predictors of the candidate sites, leveraging the information from the reference site would allow windmill companies to estimate the wind speed at the candidate sites without going through a costly and lengthy data collection period.

The Windmill data set contains measurements of wind speed (in meters per second m/s) at candidate sites (CSpd) (column 1) and at corresponding reference sites (RSpd) (column 2) for 1,116 sites. Download the Windmill.txt file from Canvas (which can be found in Files -> Data Sets), and put it in the same folder as this quarto file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

1. Briefly explain why simple linear regression could be a useful tool for this problem.

It provides a somewhat predictive model to allow companies to see a reference site and have a decent estimate of how different the wind speed will be at the corresponding site.

2. Read in the data set into an object named “wind”. Display the first 6 rows of the data.

Wind <- read_table("Windmill.txt") 

── Column specification ────────────────────────────────────────────────────────
cols(
  CSpd = col_double(),
  RSpd = col_double()
)
print(Wind)
# A tibble: 1,116 × 2
    CSpd  RSpd
   <dbl> <dbl>
 1   6.9  5.97
 2   7.1  7.22
 3   7.8  7.94
 4   6.9  6.02
 5   5.5  6.16
 6   3.1  1.77
 7   6.8  4.65
 8  11.4 10.9 
 9  12.9 11.7 
10  13.5 13.0 
# ℹ 1,106 more rows

3. What is the response/outcome variable in this situation? (Think about which variable makes the most sense to be the response.)

The corresponding site wind speeds

4. What is the explanatory variable in this situation?

The reference site wind speeds

5. Create a scatter plot of the data with variables on the appropriate axes. Add descriptive axis labels with appropriate units. Display the plot.

ggplot(Wind, aes(x=RSpd, y=CSpd))+
  geom_point()+ 
  labs (x = 'Reference site wind speeds', 
        y= 'corresponding site wind speeds ') 

6. Briefly describe the relationship between RSpd and CSpd in terms of its linearity, direction, and strength.

The relationship between RSpd and CSpd seems to have a decently strong positive linear relationship.

7. Calculate the correlation coefficient for the two variables (you may use a built-in R function). Display the result.

cor(Wind$RSpd, Wind$CSpd)
[1] 0.7555948

8. What does the calculated correlation coefficient tell you about the direction and strength of the linear association between these two variables?

This shows a strong positive correlation between these two variables. As the wind speed of RSpd goes up CSpd tends to go up as well.

9. The equation below shows the generic theoretical simple linear regression model. Update the equation to use variable names that are meaningful in context of this dataset (i.e., rename “x” and “y”).

Wind speed at corresponding site = y intercept + Slope * Wind speed at reference site + ( Wind speed at corresponding site - predicted wind speed at corresponding site), when the errors are independantlly and identically distributed with Normal distribution centered around a mean of 0 and have consistent variability.

\[\begin{align} corr speed_i &= yintercept + Slope \times Refspeed_i + (\widehat{\text{corr speed}}_i -corr speed_1) \\ \epsilon_i &\stackrel{iid}{\sim} N(0, \sigma^2) \end{align} \] when the errors are interdependently and identically distributed with Normal distribution centered around a mean of 0 and have consistent variability.

10. Add the OLS regression line to the scatter plot you created in 5. Display the result. (If you use ggplot with geom_smooth, You can remove the standard error line with the option se = FALSE).

ggplot(Wind, aes(x=RSpd, y=CSpd))+
  geom_point()+ 
  labs (x = 'Reference site wind speeds', 
        y= 'corresponding site wind speeds ') +
  geom_smooth(method = lm, se=FALSE) 
`geom_smooth()` using formula = 'y ~ x'

11. (a) Apply linear regression to the data. (b) Display a summary of the results from the lm function. (c) Plot the model fitted values (x-axis) against the residuals (y-axis) in a well-labeled scatter plot.

#creating lm dataframe
model1<-lm(CSpd ~ RSpd, data = Wind)
summary(model1) 

Call:
lm(formula = CSpd ~ RSpd, data = Wind)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7877 -1.5864 -0.1994  1.4403  9.1738 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.14123    0.16958   18.52   <2e-16 ***
RSpd         0.75573    0.01963   38.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.466 on 1114 degrees of freedom
Multiple R-squared:  0.5709,    Adjusted R-squared:  0.5705 
F-statistic:  1482 on 1 and 1114 DF,  p-value: < 2.2e-16
#creating new colomns for values and residuals 
Wind$residuals <- model1$residuals
Wind$fits <- model1$fitted.values

#Creating scatter plot
ggplot(Wind, aes(x=fits, y=residuals))+
  geom_point()+ 
  labs (x = 'fitted values', 
        y= 'model residuals ') +
  geom_smooth(method = lm, se=FALSE) 
`geom_smooth()` using formula = 'y ~ x'

12. Update the fitted simple linear regression model shown below with the coefficients you obtained from your linear model (i.e., update the \(\hat{\beta}\)s with specific numeric values). Round to two decimal places.

\[\widehat{\text{Cand Speed}}_i = {\text {3.14 + 0.76 *}} {\text{ Ref Speed}}_i\]

< your response here. Note that you can write math in R markdown by surrounding the math in dollar signs. >

13. Interpret the coefficient for the slope.

When winds speeds go up 1 meter per second in the reference sites wind speeds on average go up 0.76 meters per second in the corresponding sites

14. Interpret the coefficient for the intercept.

When wind speeds are at 0 meters per second at reference sites wind speeds tend to be on average 3.14 meters a second at corresponding sites.

15. What is the estimated average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 12 m/s? Show your code, and display the result.

3.14+0.76 * 12 
[1] 12.26

16. Briefly explain why it would be inadvisable to answer this question using your model: What is the estimated average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 25 m/s?

Because 25m/s is out side of our models range which would means we would be extrapolating

17. Calculate the (unbiased) estimate of \(\sigma^2\). Show your code, and display the result.

sum(model1$residuals^2)/model1$df.residual 
[1] 6.082312

18. Briefly summarize what you learned, personally, from this analysis about statistics, the model fitting process, R, etc.

I am starting to see how these linear models can be good starting place for solving a problem. I also feel like I am personally learning just what different notations mean and I feel like I am starting to recognize patterns and terms.

19. Briefly summarize what you learned about the scientific question from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-sways.

The purpose of this analysis is to see how well the reference sites are able to predict wind speeds at corresponding sites. This is so when someone wants to invest in building windmills in a particular area that we can show with at least some certainty, with the data we can easily get from reference sites, that they haven’t made a bad investment. What we have found is that reference sites do tend to predict wind speeds at corresponding sites fairly accurately and that a certain degree of variability can be accounted for any future investments.