Energy can be produced from wind using windmills. Choosing a site for a wind farm (i.e. the location of the windmills), however, can be a multi-million dollar gamble. If wind is inadequate at the site, then the energy produced over the lifetime of the wind farm can be much less than the cost of building the operation. Hence, accurate prediction of wind speed at a candidate site can be an important component in the decision to build or not to build. Since energy produced varies as the square of the wind speed, even small errors in prediction can have serious consequences.
One possible solution to help predict wind speed at a candidate site is to use wind speed at a nearby reference site. A reference site is a nearby location where the wind speed is already being monitored and should, theoretically, be similar to the candidate site. Using information from the reference site will allow windmill companies to know the wind speed at the candidate site without going through a costly data collection period, if the reference site is a good predictor.
The Windmill data set contains measurements of wind speed (in meters per second m/s) at a candidate site (CSpd) (column 1) and at an accompanying reference site (RSpd) (column 2) for 1,116 areas. Download the Windmill.txt file from Learning Suite, and put it in the same folder as this R Markdown file.
use in this situation.
Simple linear regression is appropriate here because we are trying to find out if the reference site is a good predictor for the candidate site. We want to find out whether there is a correlation between the speeds at the reference site and the speeds at the candidate site, and also if the model we obtain allows us to predict speeds at the candidate site.
data and make sure the data makes sense.
wind = read_table("C:/Users/samlo/OneDrive/Documents/winter 2023/STAT 330 Statistical Modeling 2/windmill_data.txt")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## CSpd = col_double(),
## RSpd = col_double()
## )
print(head(wind,n=10))
## # A tibble: 10 × 2
## CSpd RSpd
## <dbl> <dbl>
## 1 6.9 5.97
## 2 7.1 7.22
## 3 7.8 7.94
## 4 6.9 6.02
## 5 5.5 6.16
## 6 3.1 1.77
## 7 6.8 4.65
## 8 11.4 10.9
## 9 12.9 11.7
## 10 13.5 13.0
summary(wind)
## CSpd RSpd
## Min. : 0.400 Min. : 0.2221
## 1st Qu.: 6.100 1st Qu.: 4.7769
## Median : 8.800 Median : 7.5477
## Mean : 9.019 Mean : 7.7773
## 3rd Qu.:11.500 3rd Qu.:10.2096
## Max. :22.400 Max. :21.6015
variable makes the most sense to be the response.)
The candidate site speed, CSpd.
The reference site speed, RSpd.
Make the plot square. Add descriptive axis labels with appropriate units. Save the plot to a variable and print the plot.
scatter = ggplot(data = wind, aes(x = wind$RSpd, y = wind$CSpd))+
geom_point()+
labs(x = "Reference site speed (m/s)", y = "Candidate site speed (m/s)")+
coord_fixed()+
theme_bw()
print(scatter)
should use 3 key words in a complete sentence that includes referencing the variables.)
The trend is linear and strongly positive. It has some variability, especially among larger values.
a built-in R function). Print the result.
r = cor(wind$RSpd,wind$CSpd)
print(r)
## [1] 0.7555948
coefficient (what is the direction and strength of the correlation?).
This is a strong positive correlation.
regression model for this data set (using parameters (\(\beta\)s), not estimates, and not using matrix notation). Clearly explain which part of the model is deterministic and which part is random. Do not use “x” and “y” in your model - use variable names that are fairly descriptive.
\(Candidate_i\) = \(\beta_0\) + \(\beta_1\)\(Reference_i\) + \(\epsilon_i\)
The deterministic elements are the \(\beta\) values, and the random part is \(\epsilon\). \(\epsilon_i\overset{iid}\sim N(0, \sigma^2)\)
the result. You can remove the standard error line with the option
se = FALSE.
scatter + geom_smooth(method = "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'
results from the lm function. (c) Save the residuals and
fitted values to the wind tibble. (d) Print the first few
rows of the wind tibble.
wind_lm = lm(CSpd~RSpd, data = wind)
summary(wind_lm)
##
## Call:
## lm(formula = CSpd ~ RSpd, data = wind)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7877 -1.5864 -0.1994 1.4403 9.1738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.14123 0.16958 18.52 <2e-16 ***
## RSpd 0.75573 0.01963 38.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.466 on 1114 degrees of freedom
## Multiple R-squared: 0.5709, Adjusted R-squared: 0.5705
## F-statistic: 1482 on 1 and 1114 DF, p-value: < 2.2e-16
wind$residuals = wind_lm$residuals
wind$fitted = wind_lm$fitted.values
fit (how does OLS choose the parameter estimates?).
The OLS fit finds the line that has the smallest squared vertical distances from the points to that line. Squared values are used so that both positive and negative distance measurements will be converted to a number that has a positive magnitude.
this data set using the coefficients you found above (do not use parameters/\(\beta\)s and do not use matrix notation). Do not use “x” and “y” in your model - use variable names that are fairly descriptive.
\(\widehat{Candidate}_i\) = 3.141 + 0.756\(Reference_i\)
As each speed at the reference site increases by 1, we estimate that the average candidate speed increases by 0.756.
The estimated speed corresponding to a reference site speed of zero is 3.141m/s
wind speed at the reference site (RSpd) is 12 m/s? Show your code, and print the result.
wind_lm$coefficients[1] + wind_lm$coefficients[2]*12
## (Intercept)
## 12.21003
the average wind speed at the candidate site (CSpd) when the wind speed at the reference site (RSpd) is 25 m/s?
This would be extrapolation, making a prediction of data that we don’t have. We could guess that the trend line takes us in a general direction, but obtaining future data points for those speeds could prove the prediction wrong.
of the residuals around the line. Show your code, and print the result.
sum(wind$residuals ^ 2) / wind_lm$df.residual
## [1] 6.082312
few rows of the design matrix.
x = cbind(rep(1, length(wind$RSpd)), wind$RSpd)
print(head(x,n=3))
## [,1] [,2]
## [1,] 1 5.9666
## [2,] 1 7.2176
## [3,] 1 7.9405
above using lm) using matrix multiplication. You should
use the following in your computations: t() [transpose], solve()
[inverse], %*% [matrix multiplication].
x_t_x_inv = solve(t(x)%*%x)
beta = x_t_x_inv%*%t(x)%*%wind$CSpd
print(beta)
## [,1]
## [1,] 3.1412324
## [2,] 0.7557333
about the statistics, model fitting process, etc.
I learned how the OLS model fit is effective. It allows us to find a line that is equally distant from each of the data points. With that line, we can determine whether the relationship is positive or negative, and how strong the correlation is.
to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
The purpose of this data set was to determine whether a set of data from a reference site had enough of a relationship with data from a candidate site so that we could determine if the reference data would allow us to assume corresponding values for the candidate site. Obtaining these values through statistical analysis is much more feasible than obtaining them through additional data collection. From the analysis we can say that there is a strong relationship between the two, from which we can expect an increase in wind speed at the reference site to correspond with a wind speed at the candidate site. The values in our model help us determine what we can expect that value to be.