For this week’s discussion, I am using the “faithful” data set. This data set describes the waiting time between eruptions and also the duration of the eruption for Old Faithful. The dependent variable is waiting, and the independent variable is eruptions.
\[ \hat y_i = \beta_0 + \beta_1x_i + \epsilon_i \\ \text{where} \quad \hat y_i = \text{The predicted eruption duration for the } i^{th} \text{ } \text{observation} \\ \beta_0 = \text{The intercept} \\ \beta_1 = \text{The slope} \\ \epsilon_i = \text{The residual of the } i^{th} \text{ observation, following the formula:} \\ \\ \epsilon_i = y_i - \hat y_i \\ \]
plot(eruptions ~ waiting, data = faithful)
mydata <- faithful
mylm <- lm(mydata$eruptions ~ mydata$waiting) # B
abline(mylm, col = 'red', lwd = 2)
summary(mylm)
##
## Call:
## lm(formula = mydata$eruptions ~ mydata$waiting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## mydata$waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
Here we can see that the slope is 0.075628, and the intercept is -1.874016. That means that for every additional 1 minute we wait, we predict that the duration of the eruption will increase by 0.075628 minutes, or 4.53768 seconds.
To find the slope of the least square line we can use the following formula:
\[ \beta_1 = \frac{\text{cov}(x, y)}{\text{var}(x)} \]
r <- cov(mydata$waiting, mydata$eruptions)
var_x <- var(mydata$waiting)
# D
(beta1 <- r / var_x) # slope calculated by cov / var
## [1] 0.07562795
To find the intercept parameter, we can use the following formula:
\[ b_0 = \bar y - b_1 \bar x \]
ybar <- mean(mydata$eruptions)
xbar <- mean(mydata$waiting)
# D
(intercept <- ybar - beta1 * xbar)
## [1] -1.874016
The Gauss Markov Assumptions are a set of assumptions that ensure Ordinary Least Squares (OLS) regression produces Best Linear unbiased Estimates (BLUE). Ordinary Least Squares regression is a method used in linear regression to find the best fitting line through the observations available. In order for this method to give you the best fitting line (the least squares line), certain assumptions need to be made about the underlying data:
The relationship between the dependent and independent variables must be linear.
The data must have been randomly sampled.
The error terms have a mean of zero.
There should be no autocorrelation in the error terms.
There should be no perfect multicollinearity.
Error terms, also known as epsilon (ϵ), are also referred to as residuals. These are the leftover variation in the data after accounting for the model fit. Autocorrelation in error terms means that the residuals should not be correlated with one another. Perfect multicollinearity is when independent variables have an exact linear relationship between them; if that is the case, then we can’t estimate the effect of an independent variable on the dependent variable due to the existing relationship.
Helpful Sources:
https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem
https://builtin.com/data-science/ols-regression
https://quickonomics.com/terms/best-linear-unbiased-estimator/