col-definitions come from the following paper : https://ia601406.us.archive.org/15/items/in.ernet.dli.2015.225800/2015.225800.Statistical-Methods_text.pdf
head(df, 3)
## abrasion_loss hardness tensile_strength
## 1 372 45 162
## 2 206 55 233
## 3 175 61 232
mdl <- lm(abrasion_loss ~ tensile_strength + hardness, data=df)
summary(mdl); plot(mdl, which = 1)
##
## Call:
## lm(formula = abrasion_loss ~ tensile_strength + hardness, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.385 -14.608 3.816 19.755 65.981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.1611 61.7516 14.334 3.84e-14 ***
## tensile_strength -1.3743 0.1943 -7.073 1.32e-07 ***
## hardness -6.5708 0.5832 -11.267 1.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.49 on 27 degrees of freedom
## Multiple R-squared: 0.8402, Adjusted R-squared: 0.8284
## F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11
\[ \theta = \mathbb{E}(Y | \text{tensile_strength} = 200, \text{hardness} = 70) \]
And we estimate \(\theta\) ( \(y = X\beta + \epsilon\) ) with \(\hat{\theta}\) ( \(\hat{y} = X\hat{\beta}\) ).
# Least Squares estimate :
predict(mdl, newdata =
data.frame(
tensile_strength = c(200),
hardness = c(70))
)
## 1
## 150.3407
and recall : \(\hat{\beta} \sim N(\beta,\sigma^2(X^{T}X)^{-1})\)
Let \(\vec{x}_o = (1 \ 200 \ 70)^{T}\)
Therefore : \(\vec{x}_o^{T}\hat{\beta} \sim N(\theta, \vec{x}_o^{T} \sigma^2(X^{T}X)^{-1}\vec{x}_o)\)
Except we have data and not the true \(\sigma^2\), so we approximate since we have unknown variance. We do this using the t-dist with n - (p + 1) degrees of freedom. We estimate \(\sigma^2\) with \(\hat{\sigma}^2\) which is \(SS_{\text{Err}}/\text{df} = \frac{\sum (y - \hat{y})^2}{\text{df}}\)
All this to say, \(\hat{\theta}\) follows a t-dist and \(\theta\) follows a normal. Where t-dist is our approx and normal is the underlying dist – as we know, t-dist tends towards normal for large \(n\)