Simple linear regression is a statistical method used to summarize the relationships between two quantitative variables, a predictor \(x\) and a response \(y\).
(We model how \(y\) changes as \(x\) changes.)
(We model how \(y\) changes as \(x\) changes.)
The simple linear regression model is:
\[
y = \beta_0 + \beta_1\cdot x + \varepsilon
\]
where:
Once we fit the model to data, we compute quantities that help evaluate the fit:
These quantities are used to:
Residuals measure how well the model fits the data. Define the Sum of Squared Errors (SEE): \[ SEE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \] A common scaled summary is the Mean Square Error (MSE): \[ \text{MSE} = \frac{SEE}{n-2} = \frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
The R code below runs and prints the first rows of this example data set:
set.seed(123) x = seq(1, 10, length.out = 100) y = 3 + 2*x + rnorm(100, 0, 2) df = data.frame(x, y) head(df)
x y 1 1.000000 3.879049 2 1.090909 4.721463 3 1.181818 8.481053 4 1.272727 5.686471 5 1.363636 5.985848 6 1.454545 9.339221
model = lm(y ~ x, data = df)
p_reg = ggplot(df, aes(x = x, y = y)) +
geom_point(color = "#477054", size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "#2B4539") +
labs(title = "Linear Regression Fit",
x = "Predictor (x)",
y = "Response (y)") +
theme_minimal()
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.9071 -1.1047 -0.0692 1.2970 4.1897
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8770 0.4248 6.773 9.4e-10 ***
x 2.0552 0.0697 29.487 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.829 on 98 degrees of freedom
Multiple R-squared: 0.8987, Adjusted R-squared: 0.8977
F-statistic: 869.5 on 1 and 98 DF, p-value: < 2.2e-16
To test whether \(x\) is associated with \(y\), we test the slope \(\beta_1\).
Hypotheses:
\[ H_0: \beta_1 = 0 \qquad\text{vs}\qquad H_a: \beta_1 \ne 0 \]
Test Statistic:
\[ t = \frac{b_1}{SE_{b_1}} \]
A \((1 - \alpha)\times 100\%\) confidence interval for \(\beta_1\) is:
\[ b_1 \pm t_{\alpha/2, \, n-2} \cdot SE_{b_1} \]
Where:
p3d = plot_ly(data = df,
x = ~x,
y = ~y,
z = ~model$residuals,
type = "scatter3d",
mode = "markers",
marker = list(size = 4, colorbar = list(title = "Residual")),
color = model$residuals) %>%
layout(scene = list(
zaxis = list(title = "Residual")),
title = "3D Residuals (x, y, residual)")