Simple Linear Regression

What is it?

Simple linear regression is a statistical method used to summarize the relationships between two quantitative variables, a predictor \(x\) and a response \(y\).

(We model how \(y\) changes as \(x\) changes.)

The Model

The simple linear regression model is:

\[ y = \beta_0 + \beta_1\cdot x + \varepsilon \]
where:

\(\beta_0\): intercept - the predicted value of \(y\) when \(x = 0\))
\(\beta_1\): slope - change in \(y\) per unit change in \(x\)
\(\varepsilon\): random error, assumed to satisfy \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\)

Predicted Values & Residuals

Once we fit the model to data, we compute quantities that help evaluate the fit:

\(\hat{y}_i\) : predicted/fitted value for observation \(i\) (computed from the estimated model: \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\) )
\(e_i = y_i - \hat{y}_i\) : residual for observation \(i\) (difference between observed and predicted value)

These quantities are used to:

Assess model fit
Calculate MSE, RMSE, and other summary statistics
Generate residual plots to check assumptions

Residuals & MSE

Residuals measure how well the model fits the data. Define the Sum of Squared Errors (SEE): \[ SEE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \] A common scaled summary is the Mean Square Error (MSE): \[ \text{MSE} = \frac{SEE}{n-2} = \frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Example Dataset

The R code below runs and prints the first rows of this example data set:

set.seed(123)
x = seq(1, 10, length.out = 100)
y = 3 + 2*x + rnorm(100, 0, 2)
df = data.frame(x, y)
head(df)

         x        y
1 1.000000 3.879049
2 1.090909 4.721463
3 1.181818 8.481053
4 1.272727 5.686471
5 1.363636 5.985848
6 1.454545 9.339221

Scatterplot (ggplot)

Regression Line - Code

model = lm(y ~ x, data = df)
p_reg = ggplot(df, aes(x = x, y = y)) +
  geom_point(color = "#477054", size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "#2B4539") +
  labs(title = "Linear Regression Fit",
       x = "Predictor (x)",
       y = "Response (y)") +
  theme_minimal()

Regression Line - Plot

Model Output in R

Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9071 -1.1047 -0.0692  1.2970  4.1897 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.8770     0.4248   6.773  9.4e-10 ***
x             2.0552     0.0697  29.487  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.829 on 98 degrees of freedom
Multiple R-squared:  0.8987,    Adjusted R-squared:  0.8977 
F-statistic: 869.5 on 1 and 98 DF,  p-value: < 2.2e-16

Inference for the Slope

To test whether \(x\) is associated with \(y\), we test the slope \(\beta_1\).

Hypotheses:

\[ H_0: \beta_1 = 0 \qquad\text{vs}\qquad H_a: \beta_1 \ne 0 \]

Test Statistic:

\[ t = \frac{b_1}{SE_{b_1}} \]

Confidence Interval for the Slope

A \((1 - \alpha)\times 100\%\) confidence interval for \(\beta_1\) is:

\[ b_1 \pm t_{\alpha/2, \, n-2} \cdot SE_{b_1} \]

Where:

\(b_1\) is the sample slope estimate
\(SE_{b_1}\) is the standard error of the slope
\(t_{\alpha/2, \, n-2}\) is the critical t-value with \(n - 2\) degrees of freedom

Residuals: Histogram (ggplot)

3D Residuals - Plotly (Code)

p3d = plot_ly(data = df,
              x = ~x,
              y = ~y,
              z = ~model$residuals,
              type = "scatter3d",
              mode = "markers",
              marker = list(size = 4, colorbar = list(title = "Residual")),
              color = model$residuals) %>%
  layout(scene = list(
    zaxis = list(title = "Residual")),
    title = "3D Residuals (x, y, residual)")

3D Plotly Visualization

Conclusion

Simple linear regression models the relationship between a predictor and response with interpretable parameters (intercept and slope).
Residuals and MSE help assess fit quality.
Inference on the slope uses a t-statistic and t-based confidence intervals.
The workflow: (1) visualize, (2) fit model, (3) inspect output, (4) perform inference.