Simple linear regression models the relationship between two numeric variables:
The idea is to draw a straight line through the data that best explains how \(y\) changes as \(x\) changes.
A common use case: predicting a student’s exam score based on hours studied.
\[y = \beta_0 + \beta_1 x + \varepsilon\]
Where:
We estimate \(\beta_0\) and \(\beta_1\) using Ordinary Least Squares (OLS). This minimizes the total squared difference between observed and predicted values:
\[\min_{\beta_0, \beta_1} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2\]
The formulas for the estimates are:
\[\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
# Fit a simple linear regression
model <- lm(scores ~ hours, data = df)
# View the results
summary(model)
# Plot with regression line
ggplot(df, aes(x = hours, y = scores)) +
geom_point(color = "#8C1D40", size = 3) +
geom_smooth(method = "lm", se = TRUE, color = "steelblue") +
labs(title = "Regression Line: Hours vs Score",
x = "Hours Studied", y = "Exam Score") +
theme_minimal()##
## Call:
## lm(formula = scores ~ hours, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4400 -0.8317 -0.3194 1.0280 3.3825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.1388 1.4872 31.02 1.27e-09 ***
## hours 6.5859 0.3518 18.72 6.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.943 on 8 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9749
## F-statistic: 350.5 on 1 and 8 DF, p-value: 6.84e-08
Good regression requires checking a few things:
In our example, hours studied is a decent predictor of exam score — the more you study, the better you tend to do (no surprise there).