2025-09-12

What problem does it solve?

The goal of simple linear regression is to find the single best straight line that describes the relationship between two quantitative variables.

Our examples will use 16 randomly selected diamonds from the ggplot2 data set diamonds.

Mathematical Model

The relationship is modeled using the equation of a straight line with an added term for random error. \[ Y_i=\beta_0 + \beta_1X_i + \epsilon_i \]

  • \(Y_i\) is the dependent variable for the \(i\)-th observation.
  • \(X_i\) is the independent variable for the \(i\)-th observation.
  • \(\beta_0\) is the y-intercept, the value of \(Y\) when \(X=0\).
  • \(\beta_1\) is the slope, how much \(Y\) changes for a unit change in \(X\).
  • \(\epsilon_i\) is the error term accounting for unexplained variation.

How is the “Best” Line Determined?

Principle of Least Squares

The “best” line is one that minimizes the total prediction error. We determine this by summing the square of the errors, the Sum of Squared Errors (SSE). The error is defined as the vertical distance from an observed data point to the line. We find the estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that minimize this function. \[ SSE=\sum_{i=1}^n\left(y_i - \hat{y_i}\right)^2 = \sum_{i=1}^n\left(y_i - \left(\hat{\beta_0} + \hat{\beta_1}x_i\right)\right)^2 \] Any other line would have a larger total squared error. This is called Ordinary Least Squares (OLS)

Visualizing the Error

The bottom of the bowl

We can visualize the SSE for every possible combination of slope and intercept in a three-dimensional graph. The minimum point on this surface gives us our best OLS estimates for \(\hat{\beta_0}\) and \(\hat{\beta_1}\)

The Linear Model

Our calculated linear model:
  • \(\hat{\beta_0} = -1060.85\)
  • \(\hat{\beta_1} = 5587.97\)

Equation for “best” fit line: \[ \begin{align*} Y_i &= \hat{\beta_0} + \hat{\beta_1}X_i + \epsilon_i \\ y &= -1060.85 + 5587.97x \end{align*} \]

Best-Fit Line Visualized

The Model’s Prediction

Now that we’ve determined the best estimates for intercept and slope, we can draw our regression line on our scatter plot. This represents how our model predicts the dependent variable given any value of our independent variable.

What it Looks Like in R

From concept to implementation in R

Implementing linear regression in R is straightforward using the lm() function. There are many plotting packages, this example uses ggplot2.

scale_y_dollar <- scale_y_continuous(labels = label_dollar())
sample_row_indices <- sample(nrow(diamonds),
                         ceiling(log(nrow(diamonds), base=2)))
plot_data <- diamonds[sample_row_indices,]
model <- lm(price ~ carat, data=plot_data)
ggplot(plot_data, aes(x=carat, y=price)) +
  geom_point(alpha=0.75, size=3, color="blue") +
  geom_smooth(method="lm", se=FALSE, aes(color="Best-fit Line")) +
  labs(title="Price(USD) vs. Carat (weight)",
       x="Carat (weight)",
       y="Price (USD)",
       color="") +
  scale_y_dollar +
  theme_minimal()

How good is our prediction?

Looking at the residuals

The residuals are the prediction errors, \(y_i - \hat{y_i}\). These gives us information about our model’s performance. A good model should have residuals near zero, positive and negative, with no discernible patterns.

We want to see a cloud of points without a regular shape. If this cloud starts taking a shape, the model might be flawed, or the relationship may not be linear.

Conclusion and Interpretation

In Summary

  • Simple Linear Regression models the relationship between two variables with a straight line.
  • The “best” fit line is found by minimizing the sum of squared errors (OLS).
  • We can visualize the fit with a scatter plot and diagnose issues with a residual plot

In our example, a slope of 5587.97 means that, on average, each increase in carat is associated with a 5587.97 increase in price