The goal of simple linear regression is to find the single best straight line that describes the relationship between two quantitative variables.
Our examples will use 16 randomly selected diamonds from the ggplot2
data set diamonds
.
2025-09-12
The goal of simple linear regression is to find the single best straight line that describes the relationship between two quantitative variables.
Our examples will use 16 randomly selected diamonds from the ggplot2
data set diamonds
.
The relationship is modeled using the equation of a straight line with an added term for random error. \[ Y_i=\beta_0 + \beta_1X_i + \epsilon_i \]
The “best” line is one that minimizes the total prediction error. We determine this by summing the square of the errors, the Sum of Squared Errors (SSE). The error is defined as the vertical distance from an observed data point to the line. We find the estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that minimize this function. \[ SSE=\sum_{i=1}^n\left(y_i - \hat{y_i}\right)^2 = \sum_{i=1}^n\left(y_i - \left(\hat{\beta_0} + \hat{\beta_1}x_i\right)\right)^2 \] Any other line would have a larger total squared error. This is called Ordinary Least Squares (OLS)
We can visualize the SSE for every possible combination of slope and intercept in a three-dimensional graph. The minimum point on this surface gives us our best OLS estimates for \(\hat{\beta_0}\) and \(\hat{\beta_1}\)
Equation for “best” fit line: \[ \begin{align*} Y_i &= \hat{\beta_0} + \hat{\beta_1}X_i + \epsilon_i \\ y &= -1060.85 + 5587.97x \end{align*} \]
Now that we’ve determined the best estimates for intercept and slope, we can draw our regression line on our scatter plot. This represents how our model predicts the dependent variable given any value of our independent variable.
Implementing linear regression in R is straightforward using the lm()
function. There are many plotting packages, this example uses ggplot2
.
scale_y_dollar <- scale_y_continuous(labels = label_dollar()) sample_row_indices <- sample(nrow(diamonds), ceiling(log(nrow(diamonds), base=2))) plot_data <- diamonds[sample_row_indices,] model <- lm(price ~ carat, data=plot_data) ggplot(plot_data, aes(x=carat, y=price)) + geom_point(alpha=0.75, size=3, color="blue") + geom_smooth(method="lm", se=FALSE, aes(color="Best-fit Line")) + labs(title="Price(USD) vs. Carat (weight)", x="Carat (weight)", y="Price (USD)", color="") + scale_y_dollar + theme_minimal()
The residuals are the prediction errors, \(y_i - \hat{y_i}\). These gives us information about our model’s performance. A good model should have residuals near zero, positive and negative, with no discernible patterns.
We want to see a cloud of points without a regular shape. If this cloud starts taking a shape, the model might be flawed, or the relationship may not be linear.
In our example, a slope of 5587.97 means that, on average, each increase in carat is associated with a 5587.97 increase in price