Simple Linear Regression

2026-03-07

What is Simple Linear Regression?

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:

One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Simple linear regression helps make predictions and understand relationships between one independent variable and one dependent variable.

For example, you might want to know how a tree’s girth (independent variable) affects its volume (dependent variable).By collecting data and fitting a simple linear regression model, you could predict this relationship, and understand how changes in girth affect the volume of a tree.

Example of Simple Linear Regression

The regression line here illustrates that there is a positive linear relationship between a tree’s volume and its girth. Thus, trees with higher girth generally have a larger volume

Regression Model Equation

The regression model equation is:

\[ y = \beta_0 + \beta_1X + \epsilon \] Where:

\(\beta_0\) is the y-intercept
\(\beta_1\) is the slope of the regression line
\(\epsilon\) is the random error term

Least Squares Equation

The least squares method find the regression line the minimizes the sum of squared residuals. Essentially, it chooses the lines that keeps prediction errors as small as possible. \[ \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] Where:

\(y_i\) is the observed value
\(\hat{y}_i\) is the predicted value
\(y_i - \hat{y}_i\) is the residual

What is the Best Fitting Line

The best fitting line, also called the line of best fit, is the straight line that best represents the relationship between two variables in a dataset.

It is used to:

Make predictions about the data
Show the overall trend in the data

Here, we notice that the line slopes slightly downwards indicating a weak correlation between length and width in this dataset.

Residuals

A residual is a difference between the observed value and the predicted value from the regression model

\[ \text{Residual} = y_i - \hat{y}_i \] Residuals measure how far each data point is from the Regression line.

If the residual is positive, the point lies above the line
If the residual is negative, the point lies below the line

A smaller residual indicates a better model fit, meaning that the regression line is an accurate representation of the general trend of the data

Example of Residuals

library(ggplot2)
library(htmlwidgets)
library(plotly)

fig <- ggplot(trees, aes(x = Girth, y = Volume)) +
  geom_point(color="darkblue") +
  geom_smooth(method="lm", se=FALSE, color="red") +
  labs(
    title="Residuals in Linear Regression",
    x="Girth",
    y="Volume"
  ) +
  theme_minimal()

print(ggplotly(fig))