2026-03-08

Libraries

library(plotly)
library(ggplot2)
library(dplyr)
set.seed(12)

What is Simple Linear Regression?

In Statistics, we use Simple Linear Regression to see the relationship between two variables.

  • Independent Variable (X): This is the variable we use to predict.

  • Dependent Variable (Y): This is the variable we want to know about.

We try to find a straight line that fits the data points the best.

The Regression Equation

We can write the relationship using a mathematical formula. In LaTeX, it looks like this:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Where:

  • \(Y\) is the predicted value.
  • \(\beta_0\) is the intercept (where the line starts on the Y-axis).
  • \(\beta_1\) is the slope (how much \(Y\) changes when \(X\) goes up by 1).
  • \(\epsilon\) is the error or “noise” in the data.

How to calculate the Slope?

To find the best line, we need to calculate the slope \(\beta_1\). The formula is:

\[\beta_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\]

This formula helps us minimize the distance between the actual data points and our line. This method is called Ordinary Least Squares (OLS).

Example: Tree Girth and Volume

We will use the trees dataset in R. We want to see if the Girth of a tree can predict its Volume.

Here is the R code to create a simple model:

# Create the linear model
model <- lm(Volume ~ Girth, data = trees)

# Show the summary of the model
summary(model)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -36.943459   3.365145 -10.97827 7.621449e-12
## Girth         5.065856   0.247377  20.47829 8.644334e-19

Visualizing with ggplot2 (Part 1)

This plot shows the raw data points of Tree Girth versus Volume.

Visualizing with ggplot2 (Part 2)

Now we add the Regression Line to the plot to see the trend.

## `geom_smooth()` using formula = 'y ~ x'

Interactive Plot with Plotly

Below is an interactive plot using plotly. You can hover over the points to see the values.

R Code: Creating the Scatter Plot

This is the code I used to visualize the tree data using the ggplot2 library.

ggplot(trees, aes(x = Girth, y = Volume)) +
  geom_point(color = "darkblue") +
  theme_minimal() +
  labs(title = "Scatter plot of Tree Data", x = "Girth of Tree",
       y = "Volume")

Conclusion

Simple Linear Regression is a very powerful tool.

  • It helps us summarize the relationship between variables.

  • It helps us predict future values.

  • It is the first step for many advanced data science methods.