2026-03-08

Introduction

Simple linear regression is a statistical method used to model the relationship between two variables.

We use one variable, \(x\), to predict another variable, \(y\).

Example: predicting stopping distance from speed using the built-in cars dataset in R.

Why It Matters

Simple linear regression helps us: - understand trends in data - measure how strongly two variables are related - make predictions

It is commonly used in business, biology, engineering, and data science.

Regression Model

The simple linear regression model is:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where: - \(y\) is the response variable - \(x\) is the predictor variable - \(\beta_0\) is the intercept - \(\beta_1\) is the slope - \(\epsilon\) is the random error

Least Squares Idea

The regression line is chosen to minimize the sum of squared residuals:

\[ \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

This is called the least squares criterion.

Residuals are the vertical distances between observed values and predicted values.

Dataset Example

We will use the built-in cars dataset.

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

The variables are: - speed: speed of a car - dist: stopping distance

We want to see whether stopping distance tends to increase as speed increases.

Scatterplot with Regression Line

Scatterplot Interpretation

This plot shows a positive relationship between speed and stopping distance.

As speed increases, stopping distance generally increases as well.

Residual Visualization

Residual Plot Interpretation

This residual plot helps us check whether the linear model is reasonable.

If the residuals are randomly scattered around zero, the linear model is more appropriate.

Interactive Plotly Plot

Plotly Interpretation

This interactive plot lets the viewer hover over points and inspect the data more closely.

Interactive visualizations make it easier to explore individual observations.

Model Summary

Term Estimate Std_Error t_value p_value
(Intercept) (Intercept) -17.579 6.758 -2.601 0.0123
speed speed 3.932 0.416 9.464 0.0000

The slope tells us how much stopping distance is expected to change for a one-unit increase in speed.

A positive slope suggests that faster cars generally need more distance to stop.

R Code Example

model <- lm(dist ~ speed, data = cars)

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Regression Line for Cars Data")

Conclusion

Simple linear regression is a foundational statistical tool for modeling relationships between variables.

In this example, we saw that stopping distance tends to increase as speed increases.

Using ggplot and plotly makes it easier to communicate both static and interactive insights from data.