2025-11-15

Introduction

Simple linear regression is a basic but very important statistical tool used to describe, explain, and predict how one numeric variable changes with respect to another.

It comes from the work of Carl Friedrich Gauss and Adrien-Marie Legendre, who developed the method of least squares in the early 1800s. Today, it is widely used in:

  • Data science and machine learning

  • Business analytics

  • Scientific and engineering applications

In this presentation, we will:

  • Define the simple linear regression model

  • See common use cases in data science

  • Fit a model in RStudio using the built–in mtcars data set

  • Visualize the results with ggplot2 and plotly

1. Mathematical Model (1/2)

We model the relationship between a response Y and a predictor X using a straight line plus random noise:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i=1,\dots,n \]

where:

  • \(\beta_0\) is the intercept (expected value of Y when X=0)

  • \(\beta_1\) is the slope (average change in Y for a one-unit increase in X)

  • \(\varepsilon_i\) is a random error term with mean 0.

We assume:

  • The relationship between X and Y is approximately linear

  • The errors \(\varepsilon_i\) are independent, have constant variance, and are approximately normal

Mathematical Model (2/2)

We estimate the parameters \(\beta_0\) and \(\beta_1\) using the least squares method. This method chooses \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the sum of squared residuals.

The slope estimator is:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} {\sum_{i=1}^n (x_i - \bar{x})^2}. \]

The intercept estimator is:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}. \]

For any new value \(x_0\), the predicted value of \(Y\) is

\[ \hat{y}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0. \]

2. Common Use Cases in Data Science

Simple linear regression is used to:

  • Predict a numeric output (e.g., price, energy usage, fuel consumption)

  • Understand how a predictor affects a response (effect size, direction)

  • Create baseline models before more complex machine learning methods

  • Explore relationships between variables during EDA (Exploratory Data Analysis)

Examples:

  • How does car weight affect fuel efficiency?

  • How does temperature affect electricity consumption?

  • How does study time affect exam score?

3. Visual Example (1/2)

Here we use the built–in mtcars data set. We study the relationship between the predictor X (wt, the car weight in 1000 lbs) and the response Y (mpg the fuel efficiency in miles per gallon).

ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "steelblue", size = 3) +
geom_smooth(method = "lm", se = TRUE, color = "darkred") +
theme_minimal() +
labs( title = "Fuel Efficiency vs Car Weight (mtcars)", x = "Weight (1000 lbs)", 
y = "Miles per Gallon (mpg)"
)

Visual Example (2/2)

Visual Example with Colored by Cylinders

We can add more information by coloring points by the number of cylinders:

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(  title = "MPG vs Weight, Colored by Cylinders", x = "Weight (1000 lbs)",
y = "Miles per Gallon (mpg)", color = "Cylinders"
)

4. Fitting the Model in R (1/3)

We now fit a simple linear regression model, with mpg as the Response and wt as the predictor.

# Fit a simple linear regression model
mod <- lm(mpg ~ wt, data = mtcars)
# Show a summary of the model:
summary(mod)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Fitting the Model in R (2/3)

The summary() output shows:

  • Estimates _0 and _1

  • Standard errors and t–tests for each coefficient

  • The R-squared value (goodness of fit)

  • The p-value for the overall regression

Fitting the Model in R (3/3)

We can also explore the data interactively in 3D using plotly.

plot_ly( data = mtcars, x = ~wt, y = ~mpg, z = ~hp, type = "scatter3d", mode = "markers",
marker = list(size = 5, color = ~hp, colorscale = "Viridis")
) %>% layout(
title = "3D View: Weight, MPG, and Horsepower (mtcars)", 
scene = list( xaxis = list(title = "Weight (1000 lbs)"), yaxis = list(title = "MPG"), 
zaxis = list(title = "Horsepower")))

5. Inference: Confidence Interval and Hypothesis Test (1/2)

In practice, we often want to know whether the slope \(\beta_1\) is significantly different from 0.

We test the hypotheses:

\[ H_0 : \beta_1 = 0 \quad \text{vs.} \quad H_1 : \beta_1 \neq 0. \]

The test statistic is

\[ t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}, \]

which follows (approximately) a \(t\)-distribution with \(n - 2\) degrees of freedom if the model assumptions hold.

Inference: Confidence Interval and Hypothesis Test (2/2)

A small p-value (for example, \(p < 0.05\)) suggests that there is a meaningful linear relationship between \(X\) and \(Y\).

We can also compute a confidence interval for the slope:

\[ \hat{\beta}_1 \pm t_{1-\alpha/2,\,n-2} \times SE(\hat{\beta}_1), \]

where \(t_{1-\alpha/2,\,n-2}\) is the critical value from the \(t\)-distribution with \(n - 2\) degrees of freedom.

Note: SE (Standard Error) measures how precise the estimated slope is:
small SE = precise estimate; large SE = uncertain estimate.

\[ SE(\hat{\beta}_1)= \sqrt{ \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2} } \]

6. References

Gauss, Carl Friedrich. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium. Hamburg, 1809. DOI: https://doi.org/10.3931/e-rara-4022

James, Gareth, et al. An Introduction to Statistical Learning. Springer, 2021. https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf.download.html

Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer, 2016. https://ggplot2.tidyverse.org/

R Core Team. R: A Language and Environment for Statistical Computing. 2024. https://www.r-project.org/help.html