Simple Linear Regression

Overview

  • Linear regression models the relationship between two variables
  • One variable explains or predicts another
  • This method is widely used in many fields including science and engineering

Real-world idea

  • Suppose we want to understand how studying affects exam scores
  • We collect data on study hours and exam performance
  • We then try to model this relationship mathematically

Example dataset (code)

hours <- c(1, 2, 3, 4, 5, 6, 7, 8)
scores <- c(50, 55, 65, 70, 75, 78, 85, 90)

data <- data.frame(hours = hours, scores = scores)

Example dataset (values)

  • (1, 50)
  • (2, 55)
  • (3, 65)
  • (4, 70)
  • (5, 75)
  • (6, 78)
  • (7, 85)
  • (8, 90)

Visualizing the data

Interactive visualization

Observations

  • The data shows a clear upward trend
  • As study hours increase, exam scores also increase
  • This suggests a positive linear relationship
  • Linear regression helps quantify this pattern

The regression model

In simple linear regression, we model the relationship as:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

  • \(Y\) is the response (exam score)
  • \(X\) is the predictor (study hours)
  • \(\beta_0\) is the intercept
  • \(\beta_1\) is the slope
  • \(\epsilon\) represents random error

Understanding the equation

  • The intercept is the predicted value when \(X = 0\)
  • The slope tells us how much \(Y\) changes for each unit increase in \(X\)
  • The goal is to find the best-fitting line through the data

Fitting the model in R

model <- lm(scores ~ hours, data = data)
summary(model)
  • The lm() function fits a linear regression model
  • It estimates the best line relating study hours to exam scores
  • The detailed output is hidden to keep the slide clean

Fitted regression line

Residual concept

Residuals measure prediction error:

\[ e_i = y_i - \hat{y}_i \]

  • If residuals are small, the model fits well
  • Random patterns suggest a good linear fit

Residual plot

Interpreting the slope

  • The slope represents the change in score per additional hour studied
  • A positive slope indicates improvement with more study time
  • This allows us to make predictions for new values

Least squares method

The regression line minimizes the total squared error:

\[ \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

  • This ensures the best possible fit to the data
  • It penalizes large errors more than small ones

Key takeaways

  • Linear regression is simple but powerful
  • It models relationships and supports prediction
  • It is easy to implement using R
  • Visualization helps interpret results clearly

End

Thank you for viewing this presentation.