2025-01-28

Introduction

What is Simple Linear Regression?

  • A statistical method to model the relationship between two variables.
  • Dependent variable (\(y\)): The outcome or response variable.
  • Independent variable (\(x\)): The predictor or explanatory variable.
  • Goal: Predict \(y\) based on \(x\).

The Regression Equation

Equation of Simple Linear Regression

\[ y = \beta_0 + \beta_1x + \epsilon \]

  • \(y\): Predicted value of the dependent variable.
  • \(\beta_0\): Intercept (value of \(y\) when \(x = 0\)).
  • \(\beta_1\): Slope (change in \(y\) for a one-unit increase in \(x\)).
  • \(\epsilon\): Error term (captures variability not explained by \(x\)).

Key Assumptions

Assumptions of Simple Linear Regression

  1. Linearity: The relationship between \(x\) and \(y\) is linear.
  2. Independence: Observations are independent.
  3. Homoscedasticity: Variance of residuals is constant across all values of \(x\).
  4. Normality: Residuals are normally distributed.

Steps to Perform Regression

Steps to Conduct Simple Linear Regression

  1. Collect Data: Paired observations of \(x\) and \(y\).
  2. Visualize Data: Create a scatterplot to check linearity.
  3. Fit the Model: Estimate \(\beta_0\) and \(\beta_1\).
  4. Evaluate the Model:
    • R-squared (\(R^2\)): Proportion of variance in \(y\) explained by \(x\).
    • Residual analysis.

Example

Example Data: Hours Studied vs. Test Scores

Hours Studied (\(x\)) Test Score (\(y\))
2 50
4 60
6 70
8 80

Regression Line: \[ y = 40 + 5x \] - \(40\): Intercept (test score when \(x = 0\)). - \(5\): Slope (increase in score for each additional hour studied).

Model Evaluation

Evaluating the Regression Model

  • R-squared (\(R^2\)): Proportion of variance in \(y\) explained by \(x\).
  • Residuals: Differences between observed and predicted \(y\) values.

Visualizations: - Residual Plot: Check for patterns in residuals. - Scatterplot: Show the data points and regression line.

ggplot Regression Line

Scatterplot with Regression Line

## `geom_smooth()` using formula = 'y ~ x'

ggplot Distribution

Residuals Distribution

Code for Linear Regression Model

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      1      2      3      4 
##  2.482 -2.139 -3.169  2.826 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   40.816      4.645   8.788   0.0127 *
## x              5.407      0.848   6.376   0.0237 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.792 on 2 degrees of freedom
## Multiple R-squared:  0.9531, Adjusted R-squared:  0.9297 
## F-statistic: 40.65 on 1 and 2 DF,  p-value: 0.02373

Summary

  • Simple linear regression models the relationship between two variables using a straight line.
  • The regression equation predicts \(y\) based on \(x\).

Key Takeaways

  • Ensure assumptions are met for reliable results.
  • Evaluate model performance with R-squared and residual analysis.

Thank You! - Questions? Feedback?