2025-11-09

Introduction to Simple Linear Regression

Simple Linear Regression is a statistical method used to model the relationship between two continuous variables:

  • Dependent variable (Y): The outcome we want to predict
  • Independent variable (X): The predictor used to explain Y

Key Applications:

  • Predicting sales based on advertising spend
  • Estimating student performance based on study hours
  • Forecasting temperature based on elevation

The Linear Regression Model

The mathematical model for simple linear regression is:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

where:

  • \(Y_i\) = observed value of the dependent variable
  • \(X_i\) = observed value of the independent variable
  • \(\beta_0\) = y-intercept (population parameter)
  • \(\beta_1\) = slope (population parameter)
  • \(\epsilon_i\) = random error term, \(\epsilon_i \sim N(0, \sigma^2)\)

Estimating the Regression Line

The Ordinary Least Squares (OLS) method estimates the regression coefficients by minimizing the sum of squared residuals:

\[\text{SSE} = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

The estimated regression equation is:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the least squares estimates.

Example: Study Hours vs Exam Scores

Let’s examine the relationship between hours studied and exam scores for 50 students.

hours_studied <- runif(50, 1, 10)
exam_scores <- 50 + 4.5 * hours_studied + rnorm(50, 0, 8)
study_data <- data.frame(hours = hours_studied, score = exam_scores)

model <- lm(score ~ hours, data = study_data)

Summary Statistics:

  • Number of observations: 50
  • Mean study hours: 5.68
  • Mean exam score: 76.01

Visualization: Scatter Plot with Regression Line

## `geom_smooth()` using formula = 'y ~ x'

R Code: Creating the Regression Plot

hours_studied <- runif(50, 1, 10)
exam_scores <- 50 + 4.5 * hours_studied + rnorm(50, 0, 8)
study_data <- data.frame(hours = hours_studied, score = exam_scores)

# Create scatter plot with regression line
ggplot(study_data, aes(x = hours, y = score)) +
  geom_point(color = "#8C1D40", size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, 
              color = "#FFC627", fill = "#FFC627", alpha = 0.2) +
  labs(title = "Study Hours vs Exam Scores",
       x = "Hours Studied", y = "Exam Score") +
  theme_minimal(base_size = 14)

Regression Results

Estimated Regression Equation:

\[\hat{Y} = 48.522 + 4.839 X\]

Interpretation:

  • Intercept (\(\hat{\beta}_0 = 48.522\)): Expected exam score with 0 hours of study
  • Slope (\(\hat{\beta}_1 = 4.839\)): For each additional hour studied, exam score increases by 4.839 points
  • R² = 0.748: 74.8% of variance in exam scores is explained by study hours

Residual Analysis

A good residual plot shows random scatter around zero with no patterns.

Interactive 3D Visualization

Assumptions of Linear Regression

For valid inference, we assume:

  1. Linearity: The relationship between X and Y is linear
  2. Independence: Observations are independent of each other
  3. Homoscedasticity: Constant variance of errors (\(\text{Var}(\epsilon_i) = \sigma^2\))
  4. Normality: Errors are normally distributed (\(\epsilon_i \sim N(0, \sigma^2)\))

Checking Assumptions:

  • Use residual plots to check linearity and homoscedasticity
  • Use Q-Q plots to check normality
  • Consider the data collection process for independence

Key Takeaways

Simple Linear Regression Summary:

  • Powerful tool for understanding relationships between two variables
  • OLS method provides best linear unbiased estimates
  • R^2 measures the proportion of variance explained by the model
  • Always check assumptions before drawing conclusions
  • Residual analysis is crucial for model validation