What is Simple Linear Regression?

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous variables:

  • Predictor variable (X): The independent variable
  • Response variable (Y): The dependent variable

The goal is to find the best-fitting straight line through the data points.

Real-world applications: Sales forecasting, risk assessment, trend analysis

The Mathematical Foundation

The simple linear regression model is expressed as:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Where:

  • \(\beta_0\) is the y-intercept (value of Y when X = 0)
  • \(\beta_1\) is the slope (change in Y for unit change in X)
  • \(\epsilon\) is the error term (random variation)

The estimated regression line: \[\hat{Y} = b_0 + b_1 X\]

Estimating Parameters: Least Squares Method

The least squares method minimizes the sum of squared residuals:

\[SSE = \sum_{i=1}^{n}(Y_i - \hat{Y_i})^2\]

The formulas for the coefficients are:

\[b_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[b_0 = \bar{Y} - b_1\bar{X}\]

Example: Study Hours vs. Test Scores

R Code for Linear Regression

# Load the data
df <- data.frame(
  Hours = study_hours,
  Score = test_scores
)

# Fit linear regression model
model <- lm(Score ~ Hours, data = df)

# Display model summary
summary(model)$coefficients[, 1:2]
##              Estimate Std. Error
## (Intercept) 59.849751  1.7786319
## Hours        3.478075  0.2563344
# Make predictions
new_hours <- data.frame(Hours = c(5, 7, 9))
predictions <- predict(model, new_hours)
round(predictions, 2)
##     1     2     3 
## 77.24 84.20 91.15

Model Diagnostics: Residual Analysis

Good residual plots show: - Random scatter around zero - No patterns or trends - Constant variance

Interactive 3D Visualization: Multiple Regression

Measuring Model Performance

Coefficient of Determination (\(R^2\))

\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum(Y_i - \hat{Y_i})^2}{\sum(Y_i - \bar{Y})^2}\]

  • Represents the proportion of variance explained by the model
  • Ranges from 0 to 1 (higher is better)

Standard Error of Estimate

\[S_e = \sqrt{\frac{\sum(Y_i - \hat{Y_i})^2}{n-2}}\]

  • Measures the average distance of data points from the regression line

Hypothesis Testing in Regression

Testing if the slope is significant:

Null Hypothesis: \(H_0: \beta_1 = 0\) (no linear relationship)
Alternative Hypothesis: \(H_a: \beta_1 \neq 0\) (linear relationship exists)

Test statistic: \[t = \frac{b_1}{SE(b_1)}\]

This follows a t-distribution with \(n-2\) degrees of freedom.

Decision rule: Reject \(H_0\) if \(|t| > t_{\alpha/2, n-2}\) or if p-value < \(\alpha\)

Key Assumptions and Limitations

Assumptions:

  1. Linearity: Relationship between X and Y is linear
  2. Independence: Observations are independent
  3. Normality: Residuals are normally distributed
  4. Homoscedasticity: Constant variance of residuals

Limitations:

  • Only captures linear relationships
  • Sensitive to outliers
  • Cannot establish causation
  • Extrapolation beyond data range is risky

Summary

Simple linear regression is a powerful tool for:

  • Predicting outcomes based on a single predictor
  • Understanding relationships between variables
  • Making data-driven decisions

Remember: > “All models are wrong, but some are useful” - George Box

The key is knowing when and how to apply regression appropriately!

Applications in Data Science: - Customer behavior prediction - Financial forecasting - Quality control in manufacturing - Healthcare outcome analysis