2025-11-07

Introduction to Linear Regression

Linear regression models the relationship between two variables:

  • Response variable (Y): The outcome we want to predict
  • Predictor variable (X): The variable used to make predictions

Common Applications:

  • Predicting house prices from square footage
  • Estimating sales from advertising spend
  • Forecasting scores from study hours

The Mathematical Model

The simple linear regression model is:

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

Where:

  • \(Y_i\) = response for observation \(i\)
  • \(X_i\) = predictor value for observation \(i\)
  • \(\beta_0\) = intercept (Y when X = 0)
  • \(\beta_1\) = slope (change in Y per unit change in X)
  • \(\epsilon_i \sim N(0, \sigma^2)\) = random error

Least Squares Estimation

We minimize the sum of squared residuals:

\[SSE = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\]

The solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Example: Study Hours vs Exam Scores

# Create sample data
set.seed(123)
hours <- seq(1, 10, by = 0.5)
scores <- 50 + 4.5 * hours + rnorm(19, 0, 5)
data <- data.frame(hours, scores)

# Fit model
model <- lm(scores ~ hours, data = data)

Scatter Plot with Regression Line

Residual Analysis

Interactive 3D Visualization

Model Output and Code

# Model summary
summary(model)
## 
## Call:
## lm(formula = scores ~ hours, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3853  -3.2743  -0.1576   2.4458   8.3029 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.3065     2.6142   19.63 4.07e-13 ***
## hours         4.4206     0.4255   10.39 8.82e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.079 on 17 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8559 
## F-statistic: 107.9 on 1 and 17 DF,  p-value: 8.816e-09

Model Interpretation

# Extract coefficients
coefs <- coef(model)
r2 <- summary(model)$r.squared

cat("Intercept:", round(coefs[1], 2), "\n")
## Intercept: 51.31
cat("Slope:", round(coefs[2], 2), "\n")
## Slope: 4.42
cat("R-squared:", round(r2, 4), "\n")
## R-squared: 0.8639
cat("\nInterpretation:", round(r2*100, 1), 
    "% of variance explained")
## 
## Interpretation: 86.4 % of variance explained

Key Assumptions

Four key assumptions must hold:

  1. Linearity: Relationship between X and Y is linear

  2. Independence: Observations are independent

  3. Homoscedasticity: Constant variance of errors

  4. Normality: Errors are normally distributed

Check these using: scatter plots, residual plots, Q-Q plots, and statistical tests.

Hypothesis Testing

Test for a linear relationship:

\[H_0: \beta_1 = 0 \text{ (no relationship)}\] \[H_a: \beta_1 \neq 0 \text{ (relationship exists)}\]

Test statistic:

\[t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \sim t_{n-2}\]

Reject \(H_0\) if p-value < 0.05

Applications

Real-world uses:

  • Economics: GDP predictions
  • Healthcare: Disease risk modeling
  • Marketing: Sales forecasting
  • Engineering: Material testing
  • Climate: Temperature trends

Advantages:

  • Simple and interpretable
  • Fast computation
  • Foundation for advanced methods

Conclusions

Key Points:

✓ Linear regression models Y as a linear function of X

✓ Least squares provides optimal parameter estimates

✓ Check assumptions before making inferences

✓ Widely applicable across many fields

✓ Forms basis for more complex models

Thank You!

Questions?

Statistical modeling with R