September 22, 2024

OVERVIEW

  • Predicts a response variable using multiple explanatory variables
  • Extends simple linear regression to include multiple predictors
  • Models a (possible) relationship between variables
  • Finds the best-fitting line by minimizing errors
  • Used for forecasting and data analysis across various disciplines

EQUATION

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon \]

  • \(y\) = response variable
  • \(\beta_0\) = intercept
  • \(\beta_1, \beta_2, \dots, \beta_n\) = regression coefficients
  • \(x_1, x_2, \dots, x_n\) = explanatory variables
  • \(\epsilon\) = error

REGRESSION COEFFICIENTS

Regression coefficients are calculated
using the normal equation:

\[ \boldsymbol{\beta} = (X^T X)^{-1} X^T \boldsymbol{y} \]

Where:

  • \(\boldsymbol{\beta}\) = vector of regression coefficients

  • \(X\) = matrix of explanatory variables

  • \(\boldsymbol{y}\) = vector of observed values for response variable

  • \(X^T\) = transpose of the matrix \(X\)

  • \((X^T X)^{-1}\) = inverse of the matrix \(X^T X\)

THE SWISS DATASET

  • Swiss provinces (1888): Fertility and economic indicators
  • Key variables: Fertility, Agriculture, Education, etc.
  • Source: Built-in dataset in R
  • Use: Commonly used in regression analysis, helpful in understanding multiple linear regression
  • Application: Models relationship between socioeconomic factors and fertility

CODE FOR FERTILITY VS. AGRICULTURE 2D SCATTERPLOT

library(ggplot2)

ggplot(swiss, aes(x=Agriculture, y=Fertility)) +
  geom_point(color = "blue") + 
  geom_smooth(method = "lm", color="red", se=FALSE) +
  labs(title = "Fertility vs. Agriculture",
       x = "Agriculture (% of Men Employed in Agriculture)",
        y = "Fertility Rate") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlim(0, 100) +
  ylim(0, 100)

FERTILITY VS. AGRICULTURE

FERTILITY VS. EDUCATION

FERTILITY VS.
AGRICULTURE & EDUCATION

REFERENCES