2025-03-17

Introduction

This presentation includes:

-Investigated the mathematical underpinnings of multiple linear regression. -Fitting a regression model using simulated data and producing visualizations to comprehend variable relationships and evaluate model fit - Interpreted our regression analysis’s findings.

Numerous disciplines, including economics, biology, the social sciences, and engineering, can benefit from the strong framework that multiple linear regression offers for comprehending the relationships between several predictors and a response variable.

Mathematical Formulation

The multiple linear regression model is given by:

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \epsilon_i \]

Where:

  • \(y_i\) is the response variable for observation \(i\)
  • \(x_{i1}, x_{i2}, \ldots, x_{ip}\) are predictor variables
  • \(\beta_0, \beta_1, \ldots, \beta_p\) are regression coefficients
  • \(\epsilon_i\) is the error term, where \(\epsilon_i \sim N(0, \sigma^2)\)

Matrix Formulation

In matrix form, the multiple regression model is:

\[ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]

The least squares estimator for \(\boldsymbol{\beta}\) is:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \]

Data Simulation and Model Fitting

# Load necessary libraries
library(ggplot2)
library(plotly)
library(dplyr)

set.seed(123)

n <- 100
x1 <- runif(n, 0, 10)
x2 <- runif(n, 0, 10)
beta0 <- 5; beta1 <- 2; beta2 <- -1
epsilon <- rnorm(n, mean = 0, sd = 2)
y <- beta0 + beta1 * x1 + beta2 * x2 + epsilon
data <- data.frame(x1, x2, y)

model <- lm(y ~ x1 + x2, data = data)

data$predicted <- predict(model, data)
data$residuals <- residuals(model)

Model Summary

summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7988 -1.3643 -0.2172  1.1499  6.7325 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.71841    0.57234   8.244 8.11e-13 ***
## x1           2.01945    0.06913  29.211  < 2e-16 ***
## x2          -1.00593    0.07469 -13.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.953 on 97 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9181 
## F-statistic: 555.9 on 2 and 97 DF,  p-value: < 2.2e-16

Relationship Between Variables - ggplot (1)

# Create scatter plot with regression line
ggplot(data, aes(x = x1, y = y)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
  labs(title = "Relationship between x1 and y",
       x = "Predictor (x1)",
       y = "Response (y)") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Residual Diagnostics - ggplot (2)

# Create residual plot
ggplot(data, aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted values",
       y = "Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

R Code for 3D Visualization

# Create a grid for the regression plane
x1_seq <- seq(min(data$x1), max(data$x1), length.out = 30)
x2_seq <- seq(min(data$x2), max(data$x2), length.out = 30)
grid <- expand.grid(x1 = x1_seq, x2 = x2_seq)
grid$y <- predict(model, newdata = grid)

plot_ly() %>%
  add_markers(data = data, x = ~x1, y = ~x2, z = ~y, 
              marker = list(color = 'black', size = 3, opacity = 0.7)) %>%
  add_surface(x = ~x1_seq, y = ~x2_seq, 
              z = matrix(grid$y, nrow = 30, ncol = 30), 
              opacity = 0.7) %>%
  layout(title = "3D Regression Surface",
         scene = list(
           xaxis = list(title = "x1"),
           yaxis = list(title = "x2"),
           zaxis = list(title = "y")
         ))

3D Visualization with plotly

Additional ggplot Visualization - QQ Plot

ggplot(data, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "blue") +
  labs(title = "Normal Q-Q Plot of Residuals",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Interpretation of Results

Our model results show:

  • The intercept (β₀) is approximately 4.72, which represents the expected value of y when both x1 and x2 are zero.
  • The coefficient for x1 (β₁) is approximately 2.02, indicating that for each unit increase in x1, y increases by this amount (holding x2 constant).
  • The coefficient for x2 (β₂) is approximately -1.01, indicating that for each unit increase in x2, y decreases by this amount (holding x1 constant).
  • R-squared value of 0.92 indicates that 92% of the variation in y is explained by our model.

Conclusion

Here, we have:

  • Covered multiple linear regression and its mathematical foundation
  • Create simulated data and fit a regression model
  • Created visualizations to understand variable relationships and assess model fit
  • Interpreted our regression analysis findings

Multiple linear regression provides a solid paradigm for understanding the relationship between multiple predictors and a response variable, with extensive applications in various fields including economics, biology, social sciences, and engineering.