Agenda
- Overview of Multiple Linear Regression
- Model Formulation with Mathematical Notation
- Simulation and Model Fitting
- Visualization using ggplot2 and Plotly
- Interpretation and Conclusion
2025-06-08
Introduction
Welcome to my presentation on multiple linear regression. This presentation covers the theoretical foundations of the statistic principle being studied, simulates a dataset, fits a regression model using R, and visualizes the results using ggplot2 and plotly.
Agenda
Theory - Model Formulation
In multiple linear regression, the relationship between the response variable \(y\) and the predictors \(x_1, x_2, ..., x_p\) are modeled as:
\[ $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i$ \] where \(\epsilon_i \sim N(0, \sigma^2)\).
Estimation of Parameters
The parameters are estimated by minimizing the sum of squared errors (SSE):
\[ $S(\beta) = \sum_{i=1}^{n} \left( y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip} \right)^2$ \]
The solution to this minimization is provided by the normal equations:
\[ $(X^T X) \; \hat{\beta} \;=\; X^T y$ \] where \(X\) is the design matrix containing values of the predictors.
Simulation & Model Fitting in R
This code simulates a dataset and fits a prediction model based on the formula: \(y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \epsilon\)
# Load necessary packages and suppress startup messages suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(plotly)) # Simulate data set.seed(123) n <- 100 x1 <- rnorm(n, mean = 10, sd = 2) x2 <- rnorm(n, mean = 5, sd = 1) epsilon <- rnorm(n, mean = 0, sd = 2) y <- 3 + 1.5 * x1 - 2 * x2 + epsilon # Create dataframe and fit model data <- data.frame(x1, x2, y) model <- lm(y ~ x1 + x2, data = data) summary(model)
## ## Call: ## lm(formula = y ~ x1 + x2, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.7460 -1.3215 -0.2489 1.2427 4.1597 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.3637 1.4889 2.931 0.00422 ** ## x1 1.3668 0.1049 13.034 < 2e-16 *** ## x2 -1.9524 0.1980 -9.861 2.68e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.903 on 97 degrees of freedom ## Multiple R-squared: 0.7431, Adjusted R-squared: 0.7378 ## F-statistic: 140.3 on 2 and 97 DF, p-value: < 2.2e-16
Visualization with ggplot2 (Scatter & Fitted Line)
Here we plot \(y\) against \(x_1\) while holding \(x_2\) constant at its median value. The red line represents the fitted regression relationship.
Diagnostic Plot Using ggplot2
This slide presents a residuals vs. fitted values plot. It is a common diagnostic tool used to access heteroscedasticity or non-linearity in the model.
Interactive 3D Plot with Plotly
This interactive 3D plot visualizes the original data points (in blue) along with the regression plane derived from the fitted model. The axes correspond to \(x_1, x_2,\) and \(y\).
Conclusion
this presentation includes: - Exploring the mathematical formulation of multiple linear regression. - Simulating a dataset and fitted a regression model in R. - Visualized the model’s behavior using both ggplot2 and an interactive plotly 3D plot.