March 16, 2025

Introduction to Linear Regression

Linear regression is one of the most fundamental machine learning algorithms, serving as a building block for more complex models.

Key Concepts: - Predicts a continuous output variable based on one or more input variables - Establishes a linear relationship between inputs and outputs - Can be extended to model non-linear relationships (polynomial regression) - Provides interpretable coefficients that quantify relationships

Applications: Economics, finance, social sciences, medicine, engineering, and nearly every field that works with quantitative data.

Mathematical Foundation

For simple linear regression with one predictor variable:

\[y = \beta_0 + \beta_1x + \varepsilon\]

Where: - \(y\) is the dependent variable we want to predict - \(x\) is the independent variable (predictor) - \(\beta_0\) is the y-intercept - \(\beta_1\) is the slope - \(\varepsilon\) is the error term

The goal is to find values for \(\beta_0\) and \(\beta_1\) that minimize the sum of squared residuals:

\[\text{minimize} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1x_i))^2\]

Linear Regression: Step 1 - Data Generation

# Generate synthetic data with a clear linear trend
set.seed(42)
n <- 100
x <- seq(1, 10, length.out = n)
y <- 2 + 3 * x + rnorm(n, mean = 0, sd = 2)  # y = 2 + 3x + noise
data <- data.frame(x = x, y = y)

# Preview data
head(data, 3)
#>          x        y
#> 1 1.000000 7.741917
#> 2 1.090909 4.143331
#> 3 1.181818 6.271711

Linear Regression: Step 2 - Model Fitting

# Fit linear regression model
model <- lm(y ~ x, data = data)

# Extract model coefficients and R-squared
coef_intercept <- round(coef(model)[1], 2)
coef_slope <- round(coef(model)[2], 2)
r_squared <- round(summary(model)$r.squared, 3)

# Display model summary 
summary(model)$coefficients
#>             Estimate Std. Error   t value     Pr(>|t|)
#> (Intercept) 2.152293 0.48601673  4.428435 2.470501e-05
#> x           2.984134 0.07975383 37.416811 7.750931e-60

Linear Regression: Step 3 - Prediction

# Create predictions for plotting
grid <- data.frame(x = seq(min(x), max(x), length.out = 100))
predictions <- predict(model, newdata = grid, interval = "confidence")
grid$predicted <- predictions[,1]
grid$conf_low <- predictions[,2]
grid$conf_high <- predictions[,3]

# Preview predictions
head(grid, 3)
#>          x predicted conf_low conf_high
#> 1 1.000000  5.136427 4.311964  5.960891
#> 2 1.090909  5.407712 4.595646  6.219779
#> 3 1.181818  5.678997 4.879261  6.478734

Linear Regression: Step 4 - Visualization

# Create visualization
basic_plot <- ggplot(data, aes(x = x, y = y)) +
  # Add confidence interval and data points
  geom_ribbon(data = grid, aes(x = x, y = predicted, ymin = conf_low, ymax = conf_high), 
              fill = "#BED3F4", alpha = 0.5) +
  geom_point(color = "#3498DB", size = 3, alpha = 0.7) +
  geom_line(data = grid, aes(x = x, y = predicted), color = "#E74C3C", size = 1.2) +
  
  # Add annotations
  annotate("text", x = min(x) + 1, y = max(y) - 2, 
           label = paste0("y = ", coef_intercept, " + ", coef_slope, "x"), 
           color = "#E74C3C", size = 5, hjust = 0) +
  annotate("text", x = min(x) + 1, y = max(y) - 4, 
           label = paste0("R² = ", r_squared), color = "#2C3E50", size = 5, hjust = 0) +
  
  # Labels and styling
  labs(title = "Simple Linear Regression", 
       subtitle = "Relationship between X and Y with confidence interval",
       x = "X Variable", y = "Y Variable") +
  theme_minimal()

Simple Linear Regression Visualization

# Display the linear regression plot
basic_plot

Residuals Analysis: Step 1 - Calculations

# Calculate residuals
data$predicted <- predict(model, newdata = data)
data$residuals <- data$y - data$predicted

# Summary statistics of residuals
summary(data$residuals)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -6.0389 -1.3237  0.1618  0.0000  1.3054  4.4527

# Shapiro-Wilk test for normality
shapiro.test(data$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  data$residuals
#> W = 0.98033, p-value = 0.141

Residuals Analysis: Step 2 - Residuals Plot

# Create residual plot
residual_plot <- ggplot(data, aes(x = x, y = residuals)) +
  # Reference line and points
  geom_hline(yintercept = 0, linetype = "dashed", color = "#95A5A6", size = 0.8) +
  geom_point(aes(color = abs(residuals)), size = 3, alpha = 0.8) +
  geom_segment(aes(xend = x, yend = 0, color = abs(residuals)), alpha = 0.5) +
  
  # Styling
  scale_color_gradient(low = "#3498DB", high = "#E74C3C", name = "Residual\nMagnitude") +
  labs(title = "Residual Plot", 
       subtitle = "Visualizing the errors in our linear model",
       x = "X Variable", y = "Residual Value") +
  theme_minimal() +
  theme(legend.position = "right")

Residuals Analysis: Step 3 - Distribution

# Create distribution of residuals plot
hist_plot <- ggplot(data, aes(x = residuals)) +
  # Histogram and density curves
  geom_histogram(aes(y = ..density..), fill = "#3498DB", 
                 color = "#2C3E50", alpha = 0.7, bins = 20) +
  geom_density(color = "#E74C3C", size = 1) +
  
  # Normal distribution overlay
  stat_function(fun = dnorm, 
                args = list(mean = mean(data$residuals), 
                            sd = sd(data$residuals)),
                color = "#2ECC71", size = 1, linetype = "dashed") +
  annotate("text", x = max(data$residuals) - 1, y = 0.2, 
           label = "Normal Distribution", color = "#2ECC71", size = 4, hjust = 1) +
  
  # Labels
  labs(title = "Distribution of Residuals",
       subtitle = "Checking for normality in residuals",
       x = "Residual Value", y = "Density") +
  theme_minimal()

Residuals Visualization

# Display the residuals visualizations side by side
residuals_combined <- gridExtra::grid.arrange(residual_plot, hist_plot, ncol = 2)

residuals_combined
#> TableGrob (1 x 2) "arrange": 2 grobs
#>   z     cells    name           grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]

Multiple Regression: Step 1 - Data Setup

# Generate synthetic data for multiple regression
set.seed(123)
n <- 100
x1 <- runif(n, 0, 10)
x2 <- runif(n, 0, 10)
y <- 5 + 2 * x1 - 1.5 * x2 + rnorm(n, 0, 3)
data_3d <- data.frame(x1 = x1, x2 = x2, y = y)

# Fit multiple regression model
model_3d <- lm(y ~ x1 + x2, data = data_3d)
coefs <- round(coef(model_3d), 2)

# Model summary
summary(model_3d)$coefficients
#>              Estimate Std. Error    t value     Pr(>|t|)
#> (Intercept)  4.577619  0.8585113   5.332042 6.334532e-07
#> x1           2.029178  0.1036995  19.567858 1.896421e-35
#> x2          -1.508893  0.1120422 -13.467188 6.205237e-24

Multiple Regression: Step 2 - First Plot

# First view: X1 vs Y with X2 as color
plot_x1 <- ggplot(data_3d, aes(x = x1, y = y, color = x2)) +
  # Points and trend line
  geom_point(size = 3, alpha = 0.7) +
  scale_color_viridis_c(name = "X2 Value") +
  geom_smooth(method = "lm", se = FALSE, color = "#E74C3C", size = 1.2) +
  
  # Equation annotation
  annotate("text", x = min(x1) + 1, y = max(y) - 2, 
           label = paste0("y = ", coefs[1], " + ", coefs[2], "x₁ ", 
                        ifelse(coefs[3] >= 0, "+ ", "- "), abs(coefs[3]), "x₂"), 
           color = "#E74C3C", size = 4, hjust = 0) +
  
  # Labels
  labs(title = "Multiple Regression: X₁ vs Y",
       subtitle = "Points colored by X₂ value",
       x = "X₁ Variable", y = "Y Variable") +
  theme_minimal()

Multiple Regression: Step 3 - Second Plot

# Second view: X2 vs Y with X1 as color
plot_x2 <- ggplot(data_3d, aes(x = x2, y = y, color = x1)) +
  # Points and trend line
  geom_point(size = 3, alpha = 0.7) +
  scale_color_viridis_c(name = "X1 Value", option = "plasma") +
  geom_smooth(method = "lm", se = FALSE, color = "#E74C3C", size = 1.2) +
  
  # Labels
  labs(title = "Multiple Regression: X₂ vs Y",
       subtitle = "Points colored by X₁ value",
       x = "X₂ Variable", y = "Y Variable") +
  theme_minimal()

Multiple Regression Visualization

# Display both multiple regression visualizations
multiple_regression_plots <- gridExtra::grid.arrange(plot_x1, plot_x2, ncol = 2)

multiple_regression_plots
#> TableGrob (1 x 2) "arrange": 2 grobs
#>   z     cells    name           grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]

Polynomial Regression: Step 1 - Data Generation

# Generate synthetic data with quadratic relationship
set.seed(222)
n <- 120
x_poly <- seq(-5, 5, length.out = n)
y_poly <- 2 + 0.5 * x_poly - 0.5 * x_poly^2 + rnorm(n, 0, 2)
data_poly <- data.frame(x = x_poly, y = y_poly)

# Fit polynomial models of different degrees
model_linear <- lm(y ~ x, data = data_poly)
model_quadratic <- lm(y ~ x + I(x^2), data = data_poly)
model_cubic <- lm(y ~ x + I(x^2) + I(x^3), data = data_poly)

Polynomial Regression: Step 2 - Model Comparison

# Get R-squared values
r2_linear <- round(summary(model_linear)$r.squared, 3)
r2_quadratic <- round(summary(model_quadratic)$r.squared, 3)
r2_cubic <- round(summary(model_cubic)$r.squared, 3)

# Create predictions for plotting
grid_poly <- data.frame(x = seq(min(x_poly), max(x_poly), length.out = 100))
grid_poly$linear <- predict(model_linear, newdata = grid_poly)
grid_poly$quadratic <- predict(model_quadratic, newdata = grid_poly)
grid_poly$cubic <- predict(model_cubic, newdata = grid_poly)

# Compare models
data.frame(
  Model = c("Linear", "Quadratic", "Cubic"),
  R_squared = c(r2_linear, r2_quadratic, r2_cubic)
)
#>       Model R_squared
#> 1    Linear     0.098
#> 2 Quadratic     0.824
#> 3     Cubic     0.825

Polynomial Regression: Step 3 - Visualization

# Create beautiful visualization
poly_plot <- ggplot(data_poly, aes(x = x, y = y)) +
  # Data points and regression lines
  geom_point(color = "#3498DB", size = 2.5, alpha = 0.7) +
  geom_line(data = grid_poly, aes(x = x, y = linear, color = "Linear"), size = 1) +
  geom_line(data = grid_poly, aes(x = x, y = quadratic, color = "Quadratic"), size = 1) +
  geom_line(data = grid_poly, aes(x = x, y = cubic, color = "Cubic"), size = 1) +
  
  # Colors and annotations
  scale_color_manual(name = "Model Type",
  values = c("Linear" = "#E74C3C", "Quadratic" = "#2ECC71","Cubic" = "#9B59B6")) +
  annotate("text", x = -4, y = max(y_poly) - 1, 
  label = paste0("Linear R² = ", r2_linear), color = "#E74C3C", size = 4, hjust = 0) +
  annotate("text", x = -4, y = max(y_poly) - 2, 
  label = paste0("Quadratic R² = ", r2_quadratic), color = "#2ECC71",size = 4,hjust=0)+
  annotate("text", x = -4, y = max(y_poly) - 3, 
           label = paste0("Cubic R² = ", r2_cubic), color = "#9B59B6", size = 4, 
           hjust = 0) +
  
  # Labels and styling
  labs(title = "Polynomial Regression Models",
       subtitle = "Comparing linear, quadratic, and cubic fits",
       x = "X Variable", y = "Y Variable") +
  theme_minimal() +
  theme(legend.position = "bottom")

Polynomial Regression Visualization

# Display the polynomial regression plot
poly_plot

Boston Housing Dataset: Step 1 - Data Preparation

# Load the Boston Housing dataset
library(MASS)
data(Boston)

# Extract and rename variables of interest
housing_data <- Boston[, c("rm", "lstat", "medv")]
names(housing_data) <- c("rooms", "lower_status", "value")

# Explore data
summary(housing_data)
#>      rooms        lower_status       value      
#>  Min.   :3.561   Min.   : 1.73   Min.   : 5.00  
#>  1st Qu.:5.886   1st Qu.: 6.95   1st Qu.:17.02  
#>  Median :6.208   Median :11.36   Median :21.20  
#>  Mean   :6.285   Mean   :12.65   Mean   :22.53  
#>  3rd Qu.:6.623   3rd Qu.:16.95   3rd Qu.:25.00  
#>  Max.   :8.780   Max.   :37.97   Max.   :50.00

# Fit linear regression model
housing_model <- lm(value ~ rooms, data = housing_data)
coefs_housing <- round(coef(housing_model), 2)
r2_housing <- round(summary(housing_model)$r.squared, 3)

Boston Housing Dataset: Step 2 - Visualization

# Create visualization
housing_plot <- ggplot(housing_data, aes(x = rooms, y = value)) +
  # Points and trend line
  geom_point(aes(color = lower_status), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", color = "#E74C3C", size = 1.2) +
  scale_color_viridis_c(name = "% Lower Status\nPopulation", option = "plasma") +
  
  # Add equation and R-squared
  annotate("text", x = min(housing_data$rooms) + 0.5, y = max(housing_data$value) - 2, 
           label = paste0("value = ", coefs_housing[1], " + ", coefs_housing[2], " × rooms"), 
           color = "#E74C3C", size = 4, hjust = 0) +
  annotate("text", x = min(housing_data$rooms) + 0.5, y = max(housing_data$value) - 4, 
           label = paste0("R² = ", r2_housing), color = "#2C3E50", size = 4, hjust = 0) +
  
  # Labels and styling
  labs(title = "Boston Housing Dataset",
       subtitle = "Relationship between number of rooms and housing value",
       x = "Average Number of Rooms", y = "Median Housing Value ($1,000s)") +
  theme_minimal()

Boston Housing Visualization {.smaller}z

# Display the Boston housing plot
housing_plot

Interpreting Linear Regression

Linear regression provides interpretable coefficients that quantify the relationship between variables.

For simple linear regression \(y = \beta_0 + \beta_1x + \varepsilon\):

  • \(\beta_0\) (intercept): The expected value of \(y\) when \(x = 0\)
  • \(\beta_1\) (slope): The expected change in \(y\) for a one-unit increase in \(x\)

For multiple regression \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_px_p + \varepsilon\):

  • Each \(\beta_i\) represents the expected change in \(y\) for a one-unit increase in \(x_i\), holding all other variables constant

Model evaluation metrics: - \(R^2\) (coefficient of determination): Proportion of variance explained by the model (0 to 1) - Adjusted \(R^2\): Adjusts for the number of predictors - RMSE (Root Mean Square Error): Average magnitude of residuals

Assumptions and Diagnostics

For linear regression to be valid, several assumptions should be checked through diagnostic plots.

Key assumptions:

  1. Linearity: The relationship between X and Y is linear
    • Check with scatter plots and residual plots
  2. Independence: Observations are independent of each other
    • Check for auto-correlation in time series data
  3. Homoscedasticity: Residuals have constant variance across all levels of X
    • Check residual plot for funnel patterns
  4. Normality of residuals: Residuals are normally distributed
    • Check with histogram, Q-Q plot
  5. No multicollinearity: For multiple regression, predictors aren’t highly correlated
    • Check variance inflation factor (VIF)

Applications and Extensions

Linear regression is the foundation for many advanced techniques in statistics and machine learning.

Common applications: - Economics: Predicting GDP growth based on economic indicators - Finance: Modeling stock returns based on market factors - Medicine: Predicting patient outcomes based on treatment variables - Marketing: Estimating sales based on advertising expenditure - Social sciences: Analyzing relationships between sociological factors

Extensions and variations: - Ridge Regression: Adds L2 regularization to handle multicollinearity - Lasso Regression: Adds L1 regularization for feature selection - Elastic Net: Combines L1 and L2 regularization - Quantile Regression: Models different percentiles of the response - Generalized Linear Models: Extends to non-normally distributed responses

References

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.

  • Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models (5th ed.). McGraw-Hill/Irwin.

  • Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models (3rd ed.). SAGE Publications.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

  • Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). Wiley.