Introduction

Regression analysis is one of the fundamental tools in statistical modeling. This report compares two regression techniques: Least Squares Estimation (LSE) and Least Absolute Deviation (LAD) using the ‘diamonds’ dataset from ‘ggplot2’ package.While LSE minimizes the squared residuals, LAD minimizes the absolute residuals, making it more robust to outliers.

Loading Required Packages

We begin by loading the necessary libraries. “ggplot2” provides the diamonds dataset, while “quantreg” enables LAD regression.

library(quantreg)   # For LAD (quantile) regression
## Warning: package 'quantreg' was built under R version 4.4.3
## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 4.4.3
library(ggplot2)    # For dataset and visualization

Data Overview

The diamonds dataset contains information on the prices and attributes of around 54,000 diamonds. We will focus on the variables “carat” (weight of the diamond) and “price”.

# Load the dataset
dat = diamonds

# Number of rows and preview
nrow(dat)
## [1] 53940
head(dat)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Scatter Plot of Carat vs Price

We begin by visualizing the relationship between carat and price.

plot(dat$carat, dat$price,
     xlab = "Carat",
     ylab = "Price",
     main = "Carat vs Price of Diamonds",
     pch = 19)

As expected, diamond price increases with carat weight.

Fitting a Least Squares Regression Model

We fit a linear regression model using LSE, which estimates the parameters by minimizing the sum of squared residuals.

# Fit LSE model
lse_model = lm(price ~ carat, data = dat)

# Extract coefficients
lse_coeff = coef(lse_model)
lse_intercept = lse_coeff[1]
lse_slope = lse_coeff[2]

# Display coefficients
cat("LSE Intercept:", lse_intercept, "\n")
## LSE Intercept: -2256.361
cat("LSE Slope:", lse_slope, "\n")
## LSE Slope: 7756.426

We can now plot the regression line obtained using LSE.

plot(dat$carat, dat$price,
     xlab = "Carat", ylab = "Price",
     main = "LSE Regression Line",
     pch = 19)
abline(lse_model, col = "red", lwd = 2)

Fitting a Least Absolute Deviations (LAD) Regression Model

LAD regression minimizes the sum of absolute residuals, offering more robustness to outliers.

# Fit LAD model (median regression)
lad_model = rq(price ~ carat, tau = 0.5, data = dat)

# Extract coefficients
lad_coeff = coef(lad_model)
lad_intercept = lad_coeff[1]
lad_slope = lad_coeff[2]

# Display coefficients
cat("LAD Intercept:", lad_intercept, "\n")
## LAD Intercept: -1665.225
cat("LAD Slope:", lad_slope, "\n")
## LAD Slope: 6692.5

Plotting the LAD regression line:

plot(dat$carat, dat$price,
     xlab = "Carat", ylab = "Price",
     main = "LAD Regression Line",
     pch = 19)
abline(lad_model, col = "green", lwd = 2)

Visual Comparison of LSE and LAD Regression Lines

Here, we plot both regression lines on the same graph to visually compare them.

plot(dat$carat, dat$price,
     xlab = "Carat",
     ylab = "Price",
     main = "Comparison: LSE vs LAD Regression",
     pch = 19)
abline(lse_model, col = "red", lwd = 2)     # LSE line
abline(lad_model, col = "green", lwd = 2)   # LAD line
legend("topleft", legend = c("LSE", "LAD"), col = c("red", "green"), lwd = 2)

## Comparison of Residuals

We now compare the performance of LSE and LAD using residual analysis.

Sum of Squared Residuals (SSR) and Absolute Residuals (SAR)

# Compute residuals
lse_residual = residuals(lse_model)
lad_residual = residuals(lad_model)
# Compute Sum of Squared Residuals (SSR)
lse_ssr = sum((lse_residual)^2)
lad_ssr = sum((lad_residual)^2)

# Compute Sum of Absolute Residuals (SAR)
lse_sar = sum(abs(lse_residual))
lad_sar = sum(abs(lad_residual))

# Display the results
cat("LSE SSR:", lse_ssr, "\n")
## LSE SSR: 129345695398
cat("LAD SSR:", lad_ssr, "\n")
## LAD SSR: 146649329824
cat("LSE SAR:", lse_sar, "\n")
## LSE SAR: 54342568
cat("LAD SAR:", lad_sar, "\n")
## LAD SAR: 51303095

As expected, LSE has a lower SSR (since it minimizes that quantity), while LAD has a lower SAR, indicating better performance in minimizing absolute errors.

par(mfrow = c(1, 2))

# LSE Residuals
plot(fitted(lse_model), lse_residual,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "LSE Residuals", pch = 19, col = "red")
abline(h = 0, col = "black", lty = 2)

# LAD Residuals
plot(fitted(lad_model), lad_residual,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "LAD Residuals", pch = 19, col = "green")
abline(h = 0, col = "black", lty = 2)

par(mfrow = c(1,1))  # Reset

Conclusion