Regression analysis is one of the fundamental tools in statistical modeling. This report compares two regression techniques: Least Squares Estimation (LSE) and Least Absolute Deviation (LAD) using the ‘diamonds’ dataset from ‘ggplot2’ package.While LSE minimizes the squared residuals, LAD minimizes the absolute residuals, making it more robust to outliers.
We begin by loading the necessary libraries. “ggplot2” provides the diamonds dataset, while “quantreg” enables LAD regression.
library(quantreg) # For LAD (quantile) regression
## Warning: package 'quantreg' was built under R version 4.4.3
## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 4.4.3
library(ggplot2) # For dataset and visualization
The diamonds dataset contains information on the prices and attributes of around 54,000 diamonds. We will focus on the variables “carat” (weight of the diamond) and “price”.
# Load the dataset
dat = diamonds
# Number of rows and preview
nrow(dat)
## [1] 53940
head(dat)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
We begin by visualizing the relationship between carat and price.
plot(dat$carat, dat$price,
xlab = "Carat",
ylab = "Price",
main = "Carat vs Price of Diamonds",
pch = 19)
As expected, diamond price increases with carat weight.
We fit a linear regression model using LSE, which estimates the parameters by minimizing the sum of squared residuals.
# Fit LSE model
lse_model = lm(price ~ carat, data = dat)
# Extract coefficients
lse_coeff = coef(lse_model)
lse_intercept = lse_coeff[1]
lse_slope = lse_coeff[2]
# Display coefficients
cat("LSE Intercept:", lse_intercept, "\n")
## LSE Intercept: -2256.361
cat("LSE Slope:", lse_slope, "\n")
## LSE Slope: 7756.426
We can now plot the regression line obtained using LSE.
plot(dat$carat, dat$price,
xlab = "Carat", ylab = "Price",
main = "LSE Regression Line",
pch = 19)
abline(lse_model, col = "red", lwd = 2)
LAD regression minimizes the sum of absolute residuals, offering more robustness to outliers.
# Fit LAD model (median regression)
lad_model = rq(price ~ carat, tau = 0.5, data = dat)
# Extract coefficients
lad_coeff = coef(lad_model)
lad_intercept = lad_coeff[1]
lad_slope = lad_coeff[2]
# Display coefficients
cat("LAD Intercept:", lad_intercept, "\n")
## LAD Intercept: -1665.225
cat("LAD Slope:", lad_slope, "\n")
## LAD Slope: 6692.5
Plotting the LAD regression line:
plot(dat$carat, dat$price,
xlab = "Carat", ylab = "Price",
main = "LAD Regression Line",
pch = 19)
abline(lad_model, col = "green", lwd = 2)
Here, we plot both regression lines on the same graph to visually compare them.
plot(dat$carat, dat$price,
xlab = "Carat",
ylab = "Price",
main = "Comparison: LSE vs LAD Regression",
pch = 19)
abline(lse_model, col = "red", lwd = 2) # LSE line
abline(lad_model, col = "green", lwd = 2) # LAD line
legend("topleft", legend = c("LSE", "LAD"), col = c("red", "green"), lwd = 2)
## Comparison of Residuals
We now compare the performance of LSE and LAD using residual analysis.
# Compute residuals
lse_residual = residuals(lse_model)
lad_residual = residuals(lad_model)
# Compute Sum of Squared Residuals (SSR)
lse_ssr = sum((lse_residual)^2)
lad_ssr = sum((lad_residual)^2)
# Compute Sum of Absolute Residuals (SAR)
lse_sar = sum(abs(lse_residual))
lad_sar = sum(abs(lad_residual))
# Display the results
cat("LSE SSR:", lse_ssr, "\n")
## LSE SSR: 129345695398
cat("LAD SSR:", lad_ssr, "\n")
## LAD SSR: 146649329824
cat("LSE SAR:", lse_sar, "\n")
## LSE SAR: 54342568
cat("LAD SAR:", lad_sar, "\n")
## LAD SAR: 51303095
As expected, LSE has a lower SSR (since it minimizes that quantity), while LAD has a lower SAR, indicating better performance in minimizing absolute errors.
par(mfrow = c(1, 2))
# LSE Residuals
plot(fitted(lse_model), lse_residual,
xlab = "Fitted Values", ylab = "Residuals",
main = "LSE Residuals", pch = 19, col = "red")
abline(h = 0, col = "black", lty = 2)
# LAD Residuals
plot(fitted(lad_model), lad_residual,
xlab = "Fitted Values", ylab = "Residuals",
main = "LAD Residuals", pch = 19, col = "green")
abline(h = 0, col = "black", lty = 2)
par(mfrow = c(1,1)) # Reset
LSE minimizes the squared residuals, making it sensitive to outliers.
LAD minimizes absolute residuals, providing more robustness to outliers.
From the residual analysis, LAD has lower sum of absolute residuals compared to LSE.
LAD is more appropriate when data may contain outliers.