WLS: A method for handling heteroskedasticity by applying weights to the observations. In general, the WLS line should provide a better fit in the presence of significant heteroskedasticity, but how much it differs from the OLS line will depend on the specifics of your data.
GLS: A broader method that can handle a variety of error structures, including heteroskedasticity and correlated errors.
WLS is a specific form of GLS where the model is weighted to handle heteroskedasticity.
In a regression model, heteroskedasticity occurs when the variance of the errors (residuals) is not constant across all levels of the independent variable(s). This violates one of the key assumptions of Ordinary Least Squares (OLS) regression, which assumes homoskedasticity (constant variance of errors). Heteroskedasticity can lead to inefficient estimates and incorrect inferences, such as biased standard errors, which affects the reliability of hypothesis tests and confidence intervals.
WLS is an extension of OLS that addresses heteroskedasticity by assigning different weights to different observations.
Identifying Heteroskedasticity: First, you need to detect whether heteroskedasticity is present in your model. This can be done using statistical tests like the Breusch-Pagan test or White test, or by plotting the residuals against the predicted values.
Estimating Weights: Once heteroskedasticity is identified, you estimate the variance of the errors for each observation. These variances are used to calculate the weights. Typically, the weight for each observation is the inverse of its estimated variance (Feasible WLS or FWLS).
Two major ways -
Direct estimation: If you have prior knowledge or a model for how the variance changes with the independent variables, you can use that model to estimate the variances. Rare.
Fitted Values from an Auxiliary Regression: Often, a regression model is fit where the squared residuals from the original OLS regression are regressed on the independent variables or other predictors. The predicted values from this auxiliary regression can be used as estimates of the variance.
Applying Weights: In WLS, the regression model is fitted using these weights (typically the inverse of the estimated variances of the residuals), minimizing the weighted sum of squared residuals instead of the unweighted sum of squared residuals as in OLS. This effectively downweights the influence of observations with higher variance.
\[ w_i = \dfrac{1}{\hat{\sigma_i^2}} \]
where:
\(w_i\) is the weight for observation `i`,
\(\hat{\sigma}_i^2\) is the estimated variance of the error term for observation `i`.
New Model (Apply Weights in the Regression Model): The resulting model has corrected for heteroskedasticity, leading to more efficient estimates and more reliable standard errors, making hypothesis testing more accurate.
\(\text{Minimize } \sum_{i=1}^{n} w_i (y_i - \hat{y}_i)^2\)
where:
\(y_i\) is the observed value of the dependent variable,
\(\hat{y}_i\)​ is the predicted value from the regression model,
\(w_i\)​ is the weight for
observation i\
.
# Load necessary libraries
library(ggplot2)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Set seed for reproducibility
set.seed(42)
# Generate income data
income <- seq(2000, 10000, by=1000)
# Generate rent data with a significant increase in variance (severe heteroskedasticity)
rent <- income * 0.3 + rnorm(length(income), mean=0, sd=income/5) # High variance increases with income
# Add a few outliers
# Outliers at high income levels
outliers_income <- c(8000, 9000, 10000)
outliers_rent <- c(4500, 5000, 5500) # Outlier rents
# Append outliers to the data
data <- data.frame(income = income, rent = rent)
outliers_data <- data.frame(income = outliers_income, rent = outliers_rent)
data <- rbind(data, outliers_data)
# Fit OLS model
ols_model <- lm(rent ~ income, data = data)
# Calculate residuals
data$residuals <- residuals(ols_model)
# Auxiliary regression to estimate variance
data$residuals_squared <- data$residuals^2
aux_model <- lm(residuals_squared ~ income, data = data)
# Predicted variances (fitted values from the auxiliary regression)
data$predicted_variance <- fitted(aux_model) # These represent the predicted variances
# Ensure that predicted variances are non-negative
data$predicted_variance <- pmax(data$predicted_variance, 1e-6) # Small positive value to avoid division by zero
# Calculate weights for WLS
data$weights <- 1 / data$predicted_variance
# Fit WLS model
wls_model <- lm(rent ~ income, data = data, weights = data$weights)
# Add fitted values to the data frame
data$ols_fitted <- fitted(ols_model)
data$wls_fitted <- fitted(wls_model)
# Plot the data with ggplot2
ggplot(data, aes(x = income)) +
geom_point(aes(y = rent), color = "black", size = 2, alpha = 0.7) + # Original data and outliers
geom_line(aes(y = ols_fitted), color = "blue", size = 1) + # OLS fitted values
geom_line(aes(y = wls_fitted), color = "red", size = 1) + # WLS fitted values
labs(title = "OLS vs WLS with Severe Heteroskedasticity",
x = "Income",
y = "Rent") +
theme_minimal() +
theme(legend.position = "top") +
scale_y_continuous(labels = scales::label_number(suffix = "", scale = 1)) # Avoid scientific notation
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Example 2
remove(list = ls()) # Clear the workspace
cat("\014") # This will clear the R console
options(scipen = 999) # Higher values reduce scientific notation
cat("Setup complete. Starting analysis...\n")
## Setup complete. Starting analysis...
# Sample data
set.seed(42)
income <- seq(from = 30000, to = 300000, by=1000)
summary(income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30000 97500 165000 165000 232500 300000
# Generate rent data with increasing variability (fanning out pattern)
rent <- income * 0.45 + rnorm(n = length(income),
mean=30000,
sd=income/10
)
# Create a data frame
data <- data.frame(income, rent)
# Inspect the data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(data)
## Rows: 271
## Columns: 2
## $ income <dbl> 30000, 31000, 32000, 33000, 34000, 35000, 36000, 37000, 38000, …
## $ rent <dbl> 47612.88, 42199.44, 45562.01, 46938.45, 46674.51, 45378.56, 516…
ggplot(data = data, mapping = aes(x = income, y = rent)) + geom_point()
# Add a few outliers
# Outliers at high income levels
#outliers_income <- c(250000, 260000, 270000)
#outliers_rent <- c(3500000, 3600000, 3700000) # Outlier rents
#outliers_data <- data.frame(income = outliers_income, rent = outliers_rent)
#data <- rbind(data, outliers_data)
ggplot(data = data, mapping = aes(x = income, y = rent)) + geom_point()
# Fit OLS model
ols_model <- lm(rent ~ income, data = data)
# Summarize the OLS model
summary(ols_model)
##
## Call:
## lm(formula = rent ~ income, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78987 -8876 293 8546 57725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30089.08121 2472.84858 12.17 <0.0000000000000002 ***
## income 0.44476 0.01354 32.84 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17440 on 269 degrees of freedom
## Multiple R-squared: 0.8004, Adjusted R-squared: 0.7997
## F-statistic: 1079 on 1 and 269 DF, p-value: < 0.00000000000000022
# Calculate residuals
data$residuals <- residuals(ols_model)
# Inspect the residuals
plot(x = data$income,
y = data$residuals,
main="Residuals vs Income",
xlab="Income",
ylab="Residuals"
)
- Heteroskedasticty is a problem.
# Auxiliary regression to estimate variance
data$residuals_squared <- data$residuals^2
aux_model <- lm(residuals_squared ~ income, data = data)
summary(aux_model)
##
## Call:
## lm(formula = residuals_squared ~ income, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -711690910 -286513702 -38525957 116960696 5519489832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -216088278.7 77686594.9 -2.782 0.00579 **
## income 3139.3 425.4 7.379 0.00000000000199 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 547900000 on 269 degrees of freedom
## Multiple R-squared: 0.1683, Adjusted R-squared: 0.1653
## F-statistic: 54.45 on 1 and 269 DF, p-value: 0.000000000001987
# Predicted values
data$predicted_values <- fitted(aux_model)
# Inspect the predicted values
plot(x = data$income,
y = data$predicted_values,
main="Predicted Values vs Income",
xlab="Income",
ylab="Predicted Values"
)
# Calculate weights
data$weights <- 1 / data$predicted_values^2
# Inspect weights
head(data$weights)
## [1] 0.00000000000000006728772 0.00000000000000007089189
## [3] 0.00000000000000007479358 0.00000000000000007902648
## [5] 0.00000000000000008362915 0.00000000000000008864596
# Fit WLS model
wls_model <- lm(rent ~ income, data = data, weights = weights)
# Summarize the WLS model
summary(wls_model)
##
## Call:
## lm(formula = rent ~ income, data = data, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.0060700 -0.0000751 -0.0000313 0.0000187 0.0013118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21625.70534 5322.84509 4.063 0.00006367145289 ***
## income 0.56497 0.07707 7.331 0.00000000000269 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0004031 on 269 degrees of freedom
## Multiple R-squared: 0.1665, Adjusted R-squared: 0.1634
## F-statistic: 53.74 on 1 and 269 DF, p-value: 0.000000000002687
# Compare OLS and WLS coefficients
coef(ols_model)
## (Intercept) income
## 30089.0812117 0.4447647
coef(wls_model)
## (Intercept) income
## 21625.7053405 0.5649709
# Compare fitted values
# Plot original data with points
plot(data$income, data$rent, main="OLS vs WLS", xlab="Income", ylab="Rent", pch=19)
# Add OLS fitted values with a thinner line
lines(data$income, fitted(ols_model), col="blue", lwd=1) # lwd=1 for thinner line
# Add WLS fitted values with a thinner line
lines(data$income, fitted(wls_model), col="red", lwd=1) # lwd=1 for thinner line
# Add a legend to the plot
legend("topleft", legend=c("Data", "OLS", "WLS"), col=c("black", "blue", "red"), pch=c(19, NA, NA), lwd=c(NA, 1, 1))
# Weighted OLS
ols_weighted <- lm(rent ~ income, data = data, weights = weights)
# Create a table with custom column labels
stargazer(
ols_model, wls_model, ols_weighted, # Models to include
type = "text", # Output type (text, html, latex, etc.)
column.labels = c("OLS Model", "WLS Model", "OLS Weighted"), # Custom labels
title = "Regression Model Comparison",# Optional title
summary = TRUE # Show summary statistics
)
##
## Regression Model Comparison
## ========================================================================
## Dependent variable:
## -----------------------------------------
## rent
## OLS Model WLS Model OLS Weighted
## (1) (2) (3)
## ------------------------------------------------------------------------
## income 0.445*** 0.565*** 0.565***
## (0.014) (0.077) (0.077)
##
## Constant 30,089.080*** 21,625.700*** 21,625.700***
## (2,472.849) (5,322.845) (5,322.845)
##
## ------------------------------------------------------------------------
## Observations 271 271 271
## R2 0.800 0.167 0.167
## Adjusted R2 0.800 0.163 0.163
## Residual Std. Error (df = 269) 17,439.840 0.0004 0.0004
## F Statistic (df = 1; 269) 1,078.689*** 53.740*** 53.740***
## ========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# ?stargazer
# stargazer(ols_model, wls_model, ols_weighted, type = "text")