1 Background: WLS is a specific form of GLS

  • WLS: A method for handling heteroskedasticity by applying weights to the observations. In general, the WLS line should provide a better fit in the presence of significant heteroskedasticity, but how much it differs from the OLS line will depend on the specifics of your data.

  • GLS: A broader method that can handle a variety of error structures, including heteroskedasticity and correlated errors.

  • WLS is a specific form of GLS where the model is weighted to handle heteroskedasticity.

    • In WLS, the weights are typically the inverse of the estimated variances. This is effectively a transformation of the model to handle heteroskedasticity or non-constant variance of errors in regression models, which is a specific type of issue that GLS can address more generally.

2 Heteroskedasticity:

In a regression model, heteroskedasticity occurs when the variance of the errors (residuals) is not constant across all levels of the independent variable(s). This violates one of the key assumptions of Ordinary Least Squares (OLS) regression, which assumes homoskedasticity (constant variance of errors). Heteroskedasticity can lead to inefficient estimates and incorrect inferences, such as biased standard errors, which affects the reliability of hypothesis tests and confidence intervals.

2.0.1 Weighted Least Squares (WLS):

WLS is an extension of OLS that addresses heteroskedasticity by assigning different weights to different observations.

  • The key idea is to give less weight to observations with higher variance (larger errors) and more weight to observations with lower variance (smaller errors).

2.0.2 How it works:

  1. Identifying Heteroskedasticity: First, you need to detect whether heteroskedasticity is present in your model. This can be done using statistical tests like the Breusch-Pagan test or White test, or by plotting the residuals against the predicted values.

  2. Estimating Weights: Once heteroskedasticity is identified, you estimate the variance of the errors for each observation. These variances are used to calculate the weights. Typically, the weight for each observation is the inverse of its estimated variance (Feasible WLS or FWLS).

    Two major ways -

    • Direct estimation: If you have prior knowledge or a model for how the variance changes with the independent variables, you can use that model to estimate the variances. Rare.

    • Fitted Values from an Auxiliary Regression: Often, a regression model is fit where the squared residuals from the original OLS regression are regressed on the independent variables or other predictors. The predicted values from this auxiliary regression can be used as estimates of the variance.

  3. Applying Weights: In WLS, the regression model is fitted using these weights (typically the inverse of the estimated variances of the residuals), minimizing the weighted sum of squared residuals instead of the unweighted sum of squared residuals as in OLS. This effectively downweights the influence of observations with higher variance.

    \[ w_i = \dfrac{1}{\hat{\sigma_i^2}} \]

    where:

    • \(w_i\) is the weight for observation `i`,

    • \(\hat{\sigma}_i^2\) is the estimated variance of the error term for observation `i`.

  4. New Model (Apply Weights in the Regression Model): The resulting model has corrected for heteroskedasticity, leading to more efficient estimates and more reliable standard errors, making hypothesis testing more accurate.

  • After calculating the weights, you apply them to the regression model. The WLS regression minimizes the weighted sum of squared residuals:

\(\text{Minimize } \sum_{i=1}^{n} w_i (y_i - \hat{y}_i)^2\)

where:

  • \(y_i\) is the observed value of the dependent variable,

  • \(\hat{y}_i\)​ is the predicted value from the regression model,

  • \(w_i\)​ is the weight for observation i\.

# Load necessary libraries
library(ggplot2)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Set seed for reproducibility
set.seed(42)

# Generate income data
income <- seq(2000, 10000, by=1000)

# Generate rent data with a significant increase in variance (severe heteroskedasticity)
rent <- income * 0.3 + rnorm(length(income), mean=0, sd=income/5)  # High variance increases with income

# Add a few outliers
# Outliers at high income levels
outliers_income <- c(8000, 9000, 10000)
outliers_rent <- c(4500, 5000, 5500)  # Outlier rents

# Append outliers to the data
data <- data.frame(income = income, rent = rent)
outliers_data <- data.frame(income = outliers_income, rent = outliers_rent)
data <- rbind(data, outliers_data)

# Fit OLS model
ols_model <- lm(rent ~ income, data = data)

# Calculate residuals
data$residuals <- residuals(ols_model)

# Auxiliary regression to estimate variance
data$residuals_squared <- data$residuals^2
aux_model <- lm(residuals_squared ~ income, data = data)

# Predicted variances (fitted values from the auxiliary regression)
data$predicted_variance <- fitted(aux_model)  # These represent the predicted variances

# Ensure that predicted variances are non-negative
data$predicted_variance <- pmax(data$predicted_variance, 1e-6)  # Small positive value to avoid division by zero

# Calculate weights for WLS
data$weights <- 1 / data$predicted_variance

# Fit WLS model
wls_model <- lm(rent ~ income, data = data, weights = data$weights)

# Add fitted values to the data frame
data$ols_fitted <- fitted(ols_model)
data$wls_fitted <- fitted(wls_model)

# Plot the data with ggplot2
ggplot(data, aes(x = income)) +
  geom_point(aes(y = rent), color = "black", size = 2, alpha = 0.7) +  # Original data and outliers
  geom_line(aes(y = ols_fitted), color = "blue", size = 1) +  # OLS fitted values
  geom_line(aes(y = wls_fitted), color = "red", size = 1) +   # WLS fitted values
  labs(title = "OLS vs WLS with Severe Heteroskedasticity",
       x = "Income",
       y = "Rent") +
  theme_minimal() +
  theme(legend.position = "top") +
  scale_y_continuous(labels = scales::label_number(suffix = "", scale = 1))  # Avoid scientific notation
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Example 2

remove(list = ls()) # Clear the workspace
cat("\014")  # This will clear the R console
options(scipen = 999)  # Higher values reduce scientific notation

cat("Setup complete. Starting analysis...\n")
## Setup complete. Starting analysis...
# Sample data
set.seed(42)

income <- seq(from = 30000, to = 300000, by=1000)
summary(income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30000   97500  165000  165000  232500  300000
# Generate rent data with increasing variability (fanning out pattern)
rent <- income * 0.45 + rnorm(n = length(income), 
                              mean=30000, 
                              sd=income/10
                              )

# Create a data frame
data <- data.frame(income, rent)

# Inspect the data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(data)
## Rows: 271
## Columns: 2
## $ income <dbl> 30000, 31000, 32000, 33000, 34000, 35000, 36000, 37000, 38000, …
## $ rent   <dbl> 47612.88, 42199.44, 45562.01, 46938.45, 46674.51, 45378.56, 516…
ggplot(data = data, mapping = aes(x = income, y = rent)) + geom_point()

# Add a few outliers
# Outliers at high income levels
#outliers_income <- c(250000, 260000, 270000)
#outliers_rent <- c(3500000, 3600000, 3700000)  # Outlier rents
#outliers_data <- data.frame(income = outliers_income, rent = outliers_rent)
#data <- rbind(data, outliers_data)

ggplot(data = data, mapping = aes(x = income, y = rent)) + geom_point()

# Fit OLS model
ols_model <- lm(rent ~ income, data = data)

# Summarize the OLS model
summary(ols_model)
## 
## Call:
## lm(formula = rent ~ income, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78987  -8876    293   8546  57725 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 30089.08121  2472.84858   12.17 <0.0000000000000002 ***
## income          0.44476     0.01354   32.84 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17440 on 269 degrees of freedom
## Multiple R-squared:  0.8004, Adjusted R-squared:  0.7997 
## F-statistic:  1079 on 1 and 269 DF,  p-value: < 0.00000000000000022
# Calculate residuals
data$residuals <- residuals(ols_model)

# Inspect the residuals
plot(x = data$income, 
     y = data$residuals, 
     main="Residuals vs Income", 
     xlab="Income", 
     ylab="Residuals"
     )

- Heteroskedasticty is a problem.

# Auxiliary regression to estimate variance
data$residuals_squared <- data$residuals^2
aux_model <- lm(residuals_squared ~ income, data = data)

summary(aux_model)
## 
## Call:
## lm(formula = residuals_squared ~ income, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -711690910 -286513702  -38525957  116960696 5519489832 
## 
## Coefficients:
##                 Estimate   Std. Error t value         Pr(>|t|)    
## (Intercept) -216088278.7   77686594.9  -2.782          0.00579 ** 
## income            3139.3        425.4   7.379 0.00000000000199 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 547900000 on 269 degrees of freedom
## Multiple R-squared:  0.1683, Adjusted R-squared:  0.1653 
## F-statistic: 54.45 on 1 and 269 DF,  p-value: 0.000000000001987
# Predicted values
data$predicted_values <- fitted(aux_model)

# Inspect the predicted values
plot(x = data$income, 
     y = data$predicted_values, 
     main="Predicted Values vs Income",
     xlab="Income",
     ylab="Predicted Values"
     )

# Calculate weights
data$weights <- 1 / data$predicted_values^2


# Inspect weights
head(data$weights)
## [1] 0.00000000000000006728772 0.00000000000000007089189
## [3] 0.00000000000000007479358 0.00000000000000007902648
## [5] 0.00000000000000008362915 0.00000000000000008864596
# Fit WLS model
wls_model <- lm(rent ~ income, data = data, weights = weights)

# Summarize the WLS model
summary(wls_model)
## 
## Call:
## lm(formula = rent ~ income, data = data, weights = weights)
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0060700 -0.0000751 -0.0000313  0.0000187  0.0013118 
## 
## Coefficients:
##                Estimate  Std. Error t value         Pr(>|t|)    
## (Intercept) 21625.70534  5322.84509   4.063 0.00006367145289 ***
## income          0.56497     0.07707   7.331 0.00000000000269 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0004031 on 269 degrees of freedom
## Multiple R-squared:  0.1665, Adjusted R-squared:  0.1634 
## F-statistic: 53.74 on 1 and 269 DF,  p-value: 0.000000000002687
# Compare OLS and WLS coefficients
coef(ols_model)
##   (Intercept)        income 
## 30089.0812117     0.4447647
coef(wls_model)
##   (Intercept)        income 
## 21625.7053405     0.5649709
# Compare fitted values


# Plot original data with points
plot(data$income, data$rent, main="OLS vs WLS", xlab="Income", ylab="Rent", pch=19)

# Add OLS fitted values with a thinner line
lines(data$income, fitted(ols_model), col="blue", lwd=1)  # lwd=1 for thinner line

# Add WLS fitted values with a thinner line
lines(data$income, fitted(wls_model), col="red", lwd=1)  # lwd=1 for thinner line

# Add a legend to the plot
legend("topleft", legend=c("Data", "OLS", "WLS"), col=c("black", "blue", "red"), pch=c(19, NA, NA), lwd=c(NA, 1, 1))

# Weighted OLS
ols_weighted <- lm(rent ~ income, data = data, weights = weights)
# Create a table with custom column labels
stargazer(
  ols_model, wls_model, ols_weighted,   # Models to include
  type = "text",                        # Output type (text, html, latex, etc.)
  column.labels = c("OLS Model", "WLS Model", "OLS Weighted"),  # Custom labels
  title = "Regression Model Comparison",# Optional title
  summary = TRUE                        # Show summary statistics
)
## 
## Regression Model Comparison
## ========================================================================
##                                           Dependent variable:           
##                                -----------------------------------------
##                                                  rent                   
##                                  OLS Model     WLS Model   OLS Weighted 
##                                     (1)           (2)           (3)     
## ------------------------------------------------------------------------
## income                           0.445***      0.565***      0.565***   
##                                   (0.014)       (0.077)       (0.077)   
##                                                                         
## Constant                       30,089.080*** 21,625.700*** 21,625.700***
##                                 (2,472.849)   (5,322.845)   (5,322.845) 
##                                                                         
## ------------------------------------------------------------------------
## Observations                        271           271           271     
## R2                                 0.800         0.167         0.167    
## Adjusted R2                        0.800         0.163         0.163    
## Residual Std. Error (df = 269)  17,439.840      0.0004        0.0004    
## F Statistic (df = 1; 269)      1,078.689***    53.740***     53.740***  
## ========================================================================
## Note:                                        *p<0.1; **p<0.05; ***p<0.01
# ?stargazer
# stargazer(ols_model, wls_model, ols_weighted, type = "text")