For an ordinary least squares regression we assume that variance of the error term are constant, when this assumption is violated we can use a weighted least square regression method.

This note is mainly from STAT 501 of the Pennsylvania State University.

Obtain the data set Galton.

galton<-read.table("https://online.stat.psu.edu/onlinecourses/sites/stat501/files/data/galton.txt", header = T, sep = "", dec = ".")

Run unweighted least square regression

lm_uw<-lm(Progeny~Parent,data=galton)
summary(lm_uw)
## 
## Call:
## lm(formula = Progeny ~ Parent, data = galton)
## 
## Residuals:
##          1          2          3          4          5          6          7 
##  0.0014714  0.0016714 -0.0032286 -0.0008286 -0.0014286  0.0010714  0.0012714 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.127029   0.006993  18.164 9.29e-06 ***
## Parent      0.210000   0.038614   5.438  0.00285 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002043 on 5 degrees of freedom
## Multiple R-squared:  0.8554, Adjusted R-squared:  0.8265 
## F-statistic: 29.58 on 1 and 5 DF,  p-value: 0.002852

Run weighted least square regression with known weights, the weights are \(\frac{1}{SD^2}\), here \(SD\)s are from the Galton dataset, we know these weights.

lm_w<-lm(Progeny~Parent,weight=1/SD^2, data=galton)
summary(lm_w)
## 
## Call:
## lm(formula = Progeny ~ Parent, data = galton, weights = 1/SD^2)
## 
## Weighted Residuals:
##        1        2        3        4        5        6        7 
##  0.08187  0.09162 -0.16753 -0.04067 -0.08950  0.06071  0.06328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.127964   0.006811  18.787 7.87e-06 ***
## Parent      0.204801   0.038155   5.368  0.00302 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.11 on 5 degrees of freedom
## Multiple R-squared:  0.8521, Adjusted R-squared:  0.8225 
## F-statistic: 28.81 on 1 and 5 DF,  p-value: 0.003021

Suppose we don’t have weights from the data set, we need to estimate these weights from the data. Here are some rules to estimate the weights from the data. I copied directly from the STAT 501 class note.

  1. If a residual plot against a predictor exhibits a megaphone shape, then regress the absolute values of the residuals against that predictor. The resulting fitted values of this regression are estimates of \(\sigma_i\).(And remember \(w_i=\frac{1}{\sigma_i^2}\)).

  2. If a residual plot against the fitted values exhibits a megaphone shape, then regress the absolute values of the residuals against the fitted values. The resulting fitted values of this regression are estimates of \(\sigma_i\).

  3. If a residual plot of the squared residuals against a predictor exhibits an upward trend, then regress the squared residuals against that predictor. The resulting fitted values of this regression are estimates of \(\sigma_i^2\)

  4. If a residual plot of the squared residuals against the fitted values exhibits an upward trend, then regress the squared residuals against the fitted values. The resulting fitted values of this regression are estimates of \(\sigma_i^2\)

We plot residual agains fitted value (rule 2)

library(ggplot2)
library(dplyr)

data.frame(y = rstandard(lm_uw),
           x = lm_uw$fitted.values) %>%
  ggplot(aes(x = x, y = y)) + 
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dotted") +
  labs(title = "Standardized Residuals vs Fitted Values Plot")

Suppose we think the plot has a megaphone shape (rule 2), therefore we will use the regression of the absolute values of the residuals against the fitted values (from the unweighted regression, i.e the regression \(|e_i|\sim \hat{y_i}\) to estimate the \(\sigma_i\))

The following R code will estimate the \(\sigma_i\) using the regression \(|e_i|\sim \hat{y_i}\)

lm_res<-lm(abs(lm_uw$residuals) ~ lm_uw$fitted.values)  #absolute value of residuals regress on fitted value from the unweighted regresson

sigma<-(lm_res$fitted.values) # sigma is estimated from the fitted values 

res.weight<-1/sigma^2  #weights

Run weighted least square regression with weights estimated from the data (by rule 2)

lm_w_res<-lm(Progeny~Parent,weight=res.weight, data=galton)
summary(lm_w_res)
## 
## Call:
## lm(formula = Progeny ~ Parent, data = galton, weights = res.weight)
## 
## Weighted Residuals:
##       1       2       3       4       5       6       7 
##  1.0376  1.1242 -1.7833 -0.5103 -1.0949  0.5719  0.6549 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.130158   0.006318  20.603 4.99e-06 ***
## Parent      0.192455   0.036216   5.314  0.00316 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.244 on 5 degrees of freedom
## Multiple R-squared:  0.8496, Adjusted R-squared:  0.8195 
## F-statistic: 28.24 on 1 and 5 DF,  p-value: 0.003155

Compare the results from the three methods (1. unweighted, 2. weighting using known weights, 3.weighting using weights estimated from the data by linear regression)

plot(Progeny ~ Parent, data = galton)
abline(lm_uw,col="black",lty=1)
abline(lm_w, col="blue",lty=2)

abline(lm_w_res,col='red',lty=3)

legend(0.15, 0.172, legend=c("unweighted", "known weights",'weights from data'),
       col=c("black", "blue","red"), lty=1:3, cex=0.8)

We see that weighted least square regressions produce smaller coefficients than unweighted regression for this data.

References

1.https://online.stat.psu.edu/stat501/lesson/13/13.1

2.https://rpubs.com/mpfoley73/500818