Heteroscedasticity exists when the error terms are related to the predictors. For example, heteroscedasticity will be present in a model of income ~ age
because income is more variable for adults than for teenagers.
Heteroscedasticity is a problem because statistical tests of significance assume the modelling errors are uncorrelated and uniform.
Weighted least squares corrects the non-constant variance by weighting each observation by the reciprocal of its estimated variance. Observations with small estimated variances are weighted higher than observations with large estimated variances. Modify the ordinary least squares model \(\hat{\beta} = (X'X)^{-1}X'Y\) to include a diagonal matrix of weights \(W\) where \(w_{ii}\) is the reciprocal of the \(i^{th}\) observation.
\[\hat{\beta} = (X'WX)^{-1}X'WY\]
The weighted least squares equation collapses to the ordinarly least squares equation if the weights are constant. The trick with weighted least squares is the estimation of \(W\). If the variances of the observations or their functional form are somehow known, you can use that. More likely, you need to estimate \(W\) from residual plots.
Dataset cal
(ca_learning_new.txt) records the cost of the computer time to complete a computer assisted learning exercise for n = 12 subjects.
id
: subject identifiernum.responses
: number of responses in lessoncost
: computer time costlibrary(readr)
cal <- read_tsv(file = "./Data/ca_learning_new.txt")
head(cal)
## # A tibble: 6 x 3
## id num.responses cost
## <dbl> <dbl> <dbl>
## 1 1 16 77
## 2 2 14 70
## 3 3 22 85
## 4 4 10 50
## 5 5 14 62
## 6 6 17 70
The scatterplot shows a simple linear relationship cost ~ num.responses
. But there may be a slight megaphone pattern in the residuals.
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
ggplot(cal, aes(y = cost, x = num.responses)) +
geom_point() +
geom_smooth(method = lm, se = FALSE,
color = "black",
size = 0.5,
linetype = "dashed") +
labs(title = "Scatterplot of cost ~ num.reponses")
Fit the model cost ~ num.responses
.
cal.lm <- lm(cost ~ num.responses, data = cal)
summary.lm(cal.lm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.472689 5.5161841 3.530101 5.445888e-03
## num.responses 3.268908 0.3650515 8.954649 4.330241e-06
paste("R2: ", round(summary.lm(cal.lm)$r.squared, 3))
## [1] "R2: 0.889"
The residuals vs fitted values plot exhibits a megaphone shape.
library(dplyr)
data.frame(y = rstandard(cal.lm),
x = cal.lm$fitted.values) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dotted") +
labs(title = "Standardized Residuals vs Fitted Values Plot")
Create weights by extracting the fitted values of a regression of \(\epsilon \sim y\). The new slope parameter estimator (num.responses = 3.421, se = 0.370) is about the same as in the original model (num.responses = 3.269, se = 0.365).
cal.weights <- 1 / lm(abs(cal.lm$residuals) ~ cal.lm$fitted.values)$fitted.values^2
cal.lmw <- lm(cost ~ num.responses,
data = cal,
weights = cal.weights)
summary.lm(cal.lmw)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.300637 4.827736 3.583592 4.981868e-03
## num.responses 3.421106 0.370310 9.238492 3.268919e-06
paste("R2: ", round(summary.lm(cal.lmw)$r.squared, 3))
## [1] "R2: 0.895"
Now the studentized residuals are constant.
data.frame(y = rstandard(cal.lmw),
x = cal.lm$fitted.values) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dotted") +
labs(title = "Standardized Residuals vs Fitted Values Plot")
The regression estimates have not changed much from the ordinary least squares method. The following plot shows both the OLS fitted line (black) and WLS fitted line (red). The WLS fit is more heavily weighted on the middle points where the variance is lower.
ggplot(data = cal, aes(y = cost, x = num.responses)) +
geom_point() +
geom_smooth(method = lm, se = FALSE,
color = "black",
size = 0.5,
linetype = "dashed") +
geom_smooth(method = lm, se = FALSE,
aes(weight = cal.weights),
color = "red",
size = 0.5,
linetype = "dashed") +
labs(title = "Scatterplot of cost ~ num.reponses")