Um dos pressupostos centrais do Modelo Clássico de Regressão Linear (MCRL) é a homocedasticidade: \[\text{Var}(u_i | X) = \sigma^2\]
Quando esse pressuposto é violado, temos heterocedasticidade: \[\text{Var}(u_i | X) = \sigma_i^2\] ou seja, a variância do erro varia entre as observações.
Esse problema é comum em dados de seção cruzada (cross-section), especialmente quando:
Por isso, ignorar a heterocedasticidade compromete a inferência estatística.
Sob homocedasticidade:
\[ Var(\hat{\beta}) = \sigma^2 (X'X)^{-1} \]
Sob heterocedasticidade: \[ Var(\hat{\beta}) = (X'X)^{-1} X'\Omega X (X'X)^{-1} \]
Onde \(\Omega\) é a matriz de variâncias e covariâncias dos erros, diagonal com elementos \(\sigma_i^2\), ou seja: \[ \Omega = \begin{bmatrix} \sigma_1^2 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_n^2 \end{bmatrix} \]
library(wooldridge)
library(ggplot2)
data(wage1)
modelo <- lm(wage ~ educ + exper + tenure, data = wage1)
residuos <- resid(modelo)
fitted_vals <- fitted(modelo)
df <- data.frame(Fitted = fitted_vals, Residuals_sq = residuos^2)
ggplot(df, aes(x = Fitted, y = Residuals_sq)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
labs(title = "Resíduos vs. Valores Ajustados",
x = "Valores Ajustados",
y = "Resíduos") +
geom_smooth(method = "loess", se = FALSE, color = "red", linewidth = 1) +
theme_minimal()
Hipótese: \[ H_0: \text{Erro é homocedástico} \quad vs \quad H_1: \text{Erro é heterocedástico} \]
Estatística: \[ \chi^2 = n R^2 \sim \chi^2_q \]
Onde \(R^2\) é da regressão auxiliar dos resíduos ao quadrado sobre os regressores.
Exemplo prático: - Se estamos modelando o consumo com base na renda, queremos saber se há mais variabilidade do erro para rendas altas. - Ignorar isso pode gerar intervalos de confiança equivocados para o efeito da renda.
wage1 (wooldridge)suppressPackageStartupMessages(library(lmtest))
suppressPackageStartupMessages(library(wooldridge))
model <- lm(wage ~ educ + exper + tenure, data = wage1)
bptest(model) # Teste de Breusch-Pagan
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 43.096, df = 3, p-value = 2.349e-09
Generalização do teste de BP que inclui termos quadráticos e interações: \[ u_i^2 = \alpha_0 + \alpha_1 X_i + \alpha_2 X_i^2 + \alpha_3 X_i X_j + \varepsilon_i \]
Intuitivamente: verifica qualquer forma funcional de heterocedasticidade.
Vantagem: não exige especificação exata da variância. Desvantagem: sensível à multicolinearidade.
library(sandwich)
coeftest(model, vcov. = vcovHC) # HC = Heteroskedasticity-Consistent
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.872735 0.819016 -3.5075 0.0004913 ***
## educ 0.598965 0.061860 9.6826 < 2.2e-16 ***
## exper 0.022340 0.010663 2.0950 0.0366507 *
## tenure 0.169269 0.029854 5.6698 2.368e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Transformamos por: \[ \frac{y_i}{x_i} = \frac{\beta_0}{x_i} + \beta_1 + \frac{u_i}{x_i} \]
library(gujarati)
data(Table11_3)
# A função apply aplica as transformação 'as.numeric' a todas as colunas 'MARGIN = 2'
Table11_3 <- data.frame(apply(Table11_3, MARGIN = 2, as.numeric))
modelo_ols <- lm(Y ~ X, data = Table11_3)
modelo_wls <- lm(Y ~ X, weights = (1/X^2), data = Table11_3)
summary(modelo_ols)
##
## Call:
## lm(formula = Y ~ X, data = Table11_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.9808 -4.8798 0.8808 6.1226 20.2636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.29031 5.23139 1.776 0.0866 .
## X 0.63778 0.02862 22.287 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.183 on 28 degrees of freedom
## Multiple R-squared: 0.9466, Adjusted R-squared: 0.9447
## F-statistic: 496.7 on 1 and 28 DF, p-value: < 2.2e-16
summary(modelo_wls)
##
## Call:
## lm(formula = Y ~ X, data = Table11_3, weights = (1/X^2))
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.084660 -0.044245 0.005397 0.039712 0.087252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.27903 3.78273 2.717 0.0112 *
## X 0.63187 0.02666 23.703 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05209 on 28 degrees of freedom
## Multiple R-squared: 0.9525, Adjusted R-squared: 0.9508
## F-statistic: 561.9 on 1 and 28 DF, p-value: < 2.2e-16
\[\begin{align*} \frac{y_i}{\sqrt{h_i}} &= \frac{\beta_0}{\sqrt{h_i}} + \frac{\beta_1 x_{i1}}{\sqrt{h_i}} + \cdots + \frac{\beta_k x_{ik}}{\sqrt{h_i}} + \frac{u_i}{\sqrt{h_i}} \end{align*}\]
library(wooldridge)
data(wage1)
wage1$weights <- 1/(wage1$educ + .1)^2
modelo_wls <- lm(wage ~ educ + exper + tenure, data = wage1, weights = weights)
summary(modelo_wls)
##
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage1, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.19096 -0.16042 -0.06233 0.09837 1.29358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.236232 0.197772 16.363 <2e-16 ***
## educ 0.158986 0.012453 12.767 <2e-16 ***
## exper -0.008323 0.004753 -1.751 0.0805 .
## tenure 0.110774 0.009423 11.756 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2544 on 522 degrees of freedom
## Multiple R-squared: 0.7461, Adjusted R-squared: 0.7446
## F-statistic: 511.2 on 3 and 522 DF, p-value: < 2.2e-16
Passos (segundo Wooldridge):
u_hat <- resid(lm(wage ~ educ + exper + tenure, data = wage1))
lhu2 <- log(u_hat^2)
modelo_h <- lm(lhu2 ~ educ + exper + tenure, data = wage1)
hhat <- exp(fitted(modelo_h))
modelo_fgls <- lm(wage ~ educ + exper + tenure, data = wage1, weights = 1/hhat)
summary(modelo_fgls)
##
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage1, weights = 1/hhat)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -4.5426 -1.3984 -0.4874 0.9856 12.5386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59554 0.41407 1.438 0.15096
## educ 0.29689 0.03161 9.392 < 2e-16 ***
## exper 0.03186 0.00930 3.426 0.00066 ***
## tenure 0.15022 0.02435 6.169 1.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.147 on 522 degrees of freedom
## Multiple R-squared: 0.2101, Adjusted R-squared: 0.2055
## F-statistic: 46.27 on 3 and 522 DF, p-value: < 2.2e-16
O conjunto de dados hprice1 possui informações sobre os preços de casas vendidas em um determinado momento e inclui variáveis que afetam seu preço. As principais variáveis são:
Use os dados hprice1 da biblioteca
wooldridge para:
price ~ lotsize + sqrft + bdrms;suppressMessages(library(dplyr))
suppressMessages(library(skimr))
data(hprice1)
skimr::skim(hprice1)
| Name | hprice1 |
| Number of rows | 88 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| price | 0 | 1 | 293.55 | 102.71 | 111.00 | 230.00 | 265.50 | 326.25 | 725.00 | ▃▇▂▁▁ |
| assess | 0 | 1 | 315.74 | 95.31 | 198.70 | 253.90 | 290.20 | 352.12 | 708.60 | ▇▃▁▁▁ |
| bdrms | 0 | 1 | 3.57 | 0.84 | 2.00 | 3.00 | 3.00 | 4.00 | 7.00 | ▇▆▁▁▁ |
| lotsize | 0 | 1 | 9019.86 | 10174.15 | 1000.00 | 5732.75 | 6430.00 | 8583.25 | 92681.00 | ▇▁▁▁▁ |
| sqrft | 0 | 1 | 2013.69 | 577.19 | 1171.00 | 1660.50 | 1845.00 | 2227.00 | 3880.00 | ▆▇▃▁▁ |
| colonial | 0 | 1 | 0.69 | 0.46 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
| lprice | 0 | 1 | 5.63 | 0.30 | 4.71 | 5.44 | 5.58 | 5.79 | 6.59 | ▁▅▇▂▁ |
| lassess | 0 | 1 | 5.72 | 0.26 | 5.29 | 5.54 | 5.67 | 5.86 | 6.56 | ▅▇▃▂▁ |
| llotsize | 0 | 1 | 8.91 | 0.54 | 6.91 | 8.65 | 8.77 | 9.06 | 11.44 | ▁▆▇▁▁ |
| lsqrft | 0 | 1 | 7.57 | 0.26 | 7.07 | 7.41 | 7.52 | 7.71 | 8.26 | ▂▇▅▂▁ |
ggplot(hprice1, aes(x = lotsize, y = price)) +
geom_point() + geom_smooth(method = "lm") +
labs(title = "Preço vs. Tamanho do Lote")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(hprice1, aes(x = sqrft, y = price)) +
geom_point() + geom_smooth(method = "lm") +
labs(title = "Preço vs. Área Construída")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(hprice1, aes(x = factor(bdrms), y = price)) +
geom_boxplot() +
labs(title = "Preço vs. Número de Quartos")
modelo <- lm(price ~ lotsize + sqrft + bdrms, data = hprice1)
summary(modelo)
##
## Call:
## lm(formula = price ~ lotsize + sqrft + bdrms, data = hprice1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.026 -38.530 -6.555 32.323 209.376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.177e+01 2.948e+01 -0.739 0.46221
## lotsize 2.068e-03 6.421e-04 3.220 0.00182 **
## sqrft 1.228e-01 1.324e-02 9.275 1.66e-14 ***
## bdrms 1.385e+01 9.010e+00 1.537 0.12795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.83 on 84 degrees of freedom
## Multiple R-squared: 0.6724, Adjusted R-squared: 0.6607
## F-statistic: 57.46 on 3 and 84 DF, p-value: < 2.2e-16
residuos2 <- resid(modelo)^2
valores_ajustados <- fitted(modelo)
ggplot(data.frame(res2 = residuos2, fit = valores_ajustados),
aes(x = fit, y = res2)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Resíduos ao quadrado vs. Valores Ajustados",
x = "Valores Ajustados",
y = "Resíduos ao quadrado")
## `geom_smooth()` using formula = 'y ~ x'
bptest(modelo)
##
## studentized Breusch-Pagan test
##
## data: modelo
## BP = 14.092, df = 3, p-value = 0.002782
| term | estimate | se_ols | p_ols | se_robusto | p_robusto |
|---|---|---|---|---|---|
| (Intercept) | -21.7703 | 29.4750 | 0.4622 | 41.0327 | 0.5971 |
| lotsize | 0.0021 | 0.0006 | 0.0018 | 0.0071 | 0.7731 |
| sqrft | 0.1228 | 0.0132 | 0.0000 | 0.0407 | 0.0034 |
| bdrms | 13.8525 | 9.0101 | 0.1279 | 11.5618 | 0.2342 |
u_hat <- resid(modelo)
lhu2 <- log(u_hat^2)
modelo_h <- lm(lhu2 ~ lotsize + sqrft + bdrms, data = hprice1)
hhat <- exp(fitted(modelo_h))
modelo_wls <- lm(price ~ lotsize + sqrft + bdrms, data = hprice1, weights = 1/hhat)
summary(modelo_wls)
##
## Call:
## lm(formula = price ~ lotsize + sqrft + bdrms, data = hprice1,
## weights = 1/hhat)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -4.9267 -1.1313 -0.2846 1.1836 9.8270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.911604 30.823535 1.489 0.14010
## lotsize 0.004135 0.001426 2.901 0.00475 **
## sqrft 0.092462 0.014866 6.220 1.86e-08 ***
## bdrms 6.175451 8.893592 0.694 0.48937
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.974 on 84 degrees of freedom
## Multiple R-squared: 0.4675, Adjusted R-squared: 0.4485
## F-statistic: 24.58 on 3 and 84 DF, p-value: 1.633e-11
| OLS com erros robustos | WLS via FGLS | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | Statistic | p | Estimates | Statistic | p |
| (Intercept) | -21.7703 | -0.7386 | 0.462 | 45.9116 | 1.4895 | 0.140 |
| lotsize | 0.0021 | 3.2201 | 0.002 | 0.0041 | 2.9010 | 0.005 |
| sqrft | 0.1228 | 9.2751 | <0.001 | 0.0925 | 6.2197 | <0.001 |
| bdrms | 13.8525 | 1.5374 | 0.128 | 6.1755 | 0.6944 | 0.489 |
| Observations | 88 | 88 | ||||
| R2 / R2 adjusted | 0.672 / 0.661 | 0.468 / 0.449 | ||||