É fornecido um conjunto de dados, que pode ser baixado em http://www.de.ufpe.br/~cribari/educacao.dat. O objetivo é fornecer uma análise exploratória dos dados, ajustar modelos de regressão linear supondo homoscedasticidade e heteroscedasticidade e plotar gráficos dos resíduos.
Segue o banco de dados:
# Trabalho de Análise Estatística 2
# conjunto de dados: educacao e renda per capita nos EUA
educacao = read.table("http://www.de.ufpe.br/~cribari/educacao.dat")
educacao = na.exclude(educacao) # excluir a observação faltante do estado de Wisconsin
attach(educacao)
educacao
## Expenditure Income
## Alabama 275 6247
## Alaska 821 10851
## Arizona 339 7374
## Arkansas 275 6183
## California 387 8850
## Colorado 452 8001
## Connecticut 531 8914
## Delaware 424 8604
## Florida 316 7505
## Georgia 265 6700
## Hawaii 403 8380
## Idaho 304 6813
## Illinois 437 8745
## Indiana 345 7696
## Iowa 431 7873
## Kansas 355 8001
## Kentucky 260 6615
## Louisiana 316 6640
## Maine 327 6333
## Maryland 427 8306
## Massachusetts 427 8063
## Michigan 466 8442
## Minnesota 477 7847
## Mississippi 259 5736
## Missouri 274 7342
## Montana 433 7051
## Nebraska 294 7391
## Nevada 359 9032
## New Hampshire 279 7277
## New Jersey 423 8818
## New Mexico 388 6505
## New York 447 8267
## North Carolina 335 6607
## North Dakota 311 7478
## Ohio 322 7812
## Oklahoma 320 6951
## Oregon 397 7839
## Pennsylvania 412 7733
## Rhode Island 342 7526
## South Carolina 315 6242
## South Dakota 321 6841
## Tennessee 268 6489
## Texas 315 7697
## Utah 417 6622
## Vermont 353 6541
## Virginia 356 7624
## Washington 415 8450
## Washington DC 428 10022
## West Virginia 320 6456
## Wyoming 500 9096
with(educacao,c("Expenditure","Income"))
## [1] "Expenditure" "Income"
Income = Income/10^4
Segue os boxplots e histogramas das variáveis Expenditure (dependente) e Income(independente).
## Expenditure Income
## Min. :259.0 Min. : 5736
## 1st Qu.:315.2 1st Qu.: 6655
## Median :354.0 Median : 7575
## Mean :373.3 Mean : 7609
## 3rd Qu.:426.2 3rd Qu.: 8296
## Max. :821.0 Max. :10851
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Segue o código para ajustar modelos de mínimos quadrados ordinários:
modelo1 <- lm(Expenditure ~ Income)
summary(modelo1)
##
## Call:
## lm(formula = Expenditure ~ Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.390 -42.146 -6.162 30.630 224.210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -151.27 64.12 -2.359 0.0224 *
## Income 689.39 83.50 8.256 9.05e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61.41 on 48 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.5782
## F-statistic: 68.16 on 1 and 48 DF, p-value: 9.055e-11
ggplot(educacao,aes(x=Income/10000,y=Expenditure))+stat_summary(fun.data=mean_cl_normal) +
geom_smooth(method='lm',formula=y~x)
Outro modelo solicitado:
x3 = Income^2
modelo2 <- lm(Expenditure ~ Income + x3)
summary(modelo2)
##
## Call:
## lm(formula = Expenditure ~ Income + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.709 -36.896 -4.551 37.290 109.729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 832.9 327.3 2.545 0.01428 *
## Income -1834.2 829.0 -2.213 0.03182 *
## x3 1587.0 519.1 3.057 0.00368 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.68 on 47 degrees of freedom
## Multiple R-squared: 0.6553, Adjusted R-squared: 0.6407
## F-statistic: 44.68 on 2 and 47 DF, p-value: 1.345e-11
Agora ajustaremos o modelo de MQP e plotaremos no mesmo gráfico com o MQO.
##
## Call:
## lm(formula = Expenditure ~ Income, weights = 1/Expenditure^2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.25120 -0.09262 0.01823 0.11806 0.34667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.15 55.59 -1.118 0.269
## Income 551.59 75.46 7.310 2.47e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1412 on 48 degrees of freedom
## Multiple R-squared: 0.5268, Adjusted R-squared: 0.5169
## F-statistic: 53.43 on 1 and 48 DF, p-value: 2.467e-09
Gráfico de \(\hat{e}_t\) versus \(x_t\) (Income):
Gráfico de \(\hat{e}_t\) versus \(\hat{y}_t\):
Gráfico de \(\hat{e}_t\) versus \(\hat{y}_t\):
Gráfico de \(\hat{e}_t\) versus \(x_t\):
# modelo log(y) = beta_1 + beta_2*log(x) + erro
modelolog <- lm(log(Expenditure) ~ log(Income))
summary(modelolog)
##
## Call:
## lm(formula = log(Expenditure) ~ log(Income))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24954 -0.11179 -0.01208 0.09235 0.35579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.25186 0.04812 129.917 < 2e-16 ***
## log(Income) 1.25962 0.15405 8.177 1.19e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1456 on 48 degrees of freedom
## Multiple R-squared: 0.5821, Adjusted R-squared: 0.5734
## F-statistic: 66.86 on 1 and 48 DF, p-value: 1.193e-10
plot(log(Income), log(Expenditure))
abline(modelolog, col="blue")
Executaremos três testes de homoscedasticidade em cada um dos três modelos. Os testes são Goldfeld-Quandt, Breusch-Pagan e Koenker.
suppressPackageStartupMessages(library(lmtest))
gqtest(modelo1, fraction=length(Income)/3) # modelo MQO
##
## Goldfeld-Quandt test
##
## data: modelo1
## GQ = 0.59559, df1 = 15, df2 = 14, p-value = 0.8347
gqtest(modelo2, fraction=length(Income)/3) # modelo y_t = beta1 + beta2*x + beta3*x^2 + erro
##
## Goldfeld-Quandt test
##
## data: modelo2
## GQ = 1.25, df1 = 14, df2 = 13, p-value = 0.3466
gqtest(modelolog, fraction=length(Income)/3) # modelo log(y) = beta1 + beta2*log(x) + erro
##
## Goldfeld-Quandt test
##
## data: modelolog
## GQ = 1.1786, df1 = 15, df2 = 14, p-value = 0.3818
bptest(modelo1, studentize = FALSE) # modelo MQO
##
## Breusch-Pagan test
##
## data: modelo1
## BP = 24.962, df = 1, p-value = 5.849e-07
bptest(modelo2, studentize = FALSE) # modelo y_t = beta1 + beta2*x + beta3*x^2 + erro
##
## Breusch-Pagan test
##
## data: modelo2
## BP = 18.903, df = 2, p-value = 7.855e-05
bptest(modelolog, studentize = FALSE) # modelo log(y) = beta1 + beta2*log(x) + erro
##
## Breusch-Pagan test
##
## data: modelolog
## BP = 1.8639, df = 1, p-value = 0.1722
bptest(modelo1, studentize = TRUE) # modelo MQO
##
## studentized Breusch-Pagan test
##
## data: modelo1
## BP = 11.729, df = 1, p-value = 0.0006153
bptest(modelo2, studentize = TRUE) # modelo y_t = beta1 + beta2*x + beta3*x^2 + erro
##
## studentized Breusch-Pagan test
##
## data: modelo2
## BP = 15.834, df = 2, p-value = 0.0003645
bptest(modelolog, studentize = TRUE) # modelo log(y) = beta1 + beta2*log(x) + erro
##
## studentized Breusch-Pagan test
##
## data: modelolog
## BP = 2.2754, df = 1, p-value = 0.1314