Estudo: Renda per capita e gastos com educação nos estados norte-americanos

É fornecido um conjunto de dados, que pode ser baixado em http://www.de.ufpe.br/~cribari/educacao.dat. O objetivo é fornecer uma análise exploratória dos dados, ajustar modelos de regressão linear supondo homoscedasticidade e heteroscedasticidade e plotar gráficos dos resíduos.

Segue o banco de dados:

#  Trabalho de Análise Estatística 2

#  conjunto de dados: educacao e renda per capita nos EUA
educacao = read.table("http://www.de.ufpe.br/~cribari/educacao.dat")
educacao = na.exclude(educacao)  # excluir a observação faltante do estado de Wisconsin
attach(educacao)
educacao
##                Expenditure Income
## Alabama                275   6247
## Alaska                 821  10851
## Arizona                339   7374
## Arkansas               275   6183
## California             387   8850
## Colorado               452   8001
## Connecticut            531   8914
## Delaware               424   8604
## Florida                316   7505
## Georgia                265   6700
## Hawaii                 403   8380
## Idaho                  304   6813
## Illinois               437   8745
## Indiana                345   7696
## Iowa                   431   7873
## Kansas                 355   8001
## Kentucky               260   6615
## Louisiana              316   6640
## Maine                  327   6333
## Maryland               427   8306
## Massachusetts          427   8063
## Michigan               466   8442
## Minnesota              477   7847
## Mississippi            259   5736
## Missouri               274   7342
## Montana                433   7051
## Nebraska               294   7391
## Nevada                 359   9032
## New Hampshire          279   7277
## New Jersey             423   8818
## New Mexico             388   6505
## New York               447   8267
## North Carolina         335   6607
## North Dakota           311   7478
## Ohio                   322   7812
## Oklahoma               320   6951
## Oregon                 397   7839
## Pennsylvania           412   7733
## Rhode Island           342   7526
## South Carolina         315   6242
## South Dakota           321   6841
## Tennessee              268   6489
## Texas                  315   7697
## Utah                   417   6622
## Vermont                353   6541
## Virginia               356   7624
## Washington             415   8450
## Washington DC          428  10022
## West Virginia          320   6456
## Wyoming                500   9096
with(educacao,c("Expenditure","Income"))
## [1] "Expenditure" "Income"
Income = Income/10^4

Análise Exploratória

Segue os boxplots e histogramas das variáveis Expenditure (dependente) e Income(independente).

##   Expenditure        Income     
##  Min.   :259.0   Min.   : 5736  
##  1st Qu.:315.2   1st Qu.: 6655  
##  Median :354.0   Median : 7575  
##  Mean   :373.3   Mean   : 7609  
##  3rd Qu.:426.2   3rd Qu.: 8296  
##  Max.   :821.0   Max.   :10851

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Mínimos Quadrados Ordinários (MQO)

Segue o código para ajustar modelos de mínimos quadrados ordinários:

modelo1 <- lm(Expenditure ~ Income)
summary(modelo1)
## 
## Call:
## lm(formula = Expenditure ~ Income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.390  -42.146   -6.162   30.630  224.210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -151.27      64.12  -2.359   0.0224 *  
## Income        689.39      83.50   8.256 9.05e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.41 on 48 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.5782 
## F-statistic: 68.16 on 1 and 48 DF,  p-value: 9.055e-11
ggplot(educacao,aes(x=Income/10000,y=Expenditure))+stat_summary(fun.data=mean_cl_normal) + 
  geom_smooth(method='lm',formula=y~x)

Outro modelo solicitado:

x3 = Income^2
modelo2 <- lm(Expenditure ~ Income + x3)
summary(modelo2)
## 
## Call:
## lm(formula = Expenditure ~ Income + x3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -160.709  -36.896   -4.551   37.290  109.729 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    832.9      327.3   2.545  0.01428 * 
## Income       -1834.2      829.0  -2.213  0.03182 * 
## x3            1587.0      519.1   3.057  0.00368 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.68 on 47 degrees of freedom
## Multiple R-squared:  0.6553, Adjusted R-squared:  0.6407 
## F-statistic: 44.68 on 2 and 47 DF,  p-value: 1.345e-11

Mínimos Quadrados Ponderados (MQP)

Agora ajustaremos o modelo de MQP e plotaremos no mesmo gráfico com o MQO.

## 
## Call:
## lm(formula = Expenditure ~ Income, weights = 1/Expenditure^2)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25120 -0.09262  0.01823  0.11806  0.34667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -62.15      55.59  -1.118    0.269    
## Income        551.59      75.46   7.310 2.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1412 on 48 degrees of freedom
## Multiple R-squared:  0.5268, Adjusted R-squared:  0.5169 
## F-statistic: 53.43 on 1 and 48 DF,  p-value: 2.467e-09

Resíduos

Resíduo MQO:

Gráfico de \(\hat{e}_t\) versus \(x_t\) (Income):

Gráfico de \(\hat{e}_t\) versus \(\hat{y}_t\):

Resíduo MQP:

Gráfico de \(\hat{e}_t\) versus \(\hat{y}_t\):

Gráfico de \(\hat{e}_t\) versus \(x_t\):

Outro modelo:

#  modelo log(y) = beta_1 + beta_2*log(x) + erro
modelolog <- lm(log(Expenditure) ~ log(Income)) 
summary(modelolog)
## 
## Call:
## lm(formula = log(Expenditure) ~ log(Income))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24954 -0.11179 -0.01208  0.09235  0.35579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.25186    0.04812 129.917  < 2e-16 ***
## log(Income)  1.25962    0.15405   8.177 1.19e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1456 on 48 degrees of freedom
## Multiple R-squared:  0.5821, Adjusted R-squared:  0.5734 
## F-statistic: 66.86 on 1 and 48 DF,  p-value: 1.193e-10
plot(log(Income), log(Expenditure))
abline(modelolog, col="blue")

Testes para homoscedasticidade:

Executaremos três testes de homoscedasticidade em cada um dos três modelos. Os testes são Goldfeld-Quandt, Breusch-Pagan e Koenker.

Goldfeld-Quandt:

suppressPackageStartupMessages(library(lmtest))
gqtest(modelo1, fraction=length(Income)/3)  # modelo MQO
## 
##  Goldfeld-Quandt test
## 
## data:  modelo1
## GQ = 0.59559, df1 = 15, df2 = 14, p-value = 0.8347
gqtest(modelo2, fraction=length(Income)/3)  # modelo y_t = beta1 + beta2*x + beta3*x^2 + erro
## 
##  Goldfeld-Quandt test
## 
## data:  modelo2
## GQ = 1.25, df1 = 14, df2 = 13, p-value = 0.3466
gqtest(modelolog, fraction=length(Income)/3) # modelo log(y) = beta1 + beta2*log(x) + erro
## 
##  Goldfeld-Quandt test
## 
## data:  modelolog
## GQ = 1.1786, df1 = 15, df2 = 14, p-value = 0.3818

Breusch-Pagan

bptest(modelo1, studentize = FALSE)  # modelo MQO
## 
##  Breusch-Pagan test
## 
## data:  modelo1
## BP = 24.962, df = 1, p-value = 5.849e-07
bptest(modelo2, studentize = FALSE)  # modelo y_t = beta1 + beta2*x + beta3*x^2 + erro
## 
##  Breusch-Pagan test
## 
## data:  modelo2
## BP = 18.903, df = 2, p-value = 7.855e-05
bptest(modelolog, studentize = FALSE) # modelo log(y) = beta1 + beta2*log(x) + erro
## 
##  Breusch-Pagan test
## 
## data:  modelolog
## BP = 1.8639, df = 1, p-value = 0.1722

Koenker

bptest(modelo1, studentize = TRUE)  # modelo MQO
## 
##  studentized Breusch-Pagan test
## 
## data:  modelo1
## BP = 11.729, df = 1, p-value = 0.0006153
bptest(modelo2, studentize = TRUE)  # modelo y_t = beta1 + beta2*x + beta3*x^2 + erro
## 
##  studentized Breusch-Pagan test
## 
## data:  modelo2
## BP = 15.834, df = 2, p-value = 0.0003645
bptest(modelolog, studentize = TRUE) # modelo log(y) = beta1 + beta2*log(x) + erro
## 
##  studentized Breusch-Pagan test
## 
## data:  modelolog
## BP = 2.2754, df = 1, p-value = 0.1314