Estamos acostumados com critérios de seleção baseados no p-valor, porém existem diversos outros métodos que podem indicar qual o medelo que melhor explica os dados, dentre uma variedade de modelos que podemos produzir e possibilitar a escolha de qual o modelo mais adequado.
Seleciona o modelo que minimiza a variância dos resíduos e penaliza o excesso de parâmetros
\[ AIC_p = -2log(L_p)+2[(p+1) + 1] \]
No R temos:
modelo=lm(cars$speed~cars$dist)
summary(modelo) #P.valores do teste t
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## cars$dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
AIC(modelo) #AIC
## [1] 260.7755
É uma atualização do AIC, com os mesmos critérios:
\[ BIC_p = -2log(L_p)+[(p+1) + 1]log(n) \]
No R temos:
modelo=lm(cars$speed~cars$dist)
summary(modelo) #P.valores do teste t
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## cars$dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
BIC(modelo) #BIC
## [1] 266.5115
modelo=lm(cars$speed~cars$dist)
summary(modelo)
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## cars$dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
cvLm(modelo)
## 5-fold CV results:
## CV
## 6.323379
Podemos utilizar duas abordagens:
No R temos:
#Primeiramente:
amostra = sample(1:2,length(cars[,1]), replace=T, prob=c(0.7,0.3));amostra
## [1] 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 2 1 2 1 1 2
## [36] 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1
#O conjunto de treino:
treino=cars[amostra==1,]
#O conjunto de teste:
teste=cars[amostra==2,]
modelo=lm(treino[,1]~treino[,2])
summary(modelo)
##
## Call:
## lm(formula = treino[, 1] ~ treino[, 2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.095 -2.486 0.347 2.863 6.665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.16232 1.09028 7.486 1.09e-08 ***
## treino[, 2] 0.16166 0.02157 7.493 1.07e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.33 on 34 degrees of freedom
## Multiple R-squared: 0.6229, Adjusted R-squared: 0.6118
## F-statistic: 56.15 on 1 and 34 DF, p-value: 1.07e-08
cbind(predict(modelo, data.frame(teste[,2])), teste[,1], predict(modelo, data.frame(teste[,2]))-teste[,1])
## [,1] [,2] [,3]
## 1 8.485641 7 1.4856407
## 2 9.778904 9 0.7789040
## 3 11.718799 11 0.7187991
## 4 10.748852 12 -1.2511484
## 5 11.072167 14 -2.9278326
## 6 12.365431 14 -1.6345692
## 7 13.658694 16 -2.3413058
## 8 10.910510 16 -5.0894905
## 9 10.425536 17 -6.5744643
## 10 11.395483 18 -6.6045167
## 11 12.688747 18 -5.3112534
## 12 12.365431 23 -10.6345692
## 13 13.658694 24 -10.3413058
## 14 13.658694 24 -10.3413058
## 15 15.598589 7 8.5985893
## 16 17.861800 9 8.8618002
## 17 21.094959 11 10.0949586
## 18 11.395483 12 -0.6045167
## 19 12.365431 14 -1.6345692
## 20 16.891853 14 2.8918526
## 21 13.335378 16 -2.6646217
## 22 16.245221 16 0.2452209
## 23 17.215168 17 0.2151685
## 24 20.448327 18 2.4483269
## 25 13.982010 18 -4.0179900
## 26 15.598589 23 -7.4014107
## 27 19.155064 24 -4.8449365
## 28 13.335378 24 -10.6646217
## 29 15.921905 7 8.9219051
## 30 16.568537 9 7.5685368
## 31 17.215168 11 6.2151685
## 32 18.508432 12 6.5084319
## 33 18.831748 14 4.8317477
## 34 23.034854 14 9.0348537
## 35 27.561276 16 11.5612755
## 36 21.903248 16 5.9032482
plot(predict(modelo, data.frame(teste[,2]))-teste[,1])