Leandro Valter
leandro.vvalter@gmail.com
Universidade Estadual da Paraíba
Centro de Ciência e Tecnologia
Departamento de Estatística
Medições diárias da qualidade do ar em Nova York, de maio a setembro de 1973.
Leituras diárias dos seguintes valores de qualidade do ar de 1º de maio de 1973 (terça-feira) até 30 de setembro de 1973.
Os dados foram obtidos do Departamento de Conservação do Estado de Nova York (dados de ozônio) e do Serviço Nacional de Meteorologia (dados meteorológicos).
O objetivo da análise foi modelar a incidência radiação solar em função das demais variáveis.
library(corrplot)
library(ggplot2)
library(hnp)
data("airquality")
head(airquality);dim(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## [1] 153 6
air=airquality[complete.cases(airquality),]
dim(air)
## [1] 111 6
head(air)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
attach(air)
air$Month <- factor(air$Month, levels=5:9, labels=month.abb[5:9], ordered=TRUE)
air$Day <- factor(air$Day, levels=c(1:31), ordered=TRUE)
#round(fBasics::basicStats(air[1:4]),3)
| Variável | N | Média | Mínimo | Máximo | D.P | Mediana | Cs | Ck | Cv% |
|---|---|---|---|---|---|---|---|---|---|
| Radiação Solar | 111 | 184,8 | 7,0 | 334,0 | 91,1 | 207,0 | -0,5 | -0,9 | 49,46 |
| Ozônio | 111 | 42,1 | 1,0 | 168,0 | 33,3 | 31,0 | 1,2 | 1,1 | 49,32 |
| Vento | 111 | 9,9 | 2,3 | 20,7 | 3,5 | 9,7 | 0,4 | 0,2 | 35,79 |
| Temperatura | 111 | 77,8 | 57,0 | 97,0 | 9,5 | 79,0 | -0,2 | -0,7 | 12,25 |
Radiação Solar
ggplot(data=air) + geom_histogram(mapping=aes(Solar.R))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(Month, Solar.R, data=air, geom="boxplot", color=Month)+theme_bw()
Ozônio
ggplot(data=air) + geom_histogram(mapping=aes(Ozone))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(Month, Ozone, data=air, geom="boxplot", color=Month)+theme_bw()
Vento
ggplot(data=air) + geom_histogram(mapping=aes(Wind))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(Month, Wind, data=air, geom="boxplot", color=Month)+theme_bw()
Temperatura
ggplot(data=air) + geom_histogram(mapping=aes(Temp))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(Month, Temp, data=air, geom="boxplot", color=Month)+theme_bw()
plot(air[1:4], pch=16, col="blue", main="Matriz Scatterplot",panel = panel.smooth)
corrplot(cor(air[1:4]), method = "number")
mod1 = lm(Solar.R ~ Ozone + Wind + Temp + Month, data=air)
summary(mod1)
##
## Call:
## lm(formula = Solar.R ~ Ozone + Wind + Temp + Month, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -171.724 -57.983 2.083 70.904 168.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -129.0415 121.9027 -1.059 0.2923
## Ozone 0.8643 0.3917 2.206 0.0296 *
## Wind 4.0413 2.9336 1.378 0.1713
## Temp 3.0478 1.5480 1.969 0.0517 .
## Month.L -42.1320 20.8969 -2.016 0.0464 *
## Month.Q 27.8612 22.4353 1.242 0.2171
## Month.C 11.2063 23.0923 0.485 0.6285
## Month^4 26.7106 20.3539 1.312 0.1923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.31 on 103 degrees of freedom
## Multiple R-squared: 0.199, Adjusted R-squared: 0.1446
## F-statistic: 3.656 on 7 and 103 DF, p-value: 0.001452
shapiro.test(mod1$residuals)
##
## Shapiro-Wilk normality test
##
## data: mod1$residuals
## W = 0.97664, p-value = 0.04841
par(mfrow=c(2,2))
plot(mod1, pch=16)
# modelo aditivo completo
ajuste_lm_completo <- lm(Solar.R ~ Ozone + Wind + Temp + Month, data = air)
# modelo forward
#step(ajuste_lm_completo, direction = "forward")
step(ajuste_lm_completo, direction = "backward")
## Start: AIC=992.14
## Solar.R ~ Ozone + Wind + Temp + Month
##
## Df Sum of Sq RSS AIC
## - Month 4 53544 785604 991.98
## <none> 732060 992.14
## - Wind 1 13488 745549 992.17
## - Temp 1 27552 759612 994.24
## - Ozone 1 34602 766663 995.27
##
## Step: AIC=991.98
## Solar.R ~ Ozone + Wind + Temp
##
## Df Sum of Sq RSS AIC
## - Temp 1 6593 792198 990.91
## - Wind 1 12857 798461 991.78
## <none> 785604 991.98
## - Ozone 1 48871 834475 996.68
##
## Step: AIC=990.91
## Solar.R ~ Ozone + Wind
##
## Df Sum of Sq RSS AIC
## - Wind 1 10862 803060 990.42
## <none> 792198 990.91
## - Ozone 1 106980 899178 1002.97
##
## Step: AIC=990.42
## Solar.R ~ Ozone
##
## Df Sum of Sq RSS AIC
## <none> 803060 990.42
## - Ozone 1 110902 913962 1002.78
##
## Call:
## lm(formula = Solar.R ~ Ozone, data = air)
##
## Coefficients:
## (Intercept) Ozone
## 144.6306 0.9542
mod2 = lm(Solar.R ~ Ozone, data=air)
summary(mod2)
##
## Call:
## lm(formula = Solar.R ~ Ozone, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.577 -65.349 -0.555 68.110 176.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 144.6306 13.1749 10.98 < 2e-16 ***
## Ozone 0.9542 0.2459 3.88 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 85.83 on 109 degrees of freedom
## Multiple R-squared: 0.1213, Adjusted R-squared: 0.1133
## F-statistic: 15.05 on 1 and 109 DF, p-value: 0.0001793
hnp(mod2)
## Gaussian model (lm object)
Verificando os resíduos do modelo 2
par(mfrow=c(2,2))
plot(mod2, pch=16)
Teste de normalidades dos resíduos
shapiro.test(mod2$residuals)
##
## Shapiro-Wilk normality test
##
## data: mod2$residuals
## W = 0.96849, p-value = 0.009956
Como a variável resposta é contínua e assimétrica serão testadas as distribuições Gamma e Normal Inversa para modelagem.
glmod1 <- glm(Solar.R ~ Ozone + Wind + Temp + Month, data = air, family = Gamma(link="inverse"))
summary(glmod1)
##
## Call:
## glm(formula = Solar.R ~ Ozone + Wind + Temp + Month, family = Gamma(link = "inverse"),
## data = air)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.04525 -0.42664 -0.03418 0.37084 0.81366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.515e-02 3.952e-03 3.833 0.000218 ***
## Ozone -2.116e-05 1.196e-05 -1.770 0.079741 .
## Wind -1.164e-04 9.798e-05 -1.188 0.237584
## Temp -9.555e-05 4.922e-05 -1.941 0.054967 .
## Month.L 1.278e-03 7.201e-04 1.775 0.078900 .
## Month.Q -8.229e-04 7.048e-04 -1.167 0.245707
## Month.C -2.412e-04 7.880e-04 -0.306 0.760131
## Month^4 -7.919e-04 6.580e-04 -1.204 0.231527
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2741131)
##
## Null deviance: 53.366 on 110 degrees of freedom
## Residual deviance: 48.237 on 103 degrees of freedom
## AIC: 1353.8
##
## Number of Fisher Scoring iterations: 5
glmod2 <- glm(Solar.R ~ Ozone + Temp + Month , data = air, family = Gamma(link="inverse"))
summary(glmod2)
##
## Call:
## glm(formula = Solar.R ~ Ozone + Temp + Month, family = Gamma(link = "inverse"),
## data = air)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.08298 -0.40628 -0.01818 0.33652 0.79035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.330e-02 3.697e-03 3.598 0.000492 ***
## Ozone -1.449e-05 1.062e-05 -1.364 0.175397
## Temp -9.104e-05 4.999e-05 -1.821 0.071447 .
## Month.L 1.425e-03 7.193e-04 1.982 0.050152 .
## Month.Q -7.310e-04 7.031e-04 -1.040 0.300898
## Month.C -3.591e-04 7.913e-04 -0.454 0.650949
## Month^4 -6.310e-04 6.474e-04 -0.975 0.331964
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2748499)
##
## Null deviance: 53.366 on 110 degrees of freedom
## Residual deviance: 48.615 on 104 degrees of freedom
## AIC: 1352.7
##
## Number of Fisher Scoring iterations: 5
glmod3 <- glm(Solar.R ~ Ozone, data = air, family = Gamma(link="inverse"))
summary(glmod3)
##
## Call:
## glm(formula = Solar.R ~ Ozone, family = Gamma(link = "inverse"),
## data = air)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.09373 -0.39415 0.02583 0.34742 0.82250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.489e-03 4.351e-04 14.913 < 2e-16 ***
## Ozone -2.257e-05 6.341e-06 -3.559 0.000553 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2645698)
##
## Null deviance: 53.366 on 110 degrees of freedom
## Residual deviance: 50.545 on 109 degrees of freedom
## AIC: 1347.4
##
## Number of Fisher Scoring iterations: 5
ajuste = c('glmod1', 'glmod2','glmod3')
aic = c(AIC(glmod1), AIC(glmod2),AIC(glmod3))
deviance = c(deviance(glmod1), deviance(glmod2),deviance(glmod3))
verossimilhanca = c(logLik(glmod1),logLik(glmod2),logLik(glmod3))
data.frame(ajuste, aic, verossimilhanca,deviance)
## ajuste aic verossimilhanca deviance
## 1 glmod1 1353.802 -667.9010 48.23678
## 2 glmod2 1352.730 -668.3651 48.61480
## 3 glmod3 1347.368 -670.6842 50.54520
# Envelope
par(mai = c(1.2, 1.2, 0.5, 0.1))
plot(meD[, 1], meD[, 2], pch = 20, ylim = range(meD[, -1]),
cex.axis = 1.2, cex.lab = 1.2, xlab = "Quantis N(0,1)",
ylab = "")
lines(meD[, 1], meD[, 3])
lines(meD[, 1], meD[, 4], lty = 2)
lines(meD[, 1], meD[, 5])
glmod1 <- glm(Solar.R ~ Ozone + Wind + Temp + Month , data = air,
family= inverse.gaussian)
summary(glmod1)
##
## Call:
## glm(formula = Solar.R ~ Ozone + Wind + Temp + Month, family = inverse.gaussian,
## data = air)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.36029 -0.03053 -0.00232 0.02737 0.05825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.344e-04 4.381e-05 3.067 0.00276 **
## Ozone -2.044e-07 1.321e-07 -1.547 0.12481
## Wind -1.200e-06 1.122e-06 -1.070 0.28722
## Temp -1.035e-06 5.455e-07 -1.898 0.06050 .
## Month.L 1.324e-05 8.149e-06 1.624 0.10739
## Month.Q -8.344e-06 7.619e-06 -1.095 0.27597
## Month.C -1.948e-06 8.845e-06 -0.220 0.82613
## Month^4 -8.374e-06 7.324e-06 -1.143 0.25553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for inverse.gaussian family taken to be 0.001729752)
##
## Null deviance: 0.74572 on 110 degrees of freedom
## Residual deviance: 0.71874 on 103 degrees of freedom
## AIC: 1431.6
##
## Number of Fisher Scoring iterations: 6
glmod2 <- glm(Solar.R ~ Ozone + Temp , data = air,
family= inverse.gaussian)
summary(glmod2)
##
## Call:
## glm(formula = Solar.R ~ Ozone + Temp, family = inverse.gaussian,
## data = air)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.36179 -0.02881 -0.00035 0.02515 0.06444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.542e-05 3.125e-05 2.413 0.0175 *
## Ozone -1.523e-07 9.292e-08 -1.639 0.1042
## Temp -4.809e-07 4.197e-07 -1.146 0.2545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for inverse.gaussian family taken to be 0.001690236)
##
## Null deviance: 0.74572 on 110 degrees of freedom
## Residual deviance: 0.72940 on 108 degrees of freedom
## AIC: 1423.2
##
## Number of Fisher Scoring iterations: 6
glmod3 <- glm(Solar.R ~ Ozone , data = air,
family= inverse.gaussian(link="inverse"))
summary(glmod3)
##
## Call:
## glm(formula = Solar.R ~ Ozone, family = inverse.gaussian(link = "inverse"),
## data = air)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.36142 -0.03018 -0.00072 0.02555 0.06052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.676e-03 4.628e-04 14.425 < 2e-16 ***
## Ozone -2.649e-05 7.458e-06 -3.552 0.000566 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for inverse.gaussian family taken to be 0.001647603)
##
## Null deviance: 0.74572 on 110 degrees of freedom
## Residual deviance: 0.72906 on 109 degrees of freedom
## AIC: 1421.1
##
## Number of Fisher Scoring iterations: 2
shapiro.test(glmod3$residuals)
##
## Shapiro-Wilk normality test
##
## data: glmod3$residuals
## W = 0.98024, p-value = 0.09897
ajuste = c('glmod1', 'glmod2','glmod3')
aic = c(AIC(glmod1), AIC(glmod2),AIC(glmod3))
deviance = c(deviance(glmod1), deviance(glmod2),deviance(glmod3))
verossimilhanca = c(logLik(glmod1),logLik(glmod2),logLik(glmod3))
data.frame(ajuste, aic, verossimilhanca,deviance)
## ajuste aic verossimilhanca deviance
## 1 glmod1 1431.560 -706.7800 0.7187393
## 2 glmod2 1423.195 -707.5973 0.7294022
## 3 glmod3 1421.142 -707.5709 0.7290561
hnp(glmod3$residuals, sim = 99,resid.type ='deviance',how.many.out=T ,conf = 0.95,scale = T)
## Half-normal plot with simulated envelope generated assuming the residuals are
## normally distributed under the null hypothesis.
## Estimated mean: -3.04843e-18
## Estimated variance: 9.939771e-06
## Total points: 111
## Points out of envelope: 0 ( 0 %)
pred.ozonio <- data.frame(Ozone= c(20, 42, 170))
predict(glmod3, interval = 'prediction', newdata = pred.ozonio, type = 'response')
## 1 2 3
## 162.6898 179.7297 460.1239
Portanto, o número esperado de radiação solar para New York com o valor 20 (ozônio) é igual a 167.24; para o valor 31 este número passa para 179.13, e no teceiro valor o número esperado de radiação solar é de 473.36.
Sendo assim, o modelo de Normal Inversa é uma alternativa para modelagem desses dados de assimétricos e contínuos. Utilizar os Modelos Lineares Generalizados foram melhores que o modelo linear clássico.
“Essencialmente, todos os modelos estão errados, mas alguns são úteis.” - George E. P. Box