Leandro Valter
leandro.vvalter@gmail.com
Universidade Estadual da Paraíba
Centro de Ciência e Tecnologia
Departamento de Estatística

Dados

Medições diárias da qualidade do ar em Nova York, de maio a setembro de 1973.

Leituras diárias dos seguintes valores de qualidade do ar de 1º de maio de 1973 (terça-feira) até 30 de setembro de 1973.

  • Ozone: o ozônio médio em partes por bilhão de 1300 a 1500 horas na Ilha Roosevelt;
  • Solar.R:Radiação solar em Langleys na banda de frequência 4000-7700 Angstroms das 08:00 às 12:00 horas no Central Park;
  • Wind: Velocidade média do vento em milhas por hora às 07:00 e 1000 horas no Aeroporto LaGuardia; -Temp: temperatura máxima diária em graus Fahrenheit no aeroporto de La Guardia.

Os dados foram obtidos do Departamento de Conservação do Estado de Nova York (dados de ozônio) e do Serviço Nacional de Meteorologia (dados meteorológicos).

O objetivo da análise foi modelar a incidência radiação solar em função das demais variáveis.

Pacotes

library(corrplot) 
library(ggplot2)
library(hnp)

Banco de dados

data("airquality")
head(airquality);dim(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
## [1] 153   6

Removendo os NA’s

air=airquality[complete.cases(airquality),]
dim(air)
## [1] 111   6
head(air)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8
attach(air)

Fatores

air$Month <- factor(air$Month, levels=5:9, labels=month.abb[5:9], ordered=TRUE)
air$Day <- factor(air$Day, levels=c(1:31), ordered=TRUE)

Análise descritiva

#round(fBasics::basicStats(air[1:4]),3)
Variável N Média Mínimo Máximo D.P Mediana Cs Ck Cv%
Radiação Solar 111 184,8 7,0 334,0 91,1 207,0 -0,5 -0,9 49,46
Ozônio 111 42,1 1,0 168,0 33,3 31,0 1,2 1,1 49,32
Vento 111 9,9 2,3 20,7 3,5 9,7 0,4 0,2 35,79
Temperatura 111 77,8 57,0 97,0 9,5 79,0 -0,2 -0,7 12,25

Histogramas e Box-Plots

Radiação Solar

ggplot(data=air) + geom_histogram(mapping=aes(Solar.R))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(Month, Solar.R, data=air, geom="boxplot", color=Month)+theme_bw()

Ozônio

ggplot(data=air) + geom_histogram(mapping=aes(Ozone))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(Month, Ozone, data=air, geom="boxplot", color=Month)+theme_bw()

Vento

ggplot(data=air) + geom_histogram(mapping=aes(Wind))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(Month, Wind, data=air, geom="boxplot", color=Month)+theme_bw()

Temperatura

ggplot(data=air) + geom_histogram(mapping=aes(Temp))+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(Month, Temp, data=air, geom="boxplot", color=Month)+theme_bw()

Correlação

plot(air[1:4], pch=16, col="blue", main="Matriz Scatterplot",panel = panel.smooth)

corrplot(cor(air[1:4]), method = "number")

Modelo Linear Clássico

mod1 = lm(Solar.R ~ Ozone + Wind + Temp + Month, data=air)
summary(mod1)
## 
## Call:
## lm(formula = Solar.R ~ Ozone + Wind + Temp + Month, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -171.724  -57.983    2.083   70.904  168.219 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -129.0415   121.9027  -1.059   0.2923  
## Ozone          0.8643     0.3917   2.206   0.0296 *
## Wind           4.0413     2.9336   1.378   0.1713  
## Temp           3.0478     1.5480   1.969   0.0517 .
## Month.L      -42.1320    20.8969  -2.016   0.0464 *
## Month.Q       27.8612    22.4353   1.242   0.2171  
## Month.C       11.2063    23.0923   0.485   0.6285  
## Month^4       26.7106    20.3539   1.312   0.1923  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84.31 on 103 degrees of freedom
## Multiple R-squared:  0.199,  Adjusted R-squared:  0.1446 
## F-statistic: 3.656 on 7 and 103 DF,  p-value: 0.001452
shapiro.test(mod1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mod1$residuals
## W = 0.97664, p-value = 0.04841
par(mfrow=c(2,2))
plot(mod1, pch=16)

Seleção de variáveis

# modelo aditivo completo
ajuste_lm_completo <- lm(Solar.R ~ Ozone + Wind + Temp + Month, data = air)
# modelo forward
#step(ajuste_lm_completo, direction = "forward")
step(ajuste_lm_completo, direction = "backward")
## Start:  AIC=992.14
## Solar.R ~ Ozone + Wind + Temp + Month
## 
##         Df Sum of Sq    RSS    AIC
## - Month  4     53544 785604 991.98
## <none>               732060 992.14
## - Wind   1     13488 745549 992.17
## - Temp   1     27552 759612 994.24
## - Ozone  1     34602 766663 995.27
## 
## Step:  AIC=991.98
## Solar.R ~ Ozone + Wind + Temp
## 
##         Df Sum of Sq    RSS    AIC
## - Temp   1      6593 792198 990.91
## - Wind   1     12857 798461 991.78
## <none>               785604 991.98
## - Ozone  1     48871 834475 996.68
## 
## Step:  AIC=990.91
## Solar.R ~ Ozone + Wind
## 
##         Df Sum of Sq    RSS     AIC
## - Wind   1     10862 803060  990.42
## <none>               792198  990.91
## - Ozone  1    106980 899178 1002.97
## 
## Step:  AIC=990.42
## Solar.R ~ Ozone
## 
##         Df Sum of Sq    RSS     AIC
## <none>               803060  990.42
## - Ozone  1    110902 913962 1002.78
## 
## Call:
## lm(formula = Solar.R ~ Ozone, data = air)
## 
## Coefficients:
## (Intercept)        Ozone  
##    144.6306       0.9542
mod2 = lm(Solar.R ~ Ozone, data=air)
summary(mod2)
## 
## Call:
## lm(formula = Solar.R ~ Ozone, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -153.577  -65.349   -0.555   68.110  176.011 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 144.6306    13.1749   10.98  < 2e-16 ***
## Ozone         0.9542     0.2459    3.88 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.83 on 109 degrees of freedom
## Multiple R-squared:  0.1213, Adjusted R-squared:  0.1133 
## F-statistic: 15.05 on 1 and 109 DF,  p-value: 0.0001793
hnp(mod2)
## Gaussian model (lm object)

Verificando os resíduos do modelo 2

par(mfrow=c(2,2))
plot(mod2, pch=16)

Teste de normalidades dos resíduos

shapiro.test(mod2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mod2$residuals
## W = 0.96849, p-value = 0.009956

Modelo Linear Generalizado

Como a variável resposta é contínua e assimétrica serão testadas as distribuições Gamma e Normal Inversa para modelagem.

Gama

glmod1 <- glm(Solar.R ~ Ozone + Wind + Temp + Month, data = air, family = Gamma(link="inverse"))
summary(glmod1)
## 
## Call:
## glm(formula = Solar.R ~ Ozone + Wind + Temp + Month, family = Gamma(link = "inverse"), 
##     data = air)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.04525  -0.42664  -0.03418   0.37084   0.81366  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.515e-02  3.952e-03   3.833 0.000218 ***
## Ozone       -2.116e-05  1.196e-05  -1.770 0.079741 .  
## Wind        -1.164e-04  9.798e-05  -1.188 0.237584    
## Temp        -9.555e-05  4.922e-05  -1.941 0.054967 .  
## Month.L      1.278e-03  7.201e-04   1.775 0.078900 .  
## Month.Q     -8.229e-04  7.048e-04  -1.167 0.245707    
## Month.C     -2.412e-04  7.880e-04  -0.306 0.760131    
## Month^4     -7.919e-04  6.580e-04  -1.204 0.231527    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2741131)
## 
##     Null deviance: 53.366  on 110  degrees of freedom
## Residual deviance: 48.237  on 103  degrees of freedom
## AIC: 1353.8
## 
## Number of Fisher Scoring iterations: 5
glmod2 <- glm(Solar.R ~ Ozone + Temp + Month , data = air, family = Gamma(link="inverse"))
summary(glmod2)
## 
## Call:
## glm(formula = Solar.R ~ Ozone + Temp + Month, family = Gamma(link = "inverse"), 
##     data = air)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.08298  -0.40628  -0.01818   0.33652   0.79035  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.330e-02  3.697e-03   3.598 0.000492 ***
## Ozone       -1.449e-05  1.062e-05  -1.364 0.175397    
## Temp        -9.104e-05  4.999e-05  -1.821 0.071447 .  
## Month.L      1.425e-03  7.193e-04   1.982 0.050152 .  
## Month.Q     -7.310e-04  7.031e-04  -1.040 0.300898    
## Month.C     -3.591e-04  7.913e-04  -0.454 0.650949    
## Month^4     -6.310e-04  6.474e-04  -0.975 0.331964    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2748499)
## 
##     Null deviance: 53.366  on 110  degrees of freedom
## Residual deviance: 48.615  on 104  degrees of freedom
## AIC: 1352.7
## 
## Number of Fisher Scoring iterations: 5
glmod3 <- glm(Solar.R ~ Ozone, data = air, family = Gamma(link="inverse"))
summary(glmod3)
## 
## Call:
## glm(formula = Solar.R ~ Ozone, family = Gamma(link = "inverse"), 
##     data = air)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.09373  -0.39415   0.02583   0.34742   0.82250  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.489e-03  4.351e-04  14.913  < 2e-16 ***
## Ozone       -2.257e-05  6.341e-06  -3.559 0.000553 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2645698)
## 
##     Null deviance: 53.366  on 110  degrees of freedom
## Residual deviance: 50.545  on 109  degrees of freedom
## AIC: 1347.4
## 
## Number of Fisher Scoring iterations: 5
ajuste = c('glmod1', 'glmod2','glmod3')
aic    = c(AIC(glmod1), AIC(glmod2),AIC(glmod3))
deviance    = c(deviance(glmod1), deviance(glmod2),deviance(glmod3))
verossimilhanca = c(logLik(glmod1),logLik(glmod2),logLik(glmod3))
data.frame(ajuste, aic, verossimilhanca,deviance)
##   ajuste      aic verossimilhanca deviance
## 1 glmod1 1353.802       -667.9010 48.23678
## 2 glmod2 1352.730       -668.3651 48.61480
## 3 glmod3 1347.368       -670.6842 50.54520
# Envelope
par(mai = c(1.2, 1.2, 0.5, 0.1))
plot(meD[, 1], meD[, 2], pch = 20, ylim = range(meD[, -1]),
cex.axis = 1.2, cex.lab = 1.2, xlab = "Quantis N(0,1)",
ylab = "")
lines(meD[, 1], meD[, 3])
lines(meD[, 1], meD[, 4], lty = 2)
lines(meD[, 1], meD[, 5])

Normal Inversa

glmod1 <- glm(Solar.R ~ Ozone + Wind + Temp + Month , data = air, 
              family= inverse.gaussian)
summary(glmod1)
## 
## Call:
## glm(formula = Solar.R ~ Ozone + Wind + Temp + Month, family = inverse.gaussian, 
##     data = air)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.36029  -0.03053  -0.00232   0.02737   0.05825  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1.344e-04  4.381e-05   3.067  0.00276 **
## Ozone       -2.044e-07  1.321e-07  -1.547  0.12481   
## Wind        -1.200e-06  1.122e-06  -1.070  0.28722   
## Temp        -1.035e-06  5.455e-07  -1.898  0.06050 . 
## Month.L      1.324e-05  8.149e-06   1.624  0.10739   
## Month.Q     -8.344e-06  7.619e-06  -1.095  0.27597   
## Month.C     -1.948e-06  8.845e-06  -0.220  0.82613   
## Month^4     -8.374e-06  7.324e-06  -1.143  0.25553   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 0.001729752)
## 
##     Null deviance: 0.74572  on 110  degrees of freedom
## Residual deviance: 0.71874  on 103  degrees of freedom
## AIC: 1431.6
## 
## Number of Fisher Scoring iterations: 6
glmod2 <- glm(Solar.R ~ Ozone + Temp , data = air, 
              family= inverse.gaussian)
summary(glmod2)
## 
## Call:
## glm(formula = Solar.R ~ Ozone + Temp, family = inverse.gaussian, 
##     data = air)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.36179  -0.02881  -0.00035   0.02515   0.06444  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  7.542e-05  3.125e-05   2.413   0.0175 *
## Ozone       -1.523e-07  9.292e-08  -1.639   0.1042  
## Temp        -4.809e-07  4.197e-07  -1.146   0.2545  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 0.001690236)
## 
##     Null deviance: 0.74572  on 110  degrees of freedom
## Residual deviance: 0.72940  on 108  degrees of freedom
## AIC: 1423.2
## 
## Number of Fisher Scoring iterations: 6
glmod3 <- glm(Solar.R ~ Ozone , data = air, 
              family= inverse.gaussian(link="inverse"))
summary(glmod3)
## 
## Call:
## glm(formula = Solar.R ~ Ozone, family = inverse.gaussian(link = "inverse"), 
##     data = air)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.36142  -0.03018  -0.00072   0.02555   0.06052  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.676e-03  4.628e-04  14.425  < 2e-16 ***
## Ozone       -2.649e-05  7.458e-06  -3.552 0.000566 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for inverse.gaussian family taken to be 0.001647603)
## 
##     Null deviance: 0.74572  on 110  degrees of freedom
## Residual deviance: 0.72906  on 109  degrees of freedom
## AIC: 1421.1
## 
## Number of Fisher Scoring iterations: 2
shapiro.test(glmod3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  glmod3$residuals
## W = 0.98024, p-value = 0.09897
ajuste = c('glmod1', 'glmod2','glmod3')
aic    = c(AIC(glmod1), AIC(glmod2),AIC(glmod3))
deviance    = c(deviance(glmod1), deviance(glmod2),deviance(glmod3))
verossimilhanca = c(logLik(glmod1),logLik(glmod2),logLik(glmod3))
data.frame(ajuste, aic, verossimilhanca,deviance)
##   ajuste      aic verossimilhanca  deviance
## 1 glmod1 1431.560       -706.7800 0.7187393
## 2 glmod2 1423.195       -707.5973 0.7294022
## 3 glmod3 1421.142       -707.5709 0.7290561
hnp(glmod3$residuals, sim = 99,resid.type ='deviance',how.many.out=T ,conf = 0.95,scale = T)
## Half-normal plot with simulated envelope generated assuming the residuals are 
##         normally distributed under the null hypothesis. 
## Estimated mean: -3.04843e-18 
## Estimated variance: 9.939771e-06

## Total points: 111 
## Points out of envelope: 0 ( 0 %)

Predição

pred.ozonio <- data.frame(Ozone=  c(20, 42, 170))
predict(glmod3, interval = 'prediction', newdata = pred.ozonio, type = 'response')
##        1        2        3 
## 162.6898 179.7297 460.1239

Portanto, o número esperado de radiação solar para New York com o valor 20 (ozônio) é igual a 167.24; para o valor 31 este número passa para 179.13, e no teceiro valor o número esperado de radiação solar é de 473.36.

Conclusões

Sendo assim, o modelo de Normal Inversa é uma alternativa para modelagem desses dados de assimétricos e contínuos. Utilizar os Modelos Lineares Generalizados foram melhores que o modelo linear clássico.

“Essencialmente, todos os modelos estão errados, mas alguns são úteis.” - George E. P. Box