Data 621 - Blog 2

Understanding the LM, GLM, and GLS models.

Taken from: https://www.youtube.com/watch?v=P-WYkSZp9lY&feature=youtu.be

Definitions:

  • lm - linear regression, Normal Errors, constant variance
  • glm - Generalized Linear Models, non-normal errors, non-constant variance
  • gls - Generalized Least Squares model, non-normal errors, non-constant variance with correlated errors spatial temporal patterns or trends

Data exploration

For this analysis we will be using the Air Quality dataset built into R libraries. To start, we need to understand what we are dealing with.

airquality <- airquality
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
plot(Ozone~Wind, airquality)

There is pattern in the dataset and may be some non-linearity. We won’t account for that currently.

Linear Model

First, we will build a standard linear model.

model1 <- lm(Ozone~Wind, airquality)
plot(model1)

coefficients(model1)
## (Intercept)        Wind 
##   96.872895   -5.550923

There is obvious issues with the model. Lets go ahead and predict for wind speeds of 19 and 20 mph.

Ozone1=coef(model1)[1]+coef(model1)[2]*19
Ozone2=coef(model1)[1]+coef(model1)[2]*20

Ozone1
## (Intercept) 
##    -8.59464
Ozone2
## (Intercept) 
##   -14.14556

There is something wrong. For a wind speed of 19 and 20, we are going negative and .
does not jive with the plot we saw earlier. Seeing it’s not what we want, we will need to make the model non-linear in some way.

plot(Ozone~Wind, airquality)

GLM

We will start with the Generalized Linear Model using Poisson.

model2=glm(Ozone~Wind,airquality, family = poisson)
coef(model2)
## (Intercept)        Wind 
##   5.0795877  -0.1488753

Poisson allows the model to react differently when it comes to the parameters. For the LM model you have \(y=a+bx+\varepsilon\), two parameters, A & B plus errors. GLM does something different it takes \(\log { \quad y } =a+bx+\varepsilon\) or differently it takes \(y={ e }^{ a+bx+\varepsilon }\). Rewritten one more time for clarity; \(y={ e }^{ a }*{ e }^{ bx }+\varepsilon\)

GLM is a multiplicative model verses the addition model of standard linear models while still being linear in the parameters being used. The Exponent values \(a+bx\) are the same as the linear model; called the linear predictor. To convert the model back to the original model you need to take the log of the linear predictor.

Lets see what it means:

# to do the same predictor as we did earlier, we need to expentiate the expression. e to the power of the expression.
Ozone1.glm=exp(coef(model2)[1]+coef(model2)[2]*19)
Ozone2.glm=exp(coef(model2)[1]+coef(model2)[2]*20)

Ozone1.glm
## (Intercept) 
##    9.496811
Ozone2.glm
## (Intercept) 
##    8.183179
plot(Ozone~Wind, airquality)

The model seems closer to reasonable, the Ozone level at wind speed of 19 or 20 MPH is just above zero.

The next thing that is interesting, what happens when we divide Ozone2 by Ozone1?
Does it match anything we have already?

Ozone2.glm/Ozone1.glm
## (Intercept) 
##   0.8616765
exp(coef(model2)[2]) #exp(-0.1488753)
##      Wind 
## 0.8616765

It matches the exponent value of the GLM model. This is very important, the slope of the GLM fit by Poisson gives is the proportional change in ozone concentration if we change wind speed by one unit (in this case taking it from 19 to 20). Going from 19 to 20 mph the Ozone concentration goes down by a factor of \({ e }^{ 0.1488... }\)

In the GLM model, the individual slope gives an estimate of the multiplicative change in the response variable for a one unit change in the corresponding explanatory variable.

eg…the slope \(b = -.014\)

For a one unit change in wind speed, Ozone concentration decreases by \({ e }^{ 0.1488... }\)fold \({ e }^{ -b }\)

GLS

Last, we look at the Generalized Least Squares model. Fitting the model as we have done before [gls(Ozone~Wind, airquality)] will generate an error message as we have missing data.
The GLS model does not like missing values. If you look at a summary of Ozone, there are 37 Na’s.
Additionally, we also have a month and day column which we can modify to get a date to add to the model. Since this data is from 1973, we will make a date column using that year, the month and day columns.

library(nlme)
## Warning: package 'nlme' was built under R version 3.6.3
library(lattice)
summary(airquality$Ozone)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   18.00   31.50   42.13   63.25  168.00      37
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
#we need to clean the data if we want to use it in the same format as above.
model3 <- gls(Ozone~Wind, airquality,na.action = na.exclude)

#creating the date cxolumn using paste
airquality$Date = as.Date(paste(1973,airquality$Month, airquality$Day, sep="-"))

#Checking the plot is now successful to see Months from the date vs Ozone levels
xyplot(Ozone~Date, airquality)

There appears to be a temporal structure in the data we had not accounted for in the earlier models.
If we incorporate the temporal data into the model we see a change in the output, sort of.

model4 <- gls(Ozone~Wind*Date, airquality,na.action = na.exclude)
plot(ACF(model4))

The plot does not produce anything because we need to do further scrubbing of the data and ensure we are only seeing complete cases.

air2=subset(airquality,complete.cases(Ozone))
model5 <- gls(Ozone~Wind*Date, air2)
plot(ACF(model5, form = ~Date),alpha = 0.05)

There is pattern in the data, however we do not know if there is significance in the auto-correlation without alpha lines. There is indication of some days in the beginning where there is marginally significant positive auto-correlation. The first one is always set to 1, we can discount it for the purpose here. There is some temporal pattern or rather we have proved there is some pattern. We need to account for that.

model6 <- update(model5, correlation=corAR1())
library(MuMIn)
## Warning: package 'MuMIn' was built under R version 3.6.3
AICc(model5,model6)
##        df    AICc
## model5  5 1099.40
## model6  6 1095.21

We can see the AICc goes down slightly in Model 6 indicating it is the more stable of the models.

summary(model6)
## Generalized least squares fit by REML
##   Model: Ozone ~ Wind * Date 
##   Data: air2 
##        AIC     BIC    logLik
##   1094.439 1110.75 -541.2197
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi 
## 0.2811007 
## 
## Coefficients:
##                  Value Std.Error   t-value p-value
## (Intercept) -310.82785 224.46119 -1.384773  0.1689
## Wind          26.34403  18.92699  1.391876  0.1667
## Date           0.30485   0.17265  1.765721  0.0802
## Wind:Date     -0.02370   0.01462 -1.621035  0.1078
## 
##  Correlation: 
##           (Intr) Wind   Date  
## Wind      -0.909              
## Date      -0.999  0.910       
## Wind:Date  0.907 -0.999 -0.909
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.5165120 -0.7094215 -0.1900599  0.6459231  3.3907429 
## 
## Residual standard error: 26.69287 
## Degrees of freedom: 116 total; 112 residual

Because of the interaction term (Wind:Date) there is a strong relationship between all values.
Even seeing though it, there is a very strong relationship between Wind and Date.