Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known.
The general mathematical equation for a linear regression is
y = ax + b
plot(Ozone~Wind, airquality)
model1 <- lm(Ozone~Wind, airquality)
par(mfrow = c(2, 2))
plot(model1)
coef(model1)
## (Intercept) Wind
## 96.872895 -5.550923
Oz1 <- coef(model1)[1] + coef(model1)[2] * 19
Oz2 <- coef(model1)[1] + coef(model1)[2] * 20
Oz1
## (Intercept)
## -8.59464
Oz2
## (Intercept)
## -14.14556
Generalized linear models provide a common approach to a broad range of response modeling problems. Normal, Poisson, and binomial responses are the most commonly used, but other distributions can be used as well. Apart from specifying the response, GLMs also need a link function to be set which allows further flexibility in the modeling. The GLM can be fitted using a common procedure and a mechanism for hypothesis testing is available. Diagnostics using deviance residuals provide a way to check that chosen models are adequate.
model2 <- glm(Ozone~Wind, airquality, family = poisson)
coef(model2)
## (Intercept) Wind
## 5.0795877 -0.1488753
Oz1.glm <- exp(coef(model2)[1] + coef(model2)[2] * 19)
Oz2.glm <- exp(coef(model2)[1] + coef(model2)[2] * 20)
Oz2.glm / Oz1.glm
## (Intercept)
## 0.8616765
exp(coef(model2)[2])
## Wind
## 0.8616765
The generalized least squares (GLS) estimator of the coefficients of a linear regression is a generalization of the ordinary least squares (OLS) estimator. It is used to deal with situations in which the OLS estimator is not BLUE (best linear unbiased estimator) because one of the main assumptions of the Gauss-Markov theorem, namely that of homoskedasticity and absence of serial correlation, is violated. In such situations, provided that the other assumptions of the Gauss-Markov theorem are satisfied, the GLS estimator is BLUE.
library(nlme)
#model3 <- gls(Ozone~Wind, airquality)
# Check for na's
summary(airquality$Ozone)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 31.50 42.13 63.25 168.00 37
# suppress na's
model3 <- gls(Ozone~Wind, airquality, na.action = na.exclude)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
airquality$Date <- as.Date(paste(1973, airquality$Month, airquality$Day, sep = "-"))
library(lattice)
xyplot(Ozone~Date, airquality)
#model4 <- gls(Ozone~Wind*Date, airquality, na.action = na.exclude)
#plot(ACF(model4))
air2=subset(airquality, complete.cases(Ozone))
model5 <- gls(Ozone~Wind*Date, air2)
plot(ACF(model5, form=~Date), alpha=0.05) # alpha plots significant line
# Update correlation
model6 <- update(model5, correlation=corAR1())
library(MuMIn)
# Compare model
AICc(model5, model6)
## df AICc
## model5 5 1099.40
## model6 6 1095.21
summary(model6)
## Generalized least squares fit by REML
## Model: Ozone ~ Wind * Date
## Data: air2
## AIC BIC logLik
## 1094.439 1110.75 -541.2197
##
## Correlation Structure: AR(1)
## Formula: ~1
## Parameter estimate(s):
## Phi
## 0.2811007
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -310.82785 224.46119 -1.384773 0.1689
## Wind 26.34403 18.92699 1.391876 0.1667
## Date 0.30485 0.17265 1.765721 0.0802
## Wind:Date -0.02370 0.01462 -1.621035 0.1078
##
## Correlation:
## (Intr) Wind Date
## Wind -0.909
## Date -0.999 0.910
## Wind:Date 0.907 -0.999 -0.909
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.5165120 -0.7094215 -0.1900599 0.6459231 3.3907429
##
## Residual standard error: 26.69287
## Degrees of freedom: 116 total; 112 residual