Let use a data set from The R Book by Michael Crawley. It’s a dataset recording temperature, radiation strength, wind speec and ozone concentration.
ozone = read.table(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_MultipleReg/ozone.data.txt"), sep="", header=T,comment.char="#")
head(ozone)
## rad temp wind ozone
## 1 190 67 7.4 41
## 2 118 72 8.0 36
## 3 149 74 12.6 12
## 4 313 62 11.5 18
## 5 299 65 8.6 23
## 6 99 59 13.8 19
Last time, we only deal with one explanatory variable (X). If we want to know how the response variable (Y) can be explained by multiple explanatory variable, we will have to do multiple regression
First thing is to know what explanatory variables should be included in the model.
The most important thing is to know your system and decide what should be included. This is where explanatory data analysis is important.
pair()
functionThe pair()
function allows you to have a glimpse of the relationship among the variables.
pairs(ozone, lower.panel=panel.lm, upper.panel=panel.cor)
cor()
functionCorrelation coefficients can also be calculated to help you understand the linear dependency among each variable.
cor(ozone, use="na.or.complete")
## rad temp wind ozone
## rad 1.0000000 0.2940876 -0.1273656 0.3483417
## temp 0.2940876 1.0000000 -0.4971459 0.6985414
## wind -0.1273656 -0.4971459 1.0000000 -0.6129508
## ozone 0.3483417 0.6985414 -0.6129508 1.0000000
In general, we don’t want the explanatory varaibles in the model to be higly correlated. Otherwise it will bias the estimates of the regresion coefficients of explanaroty variables.
Take a look of the following simple example that demonstrate the impact of having highly correlated explanatory variables.
First we use a known \(Y\) to generate two \(X\) variables, so that we know what are the true relationship between \(X\) and \(Y\) and what the regression coefficients should be.
Specifically, I have 201 \(Y\) values evenly ranged from 0 to 20. I generated \(X_1\) by the formula, \(Y = 3X_1 + 2 + noise\), so that the regression coefficient of \(Y\) on \(X_1\) should be around 3. Similarily, I generated \(X_2\) by a different formula, \(Y = 2X_2 - 3 + noise\), so that the regression coefficient of \(Y\) on \(X_2\) should be around 2.
Y = seq(from=0, to=20, by=0.1)
X1 = (Y-2)/3 + runif(length(Y), min=0.5, max=2)
X2 = (Y+3)/2 + runif(length(Y), min=0.5, max=2)
df = as.data.frame(cbind(Y, X1, X2))
summary(lm(Y~X1, data=df))
##
## Call:
## lm(formula = Y ~ X1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.39519 -1.08491 0.01601 1.20660 2.44106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.18398 0.20589 -5.751 3.31e-08 ***
## X1 2.85737 0.04695 60.865 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.317 on 199 degrees of freedom
## Multiple R-squared: 0.949, Adjusted R-squared: 0.9488
## F-statistic: 3705 on 1 and 199 DF, p-value: < 2.2e-16
summary(lm(Y~X2, data=df))
##
## Call:
## lm(formula = Y ~ X2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.58291 -0.78257 -0.07204 0.74966 1.53370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.20809 0.17917 -29.07 <2e-16 ***
## X2 1.96220 0.02163 90.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8961 on 199 degrees of freedom
## Multiple R-squared: 0.9764, Adjusted R-squared: 0.9763
## F-statistic: 8229 on 1 and 199 DF, p-value: < 2.2e-16
So far so good. But if we what to have both \(X_1\) and \(X_2\) in the linear regression model, the regression coefficients for both \(X_1\) and \(X_2\) are way off the true values. This is because \(X_1\) and \(X_2\) are highly correlated, which we can see from the correlation plot the correlation coefficient between \(X_1\) and \(X_2\).
summary(lm(Y~X1+X2, data=df))
##
## Call:
## lm(formula = Y ~ X1 + X2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55388 -0.52965 0.05102 0.50888 1.48759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.15743 0.18542 -22.422 <2e-16 ***
## X1 0.92591 0.09743 9.503 <2e-16 ***
## X2 1.35905 0.06596 20.603 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7444 on 198 degrees of freedom
## Multiple R-squared: 0.9838, Adjusted R-squared: 0.9836
## F-statistic: 6006 on 2 and 198 DF, p-value: < 2.2e-16
plot(X1~X2)
cor(X1, X2)
## [1] 0.9621754
Exercise 1
Manipulate the above code to reduce the correlation between \(X_1\) and \(X_2\) and recalculate the regression coefficients. Explain what you find.
hint: Increase the random noise of \(X_1\) and \(X_2\) so that the two can be less correlated.
In the worst case, if two explanatory varaibles are perfectly correlated, regression coefficients can not be estimates. Take the following creative case for example…
y=seq(from=1, to=20, by=0.1)
y=runif(length(y), min=3, max=5)
x1=seq(from=5, to=24, by=0.1)
x2=x1*2+5
df = data.frame(y, x1, x2)
pairs(df)
cor(df)
## y x1 x2
## y 1.00000000 -0.05941648 -0.05941648
## x1 -0.05941648 1.00000000 1.00000000
## x2 -0.05941648 1.00000000 1.00000000
summary(lm(y~x1+x2, dat=df))
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10909 -0.49517 -0.00991 0.51193 0.92663
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.174062 0.115058 36.278 <2e-16 ***
## x1 -0.006069 0.007417 -0.818 0.414
## x2 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5652 on 189 degrees of freedom
## Multiple R-squared: 0.00353, Adjusted R-squared: -0.001742
## F-statistic: 0.6696 on 1 and 189 DF, p-value: 0.4142
We can see that when x1 and x2 are perfectly correlated, regression coefficients of one of them can not be estimated. This is a rather extreme and simple example that you might never encounter in your entire life, but it is worthy to be aware the this is due to the problem of singularity in matrix algebra.
Another more formal way to diagnose how the magnitude of collinearity among explanatory variables is the variance inflation factor (VIF).
\[VIF_i = 1 - \frac{1} {{R_i}^2}\], where \({R_i}^2\) is the \({R}^2\) of the linear model with \({X_i}\) being the response variable (i.e. at the left side of the equation) and the other \({X_s}\) being the explanatory variables (at the right hand side of the equation).
To the best of my knowledge, there is no built-in funciton that allows us to calculate VIF automatically, so we’ll have to write our own…In fact, it’s not hard to write. For example the VIF for radiation can be calculated as follow.
vif.rad = 1/(1 - summary(lm(rad~temp+wind+ozone, dat=ozone))$r.squared)
vif.rad
## [1] 1.163376
Generally we want the VIP to be smaller than 5 or 10 (some say 20). However, neither VIF or correlation coefficients or plots are not strick rules to determine whether a correlated variables should be discard from the model or not. They only alert you the potential bias that could result from collinearity.
After deciding which explanatory variables to include in the model, the next step is to compare each model. A common way to compare the model is to do \(F\)-test. The concept of \(F\)-test is to compare the ability of the models to explain the varaiance of the response varaible. To do so, \(F\)-statistic is calculated by the following formula,
\[F = \frac {(\frac{RSS_1-RSS_2}{p_2-p_1})}{(\frac{RSS_2}{n-p_2})}\ ,\]
where \(RSS_i\) is the residual sum of square of \(model_i\), \(p_i\) number of parameters in \(model_i\), \(n\) is the sample size. Basically, \(n-p_i\) is the degree of freedom of \(model_i\). The probability of this \(F\)-statistic in the \(F\) distibution (with degree of freedom \(p_2-p_1\) and \(n-p_2\)), is the probability for model 1 to explain same amount of variance with model 2.
Let’s compare the following two models with the Ozone data set.
\[Model_1: Temp = \beta_0 + \beta_1 * Wind\]
\[Model_2: Temp = \beta_0' + \beta_1' * Wind + + \beta_2' * Radiation\]
Here \(Model_1\) is nested in \(Model_2\), because if we let all levels of Radiation to have same effects on Temperature, we have \(Model_1\).
mod1 = lm(temp~wind, data=ozone)
summary(mod1)
##
## Call:
## lm(formula = temp ~ wind, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.111 -5.645 1.013 6.255 18.889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.0226 2.3479 38.767 < 2e-16 ***
## wind -1.3311 0.2225 -5.982 2.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.307 on 109 degrees of freedom
## Multiple R-squared: 0.2472, Adjusted R-squared: 0.2402
## F-statistic: 35.78 on 1 and 109 DF, p-value: 2.851e-08
mod2 = lm(temp~wind+rad, data=ozone)
summary(mod2)
##
## Call:
## lm(formula = temp ~ wind + rad, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2684 -5.0216 0.5859 5.2503 18.4622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.695296 2.924996 29.298 < 2e-16 ***
## wind -1.251135 0.217133 -5.762 7.95e-08 ***
## rad 0.024525 0.008478 2.893 0.00462 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.039 on 108 degrees of freedom
## Multiple R-squared: 0.3013, Adjusted R-squared: 0.2883
## F-statistic: 23.28 on 2 and 108 DF, p-value: 3.91e-09
Before comparison, one thing can be pointed out. The regression coefficient of wind in the two models change a little. This is because wind and radiation is not perfectly independent. You should be able to see the regression coefficient of radiation does change a little bit.
Let’s compare the two models. Since we know that it follows \(F\) distribution, we use anova()
function to perform the F-test
.
anova(mod1, mod2)
## Analysis of Variance Table
##
## Model 1: temp ~ wind
## Model 2: temp ~ wind + rad
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109 7521.1
## 2 108 6980.3 1 540.79 8.3672 0.004621 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output table, we see that \(RSS\) of mod2 is 6980.314 and the \(RSS\) of mod1 is 7521.107. The \(F\) statistic is 8.367. The interpretation of these results is that mod2 explains significantly greater amount of varaince comparing to mod1, because the probability for mod1 to explain the same amount of variance as mod2 is 0.004621.
Exercise 2
Challenge yourself by calculateing all above values manually.
Compare a third model (mod3) with ozone varaible as the third independent variable to mod1 and mod2 above. Does it explain more variance? Does it give you reasonable estimate of regressino coefficients?
Important \(F\)-test can only be used to compare the “nested” models!
There are at least two other way to compare different models.
Likelihood ratio test. This is very useful especially when comparing non linear models. However, like the \(F\)-test, the model will also need to be nested for a valid likelihood ratio test. Basically, it it calculating the maximum likelihood of different models, and take the ratio of the two. This ratio will follow \(\chi^2\) distribution. Similarily, by finding the probability of observatin the ratio according to the \(\chi^2\) distribution, we can know which model performs better. The performance of the model in this sense is not the amount of variance of response variable being explained, but the likelihood of the model to generate the observed data. The model that has higer probability is considered as the “better” model.
AIC value comparison. AIC is the informatino cirteria based on the maximum likelihood of the model. AIC is expecially useful when comparing non-nested models. The AIC can not give us an idea about how probable one model is comparing to another. Instead, it only informs which mode would loss less information while taking the number of parameters into account. Generally, the smaller the AIC is the less information is lossed (better performed model). The rule of thumb is that AIC difference greater than 10 can be considered as a significant difference, although there is not strict rules especially when AIC difference is 4-7.
Many other model comparision/selection methods are available. However, they are beyond the scope of this class.
__ Exercise 3__
Compare the AIC of the three models above mod1, mod2, and mod3. Do you have different conclusion in terms of which model performs better?
Paremeters should be used with parsimony in modeling, so that any parameter that does not significantly contribute to the model (e.g. increase the \(R^2\) in an important way) should be eliminated.
Let say we decide to use mod2 as the final model to explain the data. How do we interprete the regression coefficients?
The regression coefficient of Wind variable is the effect of Wind on Temperature when the effect of Radiation is being controlled for.
This coefficient can also be calculated by the following linear regressin model.
\[Temp_{res|rad} = \beta_0 + \beta_1 Wind_{res|rad}\ ,\]
where \(Temp_{res|rad}\) is the residuals of temperature regressed on radiation and \(Wind_{res|rad}\) is the residuals of wind regressed on radiation. From the formula, we see that the effects of wind is estimated when taking the influences of radiation into account.
mod.r = lm(temp~rad, data=ozone)
mod.wr = lm(wind~rad, data=ozone)
summary(lm(residuals(mod.r)~residuals(mod.wr)))
##
## Call:
## lm(formula = residuals(mod.r) ~ residuals(mod.wr))
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2684 -5.0216 0.5859 5.2503 18.4622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.096e-15 7.596e-01 0.000 1
## residuals(mod.wr) -1.251e+00 2.161e-01 -5.789 6.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.002 on 109 degrees of freedom
## Multiple R-squared: 0.2351, Adjusted R-squared: 0.2281
## F-statistic: 33.51 on 1 and 109 DF, p-value: 6.919e-08
We see the estimated effect is -1.251135 identical to the one estimated from the multiple regression (-1.251135). This coefficient is the partical regression coefficient.
Note that this is different from the following model.
\[Temp_{res|rad} = \beta_0' + \beta_1' Wind\ ,\]
The above model does not take the influences of Radiation on Wind into account.
summary(lm(residuals(mod.r)~wind, data=ozone))
##
## Call:
## lm(formula = residuals(mod.r) ~ wind, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.952 -5.161 0.812 5.346 18.354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.2330 2.2676 5.395 4.03e-07 ***
## wind -1.2308 0.2149 -5.727 9.14e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.022 on 109 degrees of freedom
## Multiple R-squared: 0.2313, Adjusted R-squared: 0.2243
## F-statistic: 32.8 on 1 and 109 DF, p-value: 9.143e-08