Let use a data set from The R Book by Michael Crawley. It’s a dataset recording temperature, radiation strength, wind speec and ozone concentration.

ozone = read.table(text=getURL("https://raw.githubusercontent.com/OscarFHC/NRE538_GSRA/master/Labs/NRE538_MultipleReg/ozone.data.txt"), sep="", header=T,comment.char="#")
head(ozone)
##   rad temp wind ozone
## 1 190   67  7.4    41
## 2 118   72  8.0    36
## 3 149   74 12.6    12
## 4 313   62 11.5    18
## 5 299   65  8.6    23
## 6  99   59 13.8    19

1 Recap

  • Correlation describes whether there is “interdependence” between two variables.
  • Regression is to find the best model to relate the response variables (Y) to the explanatory variables, so that we can estimate the model parameters for making inferences or forcasting.

2 Multiple Regression

Last time, we only deal with one explanatory variable (X). If we want to know how the response variable (Y) can be explained by multiple explanatory variable, we will have to do multiple regression

2.1 Collinearity

First thing is to know what explanatory variables should be included in the model.
The most important thing is to know your system and decide what should be included. This is where explanatory data analysis is important.

2.1.1 useful pair() function

The pair() function allows you to have a glimpse of the relationship among the variables.

pairs(ozone, lower.panel=panel.lm, upper.panel=panel.cor)

2.1.2 cor() function

Correlation coefficients can also be calculated to help you understand the linear dependency among each variable.

cor(ozone, use="na.or.complete")
##              rad       temp       wind      ozone
## rad    1.0000000  0.2940876 -0.1273656  0.3483417
## temp   0.2940876  1.0000000 -0.4971459  0.6985414
## wind  -0.1273656 -0.4971459  1.0000000 -0.6129508
## ozone  0.3483417  0.6985414 -0.6129508  1.0000000

In general, we don’t want the explanatory varaibles in the model to be higly correlated. Otherwise it will bias the estimates of the regresion coefficients of explanaroty variables.

Take a look of the following simple example that demonstrate the impact of having highly correlated explanatory variables.
First we use a known \(Y\) to generate two \(X\) variables, so that we know what are the true relationship between \(X\) and \(Y\) and what the regression coefficients should be.
Specifically, I have 201 \(Y\) values evenly ranged from 0 to 20. I generated \(X_1\) by the formula, \(Y = 3X_1 + 2 + noise\), so that the regression coefficient of \(Y\) on \(X_1\) should be around 3. Similarily, I generated \(X_2\) by a different formula, \(Y = 2X_2 - 3 + noise\), so that the regression coefficient of \(Y\) on \(X_2\) should be around 2.

Y = seq(from=0, to=20, by=0.1)
X1 = (Y-2)/3 + runif(length(Y), min=0.5, max=2)
X2 = (Y+3)/2 + runif(length(Y), min=0.5, max=2)
df = as.data.frame(cbind(Y, X1, X2))
summary(lm(Y~X1, data=df))
## 
## Call:
## lm(formula = Y ~ X1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.39519 -1.08491  0.01601  1.20660  2.44106 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.18398    0.20589  -5.751 3.31e-08 ***
## X1           2.85737    0.04695  60.865  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.317 on 199 degrees of freedom
## Multiple R-squared:  0.949,  Adjusted R-squared:  0.9488 
## F-statistic:  3705 on 1 and 199 DF,  p-value: < 2.2e-16
summary(lm(Y~X2, data=df))
## 
## Call:
## lm(formula = Y ~ X2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.58291 -0.78257 -0.07204  0.74966  1.53370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.20809    0.17917  -29.07   <2e-16 ***
## X2           1.96220    0.02163   90.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8961 on 199 degrees of freedom
## Multiple R-squared:  0.9764, Adjusted R-squared:  0.9763 
## F-statistic:  8229 on 1 and 199 DF,  p-value: < 2.2e-16

So far so good. But if we what to have both \(X_1\) and \(X_2\) in the linear regression model, the regression coefficients for both \(X_1\) and \(X_2\) are way off the true values. This is because \(X_1\) and \(X_2\) are highly correlated, which we can see from the correlation plot the correlation coefficient between \(X_1\) and \(X_2\).

summary(lm(Y~X1+X2, data=df))
## 
## Call:
## lm(formula = Y ~ X1 + X2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55388 -0.52965  0.05102  0.50888  1.48759 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.15743    0.18542 -22.422   <2e-16 ***
## X1           0.92591    0.09743   9.503   <2e-16 ***
## X2           1.35905    0.06596  20.603   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7444 on 198 degrees of freedom
## Multiple R-squared:  0.9838, Adjusted R-squared:  0.9836 
## F-statistic:  6006 on 2 and 198 DF,  p-value: < 2.2e-16
plot(X1~X2)

cor(X1, X2)
## [1] 0.9621754

Exercise 1

Manipulate the above code to reduce the correlation between \(X_1\) and \(X_2\) and recalculate the regression coefficients. Explain what you find.

hint: Increase the random noise of \(X_1\) and \(X_2\) so that the two can be less correlated.


In the worst case, if two explanatory varaibles are perfectly correlated, regression coefficients can not be estimates. Take the following creative case for example…

y=seq(from=1, to=20, by=0.1)
y=runif(length(y), min=3, max=5)
x1=seq(from=5, to=24, by=0.1)
x2=x1*2+5
df = data.frame(y, x1, x2)
pairs(df)

cor(df)
##              y          x1          x2
## y   1.00000000 -0.05941648 -0.05941648
## x1 -0.05941648  1.00000000  1.00000000
## x2 -0.05941648  1.00000000  1.00000000
summary(lm(y~x1+x2, dat=df))
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10909 -0.49517 -0.00991  0.51193  0.92663 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.174062   0.115058  36.278   <2e-16 ***
## x1          -0.006069   0.007417  -0.818    0.414    
## x2                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5652 on 189 degrees of freedom
## Multiple R-squared:  0.00353,    Adjusted R-squared:  -0.001742 
## F-statistic: 0.6696 on 1 and 189 DF,  p-value: 0.4142

We can see that when x1 and x2 are perfectly correlated, regression coefficients of one of them can not be estimated. This is a rather extreme and simple example that you might never encounter in your entire life, but it is worthy to be aware the this is due to the problem of singularity in matrix algebra.

Another more formal way to diagnose how the magnitude of collinearity among explanatory variables is the variance inflation factor (VIF).

\[VIF_i = 1 - \frac{1} {{R_i}^2}\], where \({R_i}^2\) is the \({R}^2\) of the linear model with \({X_i}\) being the response variable (i.e. at the left side of the equation) and the other \({X_s}\) being the explanatory variables (at the right hand side of the equation).
To the best of my knowledge, there is no built-in funciton that allows us to calculate VIF automatically, so we’ll have to write our own…In fact, it’s not hard to write. For example the VIF for radiation can be calculated as follow.

vif.rad = 1/(1 - summary(lm(rad~temp+wind+ozone, dat=ozone))$r.squared)
vif.rad
## [1] 1.163376

Generally we want the VIP to be smaller than 5 or 10 (some say 20). However, neither VIF or correlation coefficients or plots are not strick rules to determine whether a correlated variables should be discard from the model or not. They only alert you the potential bias that could result from collinearity.

2.2 Model comparision

2.2.1 \(F\)-test

After deciding which explanatory variables to include in the model, the next step is to compare each model. A common way to compare the model is to do \(F\)-test. The concept of \(F\)-test is to compare the ability of the models to explain the varaiance of the response varaible. To do so, \(F\)-statistic is calculated by the following formula,

\[F = \frac {(\frac{RSS_1-RSS_2}{p_2-p_1})}{(\frac{RSS_2}{n-p_2})}\ ,\]

where \(RSS_i\) is the residual sum of square of \(model_i\), \(p_i\) number of parameters in \(model_i\), \(n\) is the sample size. Basically, \(n-p_i\) is the degree of freedom of \(model_i\). The probability of this \(F\)-statistic in the \(F\) distibution (with degree of freedom \(p_2-p_1\) and \(n-p_2\)), is the probability for model 1 to explain same amount of variance with model 2.

Let’s compare the following two models with the Ozone data set.

\[Model_1: Temp = \beta_0 + \beta_1 * Wind\]

\[Model_2: Temp = \beta_0' + \beta_1' * Wind + + \beta_2' * Radiation\]

Here \(Model_1\) is nested in \(Model_2\), because if we let all levels of Radiation to have same effects on Temperature, we have \(Model_1\).

mod1 = lm(temp~wind, data=ozone)
summary(mod1)
## 
## Call:
## lm(formula = temp ~ wind, data = ozone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.111  -5.645   1.013   6.255  18.889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  91.0226     2.3479  38.767  < 2e-16 ***
## wind         -1.3311     0.2225  -5.982 2.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.307 on 109 degrees of freedom
## Multiple R-squared:  0.2472, Adjusted R-squared:  0.2402 
## F-statistic: 35.78 on 1 and 109 DF,  p-value: 2.851e-08
mod2 = lm(temp~wind+rad, data=ozone)
summary(mod2)
## 
## Call:
## lm(formula = temp ~ wind + rad, data = ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2684  -5.0216   0.5859   5.2503  18.4622 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 85.695296   2.924996  29.298  < 2e-16 ***
## wind        -1.251135   0.217133  -5.762 7.95e-08 ***
## rad          0.024525   0.008478   2.893  0.00462 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.039 on 108 degrees of freedom
## Multiple R-squared:  0.3013, Adjusted R-squared:  0.2883 
## F-statistic: 23.28 on 2 and 108 DF,  p-value: 3.91e-09

Before comparison, one thing can be pointed out. The regression coefficient of wind in the two models change a little. This is because wind and radiation is not perfectly independent. You should be able to see the regression coefficient of radiation does change a little bit.

Let’s compare the two models. Since we know that it follows \(F\) distribution, we use anova() function to perform the F-test.

anova(mod1, mod2)
## Analysis of Variance Table
## 
## Model 1: temp ~ wind
## Model 2: temp ~ wind + rad
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    109 7521.1                                
## 2    108 6980.3  1    540.79 8.3672 0.004621 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output table, we see that \(RSS\) of mod2 is 6980.314 and the \(RSS\) of mod1 is 7521.107. The \(F\) statistic is 8.367. The interpretation of these results is that mod2 explains significantly greater amount of varaince comparing to mod1, because the probability for mod1 to explain the same amount of variance as mod2 is 0.004621.


Exercise 2

  1. Challenge yourself by calculateing all above values manually.

  2. Compare a third model (mod3) with ozone varaible as the third independent variable to mod1 and mod2 above. Does it explain more variance? Does it give you reasonable estimate of regressino coefficients?


Important \(F\)-test can only be used to compare the “nested” models!

2.2.2 Other methods of model comparision/selection

There are at least two other way to compare different models.

  1. Likelihood ratio test. This is very useful especially when comparing non linear models. However, like the \(F\)-test, the model will also need to be nested for a valid likelihood ratio test. Basically, it it calculating the maximum likelihood of different models, and take the ratio of the two. This ratio will follow \(\chi^2\) distribution. Similarily, by finding the probability of observatin the ratio according to the \(\chi^2\) distribution, we can know which model performs better. The performance of the model in this sense is not the amount of variance of response variable being explained, but the likelihood of the model to generate the observed data. The model that has higer probability is considered as the “better” model.

  2. AIC value comparison. AIC is the informatino cirteria based on the maximum likelihood of the model. AIC is expecially useful when comparing non-nested models. The AIC can not give us an idea about how probable one model is comparing to another. Instead, it only informs which mode would loss less information while taking the number of parameters into account. Generally, the smaller the AIC is the less information is lossed (better performed model). The rule of thumb is that AIC difference greater than 10 can be considered as a significant difference, although there is not strict rules especially when AIC difference is 4-7.

Many other model comparision/selection methods are available. However, they are beyond the scope of this class.


__ Exercise 3__

Compare the AIC of the three models above mod1, mod2, and mod3. Do you have different conclusion in terms of which model performs better?


2.3 Ockham’s razor

Paremeters should be used with parsimony in modeling, so that any parameter that does not significantly contribute to the model (e.g. increase the \(R^2\) in an important way) should be eliminated.

2.4 Partial regresion coefficient

Let say we decide to use mod2 as the final model to explain the data. How do we interprete the regression coefficients?

The regression coefficient of Wind variable is the effect of Wind on Temperature when the effect of Radiation is being controlled for.

This coefficient can also be calculated by the following linear regressin model.

\[Temp_{res|rad} = \beta_0 + \beta_1 Wind_{res|rad}\ ,\]

where \(Temp_{res|rad}\) is the residuals of temperature regressed on radiation and \(Wind_{res|rad}\) is the residuals of wind regressed on radiation. From the formula, we see that the effects of wind is estimated when taking the influences of radiation into account.

mod.r = lm(temp~rad, data=ozone)
mod.wr = lm(wind~rad, data=ozone)
summary(lm(residuals(mod.r)~residuals(mod.wr)))
## 
## Call:
## lm(formula = residuals(mod.r) ~ residuals(mod.wr))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2684  -5.0216   0.5859   5.2503  18.4622 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.096e-15  7.596e-01   0.000        1    
## residuals(mod.wr) -1.251e+00  2.161e-01  -5.789 6.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.002 on 109 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2281 
## F-statistic: 33.51 on 1 and 109 DF,  p-value: 6.919e-08

We see the estimated effect is -1.251135 identical to the one estimated from the multiple regression (-1.251135). This coefficient is the partical regression coefficient.

Note that this is different from the following model.

\[Temp_{res|rad} = \beta_0' + \beta_1' Wind\ ,\]

The above model does not take the influences of Radiation on Wind into account.

summary(lm(residuals(mod.r)~wind, data=ozone))
## 
## Call:
## lm(formula = residuals(mod.r) ~ wind, data = ozone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.952  -5.161   0.812   5.346  18.354 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.2330     2.2676   5.395 4.03e-07 ***
## wind         -1.2308     0.2149  -5.727 9.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.022 on 109 degrees of freedom
## Multiple R-squared:  0.2313, Adjusted R-squared:  0.2243 
## F-statistic:  32.8 on 1 and 109 DF,  p-value: 9.143e-08