Collinearity, Tests of Nested Models, Categorical Explanatory Variables

SDS 291

November 5, 2019

Annoucements

HW 6 assigned: 3.32, 3.44, 3.45 (c only), 3.47, 3.48 (e only), 3.56
Office hours 1:30 - 3:00pm today
I’m out of town tonight-Saturday
- Office hours cancelled on Wed
- Prof Albert Kim teaching class on Thurs

Outline for Today

Interactions
Multicollinearity
Thursday with Prof Kim:
- How to test whether one model is “better” than another?
- Short lab

Visualizing Parallel Slopes

NYC <- read.csv("http://bit.ly/nycData")

Add geom_abline()’s to a qplot() visualize your model in the data space. How would we add color to the points to differentiate East from West?

library(ggplot2)

qplot(data = NYC, x = Food, y = Price, geom = "jitter", alpha = 0.25) +
  geom_abline(intercept = -17.430, slope = 2.875) +
  geom_abline(intercept = -17.430 + 1.459, slope = 2.875, linetype = 5)

Interactions - Different Slope Model

Consider the case where \(X_1\) is quantitative, but \(X_2\) is an variable that can only be 0 or 1 (e.g. \(isWoman\)). Then,

\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + \hat{\beta_3}X_1X_2 \]

So then,
- For men, \(X_2 = 0\) so \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}(0)\) and \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1\) is the fitted regression line for men,
- For women, \(X_2 = 1\) so \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}(1)\) and \(\hat{Y} = (\hat{\beta_0} + \hat{\beta_2}) + (\hat{\beta_1}+ \hat{\beta_3})X_1\) is the fitted regression line for women

RailsTrails Example from homework

library(Stat2Data)
data("RailsTrails")

qplot(y = Adj2007, x = Distance, color = GarageGroup, data = RailsTrails) +
  geom_smooth(method = "lm", se = 0)

int_mod <- lm(Adj2007 ~ Distance*GarageGroup, data = RailsTrails)
summary(int_mod)

## 
## Call:
## lm(formula = Adj2007 ~ Distance * GarageGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.46  -51.65  -17.22   30.04  425.76 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              359.083     21.295  16.862  < 2e-16 ***
## Distance                 -46.302     13.391  -3.458 0.000802 ***
## GarageGroupyes            48.862     28.108   1.738 0.085222 .  
## Distance:GarageGroupyes   -9.878     19.366  -0.510 0.611125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90.96 on 100 degrees of freedom
## Multiple R-squared:  0.2712, Adjusted R-squared:  0.2494 
## F-statistic: 12.41 on 3 and 100 DF,  p-value: 5.785e-07

parallel_mod <- lm(Adj2007 ~ Distance + GarageGroup, data = RailsTrails)
summary(parallel_mod)

## 
## Call:
## lm(formula = Adj2007 ~ Distance + GarageGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -167.88  -51.55  -21.88   36.79  427.49 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     365.103     17.661  20.673   <2e-16 ***
## Distance        -51.025      9.638  -5.294    7e-07 ***
## GarageGroupyes   37.892     18.032   2.101   0.0381 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90.62 on 101 degrees of freedom
## Multiple R-squared:  0.2693, Adjusted R-squared:  0.2549 
## F-statistic: 18.62 on 2 and 101 DF,  p-value: 1.311e-07

Three Category Categorical Variables

Three parallel slopes model

three_mod <- lm(Adj2007 ~ Distance + BedGroup, data = RailsTrails)
summary(three_mod)

## 
## Call:
## lm(formula = Adj2007 ~ Distance + BedGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -130.01  -52.78  -17.42   23.13  407.90 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      322.786     24.913  12.957  < 2e-16 ***
## Distance         -45.655      9.513  -4.799 5.58e-06 ***
## BedGroup3 beds    44.514     24.873   1.790 0.076540 .  
## BedGroup4+ beds   96.447     26.616   3.624 0.000459 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.99 on 100 degrees of freedom
## Multiple R-squared:  0.3333, Adjusted R-squared:  0.3133 
## F-statistic: 16.67 on 3 and 100 DF,  p-value: 7.417e-09

Three different slopes model

qplot(y = Adj2007, x = Distance, color = BedGroup, data = RailsTrails) +
  geom_smooth(method = "lm", se = 0)

three_mod <- lm(Adj2007 ~ Distance*BedGroup, data = RailsTrails)
summary(three_mod)

## 
## Call:
## lm(formula = Adj2007 ~ Distance * BedGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.12  -54.10  -17.78   24.02  407.23 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                311.76      32.78   9.510  1.4e-15 ***
## Distance                   -37.02      19.07  -1.941  0.05514 .  
## BedGroup3 beds              58.22      38.92   1.496  0.13789    
## BedGroup4+ beds            111.50      39.33   2.835  0.00557 ** 
## Distance:BedGroup3 beds    -10.67      23.07  -0.462  0.64482    
## Distance:BedGroup4+ beds   -14.00      28.71  -0.488  0.62683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.75 on 98 degrees of freedom
## Multiple R-squared:  0.3353, Adjusted R-squared:  0.3014 
## F-statistic: 9.888 on 5 and 98 DF,  p-value: 1.122e-07

Multicollinearity

What is it?

When one or more of the predictors is strongly correlated with some combination of the other predictors in the set
- True collinearity is when two factors are associated in an exact, linear combination
- Near collinearity is when the correlation is high (and thus the \(R^2\) is, too)

Why is it a problem?

Individual coefficients and t-tests can be deceptive and unreliable.

Intuition & and Example

Modeling Weight as a function of Height (\(X_1\)) and Inseam (\(X_2\)): \[weight=\beta_0+\beta_1 Height+ \beta_2 Inseam + \epsilon\]

Effects of Multicollinearity

If predictors are highly correlated among themselves:

The regression coefficients and tests can be extremely variable and difficult to interpret individually.

Example of Height and Inseam

We want to model Weight as a function of Height, Inseam, and Age

We simulate data, which we call rdata, for the purposes of the example with the following correlation matrix

Correlation Coefficients

# Correlation Coefficients Matrix ()
cor(rdata)

##           weight    height    inseam       age
## weight 1.0000000 0.7885873 0.8073210 0.5002686
## height 0.7885873 1.0000000 0.9559604 0.3773555
## inseam 0.8073210 0.9559604 1.0000000 0.3702648
## age    0.5002686 0.3773555 0.3702648 1.0000000

library(GGally)
ggpairs(rdata)

Model 1: Height and Age

m1 <- lm(weight ~ height + age, rdata)
summary(m1)

## 
## Call:
## lm(formula = weight ~ height + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22772 -0.42874 -0.00997  0.32386  1.59323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.06948    0.05993   1.159   0.2491    
## height       0.69118    0.06226  11.101   <2e-16 ***
## age          0.23715    0.06322   3.751   0.0003 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5978 on 97 degrees of freedom
## Multiple R-squared:  0.6698, Adjusted R-squared:  0.663 
## F-statistic: 98.37 on 2 and 97 DF,  p-value: < 2.2e-16

Model 2: Inseam and Age

m2 <- lm(weight ~ inseam + age, rdata)
summary(m2)

## 
## Call:
## lm(formula = weight ~ inseam + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06698 -0.39223 -0.04525  0.37086  1.58523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08771    0.05733   1.530 0.129282    
## inseam       0.71201    0.05925  12.017  < 2e-16 ***
## age          0.23413    0.06020   3.889 0.000184 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.571 on 97 degrees of freedom
## Multiple R-squared:  0.6987, Adjusted R-squared:  0.6925 
## F-statistic: 112.5 on 2 and 97 DF,  p-value: < 2.2e-16

Model 3: Height, Inseam and Age

m3 <- lm(weight ~ height+inseam+age, rdata)
summary(m3)

## 
## Call:
## lm(formula = weight ~ height + inseam + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08853 -0.39992 -0.01961  0.35937  1.58092 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08505    0.05761   1.476 0.143118    
## height       0.13131    0.18885   0.695 0.488537    
## inseam       0.58788    0.18815   3.125 0.002355 ** 
## age          0.23052    0.06058   3.805 0.000249 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5725 on 96 degrees of freedom
## Multiple R-squared:  0.7003, Adjusted R-squared:  0.6909 
## F-statistic: 74.76 on 3 and 96 DF,  p-value: < 2.2e-16

True collinearity

In most extreme cases - two variables are functions of the other:

Height (\(X_1\))
Pants inseam (\(X_2\))

Let’s assume we’re in a world where everyone’s leg length is the same ratio of their height : inseam = 0.34*height

##             weight_true height_true inseam_true
## weight_true   1.0000000   0.9982694   0.9982694
## height_true   0.9982694   1.0000000   1.0000000
## inseam_true   0.9982694   1.0000000   1.0000000

## 
## Call:
## lm(formula = weight_true ~ height_true + inseam_true, data = weightdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5346 -0.7053  0.1223  0.7281  1.9631 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.73503    1.08534  -0.677      0.5    
## height_true  3.01134    0.01792 168.050   <2e-16 ***
## inseam_true       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.994 on 98 degrees of freedom
## Multiple R-squared:  0.9965, Adjusted R-squared:  0.9965 
## F-statistic: 2.824e+04 on 1 and 98 DF,  p-value: < 2.2e-16

We constructed that example to be exactly collinear (we’re calling the variables "_true" because they’re perfectly collinear), when it’s unlikely to be the case (two people with the same height might have different length legs).

How do we detect multicollinearity

Warning Signs: High Correlation Coefficients

Original example (not perfect collinearity): What is the correlation between height and inseam?

##   ht_inseam_corr
## 1      0.9559604

Better Test: Variance Inflation Factor (VIF)
- \(VIF = \frac{1}{1-{R_i}^2}\) where \({R_i}^2\) is for predicting \(X_i\) using the other predictors.
  - VIF > 5 \(\sim\) \({R_i}^2\)> 80%
  - VIF \(\geq\) 5 is questionnable
  - VIF \(\geq\) 10 is a problem

Exploring the VIF - calculating manually

\(VIF = \frac{1}{1-{R_i}^2}\) where \({{R_i}^2}\) is for predicting \(X_i\) using the other predictors.
- The VIF for height used the \(R^2\) from the model \(Height = \beta_0+ \beta_1 \cdot Inseam + \beta_2 \cdot Age + \epsilon\)
- The VIF for inseam used the \(R^2\) from the model \(Inseam = \beta_0 + \beta_1 \cdot Height + \beta_2 \cdot Age + \epsilon\)
- What’s the model for the VIF of Age?

vif0<-lm(height~inseam+age, data=rdata)
summary(vif0)

## 
## Call:
## lm(formula = height ~ inseam + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59321 -0.23306 -0.01279  0.18597  0.75928 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02024    0.03090   0.655    0.514    
## inseam       0.94533    0.03194  29.595   <2e-16 ***
## age          0.02753    0.03245   0.848    0.398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3078 on 97 degrees of freedom
## Multiple R-squared:  0.9145, Adjusted R-squared:  0.9127 
## F-statistic: 518.7 on 2 and 97 DF,  p-value: < 2.2e-16

Height (\(R^2\) = 0.9144948): 1/(1-0.9144948) = 11.6951905

1/(1-summary(vif0)$r.squared)

## [1] 11.69519

vif1<-lm(inseam~height+age, data=rdata)
summary(vif1)

## 
## Call:
## lm(formula = inseam ~ height + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81922 -0.16569  0.01617  0.19982  0.55612 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02648    0.03097  -0.855    0.395    
## height       0.95236    0.03218  29.595   <2e-16 ***
## age          0.01129    0.03267   0.345    0.730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.309 on 97 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.9122 
## F-statistic: 515.2 on 2 and 97 DF,  p-value: < 2.2e-16

Weight (\(R^2\) = 0.9139662): 1/(1-0.9139662) = 11.6233416

1/(1-summary(vif1)$r.squared)

## [1] 11.62334

vif2<-lm(age~inseam+height, data=rdata)
summary(vif2)

## 
## Call:
## lm(formula = age ~ inseam + height, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.79279 -0.54459  0.00692  0.55481  2.15850 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.06224    0.09634   0.646    0.520
## inseam       0.10887    0.31514   0.345    0.730
## height       0.26751    0.31534   0.848    0.398
## 
## Residual standard error: 0.9596 on 97 degrees of freedom
## Multiple R-squared:  0.1435, Adjusted R-squared:  0.1258 
## F-statistic: 8.123 on 2 and 97 DF,  p-value: 0.0005476

Age (\(R^2\) = 0.1434511): 1/(1-0.1434511) = 1.1674756

1/(1-summary(vif2)$r.squared)

## [1] 1.167476

We use the vif() function from the car package to do this automatically from a model that has all three explanatory variables.

# Model 3: predictors 1,2 and 3
m3 <- lm(weight ~ height+inseam+age, rdata)

library(car)
vif(m3)

##    height    inseam       age 
## 11.695190 11.623342  1.167476

What can you do about collinearity?

Drop some of the predictors
Combine some of the predictors
Pay less attention to the individual coefficients and tests (depends on your goal…)

Some other things you might see (Google is a black hole…)

\(\frac{1}{1-{R_i}^2} = \frac{1}{Tolerance}\)
Decomposing correlation matrices into eigenvalues (linear algebra) as a diagnostic. This is called “principal components analysis”, a form of unsupervised machine learning

Collinearity Summary

Be cautious about your interpretation of multiple regression models with correlated values
Multicollinearity can produce counterintuitive parameter estimates and tests
- Tip: If the F-test and the t-test cause you to draw opposite conclusions (i.e., significant F-test but non-significant t-tests), you might have collinearity.
Ways to check:
- Correlation coefficients
- VIF
What to do?
- Drop or combine some of the predictors

Comparing Models: Nested F-Tests

Tests to Compare Two Regression Lines

Y = Active pulse
\(X_1\) = Resting Pulse
\(X_2\) = Female (vs. Male - 1/0 indicator)’’

\[ActivePulse=\beta_0 + \beta_1 RestingPulse + \beta_2 Female + \beta_3 RestingPulse \times Female + \epsilon\]

What is the test of different intercept?
What is the test of different slope?
What is the fitted equation for Males?
What is the fitted equation for Females?

How do we test whether these models are significantly different?

What is the fitted equation for Males? \[ActivePulse_{Males}=\beta_0 + \beta_1 RestingPulse + \epsilon\]
What is the fitted equation for Females?

\(ActivePulse_{Females}=(\beta_0 + \beta_1(X_1) + \beta_2(1) + \beta_3(1)X_1 + \epsilon)\) = \[ActivePulse_{Females}=(\beta_0 + \beta_2)+ ((\beta_1 +\beta_3) RestingPulse) + \epsilon\]

The difference between the models is both \(\beta_2\) and \(\beta_3\)

How do we test whether these models are significantly different?

One at a time with t-tests:
- \(H_0: \beta_3 = 0\)
- \(H_A: \beta_3 \neq 0\)
All terms at once with ANOVA:
- \(H_0: \beta_1 = \beta_2 = \beta_3 = 0\)
- \(H_A: some \beta_i \neq 0\)

How do we test \(\beta_2\) and \(\beta_3\) together?

Nested Models

If all predictors in Model A are also in Model B, Model A is nested in Model B.
- \(ActivePulse = \beta_0 + \beta_1 RestingPulse + \epsilon\) is nested in \(ActivePulse = \beta_0 + \beta_1 RestingPulse + \beta_2 Female +\beta_3 RestingPulse \times Female + \epsilon\)
- Model B is the Full model, Model A is the Nested Model
How can we:
- Test whether we need the extra terms in Model B
- Quantify how much the additional terms “add” to Model A?

Nested F-Test

Basic idea:

Find how much “extra” variability is explained by the “new” terms being tested.
Divide by the number of new terms to get a Mean Square for the new part of the model.
Divide this Mean Square by the MSE for the “full” model to get a test statistic.
Compare the test statistic (t.s.) to an F-distribution.

\[ t.s. = \frac {\frac {SSModel_{Full} - SSModel_{Nested}} {k_{Full} - k_{Nested}}} {\frac {SSE_{Full}} {n-k-1_{Full}}}\]

Formally, \(\beta_i\) reflects all predictors that are included in the full model but not the nested model:

\(H_0:\) \(\beta_i = 0\) for all \(\beta\) in \(\beta_i\)

\(H_A:\) \(\beta_i \neq 0\) for at least 1 \(\beta\) in \(\beta_i\)

Conceptually:

\(H_0:\) The nested (or smaller) model is all we need.

\(H_A:\) We need the full model.

Nested F-Test Example: Calculating Nested F test statistic manually

Let’s calculate the Nested F Test Statistic for the Pulse Example:

library(Stat2Data)
data("Pulse")

fullmodel<-lm(Active~Rest+Sex+Rest*Sex, data=Pulse)
nestedmodel<-lm(Active~Rest, data=Pulse)

Full Model: ANOVA Table

anova(fullmodel)

## Analysis of Variance Table
## 
## Response: Active
##            Df Sum Sq Mean Sq  F value Pr(>F)    
## Rest        1  29868 29867.9 132.6550 <2e-16 ***
## Sex         1    504   503.7   2.2373 0.1361    
## Rest:Sex    1    114   113.5   0.5043 0.4784    
## Residuals 228  51335   225.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Nested Model: ANOVA Table

anova(nestedmodel)

## Analysis of Variance Table
## 
## Response: Active
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Rest        1  29868 29867.9  132.23 < 2.2e-16 ***
## Residuals 230  51953   225.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are the values of the following:

\(SSModel_{Full}\):
\(SSModel_{Nested}\):
# of predictors (or \(k_{Full} - k_{Nested}\)):
\(SSE_{Full}\):
\(n-k-1\):
Nested F Test Statistic: \(\frac {\frac {SSModel_{Full} - SSModel_{Nested}} {k_{Full} - k_{Nested}}} {\frac {SSE_{Full}} {n-k-1_{Full}}}\)

Nested F-Test Example: Calculating Nested F test statistic automatically

anova(nestedmodel,fullmodel)

## Analysis of Variance Table
## 
## Model 1: Active ~ Rest
## Model 2: Active ~ Rest + Sex + Rest * Sex
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    230 51953                           
## 2    228 51335  2    617.27 1.3708  0.256

Collinearity, Tests of Nested Models, Categorical Explanatory Variables

Annoucements

Outline for Today

Visualizing Parallel Slopes

Interactions - Different Slope Model

RailsTrails Example from homework

Three Category Categorical Variables

Three parallel slopes model

Three different slopes model

Multicollinearity

Intuition & and Example

Intuition: two variables that are highly related to each other

Example of Height and Inseam

Correlation Coefficients

Model 1: Height and Age

Model 2: Inseam and Age

Model 3: Height, Inseam and Age

True collinearity

How do we detect multicollinearity

Exploring the VIF - calculating manually

What can you do about collinearity?

Some other things you might see (Google is a black hole…)

Collinearity Summary

Comparing Models: Nested F-Tests

Tests to Compare Two Regression Lines

How do we test whether these models are significantly different?

How do we test whether these models are significantly different?

Nested Models

Nested F-Test

Nested F-Test Example: Calculating Nested F test statistic manually

Nested F-Test Example: Calculating Nested F test statistic automatically