Collinearity, Tests of Nested Models, Categorical Explanatory Variables

SDS 291

November 5, 2019

Annoucements

Outline for Today

Visualizing Parallel Slopes

NYC <- read.csv("http://bit.ly/nycData")

Add geom_abline()’s to a qplot() visualize your model in the data space. How would we add color to the points to differentiate East from West?

library(ggplot2)

qplot(data = NYC, x = Food, y = Price, geom = "jitter", alpha = 0.25) +
  geom_abline(intercept = -17.430, slope = 2.875) +
  geom_abline(intercept = -17.430 + 1.459, slope = 2.875, linetype = 5)

Interactions - Different Slope Model

Consider the case where \(X_1\) is quantitative, but \(X_2\) is an variable that can only be 0 or 1 (e.g. \(isWoman\)). Then,

\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + \hat{\beta_3}X_1X_2 \]

So then,
- For men, \(X_2 = 0\) so \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}(0)\) and \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1\) is the fitted regression line for men,
- For women, \(X_2 = 1\) so \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}(1)\) and \(\hat{Y} = (\hat{\beta_0} + \hat{\beta_2}) + (\hat{\beta_1}+ \hat{\beta_3})X_1\) is the fitted regression line for women

RailsTrails Example from homework

library(Stat2Data)
data("RailsTrails")
qplot(y = Adj2007, x = Distance, color = GarageGroup, data = RailsTrails) +
  geom_smooth(method = "lm", se = 0)

int_mod <- lm(Adj2007 ~ Distance*GarageGroup, data = RailsTrails)
summary(int_mod)
## 
## Call:
## lm(formula = Adj2007 ~ Distance * GarageGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.46  -51.65  -17.22   30.04  425.76 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              359.083     21.295  16.862  < 2e-16 ***
## Distance                 -46.302     13.391  -3.458 0.000802 ***
## GarageGroupyes            48.862     28.108   1.738 0.085222 .  
## Distance:GarageGroupyes   -9.878     19.366  -0.510 0.611125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90.96 on 100 degrees of freedom
## Multiple R-squared:  0.2712, Adjusted R-squared:  0.2494 
## F-statistic: 12.41 on 3 and 100 DF,  p-value: 5.785e-07
parallel_mod <- lm(Adj2007 ~ Distance + GarageGroup, data = RailsTrails)
summary(parallel_mod)
## 
## Call:
## lm(formula = Adj2007 ~ Distance + GarageGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -167.88  -51.55  -21.88   36.79  427.49 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     365.103     17.661  20.673   <2e-16 ***
## Distance        -51.025      9.638  -5.294    7e-07 ***
## GarageGroupyes   37.892     18.032   2.101   0.0381 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90.62 on 101 degrees of freedom
## Multiple R-squared:  0.2693, Adjusted R-squared:  0.2549 
## F-statistic: 18.62 on 2 and 101 DF,  p-value: 1.311e-07

Three Category Categorical Variables

Three parallel slopes model

three_mod <- lm(Adj2007 ~ Distance + BedGroup, data = RailsTrails)
summary(three_mod)
## 
## Call:
## lm(formula = Adj2007 ~ Distance + BedGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -130.01  -52.78  -17.42   23.13  407.90 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      322.786     24.913  12.957  < 2e-16 ***
## Distance         -45.655      9.513  -4.799 5.58e-06 ***
## BedGroup3 beds    44.514     24.873   1.790 0.076540 .  
## BedGroup4+ beds   96.447     26.616   3.624 0.000459 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86.99 on 100 degrees of freedom
## Multiple R-squared:  0.3333, Adjusted R-squared:  0.3133 
## F-statistic: 16.67 on 3 and 100 DF,  p-value: 7.417e-09

Three different slopes model

qplot(y = Adj2007, x = Distance, color = BedGroup, data = RailsTrails) +
  geom_smooth(method = "lm", se = 0)

three_mod <- lm(Adj2007 ~ Distance*BedGroup, data = RailsTrails)
summary(three_mod)
## 
## Call:
## lm(formula = Adj2007 ~ Distance * BedGroup, data = RailsTrails)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.12  -54.10  -17.78   24.02  407.23 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                311.76      32.78   9.510  1.4e-15 ***
## Distance                   -37.02      19.07  -1.941  0.05514 .  
## BedGroup3 beds              58.22      38.92   1.496  0.13789    
## BedGroup4+ beds            111.50      39.33   2.835  0.00557 ** 
## Distance:BedGroup3 beds    -10.67      23.07  -0.462  0.64482    
## Distance:BedGroup4+ beds   -14.00      28.71  -0.488  0.62683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.75 on 98 degrees of freedom
## Multiple R-squared:  0.3353, Adjusted R-squared:  0.3014 
## F-statistic: 9.888 on 5 and 98 DF,  p-value: 1.122e-07

Multicollinearity

What is it?

Why is it a problem?

Intuition & and Example

Modeling Weight as a function of Height (\(X_1\)) and Inseam (\(X_2\)): \[weight=\beta_0+\beta_1 Height+ \beta_2 Inseam + \epsilon\]

Effects of Multicollinearity

If predictors are highly correlated among themselves:

Example of Height and Inseam

We want to model Weight as a function of Height, Inseam, and Age

We simulate data, which we call rdata, for the purposes of the example with the following correlation matrix

Correlation Coefficients

# Correlation Coefficients Matrix ()
cor(rdata) 
##           weight    height    inseam       age
## weight 1.0000000 0.7885873 0.8073210 0.5002686
## height 0.7885873 1.0000000 0.9559604 0.3773555
## inseam 0.8073210 0.9559604 1.0000000 0.3702648
## age    0.5002686 0.3773555 0.3702648 1.0000000
library(GGally)
ggpairs(rdata)

Model 1: Height and Age

m1 <- lm(weight ~ height + age, rdata)
summary(m1)
## 
## Call:
## lm(formula = weight ~ height + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22772 -0.42874 -0.00997  0.32386  1.59323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.06948    0.05993   1.159   0.2491    
## height       0.69118    0.06226  11.101   <2e-16 ***
## age          0.23715    0.06322   3.751   0.0003 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5978 on 97 degrees of freedom
## Multiple R-squared:  0.6698, Adjusted R-squared:  0.663 
## F-statistic: 98.37 on 2 and 97 DF,  p-value: < 2.2e-16

Model 2: Inseam and Age

m2 <- lm(weight ~ inseam + age, rdata)
summary(m2)
## 
## Call:
## lm(formula = weight ~ inseam + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06698 -0.39223 -0.04525  0.37086  1.58523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08771    0.05733   1.530 0.129282    
## inseam       0.71201    0.05925  12.017  < 2e-16 ***
## age          0.23413    0.06020   3.889 0.000184 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.571 on 97 degrees of freedom
## Multiple R-squared:  0.6987, Adjusted R-squared:  0.6925 
## F-statistic: 112.5 on 2 and 97 DF,  p-value: < 2.2e-16

Model 3: Height, Inseam and Age

m3 <- lm(weight ~ height+inseam+age, rdata)
summary(m3)
## 
## Call:
## lm(formula = weight ~ height + inseam + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08853 -0.39992 -0.01961  0.35937  1.58092 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08505    0.05761   1.476 0.143118    
## height       0.13131    0.18885   0.695 0.488537    
## inseam       0.58788    0.18815   3.125 0.002355 ** 
## age          0.23052    0.06058   3.805 0.000249 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5725 on 96 degrees of freedom
## Multiple R-squared:  0.7003, Adjusted R-squared:  0.6909 
## F-statistic: 74.76 on 3 and 96 DF,  p-value: < 2.2e-16

True collinearity

In most extreme cases - two variables are functions of the other:

Let’s assume we’re in a world where everyone’s leg length is the same ratio of their height : inseam = 0.34*height

##             weight_true height_true inseam_true
## weight_true   1.0000000   0.9982694   0.9982694
## height_true   0.9982694   1.0000000   1.0000000
## inseam_true   0.9982694   1.0000000   1.0000000

## 
## Call:
## lm(formula = weight_true ~ height_true + inseam_true, data = weightdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5346 -0.7053  0.1223  0.7281  1.9631 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.73503    1.08534  -0.677      0.5    
## height_true  3.01134    0.01792 168.050   <2e-16 ***
## inseam_true       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.994 on 98 degrees of freedom
## Multiple R-squared:  0.9965, Adjusted R-squared:  0.9965 
## F-statistic: 2.824e+04 on 1 and 98 DF,  p-value: < 2.2e-16

We constructed that example to be exactly collinear (we’re calling the variables "_true" because they’re perfectly collinear), when it’s unlikely to be the case (two people with the same height might have different length legs).

How do we detect multicollinearity

Original example (not perfect collinearity): What is the correlation between height and inseam?

##   ht_inseam_corr
## 1      0.9559604

Exploring the VIF - calculating manually

vif0<-lm(height~inseam+age, data=rdata)
summary(vif0)
## 
## Call:
## lm(formula = height ~ inseam + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59321 -0.23306 -0.01279  0.18597  0.75928 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02024    0.03090   0.655    0.514    
## inseam       0.94533    0.03194  29.595   <2e-16 ***
## age          0.02753    0.03245   0.848    0.398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3078 on 97 degrees of freedom
## Multiple R-squared:  0.9145, Adjusted R-squared:  0.9127 
## F-statistic: 518.7 on 2 and 97 DF,  p-value: < 2.2e-16
1/(1-summary(vif0)$r.squared)
## [1] 11.69519
vif1<-lm(inseam~height+age, data=rdata)
summary(vif1)
## 
## Call:
## lm(formula = inseam ~ height + age, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81922 -0.16569  0.01617  0.19982  0.55612 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02648    0.03097  -0.855    0.395    
## height       0.95236    0.03218  29.595   <2e-16 ***
## age          0.01129    0.03267   0.345    0.730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.309 on 97 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.9122 
## F-statistic: 515.2 on 2 and 97 DF,  p-value: < 2.2e-16
1/(1-summary(vif1)$r.squared)
## [1] 11.62334
vif2<-lm(age~inseam+height, data=rdata)
summary(vif2)
## 
## Call:
## lm(formula = age ~ inseam + height, data = rdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.79279 -0.54459  0.00692  0.55481  2.15850 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.06224    0.09634   0.646    0.520
## inseam       0.10887    0.31514   0.345    0.730
## height       0.26751    0.31534   0.848    0.398
## 
## Residual standard error: 0.9596 on 97 degrees of freedom
## Multiple R-squared:  0.1435, Adjusted R-squared:  0.1258 
## F-statistic: 8.123 on 2 and 97 DF,  p-value: 0.0005476
1/(1-summary(vif2)$r.squared)
## [1] 1.167476

We use the vif() function from the car package to do this automatically from a model that has all three explanatory variables.

# Model 3: predictors 1,2 and 3
m3 <- lm(weight ~ height+inseam+age, rdata)

library(car)
vif(m3)
##    height    inseam       age 
## 11.695190 11.623342  1.167476

What can you do about collinearity?

  1. Drop some of the predictors

  2. Combine some of the predictors

  3. Pay less attention to the individual coefficients and tests (depends on your goal…)

Some other things you might see (Google is a black hole…)

Collinearity Summary

Comparing Models: Nested F-Tests

Tests to Compare Two Regression Lines

Y = Active pulse
\(X_1\) = Resting Pulse
\(X_2\) = Female (vs. Male - 1/0 indicator)’’

\[ActivePulse=\beta_0 + \beta_1 RestingPulse + \beta_2 Female + \beta_3 RestingPulse \times Female + \epsilon\]

How do we test whether these models are significantly different?

\(ActivePulse_{Females}=(\beta_0 + \beta_1(X_1) + \beta_2(1) + \beta_3(1)X_1 + \epsilon)\) = \[ActivePulse_{Females}=(\beta_0 + \beta_2)+ ((\beta_1 +\beta_3) RestingPulse) + \epsilon\]

How do we test whether these models are significantly different?

How do we test \(\beta_2\) and \(\beta_3\) together?

Nested Models

Nested F-Test

Basic idea:

  1. Find how much “extra” variability is explained by the “new” terms being tested.

  2. Divide by the number of new terms to get a Mean Square for the new part of the model.

  3. Divide this Mean Square by the MSE for the “full” model to get a test statistic.

  4. Compare the test statistic (t.s.) to an F-distribution.

\[ t.s. = \frac {\frac {SSModel_{Full} - SSModel_{Nested}} {k_{Full} - k_{Nested}}} {\frac {SSE_{Full}} {n-k-1_{Full}}}\]

Formally, \(\beta_i\) reflects all predictors that are included in the full model but not the nested model:

\(H_0:\) \(\beta_i = 0\) for all \(\beta\) in \(\beta_i\)

\(H_A:\) \(\beta_i \neq 0\) for at least 1 \(\beta\) in \(\beta_i\)

Conceptually:

\(H_0:\) The nested (or smaller) model is all we need.

\(H_A:\) We need the full model.

Nested F-Test Example: Calculating Nested F test statistic manually

Let’s calculate the Nested F Test Statistic for the Pulse Example:

library(Stat2Data)
data("Pulse")

fullmodel<-lm(Active~Rest+Sex+Rest*Sex, data=Pulse)
nestedmodel<-lm(Active~Rest, data=Pulse)

Full Model: ANOVA Table

anova(fullmodel)
## Analysis of Variance Table
## 
## Response: Active
##            Df Sum Sq Mean Sq  F value Pr(>F)    
## Rest        1  29868 29867.9 132.6550 <2e-16 ***
## Sex         1    504   503.7   2.2373 0.1361    
## Rest:Sex    1    114   113.5   0.5043 0.4784    
## Residuals 228  51335   225.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Nested Model: ANOVA Table

anova(nestedmodel)
## Analysis of Variance Table
## 
## Response: Active
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Rest        1  29868 29867.9  132.23 < 2.2e-16 ***
## Residuals 230  51953   225.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are the values of the following:

Nested F-Test Example: Calculating Nested F test statistic automatically

anova(nestedmodel,fullmodel)
## Analysis of Variance Table
## 
## Model 1: Active ~ Rest
## Model 2: Active ~ Rest + Sex + Rest * Sex
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    230 51953                           
## 2    228 51335  2    617.27 1.3708  0.256