SDS 291
November 5, 2019
HW 6 assigned: 3.32, 3.44, 3.45 (c only), 3.47, 3.48 (e only), 3.56
Office hours 1:30 - 3:00pm today
Interactions
Multicollinearity
Add geom_abline()’s to a qplot() visualize your model in the data space. How would we add color to the points to differentiate East from West?
library(ggplot2)
qplot(data = NYC, x = Food, y = Price, geom = "jitter", alpha = 0.25) +
geom_abline(intercept = -17.430, slope = 2.875) +
geom_abline(intercept = -17.430 + 1.459, slope = 2.875, linetype = 5)Consider the case where \(X_1\) is quantitative, but \(X_2\) is an variable that can only be 0 or 1 (e.g. \(isWoman\)). Then,
\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + \hat{\beta_3}X_1X_2 \]
So then,
- For men, \(X_2 = 0\) so \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}(0)\) and \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1\) is the fitted regression line for men,
- For women, \(X_2 = 1\) so \(\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}(1)\) and \(\hat{Y} = (\hat{\beta_0} + \hat{\beta_2}) + (\hat{\beta_1}+ \hat{\beta_3})X_1\) is the fitted regression line for women
qplot(y = Adj2007, x = Distance, color = GarageGroup, data = RailsTrails) +
geom_smooth(method = "lm", se = 0)##
## Call:
## lm(formula = Adj2007 ~ Distance * GarageGroup, data = RailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.46 -51.65 -17.22 30.04 425.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 359.083 21.295 16.862 < 2e-16 ***
## Distance -46.302 13.391 -3.458 0.000802 ***
## GarageGroupyes 48.862 28.108 1.738 0.085222 .
## Distance:GarageGroupyes -9.878 19.366 -0.510 0.611125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90.96 on 100 degrees of freedom
## Multiple R-squared: 0.2712, Adjusted R-squared: 0.2494
## F-statistic: 12.41 on 3 and 100 DF, p-value: 5.785e-07
##
## Call:
## lm(formula = Adj2007 ~ Distance + GarageGroup, data = RailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -167.88 -51.55 -21.88 36.79 427.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 365.103 17.661 20.673 <2e-16 ***
## Distance -51.025 9.638 -5.294 7e-07 ***
## GarageGroupyes 37.892 18.032 2.101 0.0381 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90.62 on 101 degrees of freedom
## Multiple R-squared: 0.2693, Adjusted R-squared: 0.2549
## F-statistic: 18.62 on 2 and 101 DF, p-value: 1.311e-07
##
## Call:
## lm(formula = Adj2007 ~ Distance + BedGroup, data = RailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -130.01 -52.78 -17.42 23.13 407.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 322.786 24.913 12.957 < 2e-16 ***
## Distance -45.655 9.513 -4.799 5.58e-06 ***
## BedGroup3 beds 44.514 24.873 1.790 0.076540 .
## BedGroup4+ beds 96.447 26.616 3.624 0.000459 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86.99 on 100 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 0.3133
## F-statistic: 16.67 on 3 and 100 DF, p-value: 7.417e-09
qplot(y = Adj2007, x = Distance, color = BedGroup, data = RailsTrails) +
geom_smooth(method = "lm", se = 0)##
## Call:
## lm(formula = Adj2007 ~ Distance * BedGroup, data = RailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.12 -54.10 -17.78 24.02 407.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 311.76 32.78 9.510 1.4e-15 ***
## Distance -37.02 19.07 -1.941 0.05514 .
## BedGroup3 beds 58.22 38.92 1.496 0.13789
## BedGroup4+ beds 111.50 39.33 2.835 0.00557 **
## Distance:BedGroup3 beds -10.67 23.07 -0.462 0.64482
## Distance:BedGroup4+ beds -14.00 28.71 -0.488 0.62683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.75 on 98 degrees of freedom
## Multiple R-squared: 0.3353, Adjusted R-squared: 0.3014
## F-statistic: 9.888 on 5 and 98 DF, p-value: 1.122e-07
What is it?
Why is it a problem?
Modeling Weight as a function of Height (\(X_1\)) and Inseam (\(X_2\)): \[weight=\beta_0+\beta_1 Height+ \beta_2 Inseam + \epsilon\]
Effects of Multicollinearity
If predictors are highly correlated among themselves:
We want to model Weight as a function of Height, Inseam, and Age
We simulate data, which we call rdata, for the purposes of the example with the following correlation matrix
## weight height inseam age
## weight 1.0000000 0.7885873 0.8073210 0.5002686
## height 0.7885873 1.0000000 0.9559604 0.3773555
## inseam 0.8073210 0.9559604 1.0000000 0.3702648
## age 0.5002686 0.3773555 0.3702648 1.0000000
##
## Call:
## lm(formula = weight ~ height + age, data = rdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22772 -0.42874 -0.00997 0.32386 1.59323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.06948 0.05993 1.159 0.2491
## height 0.69118 0.06226 11.101 <2e-16 ***
## age 0.23715 0.06322 3.751 0.0003 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5978 on 97 degrees of freedom
## Multiple R-squared: 0.6698, Adjusted R-squared: 0.663
## F-statistic: 98.37 on 2 and 97 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = weight ~ inseam + age, data = rdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06698 -0.39223 -0.04525 0.37086 1.58523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.08771 0.05733 1.530 0.129282
## inseam 0.71201 0.05925 12.017 < 2e-16 ***
## age 0.23413 0.06020 3.889 0.000184 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.571 on 97 degrees of freedom
## Multiple R-squared: 0.6987, Adjusted R-squared: 0.6925
## F-statistic: 112.5 on 2 and 97 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = weight ~ height + inseam + age, data = rdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08853 -0.39992 -0.01961 0.35937 1.58092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.08505 0.05761 1.476 0.143118
## height 0.13131 0.18885 0.695 0.488537
## inseam 0.58788 0.18815 3.125 0.002355 **
## age 0.23052 0.06058 3.805 0.000249 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5725 on 96 degrees of freedom
## Multiple R-squared: 0.7003, Adjusted R-squared: 0.6909
## F-statistic: 74.76 on 3 and 96 DF, p-value: < 2.2e-16
In most extreme cases - two variables are functions of the other:
Let’s assume we’re in a world where everyone’s leg length is the same ratio of their height : inseam = 0.34*height
## weight_true height_true inseam_true
## weight_true 1.0000000 0.9982694 0.9982694
## height_true 0.9982694 1.0000000 1.0000000
## inseam_true 0.9982694 1.0000000 1.0000000
##
## Call:
## lm(formula = weight_true ~ height_true + inseam_true, data = weightdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5346 -0.7053 0.1223 0.7281 1.9631
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.73503 1.08534 -0.677 0.5
## height_true 3.01134 0.01792 168.050 <2e-16 ***
## inseam_true NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.994 on 98 degrees of freedom
## Multiple R-squared: 0.9965, Adjusted R-squared: 0.9965
## F-statistic: 2.824e+04 on 1 and 98 DF, p-value: < 2.2e-16
We constructed that example to be exactly collinear (we’re calling the variables "_true" because they’re perfectly collinear), when it’s unlikely to be the case (two people with the same height might have different length legs).
Original example (not perfect collinearity): What is the correlation between height and inseam?
## ht_inseam_corr
## 1 0.9559604
##
## Call:
## lm(formula = height ~ inseam + age, data = rdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59321 -0.23306 -0.01279 0.18597 0.75928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02024 0.03090 0.655 0.514
## inseam 0.94533 0.03194 29.595 <2e-16 ***
## age 0.02753 0.03245 0.848 0.398
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3078 on 97 degrees of freedom
## Multiple R-squared: 0.9145, Adjusted R-squared: 0.9127
## F-statistic: 518.7 on 2 and 97 DF, p-value: < 2.2e-16
## [1] 11.69519
##
## Call:
## lm(formula = inseam ~ height + age, data = rdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81922 -0.16569 0.01617 0.19982 0.55612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.02648 0.03097 -0.855 0.395
## height 0.95236 0.03218 29.595 <2e-16 ***
## age 0.01129 0.03267 0.345 0.730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.309 on 97 degrees of freedom
## Multiple R-squared: 0.914, Adjusted R-squared: 0.9122
## F-statistic: 515.2 on 2 and 97 DF, p-value: < 2.2e-16
## [1] 11.62334
##
## Call:
## lm(formula = age ~ inseam + height, data = rdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79279 -0.54459 0.00692 0.55481 2.15850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.06224 0.09634 0.646 0.520
## inseam 0.10887 0.31514 0.345 0.730
## height 0.26751 0.31534 0.848 0.398
##
## Residual standard error: 0.9596 on 97 degrees of freedom
## Multiple R-squared: 0.1435, Adjusted R-squared: 0.1258
## F-statistic: 8.123 on 2 and 97 DF, p-value: 0.0005476
## [1] 1.167476
We use the vif() function from the car package to do this automatically from a model that has all three explanatory variables.
## height inseam age
## 11.695190 11.623342 1.167476
Drop some of the predictors
Combine some of the predictors
Pay less attention to the individual coefficients and tests (depends on your goal…)
Be cautious about your interpretation of multiple regression models with correlated values
Y = Active pulse
\(X_1\) = Resting Pulse
\(X_2\) = Female (vs. Male - 1/0 indicator)’’
\[ActivePulse=\beta_0 + \beta_1 RestingPulse + \beta_2 Female + \beta_3 RestingPulse \times Female + \epsilon\]
What is the test of different intercept?
What is the test of different slope?
What is the fitted equation for Males?
What is the fitted equation for Females?
What is the fitted equation for Males? \[ActivePulse_{Males}=\beta_0 + \beta_1 RestingPulse + \epsilon\]
What is the fitted equation for Females?
\(ActivePulse_{Females}=(\beta_0 + \beta_1(X_1) + \beta_2(1) + \beta_3(1)X_1 + \epsilon)\) = \[ActivePulse_{Females}=(\beta_0 + \beta_2)+ ((\beta_1 +\beta_3) RestingPulse) + \epsilon\]
How do we test \(\beta_2\) and \(\beta_3\) together?
Basic idea:
Find how much “extra” variability is explained by the “new” terms being tested.
Divide by the number of new terms to get a Mean Square for the new part of the model.
Divide this Mean Square by the MSE for the “full” model to get a test statistic.
Compare the test statistic (t.s.) to an F-distribution.
\[ t.s. = \frac {\frac {SSModel_{Full} - SSModel_{Nested}} {k_{Full} - k_{Nested}}} {\frac {SSE_{Full}} {n-k-1_{Full}}}\]
Formally, \(\beta_i\) reflects all predictors that are included in the full model but not the nested model:
\(H_0:\) \(\beta_i = 0\) for all \(\beta\) in \(\beta_i\)
\(H_A:\) \(\beta_i \neq 0\) for at least 1 \(\beta\) in \(\beta_i\)
Conceptually:
\(H_0:\) The nested (or smaller) model is all we need.
\(H_A:\) We need the full model.
Let’s calculate the Nested F Test Statistic for the Pulse Example:
library(Stat2Data)
data("Pulse")
fullmodel<-lm(Active~Rest+Sex+Rest*Sex, data=Pulse)
nestedmodel<-lm(Active~Rest, data=Pulse)Full Model: ANOVA Table
## Analysis of Variance Table
##
## Response: Active
## Df Sum Sq Mean Sq F value Pr(>F)
## Rest 1 29868 29867.9 132.6550 <2e-16 ***
## Sex 1 504 503.7 2.2373 0.1361
## Rest:Sex 1 114 113.5 0.5043 0.4784
## Residuals 228 51335 225.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nested Model: ANOVA Table
## Analysis of Variance Table
##
## Response: Active
## Df Sum Sq Mean Sq F value Pr(>F)
## Rest 1 29868 29867.9 132.23 < 2.2e-16 ***
## Residuals 230 51953 225.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What are the values of the following:
## Analysis of Variance Table
##
## Model 1: Active ~ Rest
## Model 2: Active ~ Rest + Sex + Rest * Sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 230 51953
## 2 228 51335 2 617.27 1.3708 0.256