Effect plots and model selection in R
Model Selection
- Given a set of predictors, we might only be interested in knowing which variables are “predictive” or significantly associated with the outcome
- As such, we can choose to discard variables that are not of theoretical interest or relevance which sometimes improves precision of model estimates
- There are generally two approaches to model selection
- Fitting all possible models specifying various subsets of predictors and ranking them based on some type of criterion like Adjusted R squared or Bayesian Information Criteria (BIC)
- Sometimes however, we might be starting out with too many predictors and it’s not feasible to explore all possible subsets. So we might use forward, backward or stepwise model selection specifying a p-value of 0.05, for instance, as the basis for retaining a variable in a model
Data Description
- The data for this exercise comes from a survey.
- A group of social scientists is interested in knowing whether the maximum speed at which one has driven depends upon gender and/or height.
- This data set contains the following variables - speed, gender, height - for 1302 individuals.
- There are 863 females and 439 males in the data set.
Data Preview
head(speed)
## speed gender height
## 1 85 female 69
## 2 40 male 71
## 3 87 female 64
## 4 110 female 60
## 5 110 male 70
## 6 120 female 61
tail(speed)
## speed gender height
## 1320 97 female 65.5
## 1321 97 female 63.0
## 1322 100 female 66.0
## 1323 90 female 63.0
## 1324 90 male 69.0
## 1325 60 female 62.0
Data Structure
str(speed)
## 'data.frame': 1302 obs. of 3 variables:
## $ speed : int 85 40 87 110 110 120 90 90 80 95 ...
## $ gender: chr "female" "male" "female" "female" ...
## $ height: num 69 71 64 60 70 61 65 65 61 69 ...
## - attr(*, "na.action")= 'omit' Named int [1:23] 42 52 53 130 225 308 317 452 466 493 ...
## ..- attr(*, "names")= chr [1:23] "42" "52" "53" "130" ...
Data Summary
summary(speed)
## speed gender height
## Min. : 0.00 Length:1302 Min. :52.00
## 1st Qu.: 80.00 Class :character 1st Qu.:63.00
## Median : 90.00 Mode :character Median :66.00
## Mean : 90.75 Mean :66.14
## 3rd Qu.:100.00 3rd Qu.:69.00
## Max. :185.00 Max. :82.00
Model specifications
- We can specify many different models during the model selection process:
# Intercept-only model (returns the mean of the outcome variable)
mod0 = lm(speed ~ 1, data=speed)
# speed = gender
mod1 = lm(speed ~ gender, data=speed)
# Main effects model: speed = gender + height
mod2 = lm(speed ~ gender + height,data=speed)
# Full model (main effects of gender, height and the interaction effect)
mod3 = lm(speed ~ gender + height + gender*height, data=speed)
# Equivalent to full model: speed = gender + height + gender*height
mod3a = lm(speed~gender*height,data=speed)
# Interaction only model: speed = gender * height
mod4 = lm(speed~gender:height,data=speed)
Library effects
- Translating regression model results to more intuitive marginal effects helps make results more meaningful and interpretable
effects library produces graphical and tabular effect displays, e.g., of interactions, for various statistical models with linear predictors
library(effects)
Main effects model (speed = gender + height)
Summary
summary(mod2)
##
## Call:
## lm(formula = speed ~ gender + height, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.234 -7.717 2.283 11.313 81.857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.6755 12.1534 2.030 0.042525 *
## gendermale 5.7276 1.6142 3.548 0.000402 ***
## height 0.9699 0.1885 5.145 3.09e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.66 on 1299 degrees of freedom
## Multiple R-squared: 0.07122, Adjusted R-squared: 0.06979
## F-statistic: 49.8 on 2 and 1299 DF, p-value: < 2.2e-16
- Note: female is the base reference category, coded as 0, and male is coded as 1
- Based on the results, males have a max speed that is on average 5.7 speed-units faster than females (the base reference group, coded 0)
Fitted values
- The model object
mod2 is a list that stores the following: coefficients, residuals, effects, rank, fitted.values, assign, qr, df.residual, contrasts, xlevels, call, terms, model
head(mod2$fitted.values) #preview of fitted values
## 1 2 3 4 5 6
## 91.59651 99.26389 86.74716 82.86769 98.29402 83.83755
Effects plots
- Plot the main effects of gender and height
plot(allEffects(mod2))

Full model (speed = gender + height + gender * height)
Summary
summary(mod3)
##
## Call:
## lm(formula = speed ~ gender + height + gender * height, data = speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.372 -7.495 2.444 11.533 79.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.7801 15.8630 2.949 0.00325 **
## gendermale -50.0170 25.8116 -1.938 0.05287 .
## height 0.6264 0.2462 2.544 0.01108 *
## gendermale:height 0.8265 0.3820 2.164 0.03065 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.63 on 1298 degrees of freedom
## Multiple R-squared: 0.07456, Adjusted R-squared: 0.07242
## F-statistic: 34.86 on 3 and 1298 DF, p-value: < 2.2e-16
Effects plots
- Notice that confidence intervals are included in both plots below
plot(allEffects(mod3))

- The effect of height on max speed varies at different level of gender. For males, the effect of height on speed is greater than it is for females
- The results suggest that gender moderates the relationship between height and max speed
Model selection
leaps() performs an exhaustive search for the best subsets of the variables in x for predicting y in linear regression, using an efficient branch-and-bound algorithm. It is a compatibility wrapper for regsubsets does the same thing better.
library(leaps)
- Run the model selection procedure on the full model
allsubs = regsubsets(speed~gender+height+gender*height, data=speed)
summary(allsubs)
## Subset selection object
## Call: regsubsets.formula(speed ~ gender + height + gender * height,
## data = speed)
## 3 Variables (and intercept)
## Forced in Forced out
## gendermale FALSE FALSE
## height FALSE FALSE
## gendermale:height FALSE FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
## gendermale height gendermale:height
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" "*"
## 3 ( 1 ) "*" "*" "*"
- The function
summary() reports the best set of variables for each model size.
- From the output above, an asterisk specifies that a given variable is included in the corresponding model.
- For example, the best 2-variables model contains
height and gender*height i.e. speed ~ height + gender*height.
- The best three-variable model is
speed ~ gender + height + gender*height.
- The
summary() function also returns Adjusted R2, Mallow’s Cp and BIC
res.sum <- summary(allsubs)
data.frame(Adj.R2 = which.max(res.sum$adjr2),BIC = which.min(res.sum$bic))
## Adj.R2 BIC
## 1 3 2
- Adjusted R2 suggests that the best model is the one with all the 3 predictor variables.
- Using BIC criteria, the preferable model is the one with 2 variables.
Model selection plots
- We can also use plots to convey this insight.
- The model with two variables - height and gender*height (interaction term) - has the lowest BIC value.
par(mfrow=c(1,2))
plot(allsubs, scale='adjr', main = "Model Selection Criteria: Adj.R2")
plot(allsubs, scale='bic', main = "Model Selection Criteria: BIC")

\(~\)
Replicating results in Stata
- Let us bring this file into Stata and see how we can replicate the results. When read in, the file has the following dimensions:
(3 vars, 1302 obs)
- There are 863 females and 439 males in the data set
. tab gender
gender | Freq. Percent Cum.
------------+-----------------------------------
0 | 863 66.28 66.28
1 | 439 33.72 100.00
------------+-----------------------------------
Total | 1,302 100.00
. sum
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
speed | 1302 90.75115 22.45394 0 185
gender | 1302 .3371736 .4729264 0 1
height | 1302 66.13717 4.049847 52 82
A note on Stata notation:
- A variable can be prefixed with i. to specify indicators for each level (category) of the variable.
- An # is used between two variables to create an interaction–indicators for each combination of the categories of the variables.
- ## can be used to specify a full factorial of the variables—main effects for each variable and an interaction.
- If we want to interact a continuous variable with a factor variable, we prefix the continuous variable with c.
Main effects model (speed = gender + height)
regress speed i.gender height
Source | SS df MS Number of obs = 1302
-------------+------------------------------ F( 2, 1299) = 49.80
Model | 46715.7154 2 23357.8577 Prob > F = 0.0000
Residual | 609221.658 1299 468.992808 R-squared = 0.0712
-------------+------------------------------ Adj R-squared = 0.0698
Total | 655937.373 1301 504.17938 Root MSE = 21.656
------------------------------------------------------------------------------
speed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.gender | 5.727642 1.614234 3.55 0.000 2.56085 8.894434
height | .9698694 .1885044 5.15 0.000 .600063 1.339676
_cons | 24.67552 12.15337 2.03 0.043 .8331418 48.5179
------------------------------------------------------------------------------
Fitted values
- We can obtain linear predictions based on the model using the
predict command
predict fitted_values
(option xb assumed; fitted values)
- Note that this replicates what was produced earlier on when we ran
head(mod2$fitted.values) in R
list fitted_values in 1/6
| fitted~s |
|----------|
1. | 91.59651 |
2. | 99.26389 |
3. | 86.74716 |
4. | 82.86768 |
5. | 98.29402 |
|----------|
6. | 83.83755 |
+----------+
Main effects of gender
- Based on the results, males (coded 1) are 5.7 speed-units faster than females (the base reference group, coded 0). The predicted values above are produced using the value of
height at its mean, 66.13717.
margins gender
Predictive margins Number of obs = 1302
Model VCE : OLS
Expression : Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
0 | 88.81994 .8102143 109.63 0.000 87.23195 90.40793
1 | 94.54758 1.226792 77.07 0.000 92.14312 96.95205
------------------------------------------------------------------------------
- To see how the values above are derived, below are the manual computations:
// Predicted max speed for females
display _b[_cons] + _b[0.gender] + _b[height]*66.13717
88.819939
// Predicted max speed for males
display _b[_cons] + _b[1.gender] + _b[height]*66.13717
94.547581
marginsplot, xdimension(gender)

Main effects of height
- Obtain linear predictions, based on the model, at each value of height
- These values are used to produce the effect plots
margins, at(height=(52(1)82)) vsquish
Predictive margins Number of obs = 1302
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : height = 52
2._at : height = 53
3._at : height = 54
4._at : height = 55
5._at : height = 56
6._at : height = 57
7._at : height = 58
8._at : height = 59
9._at : height = 60
10._at : height = 61
11._at : height = 62
12._at : height = 63
13._at : height = 64
14._at : height = 65
15._at : height = 66
16._at : height = 67
17._at : height = 68
18._at : height = 69
19._at : height = 70
20._at : height = 71
21._at : height = 72
22._at : height = 73
23._at : height = 74
24._at : height = 75
25._at : height = 76
26._at : height = 77
27._at : height = 78
28._at : height = 79
29._at : height = 80
30._at : height = 81
31._at : height = 82
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 77.03994 2.731667 28.20 0.000 71.68597 82.39391
2 | 78.00981 2.548105 30.61 0.000 73.01561 83.004
3 | 78.97968 2.365321 33.39 0.000 74.34373 83.61562
4 | 79.94955 2.18351 36.62 0.000 75.66995 84.22915
5 | 80.91942 2.002937 40.40 0.000 76.99373 84.8451
6 | 81.88929 1.823969 44.90 0.000 78.31437 85.4642
7 | 82.85916 1.64713 50.31 0.000 79.63084 86.08747
8 | 83.82903 1.473187 56.90 0.000 80.94163 86.71642
9 | 84.7989 1.3033 65.06 0.000 82.24447 87.35332
10 | 85.76876 1.139284 75.28 0.000 83.53581 88.00172
11 | 86.73863 .9840809 88.14 0.000 84.80987 88.6674
12 | 87.7085 .842573 104.10 0.000 86.05709 89.35992
13 | 88.67837 .7228492 122.68 0.000 87.26161 90.09513
14 | 89.64824 .6373074 140.67 0.000 88.39914 90.89734
15 | 90.61811 .6007314 150.85 0.000 89.4407 91.79552
16 | 91.58798 .6218227 147.29 0.000 90.36923 92.80673
17 | 92.55785 .6953535 133.11 0.000 91.19498 93.92072
18 | 93.52772 .8071167 115.88 0.000 91.9458 95.10964
19 | 94.49759 .9436239 100.14 0.000 92.64812 96.34706
20 | 95.46746 1.095665 87.13 0.000 93.31999 97.61492
21 | 96.43733 1.25762 76.68 0.000 93.97244 98.90222
22 | 97.4072 1.426113 68.30 0.000 94.61207 100.2023
23 | 98.37707 1.599081 61.52 0.000 95.24293 101.5112
24 | 99.34694 1.775215 55.96 0.000 95.86758 102.8263
25 | 100.3168 1.953659 51.35 0.000 96.48771 104.1459
26 | 101.2867 2.133834 47.47 0.000 97.10444 105.4689
27 | 102.2565 2.315335 44.16 0.000 97.71857 106.7945
28 | 103.2264 2.497874 41.33 0.000 98.33067 108.1222
29 | 104.1963 2.681239 38.86 0.000 98.94115 109.4514
30 | 105.1662 2.865271 36.70 0.000 99.55033 110.782
31 | 106.136 3.049849 34.80 0.000 100.1584 112.1136
------------------------------------------------------------------------------
- Produce effects plot for height
marginsplot, recast(line) recastci(rarea) ciopts(color(*.6))
\(~\)
Full effects model (speed = gender + height + gender * height)
regress speed c.height##i.gender
Source | SS df MS Number of obs = 1302
-------------+------------------------------ F( 3, 1298) = 34.86
Model | 48905.5484 3 16301.8495 Prob > F = 0.0000
Residual | 607031.825 1298 467.667045 R-squared = 0.0746
-------------+------------------------------ Adj R-squared = 0.0724
Total | 655937.373 1301 504.17938 Root MSE = 21.626
------------------------------------------------------------------------------
speed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | .6263848 .2462315 2.54 0.011 .1433295 1.10944
1.gender | -50.01699 25.81158 -1.94 0.053 -100.654 .6199822
|
gender#|
c.height |
1 | .8265217 .3819594 2.16 0.031 .0771964 1.575847
|
_cons | 46.78009 15.86304 2.95 0.003 15.66007 77.9001
------------------------------------------------------------------------------
Margins
- Produce expected values of max speed at each of these values of height , by gender
- These values are used to produce the effect plots
margins gender, at(height=(52(1)82)) vsquish
Adjusted predictions Number of obs = 1302
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : height = 52
2._at : height = 53
3._at : height = 54
4._at : height = 55
5._at : height = 56
6._at : height = 57
7._at : height = 58
8._at : height = 59
9._at : height = 60
10._at : height = 61
11._at : height = 62
12._at : height = 63
13._at : height = 64
14._at : height = 65
15._at : height = 66
16._at : height = 67
17._at : height = 68
18._at : height = 69
19._at : height = 70
20._at : height = 71
21._at : height = 72
22._at : height = 73
23._at : height = 74
24._at : height = 75
25._at : height = 76
26._at : height = 77
27._at : height = 78
28._at : height = 79
29._at : height = 80
30._at : height = 81
31._at : height = 82
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at#gender |
1 0 | 79.3521 3.129721 25.35 0.000 73.21796 85.48624
1 1 | 72.31423 5.254051 13.76 0.000 62.01648 82.61198
2 0 | 79.97848 2.890978 27.66 0.000 74.31227 85.6447
2 1 | 73.76714 4.968073 14.85 0.000 64.02989 83.50438
3 0 | 80.60487 2.653604 30.38 0.000 75.4039 85.80584
3 1 | 75.22005 4.682839 16.06 0.000 66.04185 84.39824
4 0 | 81.23125 2.418002 33.59 0.000 76.49206 85.97045
4 1 | 76.67295 4.398492 17.43 0.000 68.05207 85.29384
5 0 | 81.85764 2.184745 37.47 0.000 77.57562 86.13966
5 1 | 78.12586 4.115218 18.98 0.000 70.06018 86.19154
6 0 | 82.48402 1.954674 42.20 0.000 78.65293 86.31511
6 1 | 79.57877 3.833252 20.76 0.000 72.06573 87.0918
7 0 | 83.11041 1.72906 48.07 0.000 79.72151 86.4993
7 1 | 81.03167 3.552908 22.81 0.000 74.0681 87.99524
8 0 | 83.73679 1.509904 55.46 0.000 80.77744 86.69615
8 1 | 82.48458 3.274602 25.19 0.000 76.06648 88.90268
9 0 | 84.36318 1.300472 64.87 0.000 81.8143 86.91206
9 1 | 83.93749 2.9989 27.99 0.000 78.05975 89.81522
10 0 | 84.98956 1.106303 76.82 0.000 82.82125 87.15788
10 1 | 85.39039 2.726593 31.32 0.000 80.04637 90.73442
11 0 | 85.61595 .9369336 91.38 0.000 83.77959 87.4523
11 1 | 86.8433 2.45881 35.32 0.000 82.02412 91.66248
12 0 | 86.24233 .8081107 106.72 0.000 84.65846 87.8262
12 1 | 88.2962 2.197205 40.19 0.000 83.98976 92.60265
13 0 | 86.86872 .7412836 117.19 0.000 85.41583 88.32161
13 1 | 89.74911 1.944274 46.16 0.000 85.93841 93.55982
14 0 | 87.4951 .7531398 116.17 0.000 86.01897 88.97123
14 1 | 91.20202 1.703882 53.53 0.000 87.86247 94.54157
15 0 | 88.12149 .8403558 104.86 0.000 86.47442 89.76855
15 1 | 92.65492 1.482145 62.51 0.000 89.74997 95.55988
16 0 | 88.74787 .9830747 90.28 0.000 86.82108 90.67466
16 1 | 94.10783 1.288729 73.02 0.000 91.58197 96.63369
17 0 | 89.37426 1.161006 76.98 0.000 87.09873 91.64978
17 1 | 95.56074 1.138163 83.96 0.000 93.32998 97.7915
18 0 | 90.00064 1.360401 66.16 0.000 87.3343 92.66698
18 1 | 97.01364 1.049063 92.48 0.000 94.95752 99.06977
19 0 | 90.62703 1.573121 57.61 0.000 87.54376 93.71029
19 1 | 98.46655 1.03739 94.92 0.000 96.4333 100.4998
20 0 | 91.25341 1.794432 50.85 0.000 87.73639 94.77043
20 1 | 99.91946 1.105599 90.38 0.000 97.75252 102.0864
21 0 | 91.87979 2.021516 45.45 0.000 87.9177 95.84189
21 1 | 101.3724 1.240583 81.71 0.000 98.94087 103.8039
22 0 | 92.50618 2.252626 41.07 0.000 88.09111 96.92125
22 1 | 102.8253 1.42347 72.24 0.000 100.0353 105.6152
23 0 | 93.13256 2.48664 37.45 0.000 88.25884 98.00629
23 1 | 104.2782 1.638297 63.65 0.000 101.0672 107.4892
24 0 | 93.75895 2.722811 34.43 0.000 88.42234 99.09556
24 1 | 105.7311 1.874111 56.42 0.000 102.0579 109.4043
25 0 | 94.38533 2.960621 31.88 0.000 88.58262 100.188
25 1 | 107.184 2.123933 50.46 0.000 103.0212 111.3468
26 0 | 95.01172 3.199705 29.69 0.000 88.74041 101.283
26 1 | 108.6369 2.383363 45.58 0.000 103.9656 113.3082
27 0 | 95.6381 3.439798 27.80 0.000 88.89622 102.38
27 1 | 110.0898 2.649579 41.55 0.000 104.8967 115.2829
28 0 | 96.26449 3.680702 26.15 0.000 89.05045 103.4785
28 1 | 111.5427 2.920728 38.19 0.000 105.8182 117.2672
29 0 | 96.89087 3.922268 24.70 0.000 89.20337 104.5784
29 1 | 112.9956 3.195552 35.36 0.000 106.7324 119.2588
30 0 | 97.51726 4.16438 23.42 0.000 89.35522 105.6793
30 1 | 114.4485 3.473181 32.95 0.000 107.6412 121.2558
31 0 | 98.14364 4.406949 22.27 0.000 89.50618 106.7811
31 1 | 115.9014 3.752991 30.88 0.000 108.5457 123.2572
------------------------------------------------------------------------------
marginsplot, bydimension(gender) recast(line) recastci(rarea) ciopts(color(*.6))

Model Selection
- Stepwise backward model selection procedure is used here
- Specify a significance level of 0.05 for retaining a variable in the model
- Begins with full model and removes
gender, keeping height and interaction
- This aligns with results in R where the model without gender dropped had the lowest BIC
gen interaction = gender*height
stepwise, pr(.05): regress speed gender height interaction
begin with full model
p = 0.0529 >= 0.0500 removing gender
Source | SS df MS Number of obs = 1302
-------------+------------------------------ F( 2, 1299) = 50.30
Model | 47149.474 2 23574.737 Prob > F = 0.0000
Residual | 608787.899 1299 468.658891 R-squared = 0.0719
-------------+------------------------------ Adj R-squared = 0.0705
Total | 655937.373 1301 504.17938 Root MSE = 21.649
------------------------------------------------------------------------------
speed | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
interaction | .0878156 .0238789 3.68 0.000 .0409702 .1346611
height | .9193059 .1945755 4.72 0.000 .5375893 1.301023
_cons | 27.8888 12.52701 2.23 0.026 3.313413 52.46418
------------------------------------------------------------------------------