Effect plots and model selection in R

Model Selection

  • Given a set of predictors, we might only be interested in knowing which variables are “predictive” or significantly associated with the outcome
  • As such, we can choose to discard variables that are not of theoretical interest or relevance which sometimes improves precision of model estimates
  • There are generally two approaches to model selection
    • Fitting all possible models specifying various subsets of predictors and ranking them based on some type of criterion like Adjusted R squared or Bayesian Information Criteria (BIC)
    • Sometimes however, we might be starting out with too many predictors and it’s not feasible to explore all possible subsets. So we might use forward, backward or stepwise model selection specifying a p-value of 0.05, for instance, as the basis for retaining a variable in a model

Data Description

  • The data for this exercise comes from a survey.
  • A group of social scientists is interested in knowing whether the maximum speed at which one has driven depends upon gender and/or height.
  • This data set contains the following variables - speed, gender, height - for 1302 individuals.
  • There are 863 females and 439 males in the data set.

Data Preview

head(speed) 
##   speed gender height
## 1    85 female     69
## 2    40   male     71
## 3    87 female     64
## 4   110 female     60
## 5   110   male     70
## 6   120 female     61
tail(speed)
##      speed gender height
## 1320    97 female   65.5
## 1321    97 female   63.0
## 1322   100 female   66.0
## 1323    90 female   63.0
## 1324    90   male   69.0
## 1325    60 female   62.0

Data Structure

str(speed)
## 'data.frame':    1302 obs. of  3 variables:
##  $ speed : int  85 40 87 110 110 120 90 90 80 95 ...
##  $ gender: chr  "female" "male" "female" "female" ...
##  $ height: num  69 71 64 60 70 61 65 65 61 69 ...
##  - attr(*, "na.action")= 'omit' Named int [1:23] 42 52 53 130 225 308 317 452 466 493 ...
##   ..- attr(*, "names")= chr [1:23] "42" "52" "53" "130" ...

Data Summary

summary(speed)
##      speed           gender              height     
##  Min.   :  0.00   Length:1302        Min.   :52.00  
##  1st Qu.: 80.00   Class :character   1st Qu.:63.00  
##  Median : 90.00   Mode  :character   Median :66.00  
##  Mean   : 90.75                      Mean   :66.14  
##  3rd Qu.:100.00                      3rd Qu.:69.00  
##  Max.   :185.00                      Max.   :82.00

Model specifications

  • We can specify many different models during the model selection process:
# Intercept-only model (returns the mean of the outcome variable)
mod0 = lm(speed ~ 1, data=speed)

# speed = gender
mod1 = lm(speed ~ gender, data=speed)

# Main effects model: speed = gender + height
mod2 = lm(speed ~ gender + height,data=speed)

# Full model (main effects of gender, height and the interaction effect)
mod3 = lm(speed ~ gender + height + gender*height, data=speed)  

# Equivalent to full model: speed = gender + height + gender*height
mod3a = lm(speed~gender*height,data=speed)  

# Interaction only model: speed = gender * height
mod4 = lm(speed~gender:height,data=speed)

Library effects

  • Translating regression model results to more intuitive marginal effects helps make results more meaningful and interpretable
  • effects library produces graphical and tabular effect displays, e.g., of interactions, for various statistical models with linear predictors
library(effects)

Main effects model (speed = gender + height)

Summary
summary(mod2)
## 
## Call:
## lm(formula = speed ~ gender + height, data = speed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.234   -7.717    2.283   11.313   81.857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.6755    12.1534   2.030 0.042525 *  
## gendermale    5.7276     1.6142   3.548 0.000402 ***
## height        0.9699     0.1885   5.145 3.09e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.66 on 1299 degrees of freedom
## Multiple R-squared:  0.07122,    Adjusted R-squared:  0.06979 
## F-statistic:  49.8 on 2 and 1299 DF,  p-value: < 2.2e-16
  • Note: female is the base reference category, coded as 0, and male is coded as 1
  • Based on the results, males have a max speed that is on average 5.7 speed-units faster than females (the base reference group, coded 0)
Fitted values
  • The model object mod2 is a list that stores the following: coefficients, residuals, effects, rank, fitted.values, assign, qr, df.residual, contrasts, xlevels, call, terms, model
head(mod2$fitted.values) #preview of fitted values
##        1        2        3        4        5        6 
## 91.59651 99.26389 86.74716 82.86769 98.29402 83.83755
Effects plots
  • Plot the main effects of gender and height
plot(allEffects(mod2))

Full model (speed = gender + height + gender * height)

Summary
summary(mod3)
## 
## Call:
## lm(formula = speed ~ gender + height + gender * height, data = speed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.372   -7.495    2.444   11.533   79.269 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        46.7801    15.8630   2.949  0.00325 **
## gendermale        -50.0170    25.8116  -1.938  0.05287 . 
## height              0.6264     0.2462   2.544  0.01108 * 
## gendermale:height   0.8265     0.3820   2.164  0.03065 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.63 on 1298 degrees of freedom
## Multiple R-squared:  0.07456,    Adjusted R-squared:  0.07242 
## F-statistic: 34.86 on 3 and 1298 DF,  p-value: < 2.2e-16
Effects plots
  • Notice that confidence intervals are included in both plots below
plot(allEffects(mod3))

  • The effect of height on max speed varies at different level of gender. For males, the effect of height on speed is greater than it is for females
  • The results suggest that gender moderates the relationship between height and max speed

Model selection

  • leaps() performs an exhaustive search for the best subsets of the variables in x for predicting y in linear regression, using an efficient branch-and-bound algorithm. It is a compatibility wrapper for regsubsets does the same thing better.
library(leaps)
  • Run the model selection procedure on the full model
allsubs = regsubsets(speed~gender+height+gender*height, data=speed) 
summary(allsubs) 
## Subset selection object
## Call: regsubsets.formula(speed ~ gender + height + gender * height, 
##     data = speed)
## 3 Variables  (and intercept)
##                   Forced in Forced out
## gendermale            FALSE      FALSE
## height                FALSE      FALSE
## gendermale:height     FALSE      FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
##          gendermale height gendermale:height
## 1  ( 1 ) " "        "*"    " "              
## 2  ( 1 ) " "        "*"    "*"              
## 3  ( 1 ) "*"        "*"    "*"
  • The function summary() reports the best set of variables for each model size.
  • From the output above, an asterisk specifies that a given variable is included in the corresponding model.
  • For example, the best 2-variables model contains height and gender*height i.e. speed ~ height + gender*height.
  • The best three-variable model is speed ~ gender + height + gender*height.
  • The summary() function also returns Adjusted R2, Mallow’s Cp and BIC
res.sum <- summary(allsubs)
data.frame(Adj.R2 = which.max(res.sum$adjr2),BIC = which.min(res.sum$bic))
##   Adj.R2 BIC
## 1      3   2
  • Adjusted R2 suggests that the best model is the one with all the 3 predictor variables.
  • Using BIC criteria, the preferable model is the one with 2 variables.

Model selection plots

  • We can also use plots to convey this insight.
  • The model with two variables - height and gender*height (interaction term) - has the lowest BIC value.
par(mfrow=c(1,2))
plot(allsubs, scale='adjr', main = "Model Selection Criteria: Adj.R2")
plot(allsubs, scale='bic', main = "Model Selection Criteria: BIC")

\(~\)

Replicating results in Stata

  • Let us bring this file into Stata and see how we can replicate the results. When read in, the file has the following dimensions:
(3 vars, 1302 obs)
  • There are 863 females and 439 males in the data set
 . tab gender 

     gender |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        863       66.28       66.28
          1 |        439       33.72      100.00
------------+-----------------------------------
      Total |      1,302      100.00
  • Quick summary
 . sum 

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       speed |      1302    90.75115    22.45394          0        185
      gender |      1302    .3371736    .4729264          0          1
      height |      1302    66.13717    4.049847         52         82

A note on Stata notation:

  • A variable can be prefixed with i. to specify indicators for each level (category) of the variable.
  • An # is used between two variables to create an interaction–indicators for each combination of the categories of the variables.
  • ## can be used to specify a full factorial of the variables—main effects for each variable and an interaction.
  • If we want to interact a continuous variable with a factor variable, we prefix the continuous variable with c.

Main effects model (speed = gender + height)

regress speed i.gender height 
      Source |       SS       df       MS              Number of obs =    1302
-------------+------------------------------           F(  2,  1299) =   49.80
       Model |  46715.7154     2  23357.8577           Prob > F      =  0.0000
    Residual |  609221.658  1299  468.992808           R-squared     =  0.0712
-------------+------------------------------           Adj R-squared =  0.0698
       Total |  655937.373  1301   504.17938           Root MSE      =  21.656

------------------------------------------------------------------------------
       speed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.gender |   5.727642   1.614234     3.55   0.000      2.56085    8.894434
      height |   .9698694   .1885044     5.15   0.000      .600063    1.339676
       _cons |   24.67552   12.15337     2.03   0.043     .8331418     48.5179
------------------------------------------------------------------------------

Fitted values

  • We can obtain linear predictions based on the model using the predict command
predict fitted_values
(option xb assumed; fitted values)
  • Note that this replicates what was produced earlier on when we ran head(mod2$fitted.values) in R
list fitted_values in 1/6 
     | fitted~s |
     |----------|
  1. | 91.59651 |
  2. | 99.26389 |
  3. | 86.74716 |
  4. | 82.86768 |
  5. | 98.29402 |
     |----------|
  6. | 83.83755 |
     +----------+

Main effects of gender

  • Based on the results, males (coded 1) are 5.7 speed-units faster than females (the base reference group, coded 0). The predicted values above are produced using the value of height at its mean, 66.13717.
margins gender
Predictive margins                                Number of obs   =       1302
Model VCE    : OLS

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
          0  |   88.81994   .8102143   109.63   0.000     87.23195    90.40793
          1  |   94.54758   1.226792    77.07   0.000     92.14312    96.95205
------------------------------------------------------------------------------
  • To see how the values above are derived, below are the manual computations:
// Predicted max speed for females
display _b[_cons] + _b[0.gender] + _b[height]*66.13717
88.819939
// Predicted max speed for males
display _b[_cons] + _b[1.gender] + _b[height]*66.13717
94.547581
  • Effects plots
marginsplot, xdimension(gender)

Main effects of height

  • Obtain linear predictions, based on the model, at each value of height
  • These values are used to produce the effect plots
margins, at(height=(52(1)82)) vsquish
Predictive margins                                Number of obs   =       1302
Model VCE    : OLS

Expression   : Linear prediction, predict()
1._at        : height          =          52
2._at        : height          =          53
3._at        : height          =          54
4._at        : height          =          55
5._at        : height          =          56
6._at        : height          =          57
7._at        : height          =          58
8._at        : height          =          59
9._at        : height          =          60
10._at       : height          =          61
11._at       : height          =          62
12._at       : height          =          63
13._at       : height          =          64
14._at       : height          =          65
15._at       : height          =          66
16._at       : height          =          67
17._at       : height          =          68
18._at       : height          =          69
19._at       : height          =          70
20._at       : height          =          71
21._at       : height          =          72
22._at       : height          =          73
23._at       : height          =          74
24._at       : height          =          75
25._at       : height          =          76
26._at       : height          =          77
27._at       : height          =          78
28._at       : height          =          79
29._at       : height          =          80
30._at       : height          =          81
31._at       : height          =          82

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   77.03994   2.731667    28.20   0.000     71.68597    82.39391
          2  |   78.00981   2.548105    30.61   0.000     73.01561      83.004
          3  |   78.97968   2.365321    33.39   0.000     74.34373    83.61562
          4  |   79.94955    2.18351    36.62   0.000     75.66995    84.22915
          5  |   80.91942   2.002937    40.40   0.000     76.99373     84.8451
          6  |   81.88929   1.823969    44.90   0.000     78.31437     85.4642
          7  |   82.85916    1.64713    50.31   0.000     79.63084    86.08747
          8  |   83.82903   1.473187    56.90   0.000     80.94163    86.71642
          9  |    84.7989     1.3033    65.06   0.000     82.24447    87.35332
         10  |   85.76876   1.139284    75.28   0.000     83.53581    88.00172
         11  |   86.73863   .9840809    88.14   0.000     84.80987     88.6674
         12  |    87.7085    .842573   104.10   0.000     86.05709    89.35992
         13  |   88.67837   .7228492   122.68   0.000     87.26161    90.09513
         14  |   89.64824   .6373074   140.67   0.000     88.39914    90.89734
         15  |   90.61811   .6007314   150.85   0.000      89.4407    91.79552
         16  |   91.58798   .6218227   147.29   0.000     90.36923    92.80673
         17  |   92.55785   .6953535   133.11   0.000     91.19498    93.92072
         18  |   93.52772   .8071167   115.88   0.000      91.9458    95.10964
         19  |   94.49759   .9436239   100.14   0.000     92.64812    96.34706
         20  |   95.46746   1.095665    87.13   0.000     93.31999    97.61492
         21  |   96.43733    1.25762    76.68   0.000     93.97244    98.90222
         22  |    97.4072   1.426113    68.30   0.000     94.61207    100.2023
         23  |   98.37707   1.599081    61.52   0.000     95.24293    101.5112
         24  |   99.34694   1.775215    55.96   0.000     95.86758    102.8263
         25  |   100.3168   1.953659    51.35   0.000     96.48771    104.1459
         26  |   101.2867   2.133834    47.47   0.000     97.10444    105.4689
         27  |   102.2565   2.315335    44.16   0.000     97.71857    106.7945
         28  |   103.2264   2.497874    41.33   0.000     98.33067    108.1222
         29  |   104.1963   2.681239    38.86   0.000     98.94115    109.4514
         30  |   105.1662   2.865271    36.70   0.000     99.55033     110.782
         31  |    106.136   3.049849    34.80   0.000     100.1584    112.1136
------------------------------------------------------------------------------
  • Produce effects plot for height
marginsplot, recast(line) recastci(rarea) ciopts(color(*.6))

\(~\)

Full effects model (speed = gender + height + gender * height)

regress speed c.height##i.gender 
      Source |       SS       df       MS              Number of obs =    1302
-------------+------------------------------           F(  3,  1298) =   34.86
       Model |  48905.5484     3  16301.8495           Prob > F      =  0.0000
    Residual |  607031.825  1298  467.667045           R-squared     =  0.0746
-------------+------------------------------           Adj R-squared =  0.0724
       Total |  655937.373  1301   504.17938           Root MSE      =  21.626

------------------------------------------------------------------------------
       speed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      height |   .6263848   .2462315     2.54   0.011     .1433295     1.10944
    1.gender |  -50.01699   25.81158    -1.94   0.053     -100.654    .6199822
             |
      gender#|
    c.height |
          1  |   .8265217   .3819594     2.16   0.031     .0771964    1.575847
             |
       _cons |   46.78009   15.86304     2.95   0.003     15.66007     77.9001
------------------------------------------------------------------------------

Margins

  • Produce expected values of max speed at each of these values of height , by gender
  • These values are used to produce the effect plots
margins gender, at(height=(52(1)82)) vsquish 
Adjusted predictions                              Number of obs   =       1302
Model VCE    : OLS

Expression   : Linear prediction, predict()
1._at        : height          =          52
2._at        : height          =          53
3._at        : height          =          54
4._at        : height          =          55
5._at        : height          =          56
6._at        : height          =          57
7._at        : height          =          58
8._at        : height          =          59
9._at        : height          =          60
10._at       : height          =          61
11._at       : height          =          62
12._at       : height          =          63
13._at       : height          =          64
14._at       : height          =          65
15._at       : height          =          66
16._at       : height          =          67
17._at       : height          =          68
18._at       : height          =          69
19._at       : height          =          70
20._at       : height          =          71
21._at       : height          =          72
22._at       : height          =          73
23._at       : height          =          74
24._at       : height          =          75
25._at       : height          =          76
26._at       : height          =          77
27._at       : height          =          78
28._at       : height          =          79
29._at       : height          =          80
30._at       : height          =          81
31._at       : height          =          82

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  _at#gender |
        1 0  |    79.3521   3.129721    25.35   0.000     73.21796    85.48624
        1 1  |   72.31423   5.254051    13.76   0.000     62.01648    82.61198
        2 0  |   79.97848   2.890978    27.66   0.000     74.31227     85.6447
        2 1  |   73.76714   4.968073    14.85   0.000     64.02989    83.50438
        3 0  |   80.60487   2.653604    30.38   0.000      75.4039    85.80584
        3 1  |   75.22005   4.682839    16.06   0.000     66.04185    84.39824
        4 0  |   81.23125   2.418002    33.59   0.000     76.49206    85.97045
        4 1  |   76.67295   4.398492    17.43   0.000     68.05207    85.29384
        5 0  |   81.85764   2.184745    37.47   0.000     77.57562    86.13966
        5 1  |   78.12586   4.115218    18.98   0.000     70.06018    86.19154
        6 0  |   82.48402   1.954674    42.20   0.000     78.65293    86.31511
        6 1  |   79.57877   3.833252    20.76   0.000     72.06573     87.0918
        7 0  |   83.11041    1.72906    48.07   0.000     79.72151     86.4993
        7 1  |   81.03167   3.552908    22.81   0.000      74.0681    87.99524
        8 0  |   83.73679   1.509904    55.46   0.000     80.77744    86.69615
        8 1  |   82.48458   3.274602    25.19   0.000     76.06648    88.90268
        9 0  |   84.36318   1.300472    64.87   0.000      81.8143    86.91206
        9 1  |   83.93749     2.9989    27.99   0.000     78.05975    89.81522
       10 0  |   84.98956   1.106303    76.82   0.000     82.82125    87.15788
       10 1  |   85.39039   2.726593    31.32   0.000     80.04637    90.73442
       11 0  |   85.61595   .9369336    91.38   0.000     83.77959     87.4523
       11 1  |    86.8433    2.45881    35.32   0.000     82.02412    91.66248
       12 0  |   86.24233   .8081107   106.72   0.000     84.65846     87.8262
       12 1  |    88.2962   2.197205    40.19   0.000     83.98976    92.60265
       13 0  |   86.86872   .7412836   117.19   0.000     85.41583    88.32161
       13 1  |   89.74911   1.944274    46.16   0.000     85.93841    93.55982
       14 0  |    87.4951   .7531398   116.17   0.000     86.01897    88.97123
       14 1  |   91.20202   1.703882    53.53   0.000     87.86247    94.54157
       15 0  |   88.12149   .8403558   104.86   0.000     86.47442    89.76855
       15 1  |   92.65492   1.482145    62.51   0.000     89.74997    95.55988
       16 0  |   88.74787   .9830747    90.28   0.000     86.82108    90.67466
       16 1  |   94.10783   1.288729    73.02   0.000     91.58197    96.63369
       17 0  |   89.37426   1.161006    76.98   0.000     87.09873    91.64978
       17 1  |   95.56074   1.138163    83.96   0.000     93.32998     97.7915
       18 0  |   90.00064   1.360401    66.16   0.000      87.3343    92.66698
       18 1  |   97.01364   1.049063    92.48   0.000     94.95752    99.06977
       19 0  |   90.62703   1.573121    57.61   0.000     87.54376    93.71029
       19 1  |   98.46655    1.03739    94.92   0.000      96.4333    100.4998
       20 0  |   91.25341   1.794432    50.85   0.000     87.73639    94.77043
       20 1  |   99.91946   1.105599    90.38   0.000     97.75252    102.0864
       21 0  |   91.87979   2.021516    45.45   0.000      87.9177    95.84189
       21 1  |   101.3724   1.240583    81.71   0.000     98.94087    103.8039
       22 0  |   92.50618   2.252626    41.07   0.000     88.09111    96.92125
       22 1  |   102.8253    1.42347    72.24   0.000     100.0353    105.6152
       23 0  |   93.13256    2.48664    37.45   0.000     88.25884    98.00629
       23 1  |   104.2782   1.638297    63.65   0.000     101.0672    107.4892
       24 0  |   93.75895   2.722811    34.43   0.000     88.42234    99.09556
       24 1  |   105.7311   1.874111    56.42   0.000     102.0579    109.4043
       25 0  |   94.38533   2.960621    31.88   0.000     88.58262     100.188
       25 1  |    107.184   2.123933    50.46   0.000     103.0212    111.3468
       26 0  |   95.01172   3.199705    29.69   0.000     88.74041     101.283
       26 1  |   108.6369   2.383363    45.58   0.000     103.9656    113.3082
       27 0  |    95.6381   3.439798    27.80   0.000     88.89622      102.38
       27 1  |   110.0898   2.649579    41.55   0.000     104.8967    115.2829
       28 0  |   96.26449   3.680702    26.15   0.000     89.05045    103.4785
       28 1  |   111.5427   2.920728    38.19   0.000     105.8182    117.2672
       29 0  |   96.89087   3.922268    24.70   0.000     89.20337    104.5784
       29 1  |   112.9956   3.195552    35.36   0.000     106.7324    119.2588
       30 0  |   97.51726    4.16438    23.42   0.000     89.35522    105.6793
       30 1  |   114.4485   3.473181    32.95   0.000     107.6412    121.2558
       31 0  |   98.14364   4.406949    22.27   0.000     89.50618    106.7811
       31 1  |   115.9014   3.752991    30.88   0.000     108.5457    123.2572
------------------------------------------------------------------------------
  • Effects plots
marginsplot, bydimension(gender) recast(line) recastci(rarea) ciopts(color(*.6)) 

Model Selection

  • Stepwise backward model selection procedure is used here
  • Specify a significance level of 0.05 for retaining a variable in the model
  • Begins with full model and removes gender, keeping height and interaction
  • This aligns with results in R where the model without gender dropped had the lowest BIC
gen interaction = gender*height
stepwise, pr(.05): regress speed gender height interaction 
                      begin with full model
p = 0.0529 >= 0.0500  removing gender

      Source |       SS       df       MS              Number of obs =    1302
-------------+------------------------------           F(  2,  1299) =   50.30
       Model |   47149.474     2   23574.737           Prob > F      =  0.0000
    Residual |  608787.899  1299  468.658891           R-squared     =  0.0719
-------------+------------------------------           Adj R-squared =  0.0705
       Total |  655937.373  1301   504.17938           Root MSE      =  21.649

------------------------------------------------------------------------------
       speed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 interaction |   .0878156   .0238789     3.68   0.000     .0409702    .1346611
      height |   .9193059   .1945755     4.72   0.000     .5375893    1.301023
       _cons |    27.8888   12.52701     2.23   0.026     3.313413    52.46418
------------------------------------------------------------------------------