BUA 345 - Lecture 17

More about Model Selection

Author

Penelope Pooler Eisenbies

Published

March 17, 2025

Housekeeping

Upcoming Dates

HW 7 is available and is due on Wednesday, 3/19.

  • Demo videos were posted last week.
  • HW 8 (Parts 1 and 2) are posted and due on Wednesday (3/26)

    • Part 1 of HW 8 can be completed after today’s lecture.

    • Part 2 of HW 8 pertains to Thursday’s lecture on Logistic Regression

  • Quiz 2 will be on 4/1/2025 in the classroom.

    • Date has changed and syllabus has been updated.

    • Practice Questions will be posted this weekend.

More Housekeeping

This Week’s Plan

  • Today (Tuesday 3/18)

    • Quick review of model selection concepts

    • Best Subsets method

    • measures of model fit

  • Thursday 3/20

    Logistic Regression of binary response data

In-class Polling (Session ID: bua345s25)

Lecture 17 In-class Exercises - Q1


Review Question from Week 8 and HW 7.

If two predictor variables (X variables) in a model have a correlation of 0.85, what do you conclude?

Review of Animals Data

  • Question: What factors affect a mammal’s sleep duration?

  • Animals Data Notes:

    • Population was limited to animals under 1000 pounds (two elephant species excluded).

    • Natural log (LN) transformed variables were added to original data.

    • Observations with missing values are removed below

    • Working dataset has 49 observations (49 different species)

Animals Data

First 10 Rows of Animals Data
Species TotalSleep BodyWt LNBodyWt BrainWt LNBrainWt LifeSpan LNLifeSpan Gestation PredF ExposF DangrF
Africangiantpouchedrat 8.3 1.00 0.00 6.6 1.89 4.5 1.50 42 3 1 3
Americanopossum 19.4 1.70 0.53 6.3 1.84 5.0 1.61 12 2 1 1
ArcticFox 12.5 3.39 1.22 44.5 3.80 14.0 2.64 60 1 1 1
Baboon 9.8 10.55 2.36 179.5 5.19 27.0 3.30 180 4 4 4
Bigbrownbat 19.7 0.02 -3.77 0.3 -1.20 19.0 2.94 35 1 1 1
Braziliantapir 6.2 160.00 5.08 169.0 5.13 30.4 3.41 392 4 5 4
Cat 14.5 3.30 1.19 25.6 3.24 28.0 3.33 63 1 2 1
Chimpanzee 9.7 52.16 3.95 440.0 6.09 50.0 3.91 230 1 1 1
Chinchilla 12.5 0.43 -0.86 6.4 1.86 7.0 1.95 112 5 4 4
Cow 3.9 465.00 6.14 423.0 6.05 30.0 3.40 281 5 5 5

Animals Data Dictionary - Description of Variables

Intuitvely, there is likely to be redundancy between Predation, Exposure, and Danger.

Variable Type Description
Species Nominal Name of Species
TotalSleep Quantitative Total Sleep
BodyWt Quantitative Average Body Weight in kilograms
LNBodyWt Quantitative Natural Log of Body Weight
BrainWt Quantitative Average Brain Weight in grams
LNBrainWt Quantitative Natural Log of Brain Weight
LifeSpan Quantitative Maximum Life Span in years
LNLifeSpan Quantitative Natural Log of Life Span
Gestation Quantitative Gestation Time in days
PredF Ordinal Predation Index (1=least likely to be prey)
ExposF Ordinal Sleep Exposure Index (1=least exposed)
DangrF Ordinal Overall Danger Index (1=least danger from other animals)

Multicollinearity Concerns in Animals Dataset

  • LNBodyWt and LNBrainWt (R = 0.95):

    • These two predictors can not both be in the final model.
  • LNBrainWt and LNLifeSpan (R = 0.79):

    • These two predictors ideally should not both be in the final model.
  • Predation (PredF) and Danger (DangrF) (R = 0.95):

    • These two predictors can not both be in the final model.
  • Exposure (ExposF) and Danger (DangrF) (R = 0.78):

    • These two predictors ideally should not both be in the final model.
  • NOTE: Students should know the commands for creating a correlation matrix with rounded values.

    • See HW 7 and next two slides

Correlation Matrix of Quantitative Animal Variables

Code
```{r reminder of multicollinear quant term, echo=T}
animals <- animals |> filter(!is.na(LifeSpan) & !is.na(Gestation)) # exclude missing values
animals |> select(TotalSleep, LNBodyWt, LNBrainWt, LNLifeSpan) |> cor() |> round(2) |> kable() |> kable_styling(full_width = F)
```
TotalSleep LNBodyWt LNBrainWt LNLifeSpan
TotalSleep 1.00 -0.56 -0.57 -0.37
LNBodyWt -0.56 1.00 0.95 0.71
LNBrainWt -0.57 0.95 1.00 0.79
LNLifeSpan -0.37 0.71 0.79 1.00

Correlation Matrix of ordinal Variables

Code
```{r reminder of multicollinear ordinal terms, echo=T}
animals_ordinal |> cor() |> round(2) |> kable() |> kable_styling(full_width = F)
```
Predation Exposure Danger
Predation 1.00 0.66 0.95
Exposure 0.66 1.00 0.78
Danger 0.95 0.78 1.00

Backwards Elimination - Animal Data Final Model

                         Model Summary                          
---------------------------------------------------------------
R                       0.857       RMSE                 2.329 
R-Squared               0.734       MSE                  5.423 
Adj. R-Squared          0.655       Coef. Var           25.223 
Pred R-Squared          0.547       AIC                247.894 
MAE                     1.857       SBC                272.488 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    734.163        11         66.742    9.294    0.0000 
Residual      265.708        37          7.181                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                       
-----------------------------------------------------------------------------------------------
            model      Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
-----------------------------------------------------------------------------------------------
      (Intercept)     6.751         3.305                  2.043    0.048      0.054    13.448 
         LNBodyWt    -0.698         0.244       -0.442    -2.859    0.007     -1.192    -0.203 
       LNLifeSpan     2.855         1.133        0.591     2.519    0.016      0.559     5.151 
        Gestation    -0.020         0.006       -0.447    -3.285    0.002     -0.032    -0.008 
           PredF2    13.998         4.132        0.041     3.388    0.002      5.626    22.369 
           PredF3    11.883         5.514       -0.494     2.155    0.038      0.711    23.056 
           PredF4     2.654         4.102        0.021     0.647    0.522     -5.658    10.966 
           PredF5    -0.782         4.262       -0.316    -0.183    0.855     -9.418     7.855 
LNLifeSpan:PredF2    -5.367         1.478       -0.471    -3.632    0.001     -8.361    -2.373 
LNLifeSpan:PredF3    -7.390         3.141       -0.588    -2.352    0.024    -13.755    -1.025 
LNLifeSpan:PredF4    -0.941         1.356       -0.083    -0.694    0.492     -3.689     1.807 
LNLifeSpan:PredF5    -1.043         1.446       -0.091    -0.721    0.475     -3.973     1.887 
-----------------------------------------------------------------------------------------------

Model Selection Methods

  • Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables

  • How do we decide which variables should be in our model?

  • There are many methods:

  • We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).

  • Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.

    • This is referred to as starting with a full or saturated model.
  • Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.

    • Forward selection also needs to know what terms are in the full model.
  • Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.

Comments about Model Selection Methods

  • Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.

  • Steps for model selection using multiple methods are similar to the steps for Backward Elimination (Week 8 Lectures)

  • Not all steps are ALWAYS required. It depends on how complex the data are.

  • In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.

    • For Step 1, we only need to examine correlations.

    • In this case, Step 7 will be apparent.

    • We can add model estimates to data for future interpretation (Step 8)

Lecture 17 In-class Exercises - Q2

Which model selection method is characterized by starting with NO (0) terms in the model and then adding terms one by one until no more terms added are significant to the model?

A Backward Elimination

B Stepwise Selection

C Forward Selection

D Adjusted \(R^2\)

Steps for Model Selection Using Multiple Methods

  1. Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
  • Also look at correlation matrix to check if there are pairs of variables to be concerned about.
  1. Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).

  2. Use Backward Elimination, Forward Selection, and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)

  • Carefully examine results to see where these candidate models agree and disagree.

Steps for Model Selection Using Multiple Methods Cont’d

  1. Examine predictors in preliminary candidate models to confirm they are not too highly correlated with each other.
  • If two predictor variables in any model have a correlation of 0.8 or greater, drop one of them.
  1. Rerun model selection methods, if a candidate model is substantially changed (not always needed).

  2. Compare model fit statistics from final candidate model from all three methods.

  3. Decide on final candidate and make final modifications, if needed.

  4. Interpret final model and use for estimation.

Forward Selection of Animals Data

Full Model:

Code
```{r animals full model, echo=T}
# full model (subjective)
animals_full <- lm(TotalSleep ~ LNBodyWt + LNBrainWt + 
                     LNLifeSpan + Gestation + 
                     PredF + ExposF + DangrF + 
                     LNBodyWt*Gestation + LNLifeSpan*PredF + 
                     LNLifeSpan*ExposF + LNLifeSpan*DangrF, data=animals)
```

Forward Model Selection

Code
```{r animals model backward elim and output, echo=T}
(animals_FS <- ols_step_forward_p(animals_full, p_val = 0.1, progress = F))
```

                                Stepwise Summary                                 
-------------------------------------------------------------------------------
Step    Variable              AIC        SBC       SBIC        R2       Adj. R2 
-------------------------------------------------------------------------------
 0      Base Model          290.830    294.614    147.782    0.00000    0.00000 
 1      Gestation           268.856    274.532    123.817    0.38693    0.37388 
 2      DangrF              251.692    264.935     98.670    0.63316    0.59050 
 3      LNBrainWt           248.061    263.196     93.052    0.67298    0.62626 
 4      PredF               241.628    264.330     78.645    0.75641    0.69231 
 5      LNLifeSpan          233.996    258.589     69.041    0.79989    0.74039 
 6      LNLifeSpan:PredF    228.314    260.475     55.409    0.84864    0.77984 
 7      LNBodyWt            228.450    262.503     53.568    0.85429    0.78143 
 8      ExposF              229.245    270.865     46.411    0.87421    0.78437 
-------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.935       RMSE                 1.602 
R-Squared               0.874       MSE                  2.567 
Adj. R-Squared          0.784       Coef. Var           19.948 
Pred R-Squared           -Inf       AIC                229.245 
MAE                     1.168       SBC                270.865 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    874.101        20         43.705     9.73    0.0000 
Residual      125.769        28          4.492                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                        
------------------------------------------------------------------------------------------------
            model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------------
      (Intercept)      6.184         2.897                  2.135    0.042      0.250    12.118 
        Gestation     -0.020         0.007       -0.454    -2.789    0.009     -0.035    -0.005 
          DangrF2     -6.733         1.877       -0.641    -3.587    0.001    -10.578    -2.888 
          DangrF3     -8.462         3.579       -0.655    -2.364    0.025    -15.793    -1.130 
          DangrF4     -8.780         4.650       -0.718    -1.888    0.069    -18.305     0.745 
          DangrF5    -20.146         6.095       -1.561    -3.305    0.003    -32.632    -7.661 
        LNBrainWt     -0.180         0.684       -0.092    -0.264    0.794     -1.582     1.221 
           PredF2     14.954         3.672        1.462     4.072    0.000      7.431    22.477 
           PredF3     16.956         5.583        1.230     3.037    0.005      5.520    28.393 
           PredF4     11.583         5.230        0.897     2.215    0.035      0.871    22.295 
           PredF5      0.598         6.292        0.055     0.095    0.925    -12.290    13.486 
       LNLifeSpan      3.218         0.937        0.666     3.433    0.002      1.298     5.138 
         LNBodyWt     -0.803         0.511       -0.508    -1.572    0.127     -1.848     0.243 
          ExposF2     -0.082         1.180       -0.008    -0.070    0.945     -2.499     2.335 
          ExposF3      0.481         1.723        0.029     0.279    0.782     -3.049     4.011 
          ExposF4      3.183         1.854        0.213     1.716    0.097     -0.615     6.981 
          ExposF5      4.951         4.042        0.405     1.225    0.231     -3.328    13.231 
PredF2:LNLifeSpan     -3.401         1.455       -0.810    -2.337    0.027     -6.381    -0.420 
PredF3:LNLifeSpan     -5.334         4.249       -0.603    -1.255    0.220    -14.037     3.370 
PredF4:LNLifeSpan     -1.707         1.767       -0.373    -0.966    0.342     -5.327     1.913 
PredF5:LNLifeSpan      3.238         2.070        0.856     1.565    0.129     -1.002     7.478 
------------------------------------------------------------------------------------------------

Final Forward (and Stepwise) Selection Model

  • Drop DangrF due to multicollinearity with PredF

  • Drop LNBrainWt due to multicollinearity with LNBodyWt

  • Leave in ExposF(?) and compare to Backward Elimination Model

  • Stepwise Selection arrived at same model as Forward Selection.

                         Model Summary                          
---------------------------------------------------------------
R                       0.882       RMSE                 2.131 
R-Squared               0.777       MSE                  4.543 
Adj. R-Squared          0.676       Coef. Var           24.445 
Pred R-Squared          0.407       AIC                247.220 
MAE                     1.729       SBC                279.381 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    777.272        15         51.818    7.682    0.0000 
Residual      222.599        33          6.745                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                       
-----------------------------------------------------------------------------------------------
            model      Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
-----------------------------------------------------------------------------------------------
      (Intercept)     7.151         3.217                  2.223    0.033      0.605    13.696 
         LNBodyWt    -0.796         0.251       -0.504    -3.167    0.003     -1.307    -0.285 
       LNLifeSpan     2.604         1.109        0.539     2.349    0.025      0.348     4.860 
        Gestation    -0.014         0.007       -0.313    -2.102    0.043     -0.027     0.000 
          ExposF2    -2.416         1.272       -0.223    -1.899    0.066     -5.004     0.173 
          ExposF3    -1.237         1.905       -0.075    -0.649    0.521     -5.113     2.640 
          ExposF4     1.096         2.009        0.073     0.545    0.589     -2.991     5.182 
          ExposF5    -2.379         2.864       -0.195    -0.831    0.412     -8.206     3.448 
           PredF2    12.917         4.095        0.173     3.154    0.003      4.585    21.249 
           PredF3    14.428         5.566       -0.597     2.592    0.014      3.103    25.752 
           PredF4     1.813         4.012       -0.019     0.452    0.654     -6.349     9.974 
           PredF5    -1.068         4.590       -0.206    -0.233    0.817    -10.407     8.270 
LNLifeSpan:PredF2    -4.405         1.530       -0.387    -2.880    0.007     -7.518    -1.293 
LNLifeSpan:PredF3    -8.959         3.193       -0.712    -2.806    0.008    -15.454    -2.463 
LNLifeSpan:PredF4    -0.814         1.418       -0.072    -0.574    0.570     -3.699     2.071 
LNLifeSpan:PredF5    -0.458         1.795       -0.040    -0.255    0.800     -4.110     3.195 
-----------------------------------------------------------------------------------------------

Comparing Model Results

  • Comparison Measures:

    • Adj. \(R^2\): Higher value indicates better model fit

    • C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).

    • AIC: Lower value indicates better model fit (Akaike Information Criteria).

    • RMSE: Lower value indicates better model fit (Root mean Square Error).

  • Decision is debatable but it seems worthwhile to include ExposF (Exposure).

  • Same data and models are covered in HW 8 - Part 1.

Method Adjusted_R2 Mallows_Cp AIC RMSE
Backward Elimination 0.655 23.527 247.894 2.329
Forward/Stepwise Selection 0.676 23.654 247.220 2.131

Model Validation

  • How good is our model?

  • There are many ways to examine model fit.

  • Here are two straightforward ways:

    • Check correlation between observed and estimated values
    • Plot a scatterplot of observed and estimated values

Model Validation Plot (R = 0.88)

Wine Data - Model Selection Example

Can we determine what factors affect wine quality even if we KNOW NOTHING about wine cultivation and chemistry?

Maybe!

  • Since we have no prior knowledge, we start with a straightforward full model with all available predictors and no interactions.

    • In practice, a consultant would be working with a wine expert to carefully determine a saturated model that includes all possible interactions.

Import Wine Data

Notice that all variables are numeric (<dbl> stands for decimal value).

Code
```{r import and examine data, echo=T}
wine <- read_csv("data/wine.csv", show_col_types = F) 
head(wine) |> kable() |> kable_styling(full_width = F)
```
Wine_Quality Fixed_Acidity Volatile_Acidity Citric_Acidity Residual_Sugar Chlorides Free_Sulphur_Dioxide Total_Sulphur_Dioxide Ph Sulfate Alcohol
5 9.3 0.48 0.29 2.1 0.127 6 16 3.22 0.72 11.2
6 9.1 0.22 0.24 2.1 0.078 1 28 3.41 0.87 10.3
7 7.9 0.34 0.36 1.9 0.065 5 10 3.27 0.54 11.2
5 7.2 1.00 0.00 3.0 0.102 7 16 3.43 0.46 10.0
7 11.9 0.43 0.66 3.1 0.109 10 23 3.15 0.85 10.4
5 7.2 0.49 0.24 2.2 0.070 5 36 3.33 0.48 9.4

Examine Correlation matrix for MultiCollinearity

                      Wine_Quality Fixed_Acidity Volatile_Acidity
Wine_Quality                  1.00          0.11            -0.39
Fixed_Acidity                 0.11          1.00            -0.23
Volatile_Acidity             -0.39         -0.23             1.00
Citric_Acidity                0.22          0.68            -0.52
Residual_Sugar                0.04          0.20            -0.01
Chlorides                    -0.10          0.12             0.04
Free_Sulphur_Dioxide          0.01         -0.18            -0.05
Total_Sulphur_Dioxide        -0.08         -0.13             0.05
Ph                           -0.06         -0.70             0.19
Sulfate                       0.21          0.19            -0.24
Alcohol                       0.45         -0.08            -0.17
                      Citric_Acidity Residual_Sugar Chlorides
Wine_Quality                    0.22           0.04     -0.10
Fixed_Acidity                   0.68           0.20      0.12
Volatile_Acidity               -0.52          -0.01      0.04
Citric_Acidity                  1.00           0.16      0.21
Residual_Sugar                  0.16           1.00      0.05
Chlorides                       0.21           0.05      1.00
Free_Sulphur_Dioxide           -0.07           0.18     -0.04
Total_Sulphur_Dioxide           0.06           0.18      0.00
Ph                             -0.55          -0.14     -0.26
Sulfate                         0.27          -0.01      0.35
Alcohol                         0.10           0.07     -0.21
                      Free_Sulphur_Dioxide Total_Sulphur_Dioxide    Ph Sulfate
Wine_Quality                          0.01                 -0.08 -0.06    0.21
Fixed_Acidity                        -0.18                 -0.13 -0.70    0.19
Volatile_Acidity                     -0.05                  0.05  0.19   -0.24
Citric_Acidity                       -0.07                  0.06 -0.55    0.27
Residual_Sugar                        0.18                  0.18 -0.14   -0.01
Chlorides                            -0.04                  0.00 -0.26    0.35
Free_Sulphur_Dioxide                  1.00                  0.65  0.08    0.00
Total_Sulphur_Dioxide                 0.65                  1.00 -0.07    0.08
Ph                                    0.08                 -0.07  1.00   -0.24
Sulfate                               0.00                  0.08 -0.24    1.00
Alcohol                              -0.03                 -0.08  0.21    0.05
                      Alcohol
Wine_Quality             0.45
Fixed_Acidity           -0.08
Volatile_Acidity        -0.17
Citric_Acidity           0.10
Residual_Sugar           0.07
Chlorides               -0.21
Free_Sulphur_Dioxide    -0.03
Total_Sulphur_Dioxide   -0.08
Ph                       0.21
Sulfate                  0.05
Alcohol                  1.00
[1] 0.68
[1] -0.7

Model Selection

  • We specify a full model using an easy shortcut:

    • If all variables are included, you can use . instead of listing them all.

    • This model specification is also used in HW 7.

  • The we do three model selection procedures:

    • Backward Elimination (BE)
    • Forward Selection (FS)
    • Stepwise Selection (SS)
Code
```{r specify full model, echo=T}
wine_full <- lm(Wine_Quality ~ ., data = wine)                 # specify full model
wine_BE <- ols_step_backward_p(wine_full, progress=F)          # backward elimination  
wine_FS <- ols_step_forward_p(wine_full, progress=F)           # forward selection
wine_SS <- ols_step_both_p(wine_full, progress=F)              # stepwise selection
```

Comparing Model Results

  • Look at the LAST step for each method to determine which method results in the best fit.

  • Comparison Measures:

    • Adj. \(R^2\): Higher value indicates better model fit

    • C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).

    • AIC: Lower value indicates better model fit (Akaike Information Criteria).

    • RMSE: Lower value indicates better model fit (Root mean Square Error).

  • By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.

Lecture 17 In-class Exercises - Q3

Session ID: bua345s24

Which two model selection methods arrived at the same model for the wine data?

  • On the next few slides I will show pairs of stepwise summaries so you can compare them.

Backwards Elimination and Forward Selection

Backward Elimination

Forward Selection

Backwards Elimination and Stepwise Selection

Backward Elimination

Stepwise Selection

Forward Selection and Stepwise Selection

Forward Selection

Stepwise Selection

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Wine Model Validation Plot (R = 0.58)

Best Subsets

  • Another model selection method is ‘Best Subsets’

    • Output shows ‘Best’ one variable model, ‘Best’ two variable model, ‘Best’ three variable model, etc.
  • Each ‘Best’ model is determined by multiple Fit Statistics.

  • This method then examines which of these candidates is the overall best by comparing their fit statistics.

  • If we are fortunate, the optimal choice from Best Subsets matches a model above.

    • In this case (and HW 8) we are fortunate.
  • NOTE: ols_step_best_subset command is VERY slow. You do not need to rerun it. Output is provided.

Some of the Best Subsets PLots

Reading Best Subsets Output

Tabular Output

  • Bottom table shows which model performs best, based on all of the fit statistics.

    • For example, if model 3 (Three variable model) was best, it would have the HIGHEST Adjusted \(R^2\), Lowest C(p), and Lowest AIC.

      • We can see from bottom table that Model 3 is not the best.
    • Model 7 IS the best because it does have the HIGHEST Adjusted R2, Lowest C(p), and Lowest AIC.

  • Top table lists the variables in each of the ‘Best’ models.

Wine Best Subset Output

Preview of HW 8 - Part 1

  • Review model comparisons for Animal Data from first part of lecture.

  • Compare the optimal best subset model (Model 7) to the model found by both Backward Elimination and Forward Selection.

  • The goal is to determine to what extent they agree.

    • Spoiler: They are in complete agreement which indicates that we have consensus on the model for these data.

Reminder of Upcoming Dates

  • Today’s Lecture (3/18) is the third and final lecture on model and variable selection.

  • HW 7 is due tomorrow, Wed., 3/19.

  • HW 8 is now posted and is due Wednesday, 3/26

    • Part 1 pertains to Lectures 15-17

    • Part 2 pertains to Lecture 18

  • Quiz 2 is on Tuesday, April 1st, in the classroom

    • Practice Questions will be posted this weekend.

Key Points from this Week

  • Regression modeling can be overwhelming because of all of the possible options.

    • Automating part of the variable selection process is helpful.

    • Trying different methods and comparing results is strongly recommended.

    • Results from Automated processes are preliminary models that can (and should) be tinkered with.

    • Once we have a final model we can add regression estimates and residuals to the dataset.

    • Methods Covered: Backwards Elimination, Forward Selection, Stepwise Selection, Best Subsets

      • Compare results from multiple methods

To submit an Engagement Question or Comment about material from Lecture 17: Submit it by midnight today (day of lecture).