BUA 345 - Lectures 15

Introduction to Model Selection

Author

Penelope Pooler Eisenbies

Published

March 4, 2025

Housekeeping

HW 6 was due 3/5/2025 - 2 day grace period

Demo videos were posted on Sunday morning

HW 7 will be posted on Thursday (3/6) is due on Wednesday, 3/19.

Quiz 2 will be on 4/1/2025 - Date has changed and syllabus has been updated.

Today’s plan

Review of \(R^2\) and Adjusted \(R^2\)
- Selecting a model based on Adjusted \(R^2\)
Explanation of Quantitative Interactions
Building a full model
Backward Elimination for Model Selection

In-class Polling (Session ID: bua345s25)

Lecture 15 In-class Exercises - Q1

Recall the Actors and Athletes data that we examined in Lecture 14.

Code

```{r celeb_prof regression, echo=T}
# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) 
# formatted regression output - saved and printed to screen
(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession,
                                     data=celeb_prof, iterm = T))
```

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.640 
R-Squared               0.974       MSE                 6.968 
Adj. R-Squared          0.967       Coef. Var           6.058 
Pred R-Squared          0.951       AIC                86.467 
MAE                     2.265       SBC                90.330 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4131.949         3       1377.316    148.246    0.0000 
Residual       111.489        12          9.291                      
Total         4243.437        15                                     
---------------------------------------------------------------------

                                         Parameter Estimates                                           
------------------------------------------------------------------------------------------------------
                model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
------------------------------------------------------------------------------------------------------
          (Intercept)    -50.297         9.054                  -5.555    0.000    -70.023    -30.571 
                  Age      1.824         0.179        0.983     10.170    0.000      1.433      2.215 
    ProfessionAthlete    227.218        12.389        0.263     18.340    0.000    200.224    254.212 
Age:ProfessionAthlete     -5.063         0.293       -1.487    -17.294    0.000     -5.701     -4.425 
------------------------------------------------------------------------------------------------------

Lecture 15 In-class Exercises - Q1

Session ID: bua345s25

Abridged Output

Review Question What is the slope for the linear model for Athletes?

Round answer to two decimal places.

Hint: To answer this question, you combine two terms:

the baseline slope term for Age: 1.824
the difference in slope Athlete: -5.063
Slope for Athletes = baseline + difference = ____

Review of Regression Terms \(R^2\) and Adjusted \(R^2\)

R is the correlation coefficient, \(R_{XY}\)
\(R^2\) is \(R_{XY}^2\)
\(R^2\) is also called coefficient of determination
Meaning of \(R^2\) in SLR: Proportion of variability in y explained by X
Adjusted \(R^2\) adjusts \(R^2\)> for number of explanatory (X) variables in model.
- Meaning of Adjusted \(R^2\) in MLR is a little less specific but similar to \(R^2\)

Example of\(R^2\) Interpretation

Import and Examine Insurance Data

Code

```{r import and examine insure_L15 data, echo=T}
insure <- read_csv("data/insure_L15.csv", show_col_types=F) # import
insure <- insure |> # create log transformed variable
  mutate(ln_Charges = log(Charges)) 
head(insure) |> kable()
```

Charges	Age	BMI	Children	ln_Charges
16884.924	19	27.900	0	9.734176
1725.552	18	33.770	1	7.453303
4449.462	28	33.000	3	8.400538
21984.471	33	22.705	0	9.998092
3866.855	32	28.880	0	8.260197
3756.622	31	25.740	0	8.231275

Examine Histograms of Charges and ln_Charges

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

SLR model - Predictor (X) Variable: Age

                         Model Summary                           
----------------------------------------------------------------
R                       0.528       RMSE                  0.781 
R-Squared               0.279       MSE                   0.610 
Adj. R-Squared          0.278       Coef. Var             8.587 
Pred R-Squared          0.276       AIC                3140.619 
MAE                     0.630       SBC                3156.215 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                 ANOVA                                  
-----------------------------------------------------------------------
                Sum of                                                 
               Squares          DF    Mean Square       F         Sig. 
-----------------------------------------------------------------------
Regression     314.960           1        314.960    515.977    0.0000 
Residual       815.514        1336          0.610                      
Total         1130.474        1337                                     
-----------------------------------------------------------------------

                                 Parameter Estimates                                   
--------------------------------------------------------------------------------------
      model     Beta    Std. Error    Std. Beta       t        Sig     lower    upper 
--------------------------------------------------------------------------------------
(Intercept)    7.744         0.063                 122.272    0.000    7.620    7.868 
        Age    0.035         0.002        0.528     22.715    0.000    0.032    0.038 
--------------------------------------------------------------------------------------

More about \(R^2\) and How It’s Calculated

R = 0.528 which indicates a moderate correlation between Age and ln_Charges (Natural Log of Insurance Charges)
\(R^2\) = 0.279 which means that approximately 28% of the variability in ln_charges (Natural Log of Insurance Charges) is explained by Age.

More about \(R^2\) and How It’s Calculated

\(R^2\) can also be calculated from the Sum of Squares output:
- \(SS_{TOT}\) (Total. Sum of Squares): 1130.474 (Total variability in Y)
- \(SS_{REG}\) (Regression. Sum of Squares): 314.960 (Variability in Y explained by model)
- \(SS_{RES}\) (Residual Sum of Squares): 815.514 (Variability in Y NOT explained by model)
- \(R^2\) = \(SS_{REG}\) / \(SS_{TOT}\) = 314.96/1130.474 = 0.279

MLR with Quantitative Interaction Term

In Lecture 14, categorical terms and interactions had a simple interpretation:
- Each category has a unique SLR model:
  - The intercepts for different categories may or may not be different from baseline category(check P-values)
  - The slopes for different categories may or may not be different from baseline category (check P-values)

There are other kinds of interaction terms.
The first one we will discuss is an interaction between two QUANTITATIVE variables.

Example MLR with Quantitative Interaction Term

One POSSIBLE model for these data (there are many):

Code

```{r insurance model with quantitative interaction, echo=T}
# save and print mlr model output
(insure_mlr_quant1 <- ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children, data=insure))
insure_model1 <- insure_mlr_quant1$model # save model parameters to use in calculations
```

                         Model Summary                           
----------------------------------------------------------------
R                       0.555       RMSE                  0.765 
R-Squared               0.307       MSE                   0.585 
Adj. R-Squared          0.305       Coef. Var             8.423 
Pred R-Squared          0.302       AIC                3091.957 
MAE                     0.626       SBC                3123.150 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                 ANOVA                                  
-----------------------------------------------------------------------
                Sum of                                                 
               Squares          DF    Mean Square       F         Sig. 
-----------------------------------------------------------------------
Regression     347.605           4         86.901    147.968    0.0000 
Residual       782.869        1333          0.587                      
Total         1130.474        1337                                     
-----------------------------------------------------------------------

                                   Parameter Estimates                                    
-----------------------------------------------------------------------------------------
       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-----------------------------------------------------------------------------------------
 (Intercept)     7.195         0.127                 56.858    0.000     6.947     7.443 
         Age     0.037         0.002        0.496    20.009    0.000     0.033     0.040 
         BMI     0.012         0.003        0.078     3.383    0.001     0.005     0.018 
    Children     0.251         0.055        0.139     4.563    0.000     0.143     0.358 
Age:Children    -0.004         0.001       -0.066    -2.782    0.005    -0.006    -0.001 
-----------------------------------------------------------------------------------------

Interpreting Quantitative Interactions

Two CORRECT Interpretation(s) of this interaction:
1. The effect of age on insurance charges differs depending on how many children you have.
2. The effect of number of children on insurance charges differs depending your age.
Which interpretation the analyst emphasizes depends on the question being addressed.

Two Questions about Evaluating Interaction Terms:
- How do we decide if ANY interaction term should stay in the model?
- How do we attain estimates from a model with a qunatitative interaction?
  
  Example: If a person is 48, has a BMI of 26 and has 3 children, what is the estimate of their insurance changes in dollars (NOT the LN of their charges)?

Lecture 15 In-class Exercises - Q2

Session ID: bua345s25

Based on the R MLR output shown, is the interaction between Age and Number of Children useful in explaining differences in Insurance Charges?

Abridged Output

Lecture 15 In-class Exercises - Q3

Session ID: bua345s25

Using this model, what is estimated insurance charge for 45 year old with a BMI of 26 and 2 children? Round to closest whole dollar.

Calculation can be done in R or by hand.
- Age = 45
- BMI = 26
- Children = 2
- Age*Children = 45*2 = 90
On the next slide I demonstrate how to do this in R using the saved model.

Lecture 15 In-class Exercises - Q3

Code

```{r create a dataset with 1 new observation, echo=T}
Age <- 45      # specify values using variable names in model
BMI <- 26
Children <- 2
                         
(new_obs <- tibble(Age, BMI, Children)) # new_obs is 1 row dataset

(new_obs <- new_obs |>   # add regression estimate 
  mutate(est_ln_Charges = lm(insure_model1) |> predict(new_obs)))

#(new_obs <- new_obs |>  # back-transform estimate
# mutate(est_Charges = ____(_____)) 
```

# A tibble: 1 × 3
    Age   BMI Children
  <dbl> <dbl>    <dbl>
1    45    26        2
# A tibble: 1 × 4
    Age   BMI Children est_ln_Charges
  <dbl> <dbl>    <dbl>          <dbl>
1    45    26        2           9.31

Lecture 15 In-class Exercises - Q4

In the previous model, all included terms appear to be useful to the model. Is the interaction between Age and BMI also useful to the model?

Examine the model output to answer this question.

Code

```{r insure model with two quant interactions, echo=T}
# save and print mlr model output
(insure_mlr_quant2 <- ols_regress(ln_Charges ~ Age + BMI + Children + 
                                    Age*Children + Age*BMI, data=insure))
```

                         Model Summary                           
----------------------------------------------------------------
R                       0.555       RMSE                  0.764 
R-Squared               0.309       MSE                   0.584 
Adj. R-Squared          0.306       Coef. Var             8.419 
Pred R-Squared          0.302       AIC                3091.922 
MAE                     0.626       SBC                3128.315 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                 ANOVA                                  
-----------------------------------------------------------------------
                Sum of                                                 
               Squares          DF    Mean Square       F         Sig. 
-----------------------------------------------------------------------
Regression     348.795           5         69.759    118.871    0.0000 
Residual       781.679        1332          0.587                      
Total         1130.474        1337                                     
-----------------------------------------------------------------------

                                   Parameter Estimates                                    
-----------------------------------------------------------------------------------------
       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-----------------------------------------------------------------------------------------
 (Intercept)     6.785         0.315                 21.567    0.000     6.168     7.402 
         Age     0.047         0.008        0.498     6.088    0.000     0.032     0.062 
         BMI     0.025         0.010        0.076     2.504    0.012     0.005     0.045 
    Children     0.249         0.055        0.139     4.540    0.000     0.142     0.357 
Age:Children    -0.004         0.001       -0.065    -2.750    0.006    -0.006    -0.001 
     Age:BMI     0.000         0.000       -0.033    -1.424    0.155    -0.001     0.000 
-----------------------------------------------------------------------------------------

Goodness of Fit - Adjusted \(R^2\)

Previous slides show two possible models for these data. There are 63 possible models with these X variables and all two way interactions.
Today we will discuss Adjusted \(R^2\) as one option to compare different models (We will cover other model comparison measures soon).
- Adjusted \(R^2\) adjusts \(R^2\) DOWNWARD by adding a penalty for additional predictor variables.
  - \(R^2\) (unadjusted) should NOT be used to compare MLR models.
  - Adding predictors will always increase \(R^2\), even if predictors are not useful.
  - Instead we adjust: We penalize model \(R^2\) for each additional variable added.
  - Adjusted \(R^2\) only increases if model fit improvement exceeds penalty for adding terms.

More about Goodness of Fit - Adjusted \(R^2\)

P-values for each term and change in Adjusted \(R^2\) often agree (but not always)
As P, number of predictors increases, the penalty increases.
Adjusted \(R^2 = 1 - \frac{(1-R^2)(n-1)}{n-P-1}\)
Students are not required to memorize this equation but you should understand what it is doing.

All Possible Models Sorted by Number of X variables

\(R^2\) ALWAYS increases as number of X variables increases.

Adjusted \(R^2\) ONLY increases if X variable is useful to model.

No. of Predictors	Predictors	\(R^2\)	Adjusted \(R^2\)
1	Age	0.2786	0.2781
1	Children	0.0260	0.0253
1	BMI	0.0176	0.0169
2	Age Children	0.2979	0.2969
2	Age BMI	0.2843	0.2832
3	Age BMI Children	0.3035	0.3019
4	Age BMI Children Age:Children	0.3075	0.3054
4	Age BMI Children Age:BMI	0.3046	0.3025
4	Age BMI Children BMI:Children	0.3036	0.3015

All Possible Models Sorted by Adj. \(R^2\)

\(R^2\) ALWAYS increases as number of X variables increases.

Adjusted \(R^2\) ONLY increases if X variable is useful to model.

No. of Predictors	Predictors	\(R^2\)	Adjusted \(R^2\)
4	Age BMI Children Age:Children	0.3075	0.3054
4	Age BMI Children Age:BMI	0.3046	0.3025
3	Age BMI Children	0.3035	0.3019
4	Age BMI Children BMI:Children	0.3036	0.3015
2	Age Children	0.2979	0.2969
2	Age BMI	0.2843	0.2832
1	Age	0.2786	0.2781
1	Children	0.0260	0.0253
1	BMI	0.0176	0.0169

Introduction to Model Selection

AKA Variable Selection

Adjusted \(R^2\) is good for comparing a few models.
In this case we knew that only 9 of the 63 possible models were reasonable.
If there are many possible reasonable models, we automate part of the selection process.
In MLR, the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
- How do we decide which variables should be in our model?
- There are many methods:
- A popular method, Backward Elimination, can also be done manually in any software:
  - Start with all potential terms (including potential interaction terms) in the model and removes the least significant term one at time

Next Topics in Model Selection

Looking ahead, we’ll also cover:
- Foreward Selection
- Stepwise Selection
- ‘All Possible’ models - compared using additional measures
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.

Steps for Backward Elimination

Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
- Optional at this stage: Also examine correlation matrix to determine if some pairs of variables will be a concern
- New term - Multicollinearity: If two predictors (X variables) in model have a correlation of 0.8 or higher, they can not both stay in the model because they are multicollinear and cause the model to be unstable.
Create a ‘saturated’ model with all potential predictor variables and interaction terms
- This is subjective.
- Be as transparent as possible in your how you decide on your full model.
Use ‘Backward Elimination’ to pare model down to a preliminary model

Steps for Backward Elimination

Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
- If two predictor variables have a correlation of 0.8 or greater, drop one of them (see above)
If model was modified in step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.

Plan for Thursday and HW 7

In HW 7, you will examine the correlation matrix and then do simple versions of steps 3 and 6 of the model selection process.
Thursday, we will look at a couple of interesting models selection examples.

Example 1: Animals Data

Question: What factors affect a mammal’s sleep duration?**
Animals Data Notes:
- Population was limited to animals under 1000 pounds (two elephant species excluded).
- Natural log (LN) transformed variables were added to original data.
- Observations with missing values are removed below
- Working dataset has 49 observations (49 different species)

Preview of Lecture 16 Animals Data

Species	TotalSleep	BodyWt	LNBodyWt	BrainWt	LNBrainWt	LifeSpan	LNLifeSpan	Gestation	Predation	Exposure	Danger
Africangiantpouchedrat	8.3	1.00	0.00	6.6	1.89	4.5	1.50	42	3	1	3
Americanopossum	19.4	1.70	0.53	6.3	1.84	5.0	1.61	12	2	1	1
ArcticFox	12.5	3.39	1.22	44.5	3.80	14.0	2.64	60	1	1	1
Baboon	9.8	10.55	2.36	179.5	5.19	27.0	3.30	180	4	4	4
Bigbrownbat	19.7	0.02	-3.77	0.3	-1.20	19.0	2.94	35	1	1	1
Braziliantapir	6.2	160.00	5.08	169.0	5.13	30.4	3.41	392	4	5	4

Animals Data Dictionary - Description of Variables

Variable	Type	Description
Species	Nominal	Name of Species
TotalSleep	Quantitative	Total Sleep
BodyWt	Quantitative	Average Body Weight in kilograms
LNBodyWt	Quantitative	Natural Log of Body Weight
BrainWt	Quantitative	Average Brain Weight in grams
LNBrainWt	Quantitative	Natural Log of Brain Weight
LifeSpan	Quantitative	Maximum Life Span in years
LNLifeSpan	Quantitative	Natural Log of Life Span
Gestation	Quantitative	Gestation Time in days
Predation	Ordinal	Predation Index (1=least likely to be prey)
Exposure	Ordinal	Sleep Exposure Index (1=least exposed)
Danger	Ordinal	Overall Danger Index (1=least danger from other animals)

Key Points from Today

Regression modeling can be overwhelming
- Automating part of the variable selection process is helpful.
- Today we introduced Backward Elimination
- Thursday we will look at a couple other model selection methods.
- Try different methods and compare results.
- Results from automated processes are preliminary.
HW 6 due on Wed. 3/5 (Grace Period extended until 3/7).
HW 7 will be posted by 3/7 and is due on Wed. 3/19.
Date of Quiz 2 has been changed to Tuesday, 4/1.

To submit an Engagement Question or Comment about material from Lecture 15: Submit it by midnight today (day of lecture).