Lectures 15 and 16 - Introduction to Model Selection

Penelope Pooler Eisenbies
BUA 345

2024-03-06

Housekeeping

Upcoming Dates

HW 6 is due Wednesday (3/6)
HW 7 will be posted on Thursday (3/7) is due on Wednesday, 3/20.
Quiz 2 is Thursday, March 28th
- There will be an asynchronous option.
Thursday’s Lecture (3/7) will include In-class Exercises using the Animals Data and your HW 7 data to help you make progress.

More Housekeeping

This Week’s Plan 📋

Review of \(R^2\) and Adjusted \(R^2\)
- Selecting a model based on Adjusted \(R^2\)
Explanation of Quantitative Interactions
Building a full model
Backward Elimination for Model Selection

💥 Lecture 14 In-class Exercises - Q1 - Review 💥

Recall the Actors and Athletes data that we examined in Lecture 14.

# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) 

# formatted regression output - saved and printed to screen
(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession,
                                     data=celeb_prof, iterm = T))

Abridged Output

💥 Lecture 15 In-class Exercises - Q1 - Review 💥

Session ID: bua345s24

Abridged Output

Review Question What is the slope for the linear model for Athletes?

Round answer to two decimal places.

Hint: To answer this question, you combine two terms:

the baseline slope term for Age: 1.824
the difference in slope Athlete: -5.063
Slope for Athletes = baseline + difference = ____

Review of Regression Terms \(R^2\) and Adjusted \(R^2\)

R is the correlation coefficient, \(R_{XY}\)
\(R^2\) is \(R_{XY}^2\)
\(R^2\) is also called coefficient of determination
Meaning of \(R^2\) in SLR: Proportion of variability in y explained by X
Adjusted \(R^2\) adjusts \(R^2\)> for number of explanatory (X) variables in model.
- Meaning of Adjusted \(R^2\) in MLR is a little less specific but similar to \(R^2\)

Example of\(R^2\) Interpretation

Import and Examine Insurance Data

insure <- read_csv("data/insure_L15.csv", show_col_types=F) # import
insure <- insure |> # create log transformed variable
  mutate(ln_Charges = log(Charges)) |> glimpse(width=60)

Rows: 1,338
Columns: 5
$ Charges    <dbl> 16884.924, 1725.552, 4449.462, 21984.47…
$ Age        <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60,…
$ BMI        <dbl> 27.900, 33.770, 33.000, 22.705, 28.880,…
$ Children   <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, …
$ ln_Charges <dbl> 9.734176, 7.453302, 8.400538, 9.998092,…

Examine Histograms of Charges and ln_Charges

SLR model - Predictor (X) Variable: Age

# save and print slr model
(insure_slr <- ols_regress(ln_Charges ~ Age, data=insure))

Abridged Output

More about \(R^2\) and How It’s Calculated

R = 0.528 which indicates a moderate correlation between Age and ln_Charges (Natural Log of Insurance Charges)
\(R^2\) = 0.279 which means that approximately 28% of the variability in ln_charges (Natural Log of Insurance Charges) is explained by Age.

More about \(R^2\) and How It’s Calculated

\(R^2\) can also be calculated from the Sum of Squares output:
- \(SS_{TOT}\) (Total. Sum of Squares): 1130.474 (Total variability in Y)
- \(SS_{REG}\) (Regression. Sum of Squares): 314.960 (Variability in Y explained by model)
- \(SS_{RES}\) (Residual Sum of Squares): 815.514 (Variability in Y NOT explained by model)
- \(R^2\) = \(SS_{REG}\) / \(SS_{TOT}\) = 314.96/1130.474 = 0.279

MLR with Quantitative Interaction Term

In Lecture 14, categorical terms and interactions had a simple interpretation:
- Each category has a unique SLR model:
  - The intercepts for different categories may or may not be different from baseline category(check P-values)
  - The slopes for different categories may or may not be different from baseline category (check P-values)

There are other kinds of interaction terms.
The first one we will discuss is an interaction between two QUANTITATIVE variables.

Example MLR with Quantitative Interaction Term

One POSSIBLE model for these data (there are many):

# save and print mlr model output
(insure_mlr_quant1 <- ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children, data=insure))

# save model parameters to use in calculations
insure_model1 <- insure_mlr_quant1$model

Abridged Output

Interpreting Quantitative Interactions

Two CORRECT Interpretation(s) of this interaction:
1. The effect of age on insurance charges differs depending on how many children you have.
2. The effect of number of children on insurance charges differs depending your age.
Which interpretation the analyst emphasizes depends on the question being addressed.

Two Questions about Evaluating Interaction Terms:
- How do we decide if ANY interaction term should stay in the model?
- How do we attain estimates from a model with a qunatitative interaction?
  
  Example: If a person is 48, has a BMI of 26 and has 3 children, what is the estimate of their insurance changes in dollars (NOT the LN of their charges)?

💥 Lecture 15 In-class Exercises - Q2💥

Session ID: bua345s24

Based on the R MLR output shown, is the interaction between Age and Number of Children useful in explaining differences in Insurance Charges?

Abridged Output

💥 Lecture 15 In-class Exercises - Q3💥

Session ID: bua345s24

Using this model, what is estimated insurance charge for 45 year old with a BMI of 26 and 2 children? Round to closest whole dollar.

Calculation can be done in R or by hand.
- Age = 45
- BMI = 26
- Children = 2
- Age*Children = 45*2 = 90
On the next slide I demonstrate how to do this in R using the saved model.

💥 Lecture 15 In-class Exercises - Q3💥

Age <- 45      # specify values using variable names in model
BMI <- 26
Children <- 2
                         # new_obs is 1 row dataset
(new_obs <- tibble(Age, BMI, Children))

# A tibble: 1 × 3
    Age   BMI Children
  <dbl> <dbl>    <dbl>
1    45    26        2

(new_obs <- new_obs |>   # add regression estimate 
  mutate(est_ln_Charges = lm(insure_model1) |> predict(new_obs)))

# A tibble: 1 × 4
    Age   BMI Children est_ln_Charges
  <dbl> <dbl>    <dbl>          <dbl>
1    45    26        2           9.31

#(new_obs <- new_obs |>  # back-transform estimate
# mutate(est_Charges = ____(_____))

💥 Lecture 15 In-class Exercises - Q4 💥

In the above model, all included terms appear to be useful to the model. Is the interaction between Age and BMI also useful to the model?

Examine the model output to answer this question.

# save and print mlr model output
(insure_mlr_quant2 <- ols_regress(ln_Charges ~ Age + BMI + Children + 
                                    Age*Children + Age*BMI, data=insure))

Abridged output

Goodness of Fit - Adjusted \(R^2\)

Previous slides show two possible models for these data. There are 63 possible models with these X variables and all two way interactions.
Today we will discuss Adjusted \(R^2\) as one option to compare different models (We will cover other model comparison measures soon).
- Adjusted \(R^2\) adjusts \(R^2\) DOWNWARD by adding a penalty for additional predictor variables.
  - \(R^2\) (unadjusted) should NOT be used to compare MLR models.
  - Adding predictors will always increase \(R^2\), even if predictors are not useful.
  - Instead we adjust: We penalize model \(R^2\) for each additional variable added.
  - Adjusted \(R^2\) only increases if model fit improvement exceeds penalty for adding terms.

More about Goodness of Fit - Adjusted \(R^2\)

P-values for each term and change in Adjusted \(R^2\) often agree (but not always)
As P, number of predictors increases, the penalty increases.
Adjusted \(R^2 = 1 - \frac{(1-R^2)(n-1)}{n-P-1}\)
Students are not required to memorize this equation but you should understand what it is doing.

All Possible Models Sorted by Number of X variables

\(R^2\) ALWAYS increases as number of X variables increases.

Adjusted \(R^2\) ONLY increases if X variable is useful to model.

No. of Predictors	Predictors	\(R^2\)	Adjusted \(R^2\)
1	Age	0.2786	0.2781
1	Children	0.0260	0.0253
1	BMI	0.0176	0.0169
2	Age Children	0.2979	0.2969
2	Age BMI	0.2843	0.2832
3	Age BMI Children	0.3035	0.3019
4	Age BMI Children Age:Children	0.3075	0.3054
4	Age BMI Children Age:BMI	0.3046	0.3025
4	Age BMI Children BMI:Children	0.3036	0.3015

All Possible Models Sorted by Adj. \(R^2\)

\(R^2\) ALWAYS increases as number of X variables increases.

Adjusted \(R^2\) ONLY increases if X variable is useful to model.

No. of Predictors	Predictors	\(R^2\)	Adjusted \(R^2\)
4	Age BMI Children Age:Children	0.3075	0.3054
4	Age BMI Children Age:BMI	0.3046	0.3025
3	Age BMI Children	0.3035	0.3019
4	Age BMI Children BMI:Children	0.3036	0.3015
2	Age Children	0.2979	0.2969
2	Age BMI	0.2843	0.2832
1	Age	0.2786	0.2781
1	Children	0.0260	0.0253
1	BMI	0.0176	0.0169

Intro to Model Selection (AKA Variable Selection)

Adjusted \(R^2\) is good for comparing a few models.
In this case we new that only 9 of the 63 possible models were reasonable.
If there are many possible reasonable models, we automate part of the selection process.
In MLR, the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
- How do we decide which variables should be in our model?
- There are many methods:
- A popular method, Backward Elimination, can also be done manually in any software:
  - Start with all potential terms (including potential interaction terms) in the model and removes the least significant term one at time

Next Topics in Model Selection

Looking ahead, we’ll also cover:
- Foreward Selection
- Stepwise Selection
- ‘All Possible’ models - compared using additional measures
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.

Steps for Backward Elimination

Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
- Optional at this stage: Also examine correlation matrix to determine if some pairs of variables will be a concern
- New term - Multicollinearity: If two predictors (X variables) in model have a correlation of 0.8 or higher, they can not both stay in the model because they are multicollinear and cause the model to be unstable.
Create a ‘saturated’ model with all potential predictor variables and interaction terms

This is subjective.
Be as transparent as possible in your how you decide on your full model.

Use ‘Backward Elimination’ to pare model down to a preliminary model

Steps for Backward Elimination

Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
- If two predictor variables have a correlation of 0.8 or greater, drop one of them (see above)
If model was modified in step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.

Plan for This Week and HW 7

In HW 7, you will examine the correlation matrix and then do simple versions of steps 3 and 6 of the model selection process.
This week, we look at couple of interesting models selection examples.

Example 1: Animals Data

Question: What factors affect a mammal’s sleep duration?**
Animals Data Notes:
- Population was limited to animals under 1000 pounds (two elephant species excluded).
- Natural log (LN) transformed variables were added to original data.
- Observations with missing values are removed below
- Working dataset has 49 observations (49 different species)

Animals Data

# import and examine data
animals <- read_csv("data/animals.csv", show_col_types=F) |>
  filter(!is.na(LifeSpan) & !is.na(Gestation)) 

animals |> glimpse(width=60)

Rows: 49
Columns: 12
$ Species    <chr> "Africangiantpouchedrat", "Americanopos…
$ TotalSleep <dbl> 8.3, 19.4, 12.5, 9.8, 19.7, 6.2, 14.5, …
$ BodyWt     <dbl> 1.00, 1.70, 3.39, 10.55, 0.02, 160.00, …
$ LNBodyWt   <dbl> 0.00, 0.53, 1.22, 2.36, -3.77, 5.08, 1.…
$ BrainWt    <dbl> 6.6, 6.3, 44.5, 179.5, 0.3, 169.0, 25.6…
$ LNBrainWt  <dbl> 1.89, 1.84, 3.80, 5.19, -1.20, 5.13, 3.…
$ LifeSpan   <dbl> 4.5, 5.0, 14.0, 27.0, 19.0, 30.4, 28.0,…
$ LNLifeSpan <dbl> 1.50, 1.61, 2.64, 3.30, 2.94, 3.41, 3.3…
$ Gestation  <dbl> 42, 12, 60, 180, 35, 392, 63, 230, 112,…
$ Predation  <dbl> 3, 2, 1, 4, 1, 4, 1, 1, 5, 5, 5, 1, 2, …
$ Exposure   <dbl> 1, 1, 1, 4, 1, 5, 2, 1, 4, 5, 5, 1, 2, …
$ Danger     <dbl> 3, 1, 1, 4, 1, 4, 1, 1, 4, 5, 5, 1, 2, …

Animals Data Dictionary - Description of Variables

Intuitvely, there is likely to be redundancy between Predation, Exposure, and Danger.

# A tibble: 12 × 3
   Variable   Type         Description                                          
   <chr>      <chr>        <chr>                                                
 1 Species    Nominal      Name of Species                                      
 2 TotalSleep Quantitative Total Sleep                                          
 3 BodyWt     Quantitative Average Body Weight in kilograms                     
 4 LNBodyWt   Quantitative Natural Log of Body Weight                           
 5 BrainWt    Quantitative Average Brain Weight in grams                        
 6 LNBrainWt  Quantitative Natural Log of Brain Weight                          
 7 LifeSpan   Quantitative Maximum Life Span in years                           
 8 LNLifeSpan Quantitative Natural Log of Life Span                             
 9 Gestation  Quantitative Gestation Time in days                               
10 Predation  Ordinal      Predation Index (1=least likely to be prey)          
11 Exposure   Ordinal      Sleep Exposure Index (1=least exposed)               
12 Danger     Ordinal      Overall Danger Index (1=least danger from other anim…

💥 Lecture 16 In-class Exercises - Q1 💥

Session ID: bua345s24

Which two ordinal categorical predictor variables appear to be multicollinear, i.e., highly correlated?

           TotalSleep Predation Exposure Danger
TotalSleep       1.00     -0.48    -0.63  -0.63
Predation       -0.48      1.00     0.66   0.95
Exposure        -0.63      0.66     1.00   0.78
Danger          -0.63      0.95     0.78   1.00

Scatterplot Matrix - Visual Representation of Correlations

pairs(animal_mat3)

Backward Elimination

Data examination and transformations completed
Create a full ‘saturated’ model with all potential predictor variables and interaction terms (This is subjective).

# convert ordinal variables to factors
animals <- animals |>       
  mutate(PredF = factor(Predation), 
         ExposF = factor(Exposure), 
         DangrF=factor(Danger))

# full model (subjective)
animals_full <- lm(TotalSleep ~ LNBodyWt + LNBrainWt + 
                     LNLifeSpan + Gestation + 
                     PredF + ExposF + DangrF + 
                     LNBodyWt*Gestation + LNLifeSpan*PredF + 
                     LNLifeSpan*ExposF + LNLifeSpan*DangrF, data=animals)

Backward Elimination Continued

Use ‘Backward Elimination’ to pare full model down to a preliminary model.

We cast a wide net by specifying that terms will remain in model if p-value < 0.1.

(animals_BE <- ols_step_backward_p(animals_full, p_val = 0.1, progress = F))


                                 Stepwise Summary                                 
--------------------------------------------------------------------------------
Step    Variable                AIC        SBC       SBIC       R2       Adj. R2 
--------------------------------------------------------------------------------
 0      Full Model            240.461    299.107    39.729    0.89048    0.73714 
 1      LNLifeSpan:DangrF     232.882    283.961    40.124    0.88953    0.76946 
 2      LNBrainWt             231.276    280.464    40.494    0.88864    0.77728 
 3      LNLifeSpan:ExposF     229.366    270.986    46.531    0.87390    0.78383 
 4      LNBodyWt:Gestation    227.366    267.095    46.512    0.87390    0.79129 
 5      ExposF                227.508    259.669    54.605    0.85111    0.78343 
--------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.923       RMSE                 1.743 
R-Squared               0.851       MSE                  4.511 
Adj. R-Squared          0.783       Coef. Var           19.991 
Pred R-Squared          0.660       AIC                227.508 
MAE                     1.283       SBC                259.669 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
               Sum of                                              
              Squares        DF    Mean Square      F         Sig. 
-------------------------------------------------------------------
Regression    851.000        15         56.733    12.576    0.0000 
Residual      148.871        33          4.511                     
Total         999.871        48                                    
-------------------------------------------------------------------

                                      Parameter Estimates                                        
------------------------------------------------------------------------------------------------
            model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------------
      (Intercept)      6.324         2.635                  2.401    0.022      0.964    11.684 
         LNBodyWt     -0.813         0.202       -0.515    -4.027    0.000     -1.224    -0.402 
       LNLifeSpan      3.009         0.909        0.622     3.311    0.002      1.160     4.858 
        Gestation     -0.019         0.005       -0.424    -3.736    0.001     -0.029    -0.009 
           PredF2     14.639         3.291        1.431     4.448    0.000      7.944    21.335 
           PredF3     17.053         5.383        1.237     3.168    0.003      6.101    28.005 
           PredF4     11.414         4.830        0.884     2.363    0.024      1.587    21.241 
           PredF5      0.722         6.052        0.067     0.119    0.906    -11.592    13.035 
          DangrF2     -6.810         1.746       -0.648    -3.900    0.000    -10.363    -3.258 
          DangrF3     -8.701         3.444       -0.674    -2.527    0.016    -15.708    -1.695 
          DangrF4     -7.957         4.344       -0.651    -1.832    0.076    -16.794     0.881 
          DangrF5    -16.325         4.456       -1.265    -3.664    0.001    -25.390    -7.259 
LNLifeSpan:PredF2     -3.334         1.299       -0.794    -2.567    0.015     -5.976    -0.692 
LNLifeSpan:PredF3     -5.444         3.940       -0.615    -1.382    0.176    -13.459     2.571 
LNLifeSpan:PredF4     -1.160         1.537       -0.253    -0.755    0.456     -4.286     1.967 
LNLifeSpan:PredF5      3.357         1.868        0.887     1.797    0.081     -0.443     7.157 
------------------------------------------------------------------------------------------------

Backward Elimination - Preliminary Model

Note that each category of each factor variable is shown making model look more complex than it is.

Backward Elimination - Next Steps

Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
- If correlation for two variables, \(R_{XY} \geq 0.8\), then one variable should be excluded.
- Variables in preliminary model: : LNBodyWt, LNLifeSpan, Gestation, PredF, DangrF, LNLifeSpan*PredF

Recall that PredF (Predation) and DangrF (Danger) are highly correlated.
PredF is included in an interaction term so exclude DangrF.

Backward Elimination - Next Steps - Continued

If model was modified in Step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.

Adjusted \(R^2\) = 0.655
Model (next slide) looks complicated, but each animal is in only one Predation Category.
Baseline Predation Category = 1

# specify final model
(animals_final <- ols_regress(TotalSleep ~ LNBodyWt + LNLifeSpan + Gestation + 
                               PredF + LNLifeSpan*PredF, data=animals))
# save coefficients
animals_model <- animals_final$model

Backwards Elimination - Animal Data Final Model

# specify final model
(animals_final <- ols_regress(TotalSleep ~ LNBodyWt + LNLifeSpan + Gestation + 
                               PredF + LNLifeSpan*PredF, data=animals))

animals_model <- animals_final$model # save coefficients

                         Model Summary                          
---------------------------------------------------------------
R                       0.857       RMSE                 2.329 
R-Squared               0.734       MSE                  7.181 
Adj. R-Squared          0.655       Coef. Var           25.223 
Pred R-Squared          0.547       AIC                247.894 
MAE                     1.857       SBC                272.488 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    734.163        11         66.742    9.294    0.0000 
Residual      265.708        37          7.181                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                       
-----------------------------------------------------------------------------------------------
            model      Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
-----------------------------------------------------------------------------------------------
      (Intercept)     6.751         3.305                  2.043    0.048      0.054    13.448 
         LNBodyWt    -0.698         0.244       -0.442    -2.859    0.007     -1.192    -0.203 
       LNLifeSpan     2.855         1.133        0.591     2.519    0.016      0.559     5.151 
        Gestation    -0.020         0.006       -0.447    -3.285    0.002     -0.032    -0.008 
           PredF2    13.998         4.132        0.041     3.388    0.002      5.626    22.369 
           PredF3    11.883         5.514       -0.494     2.155    0.038      0.711    23.056 
           PredF4     2.654         4.102        0.021     0.647    0.522     -5.658    10.966 
           PredF5    -0.782         4.262       -0.316    -0.183    0.855     -9.418     7.855 
LNLifeSpan:PredF2    -5.367         1.478       -0.471    -3.632    0.001     -8.361    -2.373 
LNLifeSpan:PredF3    -7.390         3.141       -0.588    -2.352    0.024    -13.755    -1.025 
LNLifeSpan:PredF4    -0.941         1.356       -0.083    -0.694    0.492     -3.689     1.807 
LNLifeSpan:PredF5    -1.043         1.446       -0.091    -0.721    0.475     -3.973     1.887 
-----------------------------------------------------------------------------------------------

Using Model to Find Estimates

Exporting Model and Data to Excel

This model can be used to find model estimates and residuals for all animals.
We will ALSO do these calculations in an Excel Spreadsheet to clarify each model component in estimate.
Below we export the data for three species to examine how the model works

animals_model_data <- animals |>             # create new dataset with model variables only
  select(Species, TotalSleep, LNBodyWt, LNLifeSpan, Gestation, PredF)

three_species <- animals_model_data |>       # create mini dataset with three species
  filter(Species %in% c("Baboon", "Donkey", "ArcticFox")) |>
  write_csv("ThreeSpecies.csv")

Species	TotalSleep	LNBodyWt	LNLifeSpan	Gestation	PredF
ArcticFox	12.5	1.22	2.64	60	1
Baboon	9.8	2.36	3.30	180	4
Donkey	3.1	5.23	3.69	365	5

Using a Model to Find Estimates

Model coefficients for calculations can be extracted and exported to Excel.
Below We create a two column dataset listing each model component and it’s beta coefficient.
That dataset is exported as a .csv file for an in-class exercise.

# examine and export model betas for worksheet
beta <- animals_final$betas
model_term <- names(beta)
animal_betas <- tibble(model_term, beta) |> 
    write_csv("animal_betas.csv")

model_term	beta
(Intercept)	6.7512
LNBodyWt	-0.6976
LNLifeSpan	2.8550
Gestation	-0.0198
PredF2	13.9979
PredF3	11.8834
PredF4	2.6536
PredF5	-0.7817
LNLifeSpan:PredF2	-5.3668
LNLifeSpan:PredF3	-7.3900
LNLifeSpan:PredF4	-0.9409
LNLifeSpan:PredF5	-1.0427

💥 Lecture 16 In-class Exercises - Q2 and Q3 💥

Session ID: bua345s24

Use the provided .csv file worksheet to answer these questions:

Question 2. What is the regression estimate of total sleep for ‘Donkey’?

Question 3. What is the regression estimate of total sleep for ‘Artic Fox’ (ArticFox)?

At Home Practice:

Complete the worksheet for ‘Baboon’ at home.
At least one question on Quiz 2 will include an Excel Worksheet like this where you have to correctly do the calculation using the model and x values from the data.
You can use R, but code to add estimates to dataset will not be provided.
This exercise is about understanding the model estimation process.

Using Model to Find Estimates in R

Model estimates can be calculated in R.
Excel Worksheet is used to demonstrate how those estimates are calculated.
You will calculate an estimate using a complex on Quiz 2.

Species	TotalSleep	Est_TotalSleep	Resid	LNBodyWt	LNLifeSpan	Gestation	PredF
Africangiantpouchedrat	8.3	11.00	-2.70	0.00	1.50	42	3
Americanopossum	19.4	16.10	3.30	0.53	1.61	12	2
ArcticFox	12.5	12.25	0.25	1.22	2.64	60	1
Baboon	9.8	10.51	-0.71	2.36	3.30	180	4

Model Validation

How good is our model?
There are many ways to examine model fit.
Here are two straightforward ways:
- Check correlation between observed and estimated values
- Plot a scatterplot of observed and estimated values

Model Validation Plot (R = 0.86)

HW 7 Demo - Questions 1 - 2

Students are provided with an R project to complete HW 7.
Read instructions which correspond to Blackboard HW Assignment 7.
Run Code Block 1 (setup) and Code Block 2 (import and examine ames dataset).
Code Black 3 (examine correlation matrix of X variables) is incomplete.
- Remove # symbols before incomplete R code.
- Replace blanks (____) with correct commands to calculate correlation matrix with values rounded to 2 decimal places.
- Run line or whole code block to view correlation matrix which is large.
- In same Code Block, remove # from the two lines of code at the bottom.
  - These two lines find the largest positive and negative correlations in the matrix.
Answer Questions 1 - 2 based on the correlation matrix and min/max output.

HW 7 Demo - Questions 3 - 6

Run Code Block 4 (specify full model and do backward elim) to:
- Create the full model with all variables and no interactions.
- Run the Backward Elimination.
Answer questions 3 - 6 based on the Backward Elimination model output

HW 7 Demo - Questions 7 - 11

Run Code Block 5 (save final model) to save the final model as final_ames_model
Complete the code in Code Block 6 (import new data and add predictions) and run code to add model estimates and residuals to new small dataset of two new houses.
- It is helpful to run the lines in this code block one at a time.
- Run the first command that begins new_houses <- read_csv(... to import a new small datset with 2 observations.
- Run the command that begins (new_houses <- new_houses |> mutate(Est_Price... to add Est_Price, the regression estimates to this dataset.

HW 7 Demo - Questions 7 - 11 Continued

Remove # before the following three lines to complete them:
- #(new_houses <- new_houses |>
- # mutate(Resid = ____ - ____ |> round()) |>
- # relocate(Est_Price, Resid, .after=Price))
In the line with the blanks you are calculating residuals as
- Price minus Estimated Price (Resid = Price - Est_Price)
- The next line relocates Est_Price and Resid in the left side of the dataset, after Price.
Answer Questions 7 - 11 based on this output.

Model Selection Methods

Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).

Description of Other Model Selection Methods

Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.
- This is referred to as starting with a full or saturated model.
Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.
- Forward selection also needs to know what terms are in the full model.
Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.

Notes about Model Selection using Multiple Methods

The steps for other methods are similar to the steps for Backward Elimination.
Not all steps are ALWAYS required. It depends on how complex the data are.
In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.
- For Step 1, we only need to examine correlations.
- In this case, Step 7 will be apparent.
- We can add model estimates to data for future interpretation (Step 8)

Steps for Model Selection Using Multiple Methods

Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.

Also look at correlation matrix to check if there are pairs of variables to be concerned about.

Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).
Use Backward Elimination, Forward Selection, and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)

Carefully examine results to see where these candidate models agree and disagree.

Steps for Model Selection Using Multiple Methods Continued

Examine predictors in preliminary candidate models to confirm they are not too highly correlated with each other.

If two predictor variables in any model have a correlation of 0.8 or greater, drop one of them.

Rerun model selection methods, if a candidate model is substantially changed (not always needed).
Compare model fit statistics from final candidate model from all three methods.
Decide on final candidate and make final modifications, if needed.
Interpret final model.

Wine Data - Model Selection Example

Can we determine what factors affect wine quality even if we KNOW NOTHING about wine cultivation and chemistry?

Maybe!

Since we have no prior knowledge, we start with a straightforward full model with all available predictors and no interactions.
- In practice, a consultant would be working with a wine expert to carefully determine a saturated model that includes all possible interactions.

Import Wine Data

Notice that all variables are numeric (<dbl> stands for decimal value).

wine <- read_csv("data/wine.csv", show_col_types = F) |>
  glimpse(width=60)

Rows: 605
Columns: 11
$ Wine_Quality          <dbl> 5, 6, 7, 5, 7, 5, 5, 5, 5, 5…
$ Fixed_Acidity         <dbl> 9.3, 9.1, 7.9, 7.2, 11.9, 7.…
$ Volatile_Acidity      <dbl> 0.48, 0.22, 0.34, 1.00, 0.43…
$ Citric_Acidity        <dbl> 0.29, 0.24, 0.36, 0.00, 0.66…
$ Residual_Sugar        <dbl> 2.1, 2.1, 1.9, 3.0, 3.1, 2.2…
$ Chlorides             <dbl> 0.127, 0.078, 0.065, 0.102, …
$ Free_Sulphur_Dioxide  <dbl> 6, 1, 5, 7, 10, 5, 5, 48, 27…
$ Total_Sulphur_Dioxide <dbl> 16, 28, 10, 16, 23, 36, 21, …
$ Ph                    <dbl> 3.22, 3.41, 3.27, 3.43, 3.15…
$ Sulfate               <dbl> 0.72, 0.87, 0.54, 0.46, 0.85…
$ Alcohol               <dbl> 11.2, 10.3, 11.2, 10.0, 10.4…

Examine Correlation matrix for MultiCollinearity

(cor_wine <- wine |> cor() |> round(2)) #  correlation matrix

                      Wine_Quality Fixed_Acidity Volatile_Acidity
Wine_Quality                  1.00          0.11            -0.39
Fixed_Acidity                 0.11          1.00            -0.23
Volatile_Acidity             -0.39         -0.23             1.00
Citric_Acidity                0.22          0.68            -0.52
Residual_Sugar                0.04          0.20            -0.01
Chlorides                    -0.10          0.12             0.04
Free_Sulphur_Dioxide          0.01         -0.18            -0.05
Total_Sulphur_Dioxide        -0.08         -0.13             0.05
Ph                           -0.06         -0.70             0.19
Sulfate                       0.21          0.19            -0.24
Alcohol                       0.45         -0.08            -0.17
                      Citric_Acidity Residual_Sugar Chlorides
Wine_Quality                    0.22           0.04     -0.10
Fixed_Acidity                   0.68           0.20      0.12
Volatile_Acidity               -0.52          -0.01      0.04
Citric_Acidity                  1.00           0.16      0.21
Residual_Sugar                  0.16           1.00      0.05
Chlorides                       0.21           0.05      1.00
Free_Sulphur_Dioxide           -0.07           0.18     -0.04
Total_Sulphur_Dioxide           0.06           0.18      0.00
Ph                             -0.55          -0.14     -0.26
Sulfate                         0.27          -0.01      0.35
Alcohol                         0.10           0.07     -0.21
                      Free_Sulphur_Dioxide Total_Sulphur_Dioxide    Ph Sulfate
Wine_Quality                          0.01                 -0.08 -0.06    0.21
Fixed_Acidity                        -0.18                 -0.13 -0.70    0.19
Volatile_Acidity                     -0.05                  0.05  0.19   -0.24
Citric_Acidity                       -0.07                  0.06 -0.55    0.27
Residual_Sugar                        0.18                  0.18 -0.14   -0.01
Chlorides                            -0.04                  0.00 -0.26    0.35
Free_Sulphur_Dioxide                  1.00                  0.65  0.08    0.00
Total_Sulphur_Dioxide                 0.65                  1.00 -0.07    0.08
Ph                                    0.08                 -0.07  1.00   -0.24
Sulfate                               0.00                  0.08 -0.24    1.00
Alcohol                              -0.03                 -0.08  0.21    0.05
                      Alcohol
Wine_Quality             0.45
Fixed_Acidity           -0.08
Volatile_Acidity        -0.17
Citric_Acidity           0.10
Residual_Sugar           0.07
Chlorides               -0.21
Free_Sulphur_Dioxide    -0.03
Total_Sulphur_Dioxide   -0.08
Ph                       0.21
Sulfate                  0.05
Alcohol                  1.00

max(cor_wine[cor_wine < 1])

[1] 0.68

min(cor_wine)

[1] -0.7

Model Selection

We specify a full model using an easy shortcut:
- If all variables are included, you can use . instead of listing them all.
- This model specification is also used in HW 7.
The we do three model selection procedures:
- Backward Elimination (BE)
- Forward Selection (FS)
- Stepwise Selection (SS)

wine_full <- lm(Wine_Quality ~ ., data = wine)                 # specify full model

wine_BE <- ols_step_backward_p(wine_full, progress=F)          # backward elimination  

wine_FS <- ols_step_forward_p(wine_full, progress=F)           # forward selection

wine_SS <- ols_step_both_p(wine_full, progress=F)              # stepwise selection

Comparing Model Results

Look at the LAST step for each method to determine which method results in the best fit.
Comparison Measures:
- Adj. \(R^2\): Higher value indicates better model fit
- C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
- AIC: Lower value indicates better model fit (Akaike Information Criteria).
- RMSE: Lower value indicates better model fit (Root mean Square Error).
By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.

💥 Lecture 16 In-class Exercises - Q4 - Review 💥

Session ID: bua345s23

Which two model selection methods arrived at the same model for the wine data?

On the next few slides I will show pairs of stepwise summaries so you can compare them.

Backwards Elimination and Forward Selection

Backward Elimination Forward Selection

Backwards Elimination and Stepwise Selection

Backward Elimination

Stepwise Selection

Forward Selection and Stepwise Selection

Forward Selection Stepwise Selection

Wine Model Validation Plot (R = 0.58)

Key Points from This Week

Regression modeling can be overwhelming
- Automating part of the variable selection process is helpful.
- Try different methods and compare results.
- Results from automated processes are preliminary.
- Model estimates and residuals can be added to dataset.
  - Demonstrated for both datasets and in HW 7.
HW 6 due on Wed. 3/6 (Grace Period extended until 3/8).
HW 7 is posted and is due on Wed. 3/20

To submit an Engagement Question or Comment about material from Today’s Lecture: Submit by midnight today (day of lecture). Click on Link next to the ❓ under today’s lecture.