2024-03-06
HW 6 is due Wednesday (3/6)
HW 7 will be posted on Thursday (3/7) is due on Wednesday, 3/20.
Quiz 2 is Thursday, March 28th
Thursday’s Lecture (3/7) will include In-class Exercises using the Animals Data and your HW 7 data to help you make progress.
Review of \(R^2\) and Adjusted \(R^2\)
Explanation of Quantitative Interactions
Building a full model
Backward Elimination for Model Selection
Recall the Actors and Athletes data that we examined in Lecture 14.
Session ID: bua345s24
Review Question What is the slope for the linear model for Athletes?
Round answer to two decimal places.
Hint: To answer this question, you combine two terms:
the baseline slope term for Age
: 1.824
the difference in slope Athlete
: -5.063
Slope for Athletes = baseline + difference = ____
R is the correlation coefficient, \(R_{XY}\)
\(R^2\) is \(R_{XY}^2\)
\(R^2\) is also called coefficient of determination
Meaning of \(R^2\) in SLR: Proportion of variability in y explained by X
Adjusted \(R^2\) adjusts \(R^2\)> for number of explanatory (X) variables in model.
Import and Examine Insurance Data
insure <- read_csv("data/insure_L15.csv", show_col_types=F) # import
insure <- insure |> # create log transformed variable
mutate(ln_Charges = log(Charges)) |> glimpse(width=60)
Rows: 1,338
Columns: 5
$ Charges <dbl> 16884.924, 1725.552, 4449.462, 21984.47…
$ Age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60,…
$ BMI <dbl> 27.900, 33.770, 33.000, 22.705, 28.880,…
$ Children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, …
$ ln_Charges <dbl> 9.734176, 7.453302, 8.400538, 9.998092,…
R = 0.528 which indicates a moderate correlation between Age
and ln_Charges
(Natural Log of Insurance Charges)
\(R^2\) = 0.279 which means that approximately 28% of the variability in ln_charges
(Natural Log of Insurance Charges) is explained by Age
.
\(R^2\) can also be calculated from the Sum of Squares output:
\(SS_{TOT}\) (Total. Sum of Squares): 1130.474 (Total variability in Y)
\(SS_{REG}\) (Regression. Sum of Squares): 314.960 (Variability in Y explained by model)
\(SS_{RES}\) (Residual Sum of Squares): 815.514 (Variability in Y NOT explained by model)
\(R^2\) = \(SS_{REG}\) / \(SS_{TOT}\) = 314.96/1130.474 = 0.279
In Lecture 14, categorical terms and interactions had a simple interpretation:
Each category has a unique SLR model:
The intercepts for different categories may or may not be different from baseline category(check P-values)
The slopes for different categories may or may not be different from baseline category (check P-values)
There are other kinds of interaction terms.
The first one we will discuss is an interaction between two QUANTITATIVE variables.
One POSSIBLE model for these data (there are many):
Two CORRECT Interpretation(s) of this interaction:
The effect of age on insurance charges differs depending on how many children you have.
The effect of number of children on insurance charges differs depending your age.
Which interpretation the analyst emphasizes depends on the question being addressed.
Two Questions about Evaluating Interaction Terms:
How do we decide if ANY interaction term should stay in the model?
How do we attain estimates from a model with a qunatitative interaction?
Example: If a person is 48, has a BMI of 26 and has 3 children, what is the estimate of their insurance changes in dollars (NOT the LN of their charges)?
Session ID: bua345s24
Based on the R MLR output shown, is the interaction between Age and Number of Children useful in explaining differences in Insurance Charges?
Session ID: bua345s24
Using this model, what is estimated insurance charge for 45 year old with a BMI of 26 and 2 children? Round to closest whole dollar.
Calculation can be done in R or by hand.
Age = 45
BMI = 26
Children = 2
Age*Children = 45*2 = 90
On the next slide I demonstrate how to do this in R using the saved model.
Age <- 45 # specify values using variable names in model
BMI <- 26
Children <- 2
# new_obs is 1 row dataset
(new_obs <- tibble(Age, BMI, Children))
# A tibble: 1 × 3
Age BMI Children
<dbl> <dbl> <dbl>
1 45 26 2
(new_obs <- new_obs |> # add regression estimate
mutate(est_ln_Charges = lm(insure_model1) |> predict(new_obs)))
# A tibble: 1 × 4
Age BMI Children est_ln_Charges
<dbl> <dbl> <dbl> <dbl>
1 45 26 2 9.31
In the above model, all included terms appear to be useful to the model. Is the interaction between Age and BMI also useful to the model?
Examine the model output to answer this question.
Previous slides show two possible models for these data. There are 63 possible models with these X variables and all two way interactions.
Today we will discuss Adjusted \(R^2\) as one option to compare different models (We will cover other model comparison measures soon).
Adjusted \(R^2\) adjusts \(R^2\) DOWNWARD by adding a penalty for additional predictor variables.
\(R^2\) (unadjusted) should NOT be used to compare MLR models.
Adding predictors will always increase \(R^2\), even if predictors are not useful.
Instead we adjust: We penalize model \(R^2\) for each additional variable added.
Adjusted \(R^2\) only increases if model fit improvement exceeds penalty for adding terms.
P-values for each term and change in Adjusted \(R^2\) often agree (but not always)
As P, number of predictors increases, the penalty increases.
Adjusted \(R^2 = 1 - \frac{(1-R^2)(n-1)}{n-P-1}\)
Students are not required to memorize this equation but you should understand what it is doing.
\(R^2\) ALWAYS increases as number of X variables increases.
Adjusted \(R^2\) ONLY increases if X variable is useful to model.
No. of Predictors | Predictors | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
1 | Age | 0.2786 | 0.2781 |
1 | Children | 0.0260 | 0.0253 |
1 | BMI | 0.0176 | 0.0169 |
2 | Age Children | 0.2979 | 0.2969 |
2 | Age BMI | 0.2843 | 0.2832 |
3 | Age BMI Children | 0.3035 | 0.3019 |
4 | Age BMI Children Age:Children | 0.3075 | 0.3054 |
4 | Age BMI Children Age:BMI | 0.3046 | 0.3025 |
4 | Age BMI Children BMI:Children | 0.3036 | 0.3015 |
\(R^2\) ALWAYS increases as number of X variables increases.
Adjusted \(R^2\) ONLY increases if X variable is useful to model.
No. of Predictors | Predictors | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
4 | Age BMI Children Age:Children | 0.3075 | 0.3054 |
4 | Age BMI Children Age:BMI | 0.3046 | 0.3025 |
3 | Age BMI Children | 0.3035 | 0.3019 |
4 | Age BMI Children BMI:Children | 0.3036 | 0.3015 |
2 | Age Children | 0.2979 | 0.2969 |
2 | Age BMI | 0.2843 | 0.2832 |
1 | Age | 0.2786 | 0.2781 |
1 | Children | 0.0260 | 0.0253 |
1 | BMI | 0.0176 | 0.0169 |
Adjusted \(R^2\) is good for comparing a few models.
In this case we new that only 9 of the 63 possible models were reasonable.
If there are many possible reasonable models, we automate part of the selection process.
In MLR, the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
A popular method, Backward Elimination, can also be done manually in any software:
Looking ahead, we’ll also cover:
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
Optional at this stage: Also examine correlation matrix to determine if some pairs of variables will be a concern
New term - Multicollinearity: If two predictors (X variables) in model have a correlation of 0.8 or higher, they can not both stay in the model because they are multicollinear and cause the model to be unstable.
Create a ‘saturated’ model with all potential predictor variables and interaction terms
This is subjective.
Be as transparent as possible in your how you decide on your full model.
Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
If model was modified in step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.
In HW 7, you will examine the correlation matrix and then do simple versions of steps 3 and 6 of the model selection process.
This week, we look at couple of interesting models selection examples.
Example 1: Animals Data
Question: What factors affect a mammal’s sleep duration?**
Animals Data Notes:
Population was limited to animals under 1000 pounds (two elephant species excluded).
Natural log (LN) transformed variables were added to original data.
Observations with missing values are removed below
Working dataset has 49 observations (49 different species)
# import and examine data
animals <- read_csv("data/animals.csv", show_col_types=F) |>
filter(!is.na(LifeSpan) & !is.na(Gestation))
animals |> glimpse(width=60)
Rows: 49
Columns: 12
$ Species <chr> "Africangiantpouchedrat", "Americanopos…
$ TotalSleep <dbl> 8.3, 19.4, 12.5, 9.8, 19.7, 6.2, 14.5, …
$ BodyWt <dbl> 1.00, 1.70, 3.39, 10.55, 0.02, 160.00, …
$ LNBodyWt <dbl> 0.00, 0.53, 1.22, 2.36, -3.77, 5.08, 1.…
$ BrainWt <dbl> 6.6, 6.3, 44.5, 179.5, 0.3, 169.0, 25.6…
$ LNBrainWt <dbl> 1.89, 1.84, 3.80, 5.19, -1.20, 5.13, 3.…
$ LifeSpan <dbl> 4.5, 5.0, 14.0, 27.0, 19.0, 30.4, 28.0,…
$ LNLifeSpan <dbl> 1.50, 1.61, 2.64, 3.30, 2.94, 3.41, 3.3…
$ Gestation <dbl> 42, 12, 60, 180, 35, 392, 63, 230, 112,…
$ Predation <dbl> 3, 2, 1, 4, 1, 4, 1, 1, 5, 5, 5, 1, 2, …
$ Exposure <dbl> 1, 1, 1, 4, 1, 5, 2, 1, 4, 5, 5, 1, 2, …
$ Danger <dbl> 3, 1, 1, 4, 1, 4, 1, 1, 4, 5, 5, 1, 2, …
Intuitvely, there is likely to be redundancy between Predation
, Exposure
, and Danger
.
# A tibble: 12 × 3
Variable Type Description
<chr> <chr> <chr>
1 Species Nominal Name of Species
2 TotalSleep Quantitative Total Sleep
3 BodyWt Quantitative Average Body Weight in kilograms
4 LNBodyWt Quantitative Natural Log of Body Weight
5 BrainWt Quantitative Average Brain Weight in grams
6 LNBrainWt Quantitative Natural Log of Brain Weight
7 LifeSpan Quantitative Maximum Life Span in years
8 LNLifeSpan Quantitative Natural Log of Life Span
9 Gestation Quantitative Gestation Time in days
10 Predation Ordinal Predation Index (1=least likely to be prey)
11 Exposure Ordinal Sleep Exposure Index (1=least exposed)
12 Danger Ordinal Overall Danger Index (1=least danger from other anim…
Session ID: bua345s24
Which two ordinal categorical predictor variables appear to be multicollinear, i.e., highly correlated?
TotalSleep Predation Exposure Danger
TotalSleep 1.00 -0.48 -0.63 -0.63
Predation -0.48 1.00 0.66 0.95
Exposure -0.63 0.66 1.00 0.78
Danger -0.63 0.95 0.78 1.00
Data examination and transformations completed
Create a full ‘saturated’ model with all potential predictor variables and interaction terms (This is subjective).
# convert ordinal variables to factors
animals <- animals |>
mutate(PredF = factor(Predation),
ExposF = factor(Exposure),
DangrF=factor(Danger))
# full model (subjective)
animals_full <- lm(TotalSleep ~ LNBodyWt + LNBrainWt +
LNLifeSpan + Gestation +
PredF + ExposF + DangrF +
LNBodyWt*Gestation + LNLifeSpan*PredF +
LNLifeSpan*ExposF + LNLifeSpan*DangrF, data=animals)
Stepwise Summary
--------------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
--------------------------------------------------------------------------------
0 Full Model 240.461 299.107 39.729 0.89048 0.73714
1 LNLifeSpan:DangrF 232.882 283.961 40.124 0.88953 0.76946
2 LNBrainWt 231.276 280.464 40.494 0.88864 0.77728
3 LNLifeSpan:ExposF 229.366 270.986 46.531 0.87390 0.78383
4 LNBodyWt:Gestation 227.366 267.095 46.512 0.87390 0.79129
5 ExposF 227.508 259.669 54.605 0.85111 0.78343
--------------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
---------------------------------------------------------------
R 0.923 RMSE 1.743
R-Squared 0.851 MSE 4.511
Adj. R-Squared 0.783 Coef. Var 19.991
Pred R-Squared 0.660 AIC 227.508
MAE 1.283 SBC 259.669
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-------------------------------------------------------------------
Regression 851.000 15 56.733 12.576 0.0000
Residual 148.871 33 4.511
Total 999.871 48
-------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------------
(Intercept) 6.324 2.635 2.401 0.022 0.964 11.684
LNBodyWt -0.813 0.202 -0.515 -4.027 0.000 -1.224 -0.402
LNLifeSpan 3.009 0.909 0.622 3.311 0.002 1.160 4.858
Gestation -0.019 0.005 -0.424 -3.736 0.001 -0.029 -0.009
PredF2 14.639 3.291 1.431 4.448 0.000 7.944 21.335
PredF3 17.053 5.383 1.237 3.168 0.003 6.101 28.005
PredF4 11.414 4.830 0.884 2.363 0.024 1.587 21.241
PredF5 0.722 6.052 0.067 0.119 0.906 -11.592 13.035
DangrF2 -6.810 1.746 -0.648 -3.900 0.000 -10.363 -3.258
DangrF3 -8.701 3.444 -0.674 -2.527 0.016 -15.708 -1.695
DangrF4 -7.957 4.344 -0.651 -1.832 0.076 -16.794 0.881
DangrF5 -16.325 4.456 -1.265 -3.664 0.001 -25.390 -7.259
LNLifeSpan:PredF2 -3.334 1.299 -0.794 -2.567 0.015 -5.976 -0.692
LNLifeSpan:PredF3 -5.444 3.940 -0.615 -1.382 0.176 -13.459 2.571
LNLifeSpan:PredF4 -1.160 1.537 -0.253 -0.755 0.456 -4.286 1.967
LNLifeSpan:PredF5 3.357 1.868 0.887 1.797 0.081 -0.443 7.157
------------------------------------------------------------------------------------------------
Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
If correlation for two variables, \(R_{XY} \geq 0.8\), then one variable should be excluded.
Variables in preliminary model: : LNBodyWt
, LNLifeSpan
, Gestation
, PredF
, DangrF
, LNLifeSpan*PredF
Recall that PredF
(Predation) and DangrF
(Danger) are highly correlated.
PredF
is included in an interaction term so exclude DangrF
.
If model was modified in Step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.
# specify final model
(animals_final <- ols_regress(TotalSleep ~ LNBodyWt + LNLifeSpan + Gestation +
PredF + LNLifeSpan*PredF, data=animals))
animals_model <- animals_final$model # save coefficients
Model Summary
---------------------------------------------------------------
R 0.857 RMSE 2.329
R-Squared 0.734 MSE 7.181
Adj. R-Squared 0.655 Coef. Var 25.223
Pred R-Squared 0.547 AIC 247.894
MAE 1.857 SBC 272.488
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------
Regression 734.163 11 66.742 9.294 0.0000
Residual 265.708 37 7.181
Total 999.871 48
------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) 6.751 3.305 2.043 0.048 0.054 13.448
LNBodyWt -0.698 0.244 -0.442 -2.859 0.007 -1.192 -0.203
LNLifeSpan 2.855 1.133 0.591 2.519 0.016 0.559 5.151
Gestation -0.020 0.006 -0.447 -3.285 0.002 -0.032 -0.008
PredF2 13.998 4.132 0.041 3.388 0.002 5.626 22.369
PredF3 11.883 5.514 -0.494 2.155 0.038 0.711 23.056
PredF4 2.654 4.102 0.021 0.647 0.522 -5.658 10.966
PredF5 -0.782 4.262 -0.316 -0.183 0.855 -9.418 7.855
LNLifeSpan:PredF2 -5.367 1.478 -0.471 -3.632 0.001 -8.361 -2.373
LNLifeSpan:PredF3 -7.390 3.141 -0.588 -2.352 0.024 -13.755 -1.025
LNLifeSpan:PredF4 -0.941 1.356 -0.083 -0.694 0.492 -3.689 1.807
LNLifeSpan:PredF5 -1.043 1.446 -0.091 -0.721 0.475 -3.973 1.887
-----------------------------------------------------------------------------------------------
Exporting Model and Data to Excel
This model can be used to find model estimates and residuals for all animals.
We will ALSO do these calculations in an Excel Spreadsheet to clarify each model component in estimate.
Below we export the data for three species to examine how the model works
animals_model_data <- animals |> # create new dataset with model variables only
select(Species, TotalSleep, LNBodyWt, LNLifeSpan, Gestation, PredF)
three_species <- animals_model_data |> # create mini dataset with three species
filter(Species %in% c("Baboon", "Donkey", "ArcticFox")) |>
write_csv("ThreeSpecies.csv")
Species | TotalSleep | LNBodyWt | LNLifeSpan | Gestation | PredF |
---|---|---|---|---|---|
ArcticFox | 12.5 | 1.22 | 2.64 | 60 | 1 |
Baboon | 9.8 | 2.36 | 3.30 | 180 | 4 |
Donkey | 3.1 | 5.23 | 3.69 | 365 | 5 |
Model coefficients for calculations can be extracted and exported to Excel.
Below We create a two column dataset listing each model component and it’s beta coefficient.
That dataset is exported as a .csv file for an in-class exercise.
model_term | beta |
---|---|
(Intercept) | 6.7512 |
LNBodyWt | -0.6976 |
LNLifeSpan | 2.8550 |
Gestation | -0.0198 |
PredF2 | 13.9979 |
PredF3 | 11.8834 |
PredF4 | 2.6536 |
PredF5 | -0.7817 |
LNLifeSpan:PredF2 | -5.3668 |
LNLifeSpan:PredF3 | -7.3900 |
LNLifeSpan:PredF4 | -0.9409 |
LNLifeSpan:PredF5 | -1.0427 |
Session ID: bua345s24
Use the provided .csv file worksheet to answer these questions:
Question 2. What is the regression estimate of total sleep for ‘Donkey’?
Question 3. What is the regression estimate of total sleep for ‘Artic Fox’ (ArticFox
)?
Complete the worksheet for ‘Baboon’ at home.
At least one question on Quiz 2 will include an Excel Worksheet like this where you have to correctly do the calculation using the model and x values from the data.
You can use R, but code to add estimates to dataset will not be provided.
This exercise is about understanding the model estimation process.
Model estimates can be calculated in R.
Excel Worksheet is used to demonstrate how those estimates are calculated.
You will calculate an estimate using a complex on Quiz 2.
Species | TotalSleep | Est_TotalSleep | Resid | LNBodyWt | LNLifeSpan | Gestation | PredF |
---|---|---|---|---|---|---|---|
Africangiantpouchedrat | 8.3 | 11.00 | -2.70 | 0.00 | 1.50 | 42 | 3 |
Americanopossum | 19.4 | 16.10 | 3.30 | 0.53 | 1.61 | 12 | 2 |
ArcticFox | 12.5 | 12.25 | 0.25 | 1.22 | 2.64 | 60 | 1 |
Baboon | 9.8 | 10.51 | -0.71 | 2.36 | 3.30 | 180 | 4 |
How good is our model?
There are many ways to examine model fit.
Here are two straightforward ways:
Students are provided with an R project to complete HW 7.
Read instructions which correspond to Blackboard HW Assignment 7.
Run Code Block 1 (setup
) and Code Block 2 (import and examine ames dataset
).
Code Black 3 (examine correlation matrix of X variables
) is incomplete.
Remove # symbols before incomplete R code.
Replace blanks (____
) with correct commands to calculate correlation matrix with values rounded to 2 decimal places.
Run line or whole code block to view correlation matrix which is large.
In same Code Block, remove #
from the two lines of code at the bottom.
Answer Questions 1 - 2 based on the correlation matrix and min/max output.
Run Code Block 4 (specify full model and do backward elim
) to:
Create the full model with all variables and no interactions.
Run the Backward Elimination.
Answer questions 3 - 6 based on the Backward Elimination model output
Run Code Block 5 (save final model
) to save the final model as final_ames_model
Complete the code in Code Block 6 (import new data and add predictions
) and run code to add model estimates and residuals to new small dataset of two new houses.
It is helpful to run the lines in this code block one at a time.
Run the first command that begins new_houses <- read_csv(...
to import a new small datset with 2 observations.
Run the command that begins (new_houses <- new_houses |> mutate(Est_Price...
to add Est_Price
, the regression estimates to this dataset.
Remove #
before the following three lines to complete them:
#(new_houses <- new_houses |>
# mutate(Resid = ____ - ____ |> round()) |>
# relocate(Est_Price, Resid, .after=Price))
In the line with the blanks you are calculating residuals as
Price minus Estimated Price (Resid = Price - Est_Price
)
The next line relocates Est_Price
and Resid
in the left side of the dataset, after Price
.
Answer Questions 7 - 11 based on this output.
Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).
Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.
Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.
Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
The steps for other methods are similar to the steps for Backward Elimination.
Not all steps are ALWAYS required. It depends on how complex the data are.
In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.
For Step 1, we only need to examine correlations.
In this case, Step 7 will be apparent.
We can add model estimates to data for future interpretation (Step 8)
Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).
Use Backward Elimination, Forward Selection, and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)
Rerun model selection methods, if a candidate model is substantially changed (not always needed).
Compare model fit statistics from final candidate model from all three methods.
Decide on final candidate and make final modifications, if needed.
Interpret final model.
Can we determine what factors affect wine quality even if we KNOW NOTHING about wine cultivation and chemistry?
Maybe!
Since we have no prior knowledge, we start with a straightforward full model with all available predictors and no interactions.
Notice that all variables are numeric (<dbl>
stands for decimal value).
Rows: 605
Columns: 11
$ Wine_Quality <dbl> 5, 6, 7, 5, 7, 5, 5, 5, 5, 5…
$ Fixed_Acidity <dbl> 9.3, 9.1, 7.9, 7.2, 11.9, 7.…
$ Volatile_Acidity <dbl> 0.48, 0.22, 0.34, 1.00, 0.43…
$ Citric_Acidity <dbl> 0.29, 0.24, 0.36, 0.00, 0.66…
$ Residual_Sugar <dbl> 2.1, 2.1, 1.9, 3.0, 3.1, 2.2…
$ Chlorides <dbl> 0.127, 0.078, 0.065, 0.102, …
$ Free_Sulphur_Dioxide <dbl> 6, 1, 5, 7, 10, 5, 5, 48, 27…
$ Total_Sulphur_Dioxide <dbl> 16, 28, 10, 16, 23, 36, 21, …
$ Ph <dbl> 3.22, 3.41, 3.27, 3.43, 3.15…
$ Sulfate <dbl> 0.72, 0.87, 0.54, 0.46, 0.85…
$ Alcohol <dbl> 11.2, 10.3, 11.2, 10.0, 10.4…
Wine_Quality Fixed_Acidity Volatile_Acidity
Wine_Quality 1.00 0.11 -0.39
Fixed_Acidity 0.11 1.00 -0.23
Volatile_Acidity -0.39 -0.23 1.00
Citric_Acidity 0.22 0.68 -0.52
Residual_Sugar 0.04 0.20 -0.01
Chlorides -0.10 0.12 0.04
Free_Sulphur_Dioxide 0.01 -0.18 -0.05
Total_Sulphur_Dioxide -0.08 -0.13 0.05
Ph -0.06 -0.70 0.19
Sulfate 0.21 0.19 -0.24
Alcohol 0.45 -0.08 -0.17
Citric_Acidity Residual_Sugar Chlorides
Wine_Quality 0.22 0.04 -0.10
Fixed_Acidity 0.68 0.20 0.12
Volatile_Acidity -0.52 -0.01 0.04
Citric_Acidity 1.00 0.16 0.21
Residual_Sugar 0.16 1.00 0.05
Chlorides 0.21 0.05 1.00
Free_Sulphur_Dioxide -0.07 0.18 -0.04
Total_Sulphur_Dioxide 0.06 0.18 0.00
Ph -0.55 -0.14 -0.26
Sulfate 0.27 -0.01 0.35
Alcohol 0.10 0.07 -0.21
Free_Sulphur_Dioxide Total_Sulphur_Dioxide Ph Sulfate
Wine_Quality 0.01 -0.08 -0.06 0.21
Fixed_Acidity -0.18 -0.13 -0.70 0.19
Volatile_Acidity -0.05 0.05 0.19 -0.24
Citric_Acidity -0.07 0.06 -0.55 0.27
Residual_Sugar 0.18 0.18 -0.14 -0.01
Chlorides -0.04 0.00 -0.26 0.35
Free_Sulphur_Dioxide 1.00 0.65 0.08 0.00
Total_Sulphur_Dioxide 0.65 1.00 -0.07 0.08
Ph 0.08 -0.07 1.00 -0.24
Sulfate 0.00 0.08 -0.24 1.00
Alcohol -0.03 -0.08 0.21 0.05
Alcohol
Wine_Quality 0.45
Fixed_Acidity -0.08
Volatile_Acidity -0.17
Citric_Acidity 0.10
Residual_Sugar 0.07
Chlorides -0.21
Free_Sulphur_Dioxide -0.03
Total_Sulphur_Dioxide -0.08
Ph 0.21
Sulfate 0.05
Alcohol 1.00
[1] 0.68
[1] -0.7
We specify a full model using an easy shortcut:
If all variables are included, you can use .
instead of listing them all.
This model specification is also used in HW 7.
The we do three model selection procedures:
Look at the LAST step for each method to determine which method results in the best fit.
Comparison Measures:
Adj. \(R^2\): Higher value indicates better model fit
C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
AIC: Lower value indicates better model fit (Akaike Information Criteria).
RMSE: Lower value indicates better model fit (Root mean Square Error).
By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.
Session ID: bua345s23
Which two model selection methods arrived at the same model for the wine data?
Backward Elimination Forward Selection
Backward Elimination
Stepwise Selection
Forward Selection Stepwise Selection
Regression modeling can be overwhelming
Automating part of the variable selection process is helpful.
Try different methods and compare results.
Results from automated processes are preliminary.
Model estimates and residuals can be added to dataset.
HW 6 due on Wed. 3/6 (Grace Period extended until 3/8).
HW 7 is posted and is due on Wed. 3/20
To submit an Engagement Question or Comment about material from Today’s Lecture: Submit by midnight today (day of lecture). Click on Link next to the ❓ under today’s lecture.