Multiple Linear Regression Variable Selection Methods
Author
Spencer Hamilton
Data and Description
For this assignment, we are revisiting the data set used in Homework 4. I think it would be very beneficial for you to review your Homework 4 before starting this one.
Measuring body fat is not simple. One method requires submerging the body underwater in a tank and measuring the increase in water level. A simpler method for estimating body fat would be preferred. In order to develop such a method, researchers recorded age (years), weight (pounds), height (inches), and three body circumference measurements (around the neck, chest, and abdominal (all in centimeters)) for 252 men. Each mans’ percentage of body fat was accurately estimated by an underwater weighing technique (the variable brozek is the percentage of body fat). The hope is to be able to use this data to create a model that will accurately predict body fat percentage, by using just the basic variables recorded, without having to use the tank submerging method.
The data can be found in the BodyFat data set on Canvas. Download BodyFat.txt, and put it in the same folder as this R Markdown file.
0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.
0b. Make sure to set your seed since some of the functions randomly split your data (use set.seed in the setup code chunk above)!
1. Read in the data set, and call the data frame “bodyfat_orig”. Print a summary of the data and make sure the data makes sense. Remove the “row” column (which contains row numbers) from the data set. Make sure the class of “bodyfat_orig” is a data.frame only.
# your code herebodyfat_orig <-read.table('bodyfat.txt', header =TRUE)#summary(bodyfat)bodyfat_orig <-subset(bodyfat_orig, select =-row)summary(bodyfat_orig)
brozek age weight height
Min. : 0.00 Min. :22.00 Min. :118.5 Min. :64.00
1st Qu.:12.80 1st Qu.:35.50 1st Qu.:158.8 1st Qu.:68.38
Median :19.00 Median :43.00 Median :176.2 Median :70.00
Mean :18.89 Mean :44.86 Mean :178.8 Mean :70.32
3rd Qu.:24.55 3rd Qu.:54.00 3rd Qu.:196.9 3rd Qu.:72.25
Max. :45.10 Max. :81.00 Max. :363.1 Max. :77.75
neck chest abdom
Min. :31.10 Min. : 79.3 Min. : 69.40
1st Qu.:36.40 1st Qu.: 94.3 1st Qu.: 84.55
Median :38.00 Median : 99.6 Median : 90.90
Mean :37.98 Mean :100.8 Mean : 92.49
3rd Qu.:39.40 3rd Qu.:105.3 3rd Qu.: 99.20
Max. :51.20 Max. :136.2 Max. :148.10
2. Refer back to your Homework 4. In that assignment, you fit a multiple linear regression model to the body fat measurements using all the predictor variables provided in the data set. For each of the multiple linear regression assumptions listed below, state if they were met or not met.
The X’s vs Y are linear: Honestly, I think it could be, with very random looking residual vs predictor plots and a flat line in the residuals vs fitted values plot… perhaps a test would be able to prove it though.
The residuals are normally distributed: No, see histogram for visual proof.
The residuals are homoscedastic: No, see the scale-location plot.
There are no influential points: No, there is an influential point (39).
No multicollinearity: No, Weight, chest, and abdomn are past the recommended limit for multicollinearity.
3. There is one clear influential point in the data set. Create a new variable called “bodyfat” that contains the bodyfat_orig data set with the influential point removed. Use the bodyfat data set (not the bodyfat_orig data set) throughout the rest of the assignment.
#code for graphing... to validatecd_cont_pos <-function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}cd_cont_neg <-function(leverage, level, model) {-cd_cont_pos(leverage, level, model)}#linear model and cook'sbodyfat_orig_lm <-lm(bodyfat_orig$brozek ~ ., bodyfat_orig)cooksd <-cooks.distance(bodyfat_orig_lm)#pregraphcd_threshold <-0.5autoplot(bodyfat_orig_lm, which =5) +stat_function(fun = cd_cont_pos,args =list(level = cd_threshold, model = bodyfat_orig_lm),xlim =c(0, 0.6), lty =2, colour ="red") +stat_function(fun = cd_cont_neg,args =list(level = cd_threshold, model = bodyfat_orig_lm),xlim =c(0, 0.6), lty =2, colour ="red") +scale_y_continuous(limits =c(-4, 4))
# Find the influential observation indicesinfluential_obs_indices <-which(cooksd > cd_threshold)# Remove influential observations from the datasetbodyfat <- bodyfat_orig[-influential_obs_indices, ]# Print out the influential observations that were removedinfluential_observations_removed <- bodyfat_orig[influential_obs_indices, ]print(influential_observations_removed)
You should have discovered, from Homework 4, that there is a multicollinearity problem. The goal of this assignment is to continue this analysis by identifying variables to potentially remove from the model to resolve the multicollinearity issues.
4. Briefly explain why multicollinearity is a problem for multiple linear regression.
Multicollinearity is when two or more predictor variables in a regression model are highly correlated with themselves. This is a problem for our regression model because it will inflate our standard errors. This will shrink the precision of the estimates.
Multicollinearity can make the estimated coefficients sensitive to small changes in the data, leading to unstable and unreliable coefficient estimates. It really makes our model sad, and difficult to work with.
It also complicates the process of variable selection because it becomes challenging to distinguish the unique contribution of each variable added to the model.
5. Briefly explain the similarities and differences between the following variable selection methods: best subset, forward, backward, and sequential replacement. Do not just copy the algorithms from the class notes - use your own words to explain what these methods are doing.
Best Subset Selection is all about fitting all possible combinations, and selecting the model with the best preformance. This is usually adjusted R-squared, AIC or BIC. This is a brute force selection.
Forward Selection starts with an empty model then adds one predictor variable at a time, based on some criteria, untill adding more will no longer make any improvement in the model. This is less computationally expensive, but may not guarantee the best subset of variables.
Backward Elimination is a full model, then progressively cuts variables untill no improvement is seen in the model. It may not converge with the Foreward Selection because it has a different model, and can only find a local minima, unlike the brute force search of Best Subset Selection.
Sequential Replacement is the combination of foreward and backwards. It will take either a foreward or backward step (adding or removing a predictor variable), until the best model is made, according to some criteria.It is in the middle of the road for computational work load.
6. Briefly explain how shrinkage methods work (bias-variance tradeoff).
Shrinkage methods, such as Ridge Regression and Lasso Regression, work by adding a penalty term. This penalty term shrinks large coefficients in our standard linear model.
Bias: Shrinkage methods add bias by penalizing the magnitude of the coefficients. This bias shrinks the variance of the model, leading to better performance, especially with high-dimensional datasets or multicollinearity.
Variance: By shrinking the coefficients, LASSO and Ridge will reduce the variance of the model. Shrinkage methods tend to produce more stable and generalizable models, because they help stop overfitting.
7. Briefly explain the similarities/differences between ridge regression and LASSO.
They both add a penalty, but they do it differently. Ridge regression will add a penalty term proportional to the square of the coefficients (L2 penalty). In contrast, LASSO (Least Absolute Shrinkage and Selection Operator) selects prediction variables by setting some coefficients exactly to zero. This property makes LASSO good for feature selection as it will only take the most important predictors.
8. When using the bestglm function in R for the stepwise methods, the response variable must be the last column in the data set for the bestglm function to work. Switch the order of the columns in the data set so that brozek is last.
# your code here#new_df <- df[, c(setdiff(names(df), column_to_move), column_to_move)]# Assuming "bodyfat" is your dataframe# Identify the index of the "brozek" columnbrozek_index <-which(names(bodyfat) =="brozek")# Create a new dataframe with the "brozek" column moved to the last positionnew_bodyfat <- bodyfat[, c(1:(brozek_index-1), (brozek_index+1):ncol(bodyfat), brozek_index)]# Assuming "bodyfat" is your dataframe and you want to keep all columns except "age" and "weight"bodyfat <-subset(bodyfat, select =-c(brozek))# Rename the last column back to "brozek"names(new_bodyfat)[ncol(new_bodyfat)] <-"brozek"# If you want to replace the original dataframe with the new one:bodyfat <- new_bodyfatbodyfat <-subset(bodyfat, select =-c(brozek))summary(bodyfat)
age weight height neck
Min. :22.00 Min. :118.5 Min. :64.00 Min. :31.10
1st Qu.:35.25 1st Qu.:158.5 1st Qu.:68.31 1st Qu.:36.40
Median :43.00 Median :176.1 Median :70.00 Median :37.95
Mean :44.86 Mean :178.1 Mean :70.31 Mean :37.93
3rd Qu.:54.00 3rd Qu.:196.8 3rd Qu.:72.25 3rd Qu.:39.40
Max. :81.00 Max. :262.8 Max. :77.75 Max. :43.90
chest abdom brozek
Min. : 79.30 Min. : 69.40 Min. : 0.00
1st Qu.: 94.25 1st Qu.: 84.53 1st Qu.:12.80
Median : 99.60 Median : 90.90 Median :19.00
Mean :100.63 Mean : 92.27 Mean :18.83
3rd Qu.:105.30 3rd Qu.: 99.17 3rd Qu.:24.50
Max. :128.30 Max. :126.20 Max. :45.10
9. Apply the best subsets variable selection procedure to this data set using the bestglm function. Try it using AIC and BIC. Output a summary of the “best” model for each metric.
best_subsets_bic <-bestglm(Xy= bodyfat, family = gaussian, IC="BIC", method="exhaustive")best_subsets_bic$BestModel
Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
drop = FALSE], y = y))
Coefficients:
(Intercept) weight abdom
-42.8704 -0.1221 0.9043
best_subsets_aic <-bestglm(Xy= bodyfat, family = gaussian, IC="AIC", method="exhaustive")best_subsets_aic$BestModel
Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
drop = FALSE], y = y))
Coefficients:
(Intercept) height neck chest abdom
6.8946 -0.4467 -0.4848 -0.1438 0.8259
10. Apply the forward selection procedure to this data set using the step() function in R. Try it using AIC and BIC (remember: in order to do BIC with the step() function you need to change the default value of k to be log(n) where n is the number of rows in the dataset!). Output a summary of the “best” models in each case.
# your code here# Use the code we used in classbase_mod <-lm(brozek ~1, data = bodyfat) # Intercept only model (null model, or base model)full_mod <-lm(brozek ~ ., data = bodyfat) # All predictors in model (besides response)# Forward stepwise ---------------------------------------------------step_forw_aic <-step(base_mod, # starting modeldirection ="forward", k =2, # AIC (the multiplier of p in the penalty term)scope=list(lower= base_mod, upper= full_mod))
## You can also run forward selection using bestglm, but ## you can't see the path with this way, to my knowledge.forw_AIC_2 <-bestglm(bodyfat,IC ="AIC",method ="forward")#now BICcat("\n\n\n\n Now for BIC")
Now for BIC
base_mod <-lm(brozek ~1, data = bodyfat) # Intercept only model (null model, or base model)full_mod <-lm(brozek ~ ., data = bodyfat) # All predictors in model (besides response)# Forward stepwise ---------------------------------------------------step_forw_bic <-step(base_mod, # starting modeldirection ="forward", k =log(nrow(bodyfat)), # AIC (the multiplier of p in the penalty term)scope=list(lower= base_mod, upper= full_mod))
## You can also run forward selection using bestglm, but ## you can't see the path with this way, to my knowledge.forw_BIC_2 <-bestglm(bodyfat,IC ="BIC",method ="forward")
11. Apply the backward selection procedure to this data set using the step() function in R. Try it using AIC and BIC (remember: in order to do BIC with the step() function you need to change the default value of k to be log(n) where n is the number of rows in the dataset!). Output a summary of the “best” models in each case.
# your code here# Use the code we used in classbase_mod <-lm(brozek ~1, data = bodyfat) # Intercept only model (null model, or base model)full_mod <-lm(brozek ~ ., data = bodyfat) # All predictors in model (besides response)# back stepwise ---------------------------------------------------step_bac_biac <-step(full_mod, # starting modeldirection ="backward", k =log(nrow(bodyfat)), # BIC (the multiplier of p in the penalty term)scope=list(lower= base_mod, upper= full_mod))
## You can also run forward selection using bestglm, but ## you can't see the path with this way, to my knowledge.back_bic_2 <-bestglm(bodyfat,IC ="BIC",method ="backward")#now for AICcat("\n\n\n\n now with AIC backwards\n\n" )
now with AIC backwards
# back stepwise ---------------------------------------------------step_bac_aic <-step(full_mod, # starting modeldirection ="backward", k =2, # BIC (the multiplier of p in the penalty term)scope=list(lower= base_mod, upper= full_mod))
## You can also run forward selection using bestglm, but ## you can't see the path with this way, to my knowledge.back_bic_aic_2 <-bestglm(bodyfat,IC ="BIC",method ="backward")
12. Apply the sequential replacement selection procedure to this data set using the step() function. You may choose which metric you would like to use (either AIC or BIC). Try initializing it from the full model and the intercept only model. Output a summary of the single “best” model for the metric you chose.
# your code hereboth <-bestglm(bodyfat, method="seqrep")summary(both$BestModel)
Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
drop = FALSE], y = y))
Residuals:
Min 1Q Median 3Q Max
-10.0656 -2.9685 -0.1073 3.0258 9.9220
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -42.87040 2.45735 -17.446 < 2e-16 ***
weight -0.12206 0.01966 -6.207 2.26e-09 ***
abdom 0.90426 0.05221 17.321 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.074 on 247 degrees of freedom
Multiple R-squared: 0.7212, Adjusted R-squared: 0.7189
F-statistic: 319.4 on 2 and 247 DF, p-value: < 2.2e-16
13. Apply LASSO to this data set using the MSE metric. Output the coefficient values corresponding to the 1 standard error rule (do not output any plots).
# your code herebodyfat_x <-as.matrix(bodyfat[, 1:6]) # bodyfat_y <- bodyfat[, 7] # response# use cross validation to pick the "best" (based on MSE) lambdaset.seed(50)bodyfat_ridge_cv <-cv.glmnet(x = bodyfat_x, # automatically includes a column y = bodyfat_y, type.measure ="mse", alpha =0) # 0 is code for "ridge regression"# plot (log) lambda vs MSEp1 <-autoplot(bodyfat_ridge_cv, label =FALSE) +theme_bw() +theme(aspect.ratio =1)# lambda.min: value of lambda that gives minimum mean cross-validated errorbodyfat_ridge_cv$lambda.min
[1] 0.6312929
# lambda.1se: value of lambda within 1 standard error of the minimum # cross-validated errorbodyfat_ridge_cv$lambda.1se
[1] 1.210763
coef(bodyfat_ridge_cv, s ="lambda.min")
7 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 4.81285338
age 0.04819341
weight 0.03464651
height -0.48395234
neck -0.38752860
chest 0.05516167
abdom 0.52953525
coef(bodyfat_ridge_cv, s ="lambda.1se")
7 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 1.89654409
age 0.05482955
weight 0.04079229
height -0.47219783
neck -0.27269081
chest 0.10999933
abdom 0.43008565
14. Apply Elastic Net to this data set using the MSE metric. Output the coefficient values corresponding to the 1 standard error rule (do not output any plots).
# your code hereset.seed(50)bodyfat_elastic_cv <-cv.glmnet(x = bodyfat_x, # automatically includes a column of ones for the intercept FYIy = bodyfat_y, type.measure ="mse", alpha = .5) # .5 - this fits the elastic net - half lasso half ridge# plot (log) lambda vs MSEp3 <-autoplot(bodyfat_elastic_cv, label =FALSE) +theme_bw() +theme(aspect.ratio =1)# lambda.min: value of lambda that gives minimum mean cross-validated errorbodyfat_elastic_cv$lambda.min
[1] 0.03276441
# lambda.1se: value of lambda within 1 standard error of the minimum # cross-validated errorbodyfat_elastic_cv$lambda.1se
[1] 0.5860404
coef(bodyfat_elastic_cv, s ="lambda.min")
7 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -0.30691450
age 0.02122208
weight -0.01068811
height -0.37934042
neck -0.45669044
chest -0.09411263
abdom 0.79713819
coef(bodyfat_elastic_cv, s ="lambda.1se")
7 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -10.58193343
age 0.02146324
weight .
height -0.35221851
neck -0.01698972
chest .
abdom 0.58369704
15. Fill in the table below with “X”s (like the one at the end of the Module 5 course notes: a row for each variable, a column for each variable selection method, an “X” in a cell means the variable was included for that variable selection method). For the best subset, forward, backward, and sequential replacement columns, use either AIC or BIC. In other words, only report the models for either AIC or BIC, not both, but make sure you’re consistent in the table.
I will use BIC
Variable
Best Subset
Forward
Backward
Sequential Replacement
LASSO
Elastic Net
age
x
x
weight
x
x
x
x
height
x
x
x
neck
x
x
x
chest
x
abdom
x
x
x
x
x
x
16. Now that you have seen the various results from the different methods, pick a subset of variables that you will include in the model. Which variables do you choose to include in the model? Why?
I choose to include weight and abdom. These are the predictor variables which three of the selection methods agree on. Thus, I will go with the crowd.
17. Create the multiple linear regression model with the variables you listed in the previous question (alternatively, you can call the best model using $BestModel). Print a summary of the results. Save the residuals from this model to the bodyfat dataframe.
# your code herebest_subsets_bic$BestModel
Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
drop = FALSE], y = y))
Coefficients:
(Intercept) weight abdom
-42.8704 -0.1221 0.9043
Now that you have chosen a model, the next several questions ask you to check some of the model assumptions. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that. Note: you can copy (then modify) a lot of your code from Homework 4 to answer these questions.
18. The Xs vs Y are linear (use the residual by predictor plots and the partial regression plots)
# your code here#residual vs predictor plotsplot(bodyfat$weight, bodyfat$residuals,xlab ="weight Variable",ylab ="Residuals",main ="Residuals vs. weight")
plot(bodyfat$abdom, bodyfat$residuals,xlab ="abdom Variable",ylab ="Residuals",main ="Residuals vs. abdom")
#partial regression plotsresid_vs_weight <-ggplot(data = bodyfat) +geom_point(mapping =aes(x = weight, y = residuals)) +theme(aspect.ratio =1)resid_vs_abdom <-ggplot(data = bodyfat) +geom_point(mapping =aes(x = abdom, y = residuals)) +theme(aspect.ratio =1)# put plots in 2 rows & 3 columns using the patchwork packageresid_vs_weight
resid_vs_abdom
#new bodyfat_lm with just weight and abdomfinal_lm <-lm(brozek ~ weight + abdom, data = bodyfat)#REsiduals vs fittedautoplot(final_lm, which =1, ncol =1, nrow =1)
Yes, I believe there is a linear trend. The residual graphs look very random, and the residuals vs fitted graph is almost straight, except at the end… just like before.
19. The errors are normally distributed (use a histogram, qq plot, and shapiro wilk test)
# your code here# histogram of errors# The code below produces a basic histogram (from the in-class analysis #2)ggplot(data = bodyfat) +geom_histogram(aes(x = residuals, y =after_stat(density)), binwidth =1.2) +stat_function(fun = dnorm, color ="red", linewidth =2,args =list(mean =mean(bodyfat$residuals), sd =sd(bodyfat$residuals)))
# Create a Q-Q plotqqnorm(bodyfat$residuals, main ="Q-Q Plot of Residuals")qqline(bodyfat$residuals)
#Shapiro Test (also in-class #2)shapiro.test(final_lm$residuals)
Shapiro-Wilk normality test
data: final_lm$residuals
W = 0.99153, p-value = 0.1592
Wow, this is a much more normal looking histogram. It’s still not perfect in the center, but it is so much better! In addition, the QQ plot looks like it did last time, and we still fail to reject the null hypothesis of normality for the wilks-shapiro test.
20. The errors have equal/constant variance across all values of X (use the scale - location plot)
# your code hereautoplot(final_lm, which =3, nrow =1, ncol =1)
The scale-location plot is softened from the origional. It still squiggles at around 25, but it is less severe.
21. The model describes all observations (i.e., there are no influential points) (use the cooks distance > 0.5 plot).
# your code here# Cook's Distance# taken from mod_4_in_class_analysiscd_cont_pos <-function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}cd_cont_neg <-function(leverage, level, model) {-cd_cont_pos(leverage, level, model)}cd_threshold <-0.5autoplot(final_lm, which =5) +stat_function(fun = cd_cont_pos,args =list(level = cd_threshold, model = final_lm),xlim =c(0, 0.6), lty =2, colour ="red") +stat_function(fun = cd_cont_neg,args =list(level = cd_threshold, model = final_lm),xlim =c(0, 0.6), lty =2, colour ="red") +scale_y_continuous(limits =c(-4, 4))
There are no influential points, but that’s because we already removed them… not sure how impressive this actually is, but it is at least passable.
22. No multicollinearity (use the scatterplot matrix, correlation matrix, and variance inflation factors).
There is no sign of Multicollerity, with values all under 5.
23. Given the results from your model assumption checking, what would you do next to continue this analysis?
To continue my analysis, honestly, I would begin interpreting the coefficients we have. I am satisfied that using just weight and abdom are the best variables. We can begin using them to create our ultimate model for how to best model the actual bodyfat of a male in our sample.
24. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.
I learned that moving forward and backwards and using other types of variable selection process’s really can mix up what results we get. Of course, I think that the best is just to try all of them, especially with this small of a data set, I think it isnt that difficult, and give me us the peace of mind that we really picked the best variables for our model.
25. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
We have this data set to try and predict the actual body fat of a person, without going through the hastle of dunking them underwater in the complex brozek process. Instead, we are happy to take a few measurements, and make a good prediction. It will be much cheaper, and if done correctly, still very accurate. So, from our analysis we learned that to make a great model for predicting, we just need to measure people’s weight and their abdomen size. From these two, we can make a great guess. Using all of the variables can confuse the model, so we will just leave them out when we actually predict. Of course, it was important to measure everything we could in our sample, so that we could find out what were the most important things to measure on a person.