Homework 5



Multiple Linear Regression Variable Selection Methods


Author


Spencer Hamilton

Data and Description

For this assignment, we are revisiting the data set used in Homework 4. I think it would be very beneficial for you to review your Homework 4 before starting this one.

Measuring body fat is not simple. One method requires submerging the body underwater in a tank and measuring the increase in water level. A simpler method for estimating body fat would be preferred. In order to develop such a method, researchers recorded age (years), weight (pounds), height (inches), and three body circumference measurements (around the neck, chest, and abdominal (all in centimeters)) for 252 men. Each mans’ percentage of body fat was accurately estimated by an underwater weighing technique (the variable brozek is the percentage of body fat). The hope is to be able to use this data to create a model that will accurately predict body fat percentage, by using just the basic variables recorded, without having to use the tank submerging method.

The data can be found in the BodyFat data set on Canvas. Download BodyFat.txt, and put it in the same folder as this R Markdown file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

0b. Make sure to set your seed since some of the functions randomly split your data (use set.seed in the setup code chunk above)!

1. Read in the data set, and call the data frame “bodyfat_orig”. Print a summary of the data and make sure the data makes sense. Remove the “row” column (which contains row numbers) from the data set. Make sure the class of “bodyfat_orig” is a data.frame only.

# your code here
bodyfat_orig <- read.table('bodyfat.txt', header = TRUE)
#summary(bodyfat)
bodyfat_orig <- subset(bodyfat_orig, select = -row)
summary(bodyfat_orig)
     brozek           age            weight          height     
 Min.   : 0.00   Min.   :22.00   Min.   :118.5   Min.   :64.00  
 1st Qu.:12.80   1st Qu.:35.50   1st Qu.:158.8   1st Qu.:68.38  
 Median :19.00   Median :43.00   Median :176.2   Median :70.00  
 Mean   :18.89   Mean   :44.86   Mean   :178.8   Mean   :70.32  
 3rd Qu.:24.55   3rd Qu.:54.00   3rd Qu.:196.9   3rd Qu.:72.25  
 Max.   :45.10   Max.   :81.00   Max.   :363.1   Max.   :77.75  
      neck           chest           abdom       
 Min.   :31.10   Min.   : 79.3   Min.   : 69.40  
 1st Qu.:36.40   1st Qu.: 94.3   1st Qu.: 84.55  
 Median :38.00   Median : 99.6   Median : 90.90  
 Mean   :37.98   Mean   :100.8   Mean   : 92.49  
 3rd Qu.:39.40   3rd Qu.:105.3   3rd Qu.: 99.20  
 Max.   :51.20   Max.   :136.2   Max.   :148.10  

2. Refer back to your Homework 4. In that assignment, you fit a multiple linear regression model to the body fat measurements using all the predictor variables provided in the data set. For each of the multiple linear regression assumptions listed below, state if they were met or not met.

  1. The X’s vs Y are linear: Honestly, I think it could be, with very random looking residual vs predictor plots and a flat line in the residuals vs fitted values plot… perhaps a test would be able to prove it though.
  2. The residuals are normally distributed: No, see histogram for visual proof.
  3. The residuals are homoscedastic: No, see the scale-location plot.
  4. There are no influential points: No, there is an influential point (39).
  5. No multicollinearity: No, Weight, chest, and abdomn are past the recommended limit for multicollinearity.

3. There is one clear influential point in the data set. Create a new variable called “bodyfat” that contains the bodyfat_orig data set with the influential point removed. Use the bodyfat data set (not the bodyfat_orig data set) throughout the rest of the assignment.

#code for graphing... to validate
cd_cont_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cd_cont_neg <- function(leverage, level, model) {-cd_cont_pos(leverage, level, model)}
#linear model and cook's
bodyfat_orig_lm <- lm(bodyfat_orig$brozek ~ ., bodyfat_orig)
cooksd <- cooks.distance(bodyfat_orig_lm)
#pregraph
cd_threshold <- 0.5
autoplot(bodyfat_orig_lm, which = 5) +
  stat_function(fun = cd_cont_pos,
                args = list(level = cd_threshold, model = bodyfat_orig_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  stat_function(fun = cd_cont_neg,
                args = list(level = cd_threshold, model = bodyfat_orig_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  scale_y_continuous(limits = c(-4, 4))

# Find the influential observation indices
influential_obs_indices <- which(cooksd > cd_threshold)

# Remove influential observations from the dataset
bodyfat <- bodyfat_orig[-influential_obs_indices, ]

# Print out the influential observations that were removed
influential_observations_removed <- bodyfat_orig[influential_obs_indices, ]
print(influential_observations_removed)
   brozek age weight height neck chest abdom
39   33.8  46 363.15  72.25 51.2 136.2 148.1
#post lm
bodyfat_lm <- lm(bodyfat$brozek ~ ., bodyfat)
cooksd <- cooks.distance(bodyfat_lm)
#pregraph
#post graph
autoplot(bodyfat_lm, which = 5) +
  stat_function(fun = cd_cont_pos,
                args = list(level = cd_threshold, model = bodyfat_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  stat_function(fun = cd_cont_neg,
                args = list(level = cd_threshold, model = bodyfat_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  scale_y_continuous(limits = c(-4, 4))

You should have discovered, from Homework 4, that there is a multicollinearity problem. The goal of this assignment is to continue this analysis by identifying variables to potentially remove from the model to resolve the multicollinearity issues.

4. Briefly explain why multicollinearity is a problem for multiple linear regression.

Multicollinearity is when two or more predictor variables in a regression model are highly correlated with themselves. This is a problem for our regression model because it will inflate our standard errors. This will shrink the precision of the estimates.

Multicollinearity can make the estimated coefficients sensitive to small changes in the data, leading to unstable and unreliable coefficient estimates. It really makes our model sad, and difficult to work with.

It also complicates the process of variable selection because it becomes challenging to distinguish the unique contribution of each variable added to the model.

5. Briefly explain the similarities and differences between the following variable selection methods: best subset, forward, backward, and sequential replacement. Do not just copy the algorithms from the class notes - use your own words to explain what these methods are doing.

Best Subset Selection is all about fitting all possible combinations, and selecting the model with the best preformance. This is usually adjusted R-squared, AIC or BIC. This is a brute force selection.

Forward Selection starts with an empty model then adds one predictor variable at a time, based on some criteria, untill adding more will no longer make any improvement in the model. This is less computationally expensive, but may not guarantee the best subset of variables.

Backward Elimination is a full model, then progressively cuts variables untill no improvement is seen in the model. It may not converge with the Foreward Selection because it has a different model, and can only find a local minima, unlike the brute force search of Best Subset Selection.

Sequential Replacement is the combination of foreward and backwards. It will take either a foreward or backward step (adding or removing a predictor variable), until the best model is made, according to some criteria.It is in the middle of the road for computational work load.

6. Briefly explain how shrinkage methods work (bias-variance tradeoff).

Shrinkage methods, such as Ridge Regression and Lasso Regression, work by adding a penalty term. This penalty term shrinks large coefficients in our standard linear model.

Bias: Shrinkage methods add bias by penalizing the magnitude of the coefficients. This bias shrinks the variance of the model, leading to better performance, especially with high-dimensional datasets or multicollinearity.

Variance: By shrinking the coefficients, LASSO and Ridge will reduce the variance of the model. Shrinkage methods tend to produce more stable and generalizable models, because they help stop overfitting.

7. Briefly explain the similarities/differences between ridge regression and LASSO.

They both add a penalty, but they do it differently. Ridge regression will add a penalty term proportional to the square of the coefficients (L2 penalty). In contrast, LASSO (Least Absolute Shrinkage and Selection Operator) selects prediction variables by setting some coefficients exactly to zero. This property makes LASSO good for feature selection as it will only take the most important predictors.

8. When using the bestglm function in R for the stepwise methods, the response variable must be the last column in the data set for the bestglm function to work. Switch the order of the columns in the data set so that brozek is last.

# your code here
#new_df <- df[, c(setdiff(names(df), column_to_move), column_to_move)]

# Assuming "bodyfat" is your dataframe

# Identify the index of the "brozek" column
brozek_index <- which(names(bodyfat) == "brozek")

# Create a new dataframe with the "brozek" column moved to the last position
new_bodyfat <- bodyfat[, c(1:(brozek_index-1), (brozek_index+1):ncol(bodyfat), brozek_index)]

# Assuming "bodyfat" is your dataframe and you want to keep all columns except "age" and "weight"
bodyfat <- subset(bodyfat, select = -c(brozek))


# Rename the last column back to "brozek"
names(new_bodyfat)[ncol(new_bodyfat)] <- "brozek"

# If you want to replace the original dataframe with the new one:
bodyfat <- new_bodyfat

bodyfat <- subset(bodyfat, select = -c(brozek))

summary(bodyfat)
      age            weight          height           neck      
 Min.   :22.00   Min.   :118.5   Min.   :64.00   Min.   :31.10  
 1st Qu.:35.25   1st Qu.:158.5   1st Qu.:68.31   1st Qu.:36.40  
 Median :43.00   Median :176.1   Median :70.00   Median :37.95  
 Mean   :44.86   Mean   :178.1   Mean   :70.31   Mean   :37.93  
 3rd Qu.:54.00   3rd Qu.:196.8   3rd Qu.:72.25   3rd Qu.:39.40  
 Max.   :81.00   Max.   :262.8   Max.   :77.75   Max.   :43.90  
     chest            abdom            brozek     
 Min.   : 79.30   Min.   : 69.40   Min.   : 0.00  
 1st Qu.: 94.25   1st Qu.: 84.53   1st Qu.:12.80  
 Median : 99.60   Median : 90.90   Median :19.00  
 Mean   :100.63   Mean   : 92.27   Mean   :18.83  
 3rd Qu.:105.30   3rd Qu.: 99.17   3rd Qu.:24.50  
 Max.   :128.30   Max.   :126.20   Max.   :45.10  

9. Apply the best subsets variable selection procedure to this data set using the bestglm function. Try it using AIC and BIC. Output a summary of the “best” model for each metric.

best_subsets_bic <- bestglm(Xy= bodyfat, family = gaussian, IC="BIC", method="exhaustive")

best_subsets_bic$BestModel

Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
    drop = FALSE], y = y))

Coefficients:
(Intercept)       weight        abdom  
   -42.8704      -0.1221       0.9043  
best_subsets_aic <- bestglm(Xy= bodyfat, family = gaussian, IC="AIC", method="exhaustive")

best_subsets_aic$BestModel

Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
    drop = FALSE], y = y))

Coefficients:
(Intercept)       height         neck        chest        abdom  
     6.8946      -0.4467      -0.4848      -0.1438       0.8259  

10. Apply the forward selection procedure to this data set using the step() function in R. Try it using AIC and BIC (remember: in order to do BIC with the step() function you need to change the default value of k to be log(n) where n is the number of rows in the dataset!). Output a summary of the “best” models in each case.

# your code here

# Use the code we used in class
base_mod <- lm(brozek ~ 1, data = bodyfat) # Intercept only model (null model, or base model)
full_mod <- lm(brozek ~ ., data = bodyfat) # All predictors in model (besides response)

# Forward stepwise ---------------------------------------------------

step_forw_aic <- step(base_mod, # starting model
                      direction = "forward", 
                      k = 2, # AIC (the multiplier of p in the penalty term)
                      scope=list(lower= base_mod, upper= full_mod))
Start:  AIC=1020.58
brozek ~ 1

         Df Sum of Sq     RSS     AIC
+ abdom   1    9963.3  4739.1  739.54
+ chest   1    7151.7  7550.7  855.98
+ weight  1    5623.0  9079.4  902.07
+ neck    1    3381.7 11320.7  957.23
+ age     1    1233.3 13469.1 1000.67
<none>                14702.4 1020.58
+ height  1       7.4 14694.9 1022.45

Step:  AIC=739.54
brozek ~ abdom

         Df Sum of Sq    RSS    AIC
+ weight  1    639.47 4099.6 705.30
+ height  1    507.59 4231.5 713.21
+ neck    1    404.24 4334.9 719.25
+ chest   1    218.23 4520.9 729.75
+ age     1    131.08 4608.0 734.52
<none>                4739.1 739.54

Step:  AIC=705.3
brozek ~ abdom + weight

         Df Sum of Sq    RSS    AIC
+ neck    1    67.516 4032.1 703.15
+ height  1    35.157 4064.5 705.15
<none>                4099.6 705.30
+ chest   1     8.955 4090.7 706.75
+ age     1     0.816 4098.8 707.25

Step:  AIC=703.15
brozek ~ abdom + weight + neck

         Df Sum of Sq    RSS    AIC
+ height  1    54.314 3977.8 701.76
<none>                4032.1 703.15
+ chest   1     2.544 4029.6 704.99
+ age     1     0.898 4031.2 705.09

Step:  AIC=701.76
brozek ~ abdom + weight + neck + height

        Df Sum of Sq    RSS    AIC
<none>               3977.8 701.76
+ chest  1   18.5707 3959.2 702.59
+ age    1    3.4374 3974.4 703.54
## You can also run forward selection using bestglm, but 
## you can't see the path with this way, to my knowledge.
forw_AIC_2 <- bestglm(bodyfat,
                      IC = "AIC",
                      method = "forward")


#now BIC
cat("\n\n\n\n Now for BIC")




 Now for BIC
base_mod <- lm(brozek ~ 1, data = bodyfat) # Intercept only model (null model, or base model)
full_mod <- lm(brozek ~ ., data = bodyfat) # All predictors in model (besides response)

# Forward stepwise ---------------------------------------------------

step_forw_bic <- step(base_mod, # starting model
                      direction = "forward", 
                      k = log(nrow(bodyfat)), # AIC (the multiplier of p in the penalty term)
                      scope=list(lower= base_mod, upper= full_mod))
Start:  AIC=1024.1
brozek ~ 1

         Df Sum of Sq     RSS     AIC
+ abdom   1    9963.3  4739.1  746.58
+ chest   1    7151.7  7550.7  863.03
+ weight  1    5623.0  9079.4  909.12
+ neck    1    3381.7 11320.7  964.27
+ age     1    1233.3 13469.1 1007.72
<none>                14702.4 1024.10
+ height  1       7.4 14694.9 1029.49

Step:  AIC=746.58
brozek ~ abdom

         Df Sum of Sq    RSS    AIC
+ weight  1    639.47 4099.6 715.86
+ height  1    507.59 4231.5 723.78
+ neck    1    404.24 4334.9 729.81
+ chest   1    218.23 4520.9 740.31
+ age     1    131.08 4608.0 745.09
<none>                4739.1 746.58

Step:  AIC=715.86
brozek ~ abdom + weight

         Df Sum of Sq    RSS    AIC
<none>                4099.6 715.86
+ neck    1    67.516 4032.1 717.23
+ height  1    35.157 4064.5 719.23
+ chest   1     8.955 4090.7 720.84
+ age     1     0.816 4098.8 721.33
## You can also run forward selection using bestglm, but 
## you can't see the path with this way, to my knowledge.
forw_BIC_2 <- bestglm(bodyfat,
                      IC = "BIC",
                      method = "forward")

11. Apply the backward selection procedure to this data set using the step() function in R. Try it using AIC and BIC (remember: in order to do BIC with the step() function you need to change the default value of k to be log(n) where n is the number of rows in the dataset!). Output a summary of the “best” models in each case.

# your code here
# Use the code we used in class
base_mod <- lm(brozek ~ 1, data = bodyfat) # Intercept only model (null model, or base model)
full_mod <- lm(brozek ~ ., data = bodyfat) # All predictors in model (besides response)

# back stepwise ---------------------------------------------------

step_bac_biac <- step(full_mod, # starting model
                      direction = "backward", 
                      k = log(nrow(bodyfat)), # BIC (the multiplier of p in the penalty term)
                      scope=list(lower= base_mod, upper= full_mod))
Start:  AIC=728.75
brozek ~ age + weight + height + neck + chest + abdom

         Df Sum of Sq    RSS    AIC
- weight  1      4.92 3956.5 723.54
- age     1      7.65 3959.2 723.72
- chest   1     22.78 3974.4 724.67
- height  1     76.57 4028.2 728.03
- neck    1     82.68 4034.3 728.41
<none>                3951.6 728.75
- abdom   1   1880.38 5832.0 820.54

Step:  AIC=723.54
brozek ~ age + height + neck + chest + abdom

         Df Sum of Sq    RSS    AIC
- age     1     19.38 3975.9 719.24
- chest   1     46.45 4003.0 720.94
<none>                3956.5 723.54
- neck    1    120.06 4076.6 725.49
- height  1    232.19 4188.7 732.28
- abdom   1   2836.60 6793.1 853.16

Step:  AIC=719.24
brozek ~ height + neck + chest + abdom

         Df Sum of Sq    RSS    AIC
- chest   1     50.41 4026.3 716.87
<none>                3975.9 719.24
- neck    1    117.75 4093.6 721.02
- height  1    299.02 4274.9 731.85
- abdom   1   3018.43 6994.3 854.93

Step:  AIC=716.87
brozek ~ height + neck + abdom

         Df Sum of Sq     RSS    AIC
<none>                 4026.3 716.87
- neck    1     205.2  4231.5 723.78
- height  1     308.6  4334.9 729.81
- abdom   1    6749.2 10775.5 957.46
## You can also run forward selection using bestglm, but 
## you can't see the path with this way, to my knowledge.
back_bic_2 <- bestglm(bodyfat,
                      IC = "BIC",
                      method = "backward")


#now for AIC
cat("\n\n\n\n now with AIC backwards\n\n" )




 now with AIC backwards
# back stepwise ---------------------------------------------------

step_bac_aic <- step(full_mod, # starting model
                      direction = "backward", 
                      k = 2, # BIC (the multiplier of p in the penalty term)
                      scope=list(lower= base_mod, upper= full_mod))
Start:  AIC=704.1
brozek ~ age + weight + height + neck + chest + abdom

         Df Sum of Sq    RSS    AIC
- weight  1      4.92 3956.5 702.41
- age     1      7.65 3959.2 702.59
- chest   1     22.78 3974.4 703.54
<none>                3951.6 704.10
- height  1     76.57 4028.2 706.90
- neck    1     82.68 4034.3 707.28
- abdom   1   1880.38 5832.0 799.41

Step:  AIC=702.41
brozek ~ age + height + neck + chest + abdom

         Df Sum of Sq    RSS    AIC
- age     1     19.38 3975.9 701.64
<none>                3956.5 702.41
- chest   1     46.45 4003.0 703.33
- neck    1    120.06 4076.6 707.89
- height  1    232.19 4188.7 714.67
- abdom   1   2836.60 6793.1 835.55

Step:  AIC=701.64
brozek ~ height + neck + chest + abdom

         Df Sum of Sq    RSS    AIC
<none>                3975.9 701.64
- chest   1     50.41 4026.3 702.79
- neck    1    117.75 4093.6 706.93
- height  1    299.02 4274.9 717.76
- abdom   1   3018.43 6994.3 840.85
## You can also run forward selection using bestglm, but 
## you can't see the path with this way, to my knowledge.
back_bic_aic_2 <- bestglm(bodyfat,
                      IC = "BIC",
                      method = "backward")

12. Apply the sequential replacement selection procedure to this data set using the step() function. You may choose which metric you would like to use (either AIC or BIC). Try initializing it from the full model and the intercept only model. Output a summary of the single “best” model for the metric you chose.

# your code here
both <- bestglm(bodyfat, method="seqrep")
summary(both$BestModel)

Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
    drop = FALSE], y = y))

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0656  -2.9685  -0.1073   3.0258   9.9220 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -42.87040    2.45735 -17.446  < 2e-16 ***
weight       -0.12206    0.01966  -6.207 2.26e-09 ***
abdom         0.90426    0.05221  17.321  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.074 on 247 degrees of freedom
Multiple R-squared:  0.7212,    Adjusted R-squared:  0.7189 
F-statistic: 319.4 on 2 and 247 DF,  p-value: < 2.2e-16

13. Apply LASSO to this data set using the MSE metric. Output the coefficient values corresponding to the 1 standard error rule (do not output any plots).

# your code here
bodyfat_x <- as.matrix(bodyfat[, 1:6]) # 
bodyfat_y <- bodyfat[, 7] # response

# use cross validation to pick the "best" (based on MSE) lambda
set.seed(50)
bodyfat_ridge_cv <- cv.glmnet(x = bodyfat_x, # automatically includes a column 
                          y = bodyfat_y, 
                          type.measure = "mse", 
                          alpha = 0)  # 0 is code for "ridge regression"

# plot (log) lambda vs MSE
p1 <- autoplot(bodyfat_ridge_cv, label = FALSE) +
  theme_bw() +
  theme(aspect.ratio = 1)

# lambda.min: value of lambda that gives minimum mean cross-validated error
bodyfat_ridge_cv$lambda.min
[1] 0.6312929
# lambda.1se: value of lambda within 1 standard error of the minimum 
# cross-validated error
bodyfat_ridge_cv$lambda.1se
[1] 1.210763
coef(bodyfat_ridge_cv, s = "lambda.min")
7 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept)  4.81285338
age          0.04819341
weight       0.03464651
height      -0.48395234
neck        -0.38752860
chest        0.05516167
abdom        0.52953525
coef(bodyfat_ridge_cv, s = "lambda.1se")
7 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept)  1.89654409
age          0.05482955
weight       0.04079229
height      -0.47219783
neck        -0.27269081
chest        0.10999933
abdom        0.43008565

14. Apply Elastic Net to this data set using the MSE metric. Output the coefficient values corresponding to the 1 standard error rule (do not output any plots).

# your code here
set.seed(50)
bodyfat_elastic_cv <- cv.glmnet(x = bodyfat_x, # automatically includes a column of ones for the intercept FYI
                          y = bodyfat_y, 
                          type.measure = "mse", 
                          alpha = .5)  # .5 - this fits the elastic net - half lasso half ridge

# plot (log) lambda vs MSE
p3 <- autoplot(bodyfat_elastic_cv, label = FALSE) +
  theme_bw() +
  theme(aspect.ratio = 1)

# lambda.min: value of lambda that gives minimum mean cross-validated error
bodyfat_elastic_cv$lambda.min
[1] 0.03276441
# lambda.1se: value of lambda within 1 standard error of the minimum 
# cross-validated error
bodyfat_elastic_cv$lambda.1se
[1] 0.5860404
coef(bodyfat_elastic_cv, s = "lambda.min")
7 x 1 sparse Matrix of class "dgCMatrix"
                     s1
(Intercept) -0.30691450
age          0.02122208
weight      -0.01068811
height      -0.37934042
neck        -0.45669044
chest       -0.09411263
abdom        0.79713819
coef(bodyfat_elastic_cv, s = "lambda.1se")
7 x 1 sparse Matrix of class "dgCMatrix"
                      s1
(Intercept) -10.58193343
age           0.02146324
weight        .         
height       -0.35221851
neck         -0.01698972
chest         .         
abdom         0.58369704

15. Fill in the table below with “X”s (like the one at the end of the Module 5 course notes: a row for each variable, a column for each variable selection method, an “X” in a cell means the variable was included for that variable selection method). For the best subset, forward, backward, and sequential replacement columns, use either AIC or BIC. In other words, only report the models for either AIC or BIC, not both, but make sure you’re consistent in the table.

I will use BIC

Variable Best Subset Forward Backward Sequential Replacement LASSO Elastic Net
age x x
weight x x x x
height x x x
neck x x x
chest x
abdom x x x x x x

16. Now that you have seen the various results from the different methods, pick a subset of variables that you will include in the model. Which variables do you choose to include in the model? Why?

I choose to include weight and abdom. These are the predictor variables which three of the selection methods agree on. Thus, I will go with the crowd.

17. Create the multiple linear regression model with the variables you listed in the previous question (alternatively, you can call the best model using $BestModel). Print a summary of the results. Save the residuals from this model to the bodyfat dataframe.

# your code here
best_subsets_bic$BestModel

Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
    drop = FALSE], y = y))

Coefficients:
(Intercept)       weight        abdom  
   -42.8704      -0.1221       0.9043  
bodyfat$residuals <- best_subsets_bic$BestModel$residuals
bodyfat$fitted_values <- best_subsets_bic$BestModel$fitted.values

Now that you have chosen a model, the next several questions ask you to check some of the model assumptions. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that. Note: you can copy (then modify) a lot of your code from Homework 4 to answer these questions.

18. The Xs vs Y are linear (use the residual by predictor plots and the partial regression plots)

# your code here
#residual vs predictor plots

plot(bodyfat$weight, bodyfat$residuals,
     xlab = "weight Variable",
     ylab = "Residuals",
     main = "Residuals vs. weight")

plot(bodyfat$abdom, bodyfat$residuals,
     xlab = "abdom Variable",
     ylab = "Residuals",
     main = "Residuals vs. abdom")

#partial regression plots

resid_vs_weight <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = weight, y = residuals)) +
  theme(aspect.ratio = 1)

resid_vs_abdom <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = abdom, y = residuals)) +
  theme(aspect.ratio = 1)

# put plots in 2 rows & 3 columns using the patchwork package
resid_vs_weight

resid_vs_abdom

#new bodyfat_lm with just weight and abdom

final_lm <- lm(brozek ~ weight + abdom, data = bodyfat)
#REsiduals vs fitted
autoplot(final_lm, which = 1, ncol = 1, nrow = 1)

Yes, I believe there is a linear trend. The residual graphs look very random, and the residuals vs fitted graph is almost straight, except at the end… just like before.

19. The errors are normally distributed (use a histogram, qq plot, and shapiro wilk test)

# your code here
# histogram of errors
# The code below produces a basic histogram (from the in-class analysis #2)

ggplot(data = bodyfat) +
  geom_histogram(aes(x = residuals, y = after_stat(density)), 
                 binwidth = 1.2) +
  stat_function(fun = dnorm, color = "red", linewidth = 2,
                args = list(mean = mean(bodyfat$residuals), 
                            sd = sd(bodyfat$residuals))) 

# Create a Q-Q plot
qqnorm(bodyfat$residuals, main = "Q-Q Plot of Residuals")
qqline(bodyfat$residuals)

#Shapiro Test (also in-class #2)
shapiro.test(final_lm$residuals)

    Shapiro-Wilk normality test

data:  final_lm$residuals
W = 0.99153, p-value = 0.1592

Wow, this is a much more normal looking histogram. It’s still not perfect in the center, but it is so much better! In addition, the QQ plot looks like it did last time, and we still fail to reject the null hypothesis of normality for the wilks-shapiro test.

20. The errors have equal/constant variance across all values of X (use the scale - location plot)

# your code here
autoplot(final_lm, which = 3, nrow = 1, ncol = 1)

The scale-location plot is softened from the origional. It still squiggles at around 25, but it is less severe.

21. The model describes all observations (i.e., there are no influential points) (use the cooks distance > 0.5 plot).

# your code here
# Cook's Distance
# taken from mod_4_in_class_analysis

cd_cont_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cd_cont_neg <- function(leverage, level, model) {-cd_cont_pos(leverage, level, model)}

cd_threshold <- 0.5
autoplot(final_lm, which = 5) +
  stat_function(fun = cd_cont_pos,
                args = list(level = cd_threshold, model = final_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  stat_function(fun = cd_cont_neg,
                args = list(level = cd_threshold, model = final_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  scale_y_continuous(limits = c(-4, 4))

There are no influential points, but that’s because we already removed them… not sure how impressive this actually is, but it is at least passable.

22. No multicollinearity (use the scatterplot matrix, correlation matrix, and variance inflation factors).

# your code here
final_data <- subset(bodyfat, select = c(abdom, weight, residuals, fitted_values))


final_data |> 
select(-residuals, -fitted_values) |> 
  pairs(pch = 19, lower.panel = NULL)

corrplot(cor(final_data |> 
               select(-fitted_values, -residuals)),
         type = "upper")

super_vifs <- vif(final_lm)
super_vifs
  weight    abdom 
4.242705 4.242705 
max(super_vifs)
[1] 4.242705
mean(super_vifs)
[1] 4.242705

There is no sign of Multicollerity, with values all under 5.

23. Given the results from your model assumption checking, what would you do next to continue this analysis?

To continue my analysis, honestly, I would begin interpreting the coefficients we have. I am satisfied that using just weight and abdom are the best variables. We can begin using them to create our ultimate model for how to best model the actual bodyfat of a male in our sample.

24. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

I learned that moving forward and backwards and using other types of variable selection process’s really can mix up what results we get. Of course, I think that the best is just to try all of them, especially with this small of a data set, I think it isnt that difficult, and give me us the peace of mind that we really picked the best variables for our model.

25. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

We have this data set to try and predict the actual body fat of a person, without going through the hastle of dunking them underwater in the complex brozek process. Instead, we are happy to take a few measurements, and make a good prediction. It will be much cheaper, and if done correctly, still very accurate. So, from our analysis we learned that to make a great model for predicting, we just need to measure people’s weight and their abdomen size. From these two, we can make a great guess. Using all of the variables can confuse the model, so we will just leave them out when we actually predict. Of course, it was important to measure everything we could in our sample, so that we could find out what were the most important things to measure on a person.