Homework 4

Multiple Linear Regression

Author

Spencer Hamilton

Data and Description

Measuring body fat is not simple. One method requires submerging the body underwater in a tank and measuring the increase in water level. A simpler method for estimating body fat would be preferred. In order to develop such a method, researchers recorded age (years), weight (pounds), height (inches), and three body circumference measurements (around the neck, chest, and abdominal (all in centimeters)) for 252 men. Each man’s percentage of body fat was accurately estimated by an underwater weighing technique (the variable “brozek” is the percentage of body fat). The hope is to be able to use these data to create a model that will accurately predict body fat percentage using just the basic variables recorded, without having to use the tank submerging method.

The data can be found in the BodyFat data set on Canvas. Download “BodyFat.txt”, and put it in the same folder as this R Markdown file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

1. Read in the data set, and call the data frame “bodyfat”. Print a summary of the data. Remove the “row” column (which contains row numbers) from the data set.

bodyfat <- read.table('bodyfat.txt', header = TRUE)
#summary(bodyfat)
bodyfat <- subset(bodyfat, select = -row)
summary(bodyfat)

     brozek           age            weight          height     
 Min.   : 0.00   Min.   :22.00   Min.   :118.5   Min.   :64.00  
 1st Qu.:12.80   1st Qu.:35.50   1st Qu.:158.8   1st Qu.:68.38  
 Median :19.00   Median :43.00   Median :176.2   Median :70.00  
 Mean   :18.89   Mean   :44.86   Mean   :178.8   Mean   :70.32  
 3rd Qu.:24.55   3rd Qu.:54.00   3rd Qu.:196.9   3rd Qu.:72.25  
 Max.   :45.10   Max.   :81.00   Max.   :363.1   Max.   :77.75  
      neck           chest           abdom       
 Min.   :31.10   Min.   : 79.3   Min.   : 69.40  
 1st Qu.:36.40   1st Qu.: 94.3   1st Qu.: 84.55  
 Median :38.00   Median : 99.6   Median : 90.90  
 Mean   :37.98   Mean   :100.8   Mean   : 92.49  
 3rd Qu.:39.40   3rd Qu.:105.3   3rd Qu.: 99.20  
 Max.   :51.20   Max.   :136.2   Max.   :148.10

2. Create and print a scatterplot matrix of the data.

plot(bodyfat)

pairs(bodyfat, pch = 19, lower.panel = NULL) # this omits the duplicated half

3. Based on the scatterplot matrix, briefly explain which variables you think will be helpful for predicting brozek and which variables you think will not be helpful at predicting brozek. Explain how the scatterplot helped determine your answers.

Looking at the first row of scatterplots should tell most of the story, as it compares brozek (the actual bodyfat percentage), to the independent variables. The ones with decent linear looking relationships are weight, chest, and abdom. The others are much messier looking scatterplots where relationships are not at all obvious. I just looked for the ones that look the most like lines.

4. Create and print a correlation matrix (numeric or color- and shape-coded).

round(cor(bodyfat), 2)

       brozek   age weight height neck chest abdom
brozek   1.00  0.29   0.61  -0.02 0.49  0.70  0.81
age      0.29  1.00  -0.01  -0.24 0.11  0.17  0.23
weight   0.61 -0.01   1.00   0.49 0.83  0.89  0.89
height  -0.02 -0.24   0.49   1.00 0.33  0.24  0.20
neck     0.49  0.11   0.83   0.33 1.00  0.78  0.75
chest    0.70  0.17   0.89   0.24 0.78  1.00  0.92
abdom    0.81  0.23   0.89   0.20 0.75  0.92  1.00

corrplot(cor(bodyfat), type = "upper")

5. Based on the scatterplot matrix and the correlation matrix, are their any pairs of variables that you suspect will cause a problem in terms of multicollinearity? If so, which ones?

It looks like the weight will cause multicollinearity problems with chest, abdomen, and neck. They are all have dark blue circles in the previous chart, and seem to be very correlated.

6. Fit a multiple linear regression model to the data (no transformations). Print a summary of the results. Save the residuals to the `bodyfat` data frame.

body_lm <- lm(brozek ~ ., bodyfat)
summary(body_lm)


Call:
lm(formula = brozek ~ ., data = bodyfat)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.5811  -3.0358   0.0668   2.8879  10.2828 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.010e+01  1.413e+01  -1.423   0.1561    
age          5.010e-03  2.540e-02   0.197   0.8438    
weight      -8.733e-02  3.879e-02  -2.251   0.0253 *  
height      -1.400e-01  1.520e-01  -0.921   0.3579    
neck        -4.421e-01  2.019e-01  -2.189   0.0295 *  
chest        4.844e-04  9.107e-02   0.005   0.9958    
abdom        8.754e-01  7.924e-02  11.048   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.124 on 244 degrees of freedom
Multiple R-squared:  0.722, Adjusted R-squared:  0.7151 
F-statistic: 105.6 on 6 and 244 DF,  p-value: < 2.2e-16

bodyfat$residuals <- residuals(body_lm)
bodyfat$fitted_values <- fitted(body_lm)

7. Briefly comment on the “significance” of the variables: were you surprised by the results? Are there any variables that are significant that you think shouldn’t be? Are there any variables that are not significant that you think should be?

I was not surprised by the results. We knew that age and height would not be significant, because you can be skinny at any age, and tall people can be fat too. The weight, neck, chest, and abdomen all came back to be significant in predicting brozek.

8. Briefly comment on the sign (+/-) of the coefficients for the variables. Are their any variables where the sign is the opposite of what you expected?

Yes, neck circumference being negatively correlated was unexpected. As was weight, but they are both pretty small, with their e-01 and e-02… means that it is only slightly negative coefficients.

9. Mathematically write out the fitted multiple linear regression model for this data set using the coefficients you found above (do not use betas). Do not use “X” and “Y” in your model - use variable names that are descriptive.

$\text{brozek} = -20.1 + 0.005010×\text{age} − 0.08733×\text{weight} − 0.1400×\text{height} −$ $ 0.4421× + 0.0004844× + 0.8754×$

10. Assuming the model assumptions are all met, how would you interpret the coefficient for Weight?

For each pound of Weight, brozek (or the actual bodyfat mesured) will decrease by .08733, if all else is held constant.

11. Briefly explain what it means to “hold all else constant” when interpreting a coefficient.

This really means that we are trying to compare people who are the same in all other respects. It would be hard to just change your height, so instead we can just see how it will effect others in the population who have the same attributes.

12. Briefly explain what the F-test indicates, as given in the model output from question 6.

The F-test indicates whether the regression model is good and, there is strong evidence to suggest that at least one of the independent variables in the model has a non-zero coefficient. Looking at our values of” F-statistic: 105.6 on 6 and 244 DF, p-value: < 2.2e-16” means that we reject the null that it’s not better, and we conclude that our model provides a better fit to the data compared to a model with no predictors.

13. Briefly interpret the adjusted R-squared, as reported in the model output from question 6.

The adjusted R-squared is a measure of how well the independent variables in the regression model explain the variability in the dependent variable. In our case, 71.51% of the variability in the brozek is explained by the other variables variables.

Questions 14-17 involve using diagnostics to determine if the primary linear regression assumptions are met. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.

14. The X’s vs Y are linear (use the residual vs. predictor plots, partial regression plots, and the residual vs. fitted values plot).

#residual vs predictor plots

plot(bodyfat$age, bodyfat$residuals,
     xlab = "age Variable",
     ylab = "Residuals",
     main = "Residuals vs. age")

plot(bodyfat$weight, bodyfat$residuals,
     xlab = "weight Variable",
     ylab = "Residuals",
     main = "Residuals vs. weight")

plot(bodyfat$neck, bodyfat$residuals,
     xlab = "neck Variable",
     ylab = "Residuals",
     main = "Residuals vs. neck")

plot(bodyfat$chest, bodyfat$residuals,
     xlab = "chest Variable",
     ylab = "Residuals",
     main = "Residuals vs. chest")

plot(bodyfat$abdom, bodyfat$residuals,
     xlab = "abdom Variable",
     ylab = "Residuals",
     main = "Residuals vs. abdom")

plot(bodyfat$height, bodyfat$residuals,
     xlab = "height Variable",
     ylab = "Residuals",
     main = "Residuals vs. height")

# partial regression plots
# your code here
resid_vs_age <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = age, y = residuals)) +
  theme(aspect.ratio = 1)

resid_vs_weight <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = weight, y = residuals)) +
  theme(aspect.ratio = 1)

resid_vs_height <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = height, y = residuals)) +
  theme(aspect.ratio = 1)

resid_vs_neck <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = neck, y = residuals)) +
  theme(aspect.ratio = 1)

resid_vs_chest <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = chest, y = residuals)) +
  theme(aspect.ratio = 1)

resid_vs_abdom <- ggplot(data = bodyfat) +
  geom_point(mapping = aes(x = abdom, y = residuals)) +
  theme(aspect.ratio = 1)

# put plots in 2 rows & 3 columns using the patchwork package
(resid_vs_age | resid_vs_weight | resid_vs_height) /
  (resid_vs_neck | resid_vs_chest | resid_vs_abdom)

# residual vs fitted values
# your code here
autoplot(body_lm, which = 1, ncol = 1, nrow = 1)

From this last graph, we can see that most of the data is in line, but that at the end, there is a little bit of a shift downwards, but most of the data makes me feel good and confident about the linearity. The partial regression plots are also just a smattering, which is good and random.And the same goes for the residual vs predictor plots.

15. The errors are independent (I will answer this one for you, no need to modify the response. No points for this question).

Since we do not know if the 252 men were randomly sampled from a population, we do not know if the residuals are independent or not. We will assume that they are independent for this analysis.

16. The errors are normally distributed (use a histogram, qq-plot, and Shapiro-Wilk test).

# histogram of errors
# The code below produces a basic histogram (from the in-class analysis #2)

ggplot(data = bodyfat) +
  geom_histogram(aes(x = residuals, y = after_stat(density)), 
                 binwidth = 1.2) +
  stat_function(fun = dnorm, color = "red", linewidth = 2,
                args = list(mean = mean(bodyfat$residuals), 
                            sd = sd(bodyfat$residuals)))

# qq-plot
# your code here

# Create a Q-Q plot
qqnorm(bodyfat$residuals, main = "Q-Q Plot of Residuals")
qqline(bodyfat$residuals)

# Shapiro-Wilk test
# your code here

#Shapiro Test (also in-class #2)
shapiro.test(body_lm$residuals)


    Shapiro-Wilk normality test

data:  body_lm$residuals
W = 0.99416, p-value = 0.4432

We fail to reject the shapiro wilk normality test, and conclude that it is sampled from a normal distrobution. But in opposition, the QQ-plot seems to be very lined up in the middle, with bad twists at the end, where the errors are not looking very normal. And the histogram also does not look super normal. So, I would say no, the errors are not normal.

17. The errors have equal/constant variance across all values of X (check Scale-location plot)

# your code here
autoplot(body_lm, which = 3, nrow = 1, ncol = 1)

# your code here

This scale location plot is not straight, which means that there is not equal variance of for the errors across all values of x.

18. Check for influential points using Cook’s distance. What would you recommend doing given your findings? (You can also reference plots you created in previous questions.)

# Cook's Distance
# taken from mod_4_in_class_analysis

cd_cont_pos <- function(leverage, level, model) {sqrt(level*length(coef(model))*(1-leverage)/leverage)}
cd_cont_neg <- function(leverage, level, model) {-cd_cont_pos(leverage, level, model)}

cd_threshold <- 0.5
autoplot(body_lm, which = 5) +
  stat_function(fun = cd_cont_pos,
                args = list(level = cd_threshold, model = body_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  stat_function(fun = cd_cont_neg,
                args = list(level = cd_threshold, model = body_lm),
                xlim = c(0, 0.6), lty = 2, colour = "red") +
  scale_y_continuous(limits = c(-4, 4))

It looks like 39 is an influential point, with more leverage than the recommended.

19. Check for multicollinearity. For this (tacit) model assumption, compute the variance inflation factors (VIFs) and compare the VIFs to your response in question 5. Is there agreement? Is this assumption met (recall: the rule of thumb is that each VIF should be less than 10)?

# your code here
bodyfat |> 
select(-residuals, -fitted_values) |> 
  pairs(pch = 19, lower.panel = NULL)

corrplot(cor(bodyfat |> 
               select(-fitted_values, -residuals)),
         type = "upper")

super_vifs <- vif(body_lm)
super_vifs

      age    weight    height      neck     chest     abdom 
 1.510958 19.128777  2.306983  3.545106  8.604128 10.684888

max(super_vifs)

[1] 19.12878

mean(super_vifs)

[1] 7.63014

From the calculated IVF’s, we can see that weight and abdom are over the the rule of thumb rule of less than 10. This is in general agreement with problem 5, where we were worried about the multicolinearity of weight, abdom and chest. But, apparently, chest is fine, but weight and abdom are too high in multicolinearity.

Note: your next homework assigment will use this same data set, and you will be asked to fix the assumptions that were broken.

20. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

Personally, I learned about IVF’s. I was still uncomfortable calculating them by myself, and trying to interpret their meaning. But, now I am much more comfortable talking about them and using them to learn more about my data.

21. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

1)The purpose of the this data set and analysis was to see if we could measure body fat in simpler ways than doing an underwater weighing technique. Perhaps we could use other measurements which are much easier to take, and still get an accurate measurement of body weight. We used age, weight, height, neck circumference, chest circumference, and abdomen circumference. 2) We learned that we should not use all of our different variables to predict the final body weight, because they are too similar of measurements, and can get in each others way of predicting the final body weight. So, it would probably be best to change which variables we use. The main take-away is that we will need to do more analysis to decide the best way to measure the true body fat of each person, and to make the best model. This one is pretty good, but we want to keep working on it, to assure our assumptions are met, and we are making a valid model.