KITADA

Lab Activity #8

Multiple Linear Regression Analysis

Objectives:

I. There is no activity this week.

II. Multiple Linear Regression Examples

Example 1: The Body Fat example

Researchers were interested in what characteristics of a person may explain their body fat (measured as a percentage). Factors researchers thought might be significant predictors of percent body fat were age, height, weight, and various body measurements (such as neck, abdomen, knee, and ankle circumference, just to name a few). In this example, in addition to age, height, and weight as explanatory variables, we’ll use just one of the body measurements (chest circumference) as our last explanatory variable. A full data set (not available for your use) contains these variables and other body measurements. The BODYFAT data set on Canvas contains the following variables:

str(BODYFAT)
## 'data.frame':    251 obs. of  5 variables:
##  $ Fat   : num  12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
##  $ Age   : int  23 22 22 26 24 24 26 25 25 23 ...
##  $ Weight: num  154 173 154 185 184 ...
##  $ Height: num  67.8 72.2 66.2 72.2 71.2 ...
##  $ Chest : num  93.1 93.6 95.8 101.8 97.3 ...

Use the BODYFAT data set to perform a multiple linear regression analysis to determine which (if any) of Age, Height, Weight, and/or Chest help explain (i.e. “predict”) percent body fat. If any are considered significant predictors of percent body fat¸ explain their effect on percent body fat.

Step 1: representative sample

1. Even though we do not know how these data were collected, give some arguments both for and against saying the sample is representative of the population of all adults.

(This may not be discussed in lab due to time, but you should always think about why a sample may or may not be representative of a population of interest.)

Step 2: assessing correlation among the explanatory variables

pairs(BODYFAT)

plot of chunk unnamed-chunk-3

cor(BODYFAT)
##                Fat         Age      Weight      Height     Chest
## Fat     1.00000000  0.29351155  0.61084711 -0.02338427 0.7029179
## Age     0.29351155  1.00000000 -0.01251673 -0.24534866 0.1767572
## Weight  0.61084711 -0.01251673  1.00000000  0.48884997 0.8940936
## Height -0.02338427 -0.24534866  0.48884997  1.00000000 0.2278039
## Chest   0.70291786  0.17675715  0.89409364  0.22780393 1.0000000

2. Based on both the scatterplot matrix and the correlation matrix, do you feel any of the explanatory variables are “highly correlated” with another explanatory variable? If so, which are “highly correlated”? Why?

Weight and chest appear to be heighly correlated. It looks like they have a strong, positive, linear relationship.

3. If two explanatory variables are highly correlated with each other, we should remove one of them. Which one should be removed is up to you, but often a strategy of running simple linear regressions (response variable versus each of the explanatory variables) and/or running a multiple regression with only the explanatory variables that are “highly correlated” as the predictors can help decide which one to remove. If needed, decide which highly correlated variable(s) should be removed.

mod_weight<-with(BODYFAT, lm(Fat~Weight))
summary(mod_weight)
## 
## Call:
## lm(formula = Fat ~ Weight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7382  -4.7052   0.0973   4.9305  21.4419 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.88891    2.57914   -4.61 6.45e-06 ***
## Weight        0.17327    0.01423   12.17  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.616 on 249 degrees of freedom
## Multiple R-squared:  0.3731, Adjusted R-squared:  0.3706 
## F-statistic: 148.2 on 1 and 249 DF,  p-value: < 2.2e-16
#Multiple R-squared:  0.3731

mod_chest<-with(BODYFAT, lm(Fat~Chest))
summary(mod_chest)
## 
## Call:
## lm(formula = Fat ~ Chest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.0630  -4.1219  -0.2985   3.8041  15.2106 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -50.91381    4.50507  -11.30   <2e-16 ***
## Chest         0.69452    0.04454   15.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.944 on 249 degrees of freedom
## Multiple R-squared:  0.4941, Adjusted R-squared:  0.4921 
## F-statistic: 243.2 on 1 and 249 DF,  p-value: < 2.2e-16
#Multiple R-squared:  0.4941

## LETS KEEP CHEST AND REMOVE WEIGHT

Note: if a variable is removed at this point, the rest of the analysis is performed without that variable.

Step 3: checking for outliers

In addition to using the scatterplot matrix to determine if there are any outliers, a residual plot versus predicted values can tell us if there are any outliers. If there are, a residual plot versus each explanatory variable can help in identifying the outliers.

Additional plots to identify outliers.

4. After obtaining a residual plot versus the predicted values, do you feel there are any outliers? If so, obtain a residual plot versus each explanatory variable to identify the outlier(s).

### FIT MOD LOOK FOR OUTLIERS ###
full_mod<-with(BODYFAT, lm(Fat~Age+Height+Chest))
plot(fitted(full_mod),resid(full_mod),
     xlab="Fitted",
     ylab="Residual")
abline(h=0, lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-6

## NO APPEARENT OUTLIERS 

### IF KEEP WEIGHT ###
wfull_mod<-with(BODYFAT, lm(Fat~Age+Weight+Height+Chest))
plot(fitted(wfull_mod),resid(wfull_mod),
     xlab="Fitted",
     ylab="Residual")
abline(h=0, lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-6

## LOOKS LIKE THERE IS AN OUTLIER IN THE LOWER RIGHT

## CHECK AGAINST EACH EXPLANATORY (WITHOUT WEIGHT)
par(mfrow=c(2,2))
with(BODYFAT, plot(Age,resid(full_mod),
     xlab="Age",
     ylab="Residual"))
abline(h=0, lwd=2, lty=2, col="blue")

with(BODYFAT, plot(Height,resid(full_mod),
                   xlab="Height",
                   ylab="Residual"))
abline(h=0, lwd=2, lty=2, col="blue")

with(BODYFAT, plot(Chest,resid(full_mod),
                   xlab="Chest",
                   ylab="Residual"))
abline(h=0, lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-6

Step 4: assess the linearity, constant variation, and normality conditions

Plots that are necessary to assess some of these conditions. Obtain the following plots:

5. Using appropriate graphical displays, are all of the explanatory variables linearly related to percent body fat? Support your answer.

Yes, looking at the scatterplot matrix it appears that the explanatory variables of Age, Height, and Chest have a linear relationship with Fat.

6. Using an appropriate graphical display, is the “constant variation” condition satisfied? Support your answer.

It looks like there may be evidence that as the fitted value increases the residual magnitude decreases.

7. Using an appropriate graphical display, are the residuals normally distributed? Support your answer.

### WITHOUT WEIGHT
qqnorm(resid(full_mod))
qqline(resid(full_mod))

plot of chunk unnamed-chunk-7

### WITH WEIGHT
qqnorm(resid(wfull_mod))
qqline(resid(wfull_mod))

plot of chunk unnamed-chunk-7

8. If one or more of these conditions is not met, what should be done? Is that necessary in this example?

If there was evidence of curvative we might need to perform a transformation; however, I dont think its needed in this example.

Steps 5 through 7: the analysis

Once the best-fitting model is obtained (i.e. best satisfies the conditions in step 4), obtain the output from the multiple regression analysis. Use the output to answer the following questions:

### WITHOUT WEIGHT
summary(full_mod)
## 
## Call:
## lm(formula = Fat ~ Age + Height + Chest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1373  -4.5157  -0.5461   3.6722  14.4081 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.28787   10.42073  -2.043 0.042130 *  
## Age           0.08529    0.03019   2.825 0.005121 ** 
## Height       -0.49338    0.14742  -3.347 0.000945 ***
## Chest         0.70678    0.04497  15.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.664 on 247 degrees of freedom
## Multiple R-squared:  0.5443, Adjusted R-squared:  0.5388 
## F-statistic: 98.35 on 3 and 247 DF,  p-value: < 2.2e-16
anova(full_mod)
## Analysis of Variance Table
## 
## Response: Fat
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## Age         1 1498.1  1498.1  46.6980 6.426e-11 ***
## Height      1   43.8    43.8   1.3639     0.244    
## Chest       1 7923.7  7923.7 246.9991 < 2.2e-16 ***
## Residuals 247 7923.7    32.1                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
### WITH WEIGHT
summary(wfull_mod)
## 
## Call:
## lm(formula = Fat ~ Age + Weight + Height + Chest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5870  -4.1563  -0.2578   3.7957  13.6203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.52307   15.60591   1.315 0.189707    
## Age          0.11408    0.03062   3.726 0.000241 ***
## Weight       0.12893    0.03646   3.536 0.000485 ***
## Height      -0.88763    0.18219  -4.872 1.98e-06 ***
## Chest        0.32545    0.11645   2.795 0.005601 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.536 on 246 degrees of freedom
## Multiple R-squared:  0.5664, Adjusted R-squared:  0.5593 
## F-statistic: 80.33 on 4 and 246 DF,  p-value: < 2.2e-16
anova(wfull_mod)
## Analysis of Variance Table
## 
## Response: Fat
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## Age         1 1498.1  1498.1  48.8734 2.557e-11 ***
## Weight      1 6567.8  6567.8 214.2709 < 2.2e-16 ***
## Height      1 1543.5  1543.5  50.3561 1.364e-11 ***
## Chest       1  239.4   239.4   7.8114  0.005601 ** 
## Residuals 246 7540.4    30.7                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9. Perform an F-test.

a. State the null and alternative hypotheses in words and notation.

\( H_0: \beta_1=\beta_2=\beta_3=0 \)

\( H_A \): At least one \( \beta \neq 0 \)

b. Give the F-statistic with degrees of freedom and the p-value.

### WITHOUT WEIGHT
SSM<-1498.1+43.8+7923.7
SSM
## [1] 9465.6
df_M<-3
MSM<-SSM/df_M
MSM
## [1] 3155.2
SSE<-7923.7 
df_E<-247
MSE<-SSE/df_E
MSE
## [1] 32.07976
F_test<-MSM/MSE
F_test
## [1] 98.35486
pf(F_test,df_M, df_E, lower.tail=FALSE)
## [1] 6.482778e-42
### WITH WEIGHT
SSM<-1498.1+6567.8+1543.5+239.4
SSM
## [1] 9848.8
df_M<-4
MSM<-SSM/df_M
MSM
## [1] 2462.2
SSE<-7540.4
df_E<-246
MSE<-SSE/df_E
MSE
## [1] 30.65203
F_test<-MSM/MSE
F_test
## [1] 80.32746
pf(F_test,df_M, df_E, lower.tail=FALSE)
## [1] 1.638011e-43

c. State a conclusion in the context of the problem.

There is convincing evidence to suggest that at least one of the explantory variables of Age, Height, or Chest is significant in predicting Body Fat, with a p-value < 0.0001. Therefore, we will reject the null hypothesis.

d. Is it necessary to continue with the analysis? Why or why not?

Yes we should continue because we don't know which variable(s) is significant in predicting fat.

10. Perform a t-test on each explanatory variable.

a. What are the null and alternative hypotheses for each t-test?

\( H_0: \beta_i=0 \)

\( H_A: \beta_i \neq 0 \)

b. Give the t-statistics (with degrees of freedom) and p-value for each t-test.

### WITHOUT WEIGHT
summary(full_mod)
## 
## Call:
## lm(formula = Fat ~ Age + Height + Chest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1373  -4.5157  -0.5461   3.6722  14.4081 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.28787   10.42073  -2.043 0.042130 *  
## Age           0.08529    0.03019   2.825 0.005121 ** 
## Height       -0.49338    0.14742  -3.347 0.000945 ***
## Chest         0.70678    0.04497  15.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.664 on 247 degrees of freedom
## Multiple R-squared:  0.5443, Adjusted R-squared:  0.5388 
## F-statistic: 98.35 on 3 and 247 DF,  p-value: < 2.2e-16
### WITH WEIGHT
summary(wfull_mod)
## 
## Call:
## lm(formula = Fat ~ Age + Weight + Height + Chest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5870  -4.1563  -0.2578   3.7957  13.6203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.52307   15.60591   1.315 0.189707    
## Age          0.11408    0.03062   3.726 0.000241 ***
## Weight       0.12893    0.03646   3.536 0.000485 ***
## Height      -0.88763    0.18219  -4.872 1.98e-06 ***
## Chest        0.32545    0.11645   2.795 0.005601 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.536 on 246 degrees of freedom
## Multiple R-squared:  0.5664, Adjusted R-squared:  0.5593 
## F-statistic: 80.33 on 4 and 246 DF,  p-value: < 2.2e-16

c. For Age only, state a conclusion in the context of the problem.

There is convincing evidence to suggest that Age is significant in predicting body fat, with a p-value of 0.005121. Thus, we will reject the null hypothesis.

d. In a backwards selection process, would any of the explanatory variables drop out? Why or why not? If so, which one would drop out first? Why?

### WITHOUT WEIGHT
summary(step(full_mod))
## Start:  AIC=874.49
## Fat ~ Age + Height + Chest
## 
##          Df Sum of Sq     RSS     AIC
## <none>                 7923.7  874.49
## - Age     1     255.9  8179.6  880.47
## - Height  1     359.3  8283.0  883.62
## - Chest   1    7923.7 15847.4 1046.47
## 
## Call:
## lm(formula = Fat ~ Age + Height + Chest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1373  -4.5157  -0.5461   3.6722  14.4081 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.28787   10.42073  -2.043 0.042130 *  
## Age           0.08529    0.03019   2.825 0.005121 ** 
## Height       -0.49338    0.14742  -3.347 0.000945 ***
## Chest         0.70678    0.04497  15.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.664 on 247 degrees of freedom
## Multiple R-squared:  0.5443, Adjusted R-squared:  0.5388 
## F-statistic: 98.35 on 3 and 247 DF,  p-value: < 2.2e-16
### WITH WEIGHT
summary(step(wfull_mod))
## Start:  AIC=864.05
## Fat ~ Age + Weight + Height + Chest
## 
##          Df Sum of Sq    RSS    AIC
## <none>                7540.4 864.05
## - Chest   1    239.44 7779.8 869.89
## - Weight  1    383.33 7923.7 874.49
## - Age     1    425.51 7965.9 875.83
## - Height  1    727.54 8267.9 885.17
## 
## Call:
## lm(formula = Fat ~ Age + Weight + Height + Chest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5870  -4.1563  -0.2578   3.7957  13.6203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.52307   15.60591   1.315 0.189707    
## Age          0.11408    0.03062   3.726 0.000241 ***
## Weight       0.12893    0.03646   3.536 0.000485 ***
## Height      -0.88763    0.18219  -4.872 1.98e-06 ***
## Chest        0.32545    0.11645   2.795 0.005601 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.536 on 246 degrees of freedom
## Multiple R-squared:  0.5664, Adjusted R-squared:  0.5593 
## F-statistic: 80.33 on 4 and 246 DF,  p-value: < 2.2e-16

11. Using the model after performing a backwards selection process, answer the following questions:

a. Write the least-squares regression equation. Define the terms in the equation.

WITHOUT WEIGHT:

FAT=-21.28787+0.08529*AGE-0.49338*HEIGHT+0.70678*CHEST

WITH WEIGHT:

FAT=20.52307+0.11408*AGE+0.12893*WEIGHT-0.88763*HEIGHT+0.32545*CHEST

b. Interpret the coefficient of Age in the context of the problem. (Be able to interpret the other coefficients as well.)

WITHOUT WEIGHT:

With the variables of height and chest in the model, for every additional year body fat increases by 0.08.

WITH WEIGHT:

With the variables of weight, height, and chest in the model, for every additional year body fat increases by 0.11.

c. Predict percent body fat for a 30 year old who is 72 inches tall, weighs 180 pounds, and has a chest circumference of 105 cm. (If one or more of these variables is not in the final model, ignore its value.) Use R to obtain this predicted value. In addition, obtain and interpret a 95% prediction interval for this person.

## WITHOUT WEIGHT
predict.lm(full_mod, newdata=data.frame(Age=30, Height=72, Chest=105),
           interval="prediction", level=0.95)
##       fit      lwr      upr
## 1 19.9587 8.737822 31.17957
## WITH WEIGHT
predict.lm(wfull_mod, newdata=data.frame(Age=30, Height=72, Weight=180, Chest=105),
           interval="prediction", level=0.95)
##        fit      lwr      upr
## 1 17.41599 6.356406 28.47558

d. What percent of the variation in percent body fat is explained by this regression model?

## WITHOUT WEIGHT 
#Multiple R-squared:  0.5443

## WITH WEIGHT
#Multiple R-squared:  0.5664

e. What is the estimate of \( sigma \), the standard deviation of the residuals?

## WITHOUT WEIGHT 
anova(full_mod)
## Analysis of Variance Table
## 
## Response: Fat
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## Age         1 1498.1  1498.1  46.6980 6.426e-11 ***
## Height      1   43.8    43.8   1.3639     0.244    
## Chest       1 7923.7  7923.7 246.9991 < 2.2e-16 ***
## Residuals 247 7923.7    32.1                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sqrt(32.1)
## [1] 5.665686
## WITH WEIGHT
anova(wfull_mod)
## Analysis of Variance Table
## 
## Response: Fat
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## Age         1 1498.1  1498.1  48.8734 2.557e-11 ***
## Weight      1 6567.8  6567.8 214.2709 < 2.2e-16 ***
## Height      1 1543.5  1543.5  50.3561 1.364e-11 ***
## Chest       1  239.4   239.4   7.8114  0.005601 ** 
## Residuals 246 7540.4    30.7                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sqrt(30.7)
## [1] 5.540758