Setup

library(moderndive) 
library(Stat2Data) 
data("Cereal")

Question #1 - Out of Breath!

Consider the following regression table, which describes a multiple regression model of active heart rates (after climbing several flights of stairs) based on resting heart rates (i.e., Rest, measured in beats per minute), weight (i.e., Wgt, measured in pounds), and amount of Exercise (i.e., Exc, measured in hours per week).

- Symbolic fitted model:

\[ \hat{y} = \hat\beta_0 + (\hat\beta_1 \cdot x_1)+ (\hat\beta_2 \cdot x_2) + (\hat\beta_3 \cdot x_3) \] or \[ \hat{heartrate} = \hat\beta_0 + (\hat\ Rest \cdot x_1)+ (\hat\ Wgt \cdot x_2) + (\hat\ Exc\cdot x_3) \]

a) Test the hypotheses that β2 = 0 versus β2 ̸= 0 and interpret the result in the context of this problem. You may assume that the conditions for a linear model are satisfied for these data.

HYPOTHESES

  • Null hypothesis (H0): Weight (\(\hat\beta_2\)) is not a significant predictor in the model i.e. \(\beta_2= 0\)

  • Alternate hypothesis (H1): Weight (\(\hat\beta_2\)) is a significant predictor in the model i.e. \(\beta_2 \ne 0\)

INTERPRETATIONS: T-STATISTIC AND P-VALUE

  • The P-value for the Wgt coefficient is 0.282 which is greater than the significance level of 5% (0.05). Thus, the alternative hypothesis is rejected in favor of the null. This suggests that weight is not a significant predictor of heart rate in the model.

  • Additionally, the large t-statistic (1.63) also suggests that Wgt is not significantly different from 0.

CONCLUSIONS

  • The null hypothesis is accepted. Therefore, \(\beta_2\) (Weight) is not a significant contributor to heart rate in this model.

b) Interpret of the value of the coefficient for the Exc variable in this context.

  • Holding all other explanatory variables constant at all values, the coefficient of Exc (-1.08) suggests that as exercise hours increase by 1 hour/week, active heart rates decrease by 1.08 units.

c) What active pulse rate would this model predict for a 200-pound student who exercises 7 hours per week and has a resting pulse rate of 76 beats per minute?

11.84 + (1.11 * 76) + (0.03 * 200) + (-1.08 * 7)
## [1] 94.64
  • Model predicts = 94.64 units/minute.

Question 2 - Return to Breakfast Island

In a previous problem set, you modeled the number of calories in a serving of breakfast cereal as a function of the number of grams of sugar in the cereal. In this problem, you’ll revisit the breakfast cereal data, this time with a multiple regression model that which predicts the number of calories in a serving of breakfast cereal as a function of grams of sugar and the grams of fiber per serving.

cereal_model_2 <- lm(formula = Calories ~ Sugar + Fiber, data = Cereal) 
cereal_model_2
## 
## Call:
## lm(formula = Calories ~ Sugar + Fiber, data = Cereal)
## 
## Coefficients:
## (Intercept)        Sugar        Fiber  
##     109.308        1.005       -3.744
  • The fitted model: \[\hat{y} = 109.3 + 1.005 \cdot sugar - 3.744 \cdot fiber \]

a) Report the ANOVA table for this model, and use it to calculate the R^2 value for this model (unadjusted), and interpret the value in the context of this model.

# ANOVA table
anova(cereal_model_2)
Df Sum Sq Mean Sq F value Pr(>F)
Sugar 1 4567.222 4567.2222 19.21640 0.0001119
Fiber 1 4782.973 4782.9735 20.12416 0.0000831
Residuals 33 7843.214 237.6732 NA NA
# Calculating R^2
4567.2/7843.2
## [1] 0.5823133
  • R-squared formula = \(\ SS_Model / SS_Total\) => \(4567.2/7843.2\)

  • Hence, R-squared = 0.5823

  • Thus, 58.2% of the variation in cereal calories can be explained by the regression model based on sugar and fiber.

b) Calculate the standard error of this multiple regression model.

# Calculating the SE:
 237.7^(0.5)
## [1] 15.41752
  • Standard Error formula = root(Mean Sq Error) => $ $

  • Hence, SE = 15.417

c) What information does the F statistic for a multiple regression model provide?

  1. The F statistic tests the hypothesis that all the predictors in the model remove significant variability from the outcome variable. Thus, it tests overall goodness of fit for a multiple regression model. The hypotheses it tests are:
  • Null Hypothesis: \[\beta_i = 0\]
  • Alternate Hypothesis: \[\beta_A \ne 0\]
  1. When the null hypothesis is true, then both MSModel = \(σ^2\) and MSError = \(σ^2\), and the value of F should be close to 1. However if the population means aren’t all equal, then MSModel > \(σ^2\), and F is expected to be greater than 1.
  • Thus, the larger the F value, the more evidence against the null hypothesis. So with a larger F statistic, there is a higher likelihood that at least one explanatory variable in the model has a relationship with the outcome.

d) Use the following command to compute the p-value for this model’s F statistic (replace the F placeholder with the appropriate F value, replace the x and y placeholders with the appropriate degrees of freedom). What does this p-value tell you about the variables in this model?

# Calculate F-statistic:
MS_Model <- (4567.2 + 4783.0)/2
F_stat <- MS_Model/237.7
pf(F_stat, df1 = 2, df2 = 33, lower.tail = FALSE)
## [1] 2.377482e-06
  • Thus, the p value of the F statistic = 2.377482e-06

  • The p value suggests that there is strong evidence to reject the null hypothesis at a significance level of 99% (since 2.377482e-06 < 0.01). The explanatoryvariables in this model (Sugar and Fiber) remove significant variability from the model, and have significant relationships with the outcome i.e. Calories.

Question 3 - Breakfast strikes back

Just as you did in the previous problem set, fit the simple linear regression model of total calories as a function of grams of sugar using the Cereal data set.

# Simple linear regression model: 
cereal_model  <- lm(formula = Calories ~ Sugar, data = Cereal)
cereal_model
## 
## Call:
## lm(formula = Calories ~ Sugar, data = Cereal)
## 
## Coefficients:
## (Intercept)        Sugar  
##      87.428        2.481

Now, compare the estimated effect of sugar content in the simple linear regression model model to the estimated effect of sugar content in the multiple regression model from Problem 2. Briefly explain how the estimated effect of sugar content has changed between the two models.

  1. Simple Linear regression model: \(\hat{calories} = 87.428 + 2.481 \cdot sugar\)
  2. Multiple regression model: $ = 109.3 + 1.005 sugar - 3.744 fiber $
  • In the simple linear regression model, as the amount of sugar increases by 1 gram, estimated calories increase by 2.481 units.

  • On the other hand, the multiple regression model estimates that as the amount of sugar increases by 1 gram, estimated calories increase by 1.005 units while holding the other explanatory variables (i.e. fiber) constant at all values.

  • The estimated effect of sugar is positive in both models. However, in the multiple regression model, when fiber is also taken into account, the magnitude of sugar’s effect decreases from 2.48 to 1.005. This implies that fiber and sugar may be confounded i.e. there may be high fiber cereal that with sugar. Additionally, 1 additional unit of fiber results in a 3.74 decrease in calories, holding other explanatory variables constant at any given value. Thus the impact of fiber improves overall understanding of the model.