Problem 1. Before answering the following questions, take a few minutes to look through the components of the built-in R data set “trees,” which contains data on the girth (diameter), height, and volume of n = 31 felled black cherry trees. The predictors are “Girth” and “Height” and the response is “Volume.” Then, against a significance level of α = 0.05, complete the following.
(a) Using the “ggpairs()” function, generate a scatterplot array which shows the association and correlation values between the predictors and each predictor between the response. Are there indications of a correlation between the response and each predictor? Explain your reasoning.
ggpairs(trees)
The scatterplot matrix shows positive associations between Volume and both predictors (Girth and Height). Volume has a very strong correlation with Girth (r = 0.967) and a moderate correlation with Height (r = 0.598). Because both correlations are clearly positive and the scatterplots show upward trends, there is strong evidence of a positive relationship between the response (Volume) and each predictor.
b) Generate the simple linear regression model between the predictors and the response. What are your beta coefficients? Should both predictors be included in the model? Explain your reasoning.
tree_model <-lm(Volume ~ Girth + Height, data = trees)
summary(tree_model)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Model: Predicted Volume = -57.9877 + 4.7082(Girth) + 0.3393(Height)
The coefficient for Girth, β₁, indicates that, holding height constant, each 1-inch increase in girth corresponds to an approximate 4.71 cubic feet increase in volume, while the coefficient for Height,β₂, shows that, holding girth constant, each 1-foot increase in height corresponds to about a 0.34 cubic feet increase in volume. β₀ of -57.9877 in the model is just used a parameter for estimation and has no practical meaning in this context.
Both predictors are statistically significant at α = 0.05 (Girth p < 2e-16, Height p = 0.0145), and the model explains a very high proportion of the variation in tree volume (R² = 0.948, adjusted R² = 0.9442). Therefore, both Girth and Height should be included in the model, as they meaningfully contribute to predicting the volume of cherry trees.
c) Add an interaction term between the predictors “Girth” and “Height” to your fitted regression model from part (b). Discuss the differences between the model summary from part (b) and the model summary with the added interaction term. Do these differences indicate an interaction term should be included in the model? Explain your reasoning
tree_model_interaction <- lm(Volume ~ Girth * Height, data = trees)
summary(tree_model_interaction)
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## Girth -5.85585 1.92134 -3.048 0.00511 **
## Height -1.29708 0.30984 -4.186 0.00027 ***
## Girth:Height 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
When adding the interaction term between Girth and Height, the fitted model becomes:
Predicted Volume = 69.40 − 5.86(Girth) − 1.30(Height) + 0.135(Girth×Height)
All terms, including the interaction, are statistically significant at α = 0.05. Compared to the previous model without the interaction, the residual standard error decreased from 3.88 to 2.71, and R² increased from 0.948 to 0.976, indicating a better fit. The presence of a significant interaction term suggests that the effect of Girth on Volume depends on the tree’s Height (and vice versa), rather than acting independently. Therefore, the interaction term should be included, as it meaningfully improves model fit and captures the combined effect of Girth and Height on Volume.
d) Check the fit of the models in parts (b) and (c) using the appropriate residual plots. Based on these plots and your comparison between the models in part (c), which model would you recommend to predict the volume of cherry trees and why?
tree_resid <- resid(tree_model)
tree_stud <- rstudent(tree_model)
tree_resid_inter <- resid(tree_model_interaction)
tree_stud_inter <- rstudent(tree_model_interaction)
ggplot(data = NULL, aes(sample = tree_stud)) +
geom_qq(color = "steelblue", shape = 22, size = 2) +
geom_qq_line(color = "grey25", linewidth = 1) +
labs(title = "Normality Plot of Studentized Residuals (No Interaction)",
x = "Theoretical Quantiles",
y = "Sample Quantiles")
ggplot(data = NULL, aes(sample = tree_stud_inter)) +
geom_qq(color = "darkgreen", shape = 22, size = 2) +
geom_qq_line(color = "grey25", linewidth = 1) +
labs(title = "Normality Plot of Studentized Residuals (With Interaction)",
x = "Theoretical Quantiles",
y = "Sample Quantiles")
ggplot(data = NULL, aes(x = fitted(tree_model), y = tree_resid)) +
geom_point(color = "grey28", shape = 22, size = 2) +
geom_hline(yintercept = 0, linewidth = 1, color = "steelblue") +
labs(title = "Residuals vs Fitted (No Interaction)",
x = "Fitted Values",
y = "Residuals")
ggplot(data = NULL, aes(x = fitted(tree_model_interaction), y = tree_resid_inter)) +
geom_point(color = "grey28", shape = 22, size = 2) +
geom_hline(yintercept = 0, linewidth = 1, color = "darkgreen") +
labs(title = "Residuals vs Fitted (With Interaction)",
x = "Fitted Values",
y = "Residuals")
Both models show roughly linear QQ plots of studentized residuals, indicating that the assumption of normality is reasonably satisfied. However, the Residuals vs Fitted plot for the model without the interaction shows slight curvature and a wider spread of residuals, suggesting some nonlinearity and heteroscedasticity. In contrast, the model with the interaction term has residuals that are more evenly scattered around zero and exhibit a smaller range, indicating a better fit and more consistent variance. Considering both the improved fit (lower residual standard error and higher R²) and the cleaner residual patterns, the model including the Girth × Height interaction is recommended for predicting the volume of cherry trees.
e) Generate and interpret the prediction interval for the expected volume, using your recommended model from part (d), if a cherry tree has a girth of 18” and a height of 70’.
new_tree <- data.frame(Girth = 18, Height = 70)
predict(tree_model_interaction, newdata = new_tree, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 42.85975 36.03922 49.68029
Using the regression model with the Girth × Height interaction, the predicted volume for a cherry tree with a girth of 18 inches and a height of 70 feet is approximately 42.86 cubic feet. The 95% prediction interval ranges from 36.04 to 49.68 cubic feet, meaning we can be 95% confident that the actual volume of an individual tree with these dimensions will fall within this range. This interval accounts for both the variability in the response and the uncertainty in predicting a new observation, making it appropriate for estimating the volume of a specific tree rather than the average volume.