Using R, build a multiple regression model for data that interests
you. Include in this model at least one quadratic term, one dichotomous
term, and one dichotomous vs. quantitative interaction term. Interpret
all coefficients. Conduct residual analysis. Was the linear model
appropriate? Why or why not?
For this multiple regression model I chose the trees data file from
RStudio.
# Load the packages needed
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
## Warning: package 'broom' was built under R version 4.2.2
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.2
This dataset trees contains data pertaining to the Volume, Girth and
Height of 31 felled black cherry trees.
data("trees")
head(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
Multiple Regression
I start by plotting the dataset:
pairs(trees, gap=0.5)

I first created the Simple Linear model of Volume against
Girth:
m1 = lm(Volume~Girth,data=trees)
summary(m1)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
For the Multiple Regression I included Height as an additional
variable:
m2 = lm(Volume~Girth+Height,data=trees)
summary(m2)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Note that the R^2 has improved, yet the Height term is less
significant than the other two parameters.
All Possible Regression
I decided to first try All Possible Regression.
Now I include the interaction term between Girth and Height:
m3 = lm(Volume~Girth*Height,data=trees)
summary(m3)
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## Girth -5.85585 1.92134 -3.048 0.00511 **
## Height -1.29708 0.30984 -4.186 0.00027 ***
## Girth:Height 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
All terms are highly significant. Note that the Height is more
significant than in the previous model, despite the introduction of an
additional parameter.
Looking at the confidence intervals for the parameters reveals that
the estimated power of Girth is around 2, and Height around 1. This
makes a lot of sense, given the well-known dimensional relationship
between Volume, Girth and Height!
I’ll now add the interaction term.
m5 = lm(log(Volume)~log(Girth)*log(Height),data=trees)
summary(m5)
##
## Call:
## lm(formula = log(Volume) ~ log(Girth) * log(Height), data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.165941 -0.048613 0.006384 0.062204 0.132295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6869 7.6996 -0.479 0.636
## log(Girth) 0.7942 3.0910 0.257 0.799
## log(Height) 0.4377 1.7788 0.246 0.808
## log(Girth):log(Height) 0.2740 0.7124 0.385 0.704
##
## Residual standard error: 0.08265 on 27 degrees of freedom
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9753
## F-statistic: 396.4 on 3 and 27 DF, p-value: < 2.2e-16
The R^2 value has increased (of course, as all we’ve done is add an
additional parameter), but interestingly none of the four terms are
significant. This means that none of the individual terms alone are
vital for the model - there is duplication of information between the
variables. So we will revert back to the previous model.
It would be reasonable to expect the power of Girth to be 2, and
Height to be 1, we will now fix those parameters, and instead just
estimate the one remaining parameter.
m6 = lm(log(Volume)-log((Girth^2)*Height)~1,data=trees)
summary(m6)
##
## Call:
## lm(formula = log(Volume) - log((Girth^2) * Height) ~ 1, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168446 -0.047355 -0.003518 0.066308 0.136467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.16917 0.01421 -434.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0791 on 30 degrees of freedom
Note that there is no R^2 (as only the intercept was included in the
model), and that the Residual Standard Error is incomparable with
previous models due to changing the response variable.
We can alternatively construct a model with the response being y,
and the error term additive rather than multiplicative.
m7 = lm(Volume~0+I(Girth^2):Height,data=trees)
summary(m7)
##
## Call:
## lm(formula = Volume ~ 0 + I(Girth^2):Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6696 -1.0832 -0.3341 1.6045 4.2944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## I(Girth^2):Height 2.108e-03 2.722e-05 77.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.455 on 30 degrees of freedom
## Multiple R-squared: 0.995, Adjusted R-squared: 0.9949
## F-statistic: 5996 on 1 and 30 DF, p-value: < 2.2e-16
Note that the parameter estimates for the last two models are
slightly different… this is due to differences in the error model.
Model Selection
Of the last two models, the one with the log-Normal error model
would seem to have the more Normal residuals. This can be inspected by
looking at diagnostic plots, by and using the shapiro.test():
par(mfrow=c(2,2))
plot(m6)
## hat values (leverages) are all = 0.03225806
## and there are no factor predictors; no plot no. 5

par(mfrow=c(2,2))
plot(m7)

shapiro.test(residuals(m6))
##
## Shapiro-Wilk normality test
##
## data: residuals(m6)
## W = 0.97013, p-value = 0.5225
shapiro.test(residuals(m7))
##
## Shapiro-Wilk normality test
##
## data: residuals(m7)
## W = 0.95846, p-value = 0.2655
AIC
AIC is asymptotically optimal for selecting the model with the least
mean squared error, under the assumption that the “true model” is not in
the candidate set.
The Akaike Information Criterion (AIC) can help to make decisions
regarding which model is the most appropriate. I will calculate the AIC
for each of the above models:
summary(m1)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
AIC(m1)
## [1] 181.6447
summary(m2)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
AIC(m2)
## [1] 176.91
summary(m3)
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## Girth -5.85585 1.92134 -3.048 0.00511 **
## Height -1.29708 0.30984 -4.186 0.00027 ***
## Girth:Height 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
AIC(m3)
## [1] 155.4692
summary(m4)
##
## Call:
## lm(formula = log(Volume) ~ log(Girth) + log(Height), data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168561 -0.048488 0.002431 0.063637 0.129223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.63162 0.79979 -8.292 5.06e-09 ***
## log(Girth) 1.98265 0.07501 26.432 < 2e-16 ***
## log(Height) 1.11712 0.20444 5.464 7.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
## F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
AIC(m4)
## [1] -62.71125
summary(m5)
##
## Call:
## lm(formula = log(Volume) ~ log(Girth) * log(Height), data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.165941 -0.048613 0.006384 0.062204 0.132295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6869 7.6996 -0.479 0.636
## log(Girth) 0.7942 3.0910 0.257 0.799
## log(Height) 0.4377 1.7788 0.246 0.808
## log(Girth):log(Height) 0.2740 0.7124 0.385 0.704
##
## Residual standard error: 0.08265 on 27 degrees of freedom
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9753
## F-statistic: 396.4 on 3 and 27 DF, p-value: < 2.2e-16
AIC(m5)
## [1] -60.88061
summary(m6)
##
## Call:
## lm(formula = log(Volume) - log((Girth^2) * Height) ~ 1, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168446 -0.047355 -0.003518 0.066308 0.136467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.16917 0.01421 -434.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0791 on 30 degrees of freedom
AIC(m6)
## [1] -66.34198
summary(m7)
##
## Call:
## lm(formula = Volume ~ 0 + I(Girth^2):Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6696 -1.0832 -0.3341 1.6045 4.2944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## I(Girth^2):Height 2.108e-03 2.722e-05 77.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.455 on 30 degrees of freedom
## Multiple R-squared: 0.995, Adjusted R-squared: 0.9949
## F-statistic: 5996 on 1 and 30 DF, p-value: < 2.2e-16
AIC(m7)
## [1] 146.6447
The AIC can help differentiate between similar models, it cannot
help deciding between models that have different responses.
The AIC function is 2K – 2(log-likelihood). Lower AIC values
indicate a better-fit model, and a model with a delta-AIC (the
difference between the two AIC values being compared) of more than -2 is
considered significantly better than the model it is being compared
to.
If I understand the AIC correctly, Model M6 with a -66.34198 is a
better fit model with delta AIC because it has the lowest AIC. See below
reference.
Backward Elimination Procedure
To continue developing the model, we apply the backward elimination
procedure by identifying the predictor with the largest p-value.
m8 <- update(m2, .~. - Height, data = trees)
summary(m8)
##
## Call:
## lm(formula = Volume ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
Model Fit
With the Simple Regression, I constructed a simple linear model for
Volume using Girth as the independent variable. Using Multiple
Regression I tried various models, including some that had multiple
predictor variables and/or involved log-log transformations to explore
power relationships.
I will now fit this model:
volume = trees$Volume
height = trees$Height
girth = trees$Girth
m9 = nls(volume~beta0*girth^beta1*height^beta2,start=list(beta0=1,beta1=2,beta2=1))
summary(m9)
##
## Formula: volume ~ beta0 * girth^beta1 * height^beta2
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## beta0 0.001449 0.001367 1.060 0.298264
## beta1 1.996921 0.082077 24.330 < 2e-16 ***
## beta2 1.087647 0.242159 4.491 0.000111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.533 on 28 degrees of freedom
##
## Number of iterations to convergence: 5
## Achieved convergence tolerance: 8.214e-07
With a 2,533 Residual standard error on 28 degrees of freedom this
has a lower residual standard error so it is a good fits the data
model.