Now that we understand the basics of Multiple Linear Regression (MLR) will we begin to analysis the significance and accuracy of the predictor parameters, namely, our Beta coefficients.
Let’s take a peak at the trees dataset which includes information on the height (ft), girth (in), and volume(ft^3) of a sample of 31 black cherry trees.
attach(trees)
head(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
summary(trees)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
Using the trees dataset, we will try to model the trees’ volume using its girth and height.
Volume.Mod <-lm(Volume~Girth+Height)
Volume.Mod
##
## Call:
## lm(formula = Volume ~ Girth + Height)
##
## Coefficients:
## (Intercept) Girth Height
## -57.9877 4.7082 0.3393
We learn from the model that for every 1in increase in girth, volume increases 4.7082ft^3 when height is held constant, and for every 1ft increase in height, volume increases 0.3393ft^3 when girth is held constant. I would plot this for you, but 3-D plotting is a skill I haven’t masted in R.
We’re now interested in knowing how significant or if both of our predictor variables, girth and height, are truly helpful in helping us predictor the volume of the tree. we can perform a hypothesis test on the Beta coefficients to test their significance in the model
If the beta = 0, then that tells us that a change in that predictor doesn’t change our response variable. Thus, an insignificant predictor. This will be our null hypothesis.
\[ H_0: B_j = 0 \] \[H_a: B_j <> 0 \] If we do not fail to reject our null hypothesis, we can effectively eliminate the predictor from the model and predict our response variable just as accurately. We’ll assume an alpha level of 0.05.
For the hypothesis test, our test stat will follow the t-distribution with n-(k+1) degrees of freedom, where n is the number of observations and k+1 is the number of Beta coefficients.
\[ t^{n-(k+1)} = (b_j - B_j)/S_{b_j} \] Here, we assume that:
\[ b_j~-~N(B_j, SD_{bj}) \] \[ B_j = 0 \] And \[ S_{bj} = Standard~Error \]
So, if we run the hypothesis test…
summary(Volume.Mod)
##
## Call:
## lm(formula = Volume ~ Girth + Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
For the girth predictor, we see that our test stat with df=28 was 17.816 which produced a very small p-value (<2e-16). This tells us that if the null hypothesis is true, we would pretty much never get the the beta coefficient of 4.7082 for girth. From this, we would reject the null hypothesis and conclude that the girth is a significant predictor of black cherry tree volume in this model.
Additionally, when we look at the height predictor, we see that our test stat is 2.607, which gives us a p-value of 0.0145. This tells us that if the null hypothesis is true, we would see a beta coefficient of 0.3393 for height in 1.45% of samples. For our test, we would reject the null since 0.0145<0.05 telling us that our height predictor is also significant for predicting volume in this model.
If we instead are interested in knowing the accuracy our estimates for our predictor variable coefficients, we can run a confidence interval for the beta coefficients. We’ll stick with a 95% significance level because 95% is a pretty cute number.
confint(Volume.Mod, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -75.68226247 -40.2930554
## Girth 4.16683899 5.2494820
## Height 0.07264863 0.6058538
Okay, so let’s dive into this R-output. The code gives us a 2-sided confidence interval for both predictors, with alpha/2 (in this case 0.025) on for each tail.
For the girth beta, our 95% confidence interval is [4.1668, 5.2495]. Since we are dealing with MLR, our interpretation of this interval is slightly different. We can now say that we are 95% confident that the true value of the beta coefficient for girth is between [4.1668, 5.2495] when we hold the height constant.
The same can be said for the height beta: we are 95% confident that the true beta for height is between [0.0726, 0.6058] when we hold girth constant (meaning that if we repeated this same sample at naseaum 95% of samples would produce a beta coefficient in this interval).
We can also extend our study of confidence intervals for predicting the mean of the response variable to MLR models.
To do this, we select values for our predictor variables, then predictor the mean value of our response variable given these values. We’ll pretend that I’m a tree and use my approximate values: Girth: 12 Height: 72 alpha:0.05
Volume.Mean<-data.frame(Girth=12, Height=72)
predict(Volume.Mod, Volume.Mean, interval="confidence", level=0.95)
## fit lwr upr
## 1 22.93636 21.23781 24.6349
AND THERE YOU HAVE IT! Our 95% confidence interval for the mean volume when girth is 12in and height is 72ft is between [21.2378, 24.6349] cubic feet. Thus, 95% of similar samples will produce a mean volume in this range when looking at black cherry trees with a girth of 12in and a height of 72 ft.
If we’re really are feeling ecstatic and statistics, we can also predict a single value of tree volume based on a measure of girth and height. Why might you ask? Pretend you’re George Washington about to chop down a cherry tree for fire wood to heat the white house in 1778. There are 2 trees that Mrs. Washington says you can chop down, but you only have time for one (the whole sharping the axe thing takes too long…or was that Lincoln?). So, you have to decide which tree is going to provide more volume(and consequently) heat based on each tree’s girth and height.
Now that we understand our motivation for an individual response value, let’s solve. Say our good friend Georgey measures the two trees and reports the following:[Girth,Height] Tree 1: [14,65] Tree 2: [11,80]
tree1 <- data.frame(Girth=14, Height= 65)
tree2 <- data.frame(Girth=11, Height= 80)
predict(Volume.Mod, tree1, interval="prediction", level=0.95)
## fit lwr upr
## 1 29.97792 21.30197 38.65387
predict(Volume.Mod, tree2, interval="prediction", level=0.95)
## fit lwr upr
## 1 20.94221 12.62153 29.26288
Mr. W could run the above prediction interval and be 95% confident that the volume of tree1 would be between [21.3020, 38.6539]cubic feet while being 95% that the volume of tree2 would be between [12.6215, 29.2629]cubic feet. Mr. W is a smart man (except when it comes to cherry trees…), so we can assume that he would chop done tree1.
We notice that the intervals provided for the single value of the response variable are 5x wider than the average response variable that we conducted prior. This is due to the additional variation that comes with predicting a single value, rather than an averaged value. This variation shows up in the SE and increases the prediction interval compared to the confidence interval.