Linear regression with multiple predictors

Model Selection

In this example we take data in which we have measured the total volume, diameter and height at a fixed distance of 1.5 m from the ground of 20 trees. We want to see which combination of diameter, height or both is best for predicting volume of wood in trees. If either or both of these can do so with reasonable precision then that will give us a useful way to estimate the tree volume which that be precise enough and much easier than actually measuring the volume.

For this exercise you will need to start a new script and to put the dataset tree_vol.csv into your data folder within your RStuff folder. The data file can be found along with all the other data files on the R tab of the Honours Project Moodle site.

You also need to designate your RStuff folder as what RStudio recognises as a Project.

Load Packages

library(tidyverse) # general workhorse. We always load this.
library(here) # this is good for finding your data file
library (MASS) # we use this here to help us find the best model

Reading the data

For this next code to work, you need to have made your RStuff folder into a Project.

filepath <- here("data","tree_vol.csv")
trees <- read_csv(filepath)
glimpse (trees)

## Rows: 20
## Columns: 3
## $ diameter <dbl> 0.6917, 0.7167, 0.7333, 0.8750, 0.8917, 0.9000, 0.9167, 0.916…
## $ height   <dbl> 70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75, 7…
## $ volume   <dbl> 10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9, 2…

Plot the data

A useful exploratory plot we can do here is a so-called pairs plot, in which every numerical variable is plotted against every other one.

pairs(trees)

From this, does it look as though diameter and height might be useful predictors of volume? Does it look as though either one might be a good predictor of the other? If the answer to the first question is Yes (Which it is. Why?) and if the answer to the second question is No (Which it also is. Why?) then it could be that we can use both diameter and height together to predict volume. Let’s see how that works out.

Fitting the model (ie estimating the parameters \(\beta_j\) of the model):

First, we will suppose that a linear model using both diameter and height as predictors of volume is appropriate. This seems reasonable given the pairs plots above. Our task is to find the coefficients \(\beta_0\), \(\beta_1\) and \(\beta_2\) in the equation

\[\text{volume}_i= \beta_0 + \beta_1\times \text{diameter}_i + \beta_2 \times \text{height}_i + e_i\]

Here, the \(e_i\) are the error terms, the part of the variation of volume that is not explained by height or diameter.

We create a linear model object which implements this model using the lm() function.

model_full <- lm (volume ~ diameter + height, data = trees)

then we inspect this model using the summary() function.

summary (model_full)

## 
## Call:
## lm(formula = volume ~ diameter + height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2224 -1.3994 -0.0429  0.8784  5.7161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.8077     5.6488  -6.339 7.41e-06 ***
## diameter     34.0993     3.9788   8.570 1.41e-07 ***
## height        0.3204     0.0754   4.249 0.000541 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.121 on 17 degrees of freedom
## Multiple R-squared:  0.8837, Adjusted R-squared:   0.87 
## F-statistic: 64.58 on 2 and 17 DF,  p-value: 1.143e-08

In this model, we have used diameter and height as possible predictors of tree volume. From the output of the linear model, we find that if diameter increases by one while height is held constant, then the volume increases by 34.1, while if diameter is held constant and height is increased by 1, then volume increases by 0.32. This is the meaning of the numbers in the Estimate column. Those numbers are the values \(\beta_1\) and \(\beta_2\) in our equation above. The (Intercept) number -35.8 is the parameter \(\beta_0\) in that equation.

Using these values for \(\beta_0\), \(\beta_1\) and \(\beta_2\) we can predict tree volume for any given values of diameter and height.

For example, if a tree’s height were 75 m and it’s diameter at 1.5 m from the ground were 0.9 m, then we would predict that the volume of wood in the tree were:

\[\begin{align*} \text{volume}_i&= \beta_0 + \beta_1\times \text{diameter}_i + \beta_2 \times \text{height}_i\\ &=-35.8 + 34.1 \times 0.9 + 0.32\times 75\\ &=18.89\text{ m}^3 \end{align*}\]

Can we simplify the model?

But we might ask, could we just as well predict volume using only height or diameter instead of both. Do we need both? Is our model more complicated than it need be? Are we spending time measuring one of those variables when we didn’t need to?

Often, when we have several candidate predictors for a model, some of them do not actually contribute much extra explanatory power and we are better off excluding them. For example, if you used left-foot size as a predictor of peoples’ heights, would it help much, do you think, if you also knew the size of their right foot? Almost certainly not. You would have wasted time and effort in measuring the length of that foot if you already knew the length of the other.

There are several ways to determine which of a range of predictors is best to use. In one method you try all possible combinations of the predictors, fit the model in each case and for each find the so-called Akaiki Information Criterion (AIC). This number tells you how much of the ‘truth’ is lost by fitting a particular model. Hence the model with the lowest AIC is the best one.

Here, we have four possible models: one where diameter alone is used as a predictor, one where height alone is used, and one where both are used - we have already tried this one. In addition we have a ‘nothing’ model in which we suppose that volume is constant and depends neither on diameter nor on height. In that model, we predict every tree to have the mean volume of our sample.

We could manually create models for each possible combination of predictors and then see which has the lowest AIC, and here that would not be too bad since there are only four. But in general, where we might have many predictors, this would rapidly become impractical as the number of possible combinations of them would be very large.

Fortunately, there is a function in the MASS package that does all this for us: stepAIC(). So let’s use that:

This is how it goes: we give it the full model we have already found that uses all the parameters, and then let it use that to find for us the combination of parameters that gives a model with the lowest AIC

# Stepwise regression model: given the full model, find the lowest AIC model
model_best <- stepAIC(model_full, direction = "both", trace = FALSE) # finds the AIC for all possible models
print(model_best$anova) # tells us which parameters are retained in the model with lowest AIC, and its AIC value.

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## volume ~ diameter + height
## 
## Final Model:
## volume ~ diameter + height
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                         17   76.45058 32.81825

# Parameters of the best model
summary(model_best)

## 
## Call:
## lm(formula = volume ~ diameter + height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2224 -1.3994 -0.0429  0.8784  5.7161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.8077     5.6488  -6.339 7.41e-06 ***
## diameter     34.0993     3.9788   8.570 1.41e-07 ***
## height        0.3204     0.0754   4.249 0.000541 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.121 on 17 degrees of freedom
## Multiple R-squared:  0.8837, Adjusted R-squared:   0.87 
## F-statistic: 64.58 on 2 and 17 DF,  p-value: 1.143e-08

This tells us that the combination of predictors that gives a model with the lowest AIC is in this case the one that uses both diameter and height. The one we tried in the first place! This is nice since it ties in with what we kind of thought from looking at the pairs plot. Diameter and height did not appear to be correlated. Which means they are independent of each other (the fancy word for this in this context is ‘orthogonal’). Which means they both bring additional information to the model. So it should not be surprising that the best model is one that uses both.

The whole script

# load packages
library(tidyverse)
library(here) 
library(MASS)

# load data
filepath <- here("data","tree_vol.csv")
trees <- read_csv(filepath)
glimpse (trees)

# pairs plot
pairs(trees)

# fit a linear model using all our predictors
model_full <- lm (volume ~ diameter + height, data = trees)

# look at whether the predictors are significant
summary (model_full)

## Find the model with the lowest AIC - this will be the best one

# First, fit the full model using all predictors (actually, we have just done this)
model_full <- lm(volume~., data = trees) # volume ~ . means 'use all the predictors' -saves us typing them all
summary(model_full)

# Stepwise regression model
model_best <- stepAIC(model_full, direction = "both", trace = FALSE) # finds the AIC for all possible models
model_best$anova # tells us which one is best

# Parameters of the best model
summary(model_best)