Introduction

Welcome to Session IV of the TRIA-Net Introduction to R Workshop. Today we will be introducing the basic statistical modelling tools available in R :

R modeling basics (simple linear regression, formula notation, multiple regression)
model objects (plotting, predict, parameter estimates, summary)
diagnostics (residuals plots, qqnorm)
advanced modeling (logistic models, mixed effects models)

This session will look at regression models (a good introduction can be found in (Montgomery, Peck, and Vining 2015)). Other kinds of models (eg. mechanistic models, machine learning models) can also be fit in R, but we won’t have time to cover them.

To return to the list of sessions at the rpubs website, click here:

https://rpubs.com/trialewisgroup.

Simple Linear Models

Let’s start by looking at another one of R’s pre-loaded datasets: trees is a set of diameter-at-breast-height (DBH), height, and timber volume measurements from 31 black cherry trees (T. A. Ryan et al. 1976).

We’ll take a look at the first few values in the data frame using head, then make a pair of scatterplots showing volume against each of height and DBH (which is called ‘Girth’ in the data frame):

# peek at the dataset, count the number of observations (rows)
n = nrow(trees)
head(trees)

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

# divide the plot window into two panels
par(mfrow=c(1,2))

# make a palette to colour each point
treesColours = topo.colors(n)

# draw scatterplots
plot(Volume ~ Girth, data=trees, pch=19, col=treesColours)
plot(Volume ~ Height, data=trees, pch=19, col=treesColours)

The rows in this data frame have already been ordered by increasing girth, which is why the colour scale matches it. It seems like girth is a pretty good predictor of timber volume and that the relationship is almost linear.

Linear Regression

It is quite easy to fit a simple linear model in R. The function to do this is called lm. It understands the same formula notation that we use with plots. Let’s fit a regression of volume on girth:

# fit the linear model
lm(Volume ~ Girth, data=trees)

## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Coefficients:
## (Intercept)        Girth  
##     -36.943        5.066

The output of this function displays the least squares estimates of \(\beta_0\) (‘Intercept’), and \(\beta_1\) (‘Girth’) in the model:

\[ v_i = \beta_0 + \beta_1 g_i + \epsilon_i \]

where \((v_i, g_i)\) are the volume & girth pairs for each of the \(i=1,2,...31\) trees, and \(\epsilon_i \overset{\text{iid}} \sim N(0, \sigma^2)\) are the normally distributed errors.

Usually we want more than just the coefficient estimates. Calls to lm actually return a model object containing tons of information in the form of an R list. The info can be summarized like this:

# fit the linear model
heightlm = lm(Volume ~ Girth, data=trees)

# examine the model object
summary(heightlm)

## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

The summary function works on all kinds of R objects (e.g. try using it on the trees data frame). In this case we used it on the R model object heightlm and it reports details about the fit:

coefficient estimates are supplemented with standard errors, t-values and p-values
\(\sigma\) is estimated (the residual standard error)
\(R^2\) is reported (ordinary and adjusted to penalize model complexity).

Sometimes, you might want to look under the hood and see how R is actually doing its calculations. Try asking R to to show the contents heightlm by running str(heightlm). A The second item in the list is called “residuals”, so for example I can print out my model residuals like this:

heightlm[[2]]

##          1          2          3          4          5          6 
##  5.1968508  3.6770939  2.5639226  0.1519667  1.5387954  1.9322098 
##          7          8          9         10         11         12 
## -3.1809615 -0.5809615  3.3124528  0.1058672  3.8992815  0.1926959 
##         13         14         15         16         17         18 
##  0.5926959 -1.0270610 -4.7468179 -6.2060887  5.3939113 -3.0324313 
##         19         20         21         22         23         24 
## -6.7587739 -8.0653595  0.5214692 -3.2917021 -0.2114590 -5.8102436 
##         25         26         27         28         29         30 
## -3.0300006  4.7041430  3.9909717  4.5646292 -2.7419565 -3.2419565 
##         31 
##  9.5868168

“Least squares” refers to finding the model that minimizes the squares of all these numbers (added together). R has done this automatically, then used them to estimate things like \(\sigma\). But R will not automatically check if the residuals look normal (an assumption of simple linear regression). This is an important part of model adequacy checking, and we’ll come back to that later.

Multiple Regression

Now we’ll try including both covariates in the linear model. Whenever we have two main effects we usually want to try including their interaction, so we’ll do that too. The formula notation in R is intuitive. If \(h_i\) is the height of the \(i^{th}\) tree, the multiple linear regression model is:

\[ v_i = \beta_0 + \beta_1 g_i + \beta_2 h_i + \beta_3 g_i * h_i + \epsilon_i \]

and the corresponding R formula syntax is 'Volume ~ Girth + Height + Girth:Height'. Try fitting this model using an lm call, then look up the \(R^2\) value by calling summary.

## 
## Call:
## lm(formula = Volume ~ Girth + Height + Girth:Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

Click here to see my code. The fit is quite good here, and we see that all the effects are significant.

I usually like to see a plot of the model’s predicted values after the fit. We could do this by using the mathematical formula for the model and plugging in the fitted coefficients, but R’s predict function does it automatically for us. It takes in a model object and a data frame containing covariate values to predict over and generates the corresponding response values.

I’ll put girth on the x-axis and make a plot of of the model prediction line for the average height tree:

# reorder the dataset by height so the color scales matches it
treesbyheight = trees[order(trees$Height),] 

# define a set of girths and a height to predict over
girthset = pretty(trees$Girth, n=1000)
heightaverage = mean(trees$Height)

# plot volume against girth (source data)
plot(Volume ~ Girth, data=treesbyheight, pch=19, col=treesColours)
title(main=paste0('height = ', heightaverage), line=-2)

# predict volumes based on height and girths
testdata = data.frame(Girth=girthset, Height=heightaverage) 
y = predict( girthHeightlm, newdata=testdata)

# add prediction line from the linear model
lines(y ~ girthset)

I’ve used three functions here that may be new to you:

order takes in a numerical vector and returns an indexing that puts the values in increasing order. We use this to reorder the rows of the trees data frame by height (matching height with point colour).
pretty generates a sequence of n+1 evenly-spaced and rounded values covering the range of the input vector (in this case the range of girths in the data).
paste0 stitches together characters and variable values into a single string. I used this to construct a title for the plot that contains the value of variable

Since we have an interaction, the slope of this line will change as we vary height. As an exercise, try altering the above code to plot the linear model prediction line for a set of 4 different height values. You can also try colouring the lines according to their height value:

Click here to see the code I used to make this plot.

R has a convenient shorthand in its formula notation: the asterix symbol (eg. “factor1*factor2*factor3”) instructs R to include all main effects as well as the interaction term. So to fit the model above, we could have done something like this:

summary(lm(Volume ~ Girth*Height, data=trees))

## 
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

Notice that R has included an intercept, both main effects, and the interaction term automatically.

Model Selection

Now let’s try adding a higher order term to demonstrate simple model selection in R. For example we can try a quadratic in girth. Write some code to that adds a new column with squared girth to the trees data frame. Then use lm to fit the model with girth, squared girth, and height, plus all interactions using the “*" shorthand:

## 
## Call:
## lm(formula = Volume ~ Girth * sqGirth * Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9041 -1.1770 -0.0617  1.4964  4.1283 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)          -2.429e+02  5.040e+02  -0.482    0.634
## Girth                 4.281e+01  1.120e+02   0.382    0.706
## sqGirth              -2.107e+00  8.169e+00  -0.258    0.799
## Height                4.432e+00  7.037e+00   0.630    0.535
## Girth:sqGirth         1.843e-02  1.980e-01   0.093    0.927
## Girth:Height         -8.481e-01  1.522e+00  -0.557    0.583
## sqGirth:Height        5.134e-02  1.071e-01   0.479    0.636
## Girth:sqGirth:Height -7.759e-04  2.480e-03  -0.313    0.757
## 
## Residual standard error: 2.706 on 23 degrees of freedom
## Multiple R-squared:  0.9792, Adjusted R-squared:  0.9729 
## F-statistic: 154.8 on 7 and 23 DF,  p-value: < 2.2e-16

Click here to see my code.

Before we go further, a note about the way R displays numbers: The default behaviour is to print very large/small numbers in scientific notation (for example “e+09” indicates a number is given in billions, and “e-03” in thousandths). If you’d rather see numbers printed in their entirety, you can indicate this via one of R’s global options. The value of option “scipen” indicates to R how much we hate scientific notation at this time, and is by default set to 0. Let’s turn it up:

options(scipen=4)

Try calling summary again for the model you just fitted. I think it’s easier to read now. If you liked it better before, go ahead and reset scipen back down to zero. To see a list of other global options and their current values, type options().

Selection Methods

Back to model selection. Looking at the summmary call above, we have quite a few terms now. We will often need to ask if dropping one or more of these terms produces a “better” model. Automated step selection based on significance (p-values) used to be a very popular method for this, but has fallen out of favour for a number of reasons (see here for a discussion). There are quite a few options for defining what is a “better” model. Examples include information measures like AIC and BIC; as well as likelihood values and adjusted likelihoods. How model selection should be done is a very controversial topic, so you should think carefully about how which method is right for your research question

Nested Models

If we were simply comparing one model to another with one of the terms dropped, we could use an F-test to gauge how important that missing term was. This is a type of likelihood ratio test. We can use the update function to make single-term additions/deletions to a model. Then we can compare the nested models using the anova function.

# remove the 3-way interaction
smallercherrylm = update(bigcherrylm, ~.-Girth:sqGirth:Height)

# do an F-test comparing with the full model
anova(bigcherrylm, smallercherrylm, test="F")

## Analysis of Variance Table
## 
## Model 1: Volume ~ Girth * sqGirth * Height
## Model 2: Volume ~ Girth + sqGirth + Height + Girth:sqGirth + Girth:Height + 
##     sqGirth:Height
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     23 168.46                           
## 2     24 169.18 -1  -0.71711 0.0979 0.7572

The second argument to the update function has a weird looking syntax. The ~. part tells R to start with everything in the old formula, then the -Girth:sqGirth:Height part tells R to drop the Girth:sqGirth:Height term (the 3-way interaction). Then we run the ANOVA (specifying that we are doing an F-test). The non-significance tells us that we can drop the 3-way interaction.

Multiple models

What about comparing models that aren’t nested? One of the most popular techniques is to use AIC, which incorporates model complexity and likelihood into one number. Lower is “better”.

Suppose our expert opinion led us to the following three models. We can get their AIC values using the AIC function, whose arguments are the (already fitted) model objects. Try fitting the following three models, then find which one has the lowest AIC:

girth, height, and their interaction
squared girth, height, and their interaction
squared girth only

##                  df      AIC
## smallercherrylm1  5 155.4692
## smallercherrylm2  5 152.3765
## smallercherrylm3  3 167.2284

Click here to see my code.

In this case we would pick the model with squared girth, height, and their interaction, since it has the smallest AIC. Note that comparing AICs is different from hypothesis testing, and may lead us to a model with non-significant terms (which is ok)

summary(smallercherrylm2)

## 
## Call:
## lm(formula = Volume ~ sqGirth * Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9548 -1.0501 -0.1482  1.7188  4.2102 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -2.0512340 11.9831440  -0.171   0.8654  
## sqGirth        -0.0043818  0.0705431  -0.062   0.9509  
## Height          0.0267203  0.1565299   0.171   0.8657  
## sqGirth:Height  0.0021616  0.0008789   2.459   0.0206 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.577 on 27 degrees of freedom
## Multiple R-squared:  0.9779, Adjusted R-squared:  0.9754 
## F-statistic:   398 on 3 and 27 DF,  p-value: < 2.2e-16

Here is a plot of predicted values from this model:

For practice, try writing your own code to make a plot like this.

You can see the quadratic term has allowed the prediction line to taper off as the girth gets small. Does it make sense to see this in the data? Think of how you might estimate timber volume using basal area and height…

GLMs

The regressions we just saw are relatively simple models. In reality we often use generalized linear models (GLMs). These allow you to specify different error structures for your data. They also allow the response variable to be given by link function (instead of simply being a linear combination of the covariates).

We’ll look at two examples of using R to fit GLMs, a logistic regression (where we change the link function), and a mixed effects model (where we change the error structure).

Logistic Regression

This data comes from The R Book by Michael J. Crawley. Here we want to fit a model to approximate the length of the jaw bone of a deer to the age of the deer.

age = c(0, 5.112, 1.32, 35.24, 1.632931134, 2.297635098, 3.322124914, 4.043794352, 5.266018429, 6.075060388,
        7.23632978, 8.291007148, 9.040888738, 10.88103197, 11.09874529, 12.0930015, 13.99817691, 14.31094267,
        15.95442024, 16.07602302, 17.06269954, 18.62960626, 19.33089704, 20.54623897, 21.32942114, 22.26439841, 
        23.02877926, 24.11888438, 25.31485841, 26.61297791, 27.57406734, 28.13049503, 29.7512319, 30.12324693,   
        31.98634565, 32.70730577, 33.72661622, 34.99304042, 35.88213403, 36.82950948, 37.11789381, 38.1534875,  
        39.43963142, 40.25185461, 41.69374363, 42.64910056, 43.76463209, 44.61572132, 45.85532071, 46.0153324,  
        47.70015577, 48.96434866, 49.32955686, 50.60409652)

bone = c(0, 20.22, 11.1113, 140.65, 26.15218052, 10.001, 55.85634033, 46.06754123, 67.24173642, 59.63458205, 
         60.1091383, 64.31392532, 84.92215527, 92.78716478, 98.35159407, 62.22, 85.77306851, 106.6696317, 
         97.90825381, 105.9671829, 112.2599592, 102.4866278, 114.7882368, 112.5545882, 81.11, 123.4, 
         116.0762696, 114.74270117, 105.5158263, 101.3862317, 97.77144264, 118.3887866, 96.94697319, 
         113.912142, 99.48090649, 116.2306841, 106.018061, 122.9677733, 116.8611543, 122.329053, 
         99.07154936, 116.1656593, 123.1043578, 118.5664472, 112.556484, 99.5036247, 118.6991783, 
         105.2162345, 108.5122835, 135, 112.429545, 101.6761133, 142, 91.2)


 plot(age, bone, pch=21, col="purple", bg="green")
 
 # Compute a nonlineaer least squares regression to fit a model to three parameters
 model = nls(bone ~ a - b*exp(-c*age), start=list(a=120, b=110, c=0.064))
 
 print(summary(model))

## 
## Formula: bone ~ a - b * exp(-c * age)
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## a 115.25308    2.91393   39.55  < 2e-16 ***
## b 118.68818    7.89260   15.04  < 2e-16 ***
## c   0.12351    0.01711    7.22 2.44e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.21 on 51 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 0.000002392

# All the parameters appear to be significantly different from zero at P <0.001. However this does not necessarily mean that all the # parameters need to be retained in the model. In this case a = 115.25308 with standard error 2.91393 is clearly not significantly different from b=118.6875 with standard error 7.89260. Try fitting a simplier two-parameter model
 
 # Compute a nonlineaer least squares regression to fit a model to two parameters
 model2 = nls(bone ~ a*(1-exp(-c*age)), start=list(a=120, c=0.064))
 print(anova(model,model2))

## Analysis of Variance Table
## 
## Model 1: bone ~ a - b * exp(-c * age)
## Model 2: bone ~ a * (1 - exp(-c * age))
##   Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
## 1     51     8897.5                         
## 2     52     8929.3 -1 -31.85  0.1826  0.671

# Model simplification was justified (p=0.671) so we accept the two-parameter version.
 print(summary(model2))

## 
## Formula: bone ~ a * (1 - exp(-c * age))
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## a 115.58091    2.84367  40.645  < 2e-16 ***
## c   0.11882    0.01233   9.635 3.69e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 52 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 0.000001366

 av = seq(0,50,0.1)
 bv = predict(model2,list(age=av))
 lines(av,bv)

Mixed Effects

Let’s revisit the example data from last session, on average Loblolly pine height with respect to age (Kung 1986).

head(Loblolly)

##    height age Seed
## 1    4.51   3  301
## 15  10.89   5  301
## 29  28.72  10  301
## 43  41.74  15  301
## 57  52.70  20  301
## 71  60.92  25  301

str(Loblolly)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  3 variables:
##  $ height: num  4.51 10.89 28.72 41.74 52.7 ...
##  $ age   : num  3 5 10 15 20 25 3 5 10 15 ...
##  $ Seed  : Ord.factor w/ 14 levels "329"<"327"<"325"<..: 10 10 10 10 10 10 13 13 13 13 ...
##  - attr(*, "formula")=Class 'formula' length 3 height ~ age | Seed
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Age of tree"
##   ..$ y: chr "Height of tree"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(yr)"
##   ..$ y: chr "(ft)"

n = nrow(Loblolly)
nseeds = length(unique(Loblolly$Seed))

Recall that we had growth curves from 14 different seed sources. We’ll go through an example of modelling the growth increments as a function of age using all the data for ages 5 to 20.

We can use diff to compute the differences between subsequenct height measurements (growth increments) and add them to the data frame. On your own time it would be a a good exercise to try making your own version of this code. For now you can just run my version below:

# compute height differences and divide by age increment
growths = (c(Loblolly$height[2:n],0) - Loblolly$height) / rep(5, nseeds)

# add them to a copy of the Loblolly data frame, then remove the height differences for ages 3,25
newLob = Loblolly
newLob$growth = growths
newLob = newLob[newLob$age %in% c(5,10,15,20),]

# make a quick plot
plot(growth ~ age, data=newLob)

Now suppose that we have reason to believe the age-growth relationship is the same for each seed source, but we know that - independent of the age - there some variability between sources. We could use random intercepts to get a much reduced model for the age-growth relationship while accounting for this nuisance variability.

Let \(g_{ij}\) be the growth for the \(j^{th}\) measurement on each tree; and \(a_{ij}\) be the age of the \(i^{th}\) seed source;

\[ g_{ij} = \beta_0 + \beta_1 h_j + \tau_i + \epsilon_{ij} \]

The \(\epsilon_{ij} \overset{\text{iid}} \sim N(0, \sigma^2)\) represent normally distributed measurement errors. What makes this a mixed effects model is the \(\tau_i\) term, which allows the intercept to vary randomly between seed sources. We assume this variation is also normal \(\tau_{i} \overset{\text{iid}} \sim N(0, \sigma_{\tau}^2)\)

To fit this in R, we use the nlme package. Let’s install that now:

install.packages('nlme')

We can now fit the model using the lme function from this package. The formula notation is the same for fixed effects, but random effects are specified in another argument. Let’s see an example:

library(nlme)

## Warning: package 'nlme' was built under R version 3.2.3

loblme = lme(growth ~ age, random = list( ~ 1 | Seed), data=newLob)
summary(loblme)

## Linear mixed-effects model fit by REML
##  Data: newLob 
##         AIC       BIC   logLik
##   -7.551775 0.4041615 7.775887
## 
## Random effects:
##  Formula: ~1 | Seed
##            (Intercept)  Residual
## StdDev: 0.000005744349 0.1883694
## 
## Fixed effects: growth ~ age 
##                 Value  Std.Error DF   t-value p-value
## (Intercept)  3.875500 0.06165835 41  62.85443       0
## age         -0.109703 0.00450289 41 -24.36277       0
##  Correlation: 
##     (Intr)
## age -0.913
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -2.7736540 -0.4570807 -0.1297603  0.3977747  2.3412209 
## 
## Number of Observations: 56
## Number of Groups: 14

In the output, you can see that we have a section for random effects and another for fixed effects. The residual standard deviation is an estimate of \(\sigma_\tau\). Can you find our estimates of \(\beta_0\) and \(\beta_1\)?

Here’s a plot of the resulting predicted values:

plot(growth ~ age, data=newLob, pch=NA)
treesColours = topo.colors(n)
points(growth ~ age, data=newLob, pch=18, col=treesColours)

testdata = data.frame(age=c(5,10,15,20), Seed='301') 
y = predict(loblme, newdata=testdata)
lines(y ~ testdata$age)

Model Adequacy

Recall the jaw bone length versus age data. We constucted a nonlinear model to fit the data. To check the model adequacy we test to see if the residuals are normally distributed and check to see if the variances over age remain constant. We first plot the residuals of the model versus age.

 #Check the model by looking at the residuals
 model2res = resid(model2)
 plot(age, model2res, ylab="Residuals", xlab="Age", main="Jaw Bone Length")
 abline(0,0)

 # Here we can see that we have a horizontal band pattern for the residuals.  Thus the assumption of equal variances is likely satisfied.

Now, we check to see if the residuals are normally distributed by using the R function qqnorm()

 # A quantile plot, used for checking normality
 qqnorm(model2res)
 qqline(model2res)

Resources

R Markdown

The html for this site was created using an R package called R markdown. It incorporates plaintext, latex code, R code chunks, and output from R and automatically builds it all into a very nice looking and readable document (Baumer et al. 2014). The learning curve is not steep at all. We used it to make this website, but you also generate pdf reports. So it’s handy for making everything reproducible and for sharing results with your colleagues.

Learning R

It’s hard to do cover all the basics of a programming language in this short time, and I’m sure we’ve missed some things. Here are some links to find out more:

If you’d like to go through another intro tutorial, I’d recommend the swirl package. It runs in R, and automatically walks you through interactive examples.
Ben Bolker has written a very nice book on developing ecological models in R, which you can read for free online. He also has posted a variety of worked examples on rpubs.
google is very useful when learning to write code. If you are getting errors or not sure how to do something specific in R, chances are someone else has been in the same situation and asked for help online. You’ll often see the Q/A forums on stackexchange turn up in search results, and they are an excellent resource.

Exercise Solutions

Here is the code I used to fit the regression of Volume on Girth and Height in the multiple linear regression section:

# fit the linear model
girthHeightlm = lm(Volume ~ Girth + Height + Girth:Height, data=trees)

# examine the model object
summary(girthHeightlm)

## 
## Call:
## lm(formula = Volume ~ Girth + Height + Girth:Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911    0.00713 ** 
## Girth        -5.85585    1.92134  -3.048    0.00511 ** 
## Height       -1.29708    0.30984  -4.186    0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 0.00000748 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

Here is the code I used to create the 4-panel plot in the multiple linear regression section

# define a set of heights to predict over by picking indices
idxheight = c(5,12,19,26)

# set up graphical parameters
par(mfrow=c(2,2))
par(mar=c(4,4,1,1))

# make one plot for each height
for (idx in idxheight)
{
  # identify height value and corresponding colour 
  height = treesbyheight$Height[idx]
  heightcolour = treesColours[idx]
  
  # plot volume against girth (source data)
  plot(Volume ~ Girth, data=treesbyheight, pch=19, col=treesColours)
  title(main=paste0('height = ', height), line=-1)
  
  # predict volumes based on height and girths
  testdata = data.frame(Girth=girthset, Height=height) 
  y = predict( girthHeightlm, newdata=testdata)
  
  # add prediction line from the linear model
  lines(y ~ girthset, col=heightcolour)
}

Here is the code I used to add a quadratic term to the volume regression in the model selection section

# create the new quadratic term, add it to the data frame 
trees$sqGirth = trees$Girth^2

# fit the model and print info
bigcherrylm = lm(Volume ~ Girth*sqGirth*Height, data=trees)
summary(bigcherrylm)

My code for the 4-panel plot with a quadratic girth term, from the multiple models section:

idxheight = c(5,12,19,26)

# set up graphical parameters
par(mfrow=c(2,2))
par(mar=c(4,4,1,1))

# make one plot for each height
for (idx in idxheight)
{
  # identify height value and corresponding colour 
  height = treesbyheight$Height[idx]
  heightcolour = treesColours[idx]
  
  # plot volume against girth (source data)
  plot(Volume ~ Girth, data=treesbyheight, pch=19, col=treesColours)
  title(main=paste0('height = ', height), line=-1)
  
  # predict volumes based on height and girths
  testdata = data.frame(sqGirth=girthset^2, Height=height) 
  y = predict(smallercherrylm2, newdata=testdata)
  
  # add prediction line from the linear model
  lines(y ~ girthset, col=heightcolour)
}

References

Baumer, Ben, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, and Nicholas J Horton. 2014. “R Markdown: Integrating a Reproducible Analysis Tool into Introductory Statistics.” ArXiv Preprint ArXiv:1402.1894.

Kung, FH. 1986. “Fitting Logistic Growth Curve with Predetermined Carrying Capacity.” In ASA Proceedings of the Statistical Computing Section, 340–43.

Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2015. Introduction to Linear Regression Analysis. John Wiley & Sons.

Ryan, Thomas A, Brian L Joiner, Barbara F Ryan, and others. 1976. Minitab Student Handbook. Duxbury Press.

TRIA-Net R Workshop - Session IV: Modelling in R

Dean Koch, Melodie Kunegel-Lion, Nathan Marculis

February 20, 2016