Scientific Research, Statistics Lab 8

Multiple linear regression

Similar to ANOVA, linear regression models can deal with multiple explanatory variables. When we have two or more predictors we talk about multiple linear regresssion. With two predictors, we can still graphically imagine the relationship with the response variable in a 3-dimensional space. However, with more predictors we are looking at a multi-dimensional space with is hard for us to imagine.
Let’s start with an example with two predictors using the longley macroeconomic data in the car package. We will use the plotly package to produce a 3D figure of this data (for more information see https://plot.ly/r/reference/). Here are a few tips on how to handle the 3D image in the plotting window:

move the cursor into the plotting window
hover over the icons in the menu bar that pops up at the top
click on the zoom icon, then click and hold within the figure and move your mouse to adjust the plot size (zoom)
click on ‘orbital rotation’ or ‘turntable rotation’ and then click and hold into within the figure and start moving your mouse (you should now be rotating the 3D plot)

NOTE that currently 3D figures cannot be saved into a Word doc from RMarkdown, so you can only try out the 3D image rendering by running the code live (remove the ‘#’ symbol where I muted the code starting with plot-ly(...) below)!

## Without interaction
library(car)
data(longley)

## Data exploration
?longley
head(longley)

str(longley)

## 'data.frame':    16 obs. of  7 variables:
##  $ GNP.deflator: num  83 88.5 88.2 89.5 96.2 ...
##  $ GNP         : num  234 259 258 285 329 ...
##  $ Unemployed  : num  236 232 368 335 210 ...
##  $ Armed.Forces: num  159 146 162 165 310 ...
##  $ Population  : num  108 109 110 111 112 ...
##  $ Year        : int  1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
##  $ Employed    : num  60.3 61.1 60.2 61.2 63.2 ...

summary(longley)

##   GNP.deflator         GNP          Unemployed     Armed.Forces  
##  Min.   : 83.00   Min.   :234.3   Min.   :187.0   Min.   :145.6  
##  1st Qu.: 94.53   1st Qu.:317.9   1st Qu.:234.8   1st Qu.:229.8  
##  Median :100.60   Median :381.4   Median :314.4   Median :271.8  
##  Mean   :101.68   Mean   :387.7   Mean   :319.3   Mean   :260.7  
##  3rd Qu.:111.25   3rd Qu.:454.1   3rd Qu.:384.2   3rd Qu.:306.1  
##  Max.   :116.90   Max.   :554.9   Max.   :480.6   Max.   :359.4  
##    Population         Year         Employed    
##  Min.   :107.6   Min.   :1947   Min.   :60.17  
##  1st Qu.:111.8   1st Qu.:1951   1st Qu.:62.71  
##  Median :116.8   Median :1954   Median :65.50  
##  Mean   :117.4   Mean   :1954   Mean   :65.32  
##  3rd Qu.:122.3   3rd Qu.:1958   3rd Qu.:68.29  
##  Max.   :130.1   Max.   :1962   Max.   :70.55

## Create a 3-D plot of the data (not important for the exam but fun!)
library(plotly)
plot(GNP ~ Population, data = longley)

plot(GNP ~ Unemployed, data = longley)

## Funny enough, in this type of 3D plot, the response is specified as z-variable.
## Remove the '#' symbol below to run the code in live mode. RMarkdown is not capable of including 3D images.
plot_ly(data = longley, x = ~ Unemployed, y = ~ Population, z = ~ GNP, type = "scatter3d", mode = "markers")

And here is the corresponding multiple regression model without and with interaction term:

## Without interaction
m1 <- lm(GNP ~ Unemployed + Population, data = longley)
summary(m1)

## 
## Call:
## lm(formula = GNP ~ Unemployed + Population, data = longley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.468  -5.059   1.479   5.515  11.956 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.392e+03  4.603e+01 -30.244 1.96e-13 ***
## Unemployed  -1.533e-01  3.337e-02  -4.593 0.000504 ***
## Population   1.558e+01  4.483e-01  34.743 3.29e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.781 on 13 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9922 
## F-statistic: 954.4 on 2 and 13 DF,  p-value: 7.88e-15

## With interaction
m2 <- lm(GNP ~ Unemployed * Population, data = longley)
summary(m2)

## 
## Call:
## lm(formula = GNP ~ Unemployed * Population, data = longley)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8533  -4.1051   0.2683   3.5188  10.3752 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.716e+03  1.843e+02  -9.310 7.71e-07 ***
## Unemployed             7.737e-01  5.144e-01   1.504   0.1584    
## Population             1.837e+01  1.600e+00  11.479 7.92e-08 ***
## Unemployed:Population -7.908e-03  4.380e-03  -1.805   0.0962 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.105 on 12 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9934 
## F-statistic: 747.9 on 3 and 12 DF,  p-value: 6.631e-14

## Model diagnostic plots
layout(matrix(1:4, nrow = 2, byrow = T))
par(mar = c(4, 4, 1.5, 1.5), mgp = c(2.3, 0.8, 0))
plot(m2)

No gross violation of the variance homogeneity or normality assumption. Only the year 1962 sticks out, something to look at bit more closely perhaps if this was your own data.

We can add the regression plane to our 3D plot but first we have to get predictions on a fine grid of the two predictors and then transform those into a matrix (not required in the exam).

## Create new Unemployed and Population values to predict from
axis_x <- seq(min(longley$Unemployed), max(longley$Unemployed), length.out = 100)
axis_y <- seq(min(longley$Population), max(longley$Population), length.out = 100)

## Model predictions for the regression plane
lm_surface <- expand.grid(Unemployed = axis_x, Population = axis_y)

lm_surface$fit <- predict(m2, newdata = lm_surface)

library(reshape2)
lm_surface <- acast(lm_surface, Unemployed ~ Population, value.var = "fit") # create the surface matrix

## Remove the '#' symbols below to run the code in live mode. RMarkdown is not capable of including 3D images.
plot_ly(data = longley, x = ~ Unemployed, y = ~ Population, z = ~ GNP, type = "scatter3d", mode = "markers") %>% 
add_surface(z = lm_surface, x = axis_x, y = axis_y, opacity = 0.7)

Multicollinearity in multiple regression models

One frequent problem with multiple regression models is a phenomenon called multicollinearity. Multicollinearity occurs when two or more of the continuous predictor variables are highly correlated, i.e. one variable can be linearly predicted from the others with high accuracy. Often, multicollinearity produces unstable parameter estimates and increases the variance and thus the standard errors of the coefficients (inflated standard errors) resulting in flawed P-values and poor overall model performance. In practice, this often happens when one or more of the predictors in a multiple regression model provide similar or complementary information. For example ‘relative humidity’ and ‘temperature’ are highly correlated and will hence give rise to multicollinearity issues in a regression model. One cannot always tell from the model summary whether multicollinearity is present but there are two main ways of detecting multicollinearity:

graphically, using scatterplot matrices (pairs plot), where all explanatory variables are plotted against each other
by using variance inflation factors (VIF) (vif function in R-package car). VIF values larger than 10 (or 5) suggest multicollinearity caused by the respective variable.

VIFs are calculated for all predictors in a regression model and give a measure of how much the variance of the estimated regression coefficients is inflated as compared to when the predictor variables are not linearly related. So, a predictor with a VIF value of 10 suggests that the variance around a model coefficient is 10 times larger than one would expect if there was no correlation with other predictors, i.e. in a model with no multicollinearity.

These tools can help you decide which explanatory variables to retain in the model (i.e. one would drop the variable(s) with large VIF values ). Here are both tools in action:

m3 <- lm(GNP ~ Unemployed + Employed + Armed.Forces + Population, data = longley)
summary(m3)

## 
## Call:
## lm(formula = GNP ~ Unemployed + Employed + Armed.Forces + Population, 
##     data = longley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7850 -4.3097 -0.6786  3.3328  8.8336 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.339e+03  3.512e+01 -38.137 4.87e-13 ***
## Unemployed    3.430e-04  4.055e-02   0.008  0.99340    
## Employed      9.321e+00  2.397e+00   3.888  0.00253 ** 
## Armed.Forces  8.245e-02  2.955e-02   2.790  0.01759 *  
## Population    9.338e+00  1.504e+00   6.211 6.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.683 on 11 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9967 
## F-statistic:  1145 on 4 and 11 DF,  p-value: 2.513e-14

## Scatterplot matrix
pairs(longley)

## Add correlation information to the pairs plot.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}
head(longley)

pairs(longley[, -c(1, 2)], lower.panel = panel.smooth, upper.panel = panel.cor)

## Variance inflation factors
library(car)
vif(m3)

##   Unemployed     Employed Armed.Forces   Population 
##     6.668482    32.923691     1.964550    50.814189

## Remove population from the model
m4 <- lm(GNP ~ Unemployed + Employed + Armed.Forces, data = longley)
summary(m4)

## 
## Call:
## lm(formula = GNP ~ Unemployed + Employed + Armed.Forces, data = longley)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.2811  -4.7494   0.1353   5.2067  22.6691 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.265e+03  6.708e+01 -18.854 2.78e-10 ***
## Unemployed    2.142e-01  4.352e-02   4.922 0.000353 ***
## Employed      2.369e+01  1.281e+00  18.485 3.49e-10 ***
## Armed.Forces  1.420e-01  5.681e-02   2.500 0.027892 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.55 on 12 degrees of freedom
## Multiple R-squared:  0.9892, Adjusted R-squared:  0.9865 
## F-statistic: 366.3 on 3 and 12 DF,  p-value: 4.638e-12

vif(m4)

##   Unemployed     Employed Armed.Forces 
##     1.859355     2.277019     1.757381

Both, the graphical assessment as well as the VIFs suggest that Population is the main culprit causing the multicollinearity issue. And indeed, removing Population from the model gets rid of the multicollinearity. However, if we are particularly interested in Population we may also try to remove one of the remaining predictors with a large VIF value to see if that alleviates the problem.

## What if we are specifically interested in 'population' though...
m5 <- lm(GNP ~ Unemployed + Population + Armed.Forces, data = longley)
summary(m5)

## 
## Call:
## lm(formula = GNP ~ Unemployed + Population + Armed.Forces, data = longley)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.525  -6.989   1.574   5.657  13.434 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.352e+03  5.160e+01 -26.195 5.86e-12 ***
## Unemployed   -1.142e-01  4.109e-02  -2.780   0.0167 *  
## Population    1.498e+01  5.834e-01  25.677 7.42e-12 ***
## Armed.Forces  6.480e-02  4.308e-02   1.504   0.1584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.384 on 12 degrees of freedom
## Multiple R-squared:  0.9943, Adjusted R-squared:  0.9929 
## F-statistic: 698.8 on 3 and 12 DF,  p-value: 9.943e-14

vif(m5)

##   Unemployed   Population Armed.Forces 
##     3.146686     3.514335     1.918225

As we can gather from the VIF values < 5, this approach was also valid allowing us to retain Population in the model.

Analysis of Covariance

If we have a continuous response variable and a mix of continuous and categorical predictors (at least one of each kind), we use a so-called analysis of covariance model (ANCOVA). In R this type of model can be formulated with the familiar lm function. I simulated a dataset consisting of the growth rate (mm day^-1) of three different fungal species (growth) as a function of fungicide concentration (conc in mol L^-1).

First, we plot the data and fit a model without interaction term, get the model predictions and add those as regression lines to the plot.

str(dat)

## 'data.frame':    54 obs. of  3 variables:
##  $ conc   : num  10 8 6 4 2 0 10 8 6 4 ...
##  $ growth : num  37.3 76.2 118.9 166.8 158.6 ...
##  $ species: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...

plot(growth ~ conc, data = dat, pch = 21, bg = c("black", "white", "grey")[dat$species]) # use 3 different colours to identify the fungal species
legend(x = 8, y = 270, legend = c("A", "B", "C"), pch = 21, pt.bg = c("black", "white", "grey"), lty = c(1, 2, 3), title = "Species", bty = "n")

## ANCOVA without interaction
m1 <- lm(growth ~ conc + species, data = dat)
summary(m1)

## 
## Call:
## lm(formula = growth ~ conc + species, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.194 -10.800  -0.018  12.022  58.072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 206.9652     5.7620  35.919  < 2e-16 ***
## conc        -14.1804     0.7439 -19.063  < 2e-16 ***
## speciesB    -65.8031     6.2236 -10.573 2.38e-14 ***
## speciesC    -45.9073     6.2236  -7.376 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.67 on 50 degrees of freedom
## Multiple R-squared:  0.9058, Adjusted R-squared:  0.9002 
## F-statistic: 160.3 on 3 and 50 DF,  p-value: < 2.2e-16

summary.aov(m1)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## conc         1 126683  126683  363.40  < 2e-16 ***
## species      2  41000   20500   58.81 7.36e-14 ***
## Residuals   50  17430     349                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Create a new data frame containing a fine grid of predictor values (here conc values) to predict from
xv <- with(dat, seq(min(conc), max(conc), length.out = 100)) # fine grid of conc values spanning the original range
newdat <- expand.grid(conc = xv, species = levels(dat$species))
newdat$fits <- predict(m1, newdata = newdat)

## Add the regression lines (model predictions) to the plot
for (i in 1:3){
lines(fits ~ conc, data = newdat[newdat$species == levels(newdat$species)[i], ], lty = c(1, 2, 3)[i])  
}

An ANCOVA model without interaction term allows the intercept to vary with species but imposes a common slope.

Ok, here is the interpretation of the summary(m1) output, which can be verified by comparing the coefficient values with the figure: In general, the intercept is the estimate of response value (y-value) when the predictor value (x-value) is zero. In the model summary output, the (intercept) term represents the intercept of species A, i.e. the growth of species A in the absence of the fungicide, which is about 207 mm day^-1. The conc coefficient gives the common slope for all species, which is rougly -15, suggesting that fungal growth decreases by 14 mm day^-1 with a one unit increase in fungicide. The coefficient speciesB tells us, that the intercept of species B is around 66 units lower than the intercept of species A. Likewise the coefficient speciesC indicated that for this species the intercept is 46 units lower than for species A. This model contains only the main effects of fungicide concentration and species identity but no interaction term. As a result, the model assumes a common slope among the three fungal species.

Is this a valid assumption? Certainly not.

Theoretically, it could well be that all species share a similar slope, but more often then not this is not the case in reality. Therefore, we allow for separate slopes in our model by incorporating a fungicide $\times$ species interaction. So, the interaction between a continuous and a categorical predictor means that the slopes are allowed to vary with each level of the categorical predictor variable. In our example a statistically significant interaction would translate into: the effect of the fungicide on fungal growth rate varies across species.

## ANOCOCA with interaction term
m2 <- lm(growth ~ conc * species, data = dat)

summary.aov(m2) # Overall significances

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## conc          1 126683  126683  629.90  < 2e-16 ***
## species       2  41000   20500  101.93  < 2e-16 ***
## conc:species  2   7777    3888   19.33 6.94e-07 ***
## Residuals    48   9654     201                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(m2)

## 
## Call:
## lm(formula = growth ~ conc * species, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.131  -5.939  -0.960   6.238  34.447 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    230.5909     5.9258  38.913  < 2e-16 ***
## conc           -18.9056     0.9786 -19.319  < 2e-16 ***
## speciesB      -107.8949     8.3804 -12.875  < 2e-16 ***
## speciesC       -74.6925     8.3804  -8.913 9.53e-12 ***
## conc:speciesB    8.4184     1.3840   6.083 1.87e-07 ***
## conc:speciesC    5.7570     1.3840   4.160 0.000131 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.18 on 48 degrees of freedom
## Multiple R-squared:  0.9479, Adjusted R-squared:  0.9424 
## F-statistic: 174.5 on 5 and 48 DF,  p-value: < 2.2e-16

## Model predictions
newdat$fits2 <- predict(m2, newdata = newdat)

## Add predictions to the plot (regression lines)
cols <- c("black", "white", "grey")
plot(growth ~ conc, data = dat, pch = 21, bg = cols[dat$species]) # use 3 different colours to identify the fungal species
legend(x = 8, y = 270, legend = c("A", "B", "C"), pch = 21, pt.bg = cols, lty = c(1, 2, 3), title = "Species", bty = "n")

## Add the regression lines (model predictions) to the plot
for (i in 1:3) {
lines(fits2 ~ conc, data = newdat[newdat$species == levels(newdat$species)[i], ], lty = c(1, 2, 3)[i]) 
}

An ANCOVA model with interaction term allows the intercept AND the slope to vary with species.

Adding the model predictions (regression lines) to the plot, nicely visualises the interaction: at least one of the three species shows a steeper slope than the remaining two species.

As usual, the summary.aov command gives us an overall significance for the model coefficients. In case of significant interaction term our job is done. In a paper, thesis or report, one would state the overall significance of the interaction term along the lines of: ‘The efficacy of the tested fungicide varied with fungal species as evidenced by a significant fungicide $\times$ species interaction ($F_{2,48}$ = 42.84, P < 0.001).’

If the interaction was not significant, we could safely remove it and simplify the model.

Let’s move on to the summary output, which contains more detailed information but is somewhat harder to interpret.

Here, the (intercept) is again the intercept for species A but now the conc indicates the slope for species A only. The next two coefficients (speciesB and speciesC) give the deviation of the intercepts of the remaining two species from that of species A. Now comes the tricky bit: the coefficient conc:speciesB gives the difference in slopes between species A and species B. Because the linear relationship we see in our plot is decreasing for each of the fungal species, we expect negative slopes and thus it is clear that the conc:speciesB and conc:speciesC are not referring to the true slope values. The slope value for species B is calculated by adding the estimate of the conc:speciesB interaction term to the slope of species A (conc). And the same approach is used to obtain the slope for species C. \[ slope_B = -18.9056 + 8.4184 = -10.4872 \] \[ slope_C = -18.9056 + 5.7570 = -13.1486 \]

As a sanity check, we could simply extract the regression equation for each of the fungal species and use them to redraw the regression lines. This is basically what the predict function is doing. Hence, if our calculations are correct, the resulting lines should fall right onto the red lines we have already plotted using the model predictions. To do so, we also need to add the intercept (species A) or reconstruct it first (speices B and C) but that is a piece of cake.

## Predictions for species A: y = intercept + slope x predictor
spec.a <- 230.5909 - 18.9056 * unique(dat$conc) # we use unique() because in our data we have three sets of conc values (one for each species)
lines(x = unique(dat$conc), y = spec.a, col = "blue")

## Predictions for species B (look up coefficients in the summary(m2) output)
spec.b <- (230.5909 - 107.8949) + (-18.9056 + 8.4184) * unique(dat$conc) 
lines(unique(dat$conc), spec.b, col = "blue")

## Predictions for species C (compare with summary(m2) output)
spec.c <- (230.5909 - 74.6925) + (-18.9056 + 5.7570) * unique(dat$conc) 
lines(unique(dat$conc), spec.c, col = "blue")

Our predictions, calculated by hand, perfectly overlay (blue lines) the model predictions derived by the predict function.

The overall significance obtained with the summary.aov(m2) command just tells us that there is a statistically significant difference in slopes but not whether all slopes differ significantly or just some. The summary(m2) output gives us a bit more information. From there we can conclude that the slopes of species B and C are significantly different from the slope of species A but we do not know whether the slopes of B and C differ significantly. To test this, we have to follow up with a post-hoc analysis performing pairwise comparisons of the slopes (similar to what we have done in the ANOVA case, only here were are comparing slopes instead of group means). For this purpose, we use the emtrends function in the emmeans package like this:

## Post-hoc test comparing the slopes
# Obtain slopes for each fungus ...
library(emmeans)
slopes <- emtrends(m2, specs = "species", var = "conc") 
slopes # compare to the slopes we have calculate by hand above ;-)

##  species conc.trend        SE df  lower.CL   upper.CL
##  A        -18.90555 0.9786178 48 -20.87319 -16.937908
##  B        -10.48718 0.9786178 48 -12.45483  -8.519539
##  C        -13.14852 0.9786178 48 -15.11616 -11.180875
## 
## Confidence level used: 0.95

pairs(slopes)

##  contrast  estimate       SE df t.ratio p.value
##  A - B    -8.418368 1.383975 48  -6.083  <.0001
##  A - C    -5.757033 1.383975 48  -4.160  0.0004
##  B - C     2.661336 1.383975 48   1.923  0.1432
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

The pairwise comparisons obtained from the post-hoc test confirm our visual impression: the slope of species A differs significantly from the slopes of species B and C but the latter two are not significantly different from each other. In a paper or thesis, you could report the small table showing the pairwise comparisons or add compact annotation to the figure like this:

## Replot with statistical annotation
plot(growth ~ conc, data = dat, pch = 21, bg = cols[dat$species], las = 1, ann = F) # use 3 different colours to identify the fungal species
legend(x = 5, y = 315,  horiz = T, legend = c("A", "B", "C"), pch = 21, pt.bg = cols, lty = c(1, 2, 3), bty = "n", xpd = NA, xjust = 0.5, text.width = 0.8, seg.len = 2.3)

## Axis labels
mtext(expression(paste("Growth (mm d"^-1, ")")), side = 2, line = 2.7)
mtext(expression(paste("Fungicide concentration (mol L"^-1, ")")), side = 1, line = 2.4)

for (i in 1:3) {
lines(fits2 ~ conc, data = newdat[newdat$species == levels(newdat$species)[i], ], lty = c(1, 2, 3)[i]) 
}
text(x = 7, y = c(255, 235, 215), c("A \u2013 B ***", "A \u2013 C ***", expression(paste("B \u2013 C"^" ns"))), adj = c(0, 1)) # \u2013 is the unicode abbreviation for long dash. We use 'ns' to indicate 'not significant'.

$**Figure 1** Growth of three fungal species as a function of fungicide concentration. Lines indicate predictions from an analysis of covariance model. The small inset table indicates the results of a multiple comparison procedure on the slopes (Tukey contrasts, $\alpha$ = 0.05). *** *P* < 0.001, ns = not significant.$

Figure 1 Growth of three fungal species as a function of fungicide concentration. Lines indicate predictions from an analysis of covariance model. The small inset table indicates the results of a multiple comparison procedure on the slopes (Tukey contrasts, $\alpha$ = 0.05). *** P < 0.001, ns = not significant.

Scientific Research, Statistics Lab 8

Dr Martin Bader, School of Science, AUT

May 31, 2018

Multiple linear regression

Multicollinearity in multiple regression models

Analysis of Covariance