1 Growing Carnations: the Data

A flower grower decided to investigate the effects of watering and the amount of shade on the number of saleable carnation blooms produced in his nursery.

He designed his experiment to have three levels of watering (water: once, twice or three times a week) and four levels of shade (shade: none, 1/4, 1/2 and fully shaded). To conduct this experiment, he needed to grow the carnation plots in three different beds. In case these beds differed in fertility or other important features, he decided to use these beds as blocks (bed).

After four weeks, he analysed the data by counting the number of blooms, and using the square root\(^{*)}\) as the response variable, sqblooms. He then fitted a LM with three categorical explanatory variables (EVs): bed, water and shade.

\(^{*\ }\)Note for the curious: the square root transformation is commonly used as a ‘variance-stabilising transformation’ for count data — see this discussion on cossvalidated.

1.1 Setting up for the analysis

The data for this experiment are in file blooms.csv. In my analysis, I’ve loaded the data into data frame Blooms.

If you run summary (or skim or some other summary tool) on the data, you’ll note that the categorical EVs are again coded by numbers, thus:

  • bed: 1, 2, 3;
  • water: 1, 2, 3 for once, twice, three times a week;
  • shade: 1, 2, 3, 4 for no shade, 1/4, 1/2 and fully shaded;

So you’ll need to explicitly declare them as factors. As usual, we’ll be using sum contrasts rather than the default treatment contrasts — so don’t forget to set up the contrasts as well.

2 Orthogonality Check

First, I want you to assess whether the experimental design that the farmer has come up with is orthogonal. You could assess orthogonality from tables of the number of replicates in each block / treatment combination, but this can get quite tedious. Instead, we will use ANOVA results to assess orthogonality.

I want you to use two different ways of checking for orthogonality by ANOVA.

2.1 Using just the anova function in base R.

Remember that the anova function of R gives sequential SSQ.

For this assessment of orthogonality, you need to fit (at least) two version of the model with the same full set of EVs, but changing their order in the model equation. You could do this like so:

m.1 <- lm(sqblooms ~ bed + water + shade, data = Blooms)
m.1a <- lm(sqblooms ~ water + shade + bed, data = Blooms)

# or, simpler and more elegantly, you could `update` the model `m.1`:
m.1a <- update(m.1, . ~ water + shade + bed)

I have switched the order from bed+water+shade to water+shade+bed — you could switch it to something else, since with three EVs, there are six possible orders.

Then, you need to compare the sequential ANOVA tables from the two versions of the model.

TASK: Describe your assessment of orthogonality based on this comparison of ANOVA results.

2.2 Using Anova from the car package

The Anova function from the car package makes it easy to get adjusted SSQ for all EVs in one single ANOVA table — this is also known as a “Type II” ANOVA. However, Anova provides only the adjusted SSQ, not the sequential SSQ.

In this orthogonality check, you need to compare the sequential SSQ from anova with the adjusted SSQ from the Anova function on the same model (e.g., m.1).

TASK: Describe your assessment of orthogonality based on this comparison of ANOVA results.

MCQ1: Which one of the following ANOVA results is diagnostic of orthogonality (i.e., the statement holds true only if orthogonality holds)?

[_] All EVs have the same degrees of freedom.
[_] The sequential SSQ match the residual SSQ for all EVs.
[_] The order of the EVs does not affect their adjusted SSQ.
[_] The order of the EVs does not affect their sequential SSQ.
[_] The order of the EVs does not affect the residual SSQ.

MCQ2: What is your assessment of the orthogonality of the explanatory variables (EVs) in the Blooms dataset?

[_] The EVs are fully orthogonal.
[_] The EVs are slightly non-orthogonal.
[_] The EVs are strongly non-orthogonal.

3 Was Blocking Worthwhile?

The flower grower finds blocking a hassle when it comes to designing and analysing this sort of experiments. He wonders whether it has been worth his while treating the beds as blocks, or whether future experiments could simply be fully randomised across beds.

3.1 A first answer…

MCQ3: Based solely on the ANOVA results from the model with all three EVs (sqblooms ~ bed + water + shade), what would your advice be with regard to blocking for bed in the future?

The decision to abandon blocking for bed and switch to a fully randomised design in the future would be…

[_] …of no further consequence, because the design is orthogonal.
[_] …ill-advised, because the \(P\)-value for bed is very small.
[_] …justified, because water and shade both have small \(P\)-values.

3.2 A deeper look…

To assess whether blocking for bed was worthwhile, the farmer repeated the analysis of the data without using bed as a blocking factor. The logic of this is that we can pretend that blocking did not happen and see how much — if any — extra information we got from doing a blocked design.

You should now do the same — compare the results from two models:

  • the full model, which matches the actual experimental design and therefore analyses the experiment as a randomised block design with bed as the blocking factor; and
  • a model that analyses the same experiment as a fully randomised design by omitting the blocking factor (bed) from the analysis.

To fully appraise the potential advantages of blocking, you need to compare both the ANOVA results (use Anova for easy adjusted SSQ) and the parameter estimates and their associated standard errors (use summary).

TASK: Give your assessment of whether blocking was worthwhile in this experiment, and describe the insights that dropping bed from the model has provided regarding the merit of blocking.

MCQ4: Which one of the following statements applies after removing the blocking factor bed from the model?

[_] The residual Mean Squares decreased.
[_] The residual DoF decreased.
[_] The \(F\)-ratios for the two treatments increased.
[_] The values of the estimated model coefficients changed.
[_] The model coefficients were estimated with less precision.

4 The Invasion of the Flower Thief

The flower grower then discovered that a visitor had picked carnations from three of his experimental plots. He decided that because the final bloom counts for these plots were inaccurate, he would exclude them from his analysis. So he produced a new, smaller subset of the data (provided in the file blooms-2.csv) from which the data from the picked plots have been removed.

TASK: Use the ANOVA-based approach to reassess orthogonality in this subset of data (no need to do both versions of the ANOVA orthogonality check — go with the second one that uses Anova). Also comment on whether this new analysis fundamentally alters your conclusions.

MCQ5: How has the removal of some observations from the Blooms dataset affected the analysis?

[_] The design is no longer fully orthogonal.
[_] The evidence for an effect of shade is no longer compelling.
[_] The argument for the importance of blocking is no longer compelling.
[_] The results of the analysis become fundamentally misleading.

5 Bonus: Plot the Model

This exercise would be really useful if you have time to do it:

TASK: Use the estimates of the coefficients from the full model (sqblooms ~ bed + water + shade) to create plots that illustrate how the expected number of blooms varies

  • with watering and
  • with the level of shade.

This part of the practical relates to two important aspects of working with LMs:

  1. It will consolidate your understanding of ‘model parametrisation’ — that is, how the model describes the data and how the coefficients relate to the data.
  2. It highlights that LMs are about much more than just \(P\)-values — they are about estimation of effects. The effect estimates can be shown as plots — this is different from plotting the data. Suc visual model ‘explication’ is often much more helpful than just showing a table of coefficients, particularly as models get more complex.

This requires some R craft in addition to understanding what the model coefficients mean. To help you with the R craft, I’ll walk you through how I do this for water, and you then do the same for shade also.

5.1 Know your coefficients

We get the coefficients from the model fit like this:

cf <- coef(m.1)
cf  # print to see what is what
## (Intercept)        bed1        bed2      water1      water2      shade1 
##  4.02902778  0.06197222  0.38047222 -0.41102778  0.37313889  0.09652778 
##      shade2      shade3 
##  0.29341667 -0.11913889

Now we need to remember what the model coefficients actually mean, and how we get from the coefficients to the means that our model has estimated for the different groups.

For plotting how watering levels affect the expected bloom count, we want the estimates to be kind of “independent of the amount of shade” and “regardless of the bed” — but of course shade does affect bloom count, and the beds differ too, so what to do? The most useful prediction of the bloom counts are for an average amount of watering, and in a ‘typical’ average bed.

NOTE: Now it pays off that we have used sum contrasts for all EVs: Remember that with sum contrasts, all estimated effects are relative to the grand mean \(\mu\).

Therefore, instead of working out the answer to “What is the expected bloom count in bed 2 if watered twice a week in full shade?” (which would require coefficients for all EVs), here we can simply drop the other EVs from the calculation:

  • mean bloom in ‘watered once’ group = \(\mu + \alpha_1\), i.e., simply not including the coefficients for bed and shade; likewise,
  • mean bloom in ‘watered twice’ group = \(\mu + \alpha_2\), and
  • mean bloom in ‘watered thrice’ group = \(\mu - \alpha_1 - \alpha_2\)

Can you see why this would be trickier if we had used treatment contrasts?

5.2 The R craft side of things

To get the values of the individual coefficients out of cf you have two options. Either use numerical indexing, or use the associated names. E.g., to get the value of the mean in the ‘watered once’ group:

cf[1] + cf[4]
## (Intercept) 
##       3.618
cf["(Intercept)"] + cf["water1"]
## (Intercept) 
##       3.618

Numerical indexing makes for compact but unreadable code, and it is easy to mis-count the index and thus mis-calculate. I’d use the names.

For subsequent plotting by ggplot, it is useful to collect and organise your calculations into a new data frame, like this:

water.pred <- data.frame(
  water = factor(c("once", "twice", "thrice"),
                 # the next bit is just to ensure order along x-axis when plotting
                 levels = c("once", "twice", "thrice")),
  sqbloom = c(
    cf["(Intercept)"] + cf["water1"],             
    cf["(Intercept)"] + cf["water2"],
    cf["(Intercept)"] - cf["water1"] - cf["water2"]
  )
)

Normally, I would not be seen dead using bar plots, but we’ll make an exception here…

library(ggplot2)

# I'm saving the beginnings of the plot in `p` and add the `geometry` later,
# so that we can reuse the plot using different plot styles:
p <- ggplot(water.pred, aes(x=water, y=sqbloom)) + 
  labs(x="Weekly watering regime", y="bloom count (square root)")

p + geom_col()  # this is ggplot2 function for `bar charts`

p + geom_point(size=4) + geom_line(group=1)

5.3 Over to you!

TASKS:

  1. Replicate my steps and create the plot for water.
  2. Create the corresponding plot for shade.