A flower grower decided to investigate the effects of watering and the amount of shade on the number of saleable carnation blooms produced in his nursery.
He designed his experiment to have three levels of watering
(water: once, twice or three times a week) and four levels
of shade (shade: none, 1/4, 1/2 and fully shaded). To
conduct this experiment, he needed to grow the carnation plots in three
different beds. In case these beds differed in fertility or other
important features, he decided to use these beds as blocks
(bed).
After four weeks, he analysed the data by counting the number of
blooms, and using the square root\(^{*)}\) as the response variable,
sqblooms. He then fitted a LM with three categorical
explanatory variables (EVs): bed, water and
shade.
\(^{*\ }\)Note for the curious: the square root transformation is commonly used as a ‘variance-stabilising transformation’ for count data — see this discussion on cossvalidated.
The data for this experiment are in file blooms.csv. In
my analysis, I’ve loaded the data into data frame
Blooms.
If you run summary (or skim or some other
summary tool) on the data, you’ll note that the categorical EVs are
again coded by numbers, thus:
bed: 1, 2, 3;water: 1, 2, 3 for once, twice, three times a
week;shade: 1, 2, 3, 4 for no shade, 1/4, 1/2 and fully
shaded;So you’ll need to explicitly declare them as factors. As usual, we’ll be using sum contrasts rather than the default treatment contrasts — so don’t forget to set up the contrasts as well.
First, I want you to assess whether the experimental design that the farmer has come up with is orthogonal. You could assess orthogonality from tables of the number of replicates in each block / treatment combination, but this can get quite tedious. Instead, we will use ANOVA results to assess orthogonality.
I want you to use two different ways of checking for orthogonality by ANOVA.
anova function in base R.Remember that the anova function of R gives
sequential SSQ.
For this assessment of orthogonality, you need to fit (at least) two version of the model with the same full set of EVs, but changing their order in the model equation. You could do this like so:
m.1 <- lm(sqblooms ~ bed + water + shade, data = Blooms)
m.1a <- lm(sqblooms ~ water + shade + bed, data = Blooms)
# or, simpler and more elegantly, you could `update` the model `m.1`:
m.1a <- update(m.1, . ~ water + shade + bed)I have switched the order from bed+water+shade to
water+shade+bed — you could switch it to something else,
since with three EVs, there are six possible orders.
Then, you need to compare the sequential ANOVA tables from the two versions of the model.
TASK: Describe your assessment of orthogonality based on this comparison of ANOVA results.
Anova from the car packageThe Anova function from the car package
makes it easy to get adjusted SSQ for all EVs in one single ANOVA table
— this is also known as a “Type II” ANOVA. However, Anova
provides only the adjusted SSQ, not the sequential SSQ.
In this orthogonality check, you need to compare the sequential SSQ
from anova with the adjusted SSQ from the
Anova function on the same model (e.g.,
m.1).
TASK: Describe your assessment of orthogonality based on this comparison of ANOVA results.
MCQ1: Which one of the following ANOVA results is diagnostic of orthogonality (i.e., the statement holds true only if orthogonality holds)?
[_] All EVs have the same degrees of freedom.
[_] The sequential SSQ match the residual SSQ for all EVs.
[_] The order of the EVs does not affect their adjusted SSQ.
[_] The order of the EVs does not affect their sequential SSQ.
[_] The order of the EVs does not affect the residual SSQ.
MCQ2: What is your assessment of the orthogonality
of the explanatory variables (EVs) in the Blooms
dataset?
[_] The EVs are fully orthogonal.
[_] The EVs are slightly non-orthogonal.
[_] The EVs are strongly non-orthogonal.
The flower grower finds blocking a hassle when it comes to designing and analysing this sort of experiments. He wonders whether it has been worth his while treating the beds as blocks, or whether future experiments could simply be fully randomised across beds.
MCQ3: Based solely on the ANOVA results from the
model with all three EVs (sqblooms ~ bed + water + shade),
what would your advice be with regard to blocking for bed
in the future?
The decision to abandon blocking for bed and switch to a
fully randomised design in the future would be…
[_] …of no further consequence, because the design is orthogonal.
[_] …ill-advised, because the \(P\)-value forbedis very small.
[_] …justified, becausewaterandshadeboth have small \(P\)-values.
To assess whether blocking for bed was worthwhile, the farmer
repeated the analysis of the data without using bed as a
blocking factor. The logic of this is that we can pretend that blocking
did not happen and see how much — if any — extra information we got from
doing a blocked design.
You should now do the same — compare the results from two models:
bed as the blocking factor; andbed) from the
analysis.To fully appraise the potential advantages of blocking, you need to
compare both the ANOVA results (use Anova for easy adjusted
SSQ) and the parameter estimates and their associated
standard errors (use summary).
TASK: Give your assessment of whether blocking was
worthwhile in this experiment, and describe the insights that dropping
bed from the model has provided regarding the merit of
blocking.
MCQ4: Which one of the following statements applies
after removing the blocking factor bed from the model?
[_] The residual Mean Squares decreased.
[_] The residual DoF decreased.
[_] The \(F\)-ratios for the two treatments increased.
[_] The values of the estimated model coefficients changed.
[_] The model coefficients were estimated with less precision.
The flower grower then discovered that a visitor had picked
carnations from three of his experimental plots. He decided that because
the final bloom counts for these plots were inaccurate, he would exclude
them from his analysis. So he produced a new, smaller subset of the data
(provided in the file blooms-2.csv) from which the data
from the picked plots have been removed.
TASK: Use the ANOVA-based approach to reassess
orthogonality in this subset of data (no need to do both versions of the
ANOVA orthogonality check — go with the second one that uses
Anova). Also comment on whether this new analysis
fundamentally alters your conclusions.
MCQ5: How has the removal of some observations from
the Blooms dataset affected the analysis?
[_] The design is no longer fully orthogonal.
[_] The evidence for an effect ofshadeis no longer compelling.
[_] The argument for the importance of blocking is no longer compelling.
[_] The results of the analysis become fundamentally misleading.
This exercise would be really useful if you have time to do it:
TASK: Use the estimates of the coefficients from the
full model (sqblooms ~ bed + water + shade) to create plots
that illustrate how the expected number of blooms varies
This part of the practical relates to two important aspects of working with LMs:
This requires some R craft in addition to understanding what the
model coefficients mean. To help you with the R craft, I’ll walk you
through how I do this for water, and you then do the same
for shade also.
We get the coefficients from the model fit like this:
## (Intercept) bed1 bed2 water1 water2 shade1
## 4.02902778 0.06197222 0.38047222 -0.41102778 0.37313889 0.09652778
## shade2 shade3
## 0.29341667 -0.11913889
Now we need to remember what the model coefficients actually mean, and how we get from the coefficients to the means that our model has estimated for the different groups.
For plotting how watering levels affect the expected bloom count, we want the estimates to be kind of “independent of the amount of shade” and “regardless of the bed” — but of course shade does affect bloom count, and the beds differ too, so what to do? The most useful prediction of the bloom counts are for an average amount of watering, and in a ‘typical’ average bed.
NOTE: Now it pays off that we have used sum contrasts for all EVs: Remember that with sum contrasts, all estimated effects are relative to the grand mean \(\mu\).
Therefore, instead of working out the answer to “What is the expected bloom count in bed 2 if watered twice a week in full shade?” (which would require coefficients for all EVs), here we can simply drop the other EVs from the calculation:
bed and shade; likewise,Can you see why this would be trickier if we had used treatment contrasts?
To get the values of the individual coefficients out of
cf you have two options. Either use numerical indexing, or
use the associated names. E.g., to get the value of the mean in the
‘watered once’ group:
## (Intercept)
## 3.618
## (Intercept)
## 3.618
Numerical indexing makes for compact but unreadable code, and it is easy to mis-count the index and thus mis-calculate. I’d use the names.
For subsequent plotting by ggplot, it is useful to
collect and organise your calculations into a new data frame, like
this:
water.pred <- data.frame(
water = factor(c("once", "twice", "thrice"),
# the next bit is just to ensure order along x-axis when plotting
levels = c("once", "twice", "thrice")),
sqbloom = c(
cf["(Intercept)"] + cf["water1"],
cf["(Intercept)"] + cf["water2"],
cf["(Intercept)"] - cf["water1"] - cf["water2"]
)
)Normally, I would not be seen dead using bar plots, but we’ll make an exception here…
library(ggplot2)
# I'm saving the beginnings of the plot in `p` and add the `geometry` later,
# so that we can reuse the plot using different plot styles:
p <- ggplot(water.pred, aes(x=water, y=sqbloom)) +
labs(x="Weekly watering regime", y="bloom count (square root)")
p + geom_col() # this is ggplot2 function for `bar charts`TASKS:
water.shade.