I’ve again come up with a fictive scenario that has the same structure as what you do in the practical.
Research question: A study was conducted into the effect of tree blessing by village priests or shamans on the yield of plum trees. Fifty plum trees were sampled at random from a ten thousand hectare area in Northern England. For each tree, the following variables were recorded:
plums: The weight of plums harvested from a plum tree, in \(\mathrm{kg}\).
tree.h: The height of the tree, in \(\mathrm m\).
blessed: A categorical variable, coded as
1 if the tree has NOT been blessed by the village priest / shaman,
2 if it has been blessed.
var: A categorical variable indicating the plum variety as
1 for David,
2 for Royal, and
3 for Supreme.
2 Consider the study aims and design
NoteReminder: what to consider
From the information in the previous section, make sure that you can answer the following questions:
What is the primary research question, i.e., the aim of the study?
What is the dependent variable (DV)?
What is the explanatory variable (EV) of primary interest?
Why is it sensible that the other EVs were also recorded?
NoteBB TEST Q1
Select all variables to which the term “confounding” can sensibly be applied in the context of this study.
TipInterpretation
The primary research question is whether blessing trees affects their yield.
The DV is plum yield (plums).
The EV of primary interest is tree-blessing (blessed). Finding out whether plum yield depends on the variety or the height of the tree was not the aim of the study.
However, we would very much expect the yield to differ between varieties, and also to depend on the size (height) of the tree: variety and tree height and variety are expected to confound any effect that tree blessing may have. For this reason, variety and height were recorded for all trees. In the analysis, we can then control for the confounding effects of variety and height.
plums tree.h blessed var
Min. : 9.661 Min. :0.05187 Min. :1.00 Min. :1.00
1st Qu.:27.871 1st Qu.:2.67376 1st Qu.:1.00 1st Qu.:1.00
Median :34.588 Median :3.83299 Median :2.00 Median :2.00
Mean :33.877 Mean :3.49108 Mean :1.66 Mean :1.96
3rd Qu.:41.152 3rd Qu.:4.35685 3rd Qu.:2.00 3rd Qu.:3.00
Max. :45.737 Max. :5.00000 Max. :2.00 Max. :3.00
Like in the data that you have, numbers are used to indicate the categories (factor levels) of the categorical variables (factors) – here, plum variety var (1, 2, 3) and shaman blessing blessed (1, 2).
So we need to first declare var and blessed as factors, then apply our contrast settings. I’m doing this all at once:
Plums <-within(Plums,{ blessed <-factor(blessed); contrasts(blessed) <- contr.sum var <-factor(var); contrasts(var) <- contr.sum})summary(Plums) # not really needed: sanity check
plums tree.h blessed var
Min. : 9.661 Min. :0.05187 1:17 1:18
1st Qu.:27.871 1st Qu.:2.67376 2:33 2:16
Median :34.588 Median :3.83299 3:16
Mean :33.877 Mean :3.49108
3rd Qu.:41.152 3rd Qu.:4.35685
Max. :45.737 Max. :5.00000
OK, blessed and var are now shown as factors.
3.2 Look at the data
NoteTASK
For a first peek, I want you to just quickly check graphically whether there is a relationship between the DV (plum yield) and the continuous EV (tree height), and if so, what direction the effect takes and whether it looks strong or weak. For this plot, we will ignore the categorical EVs (variety and blessing status). Remember that such a ‘univariate’ analysis can potentially be misleading and needs to be followed up by a multivariate analysis (which you will carry out below).
1. How strong is the evidence that the plum yield depends upon whether a tree has been blessed? In what direction is the effect? — The evidence that the yield of a plum tree depends on whether it has been blessed is weak (\(P=0.0884\)): we have no strong evidence against the Null, i.e., that the true effect isn’t zero.
For effect directions, we need to look at the estimated coefficients. Recall that with sum contrasts, the coefficients are the deviations of the group means from the grand mean. The coefficient blessed1 (which is for un-blessed trees) is negative (-0.509), so the direction of the effect estimate is for blessed trees to have slightly higher yields. But we should not make a statement about the effect of tree-blessing on yield, because (given the ANOVA result) we have little evidence that there is an effect – and if there is an effect, it may in fact be in the other direction… we just can’t tell. Therefore, the correct way to report the result is as a statement about the sample rather than about the (general) effect of blessing trees:
“In our sample of trees, plum yields were on average higher for trees that had been blessed.”
2. How strong is the evidence that plum variety affects yield? Which varieties are associated with the highest and lowest yield? — The evidence that plum variety affects yield is very strong (\(P=1.92\times 10^{-7}\)). See below for how to work out what varieties have the highest / lowest yields.
3. What is the estimated effect of an additional metre of height on a tree’s yield? — This one is super easy – no calculation needed! You just need to recall that, for continuous EVs, the coefficient gives the estimated change in the mean of DV per unit change in the EV. So the coefficient for tree.h already gives the estimated mean change in plums per 1 m change in tree height: plum yields increase by 6.75 kg for each additional meter tree-height.
NoteBB TEST Q3
Which of these statements most accurately describes the evidence that the yield of a plum tree depends upon whether it has been blessed?
TipInterpretation
There is some evidence for an effect \((0.1 > P \geq 0.01)\).
NoteBB TEST Q4
Which variety is associated with the highest and lowest plum yield, respectively?
For this, we need to look at the coefficients (effect estimates). Let’s print them again, and remind ourselves how the varieties were coded:
# var = 1: David# var = 2: Royal# var = 3: Supreme
Important
Recall that we are using sum contrasts. Therefore, the coefficients give the deviations from the mean.
TipInterpretation
The effect estimate for Royal (var2) is larger than for David (var1) and both are positive, which means that the estimate for Supreme, which is not directly provided in the output, must be negative since all three must sum to zero. So Royal has the highest yield, and Supreme the lowest. If this doesn’t make sense, see the “Contrasts deep-dive” box below.
4 Make Predictions!
Assume we want to predict the plum yield in two scenarios:
for an un-blessed tree, 4 m high, of variety ‘Supreme’;
for a blessed tree, also 4 m high, but of variety ‘David’.
NoteTASK
Calculate the predicted plum yield for the two above scenarios.
Instead of using R like a pocket calculator, we save the estimates of the model coefficients into a vector variable – e.g., call it b (for ‘betas’):
b <-coef(my.m1) # store the coefficientsb # print them, to see what they look like
Scenario 1: Un-blessed tree, 4 m high, of variety ‘Supreme’:
b["(Intercept)"] +# intercept b["tree.h"]*4+# tree is 4 m high-b["var1"] -b["var2"] +# variety is 'Supreme' (think "neither David nor Royal") b["blessed1"] # un-blessed
(Intercept)
34.11938
# Or, using numerical indexing instead: compact, but can be confusingb[1] + b[2]*4+ (-b[3]-b[4]) + b[5]
(Intercept)
34.11938
Scenario 2: Blessed tree, also 4 m high, but of variety ‘David’:
How do we know whether to use var1 or var2 and how do we get to the + and - signs for the factor levels?
The first thing to note is that, under the bonnet, a LM treats categorical variables (factors) the same as continuous variables – this is why we say, ‘all LMs are just regressions.’ So let’s start with reminding ourselves how it works for our continuous variable, tree.h: we multiply the effect estimate (\(\beta_i\)) with the value of the variable (\(x_i\)):
b["tree.h"] *4
For a factor, we also multiply the effect estimate \(\beta_i\) with ‘the value of the variable’, but because we cannot multiply \(\beta_i\) with “David”, we use the contrast coding for “David” instead. Here is how the contrasts look like for our plum varieties. Don’t confuse the contrast coding in the model with the silly numeric coding of the varieties in the data!
Variety
in the data
contrast
in the model
David
1
(1, 0)
\(x_1=1, x_2=0\)
Royal
2
(0, 1)
\(x_1=0, x_2=1\)
Supreme
3
(-1, -1)
\(x_1=-1, x_2=-1\)
So, variety is internally represented by two variables (\(x_1, x_2\)) – see why a three-level factor needs two DoF? And the effect of variety is modelled / represented as
\[\beta_1\times x_1+\beta_2\times x_2\] See how this is the same as for continuous variables, in that “we multiply the effect estimate (\(\beta_i\)) with the value of the variable (\(x_i\))”? For variety “David,” \(x_1=1, x_2=0\), and we get
b["var1"]*1+ b["var2"]*0# which is of course justb["var1"]
For variety “Supreme,” \(x_1=-1, x_2=-1\) (which I like to think of as “neither David nor Royal”), so:
b["var1"]*-1+ b["var2"]*-1# which is the same as-(b["var1"] + b["var2"])
4.1 Meet the R predict function
NoteBB TEST Q5
What is your prediction for plum yield for a 4 m tree of variety ‘Supreme’ that hasn’t been blessed? Use the signif() function to round your results to 4 “significant digits” (that’s 4 digits, regardless of where the decimal point is, e.g. 0.1234, 1.234, 12.34…).
pred.dat <-data.frame(blessed =factor(1), # 1 is 'not blessed'tree.h =4, # tree is 4 m highvar =factor(3) # 3 is 'Supreme' )signif(predict(my.m1, newdata = pred.dat), 4)
1
34.12
So that’s 34.12 kg of plums.
NoteTASK
Modify the code to get predicted yields for both of the above scenarios in one step – by this, I mean, in a single call of predict(m, newdata = pred.dat).
To get both predictions in one step, we simply make a data frame that has both scenarios in it. Put more technically, it has two rows. To make the rows, we need to put two values into each EV — for this, the two values need to be inside a c( ).
pred.dat <-data.frame(blessed =factor(c(1, 2)), # not blessed, blessedtree.h =c(4, 4), # both are 4 m var =factor(c(3, 1)) # 3 is 'Supreme', 1 is 'David' )signif(predict(my.m1, newdata = pred.dat), 4)
1 2
34.12 38.03
4.2 SSQ Again!
NoteTASK
Compare the unadjusted and adjusted sums of squares for [blessing, in this demo example] and interpret your findings — what does this indicate from a technical statistical perspective, and what does this translate into in this study?
Important
Recall that to get ‘fully unadjusted’ SSQ for a variable, we need to put it first in the model formula.
my.m2 <-lm(plums ~ blessed + var + tree.h, Plums)anova(my.m1)
Analysis of Variance Table
Response: plums
Df Sum Sq Mean Sq F value Pr(>F)
tree.h 1 3684.9 3684.9 1036.4364 < 2.2e-16 ***
var 2 167.2 83.6 23.5104 1.023e-07 ***
blessed 1 10.8 10.8 3.0327 0.08844 .
Residuals 45 160.0 3.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(my.m2)
Analysis of Variance Table
Response: plums
Df Sum Sq Mean Sq F value Pr(>F)
blessed 1 377.91 377.91 106.29 1.982e-13 ***
var 2 917.42 458.71 129.02 < 2.2e-16 ***
tree.h 1 2567.49 2567.49 722.15 < 2.2e-16 ***
Residuals 45 159.99 3.56
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TipInterpretation
The unadjusted SSQ for blessing are much larger than the adjusted SSQ. So, much of the variation in plum yield that is seemingly attributable to blessing trees is in fact explained by differences in tree height and variety (the blessed trees in our sample must on average have been bigger and/or of a more productive variety). If we did not adjust for the effects of tree height and variety, we’d wrongly conclude that our data provide very strong evidence that blessing has an effect. The correct analysis provides very little evidence that blessing has any effect.
NoteBB TEST Q6
Which of the following statements best describes your findings regarding the unadjusted and adjusted SSQ for blessed?
TipInterpretation
Adjusting for variety and tree height reduces the SSQ of blessed.
Much of the apparent effect of tree-blesing is therefore attributable to differences in tree variety and height.
5 Confidence Intervals (Optional)
TASK: Give a 95% confidence interval for the effect of an additional metre of tree height on plum yield. For this, you will need to remind yourself how to calculate confidence intervals from standard errors (e.g., Grafen & Hails, p.14–15).
In this, \(\bar{y}\) is the estimate of the mean, and \(\frac{s}{\sqrt n}\) is the standard error (SE) of that estimate. So, to get a CI from mean \(\pm\) SE, the one extra thing we need is the critical \(t\)-value \((t_\mathrm{crit})\).
You get the mean and SE from the model summary. You could simply write down the values. However, to avoid having to write down the values, I extract and save the table of coefficients and SEs from the model summary using coef, like this:
s <-summary(my.m1)cf <-coef(s) ; signif(cf, 3) # save for later ; and print.
So all we now need is \(t_\mathrm{crit}\). R has a built-in function qt to calculate the quantiles of the \(t\)-distribution. The degrees of freedom in this case are the residual degrees of freedom for our model, which is 45. For a 95% CI, we need to cover the \(t_{45}\)-distribution between the 2.5 percentile and the 97.5 percentile; this gives the following range:
qt(c(0.025, 0.975), df =45) # 2.5% and 97.5% percentiles
[1] -2.014103 2.014103
The \(t\)-distribution is symmetric around zero, and so is our CI… this is why we need only one \(t_\mathrm{crit}\) value rather than two, and is where the \(\pm\) in the equation comes from. So, for a 95% CI, \(t_\mathrm{crit}\) is the 97.5 percentile:
t.crit <-qt(0.975, df =45) # crit. t-value for 95% CI
Putting this together, we get
\[\mathrm{CI} = 6.75 \pm 0.251 \times 2.01\]
We could do this by hand, or — much better — do the entire calculation with full precision in R (this is why I saved the coefficients table, rather than writing down the numbers):
NOTE: This code cleverly exploits a particular R behaviour: if we feed a vector of two values (the c(-t.crit, t.crit) part of the code) into an expression that otherwise has only a single value, R will expand the whole expression to return a vector with the two values.
The confidence interval for the change in plum yield per additional meter tree height is \([6.24, 7.25]\).