I’ve come up with a fictive scenario that has the same structure as what you do in the practical.
Research question: Does blessing of plum trees by the village priest increase the yield?
To answer this, we want to control for the size of the tree (bigger trees wil bear more plums), and for differences in yield between different plum varieties (David, Royal and Supreme).
Outcome: Weight of plums harvested from a plum tree, in kg (plums)
Predictors:
Height of tree, in m (tree.h);
Plum variety (var): David = 1; Royal = 2, Supreme = 3;
Tree blessed by the village priest: “Yes” = 1, “No” = 2.
Consider the study aims and design
NoteBB TEST Q1
Select all variables to which the term “confounding” can sensibly be applied in the context of this study.
TipInterpretation
Our research question is whether blessing trees affects the yield, so for us, tree height and variety are confounding variables.
plums tree.h blessed var
Min. : 9.661 Min. :0.05187 Min. :1.00 Min. :1.00
1st Qu.:27.871 1st Qu.:2.67376 1st Qu.:1.00 1st Qu.:1.00
Median :34.588 Median :3.83299 Median :2.00 Median :2.00
Mean :33.877 Mean :3.49108 Mean :1.66 Mean :1.96
3rd Qu.:41.152 3rd Qu.:4.35685 3rd Qu.:2.00 3rd Qu.:3.00
Max. :45.737 Max. :5.00000 Max. :2.00 Max. :3.00
Like in the data that you have, numbers are used to indicate the categories (factor levels) of the categorical variables (factors) – here, plum variety var (1, 2, 3) and shaman blessing blessed (1, 2).
So we need to first declare var and blessed as factors, then apply our contrast settings. I’m doing this all at once:
Plums <-within(Plums,{ blessed <-factor(blessed); contrasts(blessed) <- contr.sum var <-factor(var); contrasts(var) <- contr.sum})summary(Plums) # not really needed: sanity check
plums tree.h blessed var
Min. : 9.661 Min. :0.05187 1:17 1:18
1st Qu.:27.871 1st Qu.:2.67376 2:33 2:16
Median :34.588 Median :3.83299 3:16
Mean :33.877 Mean :3.49108
3rd Qu.:41.152 3rd Qu.:4.35685
Max. :45.737 Max. :5.00000
OK, blessed and var are now shown as factors.
Look at the data
NoteTASK
For a first peek, I want you to just quickly check graphically whether there is a relationship between [the DV] and [the continuous EV], and if so, what direction the effect takes and whether it looks strong or weak. For this plot, we will ignore [the factor EVs]. Remember that such a ‘univariate’ analysis can potentially be misleading and needs to be followed up by a multivariate analysis (which you will carry out below).
There is very little evidence that the yield of a plum tree depends on whether it has been blessed – we have no good evidence against the Null, i.e., that the true effect isn’t zero (\(P=0.0884\)). The direction of the effect estimate is for blessed trees to have higher yield, since the coefficient blessed1 (which is for un-blessed trees) is negative (-0.509). In our sample, the mean for blessed trees is higher than for un-blessed trees (after adjusting for the effects of tree height and variety).
The evidence that plum variety affects yield is very strong (\(P=1.02\times 10^{-7}\)). See below for how to work out what varieties have the highest / lowest yields.
A additional meter of tree height means 6.75 kg more plums on the tree (effect estimate for tree.h is already in “per meter” unit, since our tree height data are in meters).
NoteBB TEST Q4
Which variety is associated with the highest and lowest plum yield, respectively?
For this, we need to look at the coefficients (effect estimates). Let’s print them again, and remind ourselves how the varieties were coded:
# var = 1: David# var = 2: Royal# var = 3: Supreme
Important
Recall that we are using sum contrasts. Therefore, the coefficients give the deviations from the mean.
TipInterpretation
The effect estimate for Royal (var2) is larger than for David (var1) and both are positive, which means that the estimate for Supreme must be negative (since all three must sum to zero). So Royal has the highest yield, and Supreme the lowest.
Make Predictions!
Assume we want to predict the plum yield in two scenarios:
for an un-blessed tree, 4 m high, of variety ‘Supreme’;
for a blessed tree, also 4 m high, but of variety ‘David’.
NoteTASK
Calculate the predicted plum yield for the two above scenarios.
Instead of using R like a pocket calculator, we do this:
b <-coef(my.m1) # store the coefficients ("betas" -- hence b)b # print them again, for reference
b["(Intercept)"] +# intercept b["tree.h"]*4+# tree is 4 m high-b["var1"] -b["var2"] +# variety is 'Supreme' (think "neither David nor Royal") b["blessed1"] # un-blessed
Blessed tree, also 4 m high, but of variety ‘David’:
b[1] + b[2]*4+ b[3] - b[5]
(Intercept)
38.03209
Meet the R predict function
NoteBB TEST Q5
What is your prediction for plum yield for a 4 m tree of variety ‘Supreme’ that hasn’t been blessed? Use the signif() function to round your results to 4 “significant digits” (that’s 4 digits, regardless of where the decimal point is, e.g. 0.1234, 1.234, 12.34…).
pred.dat <-data.frame(blessed =factor(1), # 1 is 'not blessed'tree.h =4, # tree is 4 m highvar =factor(3) # 3 is 'Supreme' )signif(predict(my.m1, newdata = pred.dat), 4)
1
34.12
So that’s 34.12 kg of plums.
NoteTASK
Modify the code to get predicted yields for both of the above scenarios in one step – by this, I mean, in a single call of predict(m, newdata = pred.dat).
If we want two predictions,
un-blessed tree, 4 m high, variety ‘Supreme’
blessed tree, also 4 m high, but variety ‘David’
pred.dat <-data.frame(blessed =factor(c(1, 2)), # not blessed, blessedtree.h =c(4, 4), # both are 4 m var =factor(c(3, 1)) # 3 is 'Supreme', 1 is 'David' )signif(predict(my.m1, newdata = pred.dat), 4)
1 2
34.12 38.03
SSQ Again!
NoteTASK
Compare the unadjusted and adjusted sums of squares for [blessing, in this demo example] and interpret your findings — what does this indicate from a technical statistical perspective, and what does this translate into in this study?
Important
Recall that to get ‘fully unadjusted’ SSQ for a variable, we need to put it first in the model formula.
my.m2 <-lm(plums ~ blessed + var + tree.h, Plums)anova(my.m1)
Analysis of Variance Table
Response: plums
Df Sum Sq Mean Sq F value Pr(>F)
tree.h 1 3684.9 3684.9 1036.4364 < 2.2e-16 ***
var 2 167.2 83.6 23.5104 1.023e-07 ***
blessed 1 10.8 10.8 3.0327 0.08844 .
Residuals 45 160.0 3.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(my.m2)
Analysis of Variance Table
Response: plums
Df Sum Sq Mean Sq F value Pr(>F)
blessed 1 377.91 377.91 106.29 1.982e-13 ***
var 2 917.42 458.71 129.02 < 2.2e-16 ***
tree.h 1 2567.49 2567.49 722.15 < 2.2e-16 ***
Residuals 45 159.99 3.56
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TipInterpretation
The unadjusted SSQ for blessing are much larger than the adjusted SSQ. So, much of the variation in plum yield that is seemingly attributable to blessing trees is in fact explained by differences in tree height and variety (the blessed trees must on average have been bigger and/or of a more productive variety). If we did not adjust for the effects of tree height and variety, we’d wrongly conclude that our data provide very strong evidence that blessing has an effect. The correct analysis provides very little evidence that blessing has any effect.
NoteBB TEST Q6
Which of the following statements best describes your findings regarding the unadjusted and adjusted SSQ for blessed?
TipInterpretation
Adjusting for variety and tree height reduces the SSQ of blessed.
Much of the apparent effect of tree-blesing is therefore attributable to differences in tree variety and height.
Confidence Intervals (Optional)
TASK: Give a 95% confidence interval for the effect of an additional metre of tree height on plum yield. For this, you will need to remind yourself how to calculate confidence intervals from standard errors (e.g., Grafen & Hails, p.14–15).
In this, \(\bar{y}\) is the estimate of the mean, and \(\frac{s}{\sqrt n}\) is the standard error (SE) of that estimate. So, to get a CI from mean \(\pm\) SE, the one extra thing we need is the critical \(t\)-value \((t_\mathrm{crit})\).
You get the mean and SE from the model summary. You could simply write down the values. However, to avoid having to write down the values, I extract and save the table of coefficients and SEs from the model summary using coef, like this:
s <-summary(my.m1)cf <-coef(s) ; signif(cf, 3) # save for later ; and print.
So all we now need is \(t_\mathrm{crit}\). R has a built-in function qt to calculate the quantiles of the \(t\)-distribution. The degrees of freedom in this case are the residual degrees of freedom for our model, which is 45. For a 95% CI, we need to cover the \(t_{45}\)-distribution between the 2.5 percentile and the 97.5 percentile; this gives the following range:
qt(c(0.025, 0.975), df =45) # 2.5% and 97.5% percentiles
[1] -2.014103 2.014103
The \(t\)-distribution is symmetric around zero, and so is our CI… this is why we need only one \(t_\mathrm{crit}\) value rather than two, and is where the \(\pm\) in the equation comes from. So, for a 95% CI, \(t_\mathrm{crit}\) is the 97.5 percentile:
t.crit <-qt(0.975, df =45) # crit. t-value for 95% CI
Putting this together, we get
\[\mathrm{CI} = 6.75 \pm 0.251 \times 2.01\]
We could do this by hand, or — much better — do the entire calculation with full precision in R (this is why I saved the coefficients table, rather than writing down the numbers):
NOTE: This code cleverly exploits a particular R behaviour: if we feed a vector of two values (the c(-t.crit, t.crit) part of the code) into an expression that otherwise has only a single value, R will expand the whole expression to return a vector with the two values.
The confidence interval for the change in plum yield per additional meter tree height is \([6.24, 7.25]\).