W4 Practical Support

Blessed Plum Trees

Author

Swidbert R. Ott

1 Study Design & Data

I’ve again come up with a fictive scenario that has the same structure as what you do in the practical.

Research question: A study was conducted into the effect of tree blessing by village priests or shamans on the yield of plum trees. Fifty plum trees were sampled at random from a ten thousand hectare area in Northern England. For each tree, the following variables were recorded:

  • plums: The weight of plums harvested from a plum tree, in \(\mathrm{kg}\).
  • tree.h: The height of the tree, in \(\mathrm m\).
  • blessed: A categorical variable, coded as
    • 1 if the tree has NOT been blessed by the village priest / shaman,
    • 2 if it has been blessed.
  • var: A categorical variable indicating the plum variety as
    • 1 for David,
    • 2 for Royal, and
    • 3 for Supreme.

2 Consider the study aims and design

From the information in the previous section, make sure that you can answer the following questions:

  • What is the primary research question, i.e., the aim of the study?
  • What is the dependent variable (DV)?
  • What is the explanatory variable (EV) of primary interest?
  • Why is it sensible that the other EVs were also recorded?

Select all variables to which the term “confounding” can sensibly be applied in the context of this study.

  • The primary research question is whether blessing trees affects their yield.
  • The DV is plum yield (plums).
  • The EV of primary interest is tree-blessing (blessed). Finding out whether plum yield depends on the variety or the height of the tree was not the aim of the study.
  • However, we would very much expect the yield to differ between varieties, and also to depend on the size (height) of the tree: variety and tree height and variety are expected to confound any effect that tree blessing may have. For this reason, variety and height were recorded for all trees. In the analysis, we can then control for the confounding effects of variety and height.

3 LM Analysis & Interpretation

3.1 Set up the data

Plums <- read.csv("w4-classdemo.csv")
summary(Plums)
     plums            tree.h           blessed          var      
 Min.   : 9.661   Min.   :0.05187   Min.   :1.00   Min.   :1.00  
 1st Qu.:27.871   1st Qu.:2.67376   1st Qu.:1.00   1st Qu.:1.00  
 Median :34.588   Median :3.83299   Median :2.00   Median :2.00  
 Mean   :33.877   Mean   :3.49108   Mean   :1.66   Mean   :1.96  
 3rd Qu.:41.152   3rd Qu.:4.35685   3rd Qu.:2.00   3rd Qu.:3.00  
 Max.   :45.737   Max.   :5.00000   Max.   :2.00   Max.   :3.00  

Like in the data that you have, numbers are used to indicate the categories (factor levels) of the categorical variables (factors) – here, plum variety var (1, 2, 3) and shaman blessing blessed (1, 2).

So we need to first declare var and blessed as factors, then apply our contrast settings. I’m doing this all at once:

Plums <- within(Plums,{
  blessed <- factor(blessed);  contrasts(blessed) <- contr.sum                 
  var <- factor(var);  contrasts(var) <- contr.sum
})

summary(Plums)  # not really needed: sanity check
     plums            tree.h        blessed var   
 Min.   : 9.661   Min.   :0.05187   1:17    1:18  
 1st Qu.:27.871   1st Qu.:2.67376   2:33    2:16  
 Median :34.588   Median :3.83299           3:16  
 Mean   :33.877   Mean   :3.49108                 
 3rd Qu.:41.152   3rd Qu.:4.35685                 
 Max.   :45.737   Max.   :5.00000                 

OK, blessed and var are now shown as factors.

3.2 Look at the data

For a first peek, I want you to just quickly check graphically whether there is a relationship between the DV (plum yield) and the continuous EV (tree height), and if so, what direction the effect takes and whether it looks strong or weak. For this plot, we will ignore the categorical EVs (variety and blessing status). Remember that such a ‘univariate’ analysis can potentially be misleading and needs to be followed up by a multivariate analysis (which you will carry out below).

library(ggplot2)
ggplot(Plums, aes(x=tree.h, y=plums)) + geom_point()

What statement best describes the apparent relationship between tree height and plum yield, based on a simple graphical assessment?

In my example, it is evident that plum yield increases strongly with tree height.

3.3 Fit the LM and interpret the results

Fit the LM that is most appropriate for answering this research question.

Interpret the results of your LM analysis to briefly answer the following questions:

  1. How strong is the evidence that the plum yield depends upon whether a tree has been blessed? In what direction is the effect?
  2. How strong is the evidence that plum variety affects yield? Which varieties are associated with the highest and lowest yield?
  3. What is the estimated effect of an additional metre of height on a tree’s yield?
my.m1 <- lm(plums ~ tree.h + var + blessed, Plums)

I would use Anova() from the car package to get an adjusted SSQ (Type II) ANOVA table.

library(car)
Anova(my.m1)
Anova Table (Type II tests)

Response: plums
           Sum Sq Df  F value    Pr(>F)    
tree.h    2567.49  1 722.1540 < 2.2e-16 ***
var        158.12  2  22.2377 1.923e-07 ***
blessed     10.78  1   3.0327   0.08844 .  
Residuals  159.99 45                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For some aspects of the task, we also need the model estimates (coefficients):

coef(my.m1)
(Intercept)      tree.h        var1        var2    blessed1 
 10.1411201   6.7480000   0.3903321   2.1147729  -0.5086342 

1. How strong is the evidence that the plum yield depends upon whether a tree has been blessed? In what direction is the effect? — The evidence that the yield of a plum tree depends on whether it has been blessed is weak (\(P=0.0884\)): we have no strong evidence against the Null, i.e., that the true effect isn’t zero.

For effect directions, we need to look at the estimated coefficients. Recall that with sum contrasts, the coefficients are the deviations of the group means from the grand mean. The coefficient blessed1 (which is for un-blessed trees) is negative (-0.509), so the direction of the effect estimate is for blessed trees to have slightly higher yields. But we should not make a statement about the effect of tree-blessing on yield, because (given the ANOVA result) we have little evidence that there is an effect – and if there is an effect, it may in fact be in the other direction… we just can’t tell. Therefore, the correct way to report the result is as a statement about the sample rather than about the (general) effect of blessing trees:

“In our sample of trees, plum yields were on average higher for trees that had been blessed.”

2. How strong is the evidence that plum variety affects yield? Which varieties are associated with the highest and lowest yield? — The evidence that plum variety affects yield is very strong (\(P=1.92\times 10^{-7}\)). See below for how to work out what varieties have the highest / lowest yields.

3. What is the estimated effect of an additional metre of height on a tree’s yield? — This one is super easy – no calculation needed! You just need to recall that, for continuous EVs, the coefficient gives the estimated change in the mean of DV per unit change in the EV. So the coefficient for tree.h already gives the estimated mean change in plums per 1 m change in tree height: plum yields increase by 6.75 kg for each additional meter tree-height.

Which of these statements most accurately describes the evidence that the yield of a plum tree depends upon whether it has been blessed?

There is some evidence for an effect \((0.1 > P \geq 0.01)\).

Which variety is associated with the highest and lowest plum yield, respectively?

For this, we need to look at the coefficients (effect estimates). Let’s print them again, and remind ourselves how the varieties were coded:

coef(my.m1)   # just printing them again
(Intercept)      tree.h        var1        var2    blessed1 
 10.1411201   6.7480000   0.3903321   2.1147729  -0.5086342 
# var = 1: David
# var = 2: Royal
# var = 3: Supreme
Important

Recall that we are using sum contrasts. Therefore, the coefficients give the deviations from the mean.

The effect estimate for Royal (var2) is larger than for David (var1) and both are positive, which means that the estimate for Supreme, which is not directly provided in the output, must be negative since all three must sum to zero. So Royal has the highest yield, and Supreme the lowest. If this doesn’t make sense, see the “Contrasts deep-dive” box below.

4 Make Predictions!

Assume we want to predict the plum yield in two scenarios:

  1. for an un-blessed tree, 4 m high, of variety ‘Supreme’;
  2. for a blessed tree, also 4 m high, but of variety ‘David’.

Calculate the predicted plum yield for the two above scenarios.

Instead of using R like a pocket calculator, we save the estimates of the model coefficients into a vector variable – e.g., call it b (for ‘betas’):

b <- coef(my.m1)  # store the coefficients
b                 # print them, to see what they look like
(Intercept)      tree.h        var1        var2    blessed1 
 10.1411201   6.7480000   0.3903321   2.1147729  -0.5086342 

Scenario 1: Un-blessed tree, 4 m high, of variety ‘Supreme’:

b["(Intercept)"] +         # intercept 
   b["tree.h"]*4 +         # tree is 4 m high
  -b["var1"] -b["var2"] +  # variety is 'Supreme' (think "neither David nor Royal")
   b["blessed1"]           # un-blessed
(Intercept) 
   34.11938 
# Or, using numerical indexing instead: compact, but can be confusing
b[1] + b[2]*4 + (-b[3]-b[4]) + b[5]
(Intercept) 
   34.11938 

Scenario 2: Blessed tree, also 4 m high, but of variety ‘David’:

b["(Intercept)"] + b["tree.h"]*4 + b["var1"] - b["blessed1"]
(Intercept) 
   38.03209 

How do we know whether to use var1 or var2 and how do we get to the + and - signs for the factor levels?

The first thing to note is that, under the bonnet, a LM treats categorical variables (factors) the same as continuous variables – this is why we say, ‘all LMs are just regressions.’ So let’s start with reminding ourselves how it works for our continuous variable, tree.h: we multiply the effect estimate (\(\beta_i\)) with the value of the variable (\(x_i\)):

b["tree.h"] * 4

For a factor, we also multiply the effect estimate \(\beta_i\) with ‘the value of the variable’, but because we cannot multiply \(\beta_i\) with “David”, we use the contrast coding for “David” instead. Here is how the contrasts look like for our plum varieties. Don’t confuse the contrast coding in the model with the silly numeric coding of the varieties in the data!

Variety in the data contrast in the model
David 1 (1, 0) \(x_1=1, x_2=0\)
Royal 2 (0, 1) \(x_1=0, x_2=1\)
Supreme 3 (-1, -1) \(x_1=-1, x_2=-1\)

So, variety is internally represented by two variables (\(x_1, x_2\)) – see why a three-level factor needs two DoF? And the effect of variety is modelled / represented as

\[\beta_1\times x_1+\beta_2\times x_2\] See how this is the same as for continuous variables, in that “we multiply the effect estimate (\(\beta_i\)) with the value of the variable (\(x_i\))”? For variety “David,” \(x_1=1, x_2=0\), and we get

b["var1"]*1 + b["var2"]*0   # which is of course just
b["var1"]

For variety “Supreme,” \(x_1=-1, x_2=-1\) (which I like to think of as “neither David nor Royal”), so:

b["var1"]*-1 + b["var2"]*-1  # which is the same as
-(b["var1"] + b["var2"])

4.1 Meet the R predict function

What is your prediction for plum yield for a 4 m tree of variety ‘Supreme’ that hasn’t been blessed? Use the signif() function to round your results to 4 “significant digits” (that’s 4 digits, regardless of where the decimal point is, e.g. 0.1234, 1.234, 12.34…).

pred.dat <- data.frame(
  blessed = factor(1),  # 1 is 'not blessed'
  tree.h = 4,           # tree is 4 m high
  var = factor(3)       # 3 is 'Supreme'
  )

signif(predict(my.m1, newdata = pred.dat), 4)
    1 
34.12 

So that’s 34.12 kg of plums.

Modify the code to get predicted yields for both of the above scenarios in one step – by this, I mean, in a single call of predict(m, newdata = pred.dat).

To get both predictions in one step, we simply make a data frame that has both scenarios in it. Put more technically, it has two rows. To make the rows, we need to put two values into each EV — for this, the two values need to be inside a c( ).

pred.dat <- data.frame(
  blessed = factor(c(1, 2)),  # not blessed, blessed
  tree.h = c(4, 4),           # both are 4 m 
  var = factor(c(3, 1))       # 3 is 'Supreme', 1 is 'David'
  )

signif(predict(my.m1, newdata = pred.dat), 4)
    1     2 
34.12 38.03 

4.2 SSQ Again!

Compare the unadjusted and adjusted sums of squares for [blessing, in this demo example] and interpret your findings — what does this indicate from a technical statistical perspective, and what does this translate into in this study?

Important

Recall that to get ‘fully unadjusted’ SSQ for a variable, we need to put it first in the model formula.

my.m2 <- lm(plums ~ blessed + var + tree.h, Plums)
anova(my.m1)
Analysis of Variance Table

Response: plums
          Df Sum Sq Mean Sq   F value    Pr(>F)    
tree.h     1 3684.9  3684.9 1036.4364 < 2.2e-16 ***
var        2  167.2    83.6   23.5104 1.023e-07 ***
blessed    1   10.8    10.8    3.0327   0.08844 .  
Residuals 45  160.0     3.6                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(my.m2)
Analysis of Variance Table

Response: plums
          Df  Sum Sq Mean Sq F value    Pr(>F)    
blessed    1  377.91  377.91  106.29 1.982e-13 ***
var        2  917.42  458.71  129.02 < 2.2e-16 ***
tree.h     1 2567.49 2567.49  722.15 < 2.2e-16 ***
Residuals 45  159.99    3.56                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The unadjusted SSQ for blessing are much larger than the adjusted SSQ. So, much of the variation in plum yield that is seemingly attributable to blessing trees is in fact explained by differences in tree height and variety (the blessed trees in our sample must on average have been bigger and/or of a more productive variety). If we did not adjust for the effects of tree height and variety, we’d wrongly conclude that our data provide very strong evidence that blessing has an effect. The correct analysis provides very little evidence that blessing has any effect.

Which of the following statements best describes your findings regarding the unadjusted and adjusted SSQ for blessed?

  • Adjusting for variety and tree height reduces the SSQ of blessed.
  • Much of the apparent effect of tree-blesing is therefore attributable to differences in tree variety and height.

5 Confidence Intervals (Optional)

TASK: Give a 95% confidence interval for the effect of an additional metre of tree height on plum yield. For this, you will need to remind yourself how to calculate confidence intervals from standard errors (e.g., Grafen & Hails, p.14–15).

Recall (e.g., Grafen & Hall, p.14–15) that

\[\mathrm{CI} = \bar{y} \pm\frac{s}{\sqrt n} t_\mathrm{crit}\]

In this, \(\bar{y}\) is the estimate of the mean, and \(\frac{s}{\sqrt n}\) is the standard error (SE) of that estimate. So, to get a CI from mean \(\pm\) SE, the one extra thing we need is the critical \(t\)-value \((t_\mathrm{crit})\).

You get the mean and SE from the model summary. You could simply write down the values. However, to avoid having to write down the values, I extract and save the table of coefficients and SEs from the model summary using coef, like this:

s <- summary(my.m1)
cf <- coef(s) ;  signif(cf, 3)  # save for later ;  and print.
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   10.100      0.904   11.20 1.28e-14
tree.h         6.750      0.251   26.90 2.36e-29
var1           0.390      0.378    1.03 3.07e-01
var2           2.110      0.411    5.14 5.74e-06
blessed1      -0.509      0.292   -1.74 8.84e-02

The means estimate and SE for alt is

cf["tree.h", c("Estimate", "Std. Error")]
  Estimate Std. Error 
 6.7480000  0.2511078 

So all we now need is \(t_\mathrm{crit}\). R has a built-in function qt to calculate the quantiles of the \(t\)-distribution. The degrees of freedom in this case are the residual degrees of freedom for our model, which is 45. For a 95% CI, we need to cover the \(t_{45}\)-distribution between the 2.5 percentile and the 97.5 percentile; this gives the following range:

qt(c(0.025, 0.975), df = 45)  # 2.5% and 97.5% percentiles
[1] -2.014103  2.014103

The \(t\)-distribution is symmetric around zero, and so is our CI… this is why we need only one \(t_\mathrm{crit}\) value rather than two, and is where the \(\pm\) in the equation comes from. So, for a 95% CI, \(t_\mathrm{crit}\) is the 97.5 percentile:

t.crit <- qt(0.975, df = 45)  # crit. t-value for 95% CI

Putting this together, we get

\[\mathrm{CI} = 6.75 \pm 0.251 \times 2.01\]

We could do this by hand, or — much better — do the entire calculation with full precision in R (this is why I saved the coefficients table, rather than writing down the numbers):

CI <- cf["tree.h", "Estimate"] + cf["tree.h", "Std. Error"] * c(-t.crit, t.crit)
signif(CI, 3)
[1] 6.24 7.25

NOTE: This code cleverly exploits a particular R behaviour: if we feed a vector of two values (the c(-t.crit, t.crit) part of the code) into an expression that otherwise has only a single value, R will expand the whole expression to return a vector with the two values.

The confidence interval for the change in plum yield per additional meter tree height is \([6.24, 7.25]\).