W4 Practical Support

Author

Swidbert R. Ott

Scenario: blessing plum trees

I’ve come up with a fictive scenario that has the same structure as what you do in the practical.

Research question: Does blessing of plum trees by the village priest increase the yield?

To answer this, we want to control for the size of the tree (bigger trees wil bear more plums), and for differences in yield between different plum varieties (David, Royal and Supreme).

Outcome: Weight of plums harvested from a plum tree, in kg (plums)

Predictors:

  • Height of tree, in m (tree.h);
  • Plum variety (var): David = 1; Royal = 2, Supreme = 3;
  • Tree blessed by the village priest: “Yes” = 1, “No” = 2.

Consider the study aims and design

Select all variables to which the term “confounding” can sensibly be applied in the context of this study.

Our research question is whether blessing trees affects the yield, so for us, tree height and variety are confounding variables.

LM Analysis & Interpretation

Set up the data

Plums <- read.csv("w4-classdemo.csv")
summary(Plums)
     plums            tree.h           blessed          var      
 Min.   : 9.661   Min.   :0.05187   Min.   :1.00   Min.   :1.00  
 1st Qu.:27.871   1st Qu.:2.67376   1st Qu.:1.00   1st Qu.:1.00  
 Median :34.588   Median :3.83299   Median :2.00   Median :2.00  
 Mean   :33.877   Mean   :3.49108   Mean   :1.66   Mean   :1.96  
 3rd Qu.:41.152   3rd Qu.:4.35685   3rd Qu.:2.00   3rd Qu.:3.00  
 Max.   :45.737   Max.   :5.00000   Max.   :2.00   Max.   :3.00  

Like in the data that you have, numbers are used to indicate the categories (factor levels) of the categorical variables (factors) – here, plum variety var (1, 2, 3) and shaman blessing blessed (1, 2).

So we need to first declare var and blessed as factors, then apply our contrast settings. I’m doing this all at once:

Plums <- within(Plums,{
  blessed <- factor(blessed);  contrasts(blessed) <- contr.sum                 
  var <- factor(var);  contrasts(var) <- contr.sum
})

summary(Plums)  # not really needed: sanity check
     plums            tree.h        blessed var   
 Min.   : 9.661   Min.   :0.05187   1:17    1:18  
 1st Qu.:27.871   1st Qu.:2.67376   2:33    2:16  
 Median :34.588   Median :3.83299           3:16  
 Mean   :33.877   Mean   :3.49108                 
 3rd Qu.:41.152   3rd Qu.:4.35685                 
 Max.   :45.737   Max.   :5.00000                 

OK, blessed and var are now shown as factors.

Look at the data

For a first peek, I want you to just quickly check graphically whether there is a relationship between [the DV] and [the continuous EV], and if so, what direction the effect takes and whether it looks strong or weak. For this plot, we will ignore [the factor EVs]. Remember that such a ‘univariate’ analysis can potentially be misleading and needs to be followed up by a multivariate analysis (which you will carry out below).

library(ggplot2)
ggplot(Plums, aes(x=tree.h, y=plums)) + geom_point()

What statement best describes the apparent relationship between tree height and plum yield, based on a simple graphical assessment?

In my example, it is evident that plum yield increases strongly with tree height.

Fit the LM and interpret the results

Fit the LM that is most appropriate for answering this research question.

Interpret the results of your LM analysis to briefly answer the following questions:

  1. How strong is the evidence that the plum yield depends upon whether a tree has been blessed? In what direction is the effect?
  2. How strong is the evidence that plum variety affects yield? Which varieties are associated with the highest and lowest yield?
  3. What is the estimated effect of an additional metre of height on a tree’s yield?

We can do a Type I ANOVA if we write the model with the EVs in the right order.

my.m1 <- lm(plums ~ tree.h + var + blessed, Plums)
anova(my.m1)
Analysis of Variance Table

Response: plums
          Df Sum Sq Mean Sq   F value    Pr(>F)    
tree.h     1 3684.9  3684.9 1036.4364 < 2.2e-16 ***
var        2  167.2    83.6   23.5104 1.023e-07 ***
blessed    1   10.8    10.8    3.0327   0.08844 .  
Residuals 45  160.0     3.6                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For some aspects of the task, we also need the model estimates (coefficients):

coef(my.m1)
(Intercept)      tree.h        var1        var2    blessed1 
 10.1411201   6.7480000   0.3903321   2.1147729  -0.5086342 
  1. There is very little evidence that the yield of a plum tree depends on whether it has been blessed – we have no good evidence against the Null, i.e., that the true effect isn’t zero (\(P=0.0884\)). The direction of the effect estimate is for blessed trees to have higher yield, since the coefficient blessed1 (which is for un-blessed trees) is negative (-0.509). In our sample, the mean for blessed trees is higher than for un-blessed trees (after adjusting for the effects of tree height and variety).
  2. The evidence that plum variety affects yield is very strong (\(P=1.02\times 10^{-7}\)). See below for how to work out what varieties have the highest / lowest yields.
  3. A additional meter of tree height means 6.75 kg more plums on the tree (effect estimate for tree.h is already in “per meter” unit, since our tree height data are in meters).

Which variety is associated with the highest and lowest plum yield, respectively?

For this, we need to look at the coefficients (effect estimates). Let’s print them again, and remind ourselves how the varieties were coded:

coef(my.m1)   # just printing them again
(Intercept)      tree.h        var1        var2    blessed1 
 10.1411201   6.7480000   0.3903321   2.1147729  -0.5086342 
# var = 1: David
# var = 2: Royal
# var = 3: Supreme
Important

Recall that we are using sum contrasts. Therefore, the coefficients give the deviations from the mean.

The effect estimate for Royal (var2) is larger than for David (var1) and both are positive, which means that the estimate for Supreme must be negative (since all three must sum to zero). So Royal has the highest yield, and Supreme the lowest.

Make Predictions!

Assume we want to predict the plum yield in two scenarios:

  1. for an un-blessed tree, 4 m high, of variety ‘Supreme’;
  2. for a blessed tree, also 4 m high, but of variety ‘David’.

Calculate the predicted plum yield for the two above scenarios.

Instead of using R like a pocket calculator, we do this:

b <- coef(my.m1)  # store the coefficients ("betas" -- hence b)
b                 # print them again, for reference
(Intercept)      tree.h        var1        var2    blessed1 
 10.1411201   6.7480000   0.3903321   2.1147729  -0.5086342 

Un-blessed tree, 4 m high, of variety ‘Supreme’:

b["(Intercept)"] +         # intercept 
   b["tree.h"]*4 +         # tree is 4 m high
  -b["var1"] -b["var2"] +  # variety is 'Supreme' (think "neither David nor Royal")
   b["blessed1"]           # un-blessed
(Intercept) 
   34.11938 
# Or, using numerical indexing instead,
b[1] + b[2]*4 + (-b[3]-b[4]) + b[5]
(Intercept) 
   34.11938 

Blessed tree, also 4 m high, but of variety ‘David’:

b[1] + b[2]*4 + b[3] - b[5]
(Intercept) 
   38.03209 

Meet the R predict function

What is your prediction for plum yield for a 4 m tree of variety ‘Supreme’ that hasn’t been blessed? Use the signif() function to round your results to 4 “significant digits” (that’s 4 digits, regardless of where the decimal point is, e.g. 0.1234, 1.234, 12.34…).

pred.dat <- data.frame(
  blessed = factor(1),  # 1 is 'not blessed'
  tree.h = 4,           # tree is 4 m high
  var = factor(3)       # 3 is 'Supreme'
  )

signif(predict(my.m1, newdata = pred.dat), 4)
    1 
34.12 

So that’s 34.12 kg of plums.

Modify the code to get predicted yields for both of the above scenarios in one step – by this, I mean, in a single call of predict(m, newdata = pred.dat).

If we want two predictions,

  1. un-blessed tree, 4 m high, variety ‘Supreme’
  2. blessed tree, also 4 m high, but variety ‘David’
pred.dat <- data.frame(
  blessed = factor(c(1, 2)),  # not blessed, blessed
  tree.h = c(4, 4),           # both are 4 m 
  var = factor(c(3, 1))       # 3 is 'Supreme', 1 is 'David'
  )

signif(predict(my.m1, newdata = pred.dat), 4)
    1     2 
34.12 38.03 

SSQ Again!

Compare the unadjusted and adjusted sums of squares for [blessing, in this demo example] and interpret your findings — what does this indicate from a technical statistical perspective, and what does this translate into in this study?

Important

Recall that to get ‘fully unadjusted’ SSQ for a variable, we need to put it first in the model formula.

my.m2 <- lm(plums ~ blessed + var + tree.h, Plums)
anova(my.m1)
Analysis of Variance Table

Response: plums
          Df Sum Sq Mean Sq   F value    Pr(>F)    
tree.h     1 3684.9  3684.9 1036.4364 < 2.2e-16 ***
var        2  167.2    83.6   23.5104 1.023e-07 ***
blessed    1   10.8    10.8    3.0327   0.08844 .  
Residuals 45  160.0     3.6                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(my.m2)
Analysis of Variance Table

Response: plums
          Df  Sum Sq Mean Sq F value    Pr(>F)    
blessed    1  377.91  377.91  106.29 1.982e-13 ***
var        2  917.42  458.71  129.02 < 2.2e-16 ***
tree.h     1 2567.49 2567.49  722.15 < 2.2e-16 ***
Residuals 45  159.99    3.56                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The unadjusted SSQ for blessing are much larger than the adjusted SSQ. So, much of the variation in plum yield that is seemingly attributable to blessing trees is in fact explained by differences in tree height and variety (the blessed trees must on average have been bigger and/or of a more productive variety). If we did not adjust for the effects of tree height and variety, we’d wrongly conclude that our data provide very strong evidence that blessing has an effect. The correct analysis provides very little evidence that blessing has any effect.

Which of the following statements best describes your findings regarding the unadjusted and adjusted SSQ for blessed?

  • Adjusting for variety and tree height reduces the SSQ of blessed.
  • Much of the apparent effect of tree-blesing is therefore attributable to differences in tree variety and height.

Confidence Intervals (Optional)

TASK: Give a 95% confidence interval for the effect of an additional metre of tree height on plum yield. For this, you will need to remind yourself how to calculate confidence intervals from standard errors (e.g., Grafen & Hails, p.14–15).

Recall (e.g., Grafen & Hall, p.14–15) that

\[\mathrm{CI} = \bar{y} \pm\frac{s}{\sqrt n} t_\mathrm{crit}\]

In this, \(\bar{y}\) is the estimate of the mean, and \(\frac{s}{\sqrt n}\) is the standard error (SE) of that estimate. So, to get a CI from mean \(\pm\) SE, the one extra thing we need is the critical \(t\)-value \((t_\mathrm{crit})\).

You get the mean and SE from the model summary. You could simply write down the values. However, to avoid having to write down the values, I extract and save the table of coefficients and SEs from the model summary using coef, like this:

s <- summary(my.m1)
cf <- coef(s) ;  signif(cf, 3)  # save for later ;  and print.
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   10.100      0.904   11.20 1.28e-14
tree.h         6.750      0.251   26.90 2.36e-29
var1           0.390      0.378    1.03 3.07e-01
var2           2.110      0.411    5.14 5.74e-06
blessed1      -0.509      0.292   -1.74 8.84e-02

The means estimate and SE for alt is

cf["tree.h", c("Estimate", "Std. Error")]
  Estimate Std. Error 
 6.7480000  0.2511078 

So all we now need is \(t_\mathrm{crit}\). R has a built-in function qt to calculate the quantiles of the \(t\)-distribution. The degrees of freedom in this case are the residual degrees of freedom for our model, which is 45. For a 95% CI, we need to cover the \(t_{45}\)-distribution between the 2.5 percentile and the 97.5 percentile; this gives the following range:

qt(c(0.025, 0.975), df = 45)  # 2.5% and 97.5% percentiles
[1] -2.014103  2.014103

The \(t\)-distribution is symmetric around zero, and so is our CI… this is why we need only one \(t_\mathrm{crit}\) value rather than two, and is where the \(\pm\) in the equation comes from. So, for a 95% CI, \(t_\mathrm{crit}\) is the 97.5 percentile:

t.crit <- qt(0.975, df = 45)  # crit. t-value for 95% CI

Putting this together, we get

\[\mathrm{CI} = 6.75 \pm 0.251 \times 2.01\]

We could do this by hand, or — much better — do the entire calculation with full precision in R (this is why I saved the coefficients table, rather than writing down the numbers):

CI <- cf["tree.h", "Estimate"] + cf["tree.h", "Std. Error"] * c(-t.crit, t.crit)
signif(CI, 3)
[1] 6.24 7.25

NOTE: This code cleverly exploits a particular R behaviour: if we feed a vector of two values (the c(-t.crit, t.crit) part of the code) into an expression that otherwise has only a single value, R will expand the whole expression to return a vector with the two values.

The confidence interval for the change in plum yield per additional meter tree height is \([6.24, 7.25]\).