1 Study Design & Data

An ecological study was conducted into the effect of conservation on the biomass of vegetation supported by an area of land. Fifty plots of land, each one hectare, were sampled at random from a ten thousand hectare area in Northern England. For each plot, the following variables were recorded:

  • biomass: An estimate of the biomass of vegetation (in \(\mathrm{kg}/\mathrm m^2\)).
  • alt: The mean altitude of the plot (in \(\mathrm m\) above sea level).
  • cons: A categorical variable, which was coded as
    • 1 if the plot was part of a conservation area, and
    • 2 otherwise
  • soil: A categorical variable crudely classifying soil type as
    • 1 for chalk,
    • 2 for clay,
    • 3 for loam.

The data are stored in Conservation.csv.

1.1 Consider the study aims and design

A common beginner’s mistake is to collect a bunch of data and start analysing away, without proper consideration of what the questions are. From the information in the previous section, make sure that you can answer the following questions (you don’t need to include them in your submission):

  • What is the primary research question, i.e., the aim of the study?
  • What is the dependent variable (DV)?
  • What is the explanatory variable (EV) of primary interest?
  • Why is it sensible that the other EVs were also recorded?

MCQ1: Which one of the following statements best explains why the researchers recorded altitude and soil type of the study areas:

[_] The researchers were primarily asking how soil type and altitude affect biomass.
[_] The researchers were primarily asking how soil type affects biomass, but expected a confounding effect of altitude on biomass.
[_] The researchers were primarily asking how conservation status areas differ in altitude and soil type.
[_] The researchers were primarily asking how conservation affects biomass, but wished to control for the confounding effects of soil type and altitude.

2 LM Analysis & Interpretation

2.1 Set up the data

The first step in the analysis is, somewhat unsurprisingly, fitting a LM… but before we get to this, you need to load the data and set them up. This should be a bit of an old hat by now…

  • Remember to first check how the categorical variables are coded, and set them up correctly if needed.
  • Use sum contrasts for categorical variables.

2.2 Look at the data

Before you proceed to fitting a LM, you should always try and plot the data. Since you have two categorical EVs and a continuous EV, this would require some thought.

TASK: For a first peek, however, I want you to just quickly check graphically whether there is a relationship between biomass and altitude, and if so, what direction the effect takes and whether it looks strong or weak. For this plot, we will ignore conservation area status andf soil type. Remember that such a ‘univariate’ analysis can potentially be misleading and needs to be followed up by a multivariate analysis (which you will carry out below).

MCQ2: What statement best describes the apparent relationship between altitude and biomass, based on a simple graphical assessment?

[_] Biomass increases strongly with altitute.
[_] Biomass appears to increase slightly with altitude.
[_] No relationship between biomass and altitude is apparent.
[_] Biomass appears to decrease slightly with altitude.
[_] Biomass decreases strongly with altitude.

2.3 Fit the LM and interpret the results

TASK: Fit the LM that is most appropriate for answering this research question.

Interpret the results of your LM analysis to briefly answer the following questions:

  1. How strong is the evidence that the biomass of vegetation depends upon being in a conservation area? In what direction is the effect?
  2. How strong is the evidence that soil type affects biomass? Which soil types are associated with the highest and lowest biomass value?
  3. What is the estimated effect of an additional metre of altitude on biomass?

MCQ3: Which of these statements most accurately describes the evidence that the vegetation biomass of an area depends upon its conservation status?

[_] The evidence is weak \((P=0.0950)\).
[_] The evidence is weak \((P=0.0884)\).
[_] The evidence is strong \((P<0.05)\).
[_] The evidence is overwhelming \((P=1.982\times 10^{-13})\).

3 Make Predictions!

Now, we’ll use our model to predict biomass for two different scenarios:

  • for a plot with a mean altitude of 200 m in a conservation area with clay soil?
  • for a plot with mean altitude of 300 m, with loam soil and not in a conservation area?

For this, you obviously need to do calculations with the estimated model coefficients. It is OK if you do this by using R as a “pocket calculator” (or, for that matter, using a real pocket calculator). However, I suggest you try to do this the smart way, which is

  • storing the values of all model coefficients into a single vector variable; and
  • then use indexing to access the values of the different coefficients as needed.

TASK: Calculate the predicted biomass for the two above scenarios.

3.1 Meet the R predict function

There is actually an even better (easier) way of making model predictions: R provides the predict function just for this purpose. Where predict really comes into its own is when you want to make predictions on a large number of new ‘cases’ where the outcome is unknown (for example because it is in the future).

The way predict works is that you pass it the values that you want to base the predictions on as a data frame. If we only want to make a single prediction, this data frame will have only one row… (but it could have arbitrarily many):

pred.dat <- data.frame(
  cons=factor(1),  # 1 means 'is conservation area'
  alt=200,         # altitude of 200 m
  soil=factor(2)   # 1 means 'clay'
  )

# now pass the data frame as `newdata` to `predict`
predict(m, newdata = pred.dat)

TASK: Run this yourself. Does the result from predict match your own calculation?

TASK: Modify the code to get predicted biomasses for both of the above scenarios in one step! — by this, I mean, in a single call of predict(m, newdata = pred.dat).

MCQ4: Not really an MCQ — you’ll need to enter your predictions for biomass in Scenario 1 (conservation area on clay soil, 200 m) and Scenario 2 (non-conservation area on loam soil, 300 m) on Blackboard. Round your results to 4 “significant digits” (that’s 4 digits, regardless of where the decimal point is, eg. 0.1234, 1.234, 12.34…)

4 SSQ Again!

Now, let’s investigate whether there is any discrepancy between the sequential and adjusted sums of squares (SSQ) for cons. Bear in mind that orthogonality always involves multiple EVs (it is about relationships between variables). Nevertheless, we are here focussing on cons because it is the EV of interest for answering our primary research question.

TASK: Compare the sequential and adjusted sums of squares for cons and interpret your findings — what does this indicate from a technical statistical perspective, and what does this translate into in this study?

MCQ5: Which of the following statements best describes your findings regarding the sequential and adjusted SSQ for cons?

Adjusting for soil type and altitude…

[_] …increases the SSQ from 0.027 to 0.945. Accounting for soil type and altitude therefore strengthens the evidence for the effect of conservation.
[_] …reduces the SSQ from 0.945 to 0.027. Much of the apparent effect of conservation is therefore attributable to differences in soil type and altitude.
[_] …increases the SSQ from 0.025 to 0.718. Accounting for soil type and altitude therefore strengthens the evidence for the effect of conservation.
[_] …reduces the SSQ from 0.718 to 0.025. Much of the apparent effect of conservation is therefore attributable to differences in soil type and altitude.

5 Confidence Intervals (Optional)

If you find that you don’t have time to do this as part of the practical, you can come back to it for revision! — there are no MCQs / marks associated with it (you do not need to submit antyhing relating to this part of the exercise on Blackboard).

Above, you have worked out the estimated effect of one additional meter altitude on biomass. This is all well and good, but how confident can we be in this estimate?

TASK: Give a 95% confidence interval for the effect of an additional metre of altitude on biomass. For this, you will need to remind yourself how to calculate confidence intervals from standard errors (e.g., Grafen & Hails, p.14–15).

6 Extra Exercise (Revision)

This provides an additional example, much along the same lines as the conservation example. If you have time, do this in the practical — but there are no MCQs / marks associated with it (you do not need to submit antyhing relating to this extra exercise on Blackboard). If you find that you don’t have time to do this as part of the practical, you can come back to it for revision!

6.1 The Study: Determinants of the Grade Point Average (GPA)

The academic performance of some students in the USA is evaluated as a Grade Point Average (gpa) each year. Faculty are concerned to admit good students, and assess sudents via tests that are broken down into verbal skills (verbal) and mathematical skills (math). A hundred students from each of two years (year) had their marks analysed, to investigate whether verbal or maths skills were more important in determining a student’s gpa.

The data are in file Grades.csv.

6.2 Modelling and Interpretation

How good is the evidence that math, verbal or year predict gpa? In what direction is the effect for each of these variables?

6.3 Predictions from the Model

  • What gpa would you expect from a student in the first year whose verbal score was 700 and mathematical score was 600?

  • What about a student in the second year whose verbal score was 600 and maths score was 700?

As in the previous “biomass” exercise, do the calculation “by hand” using the estimated coefficients from the model summary, and then check your result by using the predict function in R.