1 Study Design & Data
An ecological study was conducted into the effect of conservation on the biomass of vegetation supported by an area of land. Fifty plots of land, each one hectare, were sampled at random from a ten thousand hectare area in Northern England. For each plot, the following variables were recorded:
biomass: An estimate of the biomass of vegetation (in \(\mathrm{kg}/\mathrm m^2\)).alt: The mean altitude of the plot (in \(\mathrm m\) above sea level).cons: A categorical variable, which was coded as- 1 if the plot was part of a conservation area, and
- 2 otherwise
soil: A categorical variable crudely classifying soil type as- 1 for chalk,
- 2 for clay,
- 3 for loam.
You need to download your personal version of the data: open the W4 Test on Blackboard, Question Q1 has the link to your copy of conservation.csv.
1.1 Consider the study aims and design
A common beginner’s mistake is to collect a bunch of data and start analysing away, without proper consideration of what the questions are. From the information in the previous section, make sure that you can answer the following questions:
- What is the primary research question, i.e., the aim of the study?
- What is the dependent variable (DV)?
- What is the explanatory variable (EV) of primary interest?
- Why is it sensible that the other EVs were also recorded?
BB TEST Q1 [5 marks]: Select all variables to which the term “confounding” can sensibly be applied in the context of this study.
2 LM Analysis & Interpretation
2.1 Set up the data
The first step in the analysis is, somewhat unsurprisingly, fitting a LM… but before we get to this, you need to load the data and set them up. This should be a bit of an old hat by now…
- Remember to first check how the categorical variables are coded, and set them up correctly if needed.
- Use sum contrasts for categorical variables.
2.2 Look at the data
Before you proceed to fitting a LM, you should always try and plot the data. Since you have two categorical EVs and a continuous EV, this would require some thought.
TASK: For a first peek, however, I want you to just quickly check graphically whether there is a relationship between biomass and altitude, and if so, what direction the effect takes and whether it looks strong or weak. For this plot, we will ignore conservation area status and soil type. Remember that such a ‘univariate’ analysis can potentially be misleading and needs to be followed up by a multivariate analysis (which you will carry out below).
BB TEST Q2 [5 marks]: What statement best describes the apparent relationship between altitude and biomass, based on a simple graphical assessment?
2.3 Fit the LM and interpret the results
TASK: Fit the LM that is most appropriate for answering this research question.
Interpret the results of your LM analysis to briefly answer the following questions:
- How strong is the evidence that the biomass of vegetation depends upon being in a conservation area? In what direction is the effect?
- How strong is the evidence that soil type affects biomass? Which soil types are associated with the highest and lowest biomass value?
- What is the estimated effect of an additional metre of altitude on biomass?
BB TEST Q3 [10 marks]: Which of these statements most accurately describes the evidence that the vegetation biomass of an area depends upon its conservation status?
BB TEST Q4 [10 marks]: Which soil type is associated with the highest and lowest biomass of vegetation, respectively? You need to select all correct answers to receive marks (no partial credit).
3 Make Predictions!
Now, we’ll use our model to predict biomass for two different scenarios:
- for a plot with a mean altitude of 200 m in a conservation area with clay soil?
- for a plot with mean altitude of 300 m, with loam soil and not in a conservation area?
For this, you obviously need to do calculations with the estimated model coefficients. It is OK if you do this by using R as a “pocket calculator” (or, for that matter, using a real pocket calculator). However, I suggest you try to do this the smart way, which is
- storing the values of all model coefficients into a single vector variable; and
- then use indexing to access the values of the different coefficients as needed.
TASK: Calculate the predicted biomass for the two above scenarios.
3.1 Meet the R predict function
There is actually an even better (easier) way of making model predictions: R provides the predict function just for this purpose. Where predict really comes into its own is when you want to make predictions on a large number of new ‘cases’ where the outcome is unknown (for example because it is in the future).
The way predict works is that you pass it the values that you want to base the predictions on as a data frame. If we only want to make a single prediction, this data frame will have only one row… (but it could have arbitrarily many):
pred.dat <- data.frame(
cons=factor(1), # 1 means 'is conservation area'
alt=200, # altitude of 200 m
soil=factor(2) # 1 means 'clay'
)
# now pass the data frame as `newdata` to `predict`
predict(m, newdata = pred.dat)
TASK: Run this yourself. Does the result from predict match your own calculation?
TASK: Modify the code to get predicted biomasses for both of the above scenarios in one step! — by this, I mean, in a single call of predict(m, newdata = pred.dat).
BB TEST Q5 [10 marks]: What is your prediction for biomass in Scenario 2 (non-conservation area on loam soil, 300 m)? Use the signif() function to round your results to 4 “significant digits” (that’s 4 digits, regardless of where the decimal point is, e.g. 0.1234, 1.234, 12.34…).
4 SSQ Again!
Now, let’s investigate whether there is any discrepancy between the unadjusted and adjusted sums of squares (SSQ) for cons. Bear in mind that orthogonality always involves multiple EVs (it is about relationships between variables). Nevertheless, we are here focussing on cons because it is the EV of interest for answering our primary research question.
TASK: Compare the unadjusted and adjusted sums of squares for cons and interpret your findings — what does this indicate from a technical statistical perspective, and what does this translate into in this study?
BB TEST Q6 [10 marks]: Which of the following statements best describes your findings regarding the unadjusted and adjusted SSQ for cons? You need to select all correct answers to receive marks (no partial credit).
5 Confidence Intervals (Optional)
If you find that you don’t have time to do this as part of the practical, you can come back to it for revision! — there are no MCQs / marks associated with it (you do not need to submit anything relating to this part of the practical).
Above, you have worked out the estimated effect of one additional meter altitude on biomass. This is all well and good, but how confident can we be in this estimate?
TASK: Give a 95% confidence interval for the effect of an additional metre of altitude on biomass. For this, you will need to remind yourself how to calculate confidence intervals from standard errors (e.g., Grafen & Hails, p.14–15).
6 Extra Exercise (Revision)
This provides an additional example, much along the same lines as the conservation example. If you have time, do this in the practical — but there are no MCQs / marks associated with it (you do not need to submit anything relating to this extra exercise). If you find that you don’t have time to do this as part of the practical, you can come back to it for revision!
6.1 The Study: Determinants of the Grade Point Average (GPA)
The academic performance of some students in the USA is evaluated as a Grade Point Average (gpa) each year. Faculty are concerned to admit good students, and assess sudents via tests that are broken down into verbal skills (verbal) and mathematical skills (math). A hundred students from each of two years (year) had their marks analysed, to investigate whether verbal or maths skills were more important in determining a student’s gpa.
The data are in file Grades.csv.
6.2 Modelling and Interpretation
How good is the evidence that math, verbal or year predict gpa? In what direction is the effect for each of these variables?
6.3 Predictions from the Model
What
gpawould you expect from a student in the first year whose verbal score was 700 and mathematical score was 600?What about a student in the second year whose verbal score was 600 and maths score was 700?
As in the previous “biomass” exercise, do the calculation “by hand” using the estimated coefficients from the model summary, and then check your result by using the predict function in R.