An ecological study was conducted into the effect of conservation on the biomass of vegetation supported by an area of land. Fifty plots of land, each one hectare, were sampled at random from a ten thousand hectare area in Northern England. For each plot, the following variables were recorded:
biomass: An estimate of the biomass of vegetation (in
\(\mathrm{kg}/\mathrm m^2\)).alt: The mean altitude of the plot (in \(\mathrm m\) above sea level).cons: A categorical variable, which was coded as
soil: A categorical variable crudely classifying soil
type as
The data are stored in Conservation.csv.
A common beginner’s mistake is to collect a bunch of data and start analysing away, without proper consideration of what the questions are. From the information in the previous section, make sure that you can answer the following questions (you don’t need to include them in your submission):
MCQ1: Which one of the following statements best explains why the researchers recorded altitude and soil type of the study areas:
[_] The researchers were primarily asking how soil type and altitude affect biomass.
[_] The researchers were primarily asking how soil type affects biomass, but expected a confounding effect of altitude on biomass.
[_] The researchers were primarily asking how conservation status areas differ in altitude and soil type.
[_] The researchers were primarily asking how conservation affects biomass, but wished to control for the confounding effects of soil type and altitude.
The first step in the analysis is, somewhat unsurprisingly, fitting a LM… but before we get to this, you need to load the data and set them up. This should be a bit of an old hat by now…
Before you proceed to fitting a LM, you should always try and plot the data. Since you have two categorical EVs and a continuous EV, this would require some thought.
TASK: For a first peek, however, I want you to just quickly check graphically whether there is a relationship between biomass and altitude, and if so, what direction the effect takes and whether it looks strong or weak. For this plot, we will ignore conservation area status andf soil type. Remember that such a ‘univariate’ analysis can potentially be misleading and needs to be followed up by a multivariate analysis (which you will carry out below).
MCQ2: What statement best describes the apparent relationship between altitude and biomass, based on a simple graphical assessment?
[_] Biomass increases strongly with altitute.
[_] Biomass appears to increase slightly with altitude.
[_] No relationship between biomass and altitude is apparent.
[_] Biomass appears to decrease slightly with altitude.
[_] Biomass decreases strongly with altitude.
TASK: Fit the LM that is most appropriate for answering this research question.
Interpret the results of your LM analysis to briefly answer the following questions:
MCQ3: Which of these statements most accurately describes the evidence that the vegetation biomass of an area depends upon its conservation status?
[_] The evidence is weak \((P=0.0950)\).
[_] The evidence is weak \((P=0.0884)\).
[_] The evidence is strong \((P<0.05)\).
[_] The evidence is overwhelming \((P=1.982\times 10^{-13})\).
Now, we’ll use our model to predict biomass for two different scenarios:
For this, you obviously need to do calculations with the estimated model coefficients. It is OK if you do this by using R as a “pocket calculator” (or, for that matter, using a real pocket calculator). However, I suggest you try to do this the smart way, which is
TASK: Calculate the predicted biomass for the two above scenarios.
predict functionThere is actually an even better (easier) way of making model
predictions: R provides the predict function just for this
purpose. Where predict really comes into its own is when
you want to make predictions on a large number of new ‘cases’ where the
outcome is unknown (for example because it is in the future).
The way predict works is that you pass it the values
that you want to base the predictions on as a data frame. If we
only want to make a single prediction, this data frame will have only
one row… (but it could have arbitrarily many):
pred.dat <- data.frame(
cons=factor(1), # 1 means 'is conservation area'
alt=200, # altitude of 200 m
soil=factor(2) # 1 means 'clay'
)
# now pass the data frame as `newdata` to `predict`
predict(m, newdata = pred.dat)TASK: Run this yourself. Does the result from
predict match your own calculation?
TASK: Modify the code to get predicted biomasses for
both of the above scenarios in one step! — by this, I mean, in a single
call of predict(m, newdata = pred.dat).
MCQ4: Not really an MCQ — you’ll need to enter your predictions for biomass in Scenario 1 (conservation area on clay soil, 200 m) and Scenario 2 (non-conservation area on loam soil, 300 m) on Blackboard. Round your results to 4 “significant digits” (that’s 4 digits, regardless of where the decimal point is, eg. 0.1234, 1.234, 12.34…)
Now, let’s investigate whether there is any discrepancy between the
sequential and adjusted sums of squares (SSQ) for cons.
Bear in mind that orthogonality always involves multiple EVs (it is
about relationships between variables). Nevertheless, we are here
focussing on cons because it is the EV of interest for
answering our primary research question.
TASK: Compare the sequential and adjusted sums of
squares for cons and interpret your findings — what does
this indicate from a technical statistical perspective, and what does
this translate into in this study?
MCQ5: Which of the following statements best
describes your findings regarding the sequential and adjusted SSQ for
cons?
Adjusting for soil type and altitude…
[_] …increases the SSQ from 0.027 to 0.945. Accounting for soil type and altitude therefore strengthens the evidence for the effect of conservation.
[_] …reduces the SSQ from 0.945 to 0.027. Much of the apparent effect of conservation is therefore attributable to differences in soil type and altitude.
[_] …increases the SSQ from 0.025 to 0.718. Accounting for soil type and altitude therefore strengthens the evidence for the effect of conservation.
[_] …reduces the SSQ from 0.718 to 0.025. Much of the apparent effect of conservation is therefore attributable to differences in soil type and altitude.
If you find that you don’t have time to do this as part of the practical, you can come back to it for revision! — there are no MCQs / marks associated with it (you do not need to submit antyhing relating to this part of the exercise on Blackboard).
Above, you have worked out the estimated effect of one additional meter altitude on biomass. This is all well and good, but how confident can we be in this estimate?
TASK: Give a 95% confidence interval for the effect of an additional metre of altitude on biomass. For this, you will need to remind yourself how to calculate confidence intervals from standard errors (e.g., Grafen & Hails, p.14–15).
This provides an additional example, much along the same lines as the conservation example. If you have time, do this in the practical — but there are no MCQs / marks associated with it (you do not need to submit antyhing relating to this extra exercise on Blackboard). If you find that you don’t have time to do this as part of the practical, you can come back to it for revision!
The academic performance of some students in the USA is evaluated as
a Grade Point Average (gpa) each year. Faculty are
concerned to admit good students, and assess sudents via tests that are
broken down into verbal skills (verbal) and mathematical
skills (math). A hundred students from each of two years
(year) had their marks analysed, to investigate whether
verbal or maths skills were more important in determining a student’s
gpa.
The data are in file Grades.csv.
How good is the evidence that math, verbal
or year predict gpa? In what
direction is the effect for each of these variables?
What gpa would you expect from a student in the
first year whose verbal score was 700 and mathematical score was
600?
What about a student in the second year whose verbal score was 600 and maths score was 700?
As in the previous “biomass” exercise, do the calculation “by
hand” using the estimated coefficients from the model
summary, and then check your result by using the
predict function in R.