Quick warm-up

You want to find out about differences in blood pressure (bp) between Black British, White British and Asian British patients. You have a large sample of data collected in GP surgeries.

  • What is your outcome / dependent variable?
  • What is your predictor / explanatory variable (EV)?
  • What other EVs would you want to include in your model, …
    • as covariates (continuous)?
    • as factors (categorical)?
  • Scribble down your lm().

What did you come up with?

‘Univariate’ LM: bp ~ ethnicity

Possible other EVs to include?

I can think of age, body weight, sex, income…

  • which of these would be covariates, i.e., continuous predictors?
  • which would be factors?
lm(bp ~ age + body.wt + sex + ethnicity)

Data Doodling #1

Get a feel for “how data may pan out”…:

  • You are interested in the difference in blood pressure (bp) between two groups (A, B).
  • You have recorded bp in A and B, together with age since bp tends to go up with age.

Sketch a scatter plot where a naive t-test would indicate that mean bp is lower in B, but the difference is entirely explained by a group difference in mean age.
Hint: sketch the regression lines, not just data points!

What I mean

Something like this, but with more data points obviously, and a clear pattern.

Data Doodling #1

Different mean bp in A and B is entirely down to different mean age in A and B.

Data Doodling #2

  • You are interested in the effect of exercise (h per week) on blood pressure (bp).
  • Your data come from two groups of people (A and B).

Now sketch a scatter plot where a naive y ~ x regression would come out positive (slope > 0), obscuring the beneficial effect of exercise in lowering bp.

Data Doodling #2

Both groups show the same negative relationship between exercise and bp. However, mean bp is higher in A even though they exercise more.

Write the model

If you can only use anova from base R, what’s the right model for answering each of the two research questions:

  1. Do groups A and B differ in blood pressure?
  2. Does exercise affect blood pressure?
  1. bp ~ age + group
  2. bp ~ group + exercise

Why does it matter?

Enzyme activity (made-up data)

Measuring yield for two substrates at six temperatures.

  • DV: product after 10 min (yield).
  • EV1: substrate (subst);
    fructose (F) or glucose (G)
  • EV2: reaction temperature (temp);
    16°C, 19°C, 22°C, 25°C, 28°C and 31°C.

subst is a factor,
but what about temp?

Covariate or Factor?

model_1 <- lm(yield ~ temp + subst, Enz)          # like this?
model_2 <- lm(yield ~ factor(temp) + subst, Enz)  # or better so?

What is…

  • the advantage of temp as covariate?
  • problematic about temp as covariate?

. . .

Write down the DOF used up by temp

  • in model_1 with temp as covariate;
  • in model_2 with temp as factor.

Let’s see

Make sense of changes in all values…

print( anova(model_1), signif.stars=FALSE)
Analysis of Variance Table

Response: yield
           Df  Sum Sq Mean Sq F value    Pr(>F)
temp        1 1006.66 1006.66 135.702 < 2.2e-16
subst       1  132.00  132.00  17.794 4.879e-05
Residuals 117  867.93    7.42                  
print( anova(model_2), signif.stars=FALSE)
Analysis of Variance Table

Response: yield
              Df  Sum Sq Mean Sq F value    Pr(>F)
factor(temp)   5 1423.85 284.771  71.392 < 2.2e-16
subst          1  132.00 132.001  33.093 7.625e-08
Residuals    113  450.74   3.989                  

What model fits / predicts better?

Lines: fit / predictions from model_1 (temp as covariate).

Crosshairs: fit / predictions from model_2 (temp as factor) for subst = glucose.

But model_2 can’t predict at 24°C…