A professor is interested in the pattern that average scores on the three exams that they give in their course follow. The professor gives three exams throughout the term: midterm 1, midterm 2, and a final exam. The professor hopes that scores on the exams follow a linear trend such that average scores improve on every successive exam. However, it’s also possible that the pattern of average scores could follow a quadratic trend such that scores could initially improve from midterm 1 to midterm 2, but then worsen on the final exam since the final exam tends to be more difficult than either midterm.
For this scenario, we’re going to test whether a linear or a
quadratic trend significantly fits the pattern in the data. The data can
be imported from the examscores.csv file on Canvas.
data <- import("examscores.csv")
head(data)
## ID exam score
## 1 1 1 75
## 2 2 1 40
## 3 3 1 82
## 4 4 1 60
## 5 5 1 52
## 6 6 1 72
The independent variable, exam, should be a factor. The dependent variable, score, should be integer or numeric.
IV: Exam (categorical predictor)
DV: Score on the exam
First, check each variable’s measure type.
str(data)
## 'data.frame': 30 obs. of 3 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ exam : int 1 1 1 1 1 1 1 1 1 1 ...
## $ score: int 75 40 82 60 52 72 21 74 79 55 ...
We need to convert ID and exam into
factors.
Note: Since exam was imported as an
integer variable, when converting it to a factor, we
need to specify the labels in the order that they are numerically
specified (not in alphabetical order like we would
need to do if the variable was imported as a character
variable).
data <- data %>%
mutate(ID = as.factor(ID),
exam = factor(exam, labels = c("Midterm 1", "Midterm 2", "Final Exam")))
levels(data$ID)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
levels(data$exam)
## [1] "Midterm 1" "Midterm 2" "Final Exam"
Let’s produce a descriptive statistics table showing the mean and standard deviation for scores across each level of exam:
Table
descriptives <- data %>%
group_by(exam) %>%
summarize(mean = mean(score),
sd = sd(score))
descriptives %>%
knitr::kable(digits = 2, col.names = c("Exam", "Mean", "SD"), caption = "Descriptive Statistics for Scores Across Exams Given", format.args = list(nsmall = 2))
| Exam | Mean | SD |
|---|---|---|
| Midterm 1 | 61.00 | 19.47 |
| Midterm 2 | 85.30 | 15.76 |
| Final Exam | 67.60 | 19.99 |
Second, let’s make a graph displaying the mean score across each exam given.
Graph
ggplot(descriptives, aes(x = exam, y = mean, fill = exam)) +
geom_bar(stat = "identity") +
ggtitle("Average Scores Across Exams Given") +
labs(x = "Exam",
y = "Mean Score") +
theme(plot.title = element_text(hjust = 0.5))
Question: What type of pattern do the average exam scores appear to follow?
The pattern appears to most closely resemble a quadratic relationship (scores in groups 1 and 3 are lower than scores in group 2).
For a categorical predictor with 3 levels, we can construct m-1 = 3-1 = 2 contrast codes.
When you are constructing two contrast codes, the set of codes we can use to test the significance of a linear and a quadratic trend in the data are:
Linear trend: -1/2, 0, 1/2
Quadratic trend: -1/3, 2/3, -1/3
Let’s specify these for our exam predictor:
Normally, we would specify the contrast codes like this (lines below are intentionally commented out):
# linearCC <- c(-1/2,0,1/2)
# quadraticCC <- c(-1/3,2/3,-1/3)
# contrasts(data$exam) <- cbind(linearCC, quadraticCC)
However, if we contrast code the predictor this way, when we use the
anova() function to interpret the model output, it will
only report an F-statistic for the overall model.
If we want the anova() function to provide an
F-statistic for each contrast code, we can use the
method below for assigning contrast codes.
This alternative method of contrast coding creates the contrast codes as variables in the original data set to be used as predictors.
It’s important, when using this method of contrast coding, to examine the data set first so you are aware how many participants are in each level of the categorical predictor.
Remember the contrast codes we want to create are:
Linear trend: -1/2, 0, 1/2
Quadratic trend: -1/3, 2/3, -1/3
# In the data, the first 10 scores are from the Midterm 1 condition, the second 10 scores are from the Midterm 2 condition, and the third 10 scores are from the Final Exam condition
data$linearCC <- c(rep(-1/2,10),rep(0,10),rep(1/2,10))
data$quadraticCC <- c(rep(-1/3,10),rep(2/3,10),rep(-1/3,10))
head(data)
## ID exam score linearCC quadraticCC
## 1 1 Midterm 1 75 -0.5 -0.3333333
## 2 2 Midterm 1 40 -0.5 -0.3333333
## 3 3 Midterm 1 82 -0.5 -0.3333333
## 4 4 Midterm 1 60 -0.5 -0.3333333
## 5 5 Midterm 1 52 -0.5 -0.3333333
## 6 6 Midterm 1 72 -0.5 -0.3333333
Note: Since we created each of these contrast codes
as their own variables and did not apply them to the
contrasts() assigned to exam, we have to use the variables
we created as the predictors in our model below.
Let’s predict score from the linear & quadratic
contrast codes for exam.
lmer() when fitting the model.model <- lmer(score ~ linearCC + quadraticCC + (1|ID), data = data)
lm() to fit the model, like
you see below:model_between <- lm(score ~ linearCC + quadraticCC, data = data)
residuals <- model %>%
residuals() %>%
as.data.frame() # store the residuals as a data frame
head(residuals) # check first few residuals
## .
## 1 -3.51234393
## 2 -6.46572565
## 3 -0.07307929
## 4 4.79428755
## 5 4.56316471
## 6 -1.98049892
colnames(residuals) <- "resid" # give it a column name
ggplot(data = residuals, aes(x = resid)) +
geom_density(fill = "purple") +
stat_function(linetype = 2,
fun = dnorm,
args = list(mean = mean(residuals$resid),
sd = sd(residuals$resid)))
Remember that when working with a within-subjects factor, the independence assumption is that each participant’s errors within each condition are independent of the other participants’ errors.
data$resid <- residuals$resid # add the residuals to the original data so we can play them by ID
ggplot(data = data, aes(x = ID, y = resid)) +
geom_point() +
facet_wrap(~exam)
We can check the sphericity assumption using Mauchly’s Test of Sphericity.
sphericity_test <- anova_test(data = data, dv = score, wid = ID, within = exam)
sphericity_test$`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 exam 0.978 0.913
Let’s examine the output using anova().
# Fitting the model
anova(model)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## linearCC 217.8 217.8 1 18 7.7652 0.01218 *
## quadraticCC 2940.0 2940.0 1 18 104.8198 6.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question: Is there a significant linear relationship between number of servings and nap lengths?
Yes, the linear trend is significant, F(1, 18) = 7.77, p = .012.
Question: Is there a significant quadratic relationship between number of servings and nap lengths?
Yes, the quadratic trend is significant, F(1, 18) = 104.82, p < .001.
Now, let’s examine the output using summary():
summary(model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: score ~ linearCC + quadraticCC + (1 | ID)
## Data: data
##
## REML criterion at convergence: 205.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.41723 -0.60327 0.01082 0.60144 1.39003
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 314.27 17.728
## Residual 28.05 5.296
## Number of obs: 30, groups: ID, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 71.300 5.689 9.000 12.534 5.31e-07 ***
## linearCC 6.600 2.368 18.000 2.787 0.0122 *
## quadraticCC 21.000 2.051 18.000 10.238 6.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) linrCC
## linearCC 0.000
## quadraticCC 0.000 0.000
Question: Interpret the meaning of each of the parameter estimates.
b0: 71.30, average of the group means
descriptives
## # A tibble: 3 × 3
## exam mean sd
## <fct> <dbl> <dbl>
## 1 Midterm 1 61 19.5
## 2 Midterm 2 85.3 15.8
## 3 Final Exam 67.6 20.0
(61 + 85.3 + 67.6)/3
## [1] 71.3
b1: 6.60, average on the final exam minus average on midterm 1
67.6 - 61
## [1] 6.6
b2: 21.00, average on midterm 2 minus average across both midterm 1 and the final exam
85.3 - ((61+67.6)/2)
## [1] 21
Using a polynomial regression analysis, we examined whether there was a significant linear and/or quadratic trend in the average grades received by students on three different exams given across the course of one term.
There was a significant linear trend in students’ exam scores, F(1, 18) = 7.77, p = .012. This linear trend was due to final exam scores (M = 67.60, SD = 20.00) being significantly higher than scores on midterm 1 (M = 61.00, SD = 19.50), b = 6.60, t(18) = 2.79, p = .012.
There was also a significant quadratic trend in students’ exam scores, F(1, 18) = 104.82, p < .001. This was due to scores on midterm 2 (M = 85.30, SD = 15.80) being significantly higher than the average score across both midterm 1 (M = 61.00, SD = 19.50) and the final exam (M = 67.60, SD = 20.00), b = 21.00, t(18) = 10.24, p < .001.
You can find the set of orthogonal polynomial contrast codes that can be used for a categorical predictor with 4 or 5 levels on page 197 of your model comparison textbook. They’re also below for your reference:
Contrast Codes for a Categorical Predictor with 4 Ordinal Levels:
Linear -3 -1 1 3 Quadratic 1 -1 -1 1 Cubic -1 3 -3 1
Contrast Codes for a Categorical Predictor with 5 Ordinal Levels:
Linear -2 -1 0 1 2 Quadratic 2 -1 -2 -1 2 Cubic -1 2 0 -2 1 Quartic 1 -4 6 -4 1
Lecture examples
df <- import("forgiveness.xlsx")
str(df)
## 'data.frame': 18 obs. of 2 variables:
## $ age_group : num 1 1 1 1 1 1 2 2 2 2 ...
## $ forgiveness: num 4 3 5 6 3 3 9 10 8 9 ...
df <- df %>%
mutate(age_group = factor(age_group, labels = c("Under 18", "18-35", "35 and older")))
descriptives <- df %>%
group_by(age_group) %>%
summarize(mean = mean(forgiveness))
descriptives
## # A tibble: 3 × 2
## age_group mean
## <fct> <dbl>
## 1 Under 18 4
## 2 18-35 9
## 3 35 and older 3
ggplot(descriptives, aes(x = age_group, y = mean, fill = age_group)) +
geom_bar(stat = "identity") +
labs(x = "Age Group",
y = "Average Support") +
theme(plot.title = element_text(hjust = 0.5))
df$linearCC <- c(rep(-1,6),rep(0,6),rep(1,6))
df$quadCC <- c(rep(-1,6),rep(2,6),rep(-1,6))
model2 <- lm(forgiveness ~ linearCC + quadCC, data = df)
anova(model2)
## Analysis of Variance Table
##
## Response: forgiveness
## Df Sum Sq Mean Sq F value Pr(>F)
## linearCC 1 3 3.000 1.3235 0.268
## quadCC 1 121 121.000 53.3824 2.583e-06 ***
## Residuals 15 34 2.267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1