For today’s lab, we’ll be using the data set called
extraversion.csv on Canvas.
Example: A researcher is interested in testing whether personality traits are characteristics that are stable across situations or whether the environment shapes people’s expression of personality traits. To test this research question, the researcher measures the number of times the same 16 participants express extraverted characteristics (e.g., initiating conversations with others, being enthusiastically talkative, etc.) in three different environments: at home, at school, and at a party. The researcher wants to know whether environment, overall, predicts how extraverted people’s behaviors are.
data <- import("extraversion.csv")
First, take a look at the data.
head(data)
## ID home school party
## 1 1 3 4 5
## 2 2 6 6 9
## 3 3 2 3 4
## 4 4 5 5 8
## 5 5 3 5 6
## 6 6 2 4 6
The data is currently in wide format because the participants’ scores on the DV (extraversion) are spread out across multiple columns. However, we want the data to be in long format where there is a separate column for the IV (environment: home, school, party) and the DV (extraversion).
Note that this means each participant’s ID number will occur three
times in the ID column. In other words, we’ll have a unique
row for every observation.
We can change wide data to long
data using the pivot_longer() function from the
tidyverse. To use this function, we need to provide four
arguments:
data_long <- pivot_longer(data,
cols = c("home","school","party"),
names_to = "environment",
values_to = "extraversion")
# Notice how the layout of the data changed
head(data_long)
## # A tibble: 6 × 3
## ID environment extraversion
## <int> <chr> <int>
## 1 1 home 3
## 2 1 school 4
## 3 1 party 5
## 4 2 home 6
## 5 2 school 6
## 6 2 party 9
Next, check each variable’s measure type:
str(data_long)
## tibble [48 × 3] (S3: tbl_df/tbl/data.frame)
## $ ID : int [1:48] 1 1 1 2 2 2 3 3 3 4 ...
## $ environment : chr [1:48] "home" "school" "party" "home" ...
## $ extraversion: int [1:48] 3 4 5 6 6 9 2 3 4 5 ...
ID & environment should be transformed
into factorsdata_long <- data_long %>%
mutate(ID = as.factor(ID),
environment = factor(environment, labels = c("home", "party", "school")))
And the data is already clean! (But with real-world data, you should inspect the data for any data entry errors that need to be fixed).
Let’s start with descriptively examining the M and SD on
extraversion across the three levels of
environment.
descriptives <- data_long %>%
group_by(environment) %>%
summarize(M = mean(extraversion, na.rm = TRUE),
SD = sd(extraversion, na.rm = TRUE))
# Making the table look a little nicer
descriptives %>%
knitr::kable(digits = 2, col.names = c("Environment", "Mean", "SD"), caption = "Descriptive Statistics for Extraversion Across Each Environment", format.args = list(nsmall = 2))
| Environment | Mean | SD |
|---|---|---|
| home | 3.25 | 1.61 |
| party | 6.38 | 1.75 |
| school | 4.69 | 1.20 |
# And we can plot the descriptives
ggplot(descriptives, aes(x = environment, y = M, fill = environment)) +
geom_bar(stat = "identity") +
ggtitle("Average Extraversion Across Environments") +
labs(x = "Environment",
y = "Mean Extraversion") +
theme(plot.title = element_text(hjust = 0.5))
Recall that the researcher wanted to know whether environment, overall, predicts extraversion scores. This means, when coding the categorical predictor, we are testing whether the full set of codes used to capture the categorical predictor make a significant improvement to the model compared to a model without the set full set of codes.
For environment, which has 3 levels, we need 3-1 = 2 contrast codes. Since we’re not interested in making a particular mean comparison, we can specify any set of codes that follow the rules of contrast codes.
EnvCode1 <- c(2/3, -1/3, -1/3)
EnvCode2 <- c(0, 1/2, -1/2)
contrasts(data_long$environment) <- cbind(EnvCode1, EnvCode2)
contrasts(data_long$environment)
## EnvCode1 EnvCode2
## home 0.6666667 0.0
## party -0.3333333 0.5
## school -0.3333333 -0.5
The model corresponding to the null hypothesis, Model C, is:
\[Model C: Extraversion_i = \beta_0 + \epsilon_i\]
The model corresponding to the alternative hypothesis, Model A, is: \[Model A: Extraversion_i = \beta_0 + \beta_1*EnvCode1 + \beta_2*EnvCode2 + \epsilon_i\]
The null hypothesis is that both \(\beta_1\) and \(\beta_2\) will be equal to zero.
\[H_0: \beta_1\ = \beta_2\ = 0\] The alternative hypothesis is that at least one of these model parameters is not equal to zero.
Next, let’s fit the model predicting extraversion from
environment using the custom contrast codes we just
created.
Since extraversion is a within-subjects factor, we
cannot simply use lm to fit the model. We need to fit the
model using lmer from the lme4 package, which
allows us to tell the model that subjects provided multiple scores.
(1|ID) to the
predictor side of the formula.model <- lmer(extraversion ~ environment + (1|ID), data = data_long)
By examining the distribution of the residuals.
residuals <- model %>%
residuals() %>%
as.data.frame() # store the residuals as a data frame
head(residuals) # check first few residuals
## .
## 1 0.37574264
## 2 -0.06175736
## 3 -0.74925736
## 4 0.94041992
## 5 -0.49708008
## 6 0.81541992
colnames(residuals) <- "resid" # give it a column name
ggplot(data = residuals, aes(x = resid)) +
geom_density(fill = "purple") +
stat_function(linetype = 2,
fun = dnorm,
args = list(mean = mean(residuals$resid),
sd = sd(residuals$resid)))
The independence assumption assumes that each participant’s
residuals are independent of the other participants’ residuals. The
best way to ensure independence is by collecting a random sample of
participants during data collection. Similarly to what we’ve done
before, though, we can examine whether the model’s residuals are
associated with a variable that they should not be, like
ID.
We want to examine whether there’s a relationship ID and
each participant’s residuals within each of the three environment
conditions.
data_long$resid <- residuals$resid # add the residuals to the original data so we can play them by ID
ggplot(data = data_long, aes(x = ID, y = resid)) +
geom_point() +
facet_wrap(~environment)
The sphericity assumption is that the variances of the differences between scores in all combinations of related groups are equal. We can check whether we’ve met, or violated, the sphericity assumption using Mauchly’s Test of Sphericity.
To run the test, we’ll use the anova_test() function
from the `rstatix`` package. This function requires the following
arguments:
sphericity_test <- anova_test(data = data_long, dv = extraversion, wid = ID, within = environment)
sphericity_test$`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 environment 0.997 0.979
To examine whether we have violated the sphericity assumption, look at the results for Mauchly’s Test for Sphericity. Like most statistical tests of our assumptions, we want to see a non-significant p-value, which would indicate we have not violated the assumption.
In this case, Mauchly’s Test for Sphericity is not significant, W = 0.997, p = .979.
Good - this indicates that the variance of the difference scores for each pair of related conditions are not significantly different!
If we had violated the sphericity assumption, we should report one of the sphericity-corrected tests (the greenhouse-geisser or huynh-feldt corrections).
sphericity_test$`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF]
## 1 environment 0.997 1.99, 29.91 3.21e-09 * 1.15 2.3, 34.49 3.05e-09
## p[HF]<.05
## 1 *
Since we want to know whether environment overall was a
significant predictor of extraversion, let’s pass the model to the
anova() function to get the overall F-statistic.
anova(model)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## environment 78.292 39.146 2 30 40.438 3.05e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question: Was environment, overall, a significant predictor of extraversion scores?
Yes, F(2, 30) = 40.44, p < .001.
We can see in our output that SSR = 78.292. Let’s
store it in an object called SSR.
SSR <- anova(model)$'Sum Sq'[1]
SSR
## [1] 78.29167
We don’t have SSE(C), though, which we need in order to calculate PRE. We can calculate SSE(C) by running model C and calculating SSE_C “by hand”.
model_C <- lm(extraversion ~ 1, data = data_long)
SSE_C <- sum(residuals(model_C)^2) # SSE(C) = sum of the squared errors for Model C
SSE_C
## [1] 184.4792
Then, we can calculate PRE = SSR/SSE(C).
PRE <- SSR/SSE_C
PRE
## [1] 0.424393
Model A, which includes environment as a predictor, accounts for 42% more variance in extraversion scores than Model C.
When the overall effect of a categorical predictor is significant, that indicates there is a significant difference between at least two conditions of the categorical predictor, but the results don’t tell us which two conditions are significantly different.
To know which conditions are significantly different from one another, we can construct contrast codes that test a comparison between conditions that’s of theoretical interest. Then, we can look at the significance of each individual predictor in the model.
Recall that we used the following contrast codes to code
environment:
EnvCode1 compares the mean extraversion of participants at
home to the mean extraversion of participants both at a
party and at school.
EnvCode2 compares the mean extraversion of participants at a
party versus at school.
To test the significance of each individual predictor in the model,
examine the summary() output:
summary(model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: extraversion ~ environment + (1 | ID)
## Data: data_long
##
## REML criterion at convergence: 159.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.96919 -0.51118 -0.04147 0.47791 2.01256
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1.3917 1.1797
## Residual 0.9681 0.9839
## Number of obs: 48, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4.7708 0.3273 15.0000 14.575 2.91e-10 ***
## environmentEnvCode1 -2.2813 0.3013 30.0000 -7.572 1.92e-08 ***
## environmentEnvCode2 1.6875 0.3479 30.0000 4.851 3.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) envEC1
## envrnmntEC1 0.000
## envrnmntEC2 0.000 0.000
Question: Was there a significant difference in extraversion scores when participants were at home versus not at home?
Yes, b1 = -2.28, t(30) = 4.85, p < .001.
Question: Was there a significant difference in extraversion scores when participants were at a party versus at school?
Yes, b2 = 1.69, t(30) = 4.85, p < .001.
Note: If you wanted to test another comparison (for instance, home versus school), you would need to construct a new set of contrast codes that specify that comparison.
It is generally preferred to do planned comparisons, which are theoretically motivated comparisons between groups that the researcher states they will analyze prior to conducting their analysis.
However, post-hoc comparisons can also be used to explore whether there are unexpected significant differences between conditions. Post-hoc comparisons typically make every comparison possible between every pair of conditions. Because this results in many significance tests being conducted, a correction to the p-value must be made to prevent inflating our chances of making a Type I error.
means <- emmeans(model, ~environment)
pairs(means, adjust = "tukey")
## contrast estimate SE df t.ratio p.value
## home - party -3.12 0.348 30 -8.983 <.0001
## home - school -1.44 0.348 30 -4.132 0.0008
## party - school 1.69 0.348 30 4.851 0.0001
##
## Degrees-of-freedom method: kenward-roger
## P value adjustment: tukey method for comparing a family of 3 estimates
confint(model)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 0.7305375 1.793845
## .sigma 0.7595578 1.243891
## (Intercept) 4.1104248 5.431242
## environmentEnvCode1 -2.8705466 -1.691953
## environmentEnvCode2 1.0070390 2.367961
In this study, we examined whether extraversion scores varied depending on the environment people were measured in (at home, at school, or at a party). We found that there was a significant effect of environment on demonstration of extraverted characteristics, F(2, 30) = 40.44, p < .001. Furthermore, environment accounted for 42% (PRE = 0.42) of the variance in extraversion scores.
Specifically, when participants were at home (M = 3.25, SD = 1.61), they exhibited significantly fewer extraverted behaviors than the average extraversion levels demonstrated across the party (M = 6.38, SD = 1.75) and school (M = 4.69, SD = 1.20) conditions, t(30) = -7.57, p < .001, b1 = -2.28, 95%CI[-2.87, -1.69].
Additionally, participants exhibited significantly more extraverted behaviors when they were at a party compared to when they were at school, t(30) = 4.85, p < .001, b2 = 1.69, 95%CI[1.01, 2.37].