## Warning: package 'ggplot2' was built under R version 3.2.3
The primary empirical method in many branches of Psychology is the controlled experiment. The two primary properties of an experiment is that the researcher manipulates the conditions under which the outcome is measured, and that the context is as strictly controlled as possible. Under these conditions, the data is more or less appropriately analyzed using t-tests or ANOVA. Contrary to experimental psychology, in more applied fields of research, observational studies play a significant role. This implies several complications with respect to the assumptions in experimental research:
First, it is frequently the case that the “independent” variable is not truly manipulated, but only measured. For example, many experiments dealing with aesthetic evaluations of web designs use screenshots of existing websites as their stimuli (Tuch, Bargas-Avila, & Opwis, 2010). With collected stimuli the researcher does not have the same amount of control as would be the case with designed stimuli. The problem arises that the property of interest, say symmetry of visual layout, may, for whatever reasons, be confounded with another property, say, font size. Other examples are age and gender, which are often regarded as independent variables, but cannot be manipulated in practice.
Second, it is often the case that multiple, correlated variables have an impact on the outcome. Often there are categorical predictors (e.g., an academic degree), as well as metric predictors (e.g., computer experience in years). Such variables, if disregarded, can contribute a fair amount of noise to the data. But, most of all, they reflect the situation of real use, and therefore are crucial.
Third, the effect of one variable often depends on another, which is called interaction. Interaction effects, if ignored, can severely bias or mask effects, and limit the interpretability of findings. For example, if the “conditional” variable is a property of website users (e.g. the effect of the manipulated variable depends on age), the result can become strongly dependent on the composition of the sample. These situations are beyond what one can reasonably handle with simple statistical procedures, such as t-test, ANOVA or linear regression.
In this chapter, we will first revisit the “usual suspects”, the most frequently used methods of ANOVA and linear regression. We will see how these analyses can be done in R. As will turn out, both methods have a common underyling logic (ANOVA is a special case of linear regression) and it is a small step to combine them in the framework of General Linear Models (GLM).
One of the most basic statistical routines is analysis of variance (ANOVA), which compares groups on a metric outcome. In the following example, the research question is
Do two web designs A and B differ in performance?
This question could arise, for instance, during an overhaul of a municipal website. With the emerge of e-government, this website had grown wildly over a decade. With the current navigation structure, users have increasing difficulty to find the information they need. The prototype of a novel web design is developed and tested with 200 users. Every user is given the same five tasks, but sees only one of the two designs. The outcome measure is the time it takes to find the information that is asked for as part of the task (Time on Task, ToT). Below we see part of the data frame and histograms of ToT for designs A (old design) and B (new design) separately.
# look at data:
data_Browsing_AB %>%
filter(row_number() <= 10) %>%
select(1:6) %>%
kable()
| Subj | education | gender | age | Design | ToT |
|---|---|---|---|---|---|
| 23 | degree | m | 28.41143 | A | 62.30250 |
| 24 | degree | m | 28.41143 | B | 71.14089 |
| 45 | no degree | f | 38.93923 | A | 165.00313 |
| 46 | no degree | f | 38.93923 | B | 176.26674 |
| 67 | degree | f | 76.07241 | A | 105.24776 |
| 68 | degree | f | 76.07241 | B | 112.53900 |
| 89 | no degree | f | 39.65661 | A | 150.06767 |
| 90 | no degree | f | 39.65661 | B | 113.33963 |
| 111 | degree | m | 22.77343 | A | 159.78351 |
| 112 | degree | m | 22.77343 | B | 71.04381 |
# plot data
data_Browsing_AB %>%
ggplot(aes(x = ToT)) +
geom_histogram(fill = 1, alpha = .5) +
facet_grid(Design~.)
Now, we perform the ANOVA. In R, this is a two-step procedure: 1. the aov command lets you specify a simple formula to express the dependency between a factor (here: education) and outcome variable (here: ToT). Note that the formula resembles a regression equation. The results of the ANOVA are stored in an object (ANOVA_Design). 2. The summary command outputs a comprehensive summary of the results of the ANOVA.
# ANOVA analysis
ANOVA_Design <-
data_Browsing_AB %>%
aov(ToT ~ Design, data = .)
# summarize results
summary(ANOVA_Design)
## Df Sum Sq Mean Sq F value Pr(>F)
## Design 1 4558 4558 2.83 0.0941 .
## Residuals 198 318858 1610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, overall, the new design does not seem to have a general effect on Time on Task.
Now, let us look at a slightly more complicated situation, where Time on Task (ToT) is predicted by 2 factors: education and design. First, with the ggplot system, we plot the data.
data_Browsing_AB %>%
ggplot(aes(x = education, y = ToT, fill = Design)) +
geom_boxplot()
# display group means
data_Browsing_AB %>% aggregate( ToT~education+Design, . , mean )
## education Design ToT
## 1 degree A 91.80865
## 2 no degree A 141.78354
## 3 degree B 83.54079
## 4 no degree B 130.95649
It seems that having a degree predicts lower times on task. We can do the two-factorial ANOVA by specifying predictors education and Design in the model formula and store the analysis results. We print the results using the summary command.
ANOVA_education_design <-
data_Browsing_AB %>%
aov(ToT ~ education + Design, data = .)
summary(ANOVA_education_design)
## Df Sum Sq Mean Sq F value Pr(>F)
## education 1 118562 118562 116.610 <2e-16 ***
## Design 1 4558 4558 4.483 0.0355 *
## Residuals 197 200297 1017
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this table we can see that Design has an effect on Time on Task, once we control for the effects of education. However, the ANOVA table does not give us any information on the actual direction of the effect.
It is not a coincidence that the model specification for an ANOVA in R resembles a regression equation. In fact, ANOVA is nothing more than a special case of linear regression. To see this, instead of asking for a summary, ask for the model coefficients of the two-factorial ANOVA.
# print coefficients
coef(ANOVA_education_design)
## (Intercept) educationno degree DesignB
## 92.448450 48.695291 -9.547452
#print group means
data_Browsing_AB %>% aggregate( ToT~education+Design, . , mean )
## education Design ToT
## 1 degree A 91.80865
## 2 no degree A 141.78354
## 3 degree B 83.54079
## 4 no degree B 130.95649
Look at these coefficients: what do you think they mean??
As you can see, the coefficients do not result in the exact group means. This is because we have applied a model with two main effects, and such a model rarely fits exactly onto the data. It is a so-called nonsaturated model. To see whether the group means can be significantly better predicted by a model with both main effects and an interaction effect, we run a full-factorial model (a saturated model with both main effects and an interaction effect).
ANOVA_fullfact <-
data_Browsing_AB %>%
aov(ToT ~ education + Design + education:Design, data = .)
summary(ANOVA_fullfact)
## Df Sum Sq Mean Sq F value Pr(>F)
## education 1 118562 118562 116.066 <2e-16 ***
## Design 1 4558 4558 4.462 0.0359 *
## education:Design 1 82 82 0.080 0.7774
## Residuals 196 200215 1022
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(ANOVA_fullfact)
## (Intercept) educationno degree
## 91.808655 49.974882
## DesignB educationno degree:DesignB
## -8.267861 -2.559182
With the function coef() we print the coefficients of the full factorial model. When we now use these coefficients we can reconstruct the four group means we saw earlier perfectly:
It is as if the ANOVA has used dummy coding for the effects of having NO degree and Design B. And that is exactly what is happening: R (and other statistical packages like SPSS) first make a choice how to code categorical variables into quantitative variables, and then apply linear regression.
But first, let’s have a look at a simple linear regression with a metric predictor: age. Let’s plot the relationship between age and ToT. We ask ggplot to overlay a smoothing line that lets us assess the overall trend at a glance.
data_Browsing_AB %>%
ggplot(aes(x = age, y = ToT)) +
geom_point()+
geom_smooth()
Overall, ToT seems to increase with age.
Now, to do a linear regression, we use the lm command in much the same way as the aov command.
# ANOVA parameter estimation
LM_age <-
data_Browsing_AB %>%
lm(ToT ~ 1 + age,
data = .)
summary(LM_age)
##
## Call:
## lm(formula = ToT ~ 1 + age, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.737 -28.738 -2.309 27.947 89.035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.8466 7.8210 9.058 < 2e-16 ***
## age 0.8397 0.1500 5.597 7.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.55 on 198 degrees of freedom
## Multiple R-squared: 0.1366, Adjusted R-squared: 0.1323
## F-statistic: 31.33 on 1 and 198 DF, p-value: 7.193e-08
The ‘1’ in the formula stands for ‘intercept’ or ‘general mean’. So, the linear model suggests that there is a significant linear relationship between age and ToT. To asssess how well the model fits the data, it is recommended to check the distribution of residuals. With lm, we can extract the residuals by extracting the predicted values and calculating the difference to observed values.
The predict command is called on the output object of LM_age, which reflects the fact that residual analysis is something you can only do after estimation. Below, we also estimate the so-called null model, which characterizes the data by just one grand mean (without the effect of age).
LM_null <-
data_Browsing_AB %>%
lm(ToT ~ 1, data = .)
data_Browsing_AB %>%
mutate( resid_0 = predict(LM_null) - ToT,
resid_age = predict(LM_age) - ToT ) %>%
gather(model, residuals, resid_0:resid_age) %>%
ggplot(aes(x = residuals)) +
geom_histogram() +
facet_grid(model~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We see that with increasing model complexity (from no predictor to one predictor), the variance of the residuals slightly decreases; in other words, age explains part of the variance. The statistical results show that the amount of variance predicted by age is 13 percent and statistically significant.
Next, we carry out a linear regression analysis using ToT as the dependent variable and the factors education and gender as qualitative predictors, and also include their interaction (multiplying the new quantitative dummy variables). For that we use the function lm.
LM_fullfact <-
data_Browsing_AB %>%
lm( ToT ~ 1 + education + Design + education:Design, data = .)
summary(LM_fullfact)
##
## Call:
## lm(formula = ToT ~ 1 + education + Design + education:Design,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.389 -22.628 -0.038 23.251 80.299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.809 4.520 20.312 < 2e-16 ***
## educationno degree 49.975 6.392 7.818 3.22e-13 ***
## DesignB -8.268 6.392 -1.293 0.197
## educationno degree:DesignB -2.559 9.040 -0.283 0.777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.96 on 196 degrees of freedom
## Multiple R-squared: 0.3809, Adjusted R-squared: 0.3715
## F-statistic: 40.2 on 3 and 196 DF, p-value: < 2.2e-16
and we plot the regression coefficients separately and see that the LM and ANOVA coefficients are exactly the same:
coef(LM_fullfact)
## (Intercept) educationno degree
## 91.808655 49.974882
## DesignB educationno degree:DesignB
## -8.267861 -2.559182
coef(ANOVA_fullfact)
## (Intercept) educationno degree
## 91.808655 49.974882
## DesignB educationno degree:DesignB
## -8.267861 -2.559182
This is no surprise. Actually, the aov function in R is a ‘wrapper’ for the lm function: when you use aov, R internally makes lm do the work. It is only different in how it displays results: as regression coefficients or mean sums of squares.
In other words, what an analysis of variance actually does (or is equivalent to) is recoding categorical factors into quantitative variables and applying linear regression. In R and many other statistical packages, the standard approach is dummy coding (1/0), as we’ve seen here. Realizing this, it is now easy to see that the lm function is much more general than the aov function: we can use both categorical and metric predictors in one general framework. If it is a numeric variable, it is analyzed as quantitative (with 1 degree of freedom for a main effect), if it is a factor then it is analyzed as a dummy coded variable (with k-1 degrees of freedom for k number of observed levels).
So if we want to estimate a model with age (quantitative) and education (qualitative), we simply use the original factor variable education.
Let us estimate three models with increasing number of predictors: first only age (M1), then adding education (M2), and then adding gender (M3), and then plot the residuals for the three models:
M1 <-
data_Browsing_AB %>%
lm( ToT ~ 1 + age,
data = .)
M2 <-
data_Browsing_AB %>%
lm( ToT ~ 1 + age + education,
data = .)
M3 <-
data_Browsing_AB %>%
lm( ToT ~ 1 + age + education + gender,
data = .)
data_Browsing_AB %>%
mutate( resid_M1 = predict(M1) - ToT,
resid_M2 = predict(M2) - ToT,
resid_M3 = predict(M3) - ToT ) %>%
gather(model, residuals, c(resid_M1, resid_M2,resid_M3)) %>%
ggplot(aes(x = residuals)) +
geom_histogram() +
facet_grid(model~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
coef(M3)
## (Intercept) age educationno degree
## 48.1706948 0.8137757 48.1292140
## genderm
## -0.2343352
So we see that adding education to the model results in smaller residual variance, but adding gender does not lead to a clear change in the distribution of residuals. Be careful though with mixing quantitative and qualitative variables: always check your degrees of freedom so that a categorical variable is not analysed as a quantitative variable by mistake (important if more than 2 levels).
But, linear models can do more than just adding main effects: you can arbitrarily mix any number of factors and covariates and look at interactions. Let’s see what this does to our small example, where we look at the relationship between ToT, age and education:
data_Browsing_AB %>%
ggplot(aes(x = age,
y = ToT,
col = education)) +
geom_point() +
geom_smooth(method = "lm", se = F)
The estimated age regression line seems more steep for people with a degree than for people without a degree. And here we call lm once again, combining all predictors with ‘+’, but now also adding an age by education interaction effect.
M4 <-
data_Browsing_AB %>%
lm( ToT ~ 1 + age + education + gender + age:education,
data = .)
summary(M4)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.7074992 8.2611533 7.8327440 2.998737e-13
## age 0.4594517 0.1543047 2.9775613 3.274006e-03
## educationno degree 10.9934401 11.7672066 0.9342438 3.513336e-01
## genderm 1.0525993 4.0220810 0.2617051 7.938250e-01
## age:educationno degree 0.7595516 0.2263455 3.3557183 9.514209e-04
We see that the coefficient for age:educationNoDegree is positive (or less negative!), so the effect of age is estimated to be 0.76 larger for people without a degree than for people with a degree.
If you want the full factorial model with age, education, gender and design, with all main effects and all possible two-way, three-way and four-way interaction effects this can be specified as
M_ridiculous <-
data_Browsing_AB %>%
lm( ToT ~ 1 + ( age + education + gender + Design)^4,
data = .)
summary(M_ridiculous)$coefficients [,c(1,4)]
## Estimate Pr(>|t|)
## (Intercept) 93.9001070 1.870167e-07
## age 0.0525118 8.785377e-01
## educationno degree -15.0197860 5.224265e-01
## genderm -2.5760946 9.081278e-01
## DesignB -68.5707862 5.694932e-03
## age:educationno degree 1.1445826 1.156574e-02
## age:genderm -0.1155961 7.896880e-01
## age:DesignB 1.1619846 1.764558e-02
## educationno degree:genderm 46.2295763 1.565939e-01
## educationno degree:DesignB 39.4811746 2.351377e-01
## genderm:DesignB 21.3415639 4.992897e-01
## age:educationno degree:genderm -0.6400597 3.116495e-01
## age:educationno degree:DesignB -0.7782987 2.216082e-01
## age:genderm:DesignB -0.2971787 6.278587e-01
## educationno degree:genderm:DesignB -58.9175425 2.015040e-01
## age:educationno degree:genderm:DesignB 1.0678506 2.328924e-01
Of course here it is a bit ridiculous to estimate a model with 15 coefficients with only 200 participants. In addition, we have to be very careful interpreting the significance of a k-way interaction effect if there are interaction effects of order higher than k in the model (why?). Here we see that the 4-way effect is not significant, so we may want to skip that from the model. We can then rerun our analysis with only 2- and 3-way interaction effects.
M_lessridiculous <-
data_Browsing_AB %>%
lm( ToT ~ 1 + (age + education + gender + Design)^3,
data = .)
kable(summary(M_lessridiculous)$coefficients)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 86.3734250 | 16.1688182 | 5.3419752 | 0.0000003 |
| age | 0.2104618 | 0.3171120 | 0.6636830 | 0.5077195 |
| educationno degree | -1.7121244 | 20.6567655 | -0.0828844 | 0.9340330 |
| genderm | 9.5722892 | 19.8707894 | 0.4817267 | 0.6305695 |
| DesignB | -53.5174222 | 21.0614356 | -2.5410149 | 0.0118742 |
| age:educationno degree | 0.8744501 | 0.3883014 | 2.2519878 | 0.0254985 |
| age:genderm | -0.3668520 | 0.3788936 | -0.9682190 | 0.3341991 |
| age:DesignB | 0.8460845 | 0.4076856 | 2.0753358 | 0.0393388 |
| educationno degree:genderm | 20.3151378 | 24.2663373 | 0.8371736 | 0.4035751 |
| educationno degree:DesignB | 12.8658513 | 24.6085125 | 0.5228212 | 0.6017243 |
| genderm:DesignB | -2.9552037 | 24.1488853 | -0.1223743 | 0.9027354 |
| age:educationno degree:genderm | -0.1061344 | 0.4466181 | -0.2376401 | 0.8124233 |
| age:educationno degree:DesignB | -0.2380336 | 0.4465867 | -0.5330065 | 0.5946687 |
| age:genderm:DesignB | 0.2053332 | 0.4458445 | 0.4605488 | 0.6456632 |
| educationno degree:genderm:DesignB | -7.0886655 | 15.4250521 | -0.4595554 | 0.6463750 |
Since all three-way interactions are not significant, we limit ourselves to two-way interactions:
M_reasonable <-
data_Browsing_AB %>%
lm( ToT ~ 1 + (age + education + gender + Design)^2,
data = .)
kable(summary(M_reasonable)$coefficients)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 84.0830023 | 12.1986269 | 6.8928251 | 0.0000000 |
| age | 0.2393769 | 0.2287171 | 1.0466071 | 0.2966175 |
| educationno degree | 8.7321025 | 12.7812981 | 0.6831937 | 0.4953211 |
| genderm | 8.5360261 | 12.5649195 | 0.6793538 | 0.4977448 |
| DesignB | -51.9289037 | 12.6079874 | -4.1187306 | 0.0000569 |
| age:educationno degree | 0.7017360 | 0.2214434 | 3.1689178 | 0.0017848 |
| age:genderm | -0.3141304 | 0.2210754 | -1.4209196 | 0.1569879 |
| age:DesignB | 0.8510494 | 0.2139649 | 3.9775186 | 0.0000991 |
| educationno degree:genderm | 11.6194996 | 7.6486306 | 1.5191608 | 0.1303931 |
| educationno degree:DesignB | -2.7319742 | 7.6065678 | -0.3591599 | 0.7198764 |
| genderm:DesignB | 3.9470652 | 7.6133305 | 0.5184413 | 0.6047568 |
Then on the basis of the model coefficients we can infer that, controlled for all effects in the model, the effect of age on ToT is larger (i.e., more positive or less negative!) in the participants that worked with the website with Design B than those that worked with the website with Design A. Or, put differently, the difference in ToT between design A and design B websites was more positive (or less negative!) for older participants than for younger participants. Let’s visualize this pattern as predicted by the model with only two-way interactions, and compare it with the observed data pattern:
First the raw, observed data, with fitted regression lines for four different subgroups:
data_Browsing_AB %>%
ggplot(aes(x = age,
y = ToT,
shape = education, col= Design, size=10)) +
geom_point() +
geom_smooth(method = "lm", se = F, size=1)
Now for every participant, the predicted ToT, based on the model with main effects and two-way interaction effects:
data_Browsing_AB %>%
mutate( pred_Mfull2 = predict(M_reasonable)) %>%
ggplot(aes(x = age,
y = pred_Mfull2,
shape = education, col= Design, size=10)) +
geom_point()
So, with also effects from education and gender in the model, and their interaction, the relationship between ToT, age, education and Design, according to this model, is a rather complex one. For both those with a degree and without a degree, and for males and females, the effect of Design is larger for young participants than for old participants. Also, for both the overall effect of age on ToT is larger in particicipants without a degree than with a degree.
Concluding, based on the model we can say that design B leads to shorter search times in mainly young users (age < 40). In older users, the effect seems negligible.
Note that an interpetation such as this is hard to make, based on the statistical output alone. Generally, interaction plots are great for understanding statistical interactions.
Concluding remarks:
When using lm, if main effects are included, the intercept is included by default (the ‘1’ in the formula). This is also true for aov. When an interaction term is specificied, the corresponding main effects are included. In general, when a k-way interaction term is included, all k-1 interaction terms corresponding to the same variables are included. Thus, the formula y ~ a:b:c gives you a full factorial analysis with all main effects, all two-way interaction effects, the three-way effect, and the intercept, and is therefore equivalent to the formula y ~ 1 + a + b + c + a:b + a:c + a:c + a:b:c.
Variables specified as a factor are analysed as categorical predictors, even when labelled with numbers. So be careful when you have more than two levels in a variable. Changing a numeric variable into a factor can be done by as.factor(variable_name). Always check the degrees of freedom in your output.
Further reading:
Dalgaard, P. (2002). Introductory Statistics with R. Springer.