## Warning: package 'ggplot2' was built under R version 3.2.3

1 Experimental and quasi-experimental designs

The primary empirical method in many branches of Psychology is the controlled experiment. The two primary properties of an experiment is that the researcher manipulates the conditions under which the outcome is measured, and that the context is as strictly controlled as possible. Under these conditions, the data is more or less appropriately analyzed using t-tests or ANOVA. Contrary to experimental psychology, in more applied fields of research, observational studies play a significant role. This implies several complications with respect to the assumptions in experimental research:

First, it is frequently the case that the “independent” variable is not truly manipulated, but only measured. For example, many experiments dealing with aesthetic evaluations of web designs use screenshots of existing websites as their stimuli (Tuch, Bargas-Avila, & Opwis, 2010). With collected stimuli the researcher does not have the same amount of control as would be the case with designed stimuli. The problem arises that the property of interest, say symmetry of visual layout, may, for whatever reasons, be confounded with another property, say, font size. Other examples are age and gender, which are often regarded as independent variables, but cannot be manipulated in practice.

Second, it is often the case that multiple, correlated variables have an impact on the outcome. Often there are categorical predictors (e.g., an academic degree), as well as metric predictors (e.g., computer experience in years). Such variables, if disregarded, can contribute a fair amount of noise to the data. But, most of all, they reflect the situation of real use, and therefore are crucial.

Third, the effect of one variable often depends on another, which is called interaction. Interaction effects, if ignored, can severely bias or mask effects, and limit the interpretability of findings. For example, if the “conditional” variable is a property of website users (e.g. the effect of the manipulated variable depends on age), the result can become strongly dependent on the composition of the sample. These situations are beyond what one can reasonably handle with simple statistical procedures, such as t-test, ANOVA or linear regression.

In this chapter, we will first revisit the “usual suspects”, the most frequently used methods of ANOVA and linear regression. We will see how these analyses can be done in R. As will turn out, both methods have a common underyling logic (ANOVA is a special case of linear regression) and it is a small step to combine them in the framework of General Linear Models (GLM).

2 One-way ANOVA

One of the most basic statistical routines is analysis of variance (ANOVA), which compares groups on a metric outcome. In the following example, the research question is

Do two web designs A and B differ in performance?

This question could arise, for instance, during an overhaul of a municipal website. With the emerge of e-government, this website had grown wildly over a decade. With the current navigation structure, users have increasing difficulty to find the information they need. The prototype of a novel web design is developed and tested with 200 users. Every user is given the same five tasks, but sees only one of the two designs. The outcome measure is the time it takes to find the information that is asked for as part of the task (Time on Task, ToT). Below we see part of the data frame and histograms of ToT for designs A (old design) and B (new design) separately.

# look at data:
data_Browsing_AB %>% 
  filter(row_number() <= 10) %>% 
        select(1:6) %>%
                kable()
Subj education gender age Design ToT
23 degree m 28.41143 A 62.30250
24 degree m 28.41143 B 71.14089
45 no degree f 38.93923 A 165.00313
46 no degree f 38.93923 B 176.26674
67 degree f 76.07241 A 105.24776
68 degree f 76.07241 B 112.53900
89 no degree f 39.65661 A 150.06767
90 no degree f 39.65661 B 113.33963
111 degree m 22.77343 A 159.78351
112 degree m 22.77343 B 71.04381
# plot data
data_Browsing_AB %>%
  ggplot(aes(x = ToT)) +
        geom_histogram(fill = 1, alpha = .5) +
    facet_grid(Design~.)

Now, we perform the ANOVA. In R, this is a two-step procedure: 1. the aov command lets you specify a simple formula to express the dependency between a factor (here: education) and outcome variable (here: ToT). Note that the formula resembles a regression equation. The results of the ANOVA are stored in an object (ANOVA_Design). 2. The summary command outputs a comprehensive summary of the results of the ANOVA.

# ANOVA analysis
ANOVA_Design <-
  data_Browsing_AB %>% 
  aov(ToT ~ Design, data = .)

# summarize results
summary(ANOVA_Design) 
##              Df Sum Sq Mean Sq F value Pr(>F)  
## Design        1   4558    4558    2.83 0.0941 .
## Residuals   198 318858    1610                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, overall, the new design does not seem to have a general effect on Time on Task.

3 Two-factorial ANOVA

Now, let us look at a slightly more complicated situation, where Time on Task (ToT) is predicted by 2 factors: education and design. First, with the ggplot system, we plot the data.

data_Browsing_AB %>%
  ggplot(aes(x = education, y = ToT, fill = Design)) +
  geom_boxplot()
# display group means
data_Browsing_AB %>% aggregate( ToT~education+Design, . ,  mean )
##   education Design       ToT
## 1    degree      A  91.80865
## 2 no degree      A 141.78354
## 3    degree      B  83.54079
## 4 no degree      B 130.95649

It seems that having a degree predicts lower times on task. We can do the two-factorial ANOVA by specifying predictors education and Design in the model formula and store the analysis results. We print the results using the summary command.

ANOVA_education_design <- 
  data_Browsing_AB %>% 
  aov(ToT ~ education + Design,  data = .)

summary(ANOVA_education_design) 
##              Df Sum Sq Mean Sq F value Pr(>F)    
## education     1 118562  118562 116.610 <2e-16 ***
## Design        1   4558    4558   4.483 0.0355 *  
## Residuals   197 200297    1017                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this table we can see that Design has an effect on Time on Task, once we control for the effects of education. However, the ANOVA table does not give us any information on the actual direction of the effect.

4 ANOVA = REGRESSION

It is not a coincidence that the model specification for an ANOVA in R resembles a regression equation. In fact, ANOVA is nothing more than a special case of linear regression. To see this, instead of asking for a summary, ask for the model coefficients of the two-factorial ANOVA.

# print coefficients
coef(ANOVA_education_design) 
##        (Intercept) educationno degree            DesignB 
##          92.448450          48.695291          -9.547452
#print group means
data_Browsing_AB %>% aggregate( ToT~education+Design, . ,  mean )
##   education Design       ToT
## 1    degree      A  91.80865
## 2 no degree      A 141.78354
## 3    degree      B  83.54079
## 4 no degree      B 130.95649

Look at these coefficients: what do you think they mean??

5 Full-factorial ANOVA

As you can see, the coefficients do not result in the exact group means. This is because we have applied a model with two main effects, and such a model rarely fits exactly onto the data. It is a so-called nonsaturated model. To see whether the group means can be significantly better predicted by a model with both main effects and an interaction effect, we run a full-factorial model (a saturated model with both main effects and an interaction effect).

ANOVA_fullfact <- 
  data_Browsing_AB %>% 
  aov(ToT ~ education + Design + education:Design,  data = .)

summary(ANOVA_fullfact) 
##                   Df Sum Sq Mean Sq F value Pr(>F)    
## education          1 118562  118562 116.066 <2e-16 ***
## Design             1   4558    4558   4.462 0.0359 *  
## education:Design   1     82      82   0.080 0.7774    
## Residuals        196 200215    1022                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(ANOVA_fullfact)
##                (Intercept)         educationno degree 
##                  91.808655                  49.974882 
##                    DesignB educationno degree:DesignB 
##                  -8.267861                  -2.559182

With the function coef() we print the coefficients of the full factorial model. When we now use these coefficients we can reconstruct the four group means we saw earlier perfectly:

It is as if the ANOVA has used dummy coding for the effects of having NO degree and Design B. And that is exactly what is happening: R (and other statistical packages like SPSS) first make a choice how to code categorical variables into quantitative variables, and then apply linear regression.

6 Linear Regression

But first, let’s have a look at a simple linear regression with a metric predictor: age. Let’s plot the relationship between age and ToT. We ask ggplot to overlay a smoothing line that lets us assess the overall trend at a glance.

data_Browsing_AB %>%
  ggplot(aes(x = age, y = ToT)) +
  geom_point()+
  geom_smooth()

Overall, ToT seems to increase with age.

Now, to do a linear regression, we use the lm command in much the same way as the aov command.

# ANOVA parameter estimation
LM_age <-
  data_Browsing_AB %>% 
  lm(ToT ~ 1 + age, 
           data = .) 

summary(LM_age)
## 
## Call:
## lm(formula = ToT ~ 1 + age, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.737 -28.738  -2.309  27.947  89.035 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  70.8466     7.8210   9.058  < 2e-16 ***
## age           0.8397     0.1500   5.597 7.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.55 on 198 degrees of freedom
## Multiple R-squared:  0.1366, Adjusted R-squared:  0.1323 
## F-statistic: 31.33 on 1 and 198 DF,  p-value: 7.193e-08

The ‘1’ in the formula stands for ‘intercept’ or ‘general mean’. So, the linear model suggests that there is a significant linear relationship between age and ToT. To asssess how well the model fits the data, it is recommended to check the distribution of residuals. With lm, we can extract the residuals by extracting the predicted values and calculating the difference to observed values.

The predict command is called on the output object of LM_age, which reflects the fact that residual analysis is something you can only do after estimation. Below, we also estimate the so-called null model, which characterizes the data by just one grand mean (without the effect of age).

LM_null <-
  data_Browsing_AB %>% 
  lm(ToT ~ 1, data = .)


data_Browsing_AB %>%
        mutate(                  resid_0 = predict(LM_null) - ToT,
                 resid_age = predict(LM_age) - ToT ) %>%
    gather(model, residuals, resid_0:resid_age) %>% 
    ggplot(aes(x = residuals)) +
    geom_histogram() +
    facet_grid(model~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that with increasing model complexity (from no predictor to one predictor), the variance of the residuals slightly decreases; in other words, age explains part of the variance. The statistical results show that the amount of variance predicted by age is 13 percent and statistically significant.

Next, we carry out a linear regression analysis using ToT as the dependent variable and the factors education and gender as qualitative predictors, and also include their interaction (multiplying the new quantitative dummy variables). For that we use the function lm.

LM_fullfact <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + education + Design + education:Design, data = .)
summary(LM_fullfact)
## 
## Call:
## lm(formula = ToT ~ 1 + education + Design + education:Design, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.389 -22.628  -0.038  23.251  80.299 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  91.809      4.520  20.312  < 2e-16 ***
## educationno degree           49.975      6.392   7.818 3.22e-13 ***
## DesignB                      -8.268      6.392  -1.293    0.197    
## educationno degree:DesignB   -2.559      9.040  -0.283    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.96 on 196 degrees of freedom
## Multiple R-squared:  0.3809, Adjusted R-squared:  0.3715 
## F-statistic:  40.2 on 3 and 196 DF,  p-value: < 2.2e-16

and we plot the regression coefficients separately and see that the LM and ANOVA coefficients are exactly the same:

coef(LM_fullfact)
##                (Intercept)         educationno degree 
##                  91.808655                  49.974882 
##                    DesignB educationno degree:DesignB 
##                  -8.267861                  -2.559182
coef(ANOVA_fullfact)
##                (Intercept)         educationno degree 
##                  91.808655                  49.974882 
##                    DesignB educationno degree:DesignB 
##                  -8.267861                  -2.559182

This is no surprise. Actually, the aov function in R is a ‘wrapper’ for the lm function: when you use aov, R internally makes lm do the work. It is only different in how it displays results: as regression coefficients or mean sums of squares.

In other words, what an analysis of variance actually does (or is equivalent to) is recoding categorical factors into quantitative variables and applying linear regression. In R and many other statistical packages, the standard approach is dummy coding (1/0), as we’ve seen here. Realizing this, it is now easy to see that the lm function is much more general than the aov function: we can use both categorical and metric predictors in one general framework. If it is a numeric variable, it is analyzed as quantitative (with 1 degree of freedom for a main effect), if it is a factor then it is analyzed as a dummy coded variable (with k-1 degrees of freedom for k number of observed levels).

So if we want to estimate a model with age (quantitative) and education (qualitative), we simply use the original factor variable education.

7 Combining qualitative and quantitative predictors

Let us estimate three models with increasing number of predictors: first only age (M1), then adding education (M2), and then adding gender (M3), and then plot the residuals for the three models:

M1 <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + age,   
    data = .)

M2 <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + age + education,         
    data = .)

M3 <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + age + education + gender,         
    data = .)

data_Browsing_AB %>%
        mutate(                  resid_M1 = predict(M1) - ToT,
                 resid_M2 = predict(M2) - ToT,
                                 resid_M3 = predict(M3) - ToT ) %>%
    gather(model, residuals, c(resid_M1, resid_M2,resid_M3)) %>% 
    ggplot(aes(x = residuals)) +
    geom_histogram() +
    facet_grid(model~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
coef(M3)
##        (Intercept)                age educationno degree 
##         48.1706948          0.8137757         48.1292140 
##            genderm 
##         -0.2343352

So we see that adding education to the model results in smaller residual variance, but adding gender does not lead to a clear change in the distribution of residuals. Be careful though with mixing quantitative and qualitative variables: always check your degrees of freedom so that a categorical variable is not analysed as a quantitative variable by mistake (important if more than 2 levels).

8 Higher order-interaction effects

But, linear models can do more than just adding main effects: you can arbitrarily mix any number of factors and covariates and look at interactions. Let’s see what this does to our small example, where we look at the relationship between ToT, age and education:

data_Browsing_AB %>%
        ggplot(aes(x = age, 
                         y = ToT, 
                         col = education)) +
    geom_point() +
    geom_smooth(method = "lm", se = F)

The estimated age regression line seems more steep for people with a degree than for people without a degree. And here we call lm once again, combining all predictors with ‘+’, but now also adding an age by education interaction effect.

M4 <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + age + education + gender + age:education,         
    data = .)

summary(M4)$coefficients
##                          Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            64.7074992  8.2611533 7.8327440 2.998737e-13
## age                     0.4594517  0.1543047 2.9775613 3.274006e-03
## educationno degree     10.9934401 11.7672066 0.9342438 3.513336e-01
## genderm                 1.0525993  4.0220810 0.2617051 7.938250e-01
## age:educationno degree  0.7595516  0.2263455 3.3557183 9.514209e-04

We see that the coefficient for age:educationNoDegree is positive (or less negative!), so the effect of age is estimated to be 0.76 larger for people without a degree than for people with a degree.

If you want the full factorial model with age, education, gender and design, with all main effects and all possible two-way, three-way and four-way interaction effects this can be specified as

M_ridiculous <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + ( age  + education + gender + Design)^4,         
    data = .)
summary(M_ridiculous)$coefficients [,c(1,4)]
##                                           Estimate     Pr(>|t|)
## (Intercept)                             93.9001070 1.870167e-07
## age                                      0.0525118 8.785377e-01
## educationno degree                     -15.0197860 5.224265e-01
## genderm                                 -2.5760946 9.081278e-01
## DesignB                                -68.5707862 5.694932e-03
## age:educationno degree                   1.1445826 1.156574e-02
## age:genderm                             -0.1155961 7.896880e-01
## age:DesignB                              1.1619846 1.764558e-02
## educationno degree:genderm              46.2295763 1.565939e-01
## educationno degree:DesignB              39.4811746 2.351377e-01
## genderm:DesignB                         21.3415639 4.992897e-01
## age:educationno degree:genderm          -0.6400597 3.116495e-01
## age:educationno degree:DesignB          -0.7782987 2.216082e-01
## age:genderm:DesignB                     -0.2971787 6.278587e-01
## educationno degree:genderm:DesignB     -58.9175425 2.015040e-01
## age:educationno degree:genderm:DesignB   1.0678506 2.328924e-01

Of course here it is a bit ridiculous to estimate a model with 15 coefficients with only 200 participants. In addition, we have to be very careful interpreting the significance of a k-way interaction effect if there are interaction effects of order higher than k in the model (why?). Here we see that the 4-way effect is not significant, so we may want to skip that from the model. We can then rerun our analysis with only 2- and 3-way interaction effects.

M_lessridiculous <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + (age + education + gender + Design)^3,         
    data = .)

kable(summary(M_lessridiculous)$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 86.3734250 16.1688182 5.3419752 0.0000003
age 0.2104618 0.3171120 0.6636830 0.5077195
educationno degree -1.7121244 20.6567655 -0.0828844 0.9340330
genderm 9.5722892 19.8707894 0.4817267 0.6305695
DesignB -53.5174222 21.0614356 -2.5410149 0.0118742
age:educationno degree 0.8744501 0.3883014 2.2519878 0.0254985
age:genderm -0.3668520 0.3788936 -0.9682190 0.3341991
age:DesignB 0.8460845 0.4076856 2.0753358 0.0393388
educationno degree:genderm 20.3151378 24.2663373 0.8371736 0.4035751
educationno degree:DesignB 12.8658513 24.6085125 0.5228212 0.6017243
genderm:DesignB -2.9552037 24.1488853 -0.1223743 0.9027354
age:educationno degree:genderm -0.1061344 0.4466181 -0.2376401 0.8124233
age:educationno degree:DesignB -0.2380336 0.4465867 -0.5330065 0.5946687
age:genderm:DesignB 0.2053332 0.4458445 0.4605488 0.6456632
educationno degree:genderm:DesignB -7.0886655 15.4250521 -0.4595554 0.6463750

Since all three-way interactions are not significant, we limit ourselves to two-way interactions:

M_reasonable <- 
data_Browsing_AB %>%
lm( ToT ~ 1 + (age + education + gender + Design)^2,         
    data = .)

kable(summary(M_reasonable)$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.0830023 12.1986269 6.8928251 0.0000000
age 0.2393769 0.2287171 1.0466071 0.2966175
educationno degree 8.7321025 12.7812981 0.6831937 0.4953211
genderm 8.5360261 12.5649195 0.6793538 0.4977448
DesignB -51.9289037 12.6079874 -4.1187306 0.0000569
age:educationno degree 0.7017360 0.2214434 3.1689178 0.0017848
age:genderm -0.3141304 0.2210754 -1.4209196 0.1569879
age:DesignB 0.8510494 0.2139649 3.9775186 0.0000991
educationno degree:genderm 11.6194996 7.6486306 1.5191608 0.1303931
educationno degree:DesignB -2.7319742 7.6065678 -0.3591599 0.7198764
genderm:DesignB 3.9470652 7.6133305 0.5184413 0.6047568

Then on the basis of the model coefficients we can infer that, controlled for all effects in the model, the effect of age on ToT is larger (i.e., more positive or less negative!) in the participants that worked with the website with Design B than those that worked with the website with Design A. Or, put differently, the difference in ToT between design A and design B websites was more positive (or less negative!) for older participants than for younger participants. Let’s visualize this pattern as predicted by the model with only two-way interactions, and compare it with the observed data pattern:

First the raw, observed data, with fitted regression lines for four different subgroups:

data_Browsing_AB %>%
        ggplot(aes(x = age, 
                             y = ToT, 
                          shape = education, col= Design, size=10)) +
    geom_point() +
    geom_smooth(method = "lm", se = F, size=1)

Now for every participant, the predicted ToT, based on the model with main effects and two-way interaction effects:

data_Browsing_AB %>%
        mutate(   pred_Mfull2 = predict(M_reasonable)) %>%
    ggplot(aes(x = age, 
                                 y = pred_Mfull2, 
                         shape = education, col= Design, size=10)) +
    geom_point() 

So, with also effects from education and gender in the model, and their interaction, the relationship between ToT, age, education and Design, according to this model, is a rather complex one. For both those with a degree and without a degree, and for males and females, the effect of Design is larger for young participants than for old participants. Also, for both the overall effect of age on ToT is larger in particicipants without a degree than with a degree.

Concluding, based on the model we can say that design B leads to shorter search times in mainly young users (age < 40). In older users, the effect seems negligible.

Note that an interpetation such as this is hard to make, based on the statistical output alone. Generally, interaction plots are great for understanding statistical interactions.

Concluding remarks:

Further reading: