After the experiment seemed to fail with users recruited from Facebook, I tried to diagnose the situation. Given that the stimuli looked as expected during pretesting (details below) but not in the experiment, there were three possibilities:
I decided to take these sequentially. First I checked the pretest analysis:
a0 <- aov(stereotypicality ~ gender*condition, data=pre)
summary(a0)
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 0.02 0.02 0.018 0.893
## condition 1 236.02 236.02 178.586 <2e-16 ***
## gender:condition 1 2.80 2.80 2.119 0.151
## Residuals 56 74.01 1.32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a1 <- aov(masculinity ~ gender*condition, data=pre)
summary(a1)
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 0.41 0.41 0.494 0.485
## condition 1 56.07 56.07 67.925 3.1e-11 ***
## gender:condition 1 1.04 1.04 1.255 0.267
## Residuals 56 46.22 0.83
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In analysing results from the pretest, the condition had a clear and strong effect. The stereotypical condition was perceived as very significantly more stereotypical and more masculine than the non-stereotypical one. Phew; according to pre-testing, everything looks fine.
(Notably, though, the pre-test was within-subjects, so we would expect the eventual between-subjects survey to show weaker results.)
If the population recruited from Facebook was behaving differently from the AMT group that was used in pretesting, that would explain our initial results. I decided to run the study with AMT participants instead of FB-recruited participants, beginning only with female participants, since that’s the main population where we expect to see the difference.
In the FB recruiting, I had collected responses from 100 participants of both genders, so I recruited 50 female participants, ages 18-25, from AMT. Looking at AMT participants’ ratings of the webpages, the manipulation is more aparent (though weaker, as expected).
(Note: Strangely, the stereotypical site isn’t being rated as much more stereotypical than the other, but it is being rated as more masculine… I would expect those two to correlate strongly, and this same population pre-tested the site as very stereotypical, so that’s a bit unusual. This may be because both sites have the same content, and that’s making both seem moderately Computer Science-y?)
a0 <- aov(stereotypical_all ~ condition, data=ambe)
summary(a0)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 3.25 3.246 1.804 0.184
## Residuals 59 106.16 1.799
a1 <- aov(gender_web_all ~ condition, data=ambe)
summary(a1)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 3.24 3.236 4.056 0.0486 *
## Residuals 59 47.07 0.798
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find that among these only-female participants, they are significantly less likely to want to enroll in the course when shown the stereotypical webpage, with numbers approaching significance for measures including feeling of belonging and anticipated success in the class.
Aside from these main DVs, we also find that participants in the stereotypical condition report significantly more gender anxiety, and see trends towards significance on almost every other variable: how fun the course seems, self-confidence, and long-term interest in studying CS.
a0 <- aov(enroll_intent_all ~ condition, data=ambe)
summary(a0)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 10.23 10.233 3.822 0.0553 .
## Residuals 59 157.96 2.677
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(enroll_intent_all ~ condition, main="Survey: Enroll Intent by Condition (AMT)", data=ambe)
a1 <- aov(am_be_all ~ condition, data=ambe)
summary(a1)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 5.34 5.337 2.03 0.159
## Residuals 59 155.10 2.629
plot(am_be_all ~ condition, main="Survey: Ambient Belonging by Condition (AMT)", data=ambe)
a2 <- aov(anticip_success_all ~ condition, data=ambe)
summary(a2)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 2.46 2.458 1.642 0.205
## Residuals 59 88.35 1.497
plot(anticip_success_all ~ condition, main="Survey: Anticipated Success by Condition (AMT)", data=ambe)
a3 <- aov(fun_all ~ condition, data=ambe)
summary(a3)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 4.49 4.491 1.863 0.177
## Residuals 59 142.25 2.411
a4 <- aov(cs_talent_all ~ condition, data=ambe)
summary(a4)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 0.24 0.2445 0.217 0.643
## Residuals 59 66.45 1.1262
a5 <- aov(self_conf_all ~ condition, data=ambe)
summary(a5)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 4.23 4.227 1.646 0.205
## Residuals 59 151.54 2.568
a6 <- aov(long_term_all ~ condition, data=ambe)
summary(a6)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 7.94 7.945 2.494 0.12
## Residuals 59 187.92 3.185
a7 <- aov(stereotypical_all ~ condition, data=ambe)
summary(a7)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 3.25 3.246 1.804 0.184
## Residuals 59 106.16 1.799
a8 <- aov(gender_web_all ~ condition, data=ambe)
summary(a8)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 3.24 3.236 4.056 0.0486 *
## Residuals 59 47.07 0.798
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a9 <- aov(gender_anxi_all ~ condition, data=ambe)
summary(a9)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 15.08 15.082 5.3 0.0249 *
## Residuals 59 167.89 2.846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Haven’t run the study with the within-subjects design but want to address that here: I think this plays a huge role—showing both interfaces would make the manipulation much stronger. However, this setup seems less experimentally valid to me, so I’m sticking with between-subjects.
There are a couple directions we can go from here: 1. Run more female participants—given that the effect of a user interface, which is much less immersive than a physical classroom, is weaker than the physical classroom and also weaker than the within-subjects design of the virtual classroom study, it might make sense to run some more participants. We have ~55 female participants now, and could go for ~60 or ~70. 2. Run male participants—I’ve left them out until now, since we expect that they should not display these same trends. If we wanted to more closely match the original study, though, we could run more. At the moment I’ve only run 7 males through the experiement (before deciding only to look at female participants), and already just that number lessens every single effect according to the ANOVAs above. So collecting males would allow uss to report the two-way ANOVAs as done in the virtual classrooms paper. On the other hand, that population is less critical for our main argument, which is that online interfaces impact users psychologically in measurable ways that have concrete downstream effects on behavior.