Context

After the experiment seemed to fail with users recruited from Facebook, I tried to diagnose the situation. Given that the stimuli looked as expected during pretesting (details below) but not in the experiment, there were three possibilities:

  1. Maybe I didn’t analyze the pretest correctly, and the stimuli had been flawed to begin with.
  2. Maybe the population on which I ran the survey was substantially different from the pretest population. The pretest was done on MTurk with 18-25 year olds, whereas the study participants were recruited from Facebook (a group with which I already had a lot of quality problems).
  3. Maybe it was a weakness of the experimental design—The second paper, Cheryan 2011, which used virtual classroom environments, was within-subjects (which to me seems less face-valid but strenghtens the manipulation). When I had discussed our experiment with the original study’s author, she suggested within-subjects as a way to strengthen the design.

I decided to take these sequentially. First I checked the pretest analysis:

Pretest

a0 <- aov(stereotypicality ~ gender*condition, data=pre)
summary(a0)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## gender            1   0.02    0.02   0.018  0.893    
## condition         1 236.02  236.02 178.586 <2e-16 ***
## gender:condition  1   2.80    2.80   2.119  0.151    
## Residuals        56  74.01    1.32                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a1 <- aov(masculinity ~ gender*condition, data=pre)
summary(a1)
##                  Df Sum Sq Mean Sq F value  Pr(>F)    
## gender            1   0.41    0.41   0.494   0.485    
## condition         1  56.07   56.07  67.925 3.1e-11 ***
## gender:condition  1   1.04    1.04   1.255   0.267    
## Residuals        56  46.22    0.83                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In analysing results from the pretest, the condition had a clear and strong effect. The stereotypical condition was perceived as very significantly more stereotypical and more masculine than the non-stereotypical one. Phew; according to pre-testing, everything looks fine.

(Notably, though, the pre-test was within-subjects, so we would expect the eventual between-subjects survey to show weaker results.)

Using AMT Population

Manipulation Check

If the population recruited from Facebook was behaving differently from the AMT group that was used in pretesting, that would explain our initial results. I decided to run the study with AMT participants instead of FB-recruited participants, beginning only with female participants, since that’s the main population where we expect to see the difference.

In the FB recruiting, I had collected responses from 100 participants of both genders, so I recruited 50 female participants, ages 18-25, from AMT. Looking at AMT participants’ ratings of the webpages, the manipulation is more aparent (though weaker, as expected).

(Note: Strangely, the stereotypical site isn’t being rated as much more stereotypical than the other, but it is being rated as more masculine… I would expect those two to correlate strongly, and this same population pre-tested the site as very stereotypical, so that’s a bit unusual. This may be because both sites have the same content, and that’s making both seem moderately Computer Science-y?)

a0 <- aov(stereotypical_all ~ condition, data=ambe)
summary(a0)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   3.25   3.246   1.804  0.184
## Residuals   59 106.16   1.799
a1 <- aov(gender_web_all ~ condition, data=ambe)
summary(a1)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## condition    1   3.24   3.236   4.056 0.0486 *
## Residuals   59  47.07   0.798                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results

We find that among these only-female participants, they are significantly less likely to want to enroll in the course when shown the stereotypical webpage, with numbers approaching significance for measures including feeling of belonging and anticipated success in the class.

Aside from these main DVs, we also find that participants in the stereotypical condition report significantly more gender anxiety, and see trends towards significance on almost every other variable: how fun the course seems, self-confidence, and long-term interest in studying CS.

a0 <- aov(enroll_intent_all ~ condition, data=ambe)
summary(a0)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## condition    1  10.23  10.233   3.822 0.0553 .
## Residuals   59 157.96   2.677                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(enroll_intent_all ~ condition, main="Survey: Enroll Intent by Condition (AMT)", data=ambe)

a1 <- aov(am_be_all ~ condition, data=ambe)
summary(a1)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   5.34   5.337    2.03  0.159
## Residuals   59 155.10   2.629
plot(am_be_all ~ condition, main="Survey: Ambient Belonging by Condition (AMT)", data=ambe)

a2 <- aov(anticip_success_all ~ condition, data=ambe)
summary(a2)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   2.46   2.458   1.642  0.205
## Residuals   59  88.35   1.497
plot(anticip_success_all ~ condition, main="Survey: Anticipated Success by Condition (AMT)", data=ambe)

a3 <- aov(fun_all ~ condition, data=ambe)
summary(a3)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   4.49   4.491   1.863  0.177
## Residuals   59 142.25   2.411
a4 <- aov(cs_talent_all ~ condition, data=ambe)
summary(a4)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   0.24  0.2445   0.217  0.643
## Residuals   59  66.45  1.1262
a5 <- aov(self_conf_all ~ condition, data=ambe)
summary(a5)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   4.23   4.227   1.646  0.205
## Residuals   59 151.54   2.568
a6 <- aov(long_term_all ~ condition, data=ambe)
summary(a6)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   7.94   7.945   2.494   0.12
## Residuals   59 187.92   3.185
a7 <- aov(stereotypical_all ~ condition, data=ambe)
summary(a7)
##             Df Sum Sq Mean Sq F value Pr(>F)
## condition    1   3.25   3.246   1.804  0.184
## Residuals   59 106.16   1.799
a8 <- aov(gender_web_all ~ condition, data=ambe)
summary(a8)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## condition    1   3.24   3.236   4.056 0.0486 *
## Residuals   59  47.07   0.798                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a9 <- aov(gender_anxi_all ~ condition, data=ambe)
summary(a9)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## condition    1  15.08  15.082     5.3 0.0249 *
## Residuals   59 167.89   2.846                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Note on Experimental Design

Haven’t run the study with the within-subjects design but want to address that here: I think this plays a huge role—showing both interfaces would make the manipulation much stronger. However, this setup seems less experimentally valid to me, so I’m sticking with between-subjects.

Next Steps

There are a couple directions we can go from here: 1. Run more female participants—given that the effect of a user interface, which is much less immersive than a physical classroom, is weaker than the physical classroom and also weaker than the within-subjects design of the virtual classroom study, it might make sense to run some more participants. We have ~55 female participants now, and could go for ~60 or ~70. 2. Run male participants—I’ve left them out until now, since we expect that they should not display these same trends. If we wanted to more closely match the original study, though, we could run more. At the moment I’ve only run 7 males through the experiement (before deciding only to look at female participants), and already just that number lessens every single effect according to the ANOVAs above. So collecting males would allow uss to report the two-way ANOVAs as done in the virtual classrooms paper. On the other hand, that population is less critical for our main argument, which is that online interfaces impact users psychologically in measurable ways that have concrete downstream effects on behavior.