1 Introduction

In their 2015 APS paper, Sofer et al hypothesized that face typicality affects trustworthiness judgements. Typical faces of varying attractiveness were created from the experiment’s sampled population. Subjects, female students from Hebrew University, were asked to rate these faces on trustworthiness or attractiveness over a 3-week period. They show that for a continuum of faces that vary on a typicality-attractiveness dimension, trustworthiness judgements peak around the typical face. In this replication, we will replicate figure 2 from the original paper.

2 Methods

2.1 Power Analysis

Using the effect size \(f^2 = 0.15\) with \(\alpha = 0.05\), we did a post-hoc power analysis on the original sample size of 24 participants. To get 80% power, we need 43 participants in each condition, while 90% requires 55 participants and 95% power requires 70% participants.

2.2 Planned Sample

The planned sample would be US-based mTurkers. We will not be able to replicate the exclusion criteria and restrict the locale to US participants of all genders and all ages. We will collect light demographic information to analyze this difference.

2.3 Materials

The composite images and questionnaire used in the study is available at https://osf.io/8gwnb/.

2.3.1 Composite images

The stimuli is a series of faces, bookended by a typical face and an attractive composite face with 9 transformations between them (with increments of 10% in the difference of shape and reflectance) resulting in 11 faces. “The typical face was developed by a digital averaging process of 92 faces that were representative of the experiment’s sampled population. Participants whose images were used varied in age from 23 to 31 years old.” “The attractive composite face resulted from digitally averaging the 12 most attractive female faces in Winston, O’Doherty, Kilner, Perrett and Dolan’s (2007) face set.”

2.3.2 Conditions

Between-subjects: 2 conditions, either judging Attractiveness or Trustworthiness.
Within-subjects: 11 stimuli, each judged 3 times per subject.

2.4 Procedure

Participants were randomly assigned into two groups. The first group judged trustworthiness, the other attractiveness.

They first read the instructions for their condition and give their consent:

By answering the following questions, you are participating in a study being performed by the Stanford Department of Psychology. If you have questions about this research, please contact us at stanfordpysch254@gmail.com. You must be at least 18 years old to participate. Your participation in this research is voluntary. You may decline further participation, at any time, without adverse consequences. Your anonymity is assured; the researchers who have requested your participation will not receive any personal information about you. Note however that we have recently been made aware that your public Amazon.com profile can be accessed via your worker ID if you do not choose to opt out. If you would like to opt out of this feature, you may follow instructions available here.

Next, they read the following text (Introduction screen 1):

For Attractive condition

Many times we meet strangers for the first time and form our first impression based on their faces. There are faces that look attractive and there are faces that look unattractive.

In this experiment different faces will be presented to you and you will be required to judge their attractiveness level. Judgment will be done using a 1 (Definitely not attractive) to 9 (Definitely attractive) scale.

There are no right or wrong answers and you are required to answer as soon as you can, after the face is presented.

For Trustworthiness condition

In this experiment you will be asked to record your first impression of faces. Many times we meet strangers for the first time and form our first impression based on their faces. There are faces that look trustworthy and there are faces that look untrustworthy.

In this experiment different faces will be presented to you and you will be required to judge their trustworthiness level. Judgment will be done using a 1 (Definitely not trustworthy) to 9 (Definitely trustworthy) scale.

There are no right or wrong answers and you are required to answer as soon as you can, after the face is presented.

Followed by (Introduction screen 2): > Please read the question again in every new screen and mark your answer for each face before clicking “Next”. You will not be able to go back to previous screens.

On each screen, a single face is shown and a Likert-scale from 1 through 9 is shown for either trustworthiness or attractiveness. Each participant is shown each face three times, randomized.

Finally, a debriefing questionnaire (opt-out) is shown: 1. What is your age? 2. What is your gender identity? 3. What is your highest level of education completed? 4. What is your race? Please check all that apply. 5. What in your opinion was the goal of the experiment? 6. If you have any comment about the experiment, please write it here

Separately, we will record the browser’s user agent and window size programmatically.

3 Analysis Plan

We would like to replicate Figure 2 from the original paper.

The judgements used for each face are averaged per participant, using the method from Cousineau 2005.

To mirror the analysis from the original study, we will do the following analysis:

Regression analysis where we predict the judgements using DFT (distance from typical face), DFT-squared and judgement type (trustworthiness = 1,attractiveness = 0), and their interactions.
Repeat measures ANOVA with DFT as a repeated measure and judgement type as a between-subjects factor.

4 Differences from Original Study

The first major difference is migrating the experiment to Mechanical Turk. We will not be able to replicate the original sample demographic which was rather specific. This might affect our results since the typical face was composed to reflect that demographic.

5 Results

(from a small pilot study)

5.1 Data preparation

Load the data.

path <- "~/code/aucProject/data/"
files <- dir(paste0(path,"sandbox-results/"), pattern = "*.json")
d.raw <- data.frame()
d.balance <- data.frame()

for (f in files) {
  jf <- paste0(path, "sandbox-results/", f)
  jd <- fromJSON(paste(readLines(jf), collapse=""))
  answers <- jd$answers$data
  race <- paste(answers$race, collapse="+")
  id <- data.frame(workerid = jd$WorkerId,
                   race = race,
                   age = answers$age,
                   face_dft = as.numeric(answers$face),
                   face_rating = as.numeric(answers$rating),
                   exp_type = answers$type,
                   elapsed_ms = answers$elapsed_ms,
                   num_errors = answers$num_errors,
                   gender = answers$gender,
                   education = answers$education)
  d.raw <- bind_rows(d.raw, id)
  
  id <- data.frame(workerid = jd$WorkerId,
                   race = race,
                   age = answers$age,
                   exp_type = answers$type,
                   gender = answers$gender,
                   education = answers$education)
  d.balance <- bind_rows(d.balance, id)
}

message("Number of participants: ", length(unique(d.raw$workerid)))

## Number of participants: 5

Simple balance checks:

# Balance by gender:
table(d.balance$gender)

## 
## female 
##      5

# Balance by type:
table(d.balance$exp_type)

## 
##  attractive trustworthy 
##           2           3

# Balance by Face DFT's presented:
table(d.raw$face_dft)

## 
##   0  10  20  30  40  50  60  70  80  90 100 
##  15  15  15  15  15  15  15  15  15  15  15

# Balance by age:
table(d.raw$age)

## 
## 20-29 30-39 60-69 
##    33    66    66

5.1.1 Initial data plots

Quick check to see distribution of results per participant.

ggplot(d.raw, aes(x=face_dft, y=face_rating, color=factor(workerid))) +
  geom_jitter(stat="identity", alpha=0.5) + 
  facet_wrap(~exp_type)

Check errors and reaction times

d.raw %>%
  filter(elapsed_ms < 10000) %>%
  ggplot(aes(x=elapsed_ms)) +
    geom_histogram(bins=20) +
    facet_wrap(~num_errors)

Identify outliers

ggplot(d.raw, aes(x=exp_type, y=elapsed_ms)) +
  geom_boxplot()

ggplot(d.raw, aes(x=factor(num_errors), y=elapsed_ms)) +
  geom_boxplot()

ggplot(d.raw, aes(x=factor(face_dft), face_rating)) +
  geom_boxplot() +
  facet_wrap(~exp_type)

5.2 Confirmatory analysis

Compute within-subjects means, and errors across participants to generate Figure 2 from the paper.

ms <- d.raw %>%
  group_by(exp_type, face_dft, workerid) %>%
  summarise(mean_rating = mean(face_rating))

cis <- ms %>%
  group_by(exp_type, face_dft) %>%
  summarise(ci=ci95(mean_rating), mean_rating=mean(mean_rating))

ggplot(cis, aes(x=factor(face_dft), y=mean_rating, color=exp_type, group=exp_type)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymax=mean_rating+ci, ymin=mean_rating-ci),
                width=.25) +
  xlab('DFT %') +
  ylab('Mean Judgment')

The authors also did a multiple regression analysis which predicted the judgments using DFT, DFT-squared, experiment type and their interactions (all predictors centered).

mean_dft <- mean(ms$face_dft)
mean_face_rating <- mean(ms$mean_rating)
centered <- ms %>%
  mutate(face_rating = mean_rating - mean_face_rating,
         dft = face_dft - mean_dft)
pred <- lm(face_rating ~ exp_type + face_dft + I(face_dft ** 2) + face_dft * exp_type + I(face_dft**2) * exp_type,
           centered)
summary(pred)

## 
## Call:
## lm(formula = face_rating ~ exp_type + face_dft + I(face_dft^2) + 
##     face_dft * exp_type + I(face_dft^2) * exp_type, data = centered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4716 -0.5039  0.0979  0.5400  1.4582 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       -2.4248252  0.4316785  -5.617 1.91e-06
## exp_typetrustworthy                4.0780886  0.6104856   6.680 6.68e-08
## face_dft                           0.1414530  0.0200842   7.043 2.15e-08
## I(face_dft^2)                     -0.0012782  0.0001934  -6.608 8.39e-08
## exp_typetrustworthy:face_dft      -0.1750796  0.0284034  -6.164 3.39e-07
## exp_typetrustworthy:I(face_dft^2)  0.0012364  0.0002736   4.520 5.88e-05
##                                      
## (Intercept)                       ***
## exp_typetrustworthy               ***
## face_dft                          ***
## I(face_dft^2)                     ***
## exp_typetrustworthy:face_dft      ***
## exp_typetrustworthy:I(face_dft^2) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8013 on 38 degrees of freedom
## Multiple R-squared:  0.7269, Adjusted R-squared:  0.6909 
## F-statistic: 20.22 on 5 and 38 DF,  p-value: 8.545e-10

Finally, the authors computed a by-participant ANOVA with DFT as a repeated measure and experiment type as a between-subjects factor. > Not sure if I got the formula right for this…

anova <- aov(mean_rating ~ exp_type + face_dft + Error(workerid / face_dft),
             ms)
summary(anova)

## 
## Error: workerid
##          Df Sum Sq Mean Sq
## exp_type  1 0.2227  0.2227
## 
## Error: workerid:face_dft
##          Df Sum Sq Mean Sq
## face_dft  1 0.7433  0.7433
## 
## Error: Within
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## exp_type   1   2.69   2.691   1.857    0.181    
## face_dft   1  29.14  29.138  20.100 6.29e-05 ***
## Residuals 39  56.54   1.450                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.3 Exploratory analyses

Verify DV’s that differ from original study

5.3.1 Effect of gender

ms.gender <- d.raw %>%
  group_by(gender, exp_type, face_dft, workerid) %>%
  summarise(mean = mean(face_rating)) %>%
  group_by(gender, exp_type, face_dft) %>%
  summarise(ci=ci95(mean), mean=mean(mean))

ggplot(ms.gender, aes(x=factor(face_dft), y=mean, color=exp_type, group=exp_type)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymax=mean+ci, ymin=mean-ci),
                width=.25) +
  xlab('DFT %') +
  ylab('Mean Judgment') +
  facet_wrap(~gender)

5.3.2 Effect of age

ms.age <- d.raw %>%
  group_by(age, exp_type, face_dft, workerid) %>%
  summarise(mean = mean(face_rating)) %>%
  group_by(age, exp_type, face_dft) %>%
  summarise(ci=ci95(mean), mean=mean(mean))

ggplot(ms.age, aes(x=factor(face_dft), y=mean, color=exp_type, group=exp_type)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymax=mean+ci, ymin=mean-ci),
                width=.25) +
  xlab('DFT %') +
  ylab('Mean Judgment') +
  facet_wrap(~age)

5.3.3 Effect of ethnicity

ms.race <- d.raw %>%
  group_by(race, exp_type, face_dft, workerid) %>%
  summarise(mean = mean(face_rating)) %>%
  group_by(race, exp_type, face_dft) %>%
  summarise(ci=ci95(mean), mean=mean(mean))

ggplot(ms.race, aes(x=factor(face_dft), y=mean, color=exp_type, group=exp_type)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymax=mean+ci, ymin=mean-ci),
                width=.25) +
  xlab('DFT %') +
  ylab('Mean Judgment') +
  facet_wrap(~race)

Replication of "What is Typical is Good: The Influence of Face Typicality on Perceived Trustworthiness"

by Carmel Sofer, Ron Dotsch, Daniel H J Wigboldus & Alexander Todorov (2015, Psychological Science)

Carolyn Au

2016-03-02