Abstract / Rough Outline

Many of us have experienced a sense of “familiarity” for faces we have never seen before–a phenomenon tied to both visual perception and personal experience (Lyon, 1996). Can face images be intrinsically familiar? If so, can familiarity be measured consistently? We obtained three measures of familiarity for 100, hyper-realistic, GAN-generated face images, and examined the correspondence between experiments, groups of participants, and individual responses.

We first measure visual familiarity with a task that captures both memorability (i.e. accurate recognition of something previously seen; measured via recognition hit rate, Bainbridge et al., 2016) and familiarity (i.e. false recognition of something we have not actually seen; measured via false alarm rate). However, measurements of familiarity in this task are likely more conservative than our actual experiences of familiarity: we may know we have never seen something before, and yet, it still feels familiar.

In addition to this inferred measure, our second experiment captures familiarity using a forced-choice task. Here, participants chose the “more familiar” face from pairs of two faces, randomly paired without repeats. The resulting score for each face across all participants serves as its generalized familiarity score.

Our third study aimed to capture the subtleties of familiarity for individual faces, without the time and task constraints of the previous tasks. Participants rated faces on a sliding scale between “Not at all familiar” and “Extremely familiar”.

For each experiment, we computed Spearman’s ranked correlations between image rankings (by familiarity score) for 100 randomly sampled halves of participants. The average over 100 correlations serves as a population-level reliability metric. For all experiments, we found widespread variability in image rankings between groups of individuals (Exp.1 mean rho = .02, Exp.2 mean rho = .01, Exp.3 mean rho = .05). We also calculated the consistency of familiarity scores for individual participants using leave-one-out cross-validation by participant to predict familiarity scores for each image. We found that some participants are more consistent with population responses than others, and this varied as a function of experiment. Finally, we computed a Spearman’s correlation between overall image ranking for all three experiments, and found no significantly coinciding familiarity rankings between experiments.

Our results indicate that “familiarity” is likely not a consistently measurable property of face images, but rather a subjective rating of individual experience. Approximating measurements of familiarity with memory, forced-choice, or explicit rating tasks yield inconsistent definitions, and therefore inconsistent measurements, of what “familiar” is.

TO DO:

  • Finalize participant count -> doubling seems reasonable for all three! I’m worried about being underpowered…

    • settled on ~96 for memorability
    • add double ratings/forced-choice?
  • Analyses by strategy in rating study

  • Confirm method for removing bad data in memorability experiment

  • Ask Emily about bad data issues

DONE:

  • ranked correlation for “most familiar” faces between people and groups

    • for false alarms and hit rates
  • ranked correlation across the experiments

  • make plot of ordered face list overall for each experiment

Across Experiments (ranked image comparisons)

##         tau 
## -0.09333333
##         tau 
## -0.08323232
##         tau 
## -0.01616162


Memorability & Familiarity

Face images can be more or less intrinsically “memorable” (Bainbridge et al., 2016) – can they be intrinsically familiar as well, possessing some property that is consistently deemed “familiar” across observers?

We were interested in whether images with salient familiarity signals (based on false alarm rates in the memorability task) would correspond to other, more explicit measures of familiarity. We expect consistent memorability and familiarity scores for each image across individuals.

Been looking at data quality issues. I’m thinking about removing people that have false alarms/misses that are above/below two standard deviations from the mean. How to deal with people that don’t have any false alarms?

Split Half Ranked Correlations (“Reliability”)

# make lists for FA and hitS
fa_cor_list <- list()
hit_cor_list <- list()

for (i in 1:100){
  set.seed(i)
  shuffled_ids <- d$id %>% unique() %>% sample(size = length(unique(d$id)))
  end <- length(shuffled_ids) 
  split <- length(shuffled_ids)/2
  
  ids_1 <- data.frame(id = shuffled_ids[1:split])
  ids_2 <- data.frame(id = shuffled_ids[(split+1):end])
  
  half_1 <- d[d$id %in% ids_1$id,]
  half_2 <- d[d$id %in% ids_2$id,]
  
  
  list_1 <- half_1 %>%
    group_by(image_code) %>%
    summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
              fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))
  
  list_2  <- half_2 %>%
    group_by(image_code) %>%
    summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
              fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))

    # sort by false alarm rate, then by name
  list_1 <- list_1[
    with(list_1, order(fa_r, image_code, decreasing = T)),]
  
  list_2 <- list_2[
    with(list_2, order(fa_r, image_code, decreasing = T)),]
  
  # get ranked correlation for FA sorted list
  fa_cor_list[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate

  
  # sort by hit rate, then by name
  list_1 <- list_1[
    with(list_1, order(hit_r, image_code, decreasing = T)),]
  
  list_2 <- list_2[
    with(list_2, order(hit_r, image_code, decreasing = T)),]
  
  hit_cor_list[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate
  
  
}

# fa_cor_list %>% unlist() %>% summary()
fa_cor_list %>% unlist() %>% as.data.frame() %>%
  summarise(average = mean(.),
            min = min(.),
            max = max(.),
            sd = sd(.))
##      average        min       max        sd
## 1 0.05455758 -0.1527273 0.3789899 0.1019444
hit_cor_list %>% unlist() %>% as.data.frame() %>%
  summarise(average = mean(.),
            min = min(.),
            max = max(.),
            sd = sd(.))
##       average        min       max         sd
## 1 -0.01309899 -0.2173737 0.1793939 0.07164825

Check for uniform distribution of FAs

# Just a histograms of false alarms (count is # of images with that false alarm rate)
fa_rates <- d %>%
  group_by(image_code) %>%
     summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
              fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")),
              seen_count = n())

ggplot(fa_rates, aes(x=fa_r)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") 

# Check out the distributions of other ratings by image!
counts_img <- d %>% 
  group_by(image) %>%
  summarise(fa_count = sum(perf_str == "FA"),
            hit_count = sum(perf_str == "hit"),
            cr_count = sum(perf_str == "CR"),
            miss_count = sum(perf_str == "miss"),
            seen = n()) 


rates_plot <- melt(counts_img[,-6])
a <- ggplot(rates_plot, aes(x=value, color = variable)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="grey") 
a

# histogram of false alarms, testing for uniformity...
fa_hist <- hist(fa_rates$fa_r, breaks = 100)

uniform.test(fa_hist)
## 
##  Chi-squared test for given probabilities
## 
## data:  hist.output$counts
## X-squared = 949.92, df = 135, p-value < 2.2e-16

Logistic Regression, Leave-One-Out by ID (“Consistency”)

# idea is for each person, for each image, use the responses from everyone else for that image to predict test response
d <- read.csv("../memorability/memorability_all_observations.csv")

variance <- list()
estimates <- list()
p_values <- list()
summary_list <- list()
j <- 1
for (id in unique(d$id)){
    curr_id <- id
    train <- d %>% filter(!id == curr_id) %>%
      group_by(image_code) %>%
      summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
                fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))
    train[is.na(train)] <- 0
    
    
    test <- d %>% filter(id == curr_id) %>%
      group_by(image_code) %>%
      summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
                fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))
    test[is.na(test)] <- 0
    
    dt <- data.frame(train_fa = train$fa_r, test_fa = test$fa_r)
    
    model_fa <- glm(test_fa ~ train_fa, data = dt, family = binomial())
    temp <- model_fa %>% summary()
    
    summary_list[j] <- model_fa
    
    variance[j] <- (temp$coefficients[2,2]^2 / 100)
    estimates[j] <- temp$coefficients[2,1]
    p_values[j] <- temp$coefficients[2,4]
    
    j <- j + 1
}


# weight by inverse variance
unlist(estimates) %>% summary()
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -28.94699   0.00003   4.73588   5.50709   9.03137  35.12306
avg_b1_weighted <- unlist(estimates)[-14] * (1/unlist(variance)[-14])
avg_b1_weighted %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -29.04    0.00   14.36   12.13   21.04   46.75
unlist(p_values)[-14] %>% summary()
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0006893 0.1323214 0.4150028 0.4334702 0.7445569 1.0000000

Next steps:

  • Monte Carlo simulations to generate null distribution regression coefficients

Two Alternative Forced Choice

In another sample, we captured image familiarity somewhat more explicitly, using a two-alternative, forced-choice task where participants chose the “more familiar” face from pairs of two, randomly paired, GAN face images.

Split Half Ranked Correlations (“Reliability”)

fam_fc_cor <- list()
for (i in 1:100){
  set.seed(i+100)
  shuffled_ids <- d$subject_id %>% unique() %>% sample(size = length(unique(d$subject_id)))
  end <- length(shuffled_ids) 
  split <- length(shuffled_ids)/2
  
  ids_1 <- data.frame(id = shuffled_ids[1:split])
  ids_2 <- data.frame(id = shuffled_ids[(split+1):end])
  
  half_1 <- d[d$subject_id %in% ids_1$id,]
  half_2 <- d[d$subject_id %in% ids_2$id,]
  
  seen_1 <- (table(half_1$left) + table(half_1$right)) %>% as.data.frame()
  seen_2 <- (table(half_2$left) + table(half_2$right)) %>% as.data.frame()
  
  # get rates based on number of times seen

  list_1 <- half_1$chosen %>%
    table() %>% as.data.frame()
  
  list_2 <- half_2$chosen %>%
    table() %>% as.data.frame()
  
  list_1$rate <- list_1$Freq / seen_1$Freq
  list_2$rate <- list_2$Freq / seen_2$Freq
  
  list_1$image_code <- gsub("[^0-9]", "", list_1$.) %>% as.numeric()
  list_2$image_code <- gsub("[^0-9]", "", list_2$.) %>% as.numeric()
  
  list_1 <- list_1[order(list_1$rate, list_1$image_code, decreasing = TRUE),]
  list_2 <- list_2[order(list_2$rate, list_2$image_code, decreasing = TRUE),]
  
  fam_fc_cor[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate
}  

fam_fc_cor %>% unlist() %>% as.data.frame() %>%
  summarise(average = mean(.),
            min = min(.),
            max = max(.),
            sd = sd(.))
##      average        min       max        sd
## 1 0.01010505 -0.1284848 0.2113131 0.0638566

Logistic Regression, Leave-One-Out by ID (“Consistency”)

# idea is for each person, for each image, use the responses from everyone else for that image to predict what will be said for that image
# d <- d[!grepl("mickey", d$chosen),]

estimates <- list()
p_values <- list()
j <- 1
for (id in unique(d$subject_id)){
  curr_id <- id
  train <- d %>% filter(!subject_id == curr_id) %>%
    group_by(image_code) %>% 
    select(subject_id,left, right, chosen, image_code)
  
  seen <- table(train$left) + table(train$right) %>% data.frame()
  chosen <- train$image_code %>% table() %>% as.data.frame()
  train_rate <- chosen$Freq / seen$Freq

  rm(seen)
  rm(chosen)
  
  test <- d %>% filter(subject_id == curr_id) %>%
    group_by(image_code) %>%
    select(subject_id,left, right, chosen, image_code)
  
  test_all <- data.frame(image = unique(d$image_code), response = numeric(100))
  test_all[which(test_all$image %in% test$image_code),]$response <- 1
    
  dt <- data.frame(train_rate = train_rate, test_chosen = test_all$response)
  
  model <- glm(test_chosen ~ train_rate, data = dt)
  temp <- model %>% summary()
  
  estimates[j] <- temp$coefficients[2,1]
  p_values[j] <- temp$coefficients[2,4]
  j <- j + 1
}

estimates %>% unlist() %>% summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.85137 -0.25283  0.04692  0.01249  0.34144  0.69152
p_values %>% unlist() %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04365 0.31079 0.51567 0.52470 0.77868 0.96710

Next steps:

“by item consistency”


Explicit Ratings & Written Responses

Image ratings

In our third experiment we obtained a hyper-explicit measures of familiarity. Participants rated 100 “unfamiliar” GAN face images on a sliding scale between “Not at all familiar” and “Extremely familiar” (N = 43).

## Warning: Missing column names filled in: 'X1' [1]
## # A tibble: 6 x 5
##      X1 subject_id image                          image_code response
##   <dbl> <chr>      <chr>                               <dbl>    <dbl>
## 1     1 0l70vzbs   resources/face_55_resized.jpg          55       58
## 2     2 0l70vzbs   resources/face_291_resized.jpg        291       60
## 3     3 0l70vzbs   resources/face_359_resized.jpg        359       39
## 4     4 0l70vzbs   resources/face_560_resized.jpg        560       73
## 5     5 0l70vzbs   resources/face_228_resized.jpg        228       67
## 6     6 0l70vzbs   resources/face_428_resized.jpg        428       68

Split Half Ranked Correlations - “Reliability”

d <- d %>% filter(!grepl("mickey", d$image))

fam_rating_cor <- list()
for (i in 1:100){
  set.seed(i+100)
  shuffled_ids <- d$subject_id %>% unique() %>% sample(size = length(unique(d$subject_id)))
  end <- length(shuffled_ids) 
  split <- round(length(shuffled_ids)/2)
  
  ids_1 <- data.frame(id = shuffled_ids[1:split])
  ids_2 <- data.frame(id = shuffled_ids[(split+1):end])
  
  half_1 <- d[d$subject_id %in% ids_1$id,]
  half_2 <- d[d$subject_id %in% ids_2$id,]
  
  list_1 <- half_1 %>%
    group_by(image_code) %>%
    summarise(response = mean(response))

  list_2 <- half_2 %>%
    group_by(image_code) %>%
    summarise(response = mean(response))

  list_1 <- list_1[order(list_1$response, list_1$image_code, decreasing = TRUE),]
  list_2 <- list_2[order(list_2$response, list_2$image_code, decreasing = TRUE),]
  
  fam_rating_cor[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate
}  

fam_rating_cor %>% unlist() %>% as.data.frame() %>%
  summarise(average = mean(.),
            min = min(.),
            max = max(.),
            sd = sd(.))
##      average        min       max         sd
## 1 0.03682828 -0.1090909 0.1858586 0.06040734

Logistic Regression, Leave-One-Out by ID (“Consistency”)

estimates <- list()
p_values <- list()
j <- 1

for (id in unique(d$subject_id)){
  curr_id <- id
  
  train <- d %>% filter(!subject_id == curr_id) %>%
    group_by(image_code) %>%
    summarise(response = mean(response))
  
  test <- d %>% filter(subject_id == curr_id) %>%
    group_by(image_code) %>%
    summarise(response = mean(response))
  
  dt <- data.frame(train_resp = train$response, test_resp = test$response)
  
  model <- glm(test_resp ~ train_resp, data = dt)
  temp <- model %>% summary()
  
  estimates[j] <- temp$coefficients[2,1]
  p_values[j] <- temp$coefficients[2,4]
  j <- j + 1
}

estimates %>% unlist() %>% summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.44646  0.04338  0.33259  0.67433  1.16969  3.97376
p_values %>% unlist() %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01283 0.24081 0.28721 0.48350 0.98280

Analyses by individual:

Analyses by image:

Unsurprisingly, the three faces with high mean responses are the three Mickey Mouse catch trials.

# do Cronbach's alpha

d$image_code <- as.numeric(as.factor(d$image)) # dummy code as suggested

d %>%
  select(response, image_code) %>%
  psych::alpha(check.keys = T)
## Warning in psych::alpha(., check.keys = T): Some items were negatively correlated with total scale and were automatically reversed.
##  This is indicated by a negative sign for the variable name.
## 
## Reliability analysis   
## Call: psych::alpha(x = ., check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r   S/N   ase mean sd median_r
##      0.075     0.076   0.039     0.039 0.082 0.029   53 21    0.039
## 
##  lower alpha upper     95% confidence boundaries
## 0.02 0.08 0.13 
## 
##  Reliability if an item is dropped:
##            raw_alpha std.alpha G6(smc) average_r   S/N alpha se var.r med.r
## response-      0.041     0.039  0.0015     0.039 0.041       NA     0 0.039
## image_code     0.038     0.039  0.0015     0.039 0.041       NA     0 0.039
## 
##  Item statistics 
##               n raw.r std.r r.cor r.drop mean sd
## response-  4000  0.74  0.72  0.14  0.039   55 30
## image_code 4000  0.71  0.72  0.14  0.039   50 29

Not at all reliable! How do I compute an alpha without this dummy coding?

That makes sense because this measure is experience-based…

Next steps:

  • ranked correlations within people
    • within/out of 95% of average rating range?…then do simulations?
  • make “response profile” plots

Strategy Statements

Beyond determining whether different measurements of familiarity are consistent with one another, we were also interested in capturing the experience of familiarity. To this end, our third experiment included a survey, in which participants rated their agreement with statements about different strategies for rating familiarity (e.g. “I was likely to rate a face as familiar when it looked like a celebrity”; “I used specific face features when rating familiarity”). ~data visualization to come~

Next steps:

  • clean strings for analysis
  • response patterns for those who use whole/face parts & celeb/personal

Describing Familiarity… in words!

In addition, participants provided a short written response to the question: “How would you describe ‘familiarity’?”. The results indicate varying working definitions of familiarity, reflected by diverse use of the familiarity scale.

Many adhere to a more traditional definition:

  • “whether they resembled someone i’d seen before in real life or on this survey”
  • “When something new looks like something you’ve seen before”
  • “If they look somewhat like someone that I know”

To some, it can be a property of the face or image, related to memory… sometimes conflating with "memorability:

  • “the kind of face i’ve seen at least once or twice before”
  • “fame and the face shapes”
  • “a feature or identifying composition that is stored in one’s memory”

Some describe it as a “knowing”:

  • “Familiarity is a measure of the level to which you know something”
  • “Being well experienced with/in something”

To others, a feeling of closeness:

  • “something that brings me comfort in knowing”
  • “familiarity is a feeling of comfort and closeness and a friendly relationship”
  • “A feeling you have of recognition or closeness or knowing about someone”

Some illustrative combinations of these things:

  • “Familiarity is defined as knowledge of someone or something, or to a feeling of comfort and closeness with someone or something. When you have heard of a brand of computer, this is an example of a familiarity with the computer.”
  • “Familiarity is when I see a face that reminds me of someone I personally know or someone famous such as a movie actor, musician, etc.”
And our problematically ambiguous favorite: “good”

Extra notes:

For those with more rigid definitions of familiarity, consistently low ratings were given to all of the unfamiliar GAN faces, while consistently high ratings were given to the Mickey Mouse catch trials.

Getting Turker response of concrete vs. abstract definitions based on stuff in the written responses


Compost Pile!

We computed ranked correlations by “familiarity” score across all three experiments… Our results indicate that face images can have measurable, visually “familiar” properties that roughly correspond to increasingly more explicit measures of familiarity, suggesting some agreement between perceptual familiarity, and the individual experience of familiarity.

is it consistent and is it distinct from memorability? don’t overpromise, lead with reasons why we would choose three tasks

Face images can be intrinsically “memorable”– a measurable visual property independent from individual experience (Bainbridge et al., 2013).

Many of us have experienced a sense of “familiarity” for faces we have never seen before (e.g. at the extreme, pictures of celebrity look-alikes). Is the phenomenon of face familiarity one of individual experience and memory– a sense of “knowing”or “feeling”–, or a consequence of intrinsic properties of the face image itself? We obtained three measures of familiarity for 100, hyper-realistic, GAN-generated face images, and examined if these face images were consistently rated as familiar– across participants, measurements, and conceptual definitions.

To establish whether, and which, images in our dataset have salient, “familiar”, visual properties, we employed a memorability task paradigm that captured both visual memorability scores for 103 images (100 “unfamiliar” GAN faces, 3 images of Mickey Mouse), as well as inferred measurements of image familiarity (Bainbridge et al., 2016). We expected highly “familiar” images identified in this experiment to be consistently highly-scored under other measurements of familiarity.

In another sample, we captured image familiarity somewhat more explicitly, using a two-alternative, forced-choice task. Here, participants chose the “more familiar” face from pairs of two GAN face images, randomly paired without repeats. For each trial, images either get a score of 1 (familiar) or 0 (not familiar). The resulting score for each image across all participants serves as a measure of an image’s generalized familiarity score.

In our third experiment we obtained explicit measures of familiarity. Participants rated 100 “unfamiliar” GAN face images on a sliding scale between “Not at all familiar” and “Extremely familiar”. This is intended to capture the subtleties of “familiarity” for a given face, in comparison to the time and task pressures of the forced choice and memorability tasks.

Beyond determining whether our three measurements of familiarity are consistent with one another, we were also interested in capturing the experience of familiarity. To this end, our third experiment included a survey, in which participants rated their agreement with statements about different strategies for rating familiarity. Participants also provided a short written response to the question: “How would you describe ‘familiarity’?”.

Our results indicate that familiarity is likely not an intrinsic visual feature of face images, but rather a subjective rating of individual experience. Approximating measurements of familiarity with memory, forced-choice, or explicit rating tasks yield inconsistent definitions, and therefore measurements, of what “familiar” is.