Many of us have experienced a sense of “familiarity” for faces we have never seen before–a phenomenon tied to both visual perception and personal experience (Lyon, 1996). Can face images be intrinsically familiar? If so, can familiarity be measured consistently? We obtained three measures of familiarity for 100, hyper-realistic, GAN-generated face images, and examined the correspondence between experiments, groups of participants, and individual responses.
We first measure visual familiarity with a task that captures both memorability (i.e. accurate recognition of something previously seen; measured via recognition hit rate, Bainbridge et al., 2016) and familiarity (i.e. false recognition of something we have not actually seen; measured via false alarm rate). However, measurements of familiarity in this task are likely more conservative than our actual experiences of familiarity: we may know we have never seen something before, and yet, it still feels familiar.
In addition to this inferred measure, our second experiment captures familiarity using a forced-choice task. Here, participants chose the “more familiar” face from pairs of two faces, randomly paired without repeats. The resulting score for each face across all participants serves as its generalized familiarity score.
Our third study aimed to capture the subtleties of familiarity for individual faces, without the time and task constraints of the previous tasks. Participants rated faces on a sliding scale between “Not at all familiar” and “Extremely familiar”.
For each experiment, we computed Spearman’s ranked correlations between image rankings (by familiarity score) for 100 randomly sampled halves of participants. The average over 100 correlations serves as a population-level reliability metric. For all experiments, we found widespread variability in image rankings between groups of individuals (Exp.1 mean rho = .02, Exp.2 mean rho = .01, Exp.3 mean rho = .05). We also calculated the consistency of familiarity scores for individual participants using leave-one-out cross-validation by participant to predict familiarity scores for each image. We found that some participants are more consistent with population responses than others, and this varied as a function of experiment. Finally, we computed a Spearman’s correlation between overall image ranking for all three experiments, and found no significantly coinciding familiarity rankings between experiments.
Our results indicate that “familiarity” is likely not a consistently measurable property of face images, but rather a subjective rating of individual experience. Approximating measurements of familiarity with memory, forced-choice, or explicit rating tasks yield inconsistent definitions, and therefore inconsistent measurements, of what “familiar” is.
Finalize participant count -> doubling seems reasonable for all three! I’m worried about being underpowered…
Analyses by strategy in rating study
Confirm method for removing bad data in memorability experiment
Ask Emily about bad data issues
ranked correlation for “most familiar” faces between people and groups
ranked correlation across the experiments
make plot of ordered face list overall for each experiment
## tau
## -0.09333333
## tau
## -0.08323232
## tau
## -0.01616162
Face images can be more or less intrinsically “memorable” (Bainbridge et al., 2016) – can they be intrinsically familiar as well, possessing some property that is consistently deemed “familiar” across observers?
We were interested in whether images with salient familiarity signals (based on false alarm rates in the memorability task) would correspond to other, more explicit measures of familiarity. We expect consistent memorability and familiarity scores for each image across individuals.
Been looking at data quality issues. I’m thinking about removing people that have false alarms/misses that are above/below two standard deviations from the mean. How to deal with people that don’t have any false alarms?
# make lists for FA and hitS
fa_cor_list <- list()
hit_cor_list <- list()
for (i in 1:100){
set.seed(i)
shuffled_ids <- d$id %>% unique() %>% sample(size = length(unique(d$id)))
end <- length(shuffled_ids)
split <- length(shuffled_ids)/2
ids_1 <- data.frame(id = shuffled_ids[1:split])
ids_2 <- data.frame(id = shuffled_ids[(split+1):end])
half_1 <- d[d$id %in% ids_1$id,]
half_2 <- d[d$id %in% ids_2$id,]
list_1 <- half_1 %>%
group_by(image_code) %>%
summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))
list_2 <- half_2 %>%
group_by(image_code) %>%
summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))
# sort by false alarm rate, then by name
list_1 <- list_1[
with(list_1, order(fa_r, image_code, decreasing = T)),]
list_2 <- list_2[
with(list_2, order(fa_r, image_code, decreasing = T)),]
# get ranked correlation for FA sorted list
fa_cor_list[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate
# sort by hit rate, then by name
list_1 <- list_1[
with(list_1, order(hit_r, image_code, decreasing = T)),]
list_2 <- list_2[
with(list_2, order(hit_r, image_code, decreasing = T)),]
hit_cor_list[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate
}
# fa_cor_list %>% unlist() %>% summary()
fa_cor_list %>% unlist() %>% as.data.frame() %>%
summarise(average = mean(.),
min = min(.),
max = max(.),
sd = sd(.))
## average min max sd
## 1 0.05455758 -0.1527273 0.3789899 0.1019444
hit_cor_list %>% unlist() %>% as.data.frame() %>%
summarise(average = mean(.),
min = min(.),
max = max(.),
sd = sd(.))
## average min max sd
## 1 -0.01309899 -0.2173737 0.1793939 0.07164825
# Just a histograms of false alarms (count is # of images with that false alarm rate)
fa_rates <- d %>%
group_by(image_code) %>%
summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")),
seen_count = n())
ggplot(fa_rates, aes(x=fa_r)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
# Check out the distributions of other ratings by image!
counts_img <- d %>%
group_by(image) %>%
summarise(fa_count = sum(perf_str == "FA"),
hit_count = sum(perf_str == "hit"),
cr_count = sum(perf_str == "CR"),
miss_count = sum(perf_str == "miss"),
seen = n())
rates_plot <- melt(counts_img[,-6])
a <- ggplot(rates_plot, aes(x=value, color = variable)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="grey")
a
# histogram of false alarms, testing for uniformity...
fa_hist <- hist(fa_rates$fa_r, breaks = 100)
uniform.test(fa_hist)
##
## Chi-squared test for given probabilities
##
## data: hist.output$counts
## X-squared = 949.92, df = 135, p-value < 2.2e-16
# idea is for each person, for each image, use the responses from everyone else for that image to predict test response
d <- read.csv("../memorability/memorability_all_observations.csv")
variance <- list()
estimates <- list()
p_values <- list()
summary_list <- list()
j <- 1
for (id in unique(d$id)){
curr_id <- id
train <- d %>% filter(!id == curr_id) %>%
group_by(image_code) %>%
summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))
train[is.na(train)] <- 0
test <- d %>% filter(id == curr_id) %>%
group_by(image_code) %>%
summarise(hit_r = sum(perf_str == "hit")/(sum(perf_str == "hit") + sum(perf_str == "miss")),
fa_r = sum(perf_str == "FA") / (sum(perf_str == "FA") + sum(perf_str == "CR")))
test[is.na(test)] <- 0
dt <- data.frame(train_fa = train$fa_r, test_fa = test$fa_r)
model_fa <- glm(test_fa ~ train_fa, data = dt, family = binomial())
temp <- model_fa %>% summary()
summary_list[j] <- model_fa
variance[j] <- (temp$coefficients[2,2]^2 / 100)
estimates[j] <- temp$coefficients[2,1]
p_values[j] <- temp$coefficients[2,4]
j <- j + 1
}
# weight by inverse variance
unlist(estimates) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -28.94699 0.00003 4.73588 5.50709 9.03137 35.12306
avg_b1_weighted <- unlist(estimates)[-14] * (1/unlist(variance)[-14])
avg_b1_weighted %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -29.04 0.00 14.36 12.13 21.04 46.75
unlist(p_values)[-14] %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0006893 0.1323214 0.4150028 0.4334702 0.7445569 1.0000000
Next steps:
In another sample, we captured image familiarity somewhat more explicitly, using a two-alternative, forced-choice task where participants chose the “more familiar” face from pairs of two, randomly paired, GAN face images.
fam_fc_cor <- list()
for (i in 1:100){
set.seed(i+100)
shuffled_ids <- d$subject_id %>% unique() %>% sample(size = length(unique(d$subject_id)))
end <- length(shuffled_ids)
split <- length(shuffled_ids)/2
ids_1 <- data.frame(id = shuffled_ids[1:split])
ids_2 <- data.frame(id = shuffled_ids[(split+1):end])
half_1 <- d[d$subject_id %in% ids_1$id,]
half_2 <- d[d$subject_id %in% ids_2$id,]
seen_1 <- (table(half_1$left) + table(half_1$right)) %>% as.data.frame()
seen_2 <- (table(half_2$left) + table(half_2$right)) %>% as.data.frame()
# get rates based on number of times seen
list_1 <- half_1$chosen %>%
table() %>% as.data.frame()
list_2 <- half_2$chosen %>%
table() %>% as.data.frame()
list_1$rate <- list_1$Freq / seen_1$Freq
list_2$rate <- list_2$Freq / seen_2$Freq
list_1$image_code <- gsub("[^0-9]", "", list_1$.) %>% as.numeric()
list_2$image_code <- gsub("[^0-9]", "", list_2$.) %>% as.numeric()
list_1 <- list_1[order(list_1$rate, list_1$image_code, decreasing = TRUE),]
list_2 <- list_2[order(list_2$rate, list_2$image_code, decreasing = TRUE),]
fam_fc_cor[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate
}
fam_fc_cor %>% unlist() %>% as.data.frame() %>%
summarise(average = mean(.),
min = min(.),
max = max(.),
sd = sd(.))
## average min max sd
## 1 0.01010505 -0.1284848 0.2113131 0.0638566
# idea is for each person, for each image, use the responses from everyone else for that image to predict what will be said for that image
# d <- d[!grepl("mickey", d$chosen),]
estimates <- list()
p_values <- list()
j <- 1
for (id in unique(d$subject_id)){
curr_id <- id
train <- d %>% filter(!subject_id == curr_id) %>%
group_by(image_code) %>%
select(subject_id,left, right, chosen, image_code)
seen <- table(train$left) + table(train$right) %>% data.frame()
chosen <- train$image_code %>% table() %>% as.data.frame()
train_rate <- chosen$Freq / seen$Freq
rm(seen)
rm(chosen)
test <- d %>% filter(subject_id == curr_id) %>%
group_by(image_code) %>%
select(subject_id,left, right, chosen, image_code)
test_all <- data.frame(image = unique(d$image_code), response = numeric(100))
test_all[which(test_all$image %in% test$image_code),]$response <- 1
dt <- data.frame(train_rate = train_rate, test_chosen = test_all$response)
model <- glm(test_chosen ~ train_rate, data = dt)
temp <- model %>% summary()
estimates[j] <- temp$coefficients[2,1]
p_values[j] <- temp$coefficients[2,4]
j <- j + 1
}
estimates %>% unlist() %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.85137 -0.25283 0.04692 0.01249 0.34144 0.69152
p_values %>% unlist() %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04365 0.31079 0.51567 0.52470 0.77868 0.96710
Next steps:
“by item consistency”
In our third experiment we obtained a hyper-explicit measures of familiarity. Participants rated 100 “unfamiliar” GAN face images on a sliding scale between “Not at all familiar” and “Extremely familiar” (N = 43).
## Warning: Missing column names filled in: 'X1' [1]
## # A tibble: 6 x 5
## X1 subject_id image image_code response
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 0l70vzbs resources/face_55_resized.jpg 55 58
## 2 2 0l70vzbs resources/face_291_resized.jpg 291 60
## 3 3 0l70vzbs resources/face_359_resized.jpg 359 39
## 4 4 0l70vzbs resources/face_560_resized.jpg 560 73
## 5 5 0l70vzbs resources/face_228_resized.jpg 228 67
## 6 6 0l70vzbs resources/face_428_resized.jpg 428 68
d <- d %>% filter(!grepl("mickey", d$image))
fam_rating_cor <- list()
for (i in 1:100){
set.seed(i+100)
shuffled_ids <- d$subject_id %>% unique() %>% sample(size = length(unique(d$subject_id)))
end <- length(shuffled_ids)
split <- round(length(shuffled_ids)/2)
ids_1 <- data.frame(id = shuffled_ids[1:split])
ids_2 <- data.frame(id = shuffled_ids[(split+1):end])
half_1 <- d[d$subject_id %in% ids_1$id,]
half_2 <- d[d$subject_id %in% ids_2$id,]
list_1 <- half_1 %>%
group_by(image_code) %>%
summarise(response = mean(response))
list_2 <- half_2 %>%
group_by(image_code) %>%
summarise(response = mean(response))
list_1 <- list_1[order(list_1$response, list_1$image_code, decreasing = TRUE),]
list_2 <- list_2[order(list_2$response, list_2$image_code, decreasing = TRUE),]
fam_rating_cor[i] <- cor.test(x = list_1$image_code, y = list_2$image_code, method = 'kendall')$estimate
}
fam_rating_cor %>% unlist() %>% as.data.frame() %>%
summarise(average = mean(.),
min = min(.),
max = max(.),
sd = sd(.))
## average min max sd
## 1 0.03682828 -0.1090909 0.1858586 0.06040734
estimates <- list()
p_values <- list()
j <- 1
for (id in unique(d$subject_id)){
curr_id <- id
train <- d %>% filter(!subject_id == curr_id) %>%
group_by(image_code) %>%
summarise(response = mean(response))
test <- d %>% filter(subject_id == curr_id) %>%
group_by(image_code) %>%
summarise(response = mean(response))
dt <- data.frame(train_resp = train$response, test_resp = test$response)
model <- glm(test_resp ~ train_resp, data = dt)
temp <- model %>% summary()
estimates[j] <- temp$coefficients[2,1]
p_values[j] <- temp$coefficients[2,4]
j <- j + 1
}
estimates %>% unlist() %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.44646 0.04338 0.33259 0.67433 1.16969 3.97376
p_values %>% unlist() %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01283 0.24081 0.28721 0.48350 0.98280
Unsurprisingly, the three faces with high mean responses are the three Mickey Mouse catch trials.
# do Cronbach's alpha
d$image_code <- as.numeric(as.factor(d$image)) # dummy code as suggested
d %>%
select(response, image_code) %>%
psych::alpha(check.keys = T)
## Warning in psych::alpha(., check.keys = T): Some items were negatively correlated with total scale and were automatically reversed.
## This is indicated by a negative sign for the variable name.
##
## Reliability analysis
## Call: psych::alpha(x = ., check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.075 0.076 0.039 0.039 0.082 0.029 53 21 0.039
##
## lower alpha upper 95% confidence boundaries
## 0.02 0.08 0.13
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## response- 0.041 0.039 0.0015 0.039 0.041 NA 0 0.039
## image_code 0.038 0.039 0.0015 0.039 0.041 NA 0 0.039
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## response- 4000 0.74 0.72 0.14 0.039 55 30
## image_code 4000 0.71 0.72 0.14 0.039 50 29
Not at all reliable! How do I compute an alpha without this dummy coding?
That makes sense because this measure is experience-based…
Next steps:
Beyond determining whether different measurements of familiarity are consistent with one another, we were also interested in capturing the experience of familiarity. To this end, our third experiment included a survey, in which participants rated their agreement with statements about different strategies for rating familiarity (e.g. “I was likely to rate a face as familiar when it looked like a celebrity”; “I used specific face features when rating familiarity”). ~data visualization to come~
Next steps:
In addition, participants provided a short written response to the question: “How would you describe ‘familiarity’?”. The results indicate varying working definitions of familiarity, reflected by diverse use of the familiarity scale.
Many adhere to a more traditional definition:
To some, it can be a property of the face or image, related to memory… sometimes conflating with "memorability:
Some describe it as a “knowing”:
To others, a feeling of closeness:
Some illustrative combinations of these things:
For those with more rigid definitions of familiarity, consistently low ratings were given to all of the unfamiliar GAN faces, while consistently high ratings were given to the Mickey Mouse catch trials.
Getting Turker response of concrete vs. abstract definitions based on stuff in the written responses
We computed ranked correlations by “familiarity” score across all three experiments… Our results indicate that face images can have measurable, visually “familiar” properties that roughly correspond to increasingly more explicit measures of familiarity, suggesting some agreement between perceptual familiarity, and the individual experience of familiarity.
is it consistent and is it distinct from memorability? don’t overpromise, lead with reasons why we would choose three tasks
Face images can be intrinsically “memorable”– a measurable visual property independent from individual experience (Bainbridge et al., 2013).
Many of us have experienced a sense of “familiarity” for faces we have never seen before (e.g. at the extreme, pictures of celebrity look-alikes). Is the phenomenon of face familiarity one of individual experience and memory– a sense of “knowing”or “feeling”–, or a consequence of intrinsic properties of the face image itself? We obtained three measures of familiarity for 100, hyper-realistic, GAN-generated face images, and examined if these face images were consistently rated as familiar– across participants, measurements, and conceptual definitions.
To establish whether, and which, images in our dataset have salient, “familiar”, visual properties, we employed a memorability task paradigm that captured both visual memorability scores for 103 images (100 “unfamiliar” GAN faces, 3 images of Mickey Mouse), as well as inferred measurements of image familiarity (Bainbridge et al., 2016). We expected highly “familiar” images identified in this experiment to be consistently highly-scored under other measurements of familiarity.
In another sample, we captured image familiarity somewhat more explicitly, using a two-alternative, forced-choice task. Here, participants chose the “more familiar” face from pairs of two GAN face images, randomly paired without repeats. For each trial, images either get a score of 1 (familiar) or 0 (not familiar). The resulting score for each image across all participants serves as a measure of an image’s generalized familiarity score.
In our third experiment we obtained explicit measures of familiarity. Participants rated 100 “unfamiliar” GAN face images on a sliding scale between “Not at all familiar” and “Extremely familiar”. This is intended to capture the subtleties of “familiarity” for a given face, in comparison to the time and task pressures of the forced choice and memorability tasks.
Beyond determining whether our three measurements of familiarity are consistent with one another, we were also interested in capturing the experience of familiarity. To this end, our third experiment included a survey, in which participants rated their agreement with statements about different strategies for rating familiarity. Participants also provided a short written response to the question: “How would you describe ‘familiarity’?”.
Our results indicate that familiarity is likely not an intrinsic visual feature of face images, but rather a subjective rating of individual experience. Approximating measurements of familiarity with memory, forced-choice, or explicit rating tasks yield inconsistent definitions, and therefore measurements, of what “familiar” is.