My train of thought on an index of isolation that takes into account all the data points.
Main motivation: To not have to pick an arbitrary threshold where we stop consider distances based on their ranks. This is nitpicky but maybe a bit cleaner.
I’m basing this on a dataframe that Eleonora sent me.
Let’s get it into a different shape, keeping only the 5 nearest prototypes for a given image idx
nms <- df_raw %>% #info to be added back in a second
select('X1', 'topname', 'domain') %>%
rename(idx = 'X1')
df <- df_raw %>%
select(-c('img_link','index','index_domain','topname', 'domain')) %>%
rename(idx = 'X1') %>%
pivot_longer(!idx, names_to = 'prototypes', values_to = 'dist') %>%
group_by(idx) %>%
slice_min(dist,n=5) %>%
left_join(nms)
paged_table(df, options = list(rows.print = 15))
As a quick impression, I get the feeling that
idx 0, the second closest prototype is pretty far from the first; and the third is super close to the second. This is the kind of information we would be missing; or superflously passing to the model; depending on the rank on which we cut.A useful index should tell us something about which prototypes are comparatively close to the image, and then not care much about anything further away. A priori, it’s not clear how sensitive it should be, but here are some steps that may be sensible:
How large should \(\gamma\) be? Let’s plot some options out
#Not the most beautiful code, but I'm trying to be very explicit here
x <- seq(from=-0,to=-0.99,by=-0.01) #some possible outcomes of d(i,p) - d(i,p'), from 0 to -0.99
#5 different options for gamma
y1 <- exp(x)**1
y5 <- exp(x)**5
y10 <- exp(x)**10
y20 <- exp(x)**20
y50 <- exp(x)**50
dd <- data.frame(x = rep(x,5), #put it all in a dataframe for ease of plotting
y = c(y1,y5,y10,y20,y50),
gamma = as.factor(c(rep('1',length(x)), rep('5', length(x)), rep('10', length(x)), rep('20', length(x)), rep('50', length(x)) ))
)
dd %>% ggplot(aes(x = x, y=y, col=gamma)) + geom_point() + theme_minimal()
I don’t think we should overthink this, but \(\gamma\) in the range of \([20,50]\) looks better than in \([1,10]\).
The new index of isolation may then be:
\[idx(i) = (\sum_{p'} \text{exp}(d(i,p) - d(i,p'))^{33}) - 1\]
Most of the summation terms will add (almost) nothing. We only care about what’s really close to \(i\), relative to \(d(i,p)\)
get_idx <- function(df_in,gma){
df_out <- df_in %>%
select(-c('img_link','index','index_domain','topname', 'domain')) %>%
rename(idx = 'X1') %>%
pivot_longer(!idx, names_to = 'prototypes', values_to = 'dist') %>%
group_by(idx) %>%
mutate(closest_dist = min(dist)) %>%
mutate(isolation = sum(exp(closest_dist - dist)**gma) - 1)
}
df_isol <- df_raw %>% get_idx(33)
Another look at the same data from above (tail and head only, since we’re maxing out on rows-printed otherwise), now with the new index of isolation, ordering from most to least isolated.
ex <- df %>% left_join(select(df_isol, -closest_dist)) %>% arrange(isolation)
ex %>% head(n=1000) %>% paged_table(options = list(rows.print = 15))
)
So the image with ID \(22181\) is a pretty prototypical dugout but pretty far off from everything else. Let’s see:
By contrast, the image with ID \(19873\) is a boat, but maybe a bench? or a train? It’s very un-isolated… and likely very odd. Let’s look at it as well: