Some thoughts on an index of isolation

true
June 1, 2021

Abstract

My train of thought on an index of isolation that takes into account all the data points.

Main motivation: To not have to pick an arbitrary threshold where we stop consider distances based on their ranks. This is nitpicky but maybe a bit cleaner.

Data

I’m basing this on a dataframe that Eleonora sent me.

Let’s get it into a different shape, keeping only the 5 nearest prototypes for a given image idx

nms <- df_raw %>%  #info to be added back in a second 
       select('X1', 'topname', 'domain') %>%
       rename(idx = 'X1')

df <- df_raw %>%
    select(-c('img_link','index','index_domain','topname', 'domain')) %>%
    rename(idx = 'X1') %>%
    pivot_longer(!idx, names_to = 'prototypes', values_to = 'dist') %>%
    group_by(idx) %>%
    slice_min(dist,n=5) %>%
    left_join(nms)
paged_table(df, options = list(rows.print = 15))

As a quick impression, I get the feeling that

A useful index?

A useful index should tell us something about which prototypes are comparatively close to the image, and then not care much about anything further away. A priori, it’s not clear how sensitive it should be, but here are some steps that may be sensible:

  1. Take the distance of image \(i\) to its closest prototype \(p\), \(d(i,p)\), as a reference point. This may be \(0.01\) or \(0.8\). That’s OK. What we care about is how this distance compares to everything else that follows. Thus,
  2. for all other prototypes \(p'\), calculate how close they are to \(i\) relative to \(d(i,p)\). A concrete proposal: \(d(i,p) - d(i,p')\). Since the second term is as large or larger than the first, this number is negative. The closer it is to \(0\) –the less distance it has to \(i\) relative to the distance \(i\) has to \(p\)– the more we should care about it.
  3. Transform \(d(i,p) - d(i,p')\) to reflect the fact that we only care if it’s small. A concrete proposal: \(exp(d(i,p) - d(i,p'))^\gamma\), \(\gamma \geq 1\). The bigger \(\gamma\), the less we care about larger differences to \(d(i,p)\).

How large should \(\gamma\) be? Let’s plot some options out

#Not the most beautiful code, but I'm trying to be very explicit here
x <- seq(from=-0,to=-0.99,by=-0.01) #some possible outcomes of d(i,p) - d(i,p'), from 0 to -0.99

#5 different options for gamma
y1 <- exp(x)**1
y5 <- exp(x)**5
y10 <- exp(x)**10
y20 <- exp(x)**20
y50 <- exp(x)**50

dd <- data.frame(x = rep(x,5), #put it all in a dataframe for ease of plotting
         y = c(y1,y5,y10,y20,y50),
         gamma = as.factor(c(rep('1',length(x)), rep('5', length(x)), rep('10', length(x)), rep('20', length(x)), rep('50', length(x)) ))
         )

dd %>% ggplot(aes(x = x, y=y, col=gamma)) + geom_point() + theme_minimal()

I don’t think we should overthink this, but \(\gamma\) in the range of \([20,50]\) looks better than in \([1,10]\).

The new index of isolation may then be:

\[idx(i) = (\sum_{p'} \text{exp}(d(i,p) - d(i,p'))^{33}) - 1\]

Most of the summation terms will add (almost) nothing. We only care about what’s really close to \(i\), relative to \(d(i,p)\)

Implementation

get_idx <- function(df_in,gma){
  df_out <- df_in %>%
    select(-c('img_link','index','index_domain','topname', 'domain')) %>%
    rename(idx = 'X1') %>%
    pivot_longer(!idx, names_to = 'prototypes', values_to = 'dist') %>%
    group_by(idx) %>%
    mutate(closest_dist = min(dist)) %>%
    mutate(isolation = sum(exp(closest_dist - dist)**gma) - 1)
}

df_isol <- df_raw %>% get_idx(33)

Another look at the same data from above (tail and head only, since we’re maxing out on rows-printed otherwise), now with the new index of isolation, ordering from most to least isolated.

ex <- df %>% left_join(select(df_isol, -closest_dist)) %>% arrange(isolation)

ex %>% head(n=1000) %>% paged_table(options = list(rows.print = 15))
ex %>% tail(n=1000) %>% paged_table(options = list(rows.print = 15))

)

So the image with ID \(22181\) is a pretty prototypical dugout but pretty far off from everything else. Let’s see:

By contrast, the image with ID \(19873\) is a boat, but maybe a bench? or a train? It’s very un-isolated… and likely very odd. Let’s look at it as well: