Abstract

My train of thought on an index of isolation that takes into account all the data points.

Main motivation: To not have to pick an arbitrary threshold where we stop consider distances based on their ranks. This is nitpicky but maybe a bit cleaner.

Data

I’m basing this on a dataframe that Eleonora sent me.

Let’s get it into a different shape, keeping only the 5 nearest prototypes for a given image idx

nms <- df_raw %>%  #info to be added back in a second 
       select('X1', 'topname', 'domain') %>%
       rename(idx = 'X1')

df <- df_raw %>%
    select(-c('img_link','index','index_domain','topname', 'domain')) %>%
    rename(idx = 'X1') %>%
    pivot_longer(!idx, names_to = 'prototypes', values_to = 'dist') %>%
    group_by(idx) %>%
    slice_min(dist,n=5) %>%
    left_join(nms)
paged_table(df, options = list(rows.print = 15))

As a quick impression, I get the feeling that

Generalizing quite a bit, approx. the first three closest prototypes may matter the most. It seems like whatever is ranked 4th or 5th is often already pretty far off. This accords with our intuitions. It’s also why using distance to first and distance to second closest prototype is generally pretty informative, as shown by Eleonora’s regressions
However, sometimes the second ranked prototype is very close; sometimes the third one is as well; but sometimes neither is. For example, for the image with idx 0, the second closest prototype is pretty far from the first; and the third is super close to the second. This is the kind of information we would be missing; or superflously passing to the model; depending on the rank on which we cut.

A useful index?

A useful index should tell us something about which prototypes are comparatively close to the image, and then not care much about anything further away. A priori, it’s not clear how sensitive it should be, but here are some steps that may be sensible:

Take the distance of image \(i\) to its closest prototype \(p\), \(d(i,p)\), as a reference point. This may be \(0.01\) or \(0.8\). That’s OK. What we care about is how this distance compares to everything else that follows. Thus,
for all other prototypes \(p'\), calculate how close they are to \(i\) relative to \(d(i,p)\). A concrete proposal: \(d(i,p) - d(i,p')\). Since the second term is as large or larger than the first, this number is negative. The closer it is to \(0\) –the less distance it has to \(i\) relative to the distance \(i\) has to \(p\)– the more we should care about it.
Transform \(d(i,p) - d(i,p')\) to reflect the fact that we only care if it’s small. A concrete proposal: \(exp(d(i,p) - d(i,p'))^\gamma\), \(\gamma \geq 1\). The bigger \(\gamma\), the less we care about larger differences to \(d(i,p)\).

How large should \(\gamma\) be? Let’s plot some options out

#Not the most beautiful code, but I'm trying to be very explicit here
x <- seq(from=-0,to=-0.99,by=-0.01) #some possible outcomes of d(i,p) - d(i,p'), from 0 to -0.99

#5 different options for gamma
y1 <- exp(x)**1
y5 <- exp(x)**5
y10 <- exp(x)**10
y20 <- exp(x)**20
y50 <- exp(x)**50

dd <- data.frame(x = rep(x,5), #put it all in a dataframe for ease of plotting
         y = c(y1,y5,y10,y20,y50),
         gamma = as.factor(c(rep('1',length(x)), rep('5', length(x)), rep('10', length(x)), rep('20', length(x)), rep('50', length(x)) ))
         )

dd %>% ggplot(aes(x = x, y=y, col=gamma)) + geom_point() + theme_minimal()

I don’t think we should overthink this, but \(\gamma\) in the range of \([20,50]\) looks better than in \([1,10]\).

The new index of isolation may then be:

\[idx(i) = (\sum_{p'} \text{exp}(d(i,p) - d(i,p'))^{33}) - 1\]

Most of the summation terms will add (almost) nothing. We only care about what’s really close to \(i\), relative to \(d(i,p)\)

Implementation

get_idx <- function(df_in,gma){
  df_out <- df_in %>%
    select(-c('img_link','index','index_domain','topname', 'domain')) %>%
    rename(idx = 'X1') %>%
    pivot_longer(!idx, names_to = 'prototypes', values_to = 'dist') %>%
    group_by(idx) %>%
    mutate(closest_dist = min(dist)) %>%
    mutate(isolation = sum(exp(closest_dist - dist)**gma) - 1)
}

df_isol <- df_raw %>% get_idx(33)

Another look at the same data from above (tail and head only, since we’re maxing out on rows-printed otherwise), now with the new index of isolation, ordering from most to least isolated.

Higher isolation means you’re less isolated… so I guess that needs to be turned around if we end up using this and don’t want our heads to explode

ex <- df %>% left_join(select(df_isol, -closest_dist)) %>% arrange(isolation)

ex %>% head(n=1000) %>% paged_table(options = list(rows.print = 15))

ex %>% tail(n=1000) %>% paged_table(options = list(rows.print = 15))

)

So the image with ID \(22181\) is a pretty prototypical dugout but pretty far off from everything else. Let’s see:

By contrast, the image with ID \(19873\) is a boat, but maybe a bench? or a train? It’s very un-isolated… and likely very odd. Let’s look at it as well:

Some thoughts on an index of isolation

Abstract

Data

A useful index?

Implementation