Learning the rules of image transformation

Benjamin Haley 2014
text - code - both

How a computer can learn the rules of rotation, reflection, scaling, translation, and many other transformations that images can undergo.

We recognize images despite transformations.

As your eyes move across this sentence, the image hitting your retina is constantly changing. Yet you hardly notice. One reason you do not is because your brain recognizes letters regardless of their position in your field of view.

Consider the following image. Look first at the blue dot and then at the red. Notice that the number ‘2’ between them is recognizable regardless of your focus. This, despite the fact that the image is falling on a completely different set of neurons.

Images go through many such transformations. They reverse, rotate, scale, translate and distort in many ways we have no words for. That’s not to mention all the changes in lighting that can occur. Through all of this they remain recognizable.

The number of transformations that can happen to an image is infinite, but that does not mean that all transformations are possible or probable. Many never occur in the real world and our brain cannot recognize the images after these improbable transformations.

But computers are bad at learning the rules of image transformation.

The rules of transformation are important for anyone who wants to teach a computer how to process images. The algorithms that are best at image recognition learn a representation of the world that considers many shifts of focus. These are called translations as illustrated above.

However, these algorithms do not learn that images can be translated, the way they learn to recognize digits. Instead the laws of translation are programmed into the algorithm by the researcher.

Would it be possible to have computers learn about translation without telling them explicitly? What about the myriad other transformations that are possible?

I propose that these transformations can be learned.

I propose they can, and submit the following experiment as evidence. First, I show that we can learn that flipping an image upside down is a valid transformation, but randomly rearranging the pixels is not. Then I show that examples each of the aforementioned transformations can be discovered. Finally, I wax poetic about the future of this kind of work.

note: I can’t claim that this work is unique. I just hope that it is interesting.

# Setup
# 1. Load in a few libraries.
# 2. Define some helpful functions.
# 3. Load a sample of MNIST data.
# 4. Convert it to a useful format.

# Libraries
library(ggplot2)
library(reshape2)
library(plyr)
library(dplyr)
library(RCurl)
library(xtable)
library(gtools)

# Functions to map pixel number to x, y coordinates (and back)
y <- function(pixel) -floor(pixel / 28) + 28
x <- function(pixel) pixel %% 28
pixel <- function(x, y) (28 - y) * 28 + x

# Compress all columns to one
concatenate_columns <- function(df) do.call("paste", df)

# Convert data 
# from one image per row to one pixel per row
elongate <- function(d=data){
  d <- d[,order(names(d))] %>%
    mutate(pattern = concatenate_columns(d),
           id = 1:nrow(d))
  d <- melt(d, 
            id.vars=c('id', 'pattern'), 
            variable.name='pixel', 
            value.name='intensity')
  d <- d %>%
    mutate(pixel = as.numeric(as.character(pixel)))
  d
}

# A minimalist theme for plotting
theme_nothing = function(...) theme(
  line=element_blank(),
  text=element_blank(),
  title=element_blank(),
  rect=element_blank(),
  legend.position="none")

# Show images of handwritten digits
show <- function(g) {
  ggplot(g, aes(x(pixel), y(pixel), alpha=intensity)) + 
    geom_tile() + 
    facet_wrap(~ pattern + id, ncol=20) + 
    geom_tile() + 
    scale_alpha_continuous(range=c(0.0, 1.0)) +
    theme_nothing()
}
# Highlight a special region in red
show_of_interest <- function(g) {
  g <- g %>%
    mutate(of_interest = as.numeric(of_interest))
  ggplot(g, aes(x(pixel), y(pixel), alpha=intensity + 0.5*of_interest)) + 
    geom_tile() + 
    facet_wrap(~ pattern + id, ncol=20) + 
    geom_tile(aes(fill=of_interest)) + 
    scale_alpha_continuous(range=c(0.0, 1.0)) +
    scale_fill_continuous(low="black", high="red") + 
    theme_nothing()
}

# Return a dataframe with the frequency and count
# of each unique value of x
tabulate_frequencies <- function(x) {
  data.frame(value=x) %>% 
    group_by(value) %>% 
    summarize(count = length(value),
              frequency = count/length(x))
}

# Determine the log likeliood of the patterns
# in y given the patterns in x using the multinomial distribution [1].
# 
# [1]: https://en.wikipedia.org/wiki/Multinomial_distribution
loglikelihood <- function(x, y){
  x <- tabulate_frequencies(x)
  y <- tabulate_frequencies(y)
  n <- sum(y$count)

  d <- merge(x %>% select(-count), 
             y %>% select(-frequency))
  
  d <- d %>%
    mutate(loglikelihood= log(frequency) * count - lfactorial(count)) %>%
    summarize(loglikelihood = sum(loglikelihood))
  
  as.numeric(d) + lfactorial(n)
}

# Switch pixels around
reorder <- function(data, new_order) {
  names(data) <- names(data)[new_order]
  data <- data[,as.character(sort(as.numeric(names(data))))]
  data
}

# Data
# A sample from MNIST, of handwritten digits.  Originally from Yann
# LeCunn's site [1], but actually sampled from Joseph Redmon's handy
# csv version [2].
# 
# [1]: http://yann.lecun.com/exdb/MNIST/
# [2]: http://www.pjreddie.com/projects/MNIST-in-csv/
data <- "https://raw.githubusercontent.com/benjaminhaley/transformation/master/mnist_medium.csv"
data <- read.csv(textConnection(getURL(data)))

# Convert to binary
# Originally pixel intensity is 0-256, make it on or off
data <- data.frame(data > 100) + 0
names(data) <- c(0:783)

# Make some smaller datasets
# for use when we only need a little data
tiny <- head(data, 3)
small <- head(data, 100)
medium <- head(data, 1000)

What data are we using?

I will use MNIST data, a handy collection of handwritten digits.

show(elongate(tiny))

plot of chunk unnamed-chunk-2

Where to focus?

We will focus on a small region of data. Specifically, the three vertical pixels highlighted in each digit below.

focus <- sort(pixel(x=13, y=13:15))
show_of_interest(elongate(tiny) %>%
                   mutate(of_interest = pixel %in% focus))

plot of chunk unnamed-chunk-3

What patterns are common?

Next we look at the pixel patterns across many images.

show(elongate(small[,focus]))

plot of chunk unnamed-chunk-4

We see that certain patterns are more common than others. For example, there are many cases where one of the three pixels is blank but only one case where this is the middle pixel.

Clearly the patterns observed are not random ones.

Upside down, the patterns have similar frequencies.

Then we flip the pixels upside down and look at the patterns.

show(elongate(reorder(small[,focus], c(3, 2, 1))))

plot of chunk unnamed-chunk-5

Notice that the patterns have roughly the same frequency as before. Flipping upside down does not substantially change the image.

But switching the first two pixels produces very different frequencies.

Finally, we make an improbable change, switching the first two pixels, while keeping the third in place.

show(elongate(reorder(small[,focus], c(1, 3, 2))))

plot of chunk unnamed-chunk-6

The patterns have a very different frequency than the prior cases. For example, the pattern where two filled in pixels surround a blank pixel is common, where it was only observed once in the previous two examples.

What’s going on?

We are seeing the difference between a probable image transformation, flipping upside down, and an improbable one, only switching the first two pixels. Objects in the real world flip upside down regularly. But, unless we are stretching taffy or entertaining contortionists, it is unusual to see the middle of something switch places with its top.

After a probable transformation, the image retains the same patterns as any other image. After an improbable transformation the image contains improbable patterns.

Can we quantify the effect?

We can estimate the likelihood of each reordering given the frequencies observed in the original order. For example, if a pattern, off-on-on, occurred 20% of the time in the original image then it will most likely occur 20% of the time in a valid rearrangement of the image. Concretely we use the multinomial distribution:

\(\frac{n!}{x_1!\cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}\)

Where \(n\) is the total number of samples, \(x_i\) is the number of samples that have some particular value, \(p_i\) is the frequency of some particular value, and \(k\) is the number of unique values.

orders <- permutations(3, 3, 1:3)
g <- ldply(1:nrow(orders), function(o) {
  order <- orders[o,]
  data.frame( 
    loglikelihood = loglikelihood(concatenate_columns(medium[,focus]), 
                                  concatenate_columns(reorder(medium[,focus], order))),
    order = paste(order, collapse=' ')
  )
})

print(xtable(g %>% select(order, loglikelihood), align=c('center', 'right', 'right')), 
      type="html",
      include.rownames = FALSE,
      html.table.attributes = getOption("xtable.html.table.attributes", "border=0"))

order	loglikelihood
1 2 3	-19.92
1 3 2	-558.85
2 1 3	-596.12
2 3 1	-597.38
3 1 2	-583.60
3 2 1	-45.92

We see quantitative evidence that reinforces our visual proof and our intuition. The reverse order, “3 2 1”, is more similar to the original order, “1 2 3”, than any other possible transformation.

Let us consider a wider frame of reference.

Up until now, we have considered a very limited set of transformations, each possible order of three pixels. Now let’s focus on a wider region. We will continue to use the three pixels as before, but we will compare them to a wider field, the 12 surrounding pixels, highlighted below.

# Define a zone of interest
# A range of pixels where we will consider every set of three points
zone = expand.grid(
  x=12:14,
  y=12:16) %>%
  mutate(pixel=pixel(x, y))
zone <- sort(zone$pixel)

# Show the region of interest
g <- elongate(tiny) %>%
  mutate(of_interest = 
           pixel %in% zone +
           pixel %in% focus)
show_of_interest(g)

plot of chunk unnamed-chunk-8

What transformations are likely?

Now let us consider each set of three pixels from this wider field in each of their possible orders. As before, we will use the multinomial distribution to determine how similar each set and order is to our original three pixels.

# Define all sets of three points within the zone
# in all possible orders
points = expand.grid(
  p1 = zone,
  p2 = zone,
  p3 = zone
) %>% 
  filter(p1 != p2, p2 != p3, p1 != p3)  

# For each order of each set of three points, 
# determine the likelihood of the order, compared to the baseline
points <- ldply(1:nrow(points), function(row) {
  loglikelihood = with(points[row,],
                       loglikelihood(concatenate_columns(data[,focus]),
                                     concatenate_columns(data[,c(p1, p2, p3)])))
  cbind(points[row,], 
        loglikelihood=loglikelihood, 
        id=paste(sort(as.numeric(points[row,])), collapse=' '))
})

# Rank the likelihood results
points <- points %>%
  mutate(rank=rank(-loglikelihood))

# Show the highest ranked results
ggplot(points %>% filter(rank <= 70)) +
  geom_segment(aes(x=11, y=17, xend=15, yend=17), size=0.1) +
  geom_tile(aes(x(p1), y(p1)), fill='black') +
  geom_tile(aes(x(p2), y(p2)), fill='black') +
  geom_tile(aes(x(p3), y(p3)), fill='black') +
  geom_point(aes(x(p2), y(p2)), color='red', alpha=0.9) +
  facet_wrap(~ rank, nrow=7) +
  theme_nothing()

plot of chunk unnamed-chunk-9

Here we see the sets of pixels most like our original three. The most likely set (our original three itself) occupies the top left corner and the subsequent images show other sets in order of descending likelihood. For clarity, a red dot has been put on the ‘middle’ pixel, where ‘middle’ is defined as the middle pixel in the original image.

First, note that the middle pixel always stays in the middle through all of the likely transformations. This is because likely transformations tend to maintain order (except that they may flip entirely as illustrated before).

Next, notice that we see examples of each of the likely transformations that we already know to exist.

Rotations are common, vertical and horizontal patterns appearing nearly equally often.
Reversals are similarly common, though they cannot be seen as the ends are indistinguishable.
Translations are ubiquitous as can be clearly seen by how the pixels shift within the region on
focus from left to right, top to bottom.
Distortions are common, very often we see not-quite straight lines, ones bent or stretched.
Scaling occurs, though it is rare, only occasionally do we see pixels more than one unit apart. I haven’t determined why scaling is rare in this image. Obviously in the real world scaling is very
common as you can see if you press your face against the screen.

Through all of these transformations, the basic pattern holds, the middle pixel in the original image remains the middle through each likely transformation. We see no examples of taffy stretching contortionism.

What transformations are unlikely?

Next lets look at those transformations ranked least likely. This will assure you that I have not hoodwinked you by divining patterns in the results that would be seen regardless of their order.

points <- points %>%
  mutate(reverse_rank = max(rank) + 1 - rank)

# Show the lowest ranked results
ggplot(points %>% filter(reverse_rank <= 70)) +
  geom_segment(aes(x=11, y=17, xend=15, yend=17), size=0.1) +
  geom_tile(aes(x(p1), y(p1)), fill='black') +
  geom_tile(aes(x(p2), y(p2)), fill='black') +
  geom_tile(aes(x(p3), y(p3)), fill='black') +
  geom_point(aes(x(p2), y(p2)), color='red', alpha=0.9) +
  facet_wrap(~ reverse_rank, nrow=7) +
  theme_nothing()

plot of chunk unnamed-chunk-10

Here we see the misfits, the unlikely patterns. Like before, except the top left is occupied by the least likely transformation.

The thing to notice here is the breakdown of our basic pattern. What was the middle, marked in red, is no longer in the middle. Instead we see the two tails abutting one another and the red middle cast aside. It stands to reason that these transformations are as unlikely as taffy and contortionists.

Conclusion

We understand that images can transform, but computers, generally, are blind to this fact. Here, I have given a simple visual demonstration of how those rules of transformation can be discovered by a computer using real world datasets with little external expertise.

Certainly, huge strides in computational efficiency would be necessary to make this a practical approach to image recognition problems, and it may well be that we can describe the rules of image transformation so well that a computer need never discover them on its own.

However, it is important to realize that such discovery is possible. And this does show practical promise for several reasons:

Not all transformations are easy to describe. Even this simple inquiry uncovered many likely image transformations which are difficult to describe formally. Likely images were stretched and bent in ways that are quite familiar to the eye, but difficult to describe using the geometric transformations we learned in high school. We can only explicitly teach computers those things that we can describe, for the rest, computers, like us, must learn on their own.
Not all datasets are so well understood. I have focused on image and image transformation, a subject both intuitive and well studied. But there are many datasets for which we have much less intuition and understanding. Think for instance of weather systems for which we have only recently developed sophisticated datasets and modeling tools or genetic sequences which are still 95% mysterious to us. In these less-familiar domains we might find that an algorithm which discovers the rules of transformation can quickly outpace the experts who attempt the same.
At some point new rules must be discovered. While we may be able to impart our computers with all the benefit of our expertise in the same way that we impart it upon children, at some point, they must reach the boundaries of what is known. If computers are to join us in the exploration of brave new domains of thought, they must become more adept thinkers. Discovering the rules of transformation as I have illustrated here is part of what makes us intelligent. It is a necessary step to producing truly intelligent machines.