Benjamin Haley 2014
text - code - both
How a computer can learn the rules of rotation, reflection, scaling, translation, and many other transformations that images can undergo.
As your eyes move across this sentence, the image hitting your retina is constantly changing. Yet you hardly notice. One reason you do not is because your brain recognizes letters regardless of their position in your field of view.
Consider the following image. Look first at the blue dot and then at the red. Notice that the number ‘2’ between them is recognizable regardless of your focus. This, despite the fact that the image is falling on a completely different set of neurons.
Images go through many such transformations. They reverse, rotate, scale, translate and distort in many ways we have no words for. That’s not to mention all the changes in lighting that can occur. Through all of this they remain recognizable.
The number of transformations that can happen to an image is infinite, but that does not mean that all transformations are possible or probable. Many never occur in the real world and our brain cannot recognize the images after these improbable transformations.
The rules of transformation are important for anyone who wants to teach a computer how to process images. The algorithms that are best at image recognition learn a representation of the world that considers many shifts of focus. These are called translations as illustrated above.
However, these algorithms do not learn that images can be translated, the way they learn to recognize digits. Instead the laws of translation are programmed into the algorithm by the researcher.
Would it be possible to have computers learn about translation without telling them explicitly? What about the myriad other transformations that are possible?
I propose they can, and submit the following experiment as evidence. First, I show that we can learn that flipping an image upside down is a valid transformation, but randomly rearranging the pixels is not. Then I show that examples each of the aforementioned transformations can be discovered. Finally, I wax poetic about the future of this kind of work.
note: I can’t claim that this work is unique. I just hope that it is interesting.
# Setup
# 1. Load in a few libraries.
# 2. Define some helpful functions.
# 3. Load a sample of MNIST data.
# 4. Convert it to a useful format.
# Libraries
library(ggplot2)
library(reshape2)
library(plyr)
library(dplyr)
library(RCurl)
library(xtable)
library(gtools)
# Functions to map pixel number to x, y coordinates (and back)
y <- function(pixel) -floor(pixel / 28) + 28
x <- function(pixel) pixel %% 28
pixel <- function(x, y) (28 - y) * 28 + x
# Compress all columns to one
concatenate_columns <- function(df) do.call("paste", df)
# Convert data
# from one image per row to one pixel per row
elongate <- function(d=data){
d <- d[,order(names(d))] %>%
mutate(pattern = concatenate_columns(d),
id = 1:nrow(d))
d <- melt(d,
id.vars=c('id', 'pattern'),
variable.name='pixel',
value.name='intensity')
d <- d %>%
mutate(pixel = as.numeric(as.character(pixel)))
d
}
# A minimalist theme for plotting
theme_nothing = function(...) theme(
line=element_blank(),
text=element_blank(),
title=element_blank(),
rect=element_blank(),
legend.position="none")
# Show images of handwritten digits
show <- function(g) {
ggplot(g, aes(x(pixel), y(pixel), alpha=intensity)) +
geom_tile() +
facet_wrap(~ pattern + id, ncol=20) +
geom_tile() +
scale_alpha_continuous(range=c(0.0, 1.0)) +
theme_nothing()
}
# Highlight a special region in red
show_of_interest <- function(g) {
g <- g %>%
mutate(of_interest = as.numeric(of_interest))
ggplot(g, aes(x(pixel), y(pixel), alpha=intensity + 0.5*of_interest)) +
geom_tile() +
facet_wrap(~ pattern + id, ncol=20) +
geom_tile(aes(fill=of_interest)) +
scale_alpha_continuous(range=c(0.0, 1.0)) +
scale_fill_continuous(low="black", high="red") +
theme_nothing()
}
# Return a dataframe with the frequency and count
# of each unique value of x
tabulate_frequencies <- function(x) {
data.frame(value=x) %>%
group_by(value) %>%
summarize(count = length(value),
frequency = count/length(x))
}
# Determine the log likeliood of the patterns
# in y given the patterns in x using the multinomial distribution [1].
#
# [1]: https://en.wikipedia.org/wiki/Multinomial_distribution
loglikelihood <- function(x, y){
x <- tabulate_frequencies(x)
y <- tabulate_frequencies(y)
n <- sum(y$count)
d <- merge(x %>% select(-count),
y %>% select(-frequency))
d <- d %>%
mutate(loglikelihood= log(frequency) * count - lfactorial(count)) %>%
summarize(loglikelihood = sum(loglikelihood))
as.numeric(d) + lfactorial(n)
}
# Switch pixels around
reorder <- function(data, new_order) {
names(data) <- names(data)[new_order]
data <- data[,as.character(sort(as.numeric(names(data))))]
data
}
# Data
# A sample from MNIST, of handwritten digits. Originally from Yann
# LeCunn's site [1], but actually sampled from Joseph Redmon's handy
# csv version [2].
#
# [1]: http://yann.lecun.com/exdb/MNIST/
# [2]: http://www.pjreddie.com/projects/MNIST-in-csv/
data <- "https://raw.githubusercontent.com/benjaminhaley/transformation/master/mnist_medium.csv"
data <- read.csv(textConnection(getURL(data)))
# Convert to binary
# Originally pixel intensity is 0-256, make it on or off
data <- data.frame(data > 100) + 0
names(data) <- c(0:783)
# Make some smaller datasets
# for use when we only need a little data
tiny <- head(data, 3)
small <- head(data, 100)
medium <- head(data, 1000)
I will use MNIST data, a handy collection of handwritten digits.
show(elongate(tiny))
We will focus on a small region of data. Specifically, the three vertical pixels highlighted in each digit below.
focus <- sort(pixel(x=13, y=13:15))
show_of_interest(elongate(tiny) %>%
mutate(of_interest = pixel %in% focus))
Next we look at the pixel patterns across many images.
show(elongate(small[,focus]))
We see that certain patterns are more common than others. For example, there are many cases where one of the three pixels is blank but only one case where this is the middle pixel.
Clearly the patterns observed are not random ones.
Then we flip the pixels upside down and look at the patterns.
show(elongate(reorder(small[,focus], c(3, 2, 1))))
Notice that the patterns have roughly the same frequency as before. Flipping upside down does not substantially change the image.
Finally, we make an improbable change, switching the first two pixels, while keeping the third in place.
show(elongate(reorder(small[,focus], c(1, 3, 2))))
The patterns have a very different frequency than the prior cases. For example, the pattern where two filled in pixels surround a blank pixel is common, where it was only observed once in the previous two examples.
We are seeing the difference between a probable image transformation, flipping upside down, and an improbable one, only switching the first two pixels. Objects in the real world flip upside down regularly. But, unless we are stretching taffy or entertaining contortionists, it is unusual to see the middle of something switch places with its top.
After a probable transformation, the image retains the same patterns as any other image. After an improbable transformation the image contains improbable patterns.
We can estimate the likelihood of each reordering given the frequencies observed in the original order. For example, if a pattern, off-on-on, occurred 20% of the time in the original image then it will most likely occur 20% of the time in a valid rearrangement of the image. Concretely we use the multinomial distribution:
\(\frac{n!}{x_1!\cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}\)
Where \(n\) is the total number of samples, \(x_i\) is the number of samples that have some particular value, \(p_i\) is the frequency of some particular value, and \(k\) is the number of unique values.
orders <- permutations(3, 3, 1:3)
g <- ldply(1:nrow(orders), function(o) {
order <- orders[o,]
data.frame(
loglikelihood = loglikelihood(concatenate_columns(medium[,focus]),
concatenate_columns(reorder(medium[,focus], order))),
order = paste(order, collapse=' ')
)
})
print(xtable(g %>% select(order, loglikelihood), align=c('center', 'right', 'right')),
type="html",
include.rownames = FALSE,
html.table.attributes = getOption("xtable.html.table.attributes", "border=0"))
| order | loglikelihood |
|---|---|
| 1 2 3 | -19.92 |
| 1 3 2 | -558.85 |
| 2 1 3 | -596.12 |
| 2 3 1 | -597.38 |
| 3 1 2 | -583.60 |
| 3 2 1 | -45.92 |
We see quantitative evidence that reinforces our visual proof and our intuition. The reverse order, “3 2 1”, is more similar to the original order, “1 2 3”, than any other possible transformation.
Up until now, we have considered a very limited set of transformations, each possible order of three pixels. Now let’s focus on a wider region. We will continue to use the three pixels as before, but we will compare them to a wider field, the 12 surrounding pixels, highlighted below.
# Define a zone of interest
# A range of pixels where we will consider every set of three points
zone = expand.grid(
x=12:14,
y=12:16) %>%
mutate(pixel=pixel(x, y))
zone <- sort(zone$pixel)
# Show the region of interest
g <- elongate(tiny) %>%
mutate(of_interest =
pixel %in% zone +
pixel %in% focus)
show_of_interest(g)
Now let us consider each set of three pixels from this wider field in each of their possible orders. As before, we will use the multinomial distribution to determine how similar each set and order is to our original three pixels.
# Define all sets of three points within the zone
# in all possible orders
points = expand.grid(
p1 = zone,
p2 = zone,
p3 = zone
) %>%
filter(p1 != p2, p2 != p3, p1 != p3)
# For each order of each set of three points,
# determine the likelihood of the order, compared to the baseline
points <- ldply(1:nrow(points), function(row) {
loglikelihood = with(points[row,],
loglikelihood(concatenate_columns(data[,focus]),
concatenate_columns(data[,c(p1, p2, p3)])))
cbind(points[row,],
loglikelihood=loglikelihood,
id=paste(sort(as.numeric(points[row,])), collapse=' '))
})
# Rank the likelihood results
points <- points %>%
mutate(rank=rank(-loglikelihood))
# Show the highest ranked results
ggplot(points %>% filter(rank <= 70)) +
geom_segment(aes(x=11, y=17, xend=15, yend=17), size=0.1) +
geom_tile(aes(x(p1), y(p1)), fill='black') +
geom_tile(aes(x(p2), y(p2)), fill='black') +
geom_tile(aes(x(p3), y(p3)), fill='black') +
geom_point(aes(x(p2), y(p2)), color='red', alpha=0.9) +
facet_wrap(~ rank, nrow=7) +
theme_nothing()
Here we see the sets of pixels most like our original three. The most likely set (our original three itself) occupies the top left corner and the subsequent images show other sets in order of descending likelihood. For clarity, a red dot has been put on the ‘middle’ pixel, where ‘middle’ is defined as the middle pixel in the original image.
First, note that the middle pixel always stays in the middle through all of the likely transformations. This is because likely transformations tend to maintain order (except that they may flip entirely as illustrated before).
Next, notice that we see examples of each of the likely transformations that we already know to exist.
Through all of these transformations, the basic pattern holds, the middle pixel in the original image remains the middle through each likely transformation. We see no examples of taffy stretching contortionism.
Next lets look at those transformations ranked least likely. This will assure you that I have not hoodwinked you by divining patterns in the results that would be seen regardless of their order.
points <- points %>%
mutate(reverse_rank = max(rank) + 1 - rank)
# Show the lowest ranked results
ggplot(points %>% filter(reverse_rank <= 70)) +
geom_segment(aes(x=11, y=17, xend=15, yend=17), size=0.1) +
geom_tile(aes(x(p1), y(p1)), fill='black') +
geom_tile(aes(x(p2), y(p2)), fill='black') +
geom_tile(aes(x(p3), y(p3)), fill='black') +
geom_point(aes(x(p2), y(p2)), color='red', alpha=0.9) +
facet_wrap(~ reverse_rank, nrow=7) +
theme_nothing()
Here we see the misfits, the unlikely patterns. Like before, except the top left is occupied by the least likely transformation.
The thing to notice here is the breakdown of our basic pattern. What was the middle, marked in red, is no longer in the middle. Instead we see the two tails abutting one another and the red middle cast aside. It stands to reason that these transformations are as unlikely as taffy and contortionists.
We understand that images can transform, but computers, generally, are blind to this fact. Here, I have given a simple visual demonstration of how those rules of transformation can be discovered by a computer using real world datasets with little external expertise.
Certainly, huge strides in computational efficiency would be necessary to make this a practical approach to image recognition problems, and it may well be that we can describe the rules of image transformation so well that a computer need never discover them on its own.
However, it is important to realize that such discovery is possible. And this does show practical promise for several reasons: