The Wide World of Physics.

I've been thinking more than usual lately about spatially representing the
So in this post, I want to do two things: give a quick overview of the geography of the ArXiv, which may be interesting to some people, and dump a lot of code onto the Internet.

I've been doing most of my work lately in RStudio using Yihui Xie's excellent Knitr package. The idea is to combine code with text to allow, at once, literate programming and reproducible research.

That's produced a lot of works-in-progress which are a funny agglomeration of code and analysis. I thought it might be interesting to put one online.

# This defines some functions to pull data from the Bookworm into R. Not
# public yet, sorry.
source("../RBookworm.R")
require(plyr)
require(lubridate)
require(RCurl)
opts_chunk$set(progress = TRUE, fig.width = 10, warning = FALSE, 
    cache = TRUE)

Mapping the ArXiv

When Paul Ginzparg shared the ArXiv text-base with the Cultural Observatory, I was most excited about the 'e-mail submitter' field. Every e-mail address has a domain; it should be possible to trace that domain to a place in the real world. Since ArXiv submitters are mostly scientists, that place is usually one of a broad range of educational institutions from around the world. (Or, increasingly, Mountain View California: but that's an issue addressed a bit below.)

The first step in this is geo-coding every web address that scientists have submitted from. This turns out to be somewhat hard, but doable with a combination of internal whois calls and the useful public IP at freegeoip.net. (The only problem is that they limit calls to 1000 per hour.)

The first step is to use the Bookworm API to pull the e-mail domains. We previously parsed these down, to be either the last parts of the e-mail address (eg, 'uconn.edu') or, if the second part of the e-mail address is 'ernet,' 'ac,' or 'edu,' the last three parts. (This allows us to eg; 'jcu.edu.au'; 'ox.ac.uk'). There are about 6500 distinct e-mail addresses in the Arxiv.

query = list(groups = list("mld"), search_limits = list(word = list()), 
    database = "arxiv", counttype = "Raw_Counts")
countries = topn(query, n = 1e+06, min = 5)
## Loading required package: rjson
length(countries)
## [1] 7505

The most common are the following: gmail.com, mpg.de, infn.it, cern.ch, harvard.edu, mit.edu, cam.ac.uk, caltech.edu, princeton.edu, berkeley.edu, stanford.edu, u-tokyo.ac.jp, jussieu.fr, kyoto-u.ac.jp, ernet.in.
Gmail is number one: but the rest should be good, traceable institutions. (Except for ernet.in; that will be fixed in the next go-round)

Now I define a function to pull the location by using an internal whois call, followed by a request to freegeoip.net.

getLOC = function(location) {
    data = structure(list(city = NA, region_code = NA, region_name = NA, metrocode = NA, 
        zipcode = NA, longitude = NA, latitude = NA, country_code = NA, ip = NA, 
        country_name = NA), .Names = c("city", "region_code", "region_name", 
        "metrocode", "zipcode", "longitude", "latitude", "country_code", "ip", 
        "country_name"))
    data
    # Internally find out who runs that website.
    ip = system(paste0("host ", location), intern = T)
    # The mail server is usually better than the first line, it turns out, so
    # long as the webserver isn't google. That's what the next few lines check
    # for. Some extensions here might perfect the accuracy in some other
    # similar cases.
    mailServers = ip[grep("is handled by \\d+", ip)]
    if (sum(c(grepl("google", location), grepl("gmail", location))) == 0) {
        mailServers = mailServers[grep("GOOGLE", mailServers, invert = T)]
        mailServers = mailServers[grep("google", mailServers, invert = T)]
        mailServers = mailServers[grep("gmail", mailServers, invert = T)]
        mailServers = mailServers[grep("psmtp.com", mailServers, invert = T)]
    }
    if (length(mailServers) > 0) {
        ip = system(paste0("host ", (gsub(".* is handled by \\d+ ", "", ip[grep("is handled by \\d+ ", 
            ip)[1]]))), intern = T)
    }

    ip = gsub(".* address ", "", ip[1])
    # Then pull the geodatabase information for the IP address.
    require(rjson)
    try(assign("data", fromJSON(getURL(paste0("http://freegeoip.net/json/", 
        ip)))), silent = T)

    returnt = data.frame(data, stringsAsFactors = F)
    returnt$location = location
    returnt
}

That returns results like the following, for “carleton.edu”:
Northfield, MN, Minnesota, 613, 55057, -93.2007, 44.4614, US, 137.22.198.33, United States, carleton.edu
While those results look impressively precise, they're often for the home city, not the institution itelf: compare Carleton to St Olaf's:
Northfield, MN, Minnesota, 613, 55057, -93.2007, 44.4614, US, 130.71.96.8, United States, stolaf.edu
That can create a lot of people e-mailing from a city like Moscow or Beijing, which is occasionally used as a stand-in for the whole country.

In any case: the next step is to keep a list of the IP addresses on disk.

addips = function(locations, ipTable = "locations.txt", atAtime = 10) {
    ips = read.table("locations.txt", stringsAsFactors = F, header = T)
    toget = locations[!locations %in% ips$location]
    toget = toget[1:min(length(toget), atAtime)]
    new = ldply(toget, getLOC)
    new = new[!is.na(new$location), ]
    ips = rbind(ips, new)
    write.table(x = ips, file = "locations.txt")
    ips
}
# This function I ran several times to avoid exceeding the IPs limit of
# 1000 queries an hour. locs = addips(locations = countries,atAtime=500)

Once we've made that list, we can just do a sanity check on the locations by putting them on a map.

locs = read.table("locations.txt", stringsAsFactors = F, header = T)
library(maps)
library(plyr)
library(ggplot2)
usamap <- map_data("world")
locs$long = as.numeric(locs$longitude)
locs$lat = as.numeric(locs$latitude)
ggplot(locs, aes(y = lat, x = long)) + annotation_map(map = usamap, 
    aes(id = group), colour = "grey50", fill = "grey50") + geom_point(alpha = 0.05, 
    color = "red", size = 2.5)

plot of chunk unnamed-chunk-2

This is basically good news: it looks quite a bit like we'd expect a geography of science to looks: heaviest in the eastern seaboard and California in the Americas, in the old EEC in Europe, and in Japan and South Korea.

Of course, that's just the locations of e-mail servers: there's no sense of which universities put out the most research. We can correct that with a quick API call to Bookworm, and get some maps that show the density of words produced in each of the sets. As points, that looks like this:

fullCountsQuery = list(groups = list("mld"), search_limits = list(word = list()), 
    database = "arxiv", counttype = "Raw_Counts")
fullcounts = webQuery(fullCountsQuery)
fullcounts = merge(fullcounts, locs, by.x = "mld", by.y = "location")
fullcounts = fullcounts[fullcounts$city != "Mountain View", ]
source("../RBookworm.R")
locationcounts = ddply(fullcounts, .(lat, long), function(location) {
    data.frame(count = sum(location$count), lat = location$lat[1], long = location$long[1])
})

ggplot(locationcounts, aes(y = lat, x = long)) + annotation_map(map = usamap, 
    aes(id = group), colour = "grey50", fill = "grey50") + geom_point(alpha = 0.2, 
    aes(size = count), color = "red") + scale_size_continuous(trans = "sqrt")

plot of chunk unnamed-chunk-3

To do density, it makes sense to try to bin these together somehow:

us_map = map_data("usa")
bins = 75
ggplot(locationcounts, aes(y = lat, x = long)) + annotation_map(map = usamap, 
    aes(id = group), colour = "grey50", fill = "grey50") + stat_summary_hex(aes(z = count), 
    fun = sum, bins = bins) + scale_fill_gradientn(colours = terrain.colors(10), 
    trans = "sqrt")

plot of chunk hexPlot


ggplot(locationcounts, aes(y = lat, x = long)) + annotation_map(map = us_map, 
    aes(id = group), colour = "grey50", fill = "grey50") + stat_summary_hex(aes(z = count), 
    fun = sum, bins = 50) + xlim(-125, -69) + ylim(25, 50) + scale_fill_gradientn(colours = terrain.colors(10), 
    trans = "sqrt")

plot of chunk hexPlot

That's fairly clear, but it's also too reminiscent of the Nintendo era. If we want to use a more contemporary visual vernacular, we'll use a grey-on-black palette and some transparency to get at the same data.

I tend to think that this color scheme is heavily overused nowadays; for most cases, I think black-on-white works better. But it has one big advantage: it lets you use a three-color transition from red-to-white-to-blue to indicate intensity. (Using red-to-black-to-blue on a white background doesn't work nearly as well). That's going to be important to me in a little bit–so I'm going with the Zeitgeist.

# I'm doing geom_polygon instead of geom_map because my ggplot seems not
# to be quite up to date: if you're using 0.9.1 or higher, I think
# annotate_map should work and be _much_ faster.
ggplot(locationcounts, aes(y = lat, x = long)) + geom_polygon(data = us_map, 
    aes(group = group)) + geom_point(aes(size = count), bins = 50, alpha = 0.33, 
    color = "red") + coord_map(xlim = c(-125, -67), ylim = c(25, 50), projection = "albers", 
    at0 = 45.5, lat1 = 29.5) + scale_size_continuous(trans = "sqrt") + opts(panel.background = theme_rect(fill = "black", 
    colour = NA), panel.grid.major = theme_line(colour = "black"), panel.grid.minor = theme_line(colour = "black"), 
    legend.position = "None", axis.ticks = theme_blank(), axis.text.y = theme_blank(), 
    axis.text.x = theme_blank()) + ylab("") + xlab("")

plot of chunk PlotArticles


ggplot(locationcounts, aes(y = lat, x = long)) + geom_polygon(data = usamap, 
    aes(group = group)) + geom_point(aes(size = count), bins = 50, alpha = 0.33, 
    color = "red") + coord_map(xlim = c(-180, 180), ylim = c(-60, 71), projection = "mollweide") + 
    scale_size_continuous(trans = "sqrt") + opts(panel.background = theme_rect(fill = "black", 
    colour = NA), panel.grid.major = theme_line(colour = "black"), panel.grid.minor = theme_line(colour = "black"), 
    legend.position = "None", axis.ticks = theme_blank(), axis.text.y = theme_blank(), 
    axis.text.x = theme_blank()) + ylab("") + xlab("")

plot of chunk PlotArticles

Science in Motion

This is kind of pretty. But the real power of this comes from all the other metadata in our database. Should we want to look at history of how these counts change over time, for example, it's easy to build a movie from each successive month. That nicely illustrates, among other things, China's entrance into the world physics community over the last 20 years.


# I know that Yihui Xie has also made an animator package; for various
# reasons around cropping, etc, I find it easier to just save the files to
# disk and run my own ffmpeg scripts on them.

monthlyCounts = webQuery(list(groups = list("mld", "month"), search_limits = list(word = list()), 
    database = "arxiv", counttype = "Raw_Counts"))
monthlyCounts$date = as.Date(monthlyCounts$month, origin = as.Date("0-1-1"))
yearcounts = merge(monthlyCounts, locs, by.x = "mld", by.y = "location")

yearcounts = yearcounts[!is.na(yearcounts$date), ]


MonthlyCounts = ddply(yearcounts, .(lat, long, month(date), year(date)), 
    function(location) {
        data.frame(count = sum(location$count, na.rm = T), lat = location$lat[1], 
            long = location$long[1], year = year(location$date[1]), month = month(location$date[1]))
    })

legend = data.frame(long = 0, lat = seq(-5, -45, length.out = 5), 
    count = seq(sqrt(1e+05), sqrt(max(MonthlyCounts$count, na.rm = T)), length.out = 5)^2)
legend$labels = paste0(round(legend$count/1000), "K words")
legend$labels[legend$count > 1e+06] = paste0(round(legend$count[legend$count > 
    1e+06]/1e+06), "M words")
monthframe = MonthlyCounts[1:1000, ]
plots = dlply(MonthlyCounts, .(year, month), function(monthframe) {
    plot = ggplot(monthframe, aes(y = lat, x = long)) + geom_polygon(data = usamap, 
        aes(group = group)) + geom_point(aes(size = count), bins = 50, alpha = 0.33, 
        color = "red") + coord_map(xlim = c(-180, 180), ylim = c(-60, 71), projection = "mollweide") + 
        opts(panel.background = theme_rect(fill = "black", colour = NA), panel.grid.major = theme_line(colour = "black"), 
            panel.grid.minor = theme_line(colour = "black"), legend.position = "None", 
            axis.ticks = theme_blank(), axis.text.y = theme_blank(), axis.text.x = theme_blank()) + 
        ylab("") + xlab("") + scale_size_continuous(trans = "sqrt", range = c(0, 
        7), limits = c(0, max(MonthlyCounts$count, na.rm = T))) + geom_point(data = legend, 
        aes(size = count), color = "red", alpha = 0.75) + geom_text(data = legend, 
        adj = 1.2, color = "red", aes(label = labels), size = 2.5) + annotate(geom = "text", 
        label = paste(monthframe$month[1], monthframe$year[1], sep = "-"), size = 4.5, 
        x = -30, y = 30, color = "red")
    ggsave(plot, file = paste0("maps/", monthframe$year[1], "-", monthframe$month[1], 
        ".png"), width = 9, height = 6)
})

There are also dozens of overplaced locations where the specificity is either for the country or the city (Moscow and Tomsk are some of the worst in this regard, I think). That could be solved with jitter. The point 54, -2 also has a lot of locations, including many European Yahoos–I'm not sure what's up with that.

Plotting Words Geographically

Where this really gets exciting, though, is that this sort of thing can tell us about usage patterns.

Here I define a function, for example, that shows how heavily one word is used in each of these geographic bins compared to some other word. The basic idea is pretty much the same.

geoRatio = function(words) {
    query = list(groups = list("mld"), counttype = "Raw_Counts", search_limits = list(word = words[[1]]), 
        database = "arxiv")
    a = webQuery(query)
    query[["search_limits"]][["word"]] = words[[2]]
    b = webQuery(query)
    groups = merge(a, b, by = unlist(query[["groups"]]))
    if (identical(words[[2]], list())) {
        words[[2]] = "All Words"
    }
    names(groups) = c("mld", "a", "b")
    groups$total = groups$a + groups$b
    groups$ratio = groups$a/groups$b
    plottable = merge(groups, locs, by.x = "mld", by.y = "location")
    plottable$scale = as.vector(scale(log(plottable$ratio)))
    geoPlot(plottable, aesthetic = aes(size = total, color = scale)) + annotate("text", 
        x = 70, y = -10, label = paste(words[[1]], collapse = "/"), color = "blue", 
        size = 10) + annotate("text", x = 70, y = -20, label = paste(words[[2]], 
        collapse = "/"), color = "red", size = 10)
}

geoPlot = function(data, aesthetic, mapname = "world", ...) {
    legend = data.frame(long = 0, lat = seq(0, -45, length.out = 6), count = seq(sqrt(0), 
        sqrt(max(data[, as.character(aesthetic$size)], na.rm = T)), length.out = 6)^2)[2:6, 
        ]
    legend$labels = paste0(round(legend$count), " words")
    legend$labels[legend$count > 1000] = paste0(round(legend$count[legend$count > 
        1000]/1000), "K words")
    legend$labels[legend$count > 1e+06] = paste0(round(legend$count[legend$count > 
        1e+06]/1e+06), "M words")
    myMap = map_data(mapname)
    coreplot = ggplot(data, aes(y = lat, x = long)) + geom_polygon(data = myMap, 
        aes(group = group)) + geom_point(aesthetic, alpha = 0.33) + coord_map(xlim = c(-180, 
        180), ylim = c(-60, 71), projection = "mollweide") + opts(panel.background = theme_rect(fill = "black", 
        colour = NA), panel.grid.major = theme_line(colour = "black"), panel.grid.minor = theme_line(colour = "black"), 
        legend.position = "None", axis.ticks = theme_blank(), axis.text.y = theme_blank(), 
        axis.text.x = theme_blank()) + ylab("") + xlab("") + scale_size_continuous(trans = "sqrt", 
        range = c(0, 17), limits = c(0, max(data[, as.character(aesthetic$size)], 
            na.rm = T))) + scale_color_gradient2(midpoint = 0, low = "red", 
        mid = "white", high = "blue") + geom_point(data = legend, aes(size = count), 
        color = "white", alpha = 0.75) + geom_text(data = legend, adj = 1, color = "white", 
        aes(label = labels, x = long - 8), size = 4.5)
}

Running this gives us Ngram plots where blue means little usage, white the average amount of usage, and red high usage.

This can show absolute levels, but those are frequently not very clear: as for instance here, where a number of the large centers (Stanford, CERN, etc) shade towards talking about the Higgs more than many smaller universities scattered around the world.

geoRatio(list(list("Higgs"), list()))

plot of chunk unnamed-chunk-5

It really shines at binary comparisons, though. We can see anglophone and Americaophone usage of “while” and “whilst”, the geographical peripheries of universities and cities, and other similar features. Note that the sizes here correspond to the total usage of both of the words shown: so “Yale” is big because it uses “Yale” a lot and “Oxford” a little, and “Oxford” big because it uses “Oxford” a lot and “Yale” a little. I think this makes more sense than scaling by the total number of words in the corpus.

geoRatio(list(list("while"), list("whilst")))

plot of chunk unnamed-chunk-6

geoRatio(list(list("Harvard"), list("Stanford")))

plot of chunk unnamed-chunk-6

geoRatio(list(list("Oxford"), list("Yale")))

plot of chunk unnamed-chunk-6

geoRatio(list(list("Tokyo"), list("London")))

plot of chunk unnamed-chunk-6

I don't have the subject-matter expertise to come up with anything really interesting using these charts on the ArXiv: but geo-positioning is a general feature of texts, so I should be able to throw these same tools at some other datasets I know better.