This is an example of using the XML library and some simple string to extract the usernames from the comments of a Metafilter post, then produce a quick barplot of who the most frequent usernames are.
First, we need to set up the envrionment - load the XML library, and define the Metafilter thread we want to load, by number.
library(XML)
pg = "http://www.metafilter.com/"
thread = 135636
url = paste0(pg, thread, "/")
Now we neext to extract the elements of the HTML page that contain the usernames. I used Selector Gadget to generate an XPath string to extract these lines. Unfortunately I couldn't extract the usernames on their own - the XPath couldn't be interpreted by the XML library, so after I've extracted the lines I do some simple string processing to isolate the usernames.
xp = "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"comments\", \" \" ))]//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"smallcopy\", \" \" ))]"
doc = xmlInternalTreeParse(url, isHTML = T)
## Tag barlow invalid
## ID cse-search-box already defined
src = unlist(xpathApply(doc, xp, xmlValue))
So now we have a vector containing the metadata lines under each comment:
head(src, 3)
## [1] "posted by Artw at 3:24 PM on January 14 [1 favorite] "
## [2] "posted by cjorgensen at 3:25 PM on January 14 [7 favorites] "
## [3] "posted by ook at 3:26 PM on January 14 [8 favorites] "
So let's chop out the usernames. First, we can get rid of the first 11 characters before the name
jj = src
kk = substr(jj, 11, 600)
Now let's make a simple function that can be applied over our vector to chop out everying after the username. The first string that occurs after the username is “ at ”, so let's see where that's located in each string, and keep only what's before it. Of course, usernames that contain “ at ” themselves will be truncated, but let's not get hung up about that.
namefun = function(instring) {
ll = aregexec(" at ", instring)[[1]][1]
mm = substr(instring, 1, ll)
mm
}
nvec = sapply(kk, FUN = namefun)
names(nvec) = NULL
Now, after we get rid of the pesky vector names by setting them to NULL we have a vector containing just the usernames!
head(nvec, 3)
## [1] "Artw " "cjorgensen " "ook "
And here is the plot. Horizontal barplots put the things they plot in reverse order, starting from the bottom. We take our vector of usernames, use the table function to summarize it, reverse it so names earlier in the alphabet come out at the top. We make the left margin wider to fit in long usernames.
ntable = rev(table(nvec))
par(mai = c(0.6, 3, 0.2, 0.3), cex = 0.75)
barplot(ntable, horiz = T, col = "blue", las = 1, main = thread)