This R Markdown document was written to analyse the jobs and events published on AustralianBioinformatics.net.
The Australian Bioinformatics Network aims to connect people, resources and opportunties to increase the benefits Australian bioinformatics can deliver, and online communication of information is a big part of that.
Part of the motivation behind this document is to ensure the ABN Team and the ABN's stakeholders (including its funders: CSIRO, EMBL Australia and Bioplatforms Australia) know how things are going with online communications.
While the analytics of the SquareSpace 5 content management system behind AustralianBioinformatics.net describe the traffic the site receives, they don't summarise the content that has been drafted. Thus, we need a little extra analysis.
…oh, and this is also a great excuse to show how R and R Markdown can be used!
…speaking of which, my sincere thanks go to Dr Neil Saunders (CSIRO) for pointing out how the XML package can be used to do all this.
Here's the function that we use to scrape pages from AustralianBioinformatics.net (thanks Neil!)
scrape <- function(page, domain="australianbioinformatics.net"){
url <- paste("http:/", domain, page, sep="/")
doc <- htmlTreeParse(url, useInternalNodes = T)
extract <- function(x) c(xmlAttrs(x, "href"), xmlValue(x))
hyperlinks <- xpathSApply(
doc,
"//div[@class='journal-archive-set']//li/a",
extract)
data.frame(link=hyperlinks[1,], text=hyperlinks[2,], stringsAsFactors=FALSE)
}
Now we use it to scrape the pages containing events and jobs, and create a dataframe containing everything, plus a factor to describe the type of link (job or event)
events <- rbind(scrape("upcoming-events-index"), scrape("past-events-index"), scrape("training-index"))
jobs <- rbind(scrape("open-jobs-index"), scrape("past-jobs-index"))
raw <- rbind(events, jobs)
raw$type <- factor(rep(c("event", "job"), c(nrow(events), nrow(jobs))))
rm(events, jobs)
The SquareSpace 5 content management system behind AustralianBioinformatics.net has a setting to ensure that links to blog entries are prefaced by the date of the entry, e.g., /events/2014/10/11/abic-2014-australian-bioinformatics-conference.html. Unfortunately, this got turned off by accident for a period of time and some of the links are missing date information as a result.
This next bit of code reads in and applies a file of “fixes” for the 30 posts that occurred when post dates were not being recorded. Notice the check to make sure that the merge hasn't accidentally dropped or added any records
fixes <- read.csv("data/ABN Advertising 20140602.csv", stringsAsFactors=FALSE)
ads <- merge(raw, fixes, all=TRUE)
stopifnot(nrow(ads)==nrow(raw))
as.is <- is.na(ads$fixed)
ads$fixed[as.is] <- ads$link[as.is]
rm(as.is, fixes)
Now we split the fixed link into seperate fields using the “/” character as the seperator. We then restrict out attention to the ads that have six fields
fields <- strsplit(ads$fixed, "/")
ads <- subset(ads, sapply(fields, length)==6)
Now we are working with links that we know have six fields, we convert the list of fields to a dataframe, pull our the ones we're interested in, convert them to dates, and tack them onto our ads dataframe
fields <- data.frame(t(sapply(strsplit(ads$fixed, "/"),c)), stringsAsFactors=FALSE)
ads$date <- ymd(paste(fields$X3, fields$X4, fields$X5, sep="/"))
rm(fields)
Let's try to categorise our events and jobs based on some keywords
patterns <- read.csv("data/patterns.csv", stringsAsFactors=FALSE)
regexps <- unique(
ddply(patterns,
.(type,class),
function(x) data.frame(
type=x$type,
regexp=paste(x$pattern, collapse="|"),
class=x$class,
stringsAsFactors=FALSE)
)
)
ads$subtype <- rep("other", nrow(ads))
for(i in 1:nrow(regexps)){
matches <- grepl(regexps$regexp[i], ads$text, ignore.case=TRUE)
type <- ads$type==regexps$type[i]
ads$subtype[matches&type] <- regexps$class[i]
}
rm(i, matches, type)
#View(subset(ads, subtype=="other", select=c("type","subtype", "text")), title="result")
#table(ads$subtype, ads$type)
Let's plot the cumulative number of events and jobs/opportunities advertised on the site
type.df <-
ddply(ads,
.(type),
function(x) data.frame(
type=x$type,
date=sort(x$date),
cumulative=1:nrow(x)
)
)
ggplot(data=type.df, aes(x=date,y=cumulative, group=type)) +
geom_line(aes(colour=type)) +
geom_vline(xintercept=as.numeric(as.POSIXct(today())), linetype=4)
Now let's break that down for jobs
subtype.df <-
ddply(ads,
.(subtype),
function(x) data.frame(
type=x$type,
subtype=x$subtype,
date=sort(x$date),
cumulative=1:nrow(x)
)
)
ggplot(data=subset(subtype.df, type=="job"), aes(x=date,y=cumulative, group=subtype)) +
geom_line(aes(colour=subtype)) +
geom_vline(xintercept = as.numeric(as.POSIXct(today())), linetype=4) +
facet_grid(~type)
and events
ggplot(data=subset(subtype.df, type=="event"), aes(x=date,y=cumulative, group=subtype)) +
geom_line(aes(colour=subtype)) +
geom_vline(xintercept = as.numeric(as.POSIXct(today())), linetype=4) +
facet_grid(~type)
My next challenge is to explore this with the googleVis package's gvisAnnotationChart.