Australian Bioinformatics Network website analysis

David Lovell

Preamble

Purpose of this document

This R Markdown document was written to analyse the jobs and events published on AustralianBioinformatics.net.

Motivation for this document

The Australian Bioinformatics Network aims to connect people, resources and opportunties to increase the benefits Australian bioinformatics can deliver, and online communication of information is a big part of that.

Part of the motivation behind this document is to ensure the ABN Team and the ABN's stakeholders (including its funders: CSIRO, EMBL Australia and Bioplatforms Australia) know how things are going with online communications.

While the analytics of the SquareSpace 5 content management system behind AustralianBioinformatics.net describe the traffic the site receives, they don't summarise the content that has been drafted. Thus, we need a little extra analysis.

…oh, and this is also a great excuse to show how R and R Markdown can be used!

Acknowledgements

…speaking of which, my sincere thanks go to Dr Neil Saunders (CSIRO) for pointing out how the XML package can be used to do all this.

Reading the data (a.k.a. “ScreenScraping”)

Here's the function that we use to scrape pages from AustralianBioinformatics.net (thanks Neil!)

scrape <- function(page, domain="australianbioinformatics.net"){
  url        <- paste("http:/", domain, page, sep="/")
  doc        <- htmlTreeParse(url, useInternalNodes = T)
  extract    <- function(x) c(xmlAttrs(x, "href"), xmlValue(x))
  hyperlinks <- xpathSApply(
    doc, 
    "//div[@class='journal-archive-set']//li/a",
    extract)
  data.frame(link=hyperlinks[1,], text=hyperlinks[2,], stringsAsFactors=FALSE)
}

Now we use it to scrape the pages containing events and jobs, and create a dataframe containing everything, plus a factor to describe the type of link (job or event)

events   <- rbind(scrape("upcoming-events-index"), scrape("past-events-index"), scrape("training-index"))
jobs     <- rbind(scrape("open-jobs-index"), scrape("past-jobs-index"))
raw      <- rbind(events, jobs)
raw$type <- factor(rep(c("event", "job"), c(nrow(events), nrow(jobs))))
rm(events, jobs)

Fixing the data

The SquareSpace 5 content management system behind AustralianBioinformatics.net has a setting to ensure that links to blog entries are prefaced by the date of the entry, e.g., /events/2014/10/11/abic-2014-australian-bioinformatics-conference.html. Unfortunately, this got turned off by accident for a period of time and some of the links are missing date information as a result.

This next bit of code reads in and applies a file of “fixes” for the 30 posts that occurred when post dates were not being recorded. Notice the check to make sure that the merge hasn't accidentally dropped or added any records

fixes <- read.csv("data/ABN Advertising 20140602.csv", stringsAsFactors=FALSE)
ads   <- merge(raw, fixes, all=TRUE)
stopifnot(nrow(ads)==nrow(raw))
as.is <- is.na(ads$fixed)
ads$fixed[as.is] <- ads$link[as.is]
rm(as.is, fixes)

Now we split the fixed link into seperate fields using the “/” character as the seperator. We then restrict out attention to the ads that have six fields

fields <- strsplit(ads$fixed, "/")
ads    <- subset(ads, sapply(fields, length)==6)

Now we are working with links that we know have six fields, we convert the list of fields to a dataframe, pull our the ones we're interested in, convert them to dates, and tack them onto our ads dataframe

fields <- data.frame(t(sapply(strsplit(ads$fixed, "/"),c)), stringsAsFactors=FALSE)
ads$date  <- ymd(paste(fields$X3, fields$X4, fields$X5, sep="/"))
rm(fields)

Classifying the data

Let's try to categorise our events and jobs based on some keywords

patterns <- read.csv("data/patterns.csv", stringsAsFactors=FALSE)
regexps  <- unique(
  ddply(patterns,
        .(type,class),
        function(x) data.frame(
          type=x$type, 
          regexp=paste(x$pattern, collapse="|"), 
          class=x$class, 
          stringsAsFactors=FALSE)
        )
  )

ads$subtype <- rep("other", nrow(ads))
for(i in 1:nrow(regexps)){
  matches <- grepl(regexps$regexp[i], ads$text, ignore.case=TRUE)
  type    <- ads$type==regexps$type[i]
  ads$subtype[matches&type] <- regexps$class[i]
}
rm(i, matches, type)
#View(subset(ads, subtype=="other", select=c("type","subtype", "text")), title="result")
#table(ads$subtype, ads$type)

Visualizing the data

Let's plot the cumulative number of events and jobs/opportunities advertised on the site

type.df <- 
  ddply(ads,
        .(type),
        function(x) data.frame(
          type=x$type,
          date=sort(x$date),
          cumulative=1:nrow(x)
          )
        )

ggplot(data=type.df, aes(x=date,y=cumulative, group=type)) + 
  geom_line(aes(colour=type)) +
  geom_vline(xintercept=as.numeric(as.POSIXct(today())), linetype=4)

plot of chunk unnamed-chunk-2

Now let's break that down for jobs


subtype.df <- 
  ddply(ads,
        .(subtype),
        function(x) data.frame(
          type=x$type,
          subtype=x$subtype,
          date=sort(x$date),
          cumulative=1:nrow(x)
          )
        )

ggplot(data=subset(subtype.df, type=="job"), aes(x=date,y=cumulative, group=subtype)) + 
  geom_line(aes(colour=subtype)) +
  geom_vline(xintercept = as.numeric(as.POSIXct(today())), linetype=4) +
  facet_grid(~type)

plot of chunk unnamed-chunk-3

and events

ggplot(data=subset(subtype.df, type=="event"), aes(x=date,y=cumulative, group=subtype)) + 
  geom_line(aes(colour=subtype)) +
  geom_vline(xintercept = as.numeric(as.POSIXct(today())), linetype=4) +
  facet_grid(~type)

plot of chunk unnamed-chunk-4

My next challenge is to explore this with the googleVis package's gvisAnnotationChart.