Congruent Connections

Holsenbeck_S_Final

2017-12-12

School Scraper

On Windows Home 10, use Ctrl-Alt-Del to open Task manager, Performance>CPU, see if Virtualization is enabled.
If not, enable Hardware Virtualization for your system in the BIOS. (Google search: “ enable Hardware Virtualization”)
Download the Docker Toolbox as per the instructions here.
Allow the Docker to also install Oracle VM
Follow all steps in the instructions down to verifying the installation with Hello world command.
Run Docker Quickstart Terminal
Accept all installation prompts
Pull a Standalone Chrome node
Run a Standalone Chrome node docker run -d -P selenium/standalone-chrome
Use docker-machine ip to establish IP Address, add to remoteServeraddr
Use docker ps to get the status of the running Chrome node
use the port provided in the output of docker ps
Set the port in the function below
Set the chunk to eval=T, the run all chunks to assemble the scraper.
When complete use the following to stop all containers in docker “docker stop $(docker ps -aq)”, or just shut down the CPU. Note: Oracle VM Will stay running even after Docker is closed which might present a security vulnerability. It might be worth closing via the task manager if continuing to use the computer.

Welcome to the Congruent Connections faculty data scraper and DA 5020 Final Project for Stephen Holsenbeck. This page provides the in-depth documentation and write-up for how each modular function in the scraper works. In the navigation menu above, the Shiny Apps dropdown menu will take you to the Shiny App deployments on shinyapps.io. The Shiny Apps will allow you to browse professors in the data set by wordclouds, or to search the dataset by your interests, and return professor’s information whom match your interests. This data is searchable and downloadable.
Note: The code and apps were tested on Windows 10 home running Chrome browser maximized to the screen, and the page coding is optimized for viewing in a maximized browser window on a desktop (though it uses bootstrap fluid formatting, so it can be viewed on smaller screens, though not optimally.)

The idea to do a web scraper was inspired by a prospective philosophy PhD candidate who was lamenting the inevitable task of blindly broadcast their application to suitable programs to increase their chances of acceptance. Noting that many ambitious individuals are relegated to this method of applying to programs for all types of intellectual pursuits, I attempted to improve upon the process by applying the data analytics skills acquired over the previous semester to develop a web scraper and ShinyUI that allows prospective students (for any higher ed degree) to find the professors that match their interests with which they can intelligently communicate, allocate their energy towards, and apply to their departments knowing that they share common interests. With this app, prospective students can better allocate their limited energies more efficiently, bolstering the likelihood of their acceptance and future success in the department(s) to which they apply.

Each function resides alongside it’s documentation in the accordion of corresponding name below. The Shiny App documentation follows, and finally the appendix of unused code blocks are represented, though it is undocumented.

R Selenium Chrome Loading

The remoteDriver command creates the remote driver instance that drives the hidden Chrome Selenium browser. RSelenium is especially useful when it is necessary to obtain CSS properties of page elements that are either computed on page load or resolved from JS and CSS scripts such as container dimensions, text-size, etc. This feature proves useful in the findMainDiv function for identifying the container div by a process that distills divs by their dimensions. It can also change hard-coded attributes of specific elements on a page such that they can passed to Rvest where they can be easily identified/scraped by the identifying attribute assigned by RSelenium.

# For Help: vignette('RSelenium-docker', package = 'RSelenium') remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 32768L, browserName = "chrome") remDr$open(silent = T)

R Selenium Phantom JS

PhantomJS is a headless browser, that is slightly faster than Chrome. I attempted to use PhantomJS to speed up the time expensive scraping operations but found that it calculates dimensions quite differently from Chrome, rendering the previously coded element dimensions distillation methods adapted to the Chrome remote driver, useless. For this reason, Chrome was used for all RSelenium functions herein.

# docker run -d -p 8910:8910 wernight/phantomjs phantomjs --webdriver=8910 remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 8910L, browserName = "phantomjs") remDr$open(silent = T)

Input Data

This df is the list of all input schools and URLs of the philosophy faculty pages used in the project.

dfSch <- data.frame(Name = c("UBC", "NYU", "McGill", "UToronto", "Duke", "PennState", "Northwestern", "UNCCH", "Harvard", "Yale", "BostonC", "Emory", "BostonU", "USC", "UCSD"), URL = c("http://philosophy.ubc.ca/people/core-faculty/", "https://as.nyu.edu/content/nyu-as/as/departments/philosophy/directory/faculty.html", "https://www.mcgill.ca/philosophy/people/faculty", "http://philosophy.utoronto.ca/directory-category/main-faculty/", "https://philosophy.duke.edu/people/faculty", "http://philosophy.la.psu.edu/directory/graduate-faculty", "http://www.philosophy.northwestern.edu/people/continuing-faculty/", "https://philosophy.unc.edu/people-page/faculty/", "https://philosophy.fas.harvard.edu/people-terms/department-faculty", "https://philosophy.yale.edu/people/faculty", "https://www.bc.edu/offices/stserv/academic/univcat/faculty/phil.html", "http://philosophy.emory.edu/home/people/faculty/index.html", "http://www.bu.edu/philo/people/faculty/", "http://dornsife.usc.edu/cf/phil/phil_faculty_roster.cfm", "https://philosophy.ucsd.edu/people/faculty.html"), stringsAsFactors = F)

Mode Fn

This is a mode function found via the following SO post that is used in the findBios function to identify the element within the main containing div that repeats the most frequently, singling out the element that encapsulates the listings of faculty profiles (as this is the most frequently recurring element within the main div on all of the pages).

Mode <- function(x, na.rm = FALSE) { if (na.rm) { x = x[!is.na(x)] } ux <- unique(x) return(ux[which.max(tabulate(match(x, ux)))]) }

Find Main Div Fn

This is the first of a series of fairly complex functions for scraping, and warrants a summation of the lessons learned in approaching this challenge naive to the difficulty it would entail. First and foremost:
Error & Exception Handling: To anyone attempting scalable automated web scraping, I can’t emphasize enough the importance of coding in robust error and exception handling as soon as possible. Having spent inordinate amounts of time revising and fine tuning these functions for the multitudes of errors and exceptions they encountered in just fifteen sites, I can say retrospectively that if the initial approach is oriented towards extracting the data, the second approach should go straight towards seeing what errors each function might encounter based on it’s inputs (or lack therof as it is in the cases of most errors) and code in if conditionals with corresponding breaks, and nexts where appropriate in all recursive loops. This will save a lot of time as the project progresses.
Code Navigation
Secondly, using R’s little document navigation menu at the bottom of the authoring window is indispensable to navigating around large code blocks lke this.
Easily visible comments Third, it took a lot of coding before realizing that making easy to spot comments could make navigating the code significantly easier. In later functions, notice how comments evolved from one line statements to an easy to spot line with pertinent info, something like #———–Heading Date———-#. Using easily recognizable commenting like this makes it much easier to make your way around what can become byzantine code. If you haven’t done so already, make a global snippet for this purpose triggered by something like ##. You can also use programs like Autohotkey to automate the insertion of a data/time stamp to keep track of revision and creation dates.
Code Walkthrough

Read in the document and list all the div elements

Go through that list and see if it contains a an h1 or h2 tag that contains the text Faculty, if so keep it. Note: This is why UCSD did not get properly scraped (and in the interests of saving time, I did not backtrack and attempt to fix it). If you browse to the UCSD site and inspect the Faculty header, you can see how the header is enclosed in a div that is not the main div.
Suggested Improvements:

As I learned more about xpath in doing this project, I found that a single expression could be used to grab all possible headers /html/body/*[self::h1 or self::h2 or self::h3]/text() which could then be evaluated as to whether it contained faculty. Xpath 2.0 includes a regex() function that would allow a search to only find those nodes containing the [Ff]aculty regex, but Rvest doesn’t appear to support Xpath 2.0, and neither does Chrome dev tools. Unless I made a syntax error in testing the Xpath regex that I wasn’t able to catch based on the documentation and SO posts I read.

A more intelligent div recursion engine, as the one used in the later coded getDtl fn, that scrapes all parent and sibling div tags of the identified header and attempts to the identify the one which is the main container could be used here and would result in a higher success rate.

Subset the list of divs to remove all divs that don’t contain that header. If none are found, stop and return NA

Iterate along the candidate div elements and get the id and class. Suggested Improvement: Using html_attrs and attempting to locate the first sibling or parent containing Container, main or content as in the getDtl fn could be added to this distillation method to hopefully arrive at the approrpriate div easier. Then the dimension check could be added if a div can’t be located that has an attribute of container,main or content, and that doesn’t contain the the appropriate text (see getDtl fn for the method referred to here) then use the dimensions approach to find it.

Store the height an width in a df as numeric values. Note:The tryCatch method took some time to figure out how to implement, but was necessary to do so as when RSelenium fails to find an element or encounters an error it throws a function breaking error if not encapsulated in tryCatch.

aSize: Get the text size of the links, the links to the faculty bios are typically larger than those in the nav menues and footer.

Filter the df of all those attributes using the following distillation process:

Distinct values

Filtered by having the maximum font-size for links

Filtered by having width greater than 30% of the window width but smaller than max, and height greater than 30% of the window height

Filtered by containing the words container, content, column or main (could have been used earlier, but this was the first time it occured to me).

Filter by having a width greater than .5 and less than .975 the window size.

Check to see which df has the fewest number of rows but more 0, return the index number of that data frame, and subset the list of dataframes to select that one.

If theres more than one row in it, filter by the one with the least height and return that.

findDiv <- function(url) { pg <- read_html(url) lidiv <- pg %>% html_nodes(xpath = "//div") # Parse li, articles, td, and div and find the mode candiv <- vector("character") for (i in seq_along(lidiv)) { candiv[i] <- ifelse(lidiv[[i]] %>% html_node(css = "h1") %>% html_text() %>% str_detect(".*[Ff]aculty.*") == T, lidiv[[i]] %>% html_node(css = "h1") %>% html_text() %>% str_detect(".*[Ff]aculty.*"), NA) if (is.na(candiv[i])) { candiv[i] <- ifelse(lidiv[[i]] %>% html_node(css = "h2") %>% html_text() %>% str_detect(".*[Ff]aculty.*") == T, lidiv[[i]] %>% html_node(css = "h2") %>% html_text() %>% str_detect(".*[Ff]aculty.*"), NA) # can likely combine all h1-3 tags into a single html_node call } } lidiv <- lidiv[!is.na(candiv)] candiv <- candiv[!is.na(candiv)] if (is_empty(candiv)) { return(NA) stop("No Matches") } candiv <- data.frame(attr = 1:length(candiv), v = 1:length(candiv), stringsAsFactors = F) for (i in seq_along(lidiv)) { candiv$attr[i] <- "class" candiv$v[i] <- lidiv[[i]] %>% html_attr("class") if (is.na(candiv[, 2][i])) { candiv$attr[i] <- "id" candiv$v[i] <- lidiv[[i]] %>% html_attr("id") } } # find the div classes or Ids with a heading that contains faculty dfdiv <- data.frame(attr = candiv$attr, v = candiv$v, h = rep(NA, length(candiv$attr)), w = rep(NA, length(candiv$attr)), aSize = rep(NA, length(candiv$attr)), stringsAsFactors = F) #df to store filtering characteristics remDr$navigate(url) remDr$setImplicitWaitTimeout(milliseconds = 4000) wS <- remDr$getWindowSize() %>% str_extract("[0-9]+") %>% as.numeric() for (i in seq_along(dfdiv$attr)) { div <- tryCatch({ remDr$findElement(using = "xpath", paste("//*[@", candiv$attr[i], "='", candiv$v[[i]], "']", sep = "")) }, error = function(err) { return(NA) }) if (!is.na(div)) { # get the width of the div (the largest and smallest width are likely not the div # with the teachers) dfdiv$h[i] <- div$getElementSize()$height dfdiv$w[i] <- div$getElementSize()$width # Get the size of the first link (the largest is likely the professors) dfdiv$aSize[i] <- tryCatch({ div <- remDr$findElement(using = "xpath", paste("//*[@", candiv$attr[i], "='", candiv$v[[i]], "']", sep = "")) a <- div$findChildElement(using = "xpath", "//a") a$getElementValueOfCssProperty("font-size") %>% unlist() %>% str_extract("\\d\\d?") %>% as.numeric() }, error = function(err) { return(NA) }) } } # filter methods dfs <- list() dfs[[1]] <- dfdiv %>% distinct() dfs[[2]] <- dfs[[1]] %>% filter(aSize == max(aSize, na.rm = T)) dfs[[3]] <- dfs[[2]] %>% filter(w > wS[2] * 0.3 & w < wS[2] & h > 0.3 * wS[1]) dfs[[4]] <- dfs[[3]] %>% filter(str_detect(v, "(?:[Cc]ontainer)|(?:[Cc]ontent)|(?:[Cc]olumn)|(?:[Mm]ain)")) dfs[[5]] <- dfs[[4]] %>% filter(w > wS[2] * 0.5 & w < wS[2] * 0.975) dfsrows <- data.frame(n = 1:5, stringsAsFactors = F) for (i in seq_along(dfs)) { dfsrows$rows[i] <- nrow(dfs[[i]]) } dfsrows <- dfsrows %>% filter(rows != 0) dfsrows <- dfsrows %>% filter(rows == min(rows)) %>% arrange(rows) dfnum <- dfsrows$n[1] out <- dfs[[dfnum]] if (nrow(out) > 1) { out <- out %>% filter(h == min(out$h)) } return(out) }

Find Main Divs Fn

This is the iterative instance of the findMainDiv function that iterates over the full df of URLs.

Construct a list for as many schools are in the input

Name each of those list items by the name of the school

Save the URL of that school as a list item for future reference. Note: I’m wondering retrospectively if this could have been a named list instance from the outset making it simpler to call this url by name in future functions?

Use the findDiv function to produce a data frame of the likely container div for use in later functions.

As one might imagine, this function, with 15 schools, takes a good amount of time to execute. The save and load functions are useful for saving the data from a successful completed run such that it doesnt’ have to be rerun when the project is reloaded.

# Find main divs give a df wth names and URLS findMaindivs <- function(df) { canDivs <- rep(list(list()), length(df$Name)) names(canDivs) <- df$Name for (i in seq_along(canDivs)) { canDivs[[i]][[1]] <- df$URL[[i]] } for (i in seq_along(df$URL)) { print(df$URL[[i]]) canDivs[[i]][[2]] <- findDiv(df$URL[[i]]) } return(canDivs) } # canDivs <- findMaindivs(dfSch) save(tDivs,file='tDivs.Rdata') # load('~/Northeastern/Git/da5020project/canDivs.RData') # load('~/Northeastern/Git/da5020project/tDivs.RData')

findBios Fn

Begin by checking for NA, and only proceeding if a div was found in the previous function. Suggested Improvements: A novice mistake, making the if statement as to what to look for to proceed, making it such that all following expressions must be encapsulated in the brackets. This leaves a bunch of brackets at the bottom of the fn and it’s difficult to tell which is associated with which if statement! It took some time to wisen up to this, and make if statements to look for the just the errors/exceptions and act appropriately with those. In this case, issue a “step” would have been appropriate.

Get the attribute type and value and read the html Suggested Improvements: If I had known about the xml_path function when creating the previous function, the xpath could have been already assembled and passed to this function. Better yet, the div xml_nodeset from the previous function could have been passed to this function and save having to fetch the html again.

Create a list of length 6, prepared to capture all nodes matching tags that are used to contain the individual teacher bios.

Create that list of tags div, li, article, td the cl is for getting the class of these repeating elements, and the p is for capturing all the p tag wrapped text information on the page. This capture of p tags was coded in later when I found that it was exceedingly difficult to try to capture the brief bio information individually on each of the pages, and when it was noted that Boston College only had paragraph tags and no further information for their faculty.

Capture the nodes identified in the df from the previous function. If there’s a table, capture the tds (I don’t think checking for the table was necessary retrospectively.) Gsub any text from the td tags to make it more legible. Filter instances with less than 7 characters.

For each of these lists of xml_nodes with respective tag types, pass each list of nodes to tags.

Save where we are in that loop with n.

If there are xml_nodesets present, get the children, determine if the children have img and a tags (bio images and profile links). Str_detect will return a list of logical values. Find which elements contain both types, if it finds elements containing both add a T to the logical vector of same length.

If elements are found matching those conditions, subset that original list item with xml_nodes of that tag type according to which ones contain both img and a tags.

For this new set of xml_nodes matching the criteria, save each one as tags, and make an n value (a placeholder for the list item where it will be saved in the future.)

For those tags extract the class attribute, remove NAs

Put those classes in a column in a data frame, count the number of occurences of each. Filter for classes that occur more than 9 times Note: All departments I looked at had more than 9 faculty members though this might confound a search of departments with fewer faculty members. This value might need to be manually adjusted accordingly or it could be removed entirely as the mode fn that follows pretty much takes care of what is needed (but was probably discovered later). I am going to manually change it to 5 now to avoid this issue in the future.
Find unique

If a class was found, then use it in an xpath to locate the profile items “p” and extract all the text in paragraph tags. Note: I’m unsure as to why I didn’t just use a blanket html_text on the profile containing tags instead of attempting to only get the text in p tags. More information could have been captured by not specifying p tags. I’m going to go ahead and update this so it won’t be reflected in the code of the final product. For that matter I’m unsure as to why I didnt’ go ahead and capture the a tag hrefs with the URL to the extended bio pages. In reflection, I believe that I wanted to verify the output of this approach to identifying the class of the elements containing the faculty profiles, before attempting to extract any significant amount of data. I believe that I also intended to extract more of the information contained in the profile elements on an individual basis, but this proved to be too difficult due to the great degree of variability in the way in which this data was represented. Thus I decided to just get all text in the paragraph tags at this juncture.

A list with xml_nodes of these tag elements, the class of the containing element and the text contained within the element with this class is returned for verification and debugging purposes before moving onto the next step

# dfDiv <- canDivs[[1]] findBios <- function(dfDiv) { if (!is.na(dfDiv[[2]]) == T) { attr <- as.character(dfDiv[[2]]$attr[[1]]) #get the attr type v <- as.character(dfDiv[[2]]$v[[1]]) #get the attr name htm <- read_html(dfDiv[[1]]) #get the URL el <- vector("list", 6) names(el) <- c("div", "li", "article", "td", "cl", "p") #instantiate & label list # get divs,lis,articles, tables, and p tags in main container el$div <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>% html_nodes(css = "div") el$li <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>% html_nodes(css = "li") el$article <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>% html_nodes(css = "article") if (length(htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>% html_node(css = "table")) > 0) { el$td <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>% html_node(css = "table") %>% html_nodes("td") %>% html_text() %>% gsub("\\\n|\\\t|\\\r", "", .) %>% gsub("([a-z])([A-Z])", "\\1 \\2", .) %>% gsub("\\s{3,}", "", .) el$td <- el$td[nchar(el$td) > 7] } # el tag types, filter elements with both img and a tags for (i in 1:4) { tags <- el[[i]] n <- i if (length(tags) > 0) { fil <- vector("logical") for (i in seq_along(tags)) { ch <- tags[[i]] %>% html_children() img <- ch %>% str_detect("\\<img") a <- ch %>% str_detect("\\<a") c <- which(intersect(img, a)) fil[i] <- ifelse(length(c) >= 1, T, F) } } if (sum(fil, na.rm = T) > 0) { el[[n]] <- subset(tags, fil) } } # Create a vector with the classes of those nodes containing both img and a vClass <- vector("character") for (i in seq_along(1:4)) { tags <- el[[i]] n <- i + 4 if (length(tags) > 0) { for (i in seq_along(tags)) { vcl <- vector("character") vcl <- sapply(tags, html_attr, "class") vcl <- vcl[!is.na(vcl)] } vClass <- append(vClass, vcl, after = length(vClass)) } } # store the classes in a DF in the 5th position in the list el[[5]] <- data.frame(v = vClass, stringsAsFactors = F) %>% group_by(v) %>% summarize(cnt = n()) %>% filter(cnt > 5) %>% filter(cnt == Mode(cnt)) %>% unique() if (nrow(el[[5]]) == 0) { el$p <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>% html_text() } } else { el <- NA } return(el) }

findAllBios Fn

The find all bios functions iterates the findBios function over each of the candidate Divs in the output list from findMainDivs.

findAllBios <- function(canDivs) { cDivs <- canDivs for (i in seq_along(cDivs)) { n <- i cBios <- findBios(cDivs[[i]]) cDivs[[n]][[3]] <- cBios } return(cDivs) }

getBios Fn

Store the candidate divs input as a variable cDiv (such that elements can be added and removed with variable scope issues.

Create a list in list item location [[4]]

Get the URL of the faculty page

If there are rows in the df from findMaindiv, then add those main div class to a data frame.

Find that div with Rselenium and get the height, and select the value that has the greatest height. Note: This filter was necessary as the df from findMaindivs returned more than one candidate in a couple of cases, and of those candidates, after verification, the one with the greatest heigh happened to always be the main div.
Suggested Improvements: In reflection, the findMainDiv function filtration could be updated to reflect this finding.

Read the html at the URL and find the items with the class returned from the previous function.

Make a list the same length as the number of nodes of that class and a vector for the names associated with each element

Iterate along the elements containing the faculty briefs, and get all the links in it

If a link is found, then determine if it contains an img or a an @ symbol (to eliminate the links around images or for emails), get the text if it doesn’t contain either.

Subset the list of nodes according to the one’s that might be the extended profile link (because they don’t contain img or text)

Extract the text of those links

If something is extracted, save it in the profs list under name, and print it for debugging purposes.

Because some names come out with all kinds of extra space and odd characters around them, extract just the first name and last name and use that as the name.

Find the link that contains the text from the first html_text extraction and get the href attribute, if that fails, find the link containing the name extracted from whatever html_text returned and extract the href attribute. Print the href for debugging purposes.

Grab all the text (as was considered in the findBios function, but actually makes more sense here since there’s a higher confidence that we selected the approrpriate element to be extracting the text from.)

If neither of these methods work then use RSelenium for a different approach (as is the case where a site uses a table and td elements to contain the teacher briefs, and these td tags do not contain a class.

Get the unique text from the td tags

Assuming that the first text is the faculty members name, search for an element containing that extracted text

Get the text of the element that’s found (unlist it as Rselenium webElements return lists) and store it as the name, print it for debugging.

Get the href and store it as the url, print for debugging

Store the text from the td tag (that was previously retrieved) within the professor profile (such that the wordcloud and text search features will work by looking at just the Professor profile list)

Name all of those newly added professor lists with their names.

# canDiv <- tDivs[[2]] test <- getBio(canDiv) getBio <- function(canDiv) { cDiv <- canDiv cDiv[[4]] <- list() url <- cDiv[[1]] remDr$navigate(url) if (length(cDiv[[3]]$cl$v) != 0) { v <- cDiv[[3]]$cl$v if (length(v) > 1) { df <- data.frame(v = v, h = rep(NA, length(v)), stringsAsFactors = F) for (i in seq_along(v)) { cl <- v[i] wE <- remDr$findElement("xpath", paste("//*[@class='", cl, "']", sep = "")) df$h[i] <- wE$getElementSize()$height } df <- df %>% filter(h == max(df$h)) v <- as.character(df$v[nrow(df)]) } htm <- read_html(url) bios <- htm %>% html_nodes(xpath = paste("//*[@class='", v, "']", sep = "")) profs <- vector("list", length(bios)) names <- vector("character") for (i in seq_along(bios)) { n <- i a <- bios[[i]] %>% html_nodes(css = "a") #> need to insure the first a is actually the prof name if (length(a) > 1) { asort <- vector("logical", length(a)) for (i in seq_along(a)) { asort[i] <- str_detect(a[[i]], "\\<img|\\@") == F & str_detect(a[[i]], "href") == T & nchar(a[[i]] %>% html_text()) > 1 } a <- subset(a, asort) } name <- ifelse(nchar(a %>% html_text()) != 0, a %>% html_text(), NA) if (length(name) != 0) { if (!is.na(name) & length(name) > 1) { name <- name[1] } profs[[n]]$Name <- name print(profs[[n]]$Name) fsnm <- str_match_all(profs[[n]]$Name, "\\w+.?\\w+.?")[[1]][, 1] plt <- paste(fsnm[1], fsnm[2], sep = " ") names[n] <- ifelse(length(plt) != 0, plt, NA) profs[[n]]$href <- tryCatch({ profs[[n]]$href <- remDr$findElement("partial link text", profs[[n]]$Name)$getElementAttribute("href") }, error = function(err) { return(NA) }) if (is.na(profs[[n]]$href)) { profs[[n]]$href <- tryCatch({ profs[[n]]$href <- remDr$findElement("partial link text", plt)$getElementAttribute("href") }, error = function(err) { return(NA) }) } print(profs[[n]]$href) profs[[n]]$info <- bios[[n]] %>% html_text() } } } else { print("Using RSelenium") remDr$navigate(url) tda <- unique(cDiv[[3]][["td"]]) profs <- vector("list", length(tda)) names <- vector("character") for (i in seq_along(tda)) { wE <- remDr$findElement("partial link text", str_match(tda[[i]], "\\w+\\b\\s\\w+\\b")[1, ]) profs[[i]]$Name <- wE$getElementText() %>% unlist() print(profs[[i]]$Name) names[i] <- profs[[i]]$Name profs[[i]]$href <- wE$getElementAttribute("href") %>% unlist() print(profs[[i]]$href) profs[[i]]$info <- tda[[i]] } } names(profs) <- names cDiv[[4]] <- profs return(cDiv) }

getAllBios Fn

The iterative pluralized version of the findBios function for iterating over the full list of URLs.

getAllBios <- function(canDivs) { cDivs <- canDivs for (i in seq_along(cDivs)) { n <- i theDiv <- cDivs[[i]] print(theDiv[[1]]) if (!is.na(theDiv[[3]])) { theDiv <- getBio(theDiv) cDivs[[n]] <- theDiv } } return(cDivs) }

Label tDivs

Label the information contained within the top level of the School list in the candidate div (now tDivs) list.

for (i in seq_along(tDivs)) { names(tDivs[[i]]) <- c("URL", "divdf", "IndexData", "ProfData") }

DtlEmail Fn

Extracting the email would seem like the most straightforward of all the extractions necessary. However, variable formatting in emails (with differing numbers of periods), or special page scripts that masked emails from being extracted by automated processes made this task especially challenging. Luckily, workaround methods were found that resulted in a very high extraction success rate.

Rename the input variable and if it’s NA then skip it (the error checking evolved here).

Extract the name and account for acronym and hyphenated names separated by some kind of punctuation.

Make a regex of the first and last name for use in future functions.

Store the extended profile URL in the url variable

If there’s not an URL skip it

Print the URL for debugging

Parse the URL using the handy parse_url function

Save the last part of the path for use with finding emails in the future

Save the hostnmae for use with identifying the appropriate email in the future

If the URL is valid, try parsing the HTML. Suggested Improvements: tryCatch is used here, because in the duration of time that these extraction methods were being refined, URLs scraped days before were now returning 404 errors because the faculty had changed positions or what not. Retrospectively, it would be wise to add tryCatch to all read_html functions, and add the warning{} section of tryCatch to just print a Warning when read_html fails and some data is being skipped, but allow the function to continue running.

If we fail to return html, then skip the URL

Use the simplest method, a regex for email extraction, to attempt to extract the email first.

If that fails, assign NULL such that the following if statements and function outputs work

If we found emails using the Regex method, test it to see if it has the regex for the first name, the regex for the last name or the end of the URL path to verify it’s the approrpriate address. Assign a T if it matches one of these criteria.

If not reassign NULL

Use RSelenium for a different approach:

Try navigating to the URL with tryCatch Note: tryCatch is now being used as a standard each time we try to navigate to a URL because RSelenium can be prone to error for numerous reasons.

Instantiate a character vector to store these attempts at extracting the email

First search for link text that contains the @ symbol, if some are found, get the URLs and add them to the possible email vector “a”

Try to find a link containing the capitalized first name Note: Unfortunately, the partial link text search method doesn’t take regex, necessitating a search for the capitalized and lower case last names independently

Try to find a link with lower case first name, proper case last name, and lower case last name, append any of those to the candidate vector.

Try to find a link that contains the hostname (ie nyu.edu) as the text, append that the candidates. Suggested Improvements: Noting here that the function only continues with each if the length is less than 1 in the interests of saving time on the function length. However, this method is prone to false positives. This section could be revised to just extract using all of these methods, and then sort the vector based on criteria at the end.

Try to find a link containing email in both cases and hyphenated

Print all the candidates that are found with each pass

use the string extraction method to extract specifically those urls that contained emails, and exclude the mailto: part

gsub %40 to @ (this was necessary for some reason on some sites)

filter for unique

These next criteria are used to sort the list of candidate emails based in order of the priority of the confidence level of containing the text indicated appropriate email, and proceed as follows:

first name

last name

last part of the url path

gsub out the “mailto:” part if it’s still in there.

return the div

# canDivsLi <- nDivs[[1]] i <- 13 test <- DtlEmail(nDivs[[3]]) DtlEmail <- function(canDivsLi) { cDiv <- canDivsLi if (is.na(cDiv[[2]])) { return(cDiv) } for (i in seq_along(cDiv[[4]])) { l <- i nm <- names(cDiv[[4]])[i] %>% str_extract("\\w+.?\\w+.?\\b") ln <- names(cDiv[[4]])[i] %>% str_extract("\\w+.?\\w+.?\\b$") rnm <- paste("[", tolower(substr(nm, 1, 1)), toupper(substr(nm, 1, 1)), "]", substr(nm, 2, 30), sep = "") rln <- paste("[", tolower(substr(ln, 1, 1)), toupper(substr(ln, 1, 1)), "]", substr(ln, 2, 30), sep = "") url <- cDiv[[4]][[i]]$href[[1]] #url to detail page if (is.null(url)) { next } print(url) purl <- parse_url(url) purlp <- purl$path %>% str_match("([\\_+A-Za-z\\d\\%\\&\\#]+)\\/?$") %>% unlist() purlp <- purlp[2] surl <- purl$hostname %>% str_extract("[a-zA-Z0-9-_%]+\\.[a-zA-Z-_%]+$") if (length(purl$scheme) > 0 & length(purl$hostname) > 0) { htm <- tryCatch({ read_html(url) }, error = function(err) { return(NA) }) if (any(is.na(htm))) { next } a <- htm %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>% unlist() if (length(a) == 0) { a <- NULL } else if (is.na(a)) { a <- NULL } if (length(a) >= 1) { estr <- ifelse(any(str_detect(a, rnm), str_detect(a, rln), str_detect(a, purlp)), T, F) if (estr == F) { a <- NULL } } if (length(a) < 1) { print("Using RSelenium") tryCatch({ remDr$navigate(url) }, error = function(err) { next }) a <- vector("character") aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", "@") }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", nm) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(), length(a)) } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", tolower(nm)) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(), length(a)) } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", ln) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(), length(a)) } } print(a) } if (length(a) < 1) { We <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", tolower(ln)) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(), length(a)) } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", tolower(surl)) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(), length(a)) } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", "Email") }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(), length(a)) } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", "E-mail") }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(), length(a)) } } print(a) } a <- a %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>% unlist() %>% gsub("\\%40", "@", .) a <- unique(a) if (!is.na(a %>% any(str_detect(., rnm)))) { a <- subset(a, str_detect(a, rnm)) } if (!is.na(a %>% any(str_detect(., rln)))) { a <- subset(a, str_detect(a, rln)) } if (!is.na(a %>% any(str_detect(., purlp)))) { a <- subset(a, str_detect(a, purlp)) } a <- gsub("mailto:", "", a) print(a) cDiv[[4]][[l]]$email <- a } #if statement } #for loop return(cDiv) }

getDtls Fn

The getDtls function is perhaps the most complex function in the scraper to account for the high degree of variability in code on the faculty extended profile pages. It uses a process similar to the findMainDivs function, only using text as the main variables for distillation rather than height and width of various elements. A hybrid approach would likely be more robust and could be implemented in future versions.

Create variables for use in later functions

Refine the input variable and check for NA

Sequence along the Professor data list and save the location (which professor) in “l”

Store the name to “nm”

Store the last name to “ln”

Make a regex for both of those

Create an xpath for a tag that contains the name but is not a script or title tag

Store the URL

Make a list of all the tags that typically contain text Note: Div tags sometimes also contain text, but more often than not contain all kinds of other information. For our main div locating technique to work below, div tags had to be excluded.

Skip if there’s an issue with the URL Suggested Improvement: This would have made more sense right after the URL variable for organizational reasons.

Parse the parts of the URL used for identifying information later

Functions

Try to read the html

If nothing is returned, skip

Try to find the node with the professors name using the xpath

Skip to the next if nothing is found

xml_parents():Parse the DOM tree and get all parents, identify if they have an id containing content, main or container, make subset of those parents that do

Parse the DOM tree for parents with a class with the same attributes Suggested Improvement: html_attrs might have allowed this to be condensed into a single function, but I beleive the T/F outputs would not have lined up with the xml_parents list and made it such that making a convenient subset was not possible. However, there could be a yet to be identified way of doing this with fewer steps.

Same as above only look at sibling elements in the DOM

Distill the possible tags to the appropriate container tag

Turn all the candidate xml_nodes into xpaths

Store those in dfcN and count the number of characters (to give a relative measure in each beyond the intersection of characters from the start to the first difference in all of the xpaths

For each of these Xpaths, get all the html nodes contained within it

For each of these nodes get the tag name of each, and find how many unique tag types there are

Find the intersection of the unique tag types and the text type tags in “allTTags” earlier

Seq along the nodes extract from the of those tag types, and detect if any have navigations in them (if so these tags are too far up the DOM tree to be the content container)

Make a T/F filter column in the list of xpaths based on the findings from the previous

Filter everything that has fewer than 3 types of text tags Note: 2 was the cut-off here assuming that there was at least one tag containing the professors name, 1 tag containing some kind of description or a 1 link containing an email address. This might not work in all instances, but was the best option perceived at the time. Also filter any paths that contained navigations using the T/F from the previous step.

Since we don’t yet have confidence that we’ve got the right container based on this rudimentary filtering, we will use a while loop to increase our chances of selecting the appropriate one.

While the dataframe created to contain all of the tags in the container has fewer than 3 nodes. Note: It’s assumed that the approrpriate container should have a professors name tag, a tag contained description, and one link at minimum.

Sequence along the list of xpaths, make a data frame for each xpath, fill it with NA equal to the number of nodes in the container.

Add an index number indicating the order of appearance of the tag

Add the tag name

Add the text within it (with special characters and all kinds of other nonsense removed)

Filter an rows with empty character vectors

Filter any duplicates of the text field

notiRow is going to be a T/F filter vector intended to remove all text fields that are subsets of other text fields in different tags. It accomplishes this by making a character vector of the first 200 characters of each row, and matching the first 200 characters of each individual row to any of the strings in this vector.

Break the loop if we run out of xpaths.

Print the 2nd row of text from this DF for debugging purposes

Store this DF in professor data list under “Detail”

The next part is just the DtlEmail function bundled into this function, to eliminate the need for running DtlEmail too (though it was necessary to have it as a modular function for debugging purposes).

Return the div

# canDivsLi <- nDivs[[2]] i <- 4 # test <- DtlText(canDivsLi) getDtls <- function(canDivsLi) { cDiv <- canDivsLi if (is.na(cDiv[[2]])) { return(cDiv) } for (i in seq_along(cDiv[["ProfData"]])) { l <- i nm <- names(cDiv[["ProfData"]])[i] %>% str_extract("\\w+\\.?\\w+\\.?\\b") ln <- names(cDiv[["ProfData"]])[i] %>% str_extract("\\w+\\.?\\w+\\.?\\b$") #prof first name rnm <- paste("[", tolower(substr(nm, 1, 1)), toupper(substr(nm, 1, 1)), "]", substr(nm, 2, 30), sep = "") rln <- paste("[", tolower(substr(ln, 1, 1)), toupper(substr(ln, 1, 1)), "]", substr(ln, 2, 30), sep = "") xp <- paste("//*[contains(text(),'", nm, "')][not(self::script)][not(self::title)]", sep = "") url <- cDiv[["ProfData"]][[i]]$href[[1]] #url to detail page allTTags <- c("h1", "h2", "h3", "h4", "h5", "h6", "p", "span", "a", "li", "td", "strong", "em", "article") if (length(url) == 0) { next } if (is.null(url) | is.na(url)) { next } print(url) purl <- parse_url(url) purlp <- purl$path %>% str_match("([\\_+A-Za-z\\d\\%\\&\\#]+)\\/?(?:\\.html)?$") %>% unlist() purlp <- purlp[2] surl <- purl$hostname %>% str_extract("[a-zA-Z0-9-_%]+\\.[a-zA-Z-_%]+$") if (length(purl$scheme) > 0 & length(purl$hostname) > 0) { htm <- tryCatch({ read_html(url) }, error = function(err) { return(NA) }) if (any(is.na(htm))) { next } nmN <- htm %>% html_nodes(xpath = xp) if (length(nmN) < 1) { next } cNodes <- vector("list") if (any(tf <- xml_parents(nmN) %>% html_attr("id", default = F) %>% str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) { cNodes[[1]] <- subset(xml_parents(nmN), tf) } if (any(tf <- xml_parents(nmN) %>% html_attr("class", default = F) %>% str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) { cNodes[[2]] <- subset(xml_parents(nmN), tf) } if (any(tf <- xml_siblings(nmN) %>% html_attr("id", default = F) %>% str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) { cNodes[[3]] <- subset(xml_siblings(nmN), tf) } if (any(tf <- xml_siblings(nmN) %>% html_attr("class", default = F) %>% str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) { cNodes[[4]] <- subset(xml_siblings(nmN), tf) } #----------------Create xPaths Table 2017-12-08 1929--------------------# xps <- sapply(cNodes, function(x) { if (!is.null(x)) { xml_path(x) } }, simplify = T) %>% unlist() dfcN <- data.frame(Path = xps, stringsAsFactors = F) %>% mutate(nchar = nchar(Path)) %>% arrange(desc(nchar)) allNodes <- list() for (i in seq_along(dfcN$Path)) { allNodes[[i]] <- htm %>% html_node(xpath = dfcN$Path[i]) %>% html_nodes("*") nodeNames <- allNodes[[i]] %>% html_nodes("*") %>% html_name() %>% unlist() %>% unique() dfcN$Nodes[i] <- length(intersect(nodeNames, allTTags)) aN <- vector() for (p in seq_along(allNodes[[i]])) { aN[p] <- ifelse(length(allNodes[[i]][[p]] %>% html_attrs() %>% str_detect("nav|navigation|body")) > 0, allNodes[[i]][[p]] %>% html_attrs() %>% str_detect("nav|navigation|body"), F) } dfcN$Fil[i] <- any(aN) } dfcN <- dfcN %>% filter(Nodes > 2 & Fil != T) %>% arrange(desc(Nodes)) n <- 0 dfNodes <- data.frame(Order = rep(NA, 2), stringsAsFactors = F) while (nrow(dfNodes) < 3) { n <- n + 1 print(dfcN$Path[n]) allNodes <- htm %>% html_node(dfcN$Path[n]) %>% html_nodes("*") #----------------Create Detail DF 2017-12-08 2026--------------------# dfNodes <- data.frame(Order = rep(NA, length(allNodes)), Dup = rep(NA, length(allNodes)), stringsAsFactors = F) for (i in seq_along(allNodes)) { dfNodes$Order[i] <- i dfNodes$Tag[i] <- allNodes[[i]] %>% html_name() dfNodes$Text[i] <- allNodes[[i]] %>% html_text() %>% gsub("\\\n|\\\t|\\{|\\}", "", .) %>% gsub("^\\s+", "", .) %>% gsub("\\s+$", "", .) %>% gsub("\\s{2,}", " ", .) %>% gsub("[][!#$%()*,.:;<=>@^_`|~.{}\\\\/]", "", .) %>% gsub("([a-z])([A-Z])", "\\1 \\2", .) } dfNodes <- dfNodes %>% mutate(nchar = nchar(Text)) %>% filter(nchar > 3) %>% select(-nchar) dfNodes <- dfNodes[!duplicated(dfNodes$Text), ] for (i in 1:length(dfNodes$Text)) { Text <- vector() notiRow <- seq(1:length(dfNodes$Text))[-i] for (m in notiRow) { Text[m] <- substr(dfNodes$Text[m], 1, 200) } g <- vector("logical") g[1] <- any(grepl(substr(dfNodes$Text[i], 1, 200), Text)) dfNodes$Dup[i] <- any(g) } dfNodes <- dfNodes %>% filter(Dup == F) if (n == length(dfcN$Path)) { break } } print(substr(dfNodes$Text[2], 1, 100)) cDiv[["ProfData"]][[l]]$Detail <- dfNodes #----------------DtlEmail 2017-12-09 1015--------------------# a <- htm %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>% unlist() if (length(a) == 0) { a <- NULL } else if (is.na(a)) { a <- NULL } if (length(a) >= 1) { estr <- ifelse(any(str_detect(a, rnm), str_detect(a, rln), str_detect(a, purlp)), T, F) if (estr == F) { a <- NULL } } if (length(a) < 1) { print("Using RSelenium") tryCatch({ remDr$navigate(url) }, error = function(err) { next }) a <- vector("character") aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", "@") }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", nm) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", tolower(nm)) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", ln) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { We <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", tolower(ln)) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", tolower(surl)) }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", "Email") }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } if (length(a) < 1) { aWe <- tryCatch({ aWe <- remDr$findElements(using = "partial link text", "E-mail") }, error = function(err) { return(a) }) if (length(aWe) >= 1) { for (i in seq_along(aWe)) { a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist() } } print(a) } a <- a %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>% unlist() %>% gsub("\\%40", "@", .) a <- unique(a) if (!is.na(a %>% any(str_detect(., rnm)))) { a <- subset(a, str_detect(a, rnm)) } if (!is.na(a %>% any(str_detect(., rln)))) { a <- subset(a, str_detect(a, rln)) } if (!is.na(a %>% any(str_detect(., purlp)))) { a <- subset(a, str_detect(a, purlp)) } a <- gsub("mailto:", "", a) print(a) cDiv[["ProfData"]][[l]]$email <- a #----------------End DtlEmail--------------------# # if nmNodes>0 } #if length(url)>0 } #For I in seq_along candivs return(cDiv) }

getAllDtl Fn

The iterative version of getDtls that functions for entire list of schools.

getAllDtl <- function(canDivs) { for (i in seq_along(canDivs)) { cDiv <- canDivs[[i]] canDivs[[i]] <- getDtls(cDiv) } return(canDivs) } # save(nDivs,file='Data.Rdata') nDivs <- getAllDtl(nDivs)

Clean Data

The cleanData function simply removes all of the content from the list that was involved in extracting the data, leaving just data relevant to the professors. The comments where values are being assigned NULL values were particular exceptions in the data that caused the scraper to continue to error.

cleanData <- function(cDivs) { for (i in seq_along(cDivs)) { n <- i cDivs[[n]] <- cDivs[[n]][-2] cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], sapply(cDivs[[n]][[2]], length, simplify = T) != 0) cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], sapply(cDivs[[n]][[2]], class, simplify = T) != c("xml_nodeset")) cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], sapply(cDivs[[n]][[2]], is_tibble, simplify = T) != T) for (i in 1:2) { cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], names(cDivs[[n]][[2]]) != c("div", "cl")) } } l <- length(cDivs) - 1 for (i in 1:l) { names(cDivs[[i]]) <- c("URL", "IndexData", "ProfData") } } # nDivs[[3]][['IndexData']]$p <- NULL nDivs[[4]][['IndexData']]$p <- NULL # nDivs[[10]][['IndexData']]$p <- NULL save(nDivs,file='Data.Rdata')

Master Scrape Function

The master function, bundling each of the modular functions into a single scraping function that takes a data frame with a column of school names and a column of URLs to faculty pages. Note: This is as of yet untested. Given the amount of supervision due to unforeseen errors it took to extract the original batch of data over the course of ~50 hrs, I haven’t had the opportunity to test it on another data set.

schoolScrape <- function(dfSch) { df <- dfSch tDivs <- findMaindivs(df) tDivs <- findAllBios(tDivs) tDivs <- getAllBios(tDivs) for (i in seq_along(tDivs)) { names(tDivs[[i]]) <- c("URL", "divdf", "IndexData", "ProfData") } tDivs <- getAllDtl(tDivs) tDivs <- cleanData(tDivs) return(tDivs) }

Discussion

As is likely evident after reading this documentation, developing this data scraper was no small task! Despite the numerous hours that went into it, there’s a lot more to do to refine the efficiency of the code, and account for exceptions. In the future, after the suggested improvements are implemented, I’d like to attempt using the scraper on another dataset to see how it does. If using this scraper on another list of schools and/or departments is also of interest to whomever is reading this, contact me and we can see about doing so!

Shiny Apps

Browse Professors

This Shiny application allows a prospective student to browse professors at the schools that have been scraped prior and see wordclouds corresponding to their profile information. The data is cleaned using the tm and snowballC packages, and wordcloud2 is used to make a cloud of the text. Interests, Publications, and Education are extracted from the text because they are often repetitive and non-descript (for the purposes here) words found on every faculty profile.

library("shiny") library("tm") library("SnowballC") library("RColorBrewer") library("wordcloud2") library("tidyverse") File <- "app.R" Files <- list.files(path = file.path("~"), recursive = T, include.dirs = T) Path.file <- names(unlist(sapply(Files, grep, pattern = File))[1]) Dir.wd <- dirname(Path.file) Dir.wd <- gsub("[a-zA-Z0-9_\\.]+\\.[a-zA-Z0-9_\\.]+$", "", Dir.wd) load(paste("~/", Dir.wd, "/Data.Rdata", sep = "")) dfSch <- data.frame(Name = c("UBC", "NYU", "McGill", "UToronto", "Duke", "PennState", "Northwestern", "UNCCH", "Harvard", "Yale", "BostonC", "Emory", "BostonU", "USC", "UCSD"), URL = c("http://philosophy.ubc.ca/people/core-faculty/", "https://as.nyu.edu/content/nyu-as/as/departments/philosophy/directory/faculty.html", "https://www.mcgill.ca/philosophy/people/faculty", "http://philosophy.utoronto.ca/directory-category/main-faculty/", "https://philosophy.duke.edu/people/faculty", "http://philosophy.la.psu.edu/directory/graduate-faculty", "http://www.philosophy.northwestern.edu/people/continuing-faculty/", "https://philosophy.unc.edu/people-page/faculty/", "https://philosophy.fas.harvard.edu/people-terms/department-faculty", "https://philosophy.yale.edu/people/faculty", "https://www.bc.edu/offices/stserv/academic/univcat/faculty/phil.html", "http://philosophy.emory.edu/home/people/faculty/index.html", "http://www.bu.edu/philo/people/faculty/", "http://dornsife.usc.edu/cf/phil/phil_faculty_roster.cfm", "https://philosophy.ucsd.edu/people/faculty.html"), stringsAsFactors = F) ui <- fluidPage(theme = "bootstrap.min.css", titlePanel("Browse Professors"), fluidRow(column(3, selectInput("sch", label = "School:", choices = dfSch$Name, selected = "UBC")), column(3, offset = 1, uiOutput("prof"))), fluidRow(column(12, sliderInput("size", "Size of wordcloud", min = 0.1, max = 4, step = 0.1, value = 0.5, round = FALSE, ticks = TRUE))), fluidRow(column(12, mainPanel(wordcloud2Output("wordcloud2"))))) server <- function(input, output) { output$prof <- renderUI({ profs <- names(nDivs[[input$sch]][["ProfData"]]) selectInput("prof", label = "Professors:", choices = profs, selected = 1) }) #----------------WordCloud 2017-12-10 0844--------------------# output$wordcloud2 <- renderWordcloud2({ # wordcloud2(demoFreqC, size=input$size) inText <- removeWords(nDivs[[input$sch]][["ProfData"]][[input$prof]][["Detail"]]$Text, c("[Ii]nterests", "[Pp]ublications", "[Ee]ducation")) %>% str_replace("([a-z])([A-Z])", "\\1 \\2") %>% gsub("[^[:alnum:]///' ]", "", .) %>% removePunctuation() %>% VectorSource() %>% Corpus() %>% tm_map(., content_transformer(tolower)) %>% tm_map(removeNumbers) %>% tm_map(removeWords, stopwords("english")) %>% tm_map(stripWhitespace) dfT <- TermDocumentMatrix(inText) m <- as.matrix(dfT) v <- sort(rowSums(m), decreasing = TRUE) d <- data.frame(word = names(v), freq = v) wordcloud2(data = d, color = "random-light", shape = "circle") wordcloud2(data = d, color = "random-light", shape = "circle", size = input$size) }) } shinyApp(ui, server)

Search by Interest

This application uses an observe function to observe interests being input in the text box. It then converts those interests into radio boxes on the right for selecting and deselecting interests, and simultaneously turns them into Regex for querying the data using the regints and selected reactive functions. These inputs are passed to the reactive ProfTable function. Which does the following:

Saves the search term as “search”

Uses grep and apply over each list of professors finding the one’s that match the interest regex.

Subsets the list by those that match, if matches aren’t found, it moves onto the next school.

For the name, email and profile associated with each professors, it determines if a value is present or if more than one is present. If more than one is present, it pastes the first two together and returns that, if no value is present it returns NA.

For each of these professors recorded, it records an index of which interest it matched with.

The matched interests are all combined into a data frame (of unequal length vectors) using sapply to fill in NA’s for the number of values in the difference between the vectors smaller than the largest

A data frame is made from the professors info, the df of the matched interests is combined. The interest index columns are united into an Interests column

The df with the exception of the interests column is subsetted by duplicate values, and all the corresponding values in the Interests columns for duplicates are pasted together such that there is one entry for each professor.

The stuff that was pasted together is gsubbed out such that only the number of the interest remains, seperated by a comma in the Interests column.

The ProfTable reactive passes this dataframe to the renderDataTable function that renders a searchable, downloadable data table using the DT package.

The

library("tidyverse") library("shiny") library("stringr") library("DT") library("plyr") library("dplyr") ui <- fluidPage(theme = "bootstrap.min.css", titlePanel("Find Matches for your Interests"), fluidRow( column(6,textAreaInput("interests", "Your Interests", value = "", cols = 1, rows = 10, placeholder = "Enter your Interests here, spaced evenly between words, with individual entries seperated by a comma", resize = "both") ), column(6, p("Check the boxes below to begin sorting by your interests"), checkboxGroupInput("ints", "Interest selection", choices = character(0), selected = character(0), inline = T, width = "100%") ) ), fluidRow( column(12, tags$em("Note: the Interests column in the output is a value correspondng to the query the professor matched with"), tags$br(), tags$em("IE: If your interests were 'Consciousness,Mind' then Consciousness is match 1 and Mind is match 2"), tags$br(), tags$em("A professor matching: "), tags$ol(tags$li("Consciousness will have a value 1"), tags$li("Mind will have a value of 2"), tags$li("Consciousness & Mind will have a value of 1,2"), tags$li("And so on..")) ) ), fluidRow( column(12, DT::dataTableOutput('table',height="1200px") ) ) ) server <- function(input, output,session) { File <- "Search.R" Files <- list.files(path=file.path("~"),recursive=T,include.dirs=T) Path.file <- names(unlist(sapply(Files,grep,pattern=File))[1]) Dir.wd <- dirname(Path.file) Dir.wd <- gsub("[a-zA-Z0-9_\\.]+\\.[a-zA-Z0-9_\\.]+$","",Dir.wd) load(url("https://drive.google.com/file/d/13lyCD0ztQKSHzWcPqBXT9ysUraRDMU87/view?usp=sharing")) selected <- reactive({ selected <- as.numeric(input$ints) return(selected) }) observe({ ints <- input$interests %>% str_split(",") %>% unlist() %>% gsub("^\\s?([a-zA-Z])","\\U\\1",.,perl=T) %>% gsub("\\s([a-zA-Z])"," \\U\\1",.,perl=T) cV <- seq(1,length(ints)) # Can use character(0) to remove all choices if (is.null(ints)) ints <- character(0) # Can also set the label and select items updateCheckboxGroupInput(session, "ints", choiceNames = ints, choiceValues = cV, selected = character(0) ) }) profTable <- reactive({ regints <- reactive({input$interests %>% str_split(",") %>% unlist() %>% gsub("^\\s?([a-zA-Z])","[\\U\\1\\L\\1]",.,perl=T) %>% gsub("\\s([a-zA-Z])","\\\\\\s\\?[\\U\\1\\L\\1]",.,perl=T)}) regints <- regints() selected <- reactive({ selected <- as.numeric(input$ints) return(selected) }) selected <- selected() vSchool <- vector() vProf <- vector() vHref <- vector() vEmail <- vector() mItems <- vector("list") mi <- vector() miList <- list() for(i in selected()){ search <- regints[i] for(n in seq_along(nDivs)){ mItems <- lapply(nDivs[[n]]$ProfData,function(ch) grep(search, ch)) TF <- sapply(mItems, function(x) length(x) > 0) if(any(TF)==F){next} mItems <- subset(nDivs[[n]]$ProfData,TF) if(length(mItems)==0){next} for(m in seq_along(mItems)){ vSchool <- append(vSchool,names(nDivs)[n],length(vSchool)) if(length(mItems[[m]]$Name)>1){Name <- paste(mItems[[m]]$Name[1],mItems[[m]]$Name[2],sep="")}else{Name <- mItems[[m]]$Name} if(length(mItems[[m]]$Name)<1){Name <- NA} vProf <- append(vProf,Name,length(vProf)) if(length(mItems[[m]]$email)>1){email <- paste(mItems[[m]]$email[1],mItems[[m]]$email[2],sep="")}else{email <- mItems[[m]]$email} if(length(mItems[[m]]$email)<1){email <- NA} vEmail <- append(vEmail,email,length(vEmail)) if(length(unlist(mItems[[m]]$href))>1){href <- paste(unlist(mItems[[m]]$href)[1],unlist(mItems[[m]]$href)[2],sep="")}else{href <- unlist(mItems[[m]]$href)} if(length(unlist(mItems[[m]]$href))<1){email <- NA} vHref <- append(vHref,href,length(vHref)) mi <- append(mi,i,length(mi)) } } miList[[i]] <- mi } miOut <- as.data.frame(sapply(miList, '[', seq(max(lengths(miList))))) print(miOut) df <- data.frame(School=vSchool,Professor=vProf,Email=vEmail,Profile=vHref,stringsAsFactors = F) %>% cbind(miOut) %>% group_by(School,Professor,Email,Profile) %>% unite(Interests, -School,-Professor,-Email,-Profile) %>% select(Interests,everything()) df <- aggregate(df[1], df[-1], FUN = function(X) paste(unique(X), collapse=", ")) df$Interests <- df$Interests %>% gsub("[NA\\_0-9]+?\\_(\\d)","\\1",.,perl=T) return(df) }) output$table <- DT::renderDataTable({DT::datatable(profTable(),extensions = 'Buttons', options = list(pageLength = 10, dom = 'Bfrtip', buttons = list('copy', 'print', list( extend = 'collection', buttons = c('csv', 'excel', 'pdf'), text = 'Download' )))) }) } shinyApp(ui,server)

Appendix

Testing/Debugging Code

# nDivs[['NYU']] <- getDtls(nDivs[['NYU']]) Testing Candidate Finder #----------------If cNodes 2017-12-08 1831--------------------# # print(length(cNodes)) n <- 1 tags <- cNodes[[n]] %>% html_nodes('*') %>% # html_name() %>% unique() %>% unlist() # while(length(intersect(tags,allTTags))<3&n<=length(cNodes)){ n <- n+1 tags <- # cNodes[[n]] %>% html_nodes('*') %>% html_name() %>% unique() %>% unlist() # print(paste('while:',n,sep='')) } #move up the parent tree, or across the # sibling tree til the appropriate container is foundrespectively # print(cNodes[[n]] %>% html_attrs()) #----------------Single out Text Nodes 2017-12-08 2027--------------------# # tryCatch({remDr$navigate(url)},error = function(err) { next}) for(i in # seq_along(dfcN$Path)){ dfcN$h[i] <- # remDr$findElement('xpath',dfcN$Path[i])$getElementSize()$height dfcN$w[i] <- # remDr$findElement('xpath',dfcN$Path[i])$getElementSize()$width } dfcN <- dfcN # %>% arrange(desc(h),desc(w)) system.time(tDivs <- getAllBios(canDivs)) save(nDivs, file = "Data.Rdata") system.time(profs2 <- getBio(canDivs[[2]])) bioDiv15 <- findBios(canDivs[[15]]) canDivs <- findAllBios(canDivs) dfs <- list() dfs[[1]] <- findDiv(dfSch$URL[9]) canDivs <- list() canDivs[[2]] nDivs$NYU$ProfData[["Ned Block"]]$Detail

Initialize MongoDB Code

#'C:\Program Files\MongoDB\Server\3.4\bin\mongod.exe' --dbpath C:\Users\Stephen\Documents\Northeastern\Git\da5020\HWK12\mongodb library("rmongodb") library("mongolite") cc <- mongo.create() mongo.is.connected(cc) mongo.get.databases(cc) dbBr <- mongo(collection = "Brief", db = "cc") dbBe <- mongo(collection = "Bio", db = "cc") dbCV <- mongo(collection = "CV", db = "cc") dbP <- mongo(collection = "Personal", db = "cc") dbDiv <- mongo(collection = "canDivs", db = "cc")

Unused Specific Extraction Methods and Testing

# pg <- read_html(remDr$getPageSource()[[1]]) remDr$navigate("https://www.mcgill.ca/philosophy/people/faculty/al-saji") wE <- remDr$findElement("xpath", "//*[@id='page-title']/parent::div") wE$getElementAttribute("id") gettext <- function(elems) { text <- ifelse(length(elems$getElementText()) != 0, elems$getElementText(), NA) return(text) } a[[112]]$getElementAttribute("href") a[[112]]$getElementSize()$width div[[112]]$getElementText() a[[12]]$getElementValueOfCssProperty("font-size") a[[112]]$getElementValueOfCssProperty("font-size") a[[112]]$findElements(using = "xpath", "//div") webElement

Unused Scraping Methods

library("microbenchmark") canDivs2 <- canDivs microbenchmark(canDivsa <- scrURLs(canDivs2), times = 1) scrURLs <- function(canDivs) { li <- canDivs for (i in seq_along(li)) { # Check to see if the dataframe exists for the URL if (length(li[[i]]) > 1) { # canDivs <- canDivs[sapply(canDivs, is.null)] wmin <- min(li[[i]][[2]]$w, na.rm = T) hmin <- min(li[[i]][[2]]$h, na.rm = T) # If the DF has more than 1 row, filter it if (nrow(li[[i]][[2]]) > 1) { li[[i]][[2]] <- li[[i]][[2]] %>% filter(w == wmin & h == hmin) } url <- li[[i]][[1]] print(url) attr <- as.character(li[[i]][[2]]$attr[[1]]) v <- as.character(li[[i]][[2]]$v[[1]]) remDr$navigate(url) htm <- remDr$getPageSource()[[1]] a <- htm %>% html_node("xpath", paste("//*[@", attr, "='", v, "']", sep = "")) %>% html_nodes("css", "a") dfa <- data.frame(Name = 1:length(a), dURL = 1:length(a), stringsAsFactors = F) #Find the Name and URL of the Professor for (i in seq_along(a)) { dfa$Name[[i]] <- a[[i]] %>% html_text() dfa$dURL[[i]] <- a[[i]] %>% html_attr("href") } li[[i]][[2]][[2]] <- dfa print(names(li[[i]])) } } return(li) }

Methods for extracting specific Professor details

children <- unique(bios %>% html_children() %>% html_name()) profsinfo <- setNames(data.frame(matrix(ncol = length(children), nrow = length(bios))), children) for (i in seq_along(children)) { profsinfo[n, i] <- (bios[[n]] %>% html_children() %>% html_text())[i] } vals <- profsinfo$div[1] %>% str_match_all("[a-z0-9]([A-Z][-a-z]+)") profsinfo$div <- profsinfo$div %>% str_replace_all("(?<=[a-z0-9])(?=[A-Z][-a-z]{2,4})", "|") profsinfo <- profsinfo %>% separate(div, sep = "|", into = vals[[1]][, 2], extra = "merge")

rowProfs <- which(dfa$fontSize == max(unlist(dfa$fontSize))) AProf <- a[rowProfs] len <- 1:length(rowProfs) dfProfs <- data.frame(fullName = len, aDetail = len, stringsAsFactors = F) for (i in seq_along(rowProfs)) { dfProfs$fullName[i] <- ifelse(length(AProf[[i]]$getElementText() != 0), AProf[[i]]$getElementText(), NA) dfProfs$aDetail[i] <- ifelse(length(AProf[[i]]$getElementAttribute("href")) != 0, AProf[[i]]$getElementAttribute("href"), NA) }

pubs <- list() for (i in seq_along(dfProfs$aDetail)) { remDr$navigate(dfProfs$aDetail[[i]]) pubclass <- read_html(unlist(remDr$getPageSource())) %>% str_extract_all("class\\=\\\".*[Pp]ublications.*") %>% str_match_all("class=\\\\\"([A-Za-z0-9-_\\s]+publications[A-Za-z0-9-_\\s]+)\\\\\"") %>% str_match("(?:[a-zA-Z0-9-_]+)?[Pp]ublications(?:[a-zA-Z0-9-_]+)?") %>% as.character() if (!is.na(pubclass)) { pubs[[i]] <- remDr$findElement(using = "css", paste(".", pubclass, sep = ""))$getElementText() %>% unlist() %>% strsplit("\\n") %>% unlist() pubs[[i]] <- pubs[[i]][-1] } else { pubs[[i]] <- NA } }

int <- list() for (i in seq_along(dfProfs$aDetail)) { remDr$navigate(dfProfs$aDetail[[i]]) pg <- read_html(unlist(remDr$getPageSource())) int[[i]] <- ifelse(!is.na(pg %>% html_nodes(xpath = "//p") %>% html_text() %>% subset(., nchar(.) > 1)), pg %>% html_nodes(xpath = "//p") %>% html_text() %>% subset(., nchar(.) > 1), NA) } inttag <- pg %>% str_match("\\<([a-z]+)\\>(.*[Ii]nterests?.*)\\<\\/[a-z]+\\>") p <- remDr$findElement(using = "xpath", paste("//", inttag[, 2], "['", inttag[, 3], "']", sep = ""))$findChildElements(using = "xpath", "//p")

Method using RSelenium for Finding Main Container on Detail Page

remDr$navigate(dfProfs$aDetail[[i]]) #----------------Find Main element by Prof Name 2017-12-07--------------------# # [self::h1 or self::h2 or self::h3]/parent::div xp <- paste("//*[contains(text(),'", nm, "')]", sep = "") htm %>% html_node(xpath = xp) pdiv <- remDr$findElements("xpath", xp) fs <- vector("integer") for (i in seq_along(pdiv)) { fs[i] <- pdiv[[i]]$getElementValueOfCssProperty("font-size") %>% unlist() %>% str_extract("\\d+") %>% as.integer() } wEDtlTitle <- pdiv[[match(max(fs), fs)]] #----------------Find Main element by Prof Name 2017-12-07--------------------# div <- remDr$findElement("xpath", "//*[self::h1 or self::h2 or self::h3][contains(text(),'Alia')]/parent::div") div$getElementAttribute("id") sorth <- vector("character") for (i in seq_along(div)) { sorth[i] <- div[[i]]$getElementText() } wbnds <- c(remDr$getWindowSize()$width * 0.4, remDr$getWindowSize()$width * 0.9) hbnds <- c(remDr$getWindowSize()$height * 0.2, remDr$getWindowSize()$height * 0.9) dfdiv <- data.frame(h = rep(NA, length(div))) for (i in seq_along(div)) { dfdiv$h[[i]] <- div[[i]]$getElementSize()$height dfdiv$w[[i]] <- div[[i]]$getElementSize()$width dfdiv$c[[i]] <- ifelse(div[[i]]$getElementText() %>% unlist() %>% str_detect("(?:PhD?)|(?:[Pp]ublications)") == T, div[[i]]$getElementAttribute("class") %>% unlist(), NA) } cdivs <- dfdiv %>% filter(., between(h, hbnds[1], hbnds[2]) & between(w, wbnds[1], wbnds[2]) & nchar(c) > 1) content <- remDr$findElement(using = "xpath", paste("//div[@class='", cdivs$c[1], "']", sep = ""))$getElementText() %>% unlist()

Yep, that’s 102 total hours