docker run -d -P selenium/standalone-chrome
Welcome to the Congruent Connections faculty data scraper and DA 5020 Final Project for Stephen Holsenbeck. This page provides the in-depth documentation and write-up for how each modular function in the scraper works. In the navigation menu above, the Shiny Apps dropdown menu will take you to the Shiny App deployments on shinyapps.io. The Shiny Apps will allow you to browse professors in the data set by wordclouds, or to search the dataset by your interests, and return professor’s information whom match your interests. This data is searchable and downloadable.
Note: The code and apps were tested on Windows 10 home running Chrome browser maximized to the screen, and the page coding is optimized for viewing in a maximized browser window on a desktop (though it uses bootstrap fluid formatting, so it can be viewed on smaller screens, though not optimally.)
The idea to do a web scraper was inspired by a prospective philosophy PhD candidate who was lamenting the inevitable task of blindly broadcast their application to suitable programs to increase their chances of acceptance. Noting that many ambitious individuals are relegated to this method of applying to programs for all types of intellectual pursuits, I attempted to improve upon the process by applying the data analytics skills acquired over the previous semester to develop a web scraper and ShinyUI that allows prospective students (for any higher ed degree) to find the professors that match their interests with which they can intelligently communicate, allocate their energy towards, and apply to their departments knowing that they share common interests. With this app, prospective students can better allocate their limited energies more efficiently, bolstering the likelihood of their acceptance and future success in the department(s) to which they apply.
Each function resides alongside it’s documentation in the accordion of corresponding name below. The Shiny App documentation follows, and finally the appendix of unused code blocks are represented, though it is undocumented.
The remoteDriver command creates the remote driver instance that drives the hidden Chrome Selenium browser. RSelenium is especially useful when it is necessary to obtain CSS properties of page elements that are either computed on page load or resolved from JS and CSS scripts such as container dimensions, text-size, etc. This feature proves useful in the findMainDiv function for identifying the container div by a process that distills divs by their dimensions. It can also change hard-coded attributes of specific elements on a page such that they can passed to Rvest where they can be easily identified/scraped by the identifying attribute assigned by RSelenium.
# For Help: vignette('RSelenium-docker', package = 'RSelenium')
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 32768L, browserName = "chrome")
remDr$open(silent = T)
PhantomJS is a headless browser, that is slightly faster than Chrome. I attempted to use PhantomJS to speed up the time expensive scraping operations but found that it calculates dimensions quite differently from Chrome, rendering the previously coded element dimensions distillation methods adapted to the Chrome remote driver, useless. For this reason, Chrome was used for all RSelenium functions herein.
# docker run -d -p 8910:8910 wernight/phantomjs phantomjs --webdriver=8910
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 8910L, browserName = "phantomjs")
remDr$open(silent = T)
This df is the list of all input schools and URLs of the philosophy faculty pages used in the project.
dfSch <- data.frame(Name = c("UBC", "NYU", "McGill", "UToronto", "Duke", "PennState",
"Northwestern", "UNCCH", "Harvard", "Yale", "BostonC", "Emory", "BostonU", "USC",
"UCSD"), URL = c("http://philosophy.ubc.ca/people/core-faculty/", "https://as.nyu.edu/content/nyu-as/as/departments/philosophy/directory/faculty.html",
"https://www.mcgill.ca/philosophy/people/faculty", "http://philosophy.utoronto.ca/directory-category/main-faculty/",
"https://philosophy.duke.edu/people/faculty", "http://philosophy.la.psu.edu/directory/graduate-faculty",
"http://www.philosophy.northwestern.edu/people/continuing-faculty/", "https://philosophy.unc.edu/people-page/faculty/",
"https://philosophy.fas.harvard.edu/people-terms/department-faculty", "https://philosophy.yale.edu/people/faculty",
"https://www.bc.edu/offices/stserv/academic/univcat/faculty/phil.html", "http://philosophy.emory.edu/home/people/faculty/index.html",
"http://www.bu.edu/philo/people/faculty/", "http://dornsife.usc.edu/cf/phil/phil_faculty_roster.cfm",
"https://philosophy.ucsd.edu/people/faculty.html"), stringsAsFactors = F)
This is a mode function found via the following SO post that is used in the findBios function to identify the element within the main containing div that repeats the most frequently, singling out the element that encapsulates the listings of faculty profiles (as this is the most frequently recurring element within the main div on all of the pages).
Mode <- function(x, na.rm = FALSE) {
if (na.rm) {
x = x[!is.na(x)]
}
ux <- unique(x)
return(ux[which.max(tabulate(match(x, ux)))])
}
#———–Heading Date———-#
. Using easily recognizable commenting like this makes it much easier to make your way around what can become byzantine code. If you haven’t done so already, make a global snippet for this purpose triggered by something like ##. You can also use programs like Autohotkey to automate the insertion of a data/time stamp to keep track of revision and creation dates.
/html/body/*[self::h1 or self::h2 or self::h3]/text()
which could then be evaluated as to whether it contained faculty. Xpath 2.0 includes a regex() function that would allow a search to only find those nodes containing the [Ff]aculty regex, but Rvest doesn’t appear to support Xpath 2.0, and neither does Chrome dev tools. Unless I made a syntax error in testing the Xpath regex that I wasn’t able to catch based on the documentation and SO posts I read.
findDiv <- function(url) {
pg <- read_html(url)
lidiv <- pg %>% html_nodes(xpath = "//div")
# Parse li, articles, td, and div and find the mode
candiv <- vector("character")
for (i in seq_along(lidiv)) {
candiv[i] <- ifelse(lidiv[[i]] %>% html_node(css = "h1") %>% html_text() %>%
str_detect(".*[Ff]aculty.*") == T, lidiv[[i]] %>% html_node(css = "h1") %>%
html_text() %>% str_detect(".*[Ff]aculty.*"), NA)
if (is.na(candiv[i])) {
candiv[i] <- ifelse(lidiv[[i]] %>% html_node(css = "h2") %>% html_text() %>%
str_detect(".*[Ff]aculty.*") == T, lidiv[[i]] %>% html_node(css = "h2") %>%
html_text() %>% str_detect(".*[Ff]aculty.*"), NA)
# can likely combine all h1-3 tags into a single html_node call
}
}
lidiv <- lidiv[!is.na(candiv)]
candiv <- candiv[!is.na(candiv)]
if (is_empty(candiv)) {
return(NA)
stop("No Matches")
}
candiv <- data.frame(attr = 1:length(candiv), v = 1:length(candiv), stringsAsFactors = F)
for (i in seq_along(lidiv)) {
candiv$attr[i] <- "class"
candiv$v[i] <- lidiv[[i]] %>% html_attr("class")
if (is.na(candiv[, 2][i])) {
candiv$attr[i] <- "id"
candiv$v[i] <- lidiv[[i]] %>% html_attr("id")
}
}
# find the div classes or Ids with a heading that contains faculty
dfdiv <- data.frame(attr = candiv$attr, v = candiv$v, h = rep(NA, length(candiv$attr)),
w = rep(NA, length(candiv$attr)), aSize = rep(NA, length(candiv$attr)), stringsAsFactors = F) #df to store filtering characteristics
remDr$navigate(url)
remDr$setImplicitWaitTimeout(milliseconds = 4000)
wS <- remDr$getWindowSize() %>% str_extract("[0-9]+") %>% as.numeric()
for (i in seq_along(dfdiv$attr)) {
div <- tryCatch({
remDr$findElement(using = "xpath", paste("//*[@", candiv$attr[i], "='",
candiv$v[[i]], "']", sep = ""))
}, error = function(err) {
return(NA)
})
if (!is.na(div)) {
# get the width of the div (the largest and smallest width are likely not the div
# with the teachers)
dfdiv$h[i] <- div$getElementSize()$height
dfdiv$w[i] <- div$getElementSize()$width
# Get the size of the first link (the largest is likely the professors)
dfdiv$aSize[i] <- tryCatch({
div <- remDr$findElement(using = "xpath", paste("//*[@", candiv$attr[i],
"='", candiv$v[[i]], "']", sep = ""))
a <- div$findChildElement(using = "xpath", "//a")
a$getElementValueOfCssProperty("font-size") %>% unlist() %>% str_extract("\\d\\d?") %>%
as.numeric()
}, error = function(err) {
return(NA)
})
}
}
# filter methods
dfs <- list()
dfs[[1]] <- dfdiv %>% distinct()
dfs[[2]] <- dfs[[1]] %>% filter(aSize == max(aSize, na.rm = T))
dfs[[3]] <- dfs[[2]] %>% filter(w > wS[2] * 0.3 & w < wS[2] & h > 0.3 * wS[1])
dfs[[4]] <- dfs[[3]] %>% filter(str_detect(v, "(?:[Cc]ontainer)|(?:[Cc]ontent)|(?:[Cc]olumn)|(?:[Mm]ain)"))
dfs[[5]] <- dfs[[4]] %>% filter(w > wS[2] * 0.5 & w < wS[2] * 0.975)
dfsrows <- data.frame(n = 1:5, stringsAsFactors = F)
for (i in seq_along(dfs)) {
dfsrows$rows[i] <- nrow(dfs[[i]])
}
dfsrows <- dfsrows %>% filter(rows != 0)
dfsrows <- dfsrows %>% filter(rows == min(rows)) %>% arrange(rows)
dfnum <- dfsrows$n[1]
out <- dfs[[dfnum]]
if (nrow(out) > 1) {
out <- out %>% filter(h == min(out$h))
}
return(out)
}
As one might imagine, this function, with 15 schools, takes a good amount of time to execute. The save and load functions are useful for saving the data from a successful completed run such that it doesnt’ have to be rerun when the project is reloaded.
# Find main divs give a df wth names and URLS
findMaindivs <- function(df) {
canDivs <- rep(list(list()), length(df$Name))
names(canDivs) <- df$Name
for (i in seq_along(canDivs)) {
canDivs[[i]][[1]] <- df$URL[[i]]
}
for (i in seq_along(df$URL)) {
print(df$URL[[i]])
canDivs[[i]][[2]] <- findDiv(df$URL[[i]])
}
return(canDivs)
}
# canDivs <- findMaindivs(dfSch) save(tDivs,file='tDivs.Rdata')
# load('~/Northeastern/Git/da5020project/canDivs.RData')
# load('~/Northeastern/Git/da5020project/tDivs.RData')
div, li, article, td
the cl is for getting the class of these repeating elements, and the p is for capturing all the p tag wrapped text information on the page. This capture of p tags was coded in later when I found that it was exceedingly difficult to try to capture the brief bio information individually on each of the pages, and when it was noted that Boston College only had paragraph tags and no further information for their faculty.
# dfDiv <- canDivs[[1]]
findBios <- function(dfDiv) {
if (!is.na(dfDiv[[2]]) == T) {
attr <- as.character(dfDiv[[2]]$attr[[1]]) #get the attr type
v <- as.character(dfDiv[[2]]$v[[1]]) #get the attr name
htm <- read_html(dfDiv[[1]]) #get the URL
el <- vector("list", 6)
names(el) <- c("div", "li", "article", "td", "cl", "p") #instantiate & label list
# get divs,lis,articles, tables, and p tags in main container
el$div <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>%
html_nodes(css = "div")
el$li <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']", sep = "")) %>%
html_nodes(css = "li")
el$article <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']",
sep = "")) %>% html_nodes(css = "article")
if (length(htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']",
sep = "")) %>% html_node(css = "table")) > 0) {
el$td <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']",
sep = "")) %>% html_node(css = "table") %>% html_nodes("td") %>%
html_text() %>% gsub("\\\n|\\\t|\\\r", "", .) %>% gsub("([a-z])([A-Z])",
"\\1 \\2", .) %>% gsub("\\s{3,}", "", .)
el$td <- el$td[nchar(el$td) > 7]
}
# el tag types, filter elements with both img and a tags
for (i in 1:4) {
tags <- el[[i]]
n <- i
if (length(tags) > 0) {
fil <- vector("logical")
for (i in seq_along(tags)) {
ch <- tags[[i]] %>% html_children()
img <- ch %>% str_detect("\\<img")
a <- ch %>% str_detect("\\<a")
c <- which(intersect(img, a))
fil[i] <- ifelse(length(c) >= 1, T, F)
}
}
if (sum(fil, na.rm = T) > 0) {
el[[n]] <- subset(tags, fil)
}
}
# Create a vector with the classes of those nodes containing both img and a
vClass <- vector("character")
for (i in seq_along(1:4)) {
tags <- el[[i]]
n <- i + 4
if (length(tags) > 0) {
for (i in seq_along(tags)) {
vcl <- vector("character")
vcl <- sapply(tags, html_attr, "class")
vcl <- vcl[!is.na(vcl)]
}
vClass <- append(vClass, vcl, after = length(vClass))
}
}
# store the classes in a DF in the 5th position in the list
el[[5]] <- data.frame(v = vClass, stringsAsFactors = F) %>% group_by(v) %>%
summarize(cnt = n()) %>% filter(cnt > 5) %>% filter(cnt == Mode(cnt)) %>%
unique()
if (nrow(el[[5]]) == 0) {
el$p <- htm %>% html_node(xpath = paste("//*[@", attr, "='", v, "']",
sep = "")) %>% html_text()
}
} else {
el <- NA
}
return(el)
}
The find all bios functions iterates the findBios function over each of the candidate Divs in the output list from findMainDivs.
findAllBios <- function(canDivs) {
cDivs <- canDivs
for (i in seq_along(cDivs)) {
n <- i
cBios <- findBios(cDivs[[i]])
cDivs[[n]][[3]] <- cBios
}
return(cDivs)
}
# canDiv <- tDivs[[2]] test <- getBio(canDiv)
getBio <- function(canDiv) {
cDiv <- canDiv
cDiv[[4]] <- list()
url <- cDiv[[1]]
remDr$navigate(url)
if (length(cDiv[[3]]$cl$v) != 0) {
v <- cDiv[[3]]$cl$v
if (length(v) > 1) {
df <- data.frame(v = v, h = rep(NA, length(v)), stringsAsFactors = F)
for (i in seq_along(v)) {
cl <- v[i]
wE <- remDr$findElement("xpath", paste("//*[@class='", cl, "']",
sep = ""))
df$h[i] <- wE$getElementSize()$height
}
df <- df %>% filter(h == max(df$h))
v <- as.character(df$v[nrow(df)])
}
htm <- read_html(url)
bios <- htm %>% html_nodes(xpath = paste("//*[@class='", v, "']", sep = ""))
profs <- vector("list", length(bios))
names <- vector("character")
for (i in seq_along(bios)) {
n <- i
a <- bios[[i]] %>% html_nodes(css = "a") #> need to insure the first a is actually the prof name
if (length(a) > 1) {
asort <- vector("logical", length(a))
for (i in seq_along(a)) {
asort[i] <- str_detect(a[[i]], "\\<img|\\@") == F & str_detect(a[[i]],
"href") == T & nchar(a[[i]] %>% html_text()) > 1
}
a <- subset(a, asort)
}
name <- ifelse(nchar(a %>% html_text()) != 0, a %>% html_text(), NA)
if (length(name) != 0) {
if (!is.na(name) & length(name) > 1) {
name <- name[1]
}
profs[[n]]$Name <- name
print(profs[[n]]$Name)
fsnm <- str_match_all(profs[[n]]$Name, "\\w+.?\\w+.?")[[1]][, 1]
plt <- paste(fsnm[1], fsnm[2], sep = " ")
names[n] <- ifelse(length(plt) != 0, plt, NA)
profs[[n]]$href <- tryCatch({
profs[[n]]$href <- remDr$findElement("partial link text", profs[[n]]$Name)$getElementAttribute("href")
}, error = function(err) {
return(NA)
})
if (is.na(profs[[n]]$href)) {
profs[[n]]$href <- tryCatch({
profs[[n]]$href <- remDr$findElement("partial link text", plt)$getElementAttribute("href")
}, error = function(err) {
return(NA)
})
}
print(profs[[n]]$href)
profs[[n]]$info <- bios[[n]] %>% html_text()
}
}
} else {
print("Using RSelenium")
remDr$navigate(url)
tda <- unique(cDiv[[3]][["td"]])
profs <- vector("list", length(tda))
names <- vector("character")
for (i in seq_along(tda)) {
wE <- remDr$findElement("partial link text", str_match(tda[[i]], "\\w+\\b\\s\\w+\\b")[1,
])
profs[[i]]$Name <- wE$getElementText() %>% unlist()
print(profs[[i]]$Name)
names[i] <- profs[[i]]$Name
profs[[i]]$href <- wE$getElementAttribute("href") %>% unlist()
print(profs[[i]]$href)
profs[[i]]$info <- tda[[i]]
}
}
names(profs) <- names
cDiv[[4]] <- profs
return(cDiv)
}
The iterative pluralized version of the findBios function for iterating over the full list of URLs.
getAllBios <- function(canDivs) {
cDivs <- canDivs
for (i in seq_along(cDivs)) {
n <- i
theDiv <- cDivs[[i]]
print(theDiv[[1]])
if (!is.na(theDiv[[3]])) {
theDiv <- getBio(theDiv)
cDivs[[n]] <- theDiv
}
}
return(cDivs)
}
Label the information contained within the top level of the School list in the candidate div (now tDivs) list.
for (i in seq_along(tDivs)) {
names(tDivs[[i]]) <- c("URL", "divdf", "IndexData", "ProfData")
}
# canDivsLi <- nDivs[[1]] i <- 13 test <- DtlEmail(nDivs[[3]])
DtlEmail <- function(canDivsLi) {
cDiv <- canDivsLi
if (is.na(cDiv[[2]])) {
return(cDiv)
}
for (i in seq_along(cDiv[[4]])) {
l <- i
nm <- names(cDiv[[4]])[i] %>% str_extract("\\w+.?\\w+.?\\b")
ln <- names(cDiv[[4]])[i] %>% str_extract("\\w+.?\\w+.?\\b$")
rnm <- paste("[", tolower(substr(nm, 1, 1)), toupper(substr(nm, 1, 1)), "]",
substr(nm, 2, 30), sep = "")
rln <- paste("[", tolower(substr(ln, 1, 1)), toupper(substr(ln, 1, 1)), "]",
substr(ln, 2, 30), sep = "")
url <- cDiv[[4]][[i]]$href[[1]] #url to detail page
if (is.null(url)) {
next
}
print(url)
purl <- parse_url(url)
purlp <- purl$path %>% str_match("([\\_+A-Za-z\\d\\%\\&\\#]+)\\/?$") %>%
unlist()
purlp <- purlp[2]
surl <- purl$hostname %>% str_extract("[a-zA-Z0-9-_%]+\\.[a-zA-Z-_%]+$")
if (length(purl$scheme) > 0 & length(purl$hostname) > 0)
{
htm <- tryCatch({
read_html(url)
}, error = function(err) {
return(NA)
})
if (any(is.na(htm))) {
next
}
a <- htm %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>%
unlist()
if (length(a) == 0) {
a <- NULL
} else if (is.na(a)) {
a <- NULL
}
if (length(a) >= 1) {
estr <- ifelse(any(str_detect(a, rnm), str_detect(a, rln), str_detect(a,
purlp)), T, F)
if (estr == F) {
a <- NULL
}
}
if (length(a) < 1) {
print("Using RSelenium")
tryCatch({
remDr$navigate(url)
}, error = function(err) {
next
})
a <- vector("character")
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", "@")
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", nm)
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(),
length(a))
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", tolower(nm))
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(),
length(a))
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", ln)
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(),
length(a))
}
}
print(a)
}
if (length(a) < 1) {
We <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", tolower(ln))
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(),
length(a))
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", tolower(surl))
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(),
length(a))
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", "Email")
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(),
length(a))
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", "E-mail")
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a <- append(a, aWe[[i]]$getElementAttribute("href") %>% unlist(),
length(a))
}
}
print(a)
}
a <- a %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>%
unlist() %>% gsub("\\%40", "@", .)
a <- unique(a)
if (!is.na(a %>% any(str_detect(., rnm)))) {
a <- subset(a, str_detect(a, rnm))
}
if (!is.na(a %>% any(str_detect(., rln)))) {
a <- subset(a, str_detect(a, rln))
}
if (!is.na(a %>% any(str_detect(., purlp)))) {
a <- subset(a, str_detect(a, purlp))
}
a <- gsub("mailto:", "", a)
print(a)
cDiv[[4]][[l]]$email <- a
} #if statement
} #for loop
return(cDiv)
}
xml_parents()
:Parse the DOM tree and get all parents, identify if they have an id containing content, main or container, make subset of those parents that do
# canDivsLi <- nDivs[[2]] i <- 4
# test <- DtlText(canDivsLi)
getDtls <- function(canDivsLi) {
cDiv <- canDivsLi
if (is.na(cDiv[[2]])) {
return(cDiv)
}
for (i in seq_along(cDiv[["ProfData"]])) {
l <- i
nm <- names(cDiv[["ProfData"]])[i] %>% str_extract("\\w+\\.?\\w+\\.?\\b")
ln <- names(cDiv[["ProfData"]])[i] %>% str_extract("\\w+\\.?\\w+\\.?\\b$") #prof first name
rnm <- paste("[", tolower(substr(nm, 1, 1)), toupper(substr(nm, 1, 1)), "]",
substr(nm, 2, 30), sep = "")
rln <- paste("[", tolower(substr(ln, 1, 1)), toupper(substr(ln, 1, 1)), "]",
substr(ln, 2, 30), sep = "")
xp <- paste("//*[contains(text(),'", nm, "')][not(self::script)][not(self::title)]",
sep = "")
url <- cDiv[["ProfData"]][[i]]$href[[1]] #url to detail page
allTTags <- c("h1", "h2", "h3", "h4", "h5", "h6", "p", "span", "a", "li",
"td", "strong", "em", "article")
if (length(url) == 0) {
next
}
if (is.null(url) | is.na(url)) {
next
}
print(url)
purl <- parse_url(url)
purlp <- purl$path %>% str_match("([\\_+A-Za-z\\d\\%\\&\\#]+)\\/?(?:\\.html)?$") %>%
unlist()
purlp <- purlp[2]
surl <- purl$hostname %>% str_extract("[a-zA-Z0-9-_%]+\\.[a-zA-Z-_%]+$")
if (length(purl$scheme) > 0 & length(purl$hostname) > 0)
{
htm <- tryCatch({
read_html(url)
}, error = function(err) {
return(NA)
})
if (any(is.na(htm))) {
next
}
nmN <- htm %>% html_nodes(xpath = xp)
if (length(nmN) < 1) {
next
}
cNodes <- vector("list")
if (any(tf <- xml_parents(nmN) %>% html_attr("id", default = F) %>%
str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) {
cNodes[[1]] <- subset(xml_parents(nmN), tf)
}
if (any(tf <- xml_parents(nmN) %>% html_attr("class", default = F) %>%
str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) {
cNodes[[2]] <- subset(xml_parents(nmN), tf)
}
if (any(tf <- xml_siblings(nmN) %>% html_attr("id", default = F) %>%
str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) {
cNodes[[3]] <- subset(xml_siblings(nmN), tf)
}
if (any(tf <- xml_siblings(nmN) %>% html_attr("class", default = F) %>%
str_detect("(?:[Cc]ontent)|(?:[Mm]ain)|(?:[Cc]ontainer)") %>% unlist())) {
cNodes[[4]] <- subset(xml_siblings(nmN), tf)
}
#----------------Create xPaths Table 2017-12-08 1929--------------------#
xps <- sapply(cNodes, function(x) {
if (!is.null(x)) {
xml_path(x)
}
}, simplify = T) %>% unlist()
dfcN <- data.frame(Path = xps, stringsAsFactors = F) %>% mutate(nchar = nchar(Path)) %>%
arrange(desc(nchar))
allNodes <- list()
for (i in seq_along(dfcN$Path)) {
allNodes[[i]] <- htm %>% html_node(xpath = dfcN$Path[i]) %>% html_nodes("*")
nodeNames <- allNodes[[i]] %>% html_nodes("*") %>% html_name() %>%
unlist() %>% unique()
dfcN$Nodes[i] <- length(intersect(nodeNames, allTTags))
aN <- vector()
for (p in seq_along(allNodes[[i]])) {
aN[p] <- ifelse(length(allNodes[[i]][[p]] %>% html_attrs() %>%
str_detect("nav|navigation|body")) > 0, allNodes[[i]][[p]] %>%
html_attrs() %>% str_detect("nav|navigation|body"), F)
}
dfcN$Fil[i] <- any(aN)
}
dfcN <- dfcN %>% filter(Nodes > 2 & Fil != T) %>% arrange(desc(Nodes))
n <- 0
dfNodes <- data.frame(Order = rep(NA, 2), stringsAsFactors = F)
while (nrow(dfNodes) < 3) {
n <- n + 1
print(dfcN$Path[n])
allNodes <- htm %>% html_node(dfcN$Path[n]) %>% html_nodes("*")
#----------------Create Detail DF 2017-12-08 2026--------------------#
dfNodes <- data.frame(Order = rep(NA, length(allNodes)), Dup = rep(NA,
length(allNodes)), stringsAsFactors = F)
for (i in seq_along(allNodes)) {
dfNodes$Order[i] <- i
dfNodes$Tag[i] <- allNodes[[i]] %>% html_name()
dfNodes$Text[i] <- allNodes[[i]] %>% html_text() %>% gsub("\\\n|\\\t|\\{|\\}",
"", .) %>% gsub("^\\s+", "", .) %>% gsub("\\s+$", "", .) %>%
gsub("\\s{2,}", " ", .) %>% gsub("[][!#$%()*,.:;<=>@^_`|~.{}\\\\/]",
"", .) %>% gsub("([a-z])([A-Z])", "\\1 \\2", .)
}
dfNodes <- dfNodes %>% mutate(nchar = nchar(Text)) %>% filter(nchar >
3) %>% select(-nchar)
dfNodes <- dfNodes[!duplicated(dfNodes$Text), ]
for (i in 1:length(dfNodes$Text)) {
Text <- vector()
notiRow <- seq(1:length(dfNodes$Text))[-i]
for (m in notiRow) {
Text[m] <- substr(dfNodes$Text[m], 1, 200)
}
g <- vector("logical")
g[1] <- any(grepl(substr(dfNodes$Text[i], 1, 200), Text))
dfNodes$Dup[i] <- any(g)
}
dfNodes <- dfNodes %>% filter(Dup == F)
if (n == length(dfcN$Path)) {
break
}
}
print(substr(dfNodes$Text[2], 1, 100))
cDiv[["ProfData"]][[l]]$Detail <- dfNodes
#----------------DtlEmail 2017-12-09 1015--------------------#
a <- htm %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>%
unlist()
if (length(a) == 0) {
a <- NULL
} else if (is.na(a)) {
a <- NULL
}
if (length(a) >= 1) {
estr <- ifelse(any(str_detect(a, rnm), str_detect(a, rln), str_detect(a,
purlp)), T, F)
if (estr == F) {
a <- NULL
}
}
if (length(a) < 1) {
print("Using RSelenium")
tryCatch({
remDr$navigate(url)
}, error = function(err) {
next
})
a <- vector("character")
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", "@")
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", nm)
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", tolower(nm))
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", ln)
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
We <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", tolower(ln))
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", tolower(surl))
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", "Email")
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
if (length(a) < 1) {
aWe <- tryCatch({
aWe <- remDr$findElements(using = "partial link text", "E-mail")
}, error = function(err) {
return(a)
})
if (length(aWe) >= 1) {
for (i in seq_along(aWe)) {
a[i] <- aWe[[i]]$getElementAttribute("href") %>% unlist()
}
}
print(a)
}
a <- a %>% str_extract_all("\\b[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*[A-Za-z0-9\\.\\_\\%+-]*@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b") %>%
unlist() %>% gsub("\\%40", "@", .)
a <- unique(a)
if (!is.na(a %>% any(str_detect(., rnm)))) {
a <- subset(a, str_detect(a, rnm))
}
if (!is.na(a %>% any(str_detect(., rln)))) {
a <- subset(a, str_detect(a, rln))
}
if (!is.na(a %>% any(str_detect(., purlp)))) {
a <- subset(a, str_detect(a, purlp))
}
a <- gsub("mailto:", "", a)
print(a)
cDiv[["ProfData"]][[l]]$email <- a
#----------------End DtlEmail--------------------#
# if nmNodes>0
} #if length(url)>0
} #For I in seq_along candivs
return(cDiv)
}
The iterative version of getDtls that functions for entire list of schools.
getAllDtl <- function(canDivs) {
for (i in seq_along(canDivs)) {
cDiv <- canDivs[[i]]
canDivs[[i]] <- getDtls(cDiv)
}
return(canDivs)
}
# save(nDivs,file='Data.Rdata') nDivs <- getAllDtl(nDivs)
The cleanData function simply removes all of the content from the list that was involved in extracting the data, leaving just data relevant to the professors. The comments where values are being assigned NULL values were particular exceptions in the data that caused the scraper to continue to error.
cleanData <- function(cDivs) {
for (i in seq_along(cDivs)) {
n <- i
cDivs[[n]] <- cDivs[[n]][-2]
cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], sapply(cDivs[[n]][[2]], length,
simplify = T) != 0)
cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], sapply(cDivs[[n]][[2]], class,
simplify = T) != c("xml_nodeset"))
cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], sapply(cDivs[[n]][[2]], is_tibble,
simplify = T) != T)
for (i in 1:2) {
cDivs[[n]][[2]] <- subset(cDivs[[n]][[2]], names(cDivs[[n]][[2]]) !=
c("div", "cl"))
}
}
l <- length(cDivs) - 1
for (i in 1:l) {
names(cDivs[[i]]) <- c("URL", "IndexData", "ProfData")
}
}
# nDivs[[3]][['IndexData']]$p <- NULL nDivs[[4]][['IndexData']]$p <- NULL
# nDivs[[10]][['IndexData']]$p <- NULL save(nDivs,file='Data.Rdata')
The master function, bundling each of the modular functions into a single scraping function that takes a data frame with a column of school names and a column of URLs to faculty pages. Note: This is as of yet untested. Given the amount of supervision due to unforeseen errors it took to extract the original batch of data over the course of ~50 hrs, I haven’t had the opportunity to test it on another data set.
schoolScrape <- function(dfSch) {
df <- dfSch
tDivs <- findMaindivs(df)
tDivs <- findAllBios(tDivs)
tDivs <- getAllBios(tDivs)
for (i in seq_along(tDivs)) {
names(tDivs[[i]]) <- c("URL", "divdf", "IndexData", "ProfData")
}
tDivs <- getAllDtl(tDivs)
tDivs <- cleanData(tDivs)
return(tDivs)
}
As is likely evident after reading this documentation, developing this data scraper was no small task! Despite the numerous hours that went into it, there’s a lot more to do to refine the efficiency of the code, and account for exceptions. In the future, after the suggested improvements are implemented, I’d like to attempt using the scraper on another dataset to see how it does. If using this scraper on another list of schools and/or departments is also of interest to whomever is reading this, contact me and we can see about doing so!
This Shiny application allows a prospective student to browse professors at the schools that have been scraped prior and see wordclouds corresponding to their profile information. The data is cleaned using the tm and snowballC packages, and wordcloud2 is used to make a cloud of the text. Interests, Publications, and Education are extracted from the text because they are often repetitive and non-descript (for the purposes here) words found on every faculty profile.
library("shiny")
library("tm")
library("SnowballC")
library("RColorBrewer")
library("wordcloud2")
library("tidyverse")
File <- "app.R"
Files <- list.files(path = file.path("~"), recursive = T, include.dirs = T)
Path.file <- names(unlist(sapply(Files, grep, pattern = File))[1])
Dir.wd <- dirname(Path.file)
Dir.wd <- gsub("[a-zA-Z0-9_\\.]+\\.[a-zA-Z0-9_\\.]+$", "", Dir.wd)
load(paste("~/", Dir.wd, "/Data.Rdata", sep = ""))
dfSch <- data.frame(Name = c("UBC", "NYU", "McGill", "UToronto", "Duke", "PennState",
"Northwestern", "UNCCH", "Harvard", "Yale", "BostonC", "Emory", "BostonU", "USC",
"UCSD"), URL = c("http://philosophy.ubc.ca/people/core-faculty/", "https://as.nyu.edu/content/nyu-as/as/departments/philosophy/directory/faculty.html",
"https://www.mcgill.ca/philosophy/people/faculty", "http://philosophy.utoronto.ca/directory-category/main-faculty/",
"https://philosophy.duke.edu/people/faculty", "http://philosophy.la.psu.edu/directory/graduate-faculty",
"http://www.philosophy.northwestern.edu/people/continuing-faculty/", "https://philosophy.unc.edu/people-page/faculty/",
"https://philosophy.fas.harvard.edu/people-terms/department-faculty", "https://philosophy.yale.edu/people/faculty",
"https://www.bc.edu/offices/stserv/academic/univcat/faculty/phil.html", "http://philosophy.emory.edu/home/people/faculty/index.html",
"http://www.bu.edu/philo/people/faculty/", "http://dornsife.usc.edu/cf/phil/phil_faculty_roster.cfm",
"https://philosophy.ucsd.edu/people/faculty.html"), stringsAsFactors = F)
ui <- fluidPage(theme = "bootstrap.min.css", titlePanel("Browse Professors"), fluidRow(column(3,
selectInput("sch", label = "School:", choices = dfSch$Name, selected = "UBC")),
column(3, offset = 1, uiOutput("prof"))), fluidRow(column(12, sliderInput("size",
"Size of wordcloud", min = 0.1, max = 4, step = 0.1, value = 0.5, round = FALSE,
ticks = TRUE))), fluidRow(column(12, mainPanel(wordcloud2Output("wordcloud2")))))
server <- function(input, output) {
output$prof <- renderUI({
profs <- names(nDivs[[input$sch]][["ProfData"]])
selectInput("prof", label = "Professors:", choices = profs, selected = 1)
})
#----------------WordCloud 2017-12-10 0844--------------------#
output$wordcloud2 <- renderWordcloud2({
# wordcloud2(demoFreqC, size=input$size)
inText <- removeWords(nDivs[[input$sch]][["ProfData"]][[input$prof]][["Detail"]]$Text,
c("[Ii]nterests", "[Pp]ublications", "[Ee]ducation")) %>% str_replace("([a-z])([A-Z])",
"\\1 \\2") %>% gsub("[^[:alnum:]///' ]", "", .) %>% removePunctuation() %>%
VectorSource() %>% Corpus() %>% tm_map(., content_transformer(tolower)) %>%
tm_map(removeNumbers) %>% tm_map(removeWords, stopwords("english")) %>%
tm_map(stripWhitespace)
dfT <- TermDocumentMatrix(inText)
m <- as.matrix(dfT)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)
wordcloud2(data = d, color = "random-light", shape = "circle")
wordcloud2(data = d, color = "random-light", shape = "circle", size = input$size)
})
}
shinyApp(ui, server)
The
library("tidyverse")
library("shiny")
library("stringr")
library("DT")
library("plyr")
library("dplyr")
ui <- fluidPage(theme = "bootstrap.min.css",
titlePanel("Find Matches for your Interests"),
fluidRow(
column(6,textAreaInput("interests", "Your Interests", value = "", cols = 1, rows = 10, placeholder = "Enter your Interests here, spaced evenly between words, with individual entries seperated by a comma", resize = "both")
),
column(6,
p("Check the boxes below to begin sorting by your interests"),
checkboxGroupInput("ints", "Interest selection", choices = character(0), selected = character(0),
inline = T, width = "100%")
)
),
fluidRow(
column(12,
tags$em("Note: the Interests column in the output is a value correspondng to the query the professor matched with"),
tags$br(),
tags$em("IE: If your interests were 'Consciousness,Mind' then Consciousness is match 1 and Mind is match 2"),
tags$br(),
tags$em("A professor matching: "),
tags$ol(tags$li("Consciousness will have a value 1"),
tags$li("Mind will have a value of 2"),
tags$li("Consciousness & Mind will have a value of 1,2"),
tags$li("And so on.."))
)
),
fluidRow(
column(12,
DT::dataTableOutput('table',height="1200px")
)
)
)
server <- function(input, output,session) {
File <- "Search.R"
Files <- list.files(path=file.path("~"),recursive=T,include.dirs=T)
Path.file <- names(unlist(sapply(Files,grep,pattern=File))[1])
Dir.wd <- dirname(Path.file)
Dir.wd <- gsub("[a-zA-Z0-9_\\.]+\\.[a-zA-Z0-9_\\.]+$","",Dir.wd)
load(url("https://drive.google.com/file/d/13lyCD0ztQKSHzWcPqBXT9ysUraRDMU87/view?usp=sharing"))
selected <- reactive({
selected <- as.numeric(input$ints)
return(selected)
})
observe({
ints <- input$interests %>% str_split(",") %>% unlist() %>% gsub("^\\s?([a-zA-Z])","\\U\\1",.,perl=T) %>% gsub("\\s([a-zA-Z])"," \\U\\1",.,perl=T)
cV <- seq(1,length(ints))
# Can use character(0) to remove all choices
if (is.null(ints))
ints <- character(0)
# Can also set the label and select items
updateCheckboxGroupInput(session, "ints",
choiceNames = ints, choiceValues = cV,
selected = character(0)
)
})
profTable <- reactive({
regints <- reactive({input$interests %>% str_split(",") %>% unlist() %>% gsub("^\\s?([a-zA-Z])","[\\U\\1\\L\\1]",.,perl=T) %>% gsub("\\s([a-zA-Z])","\\\\\\s\\?[\\U\\1\\L\\1]",.,perl=T)})
regints <- regints()
selected <- reactive({
selected <- as.numeric(input$ints)
return(selected)
})
selected <- selected()
vSchool <- vector()
vProf <- vector()
vHref <- vector()
vEmail <- vector()
mItems <- vector("list")
mi <- vector()
miList <- list()
for(i in selected()){
search <- regints[i]
for(n in seq_along(nDivs)){
mItems <- lapply(nDivs[[n]]$ProfData,function(ch) grep(search, ch))
TF <- sapply(mItems, function(x) length(x) > 0)
if(any(TF)==F){next}
mItems <- subset(nDivs[[n]]$ProfData,TF)
if(length(mItems)==0){next}
for(m in seq_along(mItems)){
vSchool <- append(vSchool,names(nDivs)[n],length(vSchool))
if(length(mItems[[m]]$Name)>1){Name <- paste(mItems[[m]]$Name[1],mItems[[m]]$Name[2],sep="")}else{Name <- mItems[[m]]$Name}
if(length(mItems[[m]]$Name)<1){Name <- NA}
vProf <- append(vProf,Name,length(vProf))
if(length(mItems[[m]]$email)>1){email <- paste(mItems[[m]]$email[1],mItems[[m]]$email[2],sep="")}else{email <- mItems[[m]]$email}
if(length(mItems[[m]]$email)<1){email <- NA}
vEmail <- append(vEmail,email,length(vEmail))
if(length(unlist(mItems[[m]]$href))>1){href <- paste(unlist(mItems[[m]]$href)[1],unlist(mItems[[m]]$href)[2],sep="")}else{href <- unlist(mItems[[m]]$href)}
if(length(unlist(mItems[[m]]$href))<1){email <- NA}
vHref <- append(vHref,href,length(vHref))
mi <- append(mi,i,length(mi))
}
}
miList[[i]] <- mi
}
miOut <- as.data.frame(sapply(miList, '[', seq(max(lengths(miList)))))
print(miOut)
df <- data.frame(School=vSchool,Professor=vProf,Email=vEmail,Profile=vHref,stringsAsFactors = F) %>% cbind(miOut) %>% group_by(School,Professor,Email,Profile) %>% unite(Interests, -School,-Professor,-Email,-Profile) %>% select(Interests,everything())
df <- aggregate(df[1], df[-1],
FUN = function(X) paste(unique(X), collapse=", "))
df$Interests <- df$Interests %>% gsub("[NA\\_0-9]+?\\_(\\d)","\\1",.,perl=T)
return(df)
})
output$table <- DT::renderDataTable({DT::datatable(profTable(),extensions = 'Buttons', options = list(pageLength = 10,
dom = 'Bfrtip',
buttons =
list('copy', 'print', list(
extend = 'collection',
buttons = c('csv', 'excel', 'pdf'),
text = 'Download'
))))
})
}
shinyApp(ui,server)
# nDivs[['NYU']] <- getDtls(nDivs[['NYU']]) Testing Candidate Finder
#----------------If cNodes 2017-12-08 1831--------------------#
# print(length(cNodes)) n <- 1 tags <- cNodes[[n]] %>% html_nodes('*') %>%
# html_name() %>% unique() %>% unlist()
# while(length(intersect(tags,allTTags))<3&n<=length(cNodes)){ n <- n+1 tags <-
# cNodes[[n]] %>% html_nodes('*') %>% html_name() %>% unique() %>% unlist()
# print(paste('while:',n,sep='')) } #move up the parent tree, or across the
# sibling tree til the appropriate container is foundrespectively
# print(cNodes[[n]] %>% html_attrs())
#----------------Single out Text Nodes 2017-12-08 2027--------------------#
# tryCatch({remDr$navigate(url)},error = function(err) { next}) for(i in
# seq_along(dfcN$Path)){ dfcN$h[i] <-
# remDr$findElement('xpath',dfcN$Path[i])$getElementSize()$height dfcN$w[i] <-
# remDr$findElement('xpath',dfcN$Path[i])$getElementSize()$width } dfcN <- dfcN
# %>% arrange(desc(h),desc(w))
system.time(tDivs <- getAllBios(canDivs))
save(nDivs, file = "Data.Rdata")
system.time(profs2 <- getBio(canDivs[[2]]))
bioDiv15 <- findBios(canDivs[[15]])
canDivs <- findAllBios(canDivs)
dfs <- list()
dfs[[1]] <- findDiv(dfSch$URL[9])
canDivs <- list()
canDivs[[2]]
nDivs$NYU$ProfData[["Ned Block"]]$Detail
#'C:\Program Files\MongoDB\Server\3.4\bin\mongod.exe' --dbpath C:\Users\Stephen\Documents\Northeastern\Git\da5020\HWK12\mongodb
library("rmongodb")
library("mongolite")
cc <- mongo.create()
mongo.is.connected(cc)
mongo.get.databases(cc)
dbBr <- mongo(collection = "Brief", db = "cc")
dbBe <- mongo(collection = "Bio", db = "cc")
dbCV <- mongo(collection = "CV", db = "cc")
dbP <- mongo(collection = "Personal", db = "cc")
dbDiv <- mongo(collection = "canDivs", db = "cc")
# pg <- read_html(remDr$getPageSource()[[1]])
remDr$navigate("https://www.mcgill.ca/philosophy/people/faculty/al-saji")
wE <- remDr$findElement("xpath", "//*[@id='page-title']/parent::div")
wE$getElementAttribute("id")
gettext <- function(elems) {
text <- ifelse(length(elems$getElementText()) != 0, elems$getElementText(), NA)
return(text)
}
a[[112]]$getElementAttribute("href")
a[[112]]$getElementSize()$width
div[[112]]$getElementText()
a[[12]]$getElementValueOfCssProperty("font-size")
a[[112]]$getElementValueOfCssProperty("font-size")
a[[112]]$findElements(using = "xpath", "//div")
webElement
library("microbenchmark")
canDivs2 <- canDivs
microbenchmark(canDivsa <- scrURLs(canDivs2), times = 1)
scrURLs <- function(canDivs) {
li <- canDivs
for (i in seq_along(li)) {
# Check to see if the dataframe exists for the URL
if (length(li[[i]]) > 1) {
# canDivs <- canDivs[sapply(canDivs, is.null)]
wmin <- min(li[[i]][[2]]$w, na.rm = T)
hmin <- min(li[[i]][[2]]$h, na.rm = T)
# If the DF has more than 1 row, filter it
if (nrow(li[[i]][[2]]) > 1) {
li[[i]][[2]] <- li[[i]][[2]] %>% filter(w == wmin & h == hmin)
}
url <- li[[i]][[1]]
print(url)
attr <- as.character(li[[i]][[2]]$attr[[1]])
v <- as.character(li[[i]][[2]]$v[[1]])
remDr$navigate(url)
htm <- remDr$getPageSource()[[1]]
a <- htm %>% html_node("xpath", paste("//*[@", attr, "='", v, "']", sep = "")) %>%
html_nodes("css", "a")
dfa <- data.frame(Name = 1:length(a), dURL = 1:length(a), stringsAsFactors = F) #Find the Name and URL of the Professor
for (i in seq_along(a)) {
dfa$Name[[i]] <- a[[i]] %>% html_text()
dfa$dURL[[i]] <- a[[i]] %>% html_attr("href")
}
li[[i]][[2]][[2]] <- dfa
print(names(li[[i]]))
}
}
return(li)
}
children <- unique(bios %>% html_children() %>% html_name())
profsinfo <- setNames(data.frame(matrix(ncol = length(children), nrow = length(bios))),
children)
for (i in seq_along(children)) {
profsinfo[n, i] <- (bios[[n]] %>% html_children() %>% html_text())[i]
}
vals <- profsinfo$div[1] %>% str_match_all("[a-z0-9]([A-Z][-a-z]+)")
profsinfo$div <- profsinfo$div %>% str_replace_all("(?<=[a-z0-9])(?=[A-Z][-a-z]{2,4})",
"|")
profsinfo <- profsinfo %>% separate(div, sep = "|", into = vals[[1]][, 2], extra = "merge")
rowProfs <- which(dfa$fontSize == max(unlist(dfa$fontSize)))
AProf <- a[rowProfs]
len <- 1:length(rowProfs)
dfProfs <- data.frame(fullName = len, aDetail = len, stringsAsFactors = F)
for (i in seq_along(rowProfs)) {
dfProfs$fullName[i] <- ifelse(length(AProf[[i]]$getElementText() != 0), AProf[[i]]$getElementText(),
NA)
dfProfs$aDetail[i] <- ifelse(length(AProf[[i]]$getElementAttribute("href")) !=
0, AProf[[i]]$getElementAttribute("href"), NA)
}
pubs <- list()
for (i in seq_along(dfProfs$aDetail)) {
remDr$navigate(dfProfs$aDetail[[i]])
pubclass <- read_html(unlist(remDr$getPageSource())) %>% str_extract_all("class\\=\\\".*[Pp]ublications.*") %>%
str_match_all("class=\\\\\"([A-Za-z0-9-_\\s]+publications[A-Za-z0-9-_\\s]+)\\\\\"") %>%
str_match("(?:[a-zA-Z0-9-_]+)?[Pp]ublications(?:[a-zA-Z0-9-_]+)?") %>% as.character()
if (!is.na(pubclass)) {
pubs[[i]] <- remDr$findElement(using = "css", paste(".", pubclass, sep = ""))$getElementText() %>%
unlist() %>% strsplit("\\n") %>% unlist()
pubs[[i]] <- pubs[[i]][-1]
} else {
pubs[[i]] <- NA
}
}
int <- list()
for (i in seq_along(dfProfs$aDetail)) {
remDr$navigate(dfProfs$aDetail[[i]])
pg <- read_html(unlist(remDr$getPageSource()))
int[[i]] <- ifelse(!is.na(pg %>% html_nodes(xpath = "//p") %>% html_text() %>%
subset(., nchar(.) > 1)), pg %>% html_nodes(xpath = "//p") %>% html_text() %>%
subset(., nchar(.) > 1), NA)
}
inttag <- pg %>% str_match("\\<([a-z]+)\\>(.*[Ii]nterests?.*)\\<\\/[a-z]+\\>")
p <- remDr$findElement(using = "xpath", paste("//", inttag[, 2], "['", inttag[, 3],
"']", sep = ""))$findChildElements(using = "xpath", "//p")
remDr$navigate(dfProfs$aDetail[[i]])
#----------------Find Main element by Prof Name 2017-12-07--------------------#
# [self::h1 or self::h2 or self::h3]/parent::div
xp <- paste("//*[contains(text(),'", nm, "')]", sep = "")
htm %>% html_node(xpath = xp)
pdiv <- remDr$findElements("xpath", xp)
fs <- vector("integer")
for (i in seq_along(pdiv)) {
fs[i] <- pdiv[[i]]$getElementValueOfCssProperty("font-size") %>% unlist() %>%
str_extract("\\d+") %>% as.integer()
}
wEDtlTitle <- pdiv[[match(max(fs), fs)]]
#----------------Find Main element by Prof Name 2017-12-07--------------------#
div <- remDr$findElement("xpath", "//*[self::h1 or self::h2 or self::h3][contains(text(),'Alia')]/parent::div")
div$getElementAttribute("id")
sorth <- vector("character")
for (i in seq_along(div)) {
sorth[i] <- div[[i]]$getElementText()
}
wbnds <- c(remDr$getWindowSize()$width * 0.4, remDr$getWindowSize()$width * 0.9)
hbnds <- c(remDr$getWindowSize()$height * 0.2, remDr$getWindowSize()$height * 0.9)
dfdiv <- data.frame(h = rep(NA, length(div)))
for (i in seq_along(div)) {
dfdiv$h[[i]] <- div[[i]]$getElementSize()$height
dfdiv$w[[i]] <- div[[i]]$getElementSize()$width
dfdiv$c[[i]] <- ifelse(div[[i]]$getElementText() %>% unlist() %>% str_detect("(?:PhD?)|(?:[Pp]ublications)") ==
T, div[[i]]$getElementAttribute("class") %>% unlist(), NA)
}
cdivs <- dfdiv %>% filter(., between(h, hbnds[1], hbnds[2]) & between(w, wbnds[1],
wbnds[2]) & nchar(c) > 1)
content <- remDr$findElement(using = "xpath", paste("//div[@class='", cdivs$c[1],
"']", sep = ""))$getElementText() %>% unlist()