Source file ⇒ 2017-lec25.Rmd
HTTP (hypertext transfer protocol) allows for communication between a client and host via request/response messages. At the heart of web communication is the request message, which are sent via Uniform Resource Locators (URLs).
R has very basic –limited– support for numerous protocols (e.g. HTTPS
, FTP
, FTPS
etc)
“RCurl” provides steriods for R to handle these protocols. Most important for us it will allow us to parse https.
There are two high level functions in RCurl
we are concerned with:
Function | Description |
---|---|
getURLContent() | fetches the content of a URL |
getForm() | submits a Web form via the GET method |
library(XML)
library(RCurl)
URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
doc <- URL %>% #doc is a parsed HTML document
getURLContent()%>%
htmlParse()
or
URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
URL %>% getFormParams()
## searchterms searchlocation newsearch sorttype
## "Data+Scientist" "" "true" ""
baseURL="https://www.cybercoders.com/search/"
doc <- getForm(baseURL, searchterms = "Data+Scientist", searchlocation = "", newsearch = "true", sorttype = "") %>%
htmlParse()
We will look at the salary offered for Data Science jobs listed at CyberCoders
Type in “Data Scientist” in the Job title, Keyword search box.
This will take you to https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype=
Our goal is to scrape all of the Data Scientist wage info from this website and put it in a data frame with variables low and high for low and high end of salary.
Notice this is a secure website (https). This requires us to use the function getURLContent
from the RCurl
package.
Explore some of the job listings
URL <- "https://www.cybercoders.com/data-scientist-job-321761"
doc <- URL %>%
getURLContent() %>%
htmlParse()
info <- doc %>%
xpathSApply( '//div[@class="wage"]', xmlValue)
info
## [1] " Full-time $140k - $225k"
Note: You can equivalently write: info <- doc %>% getNodeSet( '//div[@class="wage"]') %>% sapply(xmlValue)
Do example 1a
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-25-collection/
Lets write a function that given the parsed HTML doc returns all the salaries on the page:
cy.readPost =
function(doc)
{
info <- doc %>% xpathSApply( '//div[@class="wage"]', xmlValue)
info
}
"https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype=" %>%
getURLContent() %>%
htmlParse() %>%
cy.readPost()
## [1] "Full-time Compensation Unspecified"
## [2] "Full-time $95k - $120k"
## [3] "Full-time Compensation Unspecified"
## [4] "Full-time $150k - $200k"
## [5] "Full-time $150k - $200k"
## [6] "Full-time $150k - $200k"
## [7] "Full-time Compensation Unspecified"
## [8] "Full-time $90k - $120k"
## [9] "Full-time Compensation Unspecified"
## [10] "Full-time $100k - $130k"
## [11] "Full-time $140k - $225k"
## [12] "Full-time $100k - $150k"
## [13] "Full-time $120k - $175k"
## [14] "Full-time $100k - $175k"
## [15] "Full-time Compensation Unspecified"
## [16] "Compensation Unspecified"
## [17] "Contract $45.00-$60.00"
## [18] "Compensation Unspecified"
## [19] "Full-time $120k - $160k"
## [20] "Full-time $110k - $160k"
Next we need to get the salaries off of page 2 etc.
We examine the sourse code https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype= and figure out how to get the salaries off of page 2.
Near the bottom of the source code we find:
<li class="lnk-next pager-item "><a class="get-search-results next" rel="next" href="./?page=2&searchterms=Data%20Scientist&=searchlocation">»</a></li>
This has multiple attributes which doesn’t work with XpathSapply
and xmlAttrs
as we discussed last time:
link <- xpathSApply(doc, "//a[@rel='next']/@href", xmlAttrs)
"https://www.cybercoders.com/search/")
You get an error Error in parse(text=x,srcfile=src)
So you have to use getNodeSet()
URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
txt <- getURLContent(URL)
doc <- htmlParse(txt)
link <- doc %>% getNodeSet( "//a[@rel='next']/@href")
baseURL <- "https://www.cybercoders.com/search/"
paste(baseURL,as.character(link[[1]]),sep="")
## [1] "https://www.cybercoders.com/search/./?page=2&searchterms=Data%20Scientist&searchlocation=&newsearch=true&sorttype="
We make this into a function
cy.getNextPageLink =
function(doc)
{
baseURL = "https://www.cybercoders.com/search/"
link = getNodeSet(doc, "//a[@rel='next']/@href")
if(length(link) == 0) # if there is no link then length(link) will be zero
return(character())
paste(baseURL,as.character(link[[1]]),sep="")
}
Now we can write a function CyberCoders
(see below( that will give us the salary for each page.
cy.readPost =
function(doc)
{
info <- doc %>% xpathSApply( '//div[@class="wage"]', xmlValue)
info
}
cy.getNextPageLink =
function(doc)
{
baseURL = "https://www.cybercoders.com/search/"
link = getNodeSet(doc, "//a[@rel='next']/@href")
if(length(link) == 0) # if there is no link then length(link) will be zero
return(character())
paste(baseURL,as.character(link[[1]]),sep="")
}
cyberCoders =
function(query)
{
txt = getForm("https://www.cybercoders.com/search/",
searchterms = query, searchlocation = "",
newsearch = "true", sorttype = "")
doc = htmlParse(txt)
posts = c()
while(TRUE) {
posts = c(posts, cy.readPost(doc))
nextPage = cy.getNextPageLink(doc)
if(length(nextPage) == 0)
break
nextPage = getURLContent(nextPage)
doc = htmlParse(nextPage, asText = TRUE)
}
posts
}
dataSciPosts = cyberCoders("Data Scientist")
head(dataSciPosts)
## [1] "Full-time Compensation Unspecified"
## [2] "Full-time $95k - $120k"
## [3] "Full-time Compensation Unspecified"
## [4] "Full-time $150k - $200k"
## [5] "Full-time $150k - $200k"
## [6] "Full-time $150k - $200k"
Do example 1b
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-25-collection/
Finally, lets convert our vector of salaries into a data frame.
#dataSciPosts = as.character(cyberCoders("Data Scientist"))
#head(dataSciPosts,20)
salaries <- as.data.frame(dataSciPosts) %>%
extractMatches(pattern="([[:digit:]]+)k - \\$([[:digit:]]+)", dataSciPosts, low=1, high=2)
head(salaries,20)
## dataSciPosts low high
## 1 Full-time Compensation Unspecified <NA> <NA>
## 2 Full-time $95k - $120k 95 120
## 3 Full-time Compensation Unspecified <NA> <NA>
## 4 Full-time $150k - $200k 150 200
## 5 Full-time $150k - $200k 150 200
## 6 Full-time $150k - $200k 150 200
## 7 Full-time Compensation Unspecified <NA> <NA>
## 8 Full-time $90k - $120k 90 120
## 9 Full-time Compensation Unspecified <NA> <NA>
## 10 Full-time $100k - $130k 100 130
## 11 Full-time $140k - $225k 140 225
## 12 Full-time $100k - $150k 100 150
## 13 Full-time $120k - $175k 120 175
## 14 Full-time $100k - $175k 100 175
## 15 Full-time Compensation Unspecified <NA> <NA>
## 16 Compensation Unspecified <NA> <NA>
## 17 Contract $45.00-$60.00 <NA> <NA>
## 18 Compensation Unspecified <NA> <NA>
## 19 Full-time $120k - $160k 120 160
## 20 Full-time $110k - $160k 110 160
salaries <- salaries %>%
select(low,high) %>%
filter(!is.na(low)) %>%
filter(high!=0)
head(salaries)
## low high
## 1 95 120
## 2 150 200
## 3 150 200
## 4 150 200
## 5 90 120
## 6 100 130
str(salaries)
## 'data.frame': 64 obs. of 2 variables:
## $ low : Factor w/ 15 levels "100","110","120",..: 15 6 6 6 14 1 5 1 3 1 ...
## $ high: Factor w/ 14 levels "120","130","140",..: 1 7 7 7 1 2 8 4 6 6 ...
salaries$low
## [1] 95 150 150 150 90 100 140 100 120 100 120 110 100 100 160 110 100
## [18] 90 100 150 130 175 150 150 160 150 110 160 160 150 70 120 75 110
## [35] 70 100 100 130 130 150 150 140 100 100 100 100 100 100 200 160 250
## [52] 200 250 175 175 130 150 110 110 70 120 120 140 60
## Levels: 100 110 120 130 140 150 160 175 200 250 60 70 75 90 95
salaries$low %>% as.character()
## [1] "95" "150" "150" "150" "90" "100" "140" "100" "120" "100" "120"
## [12] "110" "100" "100" "160" "110" "100" "90" "100" "150" "130" "175"
## [23] "150" "150" "160" "150" "110" "160" "160" "150" "70" "120" "75"
## [34] "110" "70" "100" "100" "130" "130" "150" "150" "140" "100" "100"
## [45] "100" "100" "100" "100" "200" "160" "250" "200" "250" "175" "175"
## [56] "130" "150" "110" "110" "70" "120" "120" "140" "60"
salaries <- salaries %>% mutate(low=as.numeric(as.character(low)),high=as.numeric(as.character(high)))
head(salaries)
## low high
## 1 95 120
## 2 150 200
## 3 150 200
## 4 150 200
## 5 90 120
## 6 100 130
salaries %>% summarize(low=mean(low),high=mean(high))
## low high
## 1 128.0469 199.7656