Source file ⇒ 2017-lec25.Rmd

Announcements:

  1. Rubric and example of final project report will be available on b-courses later today
  2. Sign up sheet for Oral presentation and Poster presentation will be up tomorrow on b-courses discussion board at 10am. One member give group name and list of all group members.

Today

  1. Tentative schedule
  2. The RCurl package
  3. case study: Exploring salary of data science jobs with web scraping and text mining

0. tentative schedule

1. RCurl

HTTP (hypertext transfer protocol) allows for communication between a client and host via request/response messages. At the heart of web communication is the request message, which are sent via Uniform Resource Locators (URLs).

R has very basic –limited– support for numerous protocols (e.g. HTTPS, FTP, FTPS etc)

“RCurl” provides steriods for R to handle these protocols. Most important for us it will allow us to parse https.

There are two high level functions in RCurl we are concerned with:

Function Description
getURLContent() fetches the content of a URL
getForm() submits a Web form via the GET method
library(XML)
library(RCurl)
URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
doc <- URL %>%  #doc is a parsed HTML document
  getURLContent()%>%
  htmlParse()

or

URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
URL %>% getFormParams()
##      searchterms   searchlocation        newsearch         sorttype 
## "Data+Scientist"               ""           "true"               ""
baseURL="https://www.cybercoders.com/search/"
doc <- getForm(baseURL, searchterms = "Data+Scientist",  searchlocation = "", newsearch = "true",  sorttype = "") %>% 
  htmlParse()

2. Exploring salary of data science jobs with web scraping and text mining

We will look at the salary offered for Data Science jobs listed at CyberCoders

https://www.cybercoders.com/

Type in “Data Scientist” in the Job title, Keyword search box.

This will take you to https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype=

Our goal is to scrape all of the Data Scientist wage info from this website and put it in a data frame with variables low and high for low and high end of salary.

Notice this is a secure website (https). This requires us to use the function getURLContent from the RCurl package.

Explore some of the job listings

URL <- "https://www.cybercoders.com/data-scientist-job-321761"
doc <-  URL %>% 
  getURLContent() %>%
  htmlParse()   
info <- doc %>%
  xpathSApply( '//div[@class="wage"]', xmlValue)
info
## [1] " Full-time $140k - $225k"

Note: You can equivalently write: info <- doc %>% getNodeSet( '//div[@class="wage"]') %>% sapply(xmlValue)

In Class exercise

Do example 1a

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-25-collection/

Lets write a function that given the parsed HTML doc returns all the salaries on the page:

cy.readPost = 
function(doc)
{
  info <- doc %>% xpathSApply( '//div[@class="wage"]', xmlValue)
  info
}
"https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype=" %>%
getURLContent() %>% 
htmlParse() %>%
cy.readPost()
##  [1] "Full-time Compensation Unspecified"
##  [2] "Full-time $95k - $120k"            
##  [3] "Full-time Compensation Unspecified"
##  [4] "Full-time $150k - $200k"           
##  [5] "Full-time $150k - $200k"           
##  [6] "Full-time $150k - $200k"           
##  [7] "Full-time Compensation Unspecified"
##  [8] "Full-time $90k - $120k"            
##  [9] "Full-time Compensation Unspecified"
## [10] "Full-time $100k - $130k"           
## [11] "Full-time $140k - $225k"           
## [12] "Full-time $100k - $150k"           
## [13] "Full-time $120k - $175k"           
## [14] "Full-time $100k - $175k"           
## [15] "Full-time Compensation Unspecified"
## [16] "Compensation Unspecified"          
## [17] "Contract $45.00-$60.00"            
## [18] "Compensation Unspecified"          
## [19] "Full-time $120k - $160k"           
## [20] "Full-time $110k - $160k"

Next we need to get the salaries off of page 2 etc.

We examine the sourse code https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype= and figure out how to get the salaries off of page 2.

Near the bottom of the source code we find:

 <li class="lnk-next pager-item "><a class="get-search-results next" rel="next" href="./?page=2&searchterms=Data%20Scientist&=searchlocation">»</a></li>

This has multiple attributes which doesn’t work with XpathSapply and xmlAttrs as we discussed last time:

link <- xpathSApply(doc, "//a[@rel='next']/@href", xmlAttrs)
                     "https://www.cybercoders.com/search/")

You get an error Error in parse(text=x,srcfile=src)

So you have to use getNodeSet()

URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
txt <-  getURLContent(URL)
doc <- htmlParse(txt)

link <- doc %>% getNodeSet( "//a[@rel='next']/@href")
baseURL <-  "https://www.cybercoders.com/search/"
paste(baseURL,as.character(link[[1]]),sep="")
## [1] "https://www.cybercoders.com/search/./?page=2&searchterms=Data%20Scientist&searchlocation=&newsearch=true&sorttype="

We make this into a function

cy.getNextPageLink =
function(doc)
{
  baseURL = "https://www.cybercoders.com/search/"
  link = getNodeSet(doc, "//a[@rel='next']/@href")
  if(length(link) == 0)    # if there is no link then length(link) will be zero
     return(character())
  paste(baseURL,as.character(link[[1]]),sep="")
}

Now we can write a function CyberCoders (see below( that will give us the salary for each page.

cy.readPost = 
function(doc)
{
  info <- doc %>% xpathSApply( '//div[@class="wage"]', xmlValue)
  info
}
cy.getNextPageLink =
function(doc)
{
  baseURL = "https://www.cybercoders.com/search/"
  link = getNodeSet(doc, "//a[@rel='next']/@href")
  if(length(link) == 0)    # if there is no link then length(link) will be zero
     return(character())
  paste(baseURL,as.character(link[[1]]),sep="")
}
cyberCoders =
function(query)
{
   txt = getForm("https://www.cybercoders.com/search/",
                  searchterms = query,  searchlocation = "",
                  newsearch = "true",  sorttype = "")
   doc = htmlParse(txt)

   posts = c()
   while(TRUE) {
       posts = c(posts, cy.readPost(doc))
       nextPage = cy.getNextPageLink(doc)
       if(length(nextPage) == 0)
          break

       nextPage = getURLContent(nextPage)
       doc = htmlParse(nextPage, asText = TRUE)
   }
   posts
}

dataSciPosts = cyberCoders("Data Scientist")
head(dataSciPosts)
## [1] "Full-time Compensation Unspecified"
## [2] "Full-time $95k - $120k"            
## [3] "Full-time Compensation Unspecified"
## [4] "Full-time $150k - $200k"           
## [5] "Full-time $150k - $200k"           
## [6] "Full-time $150k - $200k"

In Class exercise

Do example 1b

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-25-collection/

Finally, lets convert our vector of salaries into a data frame.

#dataSciPosts = as.character(cyberCoders("Data Scientist"))
#head(dataSciPosts,20)

salaries <-  as.data.frame(dataSciPosts) %>% 
  extractMatches(pattern="([[:digit:]]+)k - \\$([[:digit:]]+)", dataSciPosts, low=1, high=2)
head(salaries,20)
##                          dataSciPosts  low high
## 1  Full-time Compensation Unspecified <NA> <NA>
## 2              Full-time $95k - $120k   95  120
## 3  Full-time Compensation Unspecified <NA> <NA>
## 4             Full-time $150k - $200k  150  200
## 5             Full-time $150k - $200k  150  200
## 6             Full-time $150k - $200k  150  200
## 7  Full-time Compensation Unspecified <NA> <NA>
## 8              Full-time $90k - $120k   90  120
## 9  Full-time Compensation Unspecified <NA> <NA>
## 10            Full-time $100k - $130k  100  130
## 11            Full-time $140k - $225k  140  225
## 12            Full-time $100k - $150k  100  150
## 13            Full-time $120k - $175k  120  175
## 14            Full-time $100k - $175k  100  175
## 15 Full-time Compensation Unspecified <NA> <NA>
## 16           Compensation Unspecified <NA> <NA>
## 17             Contract $45.00-$60.00 <NA> <NA>
## 18           Compensation Unspecified <NA> <NA>
## 19            Full-time $120k - $160k  120  160
## 20            Full-time $110k - $160k  110  160
salaries <- salaries %>% 
  select(low,high) %>% 
  filter(!is.na(low)) %>% 
  filter(high!=0) 
head(salaries)
##   low high
## 1  95  120
## 2 150  200
## 3 150  200
## 4 150  200
## 5  90  120
## 6 100  130
str(salaries)
## 'data.frame':    64 obs. of  2 variables:
##  $ low : Factor w/ 15 levels "100","110","120",..: 15 6 6 6 14 1 5 1 3 1 ...
##  $ high: Factor w/ 14 levels "120","130","140",..: 1 7 7 7 1 2 8 4 6 6 ...
salaries$low 
##  [1] 95  150 150 150 90  100 140 100 120 100 120 110 100 100 160 110 100
## [18] 90  100 150 130 175 150 150 160 150 110 160 160 150 70  120 75  110
## [35] 70  100 100 130 130 150 150 140 100 100 100 100 100 100 200 160 250
## [52] 200 250 175 175 130 150 110 110 70  120 120 140 60 
## Levels: 100 110 120 130 140 150 160 175 200 250 60 70 75 90 95
salaries$low %>% as.character()
##  [1] "95"  "150" "150" "150" "90"  "100" "140" "100" "120" "100" "120"
## [12] "110" "100" "100" "160" "110" "100" "90"  "100" "150" "130" "175"
## [23] "150" "150" "160" "150" "110" "160" "160" "150" "70"  "120" "75" 
## [34] "110" "70"  "100" "100" "130" "130" "150" "150" "140" "100" "100"
## [45] "100" "100" "100" "100" "200" "160" "250" "200" "250" "175" "175"
## [56] "130" "150" "110" "110" "70"  "120" "120" "140" "60"
salaries <- salaries %>% mutate(low=as.numeric(as.character(low)),high=as.numeric(as.character(high)))
head(salaries)
##   low high
## 1  95  120
## 2 150  200
## 3 150  200
## 4 150  200
## 5  90  120
## 6 100  130
salaries %>% summarize(low=mean(low),high=mean(high))
##        low     high
## 1 128.0469 199.7656