2017-lec25

Source file ⇒ 2017-lec25.Rmd

Announcements:

Rubric and example of final project report will be available on b-courses later today
Sign up sheet for Oral presentation and Poster presentation will be up tomorrow on b-courses discussion board at 10am. One member give group name and list of all group members.

Today

Tentative schedule
The RCurl package
case study: Exploring salary of data science jobs with web scraping and text mining

0. tentative schedule

Final lecture is next Tuesday. (Databases, Statistical Modeling still to come)
Review for final in lab this week and next
RRR week meeting T,Th (presentations)
Final exam is Monday May 8

1. RCurl

HTTP (hypertext transfer protocol) allows for communication between a client and host via request/response messages. At the heart of web communication is the request message, which are sent via Uniform Resource Locators (URLs).

R has very basic –limited– support for numerous protocols (e.g. HTTPS, FTP, FTPS etc)

“RCurl” provides steriods for R to handle these protocols. Most important for us it will allow us to parse https.

There are two high level functions in RCurl we are concerned with:

Function	Description
getURLContent()	fetches the content of a URL
getForm()	submits a Web form via the GET method

library(XML)
library(RCurl)
URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
doc <- URL %>%  #doc is a parsed HTML document
  getURLContent()%>%
  htmlParse()

URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
URL %>% getFormParams()

##      searchterms   searchlocation        newsearch         sorttype 
## "Data+Scientist"               ""           "true"               ""

baseURL="https://www.cybercoders.com/search/"
doc <- getForm(baseURL, searchterms = "Data+Scientist",  searchlocation = "", newsearch = "true",  sorttype = "") %>% 
  htmlParse()

2. Exploring salary of data science jobs with web scraping and text mining

We will look at the salary offered for Data Science jobs listed at CyberCoders

https://www.cybercoders.com/

Type in “Data Scientist” in the Job title, Keyword search box.

This will take you to https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype=

Our goal is to scrape all of the Data Scientist wage info from this website and put it in a data frame with variables low and high for low and high end of salary.

Notice this is a secure website (https). This requires us to use the function getURLContent from the RCurl package.

Explore some of the job listings

URL <- "https://www.cybercoders.com/data-scientist-job-321761"
doc <-  URL %>% 
  getURLContent() %>%
  htmlParse()   
info <- doc %>%
  xpathSApply( '//div[@class="wage"]', xmlValue)
info

## [1] " Full-time $140k - $225k"

Note: You can equivalently write: info <- doc %>% getNodeSet( '//div[@class="wage"]') %>% sapply(xmlValue)

In Class exercise

Do example 1a

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-25-collection/

Lets write a function that given the parsed HTML doc returns all the salaries on the page:

cy.readPost = 
function(doc)
{
  info <- doc %>% xpathSApply( '//div[@class="wage"]', xmlValue)
  info
}

"https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype=" %>%
getURLContent() %>% 
htmlParse() %>%
cy.readPost()

##  [1] "Full-time Compensation Unspecified"
##  [2] "Full-time $95k - $120k"            
##  [3] "Full-time Compensation Unspecified"
##  [4] "Full-time $150k - $200k"           
##  [5] "Full-time $150k - $200k"           
##  [6] "Full-time $150k - $200k"           
##  [7] "Full-time Compensation Unspecified"
##  [8] "Full-time $90k - $120k"            
##  [9] "Full-time Compensation Unspecified"
## [10] "Full-time $100k - $130k"           
## [11] "Full-time $140k - $225k"           
## [12] "Full-time $100k - $150k"           
## [13] "Full-time $120k - $175k"           
## [14] "Full-time $100k - $175k"           
## [15] "Full-time Compensation Unspecified"
## [16] "Compensation Unspecified"          
## [17] "Contract $45.00-$60.00"            
## [18] "Compensation Unspecified"          
## [19] "Full-time $120k - $160k"           
## [20] "Full-time $110k - $160k"

Next we need to get the salaries off of page 2 etc.

We examine the sourse code https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype= and figure out how to get the salaries off of page 2.

Near the bottom of the source code we find:

 <li class="lnk-next pager-item "><a class="get-search-results next" rel="next" href="./?page=2&searchterms=Data%20Scientist&=searchlocation">»</a></li>

This has multiple attributes which doesn’t work with XpathSapply and xmlAttrs as we discussed last time:

link <- xpathSApply(doc, "//a[@rel='next']/@href", xmlAttrs)
                     "https://www.cybercoders.com/search/")

You get an error Error in parse(text=x,srcfile=src)

So you have to use getNodeSet()

URL <- "https://www.cybercoders.com/search/?searchterms=Data+Scientist&searchlocation=&newsearch=true&sorttype="
txt <-  getURLContent(URL)
doc <- htmlParse(txt)

link <- doc %>% getNodeSet( "//a[@rel='next']/@href")
baseURL <-  "https://www.cybercoders.com/search/"
paste(baseURL,as.character(link[[1]]),sep="")

## [1] "https://www.cybercoders.com/search/./?page=2&searchterms=Data%20Scientist&searchlocation=&newsearch=true&sorttype="

We make this into a function

cy.getNextPageLink =
function(doc)
{
  baseURL = "https://www.cybercoders.com/search/"
  link = getNodeSet(doc, "//a[@rel='next']/@href")
  if(length(link) == 0)    # if there is no link then length(link) will be zero
     return(character())
  paste(baseURL,as.character(link[[1]]),sep="")
}

Now we can write a function CyberCoders (see below( that will give us the salary for each page.

cy.readPost = 
function(doc)
{
  info <- doc %>% xpathSApply( '//div[@class="wage"]', xmlValue)
  info
}

cy.getNextPageLink =
function(doc)
{
  baseURL = "https://www.cybercoders.com/search/"
  link = getNodeSet(doc, "//a[@rel='next']/@href")
  if(length(link) == 0)    # if there is no link then length(link) will be zero
     return(character())
  paste(baseURL,as.character(link[[1]]),sep="")
}

cyberCoders =
function(query)
{
   txt = getForm("https://www.cybercoders.com/search/",
                  searchterms = query,  searchlocation = "",
                  newsearch = "true",  sorttype = "")
   doc = htmlParse(txt)

   posts = c()
   while(TRUE) {
       posts = c(posts, cy.readPost(doc))
       nextPage = cy.getNextPageLink(doc)
       if(length(nextPage) == 0)
          break

       nextPage = getURLContent(nextPage)
       doc = htmlParse(nextPage, asText = TRUE)
   }
   posts
}

dataSciPosts = cyberCoders("Data Scientist")
head(dataSciPosts)

## [1] "Full-time Compensation Unspecified"
## [2] "Full-time $95k - $120k"            
## [3] "Full-time Compensation Unspecified"
## [4] "Full-time $150k - $200k"           
## [5] "Full-time $150k - $200k"           
## [6] "Full-time $150k - $200k"

In Class exercise

Do example 1b

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-25-collection/

Finally, lets convert our vector of salaries into a data frame.

#dataSciPosts = as.character(cyberCoders("Data Scientist"))
#head(dataSciPosts,20)

salaries <-  as.data.frame(dataSciPosts) %>% 
  extractMatches(pattern="([[:digit:]]+)k - \\$([[:digit:]]+)", dataSciPosts, low=1, high=2)
head(salaries,20)

##                          dataSciPosts  low high
## 1  Full-time Compensation Unspecified <NA> <NA>
## 2              Full-time $95k - $120k   95  120
## 3  Full-time Compensation Unspecified <NA> <NA>
## 4             Full-time $150k - $200k  150  200
## 5             Full-time $150k - $200k  150  200
## 6             Full-time $150k - $200k  150  200
## 7  Full-time Compensation Unspecified <NA> <NA>
## 8              Full-time $90k - $120k   90  120
## 9  Full-time Compensation Unspecified <NA> <NA>
## 10            Full-time $100k - $130k  100  130
## 11            Full-time $140k - $225k  140  225
## 12            Full-time $100k - $150k  100  150
## 13            Full-time $120k - $175k  120  175
## 14            Full-time $100k - $175k  100  175
## 15 Full-time Compensation Unspecified <NA> <NA>
## 16           Compensation Unspecified <NA> <NA>
## 17             Contract $45.00-$60.00 <NA> <NA>
## 18           Compensation Unspecified <NA> <NA>
## 19            Full-time $120k - $160k  120  160
## 20            Full-time $110k - $160k  110  160

salaries <- salaries %>% 
  select(low,high) %>% 
  filter(!is.na(low)) %>% 
  filter(high!=0) 
head(salaries)

##   low high
## 1  95  120
## 2 150  200
## 3 150  200
## 4 150  200
## 5  90  120
## 6 100  130

str(salaries)

## 'data.frame':    64 obs. of  2 variables:
##  $ low : Factor w/ 15 levels "100","110","120",..: 15 6 6 6 14 1 5 1 3 1 ...
##  $ high: Factor w/ 14 levels "120","130","140",..: 1 7 7 7 1 2 8 4 6 6 ...

salaries$low

##  [1] 95  150 150 150 90  100 140 100 120 100 120 110 100 100 160 110 100
## [18] 90  100 150 130 175 150 150 160 150 110 160 160 150 70  120 75  110
## [35] 70  100 100 130 130 150 150 140 100 100 100 100 100 100 200 160 250
## [52] 200 250 175 175 130 150 110 110 70  120 120 140 60 
## Levels: 100 110 120 130 140 150 160 175 200 250 60 70 75 90 95

salaries$low %>% as.character()

##  [1] "95"  "150" "150" "150" "90"  "100" "140" "100" "120" "100" "120"
## [12] "110" "100" "100" "160" "110" "100" "90"  "100" "150" "130" "175"
## [23] "150" "150" "160" "150" "110" "160" "160" "150" "70"  "120" "75" 
## [34] "110" "70"  "100" "100" "130" "130" "150" "150" "140" "100" "100"
## [45] "100" "100" "100" "100" "200" "160" "250" "200" "250" "175" "175"
## [56] "130" "150" "110" "110" "70"  "120" "120" "140" "60"

salaries <- salaries %>% mutate(low=as.numeric(as.character(low)),high=as.numeric(as.character(high)))
head(salaries)

##   low high
## 1  95  120
## 2 150  200
## 3 150  200
## 4 150  200
## 5  90  120
## 6 100  130

salaries %>% summarize(low=mean(low),high=mean(high))

##        low     high
## 1 128.0469 199.7656