CUNY 607 Project 3 Joel

This is a project for your entire assigned group to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists. Please note especially the requirement about making a presentation during our first meetup after the project is due.

W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right answer.”

Shirani, Ashraf. “IDENTIFYING DATA SCIENCE AND ANALYTICS COMPETENCIES BASED ON INDUSTRY DEMAND.” Issues in Information Systems 17.4 (2016).

Data Mining and Web Scraping

In order to see what skill sets are important to a data scientist, I had looked at three references to get a better idea of what tools that would service data scientists.

The first reference (https://www.crowdflower.com/what-skills-should-data-scientists-have-in-2016/) was an internet article that listed what they thought were important, which were:

SQL, Hadoop, Python, Java, R, Hiv, Mapreduce, NoSQL, Pig, SAS, C, Oracle, Teradata, SPSS, Matlab, Perl, MySQL, PostgreSQL, Ruby, HTML, and Stata.

https://www.crowdflower.com/what-skills-should-data-scientists-have-in-2016/

The second reference was an academic paper titled, “Identifying Data Science and Analytics Competencies Based on Industry Demand” by Ashraf Shirani, San Jose State University, from the journal, Issues in Information Systems, Volume 17, Issue IV, pp. 137-144, 2016. This paper noted not only the hard skills, but the soft skills as well.

Shirani, Ashraf. “IDENTIFYING DATA SCIENCE AND ANALYTICS COMPETENCIES BASED ON INDUSTRY DEMAND.” Issues in Information Systems 17.4 (2016).

Our third and final reference was from another academic journal article named, “An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings” by John Yohahn Kim and Choong Kwon Lee, in the journal International Journal of Software Engineering and Its Applications Vol.10, No. 4 (2016), pp. 161-172. Again, they list similar skill sets as the first two references.

Kim, John Yohahn, and Choong Kwon Lee. “An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings.” International Journal of Software Engineering and Its Applications 10.4 (2016): 161-172.

Knowing this information, we had to web scrape three different job sites, Monster.com, Kaggle.com, and Kdnuggets.com, with the term “Data Scientist” within the New York City area. We had created a vector with a list of tools that were deemed important from the above references. By counting the number of times these specific terms had appeared in our web scraping, this was utilized as a proxy for the importance of the tool. (For example, if SQL had shown up 100 times and C++ showed up 10 times, SQL would be, by proxy, more important than C++.)

First, we must load up the libraries:

library(RCurl)
library(XML)
library(stringr)
library(tidyr)
library(dplyr)
library(rvest)

A vector data.science.skills was created and populated from the information above.

data.science.skills <- c("sql", "hadoop", "python", "java", "r", "hive", "mapreduce", "nosql", "pig", "sas", "c", "oracle", "teradata", "spss", "matlab", "perl", "mysql", "postgresql", "ruby", "html", "stata", "experience", "business", "team", "skills", "analytical", "data mining", "problem solving", "understanding", "algorithms", "statistics", "teams", "modeling", "performance", "agile", "big data", "communication skills", "computer science", "mathematics", "solving", "people", "spark", "distributed", "softward development", "group", "ideas", "interpersonal", "java","apache", "yarn", "scala", "mongodb", "cassandra", "hbase", "c++", "c#", "leadership")

Monster.com

Monster.com was the first website we scraped for the information. From the main page, I had typed in “Data Scientist” and then scraped the results pages (Pages 1 through 10).

To get this done, I had used the rvest package in conjunction with the dplyr package and Selector Gadget. Rvest is a package created by Hadley Wickham that allows a more seamless way to extract (and scrape) information from the web. What makes this package so powerful is that it uses a CSS selector to select the correct node which contains the information we need. However, this would not be possible without the efforts of Selector Gadget. Selector Gadget allows us to highlight the text that we would like, and it gives us the correct “node” in the HTML/XML file.

In the following step, we are going to go through all of the job links that contains the job descriptions. To do this, we need to look for the “nodes” that contains all of its URLs.

# Page 1 of Monster's search

# Read the HTML file
html <- read_html("https://www.monster.com/jobs/search/?q=data-scientist")

# By using Selector Gadget, I had found that the links are in h2 , .jobTitle span.
html.result <- html %>% html_nodes("h2 , .jobTitle span") %>% html_nodes("a") %>% html_attr("href")
monster.links <- html.result

# First 6 results of the url scraping.
head(monster.links)

## [1] "http://job-openings.monster.com/monster/1617280d-1e04-4766-8115-e083ea4175bc?mescoid=1500152001001&jobPosition=1"                                          
## [2] "http://job-openings.monster.com/monster/77c3e4ed-1cce-482e-b46f-526696ea30d2?mescoid=1500152001001&jobPosition=2"                                          
## [3] "http://jobview.monster.com/healthcare-analyst-data-scientist-san-antonio-tx-job-san-antonio-tx-us-181972505.aspx?mescoid=1300087001001&jobPosition=3"      
## [4] "http://jobview.monster.com/sr-data-scientist-machine-learning-data-mining-san-mateo-job-san-mateo-ca-us-181910857.aspx?mescoid=1500152001001&jobPosition=3"
## [5] "http://jobview.monster.com/data-scientist-with-machine-learning-experience-job-bridgewater-nj-us-181303479.aspx?mescoid=1500152001001&jobPosition=3"       
## [6] "http://jobview.monster.com/software-engineer-data-scientist-job-redwood-city-ca-us-181199991.aspx?mescoid=1500127001001&jobPosition=3"

# Now to scrape multiple pages.
# Here, we will scrape from pages 2 thru 10
# Below is the reference I had used to help me scrape multiple pages.
# http://stackoverflow.com/questions/7774991/how-do-i-scrape-multiple-pages-with-xml-and-readhtmltable

# First, we notice that with each page number, only the very end of the URL address changes. 
# A vector of url links are created.

page_numbers <- 2:10
urls <- paste(
  "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=", 
  page_numbers, 
  sep = ""
)

head(urls)

## [1] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=2"
## [2] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=3"
## [3] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=4"
## [4] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=5"
## [5] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=6"
## [6] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=7"

# Now run a for loop and scrape all the urls from pages 2 thru 10.
#for (i in 2:10){
#  temp.html <- read_html(urls[i])
#  temp.html.result <- temp.html %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
#  monster.links <- c(monster.links, temp.html.result)
#}

For some unusual reason, I had noticed that this last portion of running the for loop had crashed my R Markdown, so this will be left as a comment section. When I had copied this over to my script, this for loop worked just perfectly well. I suspect that this is crashing now because I am now “booted” off from the monster.com for likely overloading their servers with scraping. In the meantime, I did manage to grab a list of it originally prior to being booted off. I had manually saved the monster.links data set and loaded it into R here for reproducibility. (This file monster.links is ony accessible through the .rmd file and will not show up on rpubs, given r echo=FALSE)

head(monster.links)

## [1] "http://job-openings.monster.com/monster/1617280d-1e04-4766-8115-e083ea4175bc?mescoid=1500152001001&jobPosition=1"                                          
## [2] "http://job-openings.monster.com/monster/77c3e4ed-1cce-482e-b46f-526696ea30d2?mescoid=1500152001001&jobPosition=2"                                          
## [3] "http://jobview.monster.com/healthcare-analyst-data-scientist-san-antonio-tx-job-san-antonio-tx-us-181972505.aspx?mescoid=1300087001001&jobPosition=3"      
## [4] "http://jobview.monster.com/sr-data-scientist-machine-learning-data-mining-san-mateo-job-san-mateo-ca-us-181910857.aspx?mescoid=1500152001001&jobPosition=3"
## [5] "http://jobview.monster.com/data-scientist-with-machine-learning-experience-job-bridgewater-nj-us-181303479.aspx?mescoid=1500152001001&jobPosition=3"       
## [6] "http://jobview.monster.com/software-engineer-data-scientist-job-redwood-city-ca-us-181199991.aspx?mescoid=1500127001001&jobPosition=3"

paste("The total amount of links in monster.com: ", length(monster.links))

## [1] "The total amount of links in monster.com:  225"

As you can see, we have 225 links to job postings. We wanted to tally up the total amount of words (data science tool sets) using stringr. First, we had to convert data.science.skills to regular expressions.

# Will need regex language to extract out the data scientist skill sets
data.science.skills.regex <- NA
for (i in 1:length(data.science.skills)){
  temp.regex <- paste("\\s",data.science.skills[i],"\\s", sep = "")
  data.science.skills.regex <- c(data.science.skills.regex, temp.regex)
}

# Remove the NA from the 1st value
data.science.skills.regex <- data.science.skills.regex[!is.na(data.science.skills.regex)]

Now if we look at monster.links, we notice that there are two distinct http sites: 1. http://job-openings.monster.com/monster/… 2. http://jobview.monster.com/…

When you look at these site’s source page, the source page follows a particular pattern in a way that we can extract the information out.

For 1. http://job-openings.monster.com/monster/…

#job.openings.monster <- unlist(str_extract_all(monster.links, "http://job-openings.monster.com/monster/.*"))

#sample1 <- data.frame()
#for (i in 1:length(job.openings.monster)){
#  a <- read_html(job.openings.monster[i])
#  b <- html_node(a, ".jobview-section")
#  # To make sure we capture all of the data, we will use all lower case letters for stringr
#  c <- tolower(html_text(b))
#  # Now to get rid of the \r and \n
#  c <- gsub("[\r\n]", " ", c)
#  
#  # Now to extract the frequency of each skillset from each website
#  d <- unlist(str_extract_all(samp.x, data.science.skills.regex))
#  e <- as.data.frame(table(d))
#  sample1 <- rbind(sample1, e)
#}

Again, this is all commented out, as when I run this code, the R Markdown will return an error as monster.com is preventing me from further scraping the website (HTML 403 Error).

If we looked at 2. http://jobview.monster.com/…

jobview.monster <- unlist(str_extract_all(monster.links, "http://jobview.monster.com/.*"))

sample2 <- data.frame()
for (i in 1:length(jobview.monster)){
  a <- read_html(jobview.monster[i])
  b <- html_node(a, "font")
  # To make sure we capture all of the data, we will use all lower case letters for stringr
  c <- tolower(html_text(b))
  # Now to get rid of the \r and \n
  c <- gsub("[\r\n]", " ", c)
  
  # Now to extract the frequency of each skillset from each website
  d <- unlist(str_extract_all(c, data.science.skills.regex))
  e <- as.data.frame(table(d))
  sample2 <- rbind(sample2, e)
}

# Now to tidy up the data so that we can use it for data analysis
tally <- sample2 %>% group_by(d) %>% summarise(Sum = sum(Freq)) %>% arrange(desc(Sum))
tally

## # A tibble: 7 × 2
##                 d   Sum
##            <fctr> <int>
## 1          ideas      2
## 2         people      1
## 3          nosql      1
## 4         skills      1
## 5          spark      1
## 6     statistics      1
## 7  understanding      1

Unfortunately, this data set is not large enough. And given this limited data, we will scrape two other data science sites to look for jobs.

Kaggle.com

We will look at the jobs section in kaggle.com. At this time, I will take extra caution not to get myself banned from websites for scraping. I’ll be introducing a concept called Sys.sleep, which is a function that will purposely slow down the scraping process, so as not to overburden the servers and cause trouble.

# Read the html file from kaggle.com/jobs
kaggle.url <- read_html("https://www.kaggle.com/jobs")

# Looking at the page source, the links are broken up into two different nodes
kaggle.links <- kaggle.url %>% html_nodes("#main > div > div > div:nth-child(3)") %>% html_nodes("a") %>% html_attr("href")
kaggle.links1 <- kaggle.url %>% html_nodes("#main > div > div > div:nth-child(2)") %>% html_nodes("a") %>% html_attr("href")

# Now combine the two links files together into one large links file.
kaggle.links <- c(kaggle.links, kaggle.links1)

# Now we add in the "https://www.kaggle.com/"
kaggle.link.num <-1:length(kaggle.links)
kaggle.urls <- paste("https://www.kaggle.com",kaggle.links[kaggle.link.num], sep = "")
head(kaggle.urls)

## [1] "https://www.kaggle.com/jobs/17707/abb-b-v-business-developer-datacentres-rotterdam-netherlands"             
## [2] "https://www.kaggle.com/jobs/17706/vonq-data-scientist-rotterdam-or-amsterdam-wherever-you-please"           
## [3] "https://www.kaggle.com/jobs/17704/rakuten-inc-data-scientist-for-travel-service-tokyo-japan"                
## [4] "https://www.kaggle.com/jobs/17703/institute-for-defense-analyses-research-associate-test-science-alexandria"
## [5] "https://www.kaggle.com/jobs/17702/institute-for-defense-analyses-research-analyst-statistician-alexandria"  
## [6] "https://www.kaggle.com/jobs/17701/laureate-education-data-scientist-columbia-md"

# Now to start counting
kaggle.words <- data.frame()
for (i in 1:length(kaggle.urls)){
  Sys.sleep(runif(1, 2.0, 5.0))
  a <- read_html(kaggle.urls[i])
  b <- html_node(a, ".jobs-board-post-content")
  # To make sure we capture all of the data, we will use all lower case letters for stringr
  c <- tolower(html_text(b))
  # Now to get rid of the \r and \n
  c <- gsub("[\r\n]", " ", c)
  
  # Now to extract the frequency of each skillset from each website
  d <- unlist(str_extract_all(c, data.science.skills.regex))
  e <- as.data.frame(table(d))
  kaggle.words <- rbind(kaggle.words, e)
}

head(kaggle.words)

##              d Freq
## 1    business     3
## 2  experience     4
## 3       group     2
## 4        team     2
## 5       agile     1
## 6  algorithms     1

Kdnuggets.com

Now to really solidify our data. Let’s scrape Kdnuggets.com and then combine all of the data into one large data set. We’ll be using similar techniques to scrape as we did for kaggle.

kdnuggets.url <- read_html("http://www.kdnuggets.com/jobs/index.html")
kdnuggets.link <- kdnuggets.url %>% html_nodes("b a") %>% html_attr("href")
#Now to get rid of the links that are not relevant to our search
kdnuggets.link <- unlist(str_extract_all(kdnuggets.link, "http://www.kdnuggets.com/jobs/.*"))

kdnuggets.words <- data.frame()
for (i in 1:length(kdnuggets.link)){
  Sys.sleep(runif(1, 2.0, 5.0))
  a <- read_html(kdnuggets.link[i])
  b <- html_node(a, "#post-")
  # To make sure we capture all of the data, we will use all lower case letters for stringr
  c <- tolower(html_text(b))
  # Now to get rid of the \r and \n
  c <- gsub("[\r\n]", " ", c)
  
  # Now to extract the frequency of each skillset from each website
  d <- unlist(str_extract_all(c, data.science.skills.regex))
  e <- as.data.frame(table(d))
  kdnuggets.words <- rbind(kdnuggets.words, e)
}

Now that we have successfully scraped all of our data, we will now tidy the data up and group them by their words.

# Will add kaggle.words and kdnuggets.words to tally
tally2 <- kaggle.words %>% group_by(d) %>% summarise(Sum = sum(Freq)) %>% arrange(desc(Sum))
tally3 <- kaggle.words %>% group_by(d) %>% summarise(Sum = sum(Freq)) %>% arrange(desc(Sum))

total.tally <- rbind(tally, tally2, tally3)
total.tally <- total.tally %>% group_by(d) %>% summarise(Sum = sum(Sum)) %>% arrange(desc(Sum))
total.tally

## # A tibble: 40 × 2
##                     d   Sum
##                <fctr> <int>
## 1         experience    188
## 2               team     68
## 3           business     66
## 4             skills     29
## 5         algorithms     18
## 6   computer science     16
## 7              group     14
## 8              teams     14
## 9         analytical     14
## 10       distributed     12
## # ... with 30 more rows

Uploading the data set onto our online MySQL server database.

Let’s create a connection using the RMySQL package and send the total.tally to our MySQL server.

library(RMySQL)

## Loading required package: DBI

con <- dbConnect(RMySQL::MySQL(), dbname = "DataSkills", user = "admin", password = "CUNYsps2017", host = "cuny-607-project-db.ce9yg7qxcc7l.us-east-2.rds.amazonaws.com", port = 3306)
dbWriteTable(con, value = total.tally, name = "DataSkillset", append = TRUE)

## [1] TRUE