W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right answer.”

Shirani, Ashraf. “IDENTIFYING DATA SCIENCE AND ANALYTICS COMPETENCIES BASED ON INDUSTRY DEMAND.” Issues in Information Systems 17.4 (2016).

Setting up our environment

A note from Latif

I chose to use Javascript, more specifically Node.js, script to perform the analytics since JS is a much more suited languages when dealing with web accesses. Since we needed to perform roughly 5000 HTTP GET requests, deal with SQL acceses to store the data, and also filter incoming data, it made much more sense to do all of this in a language built for the web.

Libraries

For our code, we needed two libraries, Axios, and MySQL. Axios is used to simplify the process of asynchronous, Promise-based, HTTP GET requests. MySQL is needed to access the database that we created.

var mysql      = require('mysql');
var axios = require("axios");

Setting up MySQL

The following code sets up MYSQL. This code sets up the basic MySQL connection, which we later use to access the database. The password is left off for confidentiality reasons.

var db = mysql.createConnection({
  host     : 'cuny-607-project-db.ce9yg7qxcc7l.us-east-2.rds.amazonaws.com',
  user     : 'admin',
  password : 'CUNYsps2017',
  db       : '**********'
});

db.connect();

Data Mining and Web Scraping

In order to see what skill sets are important to a data scientist, we looked at three references to get a better idea of what tools that would service data scientists.

The first reference (https://www.crowdflower.com/what-skills-should-data-scientists-have-in-2016/) was an internet article that listed what they thought were important, which were:

SQL, Hadoop, Python, Java, R, Hive, Mapreduce, NoSQL, Pig, SAS, C, Oracle, Teradata, SPSS, Matlab, Perl, MySQL, PostgreSQL, Ruby, HTML, and Stata.

https://www.crowdflower.com/what-skills-should-data-scientists-have-in-2016/

The second reference was an academic paper titled, “Identifying Data Science and Analytics Competencies Based on Industry Demand” by Ashraf Shirani, San Jose State University, from the journal, Issues in Information Systems, Volume 17, Issue IV, pp. 137-144, 2016. This paper noted not only the hard skills, but the soft skills as well.

Shirani, Ashraf. “IDENTIFYING DATA SCIENCE AND ANALYTICS COMPETENCIES BASED ON INDUSTRY DEMAND.” Issues in Information Systems 17.4 (2016).

Our third and final reference was from another academic journal article named, “An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings” by John Yohahn Kim and Choong Kwon Lee, in the journal International Journal of Software Engineering and Its Applications Vol.10, No. 4 (2016), pp. 161-172. Again, they list similar skill sets as the first two references.

Kim, John Yohahn, and Choong Kwon Lee. “An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings.” International Journal of Software Engineering and Its Applications 10.4 (2016): 161-172.

Knowing this information, we had to web scrape three different job sites, Monster.com, Kaggle.com, and Kdnuggets.com, with the term “Data Scientist” within the New York City area. We created a vector with a list of tools that were deemed important from the above references. By counting the number of times these specific terms had appeared in our web scraping, this was utilized as a proxy for the importance of the tool. (For example, if SQL had shown up 100 times and C++ showed up 10 times, SQL would be, by proxy, more important than C++.)

Next, we must load the libraries we need:

library(RCurl)
library(XML)
library(stringr)
library(knitr)
library(tidyr)
library(dplyr)
library(rvest)
library(RMySQL)
library(ggplot2)
library(reshape2)
library(psych)
library(data.table)

A vector data.science.skills was created and populated from the information above.

data.science.skills <- c("sql", "hadoop", "python", "java", "r", "hive", "mapreduce", "nosql", "pig", "sas", "c", "oracle", "teradata", "spss", "matlab", "perl", "mysql", "postgresql", "ruby", "html", "stata", "experience", "business", "team", "skills", "analytical", "data mining", "problem solving", "understanding", "algorithms", "statistics", "teams", "modeling", "performance", "agile", "big data", "communication skills", "computer science", "mathematics", "solving", "people", "spark", "distributed", "softward development", "group", "ideas", "interpersonal", "java","apache", "yarn", "scala", "mongodb", "cassandra", "hbase", "c++", "c#", "leadership")

Monster.com

Monster.com was the first website we scraped for the information. From the main page, we typed in “Data Scientist” and then scraped the results pages (Pages 1 through 10).

To get this done, we used the rvest package in conjunction with the dplyr package and Selector Gadget. Rvest is a package created by Hadley Wickham that allows a more seamless way to extract (and scrape) information from the web. What makes this package so powerful is that it uses a CSS selector to select the correct node which contains the information we need. However, this would not be possible without the efforts of Selector Gadget. Selector Gadget allows us to highlight the text that we would like, and it gives us the correct “node” in the HTML/XML file.

In the following step, we are going to go through all of the job links that contains the job descriptions. To do this, we need to look for the “nodes” that contains all of its URLs.

# Page 1 of Monster's search

# Read the HTML file
html <- read_html("https://www.monster.com/jobs/search/?q=data-scientist")

# By using Selector Gadget, I had found that the links are in h2 , .jobTitle span.
html.result <- html %>% html_nodes("h2 , .jobTitle span") %>% html_nodes("a") %>% html_attr("href")
monster.links <- html.result

# First 6 results of the url scraping.
head(monster.links)

## [1] "http://job-openings.monster.com/monster/1617280d-1e04-4766-8115-e083ea4175bc?mescoid=1500152001001&jobPosition=1"                                          
## [2] "http://job-openings.monster.com/monster/77c3e4ed-1cce-482e-b46f-526696ea30d2?mescoid=1500152001001&jobPosition=2"                                          
## [3] "http://jobview.monster.com/sr-data-scientist-machine-learning-data-mining-san-mateo-job-san-mateo-ca-us-182144270.aspx?mescoid=1500152001001&jobPosition=3"
## [4] "http://jobview.monster.com/healthcare-analyst-data-scientist-san-antonio-tx-job-san-antonio-tx-us-181972505.aspx?mescoid=1300087001001&jobPosition=3"      
## [5] "http://jobview.monster.com/data-scientist-with-machine-learning-experience-job-bridgewater-nj-us-181303479.aspx?mescoid=1500152001001&jobPosition=3"       
## [6] "http://jobview.monster.com/software-engineer-data-scientist-job-redwood-city-ca-us-181199991.aspx?mescoid=1500127001001&jobPosition=3"

# Now to scrape multiple pages.
# Here, we will scrape from pages 2 thru 10
# Below is the reference I had used to help me scrape multiple pages.
# http://stackoverflow.com/questions/7774991/how-do-i-scrape-multiple-pages-with-xml-and-readhtmltable

# First, we notice that with each page number, only the very end of the URL address changes. 
# A vector of url links are created.

page_numbers <- 2:10
urls <- paste(
  "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=", 
  page_numbers, 
  sep = "")

head(urls)

## [1] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=2"
## [2] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=3"
## [3] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=4"
## [4] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=5"
## [5] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=6"
## [6] "http://www.monster.com/jobs/search/?q=data-scientist&where=new-york-city&page=7"

# Now run a for loop and scrape all the urls from pages 2 thru 10.
#for (i in 2:10){
#  temp.html <- read_html(urls[i])
#  temp.html.result <- temp.html %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
#  monster.links <- c(monster.links, temp.html.result)
#}

For some unusual reason, I had noticed that this last portion of running the for loop crashed my R Markdown, so this will be left as a comment section. When I had copied this over to my script, this for loop worked just perfectly well. I suspect that this is crashing now because I am now “booted” off from monster.com for likely overloading their servers with scraping. In the meantime, I did manage to grab a list of it originally prior to being booted off. I had manually saved the monster.links data set and loaded it into R here for reproducibility. (This file monster.links is only accessible through the .rmd file and will not show up on rpubs, given r echo=FALSE)

head(monster.links)

## [1] "http://job-openings.monster.com/monster/1617280d-1e04-4766-8115-e083ea4175bc?mescoid=1500152001001&jobPosition=1"                                          
## [2] "http://job-openings.monster.com/monster/77c3e4ed-1cce-482e-b46f-526696ea30d2?mescoid=1500152001001&jobPosition=2"                                          
## [3] "http://jobview.monster.com/healthcare-analyst-data-scientist-san-antonio-tx-job-san-antonio-tx-us-181972505.aspx?mescoid=1300087001001&jobPosition=3"      
## [4] "http://jobview.monster.com/sr-data-scientist-machine-learning-data-mining-san-mateo-job-san-mateo-ca-us-181910857.aspx?mescoid=1500152001001&jobPosition=3"
## [5] "http://jobview.monster.com/data-scientist-with-machine-learning-experience-job-bridgewater-nj-us-181303479.aspx?mescoid=1500152001001&jobPosition=3"       
## [6] "http://jobview.monster.com/software-engineer-data-scientist-job-redwood-city-ca-us-181199991.aspx?mescoid=1500127001001&jobPosition=3"

paste("The total amount of links in monster.com: ", length(monster.links))

## [1] "The total amount of links in monster.com:  225"

As you can see, we have 225 links to job postings. We wanted to tally up the total amount of words (data science tool sets) using stringr. First, we had to convert data.science.skills to regular expressions.

# Will need regex language to extract out the data scientist skill sets
data.science.skills.regex <- NA
for (i in 1:length(data.science.skills)){
  temp.regex <- paste("\\s",data.science.skills[i],"\\s", sep = "")
  data.science.skills.regex <- c(data.science.skills.regex, temp.regex)
}

# Remove the NA from the 1st value
data.science.skills.regex <- data.science.skills.regex[!is.na(data.science.skills.regex)]

Now if we look at monster.links, we notice that there are two distinct http sites: 1. http://job-openings.monster.com/monster/… 2. http://jobview.monster.com/…

When you look at these site’s source page, the source page follows a particular pattern in a way that we can extract the information out.

For 1. http://job-openings.monster.com/monster/…

#job.openings.monster <- unlist(str_extract_all(monster.links, "http://job-openings.monster.com/monster/.*"))

#sample1 <- data.frame()
#for (i in 1:length(job.openings.monster)){
#  a <- read_html(job.openings.monster[i])
#  b <- html_node(a, ".jobview-section")
#  # To make sure we capture all of the data, we will use all lower case letters for stringr
#  c <- tolower(html_text(b))
#  # Now to get rid of the \r and \n
#  c <- gsub("[\r\n]", " ", c)
#  
#  # Now to extract the frequency of each skillset from each website
#  d <- unlist(str_extract_all(samp.x, data.science.skills.regex))
#  e <- as.data.frame(table(d))
#  sample1 <- rbind(sample1, e)
#}

Again, this is all commented out, as when I run this code, the R Markdown will return an error. Monster.com is preventing us from further scraping the website (HTML 403 Error).

If we looked at 2. http://jobview.monster.com/…

jobview.monster <- unlist(str_extract_all(monster.links, "http://jobview.monster.com/.*"))

sample2 <- data.frame()
#for (i in 1:length(jobview.monster)){
#  a <- read_html(jobview.monster[i])
#  b <- html_node(a, "font")
#  # To make sure we capture all of the data, we will use all lower case letters for stringr
#  c <- tolower(html_text(b))
#  # Now to get rid of the \r and \n
#  c <- gsub("[\r\n]", " ", c)
#  
#  # Now to extract the frequency of each skillset from each website
#  d <- unlist(str_extract_all(c, data.science.skills.regex))
#  e <- as.data.frame(table(d))
#  sample2 <- rbind(sample2, e)
#}

# Now to tidy up the data so that we can use it for data analysis
#tally <- sample2 %>% group_by(d) %>% summarise(Sum = sum(Freq)) %>% arrange(desc(Sum))
#tally

# Initially, this for loop had worked, but now with the change in websites, and change in job postings, by the time, we had started to knit this project, some of the websites are now 404, and hence not allowing this chunk to work. Given that we had already scraped the data and uploaded it to MySQL, I have blocked this off in comments to show you the work

Unfortunately, this data set is not large enough. And given this limited data, we will scrape two other data science sites to look for jobs.

Kaggle.com

We will look at the jobs section in kaggle.com. At this time, we will take extra caution not to get ourselves banned from websites for scraping. We’ll be introducing a concept called Sys.sleep, which is a function that will purposely slow down the scraping process, so as not to overburden the servers and cause trouble.

# Read the html file from kaggle.com/jobs
kaggle.url <- read_html("https://www.kaggle.com/jobs")

# Looking at the page source, the links are broken up into two different nodes
kaggle.links <- kaggle.url %>% html_nodes("#main > div > div > div:nth-child(3)") %>% html_nodes("a") %>% html_attr("href")
kaggle.links1 <- kaggle.url %>% html_nodes("#main > div > div > div:nth-child(2)") %>% html_nodes("a") %>% html_attr("href")

# Now combine the two links files together into one large links file.
kaggle.links <- c(kaggle.links, kaggle.links1)

# Now we add in the "https://www.kaggle.com/"
kaggle.link.num <-1:length(kaggle.links)
kaggle.urls <- paste("https://www.kaggle.com",kaggle.links[kaggle.link.num], sep = "")
head(kaggle.urls)

## [1] "https://www.kaggle.com/jobs/17712/sureid-vice-president-of-data-science-machine-learning-portland-metro-area"
## [2] "https://www.kaggle.com/jobs/17711/poki-lead-data-scientist-amsterdam-the-netherlands"                        
## [3] "https://www.kaggle.com/jobs/17708/j-b-hunt-transport-data-scientist-lowell-ar"                               
## [4] "https://www.kaggle.com/jobs/17707/abb-b-v-business-developer-datacentres-rotterdam-netherlands"              
## [5] "https://www.kaggle.com/jobs/17706/vonq-data-scientist-rotterdam-or-amsterdam-wherever-you-please"            
## [6] "https://www.kaggle.com/jobs/17704/rakuten-inc-data-scientist-for-travel-service-tokyo-japan"

# Now to start counting
kaggle.words <- data.frame()
for (i in 1:length(kaggle.urls)){
  Sys.sleep(runif(1, 2.0, 5.0))
  a <- read_html(kaggle.urls[i])
  b <- html_node(a, ".jobs-board-post-content")
  # To make sure we capture all of the data, we will use all lower case letters for stringr
  c <- tolower(html_text(b))
  # Now to get rid of the \r and \n
  c <- gsub("[\r\n]", " ", c)
  
  # Now to extract the frequency of each skillset from each website
  d <- unlist(str_extract_all(c, data.science.skills.regex))
  e <- as.data.frame(table(d))
  kaggle.words <- rbind(kaggle.words, e)
}

head(kaggle.words)

##               d Freq
## 1   algorithms     2
## 2     big data     1
## 3  distributed     1
## 4   experience     7
## 5       hadoop     1
## 6   leadership     1

Kdnuggets.com

Now to really solidify our data. Let’s scrape Kdnuggets.com and then combine all of the data into one large data set. We’ll be using similar techniques to scrape as we did for kaggle.

kdnuggets.url <- read_html("http://www.kdnuggets.com/jobs/index.html")
kdnuggets.link <- kdnuggets.url %>% html_nodes("b a") %>% html_attr("href")
#Now to get rid of the links that are not relevant to our search
kdnuggets.link <- unlist(str_extract_all(kdnuggets.link, "http://www.kdnuggets.com/jobs/.*"))

kdnuggets.words <- data.frame()
for (i in 1:length(kdnuggets.link)){
  Sys.sleep(runif(1, 2.0, 5.0))
  a <- read_html(kdnuggets.link[i])
  b <- html_node(a, "#post-")
  # To make sure we capture all of the data, we will use all lower case letters for stringr
  c <- tolower(html_text(b))
  # Now to get rid of the \r and \n
  c <- gsub("[\r\n]", " ", c)
  
  # Now to extract the frequency of each skillset from each website
  d <- unlist(str_extract_all(c, data.science.skills.regex))
  e <- as.data.frame(table(d))
  kdnuggets.words <- rbind(kdnuggets.words, e)
}

Now that we have successfully scraped all of our data, we will tidy the data up and group them by their words.

# Will add kaggle.words and kdnuggets.words to tally
tally2 <- kaggle.words %>% group_by(d) %>% summarise(Sum = sum(Freq)) %>% arrange(desc(Sum))
tally3 <- kaggle.words %>% group_by(d) %>% summarise(Sum = sum(Freq)) %>% arrange(desc(Sum))

# total.tally <- rbind(tally, tally2, tally3) Given that there is no tally, will write another line
# Will write total. tally as a combination of tally2 and tally3
total.tally <- rbind(tally2, tally3)
total.tally <- total.tally %>% group_by(d) %>% summarise(Sum = sum(Sum)) %>% arrange(desc(Sum))
total.tally

## # A tibble: 37 × 2
##                     d   Sum
##                <fctr> <int>
## 1         experience    152
## 2           business     62
## 3               team     56
## 4             skills     24
## 5         algorithms     16
## 6              teams     16
## 7   computer science     14
## 8         leadership     12
## 9         analytical     12
## 10             group     12
## # ... with 27 more rows

Uploading the data set onto our online MySQL server database.

Let’s create a connection using the RMySQL package and send the total.tally to our MySQL server.

con <- dbConnect(RMySQL::MySQL(), dbname = "DataSkills", user = "admin", password = "CUNYsps2017", host = "cuny-607-project-db.ce9yg7qxcc7l.us-east-2.rds.amazonaws.com", port = 3306)
dbWriteTable(con, value = total.tally, name = "DataSkillset", append = TRUE)

## [1] TRUE

** Collect the info of job posts related to Data Science from indeed.com **

Start from getting xml files from Application Programming Inferface (API) for indeed.com

For security purposes, part of the code will not show. The code will look like this:

indeedapi.1 <- getURL(“http://api.indeed.com/ads/apisearch?publisher=#####&format=xml&q=data+analytics&l=&sort=relevance&radius&st=&jt=fulltime&start=0&limit=25&fromage=&filter=&latlong=1&co=us&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2”, .opts=curlOptions(followlocation = TRUE))

# parse xml files
xml.1 <- xmlParse(indeedapi.1)
xml.2 <- xmlParse(indeedapi.2)
xml.3 <- xmlParse(indeedapi.3)
xml.4 <- xmlParse(indeedapi.4)
xml.5 <- xmlParse(indeedapi.5)
xml.6 <- xmlParse(indeedapi.6)
xml.7 <- xmlParse(indeedapi.7)
xml.8 <- xmlParse(indeedapi.8)

# extract the job titles from the xml files
jobtitle.extr.1 <- xpathApply(xml.1, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.2 <- xpathApply(xml.2, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.3 <- xpathApply(xml.3, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.4 <- xpathApply(xml.4, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.5 <- xpathApply(xml.5, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.6 <- xpathApply(xml.6, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.7 <- xpathApply(xml.7, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.8 <- xpathApply(xml.8, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitles <- unlist(cbind(jobtitle.extr.1,jobtitle.extr.2,jobtitle.extr.3,jobtitle.extr.4,jobtitle.extr.5,jobtitle.extr.6,jobtitle.extr.7,jobtitle.extr.8))

# create a dataframe for job title
jobtitle.df <- as.data.frame(t(jobtitles))
jobtitle.df <- gather(jobtitle.df,"No","jobtitle",1:200)

## Warning: attributes are not identical across measure variables; they will
## be dropped

jobtitle.df$No <- str_replace(jobtitle.df$No, "V","")
str(jobtitle.df)

## 'data.frame':    200 obs. of  2 variables:
##  $ No      : chr  "1" "2" "3" "4" ...
##  $ jobtitle: chr  "Junior Data Scientist" "Data Analyst (Remote)" "Data Scientist" "Junior Data Scientist" ...

head(jobtitle.df)

##   No                jobtitle
## 1  1   Junior Data Scientist
## 2  2   Data Analyst (Remote)
## 3  3          Data Scientist
## 4  4   Junior Data Scientist
## 5  5 Analytics Data Engineer
## 6  6      Big Data Developer

** continue to extract other info from the xml files **

# extract the name of companies from the xml files
company.extr.1 <- xpathApply(xml.1, path="//response/results/result/company", fun=xmlValue)
company.extr.2 <- xpathApply(xml.2, path="//response/results/result/company", fun=xmlValue)
company.extr.3 <- xpathApply(xml.3, path="//response/results/result/company", fun=xmlValue)
company.extr.4 <- xpathApply(xml.4, path="//response/results/result/company", fun=xmlValue)
company.extr.5 <- xpathApply(xml.5, path="//response/results/result/company", fun=xmlValue)
company.extr.6 <- xpathApply(xml.6, path="//response/results/result/company", fun=xmlValue)
company.extr.7 <- xpathApply(xml.7, path="//response/results/result/company", fun=xmlValue)
company.extr.8 <- xpathApply(xml.8, path="//response/results/result/company", fun=xmlValue)
companies <- unlist(cbind(company.extr.1,company.extr.2,company.extr.3,company.extr.4,company.extr.5,company.extr.6,company.extr.7,company.extr.8))
componies <- str_replace(companies,'\"',"")
  
# create a dataframe for name of companies
company.df <- as.data.frame(t(companies))
colnames(company.df) <- seq(length(companies))
company.df <- gather(company.df,"No","company",1:length(companies))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the cities from the xml files
city.extr.1 <- xpathApply(xml.1, path="//response/results/result/city", fun=xmlValue)
city.extr.2 <- xpathApply(xml.2, path="//response/results/result/city", fun=xmlValue)
city.extr.3 <- xpathApply(xml.3, path="//response/results/result/city", fun=xmlValue)
city.extr.4 <- xpathApply(xml.4, path="//response/results/result/city", fun=xmlValue)
city.extr.5 <- xpathApply(xml.5, path="//response/results/result/city", fun=xmlValue)
city.extr.6 <- xpathApply(xml.6, path="//response/results/result/city", fun=xmlValue)
city.extr.7 <- xpathApply(xml.7, path="//response/results/result/city", fun=xmlValue)
city.extr.8 <- xpathApply(xml.8, path="//response/results/result/city", fun=xmlValue)
cities <- unlist(cbind(city.extr.1,city.extr.2,city.extr.3,city.extr.4,city.extr.5,city.extr.6,city.extr.7,city.extr.8))

# create a dataframe for cities
city.df <- as.data.frame(t(cities))
colnames(city.df) <- seq(length(cities))
city.df <- gather(city.df,"No","city",1:length(cities))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the states from the xml files
state.extr.1 <- xpathApply(xml.1, path="//response/results/result/state", fun=xmlValue)
state.extr.2 <- xpathApply(xml.2, path="//response/results/result/state", fun=xmlValue)
state.extr.3 <- xpathApply(xml.3, path="//response/results/result/state", fun=xmlValue)
state.extr.4 <- xpathApply(xml.4, path="//response/results/result/state", fun=xmlValue)
state.extr.5 <- xpathApply(xml.5, path="//response/results/result/state", fun=xmlValue)
state.extr.6 <- xpathApply(xml.6, path="//response/results/result/state", fun=xmlValue)
state.extr.7 <- xpathApply(xml.7, path="//response/results/result/state", fun=xmlValue)
state.extr.8 <- xpathApply(xml.8, path="//response/results/result/state", fun=xmlValue)
states <- unlist(cbind(state.extr.1,state.extr.2,state.extr.3,state.extr.4,state.extr.5,state.extr.6,state.extr.7,state.extr.8))

# create a dataframe for states
state.df <- as.data.frame(t(states))
colnames(state.df) <- seq(length(states))
state.df <- gather(state.df,"No","state",1:length(states))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the dates from the xml files
date.extr.1 <- xpathApply(xml.1, path="//response/results/result/date", fun=xmlValue)
date.extr.2 <- xpathApply(xml.2, path="//response/results/result/date", fun=xmlValue)
date.extr.3 <- xpathApply(xml.3, path="//response/results/result/date", fun=xmlValue)
date.extr.4 <- xpathApply(xml.4, path="//response/results/result/date", fun=xmlValue)
date.extr.5 <- xpathApply(xml.5, path="//response/results/result/date", fun=xmlValue)
date.extr.6 <- xpathApply(xml.6, path="//response/results/result/date", fun=xmlValue)
date.extr.7 <- xpathApply(xml.7, path="//response/results/result/date", fun=xmlValue)
date.extr.8 <- xpathApply(xml.8, path="//response/results/result/date", fun=xmlValue)
dates <- unlist(cbind(date.extr.1,date.extr.2,date.extr.3,date.extr.4,date.extr.5,date.extr.6,date.extr.7,date.extr.8))

# create a dataframe for dates
dates.df <- as.data.frame (t(dates))
colnames(dates.df) <- seq(dates)
dates.df <- gather(dates.df,"No","date",1:length(dates))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the latitude from the xml files
latitude.extr <- xpathApply(xml.1, path="//response/results/result/latitude", fun=xmlValue)

# extract the longtitude from the xml files
longitude.extr <- xpathApply(xml.1, path="//response/results/result/longitude", fun=xmlValue)

# extract the job post links from the xml files
urls1 <- xpathApply(xml.1, path="//response/results/result/url", fun=xmlValue)
urls2 <- xpathApply(xml.2, path="//response/results/result/url", fun=xmlValue)
urls3 <- xpathApply(xml.3, path="//response/results/result/url", fun=xmlValue)
urls4 <- xpathApply(xml.4, path="//response/results/result/url", fun=xmlValue)
urls5 <- xpathApply(xml.5, path="//response/results/result/url", fun=xmlValue)
urls6 <- xpathApply(xml.6, path="//response/results/result/url", fun=xmlValue)
urls7 <- xpathApply(xml.7, path="//response/results/result/url", fun=xmlValue)
urls8 <- xpathApply(xml.8, path="//response/results/result/url", fun=xmlValue)
urls <- unlist(cbind (urls1, urls2, urls3, urls4, urls5, urls6, urls7, urls8))

head(urls)

## [1] "http://www.indeed.com/viewjob?jk=6f70efa6a5094fe6&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gVS3hxLA4lEt5O0z1LMWfYO2atdgjwIFM-1cLvzpgERY&indpubnum=2912078065868574&atk=1bc6pcv2v50h3e0s"
## [2] "http://www.indeed.com/viewjob?jk=c1e5b5dc6d9d2fb4&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gVS3hxLA4lEt5O0z1LMWfYO2atdgjwIFM-1cLvzpgERY&indpubnum=2912078065868574&atk=1bc6pcv2v50h3e0s"
## [3] "http://www.indeed.com/viewjob?jk=8a96b4d5da7a3202&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gVS3hxLA4lEt5O0z1LMWfYO2atdgjwIFM-1cLvzpgERY&indpubnum=2912078065868574&atk=1bc6pcv2v50h3e0s"
## [4] "http://www.indeed.com/viewjob?jk=dab54a15d1621055&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gVS3hxLA4lEt5O0z1LMWfYO2atdgjwIFM-1cLvzpgERY&indpubnum=2912078065868574&atk=1bc6pcv2v50h3e0s"
## [5] "http://www.indeed.com/viewjob?jk=a4d7f685b22cd517&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gVS3hxLA4lEt5O0z1LMWfYO2atdgjwIFM-1cLvzpgERY&indpubnum=2912078065868574&atk=1bc6pcv2v50h3e0s"
## [6] "http://www.indeed.com/viewjob?jk=bcb9b7e45f896cea&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gVS3hxLA4lEt5O0z1LMWfYO2atdgjwIFM-1cLvzpgERY&indpubnum=2912078065868574&atk=1bc6pcv2v50h3e0s"

** Extract the job description from each link**

# use a loop to extract the job description from each link
# for (i in 1:length(urls)){
# html.job <- getURL(urls, .opts=curlOptions(followlocation = TRUE))
# parsed.html.job <- htmlParse(html.job)
# job.extr <- xpathApply(parsed.html.job, "//span[@id='job_summary']", fun=xmlValue)
# }
# Because indeed.com will limit the access, the for loop here will be shown as statement to allow the file to be knit to a RMD file. So the data will be exported as a vector and manually loaded back for the following process.

job.extr <- fread("https://raw.githubusercontent.com/YunMai-SPS/DA607-homework/master/jobextr.csv ",header = T, sep = ',')

# create a dataframe for job descriptions
job.extr <- unlist(job.extr)
job.df <- as.data.frame(t(job.extr))
colnames(job.df) <- seq(job.extr)
job.df <- gather(job.df,"No","job_summary",1:length(job.extr))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# combine all info into one dataframe
indeedjob <- inner_join(dates.df, jobtitle.df, by = "No")
indeedjob <- inner_join(indeedjob,company.df, by = "No")
indeedjob <- inner_join(indeedjob,city.df, by = "No")
indeedjob <- inner_join(indeedjob,state.df, by = "No")
indeedjob <- inner_join(indeedjob,job.df, by = "No")
kable(indeedjob[2,])

	No	date	jobtitle	company	city	state	job_summary
2	2	Tue, 21 Mar 2017 02:39:39 GMT	Data Analyst (Remote)	First San Francisco Partners	Remote		BitVoyant produces business and network intelligence, enabling companies to intimately understand their interactions and risks in real-time and at Internet scale. Weâre growing our team to create, enrich, and mine this intelligence for insights that matter to our customers.

As a Data Scientist (Big Data), you will apply cutting-edge statistical and mathematical methods to the collection, correlation and analysis of large structured and unstructured data sets from a broad, ever-evolving set of sources. The successful applicant will apply industry knowledge, contextual understanding, and innovative insights into the development of novel analytical techniques and tools in response to urgent, complex questions. The ideal candidate will:

Responsibilities Provide valuable insights into new data sets quickly Mine large amounts of data for insights with both automated and ad-hoc queries Communicate complex findings and ideas in plain language that is friendly to business and operational audiences Problem solve â turn requirements into solutions Collaborate with team members with varying technical and non-technical backgrounds towards a shared goal Rapidly research and iterate on analysis methods by finding elegant, reliable and understandable techniques to build analytics that matter Create technical assessments and develop custom approaches in response to time-critical urgent customer needs Employ sophisticated analytics programs, machine learning, and statistical methods to prepare data for use in predictive and prescriptive modeling Examine new and more convincing methods for data reporting, visualization & presentation Develop new approaches to apply large-scale computing technology to solve customer problems Be involved in all stages of analytic development from idea formation through prototyping, automation, and productizing Maintain up-to-date knowledge of technology standards, industry trends, emerging techniques, and analytic best practices Keep informed of evolution and impact of adversary tactics and strategies Ensure analytical issues are quickly resolved and help implement strategies and solutions to reduce the likelihood of reoccurrence

Requirements - Our ideal candidate will meet most but not necessarily all of the following requirements:

US Citizenship or authorization to work in the US

Possess TS/SCI security clearance

Bachelorâs degree or higher in Computer Science, Physics, Mathematics, Statistics or equivalent

Possess solid understanding of Cyber Security terminology, concepts and technology

Possess solid understanding of Internet communication protocol

Must have at least 1 to 2 years demonstrated experience with Hadoop, Elasticsearch NoSQL, HDFS Python JSON data formats, Parquet files Structured and Unstructured data Supervised and Unsupervised Machine Learning techniques

Desired skills Big Data Platforms and tools (e.g. Cloudera, Hortonworks, MapR, Hadoop, Pig, Hive, etc.) Impala, and SQL â for queries NoSQL databases (ex. HBase, MongoDB,ArangoDB, Neo4J) â for behavioral analysis C/C++/C#, Scala, PySpark, Java, R, R Shiny, or other comparable programming language Data formats â ex. JSON, flat files, Parquet, ORC files, Avro Extract-Transform-Load (ETL) processes Statistical data analysis packages (SPSS, R, etc.) Data visualization tools (ex. Tableau, Qlik, IPython, etc.) Comfortable with rapid prototyping to solve immediate problems Super user level expertise in Linux / Unix, Mac, and Windows operating systems. Additional Machine Learning techniques, ex. Auto Encoders, Deep Learning, Hierarchical Clustering, Outlier and Anomaly Detection, etc.

Ability to process, filter, and establish the utility of data through various analytics techniques

Demonstrated creativity, foresight, and mature judgment in answering complex and difficult analytical questions

Strong written and verbal communication skills

# save the job info from indeed.com API as a cvs file
write.csv(indeedjob, file = "indeedjob_scrape.csv")

Import the job info from indeed API to MySQL locally

For security purposes, part of the code will not show. The code will look like this:

indeedjob_utf8 <- read.table(file=“tmp.txt”,encoding=“utf8”) con <- dbConnect(RMySQL::MySQL(), dbname = “datascientist_skill”, user = “root”, password = “####”, port = 3306, client.flag = CLIENT_MULTI_STATEMENTS) dbWriteTable(con, name = “Indeedjob”, value = indeedjob_utf8, overwrite = TRUE)

Import to DATA607 group database

For security purposes, part of the code will not show. The code will look like this:

con <- dbConnect(RMySQL::MySQL(), dbname = “DataSkills”, user = “admin”, password = “####”, host = “cuny-607-project-db.ce9yg7qxcc7l.us-east-2.rds.amazonaws.com”, port = 3306) dbWriteTable(con, name = “indeedjob”, value = indeedjob_utf8, overwrite = TRUE)

Calculating Correlation Scores

We chose to utilize Microsoft Conginitve Services Academic Knowlege API to calculate correlation scores between our job titles and skills that we scraped. MCS Academic Knowledge API calculates the consine similarity between two strings. Cosine similarity is essentially the angle between two non-zero vectors. The MCS API takes this concept and applies it to strings, and specifically finds similarities between them based on related works of text. Not too much information other than this is provided by Microsoft on this topic, but the results yield some very piece of information, which we have gone into in our analysis section.

The API returns a score between -1 and +1, where higher positive scores indicate stronger correlations between those two pieces of string, and lower negative scores indicate less correlation.

Settings up Axios for MCS

We used Axios to perform our requests to Microsoft Congitive Services API. The code below sets up an Axios object to be used later on. The subscription key is left off for confidentiality reasons.

var mcs = axios.create({
  baseURL: 'https://westus.api.cognitive.microsoft.com/academic/v1.0',
  headers: {'Ocp-Apim-Subscription-Key': '0649c2f676d94f0386e52b52b413a2ca'}
});

Functions

We defined three functions to perform this part of the project.

Sleep

This is a simply sleep function that waits the allotted number of milliseconds before proceeding to the next bit of code. We needed this since MCS only allows for 1000 calls per minute, and we needed to perform roughly 5000 overall. To ensure that MCS didn’t kick us out for making too many calls, we had to introduce a small delay in our code using the function below.

The code itself is pretty simple. A Promise is returned with a timer set to timeout after the certain number of milliseconds hvae passed.

var sleep = (ms) => {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Getting Jobs from Database

The following function gets all the jobs in the database, and ensures that there aren’t any duplicates but looping through the results and inserting it into a Set.

var jobs = new Set ();
let getJobs = new Promise ((resolve, reject) => {
  db.query('SELECT * FROM DataSkills.indeedjob;', function (error, results, fields) {
    if (error) reject(error);

    results.map((result) => {
      if (!jobs.has(result['jobtitle'].toLowerCase())) jobs.add(result['jobtitle'].toLowerCase());
    });

    resolve(jobs);
  })
});

Getting Skills from Database

Similar to the jobs function, the skills function performs the similar task of reading in the skills table and removing duplicate entries.

var skills = new Set (); 
let getSkills = new Promise ((resolve, reject) => {
  db.query('SELECT * FROM DataSkills.DataSkillset;', function (error, results, fields) {
    if (error) reject(error);
    results.map((skill) => {
      str = skill['d'].replace(/\s/g, '').toLowerCase();
      if (!skills.has(str)) skills.add(str);
    })

    resolve(results);
  });
});

Geting the correlation scores

The script below incorporates all of the previous sections and performs and adds the data into the database.

We start by getting the skills from the database, and then getting the jobs. The .then() is used to indicate that each of those steps need to fully complete before the next step is run. Once we have the skills and jobs, we run a nested while loop to get the correlation score for all of the 40 skills for all 123 jobs. (A .map() is essentially just a prettier way of writing a for loop). A small delay is introduced using the sleep() function we defined earlier to ensure we don’t hit the API’s limit.

Once we have the score from MCS, we perform a db.query to the database to insert the data into it.

getSkills.then(()=> {
  getJobs.then(()=> {
    jobs = Array.from (jobs);
    skills = Array.from(skills);

    jobs.map((job) => {
      skills.map((skill) => {
        sleep(500).then(() => {
          mcs.get('/similarity?s1='+encodeURIComponent(job)+"&s2="+encodeURIComponent(skill))
            .then ((resp, err) => {
            if (err) {
              if (!errors.has([skill, job])) errors.add([skill, job])
            }

            console.log (job,",", skill, ",", resp.data);
            db.query('INSERT INTO `DataSkills`.`Correlation Scores` VALUES (NULL, \'' + job + '\', \''+ skill + '\','+ resp.data + ')', function (error, results, fields) {
              if (error) console.log(error);
            })
          });
        });
      })
    })

  })
});

$Shutterstock.com$

Shutterstock.com

Now that there are correlation scores for the data jobs and their skillsets (both hard and soft), let’s pull the data from MySQL, tidy up the data and perform some data analysis.

Let’s create a connection with the MySQL online database, and download the files into R.

# Creating a connection between R and MySQL
con <- dbConnect(RMySQL::MySQL(), dbname = "DataSkills", user = "admin", password = "CUNYsps2017", host = "cuny-607-project-db.ce9yg7qxcc7l.us-east-2.rds.amazonaws.com")
dbListTables(con)

## [1] "Correlation Scores"   "Correlation Scores 2" "DataSkillset"        
## [4] "Scores"               "indeedjob"

There are three tables in the database “DataSkills”. The tables, “DataSkillset” and “indeedjob” were both scraped data from Monster, Kaggle, Kdnuggets (DataSkills) and Indeed (indeedjob). This information was then taken and used in Microsoft Cognitive Services. The job titles/descriptions from indeedjob were then correlated with the data skills sets from DataSkillset to see if there was any negative or positive correlation. A (-1) score means a complete negative correlation where as a (+1) score means a complete positive correlation. The score can range from -1 to +1.

Let’s pull up the DataSkillset and Correlation Scores table.

# Using queries to download the data sets
correlate.query <- "SELECT * FROM `Correlation Scores`;"
correlate.table <- dbGetQuery(con, correlate.query)

dataskill.query <- "SELECT * FROM DataSkillset;"
dataskill <- dbGetQuery(con, dataskill.query)

# Show the first 10 listings in the data table.
head(correlate.table, 10)

##    ID                                             job       skill    score
## 1   1 data scientist, business strategy - seattle, wa      python  0.14926
## 2   2          data integration and reporting analyst       agile  0.34817
## 3   3 data scientist, business strategy - seattle, wa  leadership  0.40843
## 4   4          data integration and reporting analyst        ruby -0.09936
## 5   5          data integration and reporting analyst  statistics  0.43853
## 6   6 data scientist, business strategy - seattle, wa        ruby  0.04905
## 7   7          data integration and reporting analyst performance  0.34586
## 8   8                   data analyst/reporting writer       group  0.08052
## 9   9          data integration and reporting analyst    modeling  0.31776
## 10 10          data integration and reporting analyst      matlab  0.15621

paste("There are", length(correlate.table$ID), "rows in this table.")

## [1] "There are 4372 rows in this table."

Data Skill Sets

In this data, this data was scraped to see which soft and/or hard skills came up most frequently during my search for job listings in Monster.com, Kaggle.com, and Kdnuggets.com. What it does is tally up how many times each skill set was mentioned during the scrape.

The data is nearly tidy. However, what I like to do is tidy the data up even more, so that way there are no duplicates in the data set.

dataskill1 <- dataskill %>% group_by(d) %>% summarise(Count = sum(Sum)) %>% arrange(desc(Count))
colnames(dataskill1) <- c("Skill", "Count")
head(dataskill1, 10)

## # A tibble: 10 × 2
##           Skill Count
##           <chr> <dbl>
## 1   experience   1240
## 2     business    540
## 3         team    456
## 4       skills    185
## 5   analytical    172
## 6        group    108
## 7       python    104
## 8   algorithms    100
## 9            r     96
## 10       teams     92

Both soft skills and hard skills are very important to the data scientist. However, we are going to take a closer look at both soft and hard skills. We will subset the data and then analyze the data.

soft <- c("experience", "business", "team", "analytical", "skills", "group", "understanding", "people", "communication skills", "teams", "solving", "leadership", "interpersonal", "problem solving", "ideas")
hard <- c("python", "r", "sql", "algorithms", "hadoop", "computer science", "bigdata", "statistics", "datamining", "java", "spark", "agile", "mathematics", "apache", "hive", "matlab", "oracle", "ruby", "sas", "scala", "nosql")

# Initially, I had tried to subset the data with the soft vector in the dplyr package, but was returning no data.
# However, I had realized that, for some reason, there was a space in front of every word in dataskill1$Skill.
# This next step, by using regular expressions, will remove the space.
skills <- dataskill1$Skill
skills <- gsub("\\s", "", skills)
dataskill1$Skill <- skills

# Now, we can subset the soft and hard skill sets
soft.skills <- dataskill1 %>% filter(Skill %in% soft)
hard.skills <- dataskill1 %>% filter(Skill %in% hard)

soft.skills

## # A tibble: 14 × 2
##            Skill Count
##            <chr> <dbl>
## 1     experience  1240
## 2       business   540
## 3           team   456
## 4         skills   185
## 5     analytical   172
## 6          group   108
## 7          teams    92
## 8         people    89
## 9  understanding    77
## 10    leadership    64
## 11       solving    60
## 12    experience    12
## 13 interpersonal     4
## 14         ideas     2

hard.skills

## # A tibble: 20 × 2
##          Skill Count
##          <chr> <dbl>
## 1       python   104
## 2   algorithms   100
## 3            r    96
## 4          sql    72
## 5      bigdata    60
## 6       hadoop    52
## 7   datamining    48
## 8         java    48
## 9        agile    44
## 10  statistics    41
## 11 mathematics    36
## 12       spark    29
## 13      apache    12
## 14        hive    12
## 15        ruby    12
## 16         sas    12
## 17       scala    12
## 18      matlab     4
## 19      oracle     4
## 20       nosql     1

ss <- ggplot(soft.skills, aes(x=Skill, y=Count)) + geom_point(aes(color=Skill, size=Count)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(main = "Soft Skills")
ss

hs <- ggplot(hard.skills, aes(x=Skill, y=Count)) + geom_point(aes(color=Skill, size=Count)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(main = "Hard Skills")
hs

Looking at the data, it appears that experience is by far one of the most important qualifications that an employer is looking for. Regarding hard skill sets, R and Python remain very popular.

Correlation Score

At first glance, the data looks like there is a lot of interesting information. (For example, there’s a very slight negative correlation with Ruby and “Data integration and reporting analyst”.) However, to make more sense of this data, let’s start by tidying up the data.

# Because there is so much variation in the 'jobs' column, we will group by 'skill'.
correlate.skills <- correlate.table %>% group_by(skill) %>% arrange(skill)
head(correlate.skills, 10)

## Source: local data frame [10 x 4]
## Groups: skill [1]
## 
##       ID                                                job skill   score
##    <int>                                              <chr> <chr>   <dbl>
## 1      2             data integration and reporting analyst agile 0.34817
## 2     18    data scientist, business strategy - seattle, wa agile 0.27510
## 3     70                             data reporting analyst agile 0.24886
## 4     93                   analytics & insights coordinator agile 0.37262
## 5    119 statistical analyst - data analytics and reporting agile 0.33180
## 6    132                      data analyst/reporting writer agile 0.23017
## 7    160    data analyst, performance, analysis & reporting agile 0.27345
## 8    256            data and reporting analyst-magellan mrx agile 0.07033
## 9    282                       aco reporting & data analyst agile 0.31328
## 10   311                             data scientist - staff agile 0.24106

Now as you see, all the jobs are now listed together, grouped by skill. We notice for agile, that there tends to be an overall positive correlation, meaning that the computer software agile was likely to be used for data integration, reporting analyst, statistical analyst, etc.

Let’s take this data frame and isolate just the agile group.

correlate.agile <- correlate.table %>% group_by(skill) %>% arrange(skill) %>% filter(skill == "agile")
correlate.agile

## Source: local data frame [111 x 4]
## Groups: skill [1]
## 
##       ID                                                job skill   score
##    <int>                                              <chr> <chr>   <dbl>
## 1      2             data integration and reporting analyst agile 0.34817
## 2     18    data scientist, business strategy - seattle, wa agile 0.27510
## 3     70                             data reporting analyst agile 0.24886
## 4     93                   analytics & insights coordinator agile 0.37262
## 5    119 statistical analyst - data analytics and reporting agile 0.33180
## 6    132                      data analyst/reporting writer agile 0.23017
## 7    160    data analyst, performance, analysis & reporting agile 0.27345
## 8    256            data and reporting analyst-magellan mrx agile 0.07033
## 9    282                       aco reporting & data analyst agile 0.31328
## 10   311                             data scientist - staff agile 0.24106
## # ... with 101 more rows

There are 13 data points after tidying and filtering the data. With the psych package, we can take this opportunity to find basic statistics about this subset of data i.e. mean, median, standard deviation, etc. We’ll also take this opportunity to graphically demonstrate

describe(correlate.agile$score)

##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis
## X1    1 111 0.31 0.11   0.32    0.32 0.12 0.05 0.5  0.45 -0.39    -0.77
##      se
## X1 0.01

boxplot(correlate.agile$score, xlab = "Agile", ylab = "Correlation Score", main = "Correlation Score for Agile Skillset", col = "lightgreen")

So the mean and median of this subset is 0.25 with data points ranging from 0.07 to 0.37. And according to the boxplot, there is one outlier in this dataset. While not the strongest positive correlation, there is some positive correlation noted here.

Now, let’s compare multiple skills against each other and see how well they correlate. What I like to do is take all of the different skill sets and compare them to each other graphically to get a better sense how important these skill sets are.

mult.skills <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills

# Now with some color!
mult.skills1 <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_boxplot(aes(fill = skill)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills1

# This time, we will use a different visualization to look at the data.
mult.skills2 <- ggplot(data = correlate.skills, aes(x=skill, y=score)) + geom_bin2d() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title = "Correlation Score for Skill Sets for Data Scientists/Analysts", x = "Skill Sets", y = "Correlation Score")
mult.skills2

Interestingly, R and Ruby had negative correlation. I am not sure how to make sense of this time, but it may be perhaps the way we had collected the information and performed the correlation. In the first data set analysis, R and Python were the two most popular tools for data scientists.

Most of the other data skills, such as understanding, team, teams, people were either moderately positive or highly positive on the correlation score.

Visuals

For the analysis looking for the best ways to represent the correlation data after doing some research on various ggplots the following plot was used as it seemed to give the best picture of the data.

The forest plots make it easier to see exactly what the data is showing. On the y axis is the job-skill pair for data scientist job titles and on the x axis is the correlation scores.

1. Create subset data frame for data scentist jobs with hard skills

# Using mysql
# 1. a new table was created for the min and max scores of each skill and called "scores". Code is below.

#Create table Scores
#select skill, min(score) AS Min, max(score) AS Max from `Correlation Scores` group by skill;

#Create table `Correlation Scores 2` 
#select ID, job, `Correlation Scores`.skill, score, Min, Max from `Correlation Scores`
#Left join Scores
#On `Correlation Scores`.skill = Scores.skill;

# 2. another table was created that joined the "scores"" table with the original "Correlation Scores"" table andd this table was named "Correlation Scores 2"

# Using the Rmysql dbGetQuery function a query(subset) of jobs with labels "data scientist" with selected hard skills was imported from the database into an R object.
# The job and skill columns were combined into 1 column "Job-Skill" and the score, min and max values where rounded to 2 decimal places for aesthics of the visual. 
data1<- as.data.frame(dbGetQuery(con, 
"select ID, concat(job, ' ', '-', ' ', skill) AS `Job-Skill`, round(score,2) AS 'score', round(Min,2) AS 'Min', round(Max,2) AS 'Max' 
from `Correlation Scores 2` 
where job like 'data scientist' and 
skill in('sql', 'hadoop', 'python', 'java', 'r', 'hive', 'mapreduce', 'nosql', 'pig', 'sas', 'c', 'oracle', 'teradata', 'spss', 'matlab', 'perl', 'mysql', 'postgresql', 'ruby', 'html', 'stata', 
'agile','spark', 'java','apache', 'yarn', 'scala', 'mongodb', 'cassandra', 'hbase', 'c++', 'c#', 'data mining', 'statistics', 'modeling','computer science', 'mathematics', 'software development')"))

## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 3 imported as
## numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 4 imported as
## numeric

Create forest plot for data1

# reverses the factor level ordering for labels after coord_flip()
data1$`Job-Skill` <- factor(data1$`Job-Skill`, levels=rev(data1$`Job-Skill`))

fp1 <- ggplot(data=data1, aes(x=`Job-Skill`, y=score, ymin=Min, ymax=Max)) +
        geom_pointrange(fill='green', color='green', shape=21, lwd=1, size=2) +
        geom_text(aes(label=score), hjust=0.5, vjust=2, size=3) + #adds label 
        geom_hline(yintercept=0, lty=2) +  # add a dotted line at x=1 after flip
        coord_flip() +  # flip coordinates (puts labels on y axis)
        xlab("Correlation") + ylab("Score") +
        theme_grey()  # use a white background

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`size`)

print(fp1)

From the plot above we see that Mathematics and Statistics have the highest correlation scores. Meanwhile for the software, Oracle has the highest correlation and surprisingly r has a negative correlation.

2. Create subset data frame for data scentist jobs with soft skills

data2<- as.data.frame(dbGetQuery(con, 
"select ID, concat(job, ' ', '-', ' ', skill) AS `Job-Skill`, round(score,2) AS 'score', round(Min,2) AS 'Min', round(Max,2) AS 'Max'
from `Correlation Scores 2` 
where job like 'data scientist' and 
skill in('experience', 'business', 'team', 'skills', 'analytical', 'problem solving', 'understanding', 'algorithms', 'performance', 'big data', 'communication skills', 'solving', 'people', 'distributed', 'group', 'ideas', 'interpersonal', 
'leadership')"))

## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 3 imported as
## numeric

## Warning in .local(conn, statement, ...): Decimal MySQL column 4 imported as
## numeric

Create forest plot for data2

# reverses the factor level ordering for labels after coord_flip()
data1$`Job-Skill` <- factor(data1$`Job-Skill`, levels=rev(data1$`Job-Skill`))

fp2 <- ggplot(data=data2, aes(x=`Job-Skill`, y=score, ymin=Min, ymax=Max)) +
        geom_pointrange(fill='green', color='green', shape=21, lwd=1, size=2) +
        geom_text(aes(label=score), hjust=0.5, vjust=2, size=3) +
        geom_hline(yintercept=0, lty=2) +  # add a dotted line at x=1 after flip
        coord_flip() +  # flip coordinates (puts labels on y axis)
        xlab("Correlation") + ylab("Score") +
        theme_grey()  # use a white background

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`size`)

print(fp2)

The plot above shows that soft skills are just as important as hard skills. We see that idea (creativity) skills have the highest correlation with team skills and experience also having high correlations.

607 Project 3: Curwen Arthurs, Joel Park, Latif Masud, Kelly Shaffer, and Yun Mai

Setting up our environment

A note from Latif

Libraries

Setting up MySQL

Data Mining and Web Scraping

Monster.com

Kaggle.com

Kdnuggets.com

Uploading the data set onto our online MySQL server database.

Import the job info from indeed API to MySQL locally

Import to DATA607 group database

Calculating Correlation Scores

Settings up Axios for MCS

Functions

Sleep

Getting Jobs from Database

Getting Skills from Database

Geting the correlation scores

Data Skill Sets

Correlation Score

Visuals

1. Create subset data frame for data scentist jobs with hard skills

Create forest plot for data1

2.

Create subset data frame for data scentist jobs with soft skills

Create forest plot for data2