Proj3_web_job_scraping

Project - Data Science Skills : part-1

You will have to determine what data to collect, where the data can be found, and how to load it. We decided to collect the info of job posts related to Data Science on the employment-related search engines such as indeed.com by scrappign the job descriptions.Then some mining will be able to done with such data. The time posted, locations and employers will also collected as we could use these as index or variables in the analysis down the road.
The data that you decide to collect should reside in a relational database, in a set of normalized tables. The selected job related info will store in database in local MySQL database and distal MySQL database share by the project group.
You should perform any needed tidying, transformations, and exploratory data analysis in R. tidying will be done at this step. Data transformations, and exploratory will be done in part 2, the text mining step.

library(RCurl)

## Loading required package: bitops

library(XML)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:RCurl':
## 
##     complete

library(stringr)

## Warning: package 'stringr' was built under R version 3.3.3

library(knitr)
library(RMySQL)

## Loading required package: DBI

** Collect the info of job posts related to Data Science from indeed.com **

Start from getting xml files from Application Programming Inferface (API) for indeed.com

Because of the sequrity reason, part of the code will not show. The code will look like this:

indeedapi.1 <- getURL(“http://api.indeed.com/ads/apisearch?publisher=#####&format=xml&q=data+analytics&l=&sort=relevance&radius&st=&jt=fulltime&start=0&limit=25&fromage=&filter=&latlong=1&co=us&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2”, .opts=curlOptions(followlocation = TRUE))

# parse xml files
xml.1 <- xmlParse(indeedapi.1)
xml.2 <- xmlParse(indeedapi.2)
xml.3 <- xmlParse(indeedapi.3)
xml.4 <- xmlParse(indeedapi.4)
xml.5 <- xmlParse(indeedapi.5)
xml.6 <- xmlParse(indeedapi.6)
xml.7 <- xmlParse(indeedapi.7)
xml.8 <- xmlParse(indeedapi.8)

# extract the job titles from the xml files
jobtitle.extr.1 <- xpathApply(xml.1, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.2 <- xpathApply(xml.2, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.3 <- xpathApply(xml.3, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.4 <- xpathApply(xml.4, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.5 <- xpathApply(xml.5, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.6 <- xpathApply(xml.6, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.7 <- xpathApply(xml.7, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitle.extr.8 <- xpathApply(xml.8, path="//response/results/result/jobtitle", fun=xmlValue)
jobtitles <- unlist(cbind(jobtitle.extr.1,jobtitle.extr.2,jobtitle.extr.3,jobtitle.extr.4,jobtitle.extr.5,jobtitle.extr.6,jobtitle.extr.7,jobtitle.extr.8))
# create a dataframe for job title
jobtitle.df <- as.data.frame(t(jobtitles))
jobtitle.df <- gather(jobtitle.df,"No","jobtitle",1:200)

## Warning: attributes are not identical across measure variables; they will
## be dropped

jobtitle.df$No <- str_replace(jobtitle.df$No, "V","")
str(jobtitle.df)

## 'data.frame':    200 obs. of  2 variables:
##  $ No      : chr  "1" "2" "3" "4" ...
##  $ jobtitle: chr  "Junior Data Scientist" "Data Analyst (Remote)" "Data Scientist" "Junior Data Scientist" ...

head(jobtitle.df)

##   No                jobtitle
## 1  1   Junior Data Scientist
## 2  2   Data Analyst (Remote)
## 3  3          Data Scientist
## 4  4   Junior Data Scientist
## 5  5 Analytics Data Engineer
## 6  6      Big Data Developer

** continue to extract other info from the xml files **

# extract the name of companies from the xml files
company.extr.1 <- xpathApply(xml.1, path="//response/results/result/company", fun=xmlValue)
company.extr.2 <- xpathApply(xml.2, path="//response/results/result/company", fun=xmlValue)
company.extr.3 <- xpathApply(xml.3, path="//response/results/result/company", fun=xmlValue)
company.extr.4 <- xpathApply(xml.4, path="//response/results/result/company", fun=xmlValue)
company.extr.5 <- xpathApply(xml.5, path="//response/results/result/company", fun=xmlValue)
company.extr.6 <- xpathApply(xml.6, path="//response/results/result/company", fun=xmlValue)
company.extr.7 <- xpathApply(xml.7, path="//response/results/result/company", fun=xmlValue)
company.extr.8 <- xpathApply(xml.8, path="//response/results/result/company", fun=xmlValue)
companies <- unlist(cbind(company.extr.1,company.extr.2,company.extr.3,company.extr.4,company.extr.5,company.extr.6,company.extr.7,company.extr.8))
componies <- str_replace(companies,'\"',"")
  
# create a dataframe for name of companies
company.df <- as.data.frame(t(companies))
colnames(company.df) <- seq(length(companies))
company.df <- gather(company.df,"No","company",1:length(companies))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the cities from the xml files
city.extr.1 <- xpathApply(xml.1, path="//response/results/result/city", fun=xmlValue)
city.extr.2 <- xpathApply(xml.2, path="//response/results/result/city", fun=xmlValue)
city.extr.3 <- xpathApply(xml.3, path="//response/results/result/city", fun=xmlValue)
city.extr.4 <- xpathApply(xml.4, path="//response/results/result/city", fun=xmlValue)
city.extr.5 <- xpathApply(xml.5, path="//response/results/result/city", fun=xmlValue)
city.extr.6 <- xpathApply(xml.6, path="//response/results/result/city", fun=xmlValue)
city.extr.7 <- xpathApply(xml.7, path="//response/results/result/city", fun=xmlValue)
city.extr.8 <- xpathApply(xml.8, path="//response/results/result/city", fun=xmlValue)
cities <- unlist(cbind(city.extr.1,city.extr.2,city.extr.3,city.extr.4,city.extr.5,city.extr.6,city.extr.7,city.extr.8))
# create a dataframe for cities
city.df <- as.data.frame(t(cities))
colnames(city.df) <- seq(length(cities))
city.df <- gather(city.df,"No","city",1:length(cities))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the states from the xml files
state.extr.1 <- xpathApply(xml.1, path="//response/results/result/state", fun=xmlValue)
state.extr.2 <- xpathApply(xml.2, path="//response/results/result/state", fun=xmlValue)
state.extr.3 <- xpathApply(xml.3, path="//response/results/result/state", fun=xmlValue)
state.extr.4 <- xpathApply(xml.4, path="//response/results/result/state", fun=xmlValue)
state.extr.5 <- xpathApply(xml.5, path="//response/results/result/state", fun=xmlValue)
state.extr.6 <- xpathApply(xml.6, path="//response/results/result/state", fun=xmlValue)
state.extr.7 <- xpathApply(xml.7, path="//response/results/result/state", fun=xmlValue)
state.extr.8 <- xpathApply(xml.8, path="//response/results/result/state", fun=xmlValue)
states <- unlist(cbind(state.extr.1,state.extr.2,state.extr.3,state.extr.4,state.extr.5,state.extr.6,state.extr.7,state.extr.8))
# create a dataframe for states
state.df <- as.data.frame(t(states))
colnames(state.df) <- seq(length(states))
state.df <- gather(state.df,"No","state",1:length(states))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the dates from the xml files
date.extr.1 <- xpathApply(xml.1, path="//response/results/result/date", fun=xmlValue)
date.extr.2 <- xpathApply(xml.2, path="//response/results/result/date", fun=xmlValue)
date.extr.3 <- xpathApply(xml.3, path="//response/results/result/date", fun=xmlValue)
date.extr.4 <- xpathApply(xml.4, path="//response/results/result/date", fun=xmlValue)
date.extr.5 <- xpathApply(xml.5, path="//response/results/result/date", fun=xmlValue)
date.extr.6 <- xpathApply(xml.6, path="//response/results/result/date", fun=xmlValue)
date.extr.7 <- xpathApply(xml.7, path="//response/results/result/date", fun=xmlValue)
date.extr.8 <- xpathApply(xml.8, path="//response/results/result/date", fun=xmlValue)
dates <- unlist(cbind(date.extr.1,date.extr.2,date.extr.3,date.extr.4,date.extr.5,date.extr.6,date.extr.7,date.extr.8))
# create a dataframe for dates
dates.df <- as.data.frame (t(dates))
colnames(dates.df) <- seq(dates)
dates.df <- gather(dates.df,"No","date",1:length(dates))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# extract the latitude from the xml files
latitude.extr <- xpathApply(xml.1, path="//response/results/result/latitude", fun=xmlValue)

# extract the longtitude from the xml files
longitude.extr <- xpathApply(xml.1, path="//response/results/result/longitude", fun=xmlValue)

# extract the job post links from the xml files
urls1 <- xpathApply(xml.1, path="//response/results/result/url", fun=xmlValue)
urls2 <- xpathApply(xml.2, path="//response/results/result/url", fun=xmlValue)
urls3 <- xpathApply(xml.3, path="//response/results/result/url", fun=xmlValue)
urls4 <- xpathApply(xml.4, path="//response/results/result/url", fun=xmlValue)
urls5 <- xpathApply(xml.5, path="//response/results/result/url", fun=xmlValue)
urls6 <- xpathApply(xml.6, path="//response/results/result/url", fun=xmlValue)
urls7 <- xpathApply(xml.7, path="//response/results/result/url", fun=xmlValue)
urls8 <- xpathApply(xml.8, path="//response/results/result/url", fun=xmlValue)
urls <- unlist(cbind (urls1, urls2, urls3, urls4, urls5, urls6, urls7, urls8))

head(urls)

## [1] "http://www.indeed.com/viewjob?jk=6f70efa6a5094fe6&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gTUfMSer-dilirEORRfTUiIAPEYcCt81Hyg31C60-sSI&indpubnum=2912078065868574&atk=1bc6m7rms53p1erp"
## [2] "http://www.indeed.com/viewjob?jk=c1e5b5dc6d9d2fb4&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gTUfMSer-dilirEORRfTUiIAPEYcCt81Hyg31C60-sSI&indpubnum=2912078065868574&atk=1bc6m7rms53p1erp"
## [3] "http://www.indeed.com/viewjob?jk=8a96b4d5da7a3202&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gTUfMSer-dilirEORRfTUiIAPEYcCt81Hyg31C60-sSI&indpubnum=2912078065868574&atk=1bc6m7rms53p1erp"
## [4] "http://www.indeed.com/viewjob?jk=dab54a15d1621055&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gTUfMSer-dilirEORRfTUiIAPEYcCt81Hyg31C60-sSI&indpubnum=2912078065868574&atk=1bc6m7rms53p1erp"
## [5] "http://www.indeed.com/viewjob?jk=a4d7f685b22cd517&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gTUfMSer-dilirEORRfTUiIAPEYcCt81Hyg31C60-sSI&indpubnum=2912078065868574&atk=1bc6m7rms53p1erp"
## [6] "http://www.indeed.com/viewjob?jk=bcb9b7e45f896cea&qd=xOkpjcvVY_jvLsMZYWTnRwupS8G_KlPWeBQfvaqeTgen-DcKOYwx1lF5XAicJT0gTUfMSer-dilirEORRfTUiIAPEYcCt81Hyg31C60-sSI&indpubnum=2912078065868574&atk=1bc6m7rms53p1erp"

** Extract the job description from each link**

# use a roop to extract the job description from each link
# for (i in 1:length(urls)){
# html.job <- getURL(urls, .opts=curlOptions(followlocation = TRUE))
# parsed.html.job <- htmlParse(html.job)
# job.extr <- xpathApply(parsed.html.job, "//span[@id='job_summary']", fun=xmlValue)
# }
# Because indeed.com will limit the access, the for loop here will be shown as statement to allow the file to be knit to a RMD file. So the data will be exported as a vector and manually loaded back for the following process.

library(data.table)

## -------------------------------------------------------------------------

## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!

## -------------------------------------------------------------------------

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

job.extr <- fread("https://raw.githubusercontent.com/YunMai-SPS/DA607-homework/master/jobextr.csv ",header = T, sep = ',')

# create a dataframe for job descriptions
job.extr <- unlist(job.extr)
job.df <- as.data.frame(t(job.extr))
colnames(job.df) <- seq(job.extr)
job.df <- gather(job.df,"No","job_summary",1:length(job.extr))

## Warning: attributes are not identical across measure variables; they will
## be dropped

# combind all info into one dataframe
indeedjob <- inner_join(dates.df, jobtitle.df, by = "No")
indeedjob <- inner_join(indeedjob,company.df, by = "No")
indeedjob <- inner_join(indeedjob,city.df, by = "No")
indeedjob <- inner_join(indeedjob,state.df, by = "No")
indeedjob <- inner_join(indeedjob,job.df, by = "No")
kable(indeedjob[2,])

	No	date	jobtitle	company	city	state	job_summary
2	2	Tue, 21 Mar 2017 02:39:39 GMT	Data Analyst (Remote)	First San Francisco Partners	Remote		BitVoyant produces business and network intelligence, enabling companies to intimately understand their interactions and risks in real-time and at Internet scale. WeÃ¢ÂÂre growing our team to create, enrich, and mine this intelligence for insights that matter to our customers.

As a Data Scientist (Big Data), you will apply cutting-edge statistical and mathematical methods to the collection, correlation and analysis of large structured and unstructured data sets from a broad, ever-evolving set of sources. The successful applicant will apply industry knowledge, contextual understanding, and innovative insights into the development of novel analytical techniques and tools in response to urgent, complex questions. The ideal candidate will:

Responsibilities Provide valuable insights into new data sets quickly Mine large amounts of data for insights with both automated and ad-hoc queries Communicate complex findings and ideas in plain language that is friendly to business and operational audiences Problem solve Ã¢ÂÂ turn requirements into solutions Collaborate with team members with varying technical and non-technical backgrounds towards a shared goal Rapidly research and iterate on analysis methods by finding elegant, reliable and understandable techniques to build analytics that matter Create technical assessments and develop custom approaches in response to time-critical urgent customer needs Employ sophisticated analytics programs, machine learning, and statistical methods to prepare data for use in predictive and prescriptive modeling Examine new and more convincing methods for data reporting, visualization & presentation Develop new approaches to apply large-scale computing technology to solve customer problems Be involved in all stages of analytic development from idea formation through prototyping, automation, and productizing Maintain up-to-date knowledge of technology standards, industry trends, emerging techniques, and analytic best practices Keep informed of evolution and impact of adversary tactics and strategies Ensure analytical issues are quickly resolved and help implement strategies and solutions to reduce the likelihood of reoccurrence

Requirements - Our ideal candidate will meet most but not necessarily all of the following requirements:

US Citizenship or authorization to work in the US

Possess TS/SCI security clearance

BachelorÃ¢ÂÂs degree or higher in Computer Science, Physics, Mathematics, Statistics or equivalent

Possess solid understanding of Cyber Security terminology, concepts and technology

Possess solid understanding of Internet communication protocol

Must have at least 1 to 2 years demonstrated experience with Hadoop, Elasticsearch NoSQL, HDFS Python JSON data formats, Parquet files Structured and Unstructured data Supervised and Unsupervised Machine Learning techniques

Desired skills Big Data Platforms and tools (e.g. Cloudera, Hortonworks, MapR, Hadoop, Pig, Hive, etc.) Impala, and SQL Ã¢ÂÂ for queries NoSQL databases (ex. HBase, MongoDB,ArangoDB, Neo4J) Ã¢ÂÂ for behavioral analysis C/C++/C#, Scala, PySpark, Java, R, R Shiny, or other comparable programming language Data formats Ã¢ÂÂ ex. JSON, flat files, Parquet, ORC files, Avro Extract-Transform-Load (ETL) processes Statistical data analysis packages (SPSS, R, etc.) Data visualization tools (ex. Tableau, Qlik, IPython, etc.) Comfortable with rapid prototyping to solve immediate problems Super user level expertise in Linux / Unix, Mac, and Windows operating systems. Additional Machine Learning techniques, ex. Auto Encoders, Deep Learning, Hierarchical Clustering, Outlier and Anomaly Detection, etc.

Ability to process, filter, and establish the utility of data through various analytics techniques

Demonstrated creativity, foresight, and mature judgment in answering complex and difficult analytical questions

Strong written and verbal communication skills

# save the job info from indeed.com API as a cvs file
write.csv(indeedjob, file = "indeedjob_scrape.csv")

Proj3_web_job_scraping

Yun Mai

March 21, 2017

Project - Data Science Skills : part-1

Import the job info from indeed API to MySQL locally

Import to DATA607 group database