Data607_web_scraping

We use data from indeed to build a file with many job postings. Each step is turned on and off separately so we don’t pull too much data from indeed.

Instructions for use:
1.Find all tags starting with `"C:/Users/dawig/Desktop`" and change to fit your local file location desired.
2.Turn on Selector section. For each group of 10 jobs, select a start for first number to pull(find ‘start=’) Change file10 to desired file number.
3. For each file created in step 1, you can run the grabber and processor. Change file10 on line 3 of the grabber for each file you made in step 2.

Selector for 10 postings

#Get a page with 10 results of search for data scientist listing
data<-unlist(readLines('https://www.indeed.com/jobs?q=data+scientist&l=White+Plains%2C+NY&start=295'))
write(data,file = "C:/Users/dawig/Desktop/web_scrapings/file10")

Grabber for 10 postings

#Use listing of 10 to pull jk codes that indeed uses to fetch jobs.
library(stringr)
#Instead of automating this step, we are choosing files with 10 tags individually--to not take too much from indeed.
new.data<- readLines("C:/Users/dawig/Desktop/web_scrapings/file10")
first.job<-grep("jobmap",new.data)
first.job
new.data[first.job[2]]
job.code<-matrix(nrow=10)
for (i in 2:11){
job.code[i-1]<-unlist(str_extract(new.data[first.job[i]],"[:alnum:]{16}"))
               }
#Use job codes to pull data files with individual jobs
for (i in 1:10)     {
temp.code<-job.code[i]
url.single.job.posting<-paste("https://www.indeed.com/viewjob?jk=",temp.code,collapse="")
url.single.job.posting<-gsub(" ", "", url.single.job.posting, fixed = TRUE)
url.single.job.posting
single.job.posting<-readLines(url.single.job.posting)
write.to.file<-"C:/Users/dawig/Desktop/web_scrapings/jobfile"
job.file.writer<-paste(write.to.file,i)
job.file.writer<-gsub(" ", "", job.file.writer, fixed = TRUE)
write(single.job.posting,file = job.file.writer)
                     }

Processor for 10 postings-appends to end of cleanjobfiles

#Get names for job files to pull
job.file.opener<-"C:/Users/dawig/Desktop/web_scrapings/jobfile"
job.file.matrix<-matrix(nrow=10)
for(i in 1:10){
  job.file.matrix[i]<-paste(job.file.opener,i)
              }
job.file.matrix<-gsub(" ", "",job.file.matrix, fixed = TRUE)

#Pull job files, clean them, append them to a single file
for(i in 1:10)     {
look.at.job.file<- readLines(job.file.matrix[i])
job.beginning<-grep("job_summary",look.at.job.file)
job.ending<-grep("result-link-bar", look.at.job.file)
small.job.posting<-(look.at.job.file[job.beginning[1]:job.ending[1]])
write(small.job.posting,file = "C:/Users/dawig/Desktop/web_scrapings/cleanjobfiles",append=TRUE)
                    }

Jobs are presented in a single file. Each is tagged with a beginning and ending of the style: `<span id="job_summary" class="summary"><div><b>Data Engineer</b></div><div>`
`<div class="result-link-bar-container result-link-bar-viewjob">`

Data607_web_scraping

Dan Wigodsky

March 21, 2018

We use data from indeed to build a file with many job postings. Each step is turned on and off separately so we don’t pull too much data from indeed.

Selector for 10 postings

Grabber for 10 postings

Processor for 10 postings-appends to end of cleanjobfiles

Jobs are presented in a single file. Each is tagged with a beginning and ending of the style: `<span id="job_summary" class="summary"><div><b>Data Engineer</b></div><div>`
`<div class="result-link-bar-container result-link-bar-viewjob">`

Data607_web_scraping

Dan Wigodsky

March 21, 2018

We use data from indeed to build a file with many job postings. Each step is turned on and off separately so we don’t pull too much data from indeed.

Selector for 10 postings

Grabber for 10 postings

Processor for 10 postings-appends to end of cleanjobfiles

Jobs are presented in a single file. Each is tagged with a beginning and ending of the style: <span id="job_summary" class="summary"><div><b>Data Engineer</b></div><div> <div class="result-link-bar-container result-link-bar-viewjob">

Jobs are presented in a single file. Each is tagged with a beginning and ending of the style: `<span id="job_summary" class="summary"><div><b>Data Engineer</b></div><div>`
`<div class="result-link-bar-container result-link-bar-viewjob">`