For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet)
A fictitious London-based training company, WeTrainYou, wants to start a local training facility in California. It is looking for a city where ample Salesforce* development jobs are available.
Its goal is to train engineers and fulfill full-time and part-time jobs. WeTrainYou has hired you to determine where they should set up the business.
Case Study: https://software.intel.com/en-us/articles/using-visualization-to-tell-a-compelling-data-story
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## -- Attaching packages ----------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts -------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: rvest
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
# This function to creates a URL for the www.dice.com website and extract the data
create_url <- function(website, title, location, radius, page){
url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
url <- paste0(url, "&startPage=", page, "&jobs")
return(url)
}
# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
search_title <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes(xpath = "//a/@title") %>%
html_text()
search_region <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=addressRegion]") %>%
html_text()
search_zipcode <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=postalCode]") %>%
html_text()
search_address <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=streetAddress]") %>%
html_text() %>%
str_replace(pattern = paste0(", ",search_region), "")
search_company <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="hiringOrganization"]') %>%
html_nodes("[itemprop=name]") %>%
html_text()
df <- tibble(title = search_title,
company = search_company,
city = search_address,
state = search_region,
zipcode = search_zipcode)
return(df)
}
site = "https://www.dice.com/jobs"
job = "Data+Analyst"
region = "MI"
miles = 30
pag = 1
#url <- create_url(website = site, title = JOB_TITLE, location = STATE_TWO_LETTERS, radius = NUM_MILES, page = PAG_NUM)
url <- create_url(website = site, title = job, location = region, radius = miles, page = pag)
#url
url
## [1] "https://www.dice.com/jobs?q=Data+Analyst&l=MI&radius=30&startPage=1&jobs"
knitr::include_graphics('imgs/dataanalyst.png')
#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
#pag = PAG_NUM
num_pages = 29
# COMMENT: Loop over the max number of pages for the job search
for (i in 1:num_pages) {
# TODO: Create a url for the job search
#url <- create_url()
url <- create_url(website = site, title = job, location = region, radius = miles, page = num_pages)
# COMMENT: read the created URL and collects the html code
web_html <- read_html(url)
# COMMENT: If statement to create the first dataframe
if(i == 1) {
# COMMENT: Creates a tibble dataframe extracting information from the html code
job_data <- create_tibble(html = web_html)
}else{
# COMMENT: We add new observation to the first dataframe
df <- create_tibble(html = web_html)
job_data <- bind_rows(job_data, df)
}
# COMMENT: We have to wait a couple of seconds before moving to the next page
Sys.sleep(1.0)
}
summary(job_data)
## title company city
## Length:551 Length:551 Length:551
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## state zipcode
## Length:551 Length:551
## Class :character Class :character
## Mode :character Mode :character
head(job_data)
## # A tibble: 6 x 5
## title company city state zipcode
## <chr> <chr> <chr> <chr> <chr>
## 1 Hadoop Administrator / Archit~ Systems Technology~ Dearbo~ MI 48126
## 2 C++ developer with exp. in si~ Systems Technology~ Warren MI 48090
## 3 Project Manager - Ann Arbor, ~ Intellisoft Techno~ Ann Ar~ MI 48103
## 4 Adobe Experience Manager (AEM~ Systems Technology~ Dearbo~ MI 48126
## 5 Adobe Experience Manager (AEM~ Systems Technology~ Dearbo~ MI 48126
## 6 VCS STORAGE ADMINISTRATOR MUS~ Tayback Staffing Auburn~ MI 48321
write_csv(x = job_data, path = "data/vice_mi.csv" )
knitr::include_graphics('imgs/vice_inspection.png')
knitr::include_graphics('imgs/geomap_vice.png')
## Most of the locations are centered in one area by the lower part of the thumb with a few others scattered across the state.
knitr::include_graphics('imgs/treedata.png')
## Most of the jobs are in Lansing and Grand Rapids which makes sense because they are big cities. Auburn Hills has the least amount of jobs.
knitr::include_graphics('imgs/vicebar.png')
## The bigger cities have more jobs available. This makes sense since mor epeople live in bg cities.
knitr::include_graphics('imgs/vicedash.png')
## All the data seems to go together showing that the same cities have the most amount of jobs available. ————-
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
knitr::include_graphics('imgs/watson1.png')
## The zipcodes with the highest value depend on the title, city, and company that the person is working with.
knitr::include_graphics('imgs/watson2.png')
## This is like a word cloud that shows which cities have the most job offerings.
knitr::include_graphics('imgs/watson3.png')
## This shows the type of jobs and number of each in certain cities. There’s a lot of jobs in system technology anf digital technology. Bigger cities have more types of job offerings.