For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet)
A fictitious London-based training company, WeTrainYou, wants to start a local training facility in California. It is looking for a city where ample Salesforce* development jobs are available.
Its goal is to train engineers and fulfill full-time and part-time jobs. WeTrainYou has hired you to determine where they should set up the business.
Case Study: https://software.intel.com/en-us/articles/using-visualization-to-tell-a-compelling-data-story
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: rvest
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
# This function to creates a URL for the www.dice.com website and extract the data
create_url <- function(website, title, location, radius, page){
url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
url <- paste0(url, "&startPage=", page, "&jobs")
return(url)
}
# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
search_title <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes(xpath = "//a/@title") %>%
html_text()
search_region <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=addressRegion]") %>%
html_text()
search_zipcode <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=postalCode]") %>%
html_text()
search_address <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=streetAddress]") %>%
html_text() %>%
str_replace(pattern = paste0(", ",search_region), "")
search_company <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="hiringOrganization"]') %>%
html_nodes("[itemprop=name]") %>%
html_text()
df <- tibble(title = search_title,
company = search_company,
city = search_address,
state = search_region,
zipcode = search_zipcode)
return(df)
}
#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
#pag = PAG_NUM
#url <- create_url(website = JOB_WEBSITE, title = JOB_TITLE, location = STATE_TWO_LETTERS, radius = NUM_MILES, page = PAG_NUM)
#url
site = "https://www.dice.com/jobs"
job = "Project+Manager"
region = "OH"
miles = 30
page = 1
url = create_url(website = site, title = job, location = region, radius = 30, page = 1)
url
## [1] "https://www.dice.com/jobs?q=Project+Manager&l=OH&radius=30&startPage=1&jobs"
knitr::include_graphics("imgs/screenshot1.png")
#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
#pag = PAG_NUM
#site = JOB_WEBSITE
#job = "Project+Manager"
#region = "OH"
#miles = 30
num_pages = 9
# COMMENT: Loop over the max number of pages for the job search
for (i in 1:num_pages) {
# TODO: Create a url for the job search
#url <- create_url()
# COMMENT: read the created URL and collects the html code
web_html <- read_html(url)
# COMMENT: If statement to create the first dataframe
if(i == 1) {
# COMMENT: Creates a tibble dataframe extracting information from the html code
job_data <- create_tibble(html = web_html)
}else{
# COMMENT: We add new observation to the first dataframe
df <- create_tibble(html = web_html)
job_data <- bind_rows(job_data, df)
}
# COMMENT: We have to wait a couple of seconds before moving to the next page
Sys.sleep(1.0)
}
summary(job_data)
## title company city
## Length:270 Length:270 Length:270
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## state zipcode
## Length:270 Length:270
## Class :character Class :character
## Mode :character Mode :character
head(job_data)
## # A tibble: 6 x 5
## title company city state zipcode
## <chr> <chr> <chr> <chr> <chr>
## 1 PMP Certified Project Manager Cincinnati Bell Tec… Cincin… OH 45242
## 2 Senior Project Manager Cincinnati Bell Tec… Cincin… OH 45212
## 3 IT Project Manager II Medical Mutual of O… Clevel… OH 44101
## 4 IT Project Managers Swagelok Clevel… OH 44139
## 5 IT Project Manager IntegrateDelivery I… Brecks… OH 44141
## 6 Sr. Project Manager ICC Columb… OH 43231
write_csv(x = job_data, path = "data/job_data-OH.csv")
knitr::include_graphics("imgs/screenshot2.png")
knitr::include_graphics("imgs/screenshot4.png")
This map shows the location of “Project Manager” jobs in Ohio. Since the size of the bubbles are the same, this implies that there are relatively the same number of jobs in each location.
knitr::include_graphics("imgs/screenshot3.png")
This tree map demonstrates through size the number of jobs that are in each city in Ohio. It shows that the highest concentration of jobs is in Columbus followed by Cincinnati and Cleveland.
knitr::include_graphics("imgs/screenshot5.png")
Similiar to the tree map up above, this bar plot shows that the cumulative number of jobs is highest in Columbus and that Cincinnati and Cleveland have the same cumulative number of jobs.
knitr::include_graphics("imgs/screenshot6.png")
The dashboard is a nice way to see three different ways that the data can be visually represented at the same time. While they all lead to the same conclusions, it is nice to be able to look at the three different visuals all at the same time. ————-
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
knitr::include_graphics("imgs/screenshot7.png")
Similiar to the bar plot that was created above in Tableau, this bar chart also shows how many jobs are in each city, which once again proves that Columbus has the highest cumulative number of jobs. This does make sense considering that Columbus is both the capital and the largest city in Ohio.
knitr::include_graphics("imgs/screenshot8.png")
This graphic demonstrates the various factors that drive zipcode including company, city, and title. Company is the strongest predictor of zipcode at 77% followed by city at 60% and then title at 47%.
knitr::include_graphics("imgs/screenshot9.png")
This final graphic shows the number of companies with job openings that are in each city. Obviously, the cities with the larger bubbles have a greater number of companies with jobs in them. I found this graphic interesting, because it looks at the number of companies in each city instead of just the number of jobs in each city. For example, the bar chart above showed that Dublin a signifcantly lower number of jobs than Cincinnatti, but this graphic shows the same size bubble for both of those cities. This means that although Cincinnati may have more jobs than Dublin, there are only 3 companies in each of those cities offereing jobs. Therefore, the companies in Cincinnati must have more job openings available at those three companies than the number of job openings at the companies in Dublin. I liked this graphic because I thought it presented the information in somewhat different way that I had not thought about before.