For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet)
A fictitious London-based training company, WeTrainYou, wants to start a local training facility in California. It is looking for a city where ample Salesforce* development jobs are available.
Its goal is to train engineers and fulfill full-time and part-time jobs. WeTrainYou has hired you to determine where they should set up the business.
Case Study: https://software.intel.com/en-us/articles/using-visualization-to-tell-a-compelling-data-story
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## -- Attaching packages -------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ----------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: rvest
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
# This function to creates a URL for the www.dice.com website and extract the data
create_url <- function(website, title, location, radius, page){
url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
url <- paste0(url, "&startPage=", page, "&jobs")
return(url)
}
# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
search_title <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes(xpath = "//a/@title") %>%
html_text()
search_region <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=addressRegion]") %>%
html_text()
search_zipcode <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=postalCode]") %>%
html_text()
search_address <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=streetAddress]") %>%
html_text() %>%
str_replace(pattern = paste0(", ",search_region), "")
search_company <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="hiringOrganization"]') %>%
html_nodes("[itemprop=name]") %>%
html_text()
df <- tibble(title = search_title,
company = search_company,
city = search_address,
state = search_region,
zipcode = search_zipcode)
return(df)
}
site = "https://www.dice.com/jobs"
job = "Data+Analyst"
region = "IL"
miles = 20
pag = 1
url <- create_url(website = site, title = job, location = region, radius = miles, page = pag)
url
## [1] "https://www.dice.com/jobs?q=Data+Analyst&l=IL&radius=20&startPage=1&jobs"
knitr::include_graphics("imgs/ScrapePG1.PNG")
#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = "IL"
#miles = 20
num_pages = 55
# COMMENT: Loop over the max number of pages for the job search
for (i in 1:num_pages) {
# TODO: Create a url for the job search
url <- create_url(website = site, title = job, location = region, radius = miles, page = i)
# COMMENT: read the created URL and collects the html code
web_html <- read_html(url)
# COMMENT: If statement to create the first dataframe
if(i == 1) {
# COMMENT: Creates a tibble dataframe extracting information from the html code. Olny creates one table. New dataframe per pg, add to table
job_data <- create_tibble(html = web_html)
}else{
# COMMENT: We add new observation to the first dataframe
df <- create_tibble(html = web_html)
job_data <- bind_rows(job_data, df)
}
# COMMENT: We have to wait a couple of seconds before moving to the next page
Sys.sleep(1.0)
}
head(job_data)
## # A tibble: 6 x 5
## title company city state zipcode
## <chr> <chr> <chr> <chr> <chr>
## 1 Data Analyst-Bus Intelligence State Farm Insuran~ Bloomin~ IL 61701
## 2 Senior Data Analyst AAIS Lisle IL 60532
## 3 Data Analyst Air Force Civilian~ Bellevi~ IL 62225
## 4 DATA ANALYST Enterprise Infioni~ Chicago IL 60601
## 5 Data Analyst LaSalle Network Chicago IL 60603
## 6 Microsoft Data Analyst Request Technology~ Oak Bro~ IL 60523
summary(job_data)
## title company city
## Length:1620 Length:1620 Length:1620
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## state zipcode
## Length:1620 Length:1620
## Class :character Class :character
## Mode :character Mode :character
write_csv(x= job_data, path = "data/job_data.csv")
knitr::include_graphics("imgs/inspection.PNG")
knitr::include_graphics("imgs/bubbles.PNG")
There are a ton of jobs in Chicago compared to other cities, this makes the bubbles much smaller than Chicago. I zoomed in to see the differences in size of bubbles, there were a few outliers with one job in the larger area of IL. This makes it hard to see the exact number of jobs.
knitr::include_graphics("imgs/TreepMap.PNG")
Chicago is such a major outlier that it makes it hard to see a difference in the colors of cities with fewer jobs.This makes it easier to see which cities are comprable to each other as their names are in the squares that designate their sizes.
knitr::include_graphics("imgs/bar.PNG")
This chart displays the number of jobs in the most clear way. You can tell that Chicago has the most jobs and that cities near Chicago tend to have the next largest amount of jobs; however they are significantly smaller. If I were seeking a job I would definitely look in Chicago rather than in the other cities.
knitr::include_graphics("imgs/dashboard.PNG")
The dashboard is nice because you can compare all three easily to gain the maximum information. You can see the proportional differences in bubble sizes clearly in the tree map and the quantities in the bar chart. ————-
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
knitr::include_graphics("imgs/watsoninsight1.PNG")
Companies with the most job postings in IL are Request Technology LLC, CyberCoders, Robert Half Technology, Deloitte, and US Tech Solutions Inc.
knitr::include_graphics("imgs/watsoninsight2.PNG")
Most companies have one job posting.
knitr::include_graphics("imgs/watsoninsight3.PNG")
There are the most jobs for higher tech coding positions, there are also a lot of analyst postings. The data analyst scrape did not actually pull only data analyst jobs. In fact it pulled the most network engineer jobs.