For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet)
A fictitious London-based training company, WeTrainYou, wants to start a local training facility in California. It is looking for a city where ample Salesforce* development jobs are available.
Its goal is to train engineers and fulfill full-time and part-time jobs. WeTrainYou has hired you to determine where they should set up the business.
Case Study: https://software.intel.com/en-us/articles/using-visualization-to-tell-a-compelling-data-story
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: rvest
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
# This function to creates a URL for the www.dice.com website and extract the data
create_url <- function(website, title, location, radius, page){
url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
url <- paste0(url, "&startPage=", page, "&jobs")
return(url)
}
# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
search_title <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes(xpath = "//a/@title") %>%
html_text()
search_region <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=addressRegion]") %>%
html_text()
search_zipcode <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=postalCode]") %>%
html_text()
search_address <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="address"]') %>%
html_nodes("[itemprop=streetAddress]") %>%
html_text() %>%
str_replace(pattern = paste0(", ",search_region), "")
search_company <- html %>%
html_nodes(".complete-serp-result-div") %>%
html_nodes('[itemprop="hiringOrganization"]') %>%
html_nodes("[itemprop=name]") %>%
html_text()
df <- tibble(title = search_title,
company = search_company,
city = search_address,
state = search_region,
zipcode = search_zipcode)
return(df)
}
site = "https://www.dice.com/jobs"
job = "Data+Analyst"
region = "MN"
miles = 30
pag = 1
url <- create_url(website = "https://www.dice.com/jobs", title = "Data+Analyst", location = "MN", radius = 30, page = 1)
url
## [1] "https://www.dice.com/jobs?q=Data+Analyst&l=MN&radius=30&startPage=1&jobs"
knitr::include_graphics('Screen Shot 2018-03-18 at 10.22.08 AM.png')
site = "https://www.dice.com/jobs"
job = "Data+Analyst"
region = "MN"
miles = 30
num_pages = 23
# COMMENT: Loop over the max number of pages for the job search
for (i in 1:num_pages) {
# TODO: Create a url for the job search
url <- create_url(website = "https://www.dice.com/jobs", title = "Data+Analyst", location = "MN", radius = 30, page = i)
# COMMENT: read the created URL and collects the html code
web_html <- read_html(url)
# COMMENT: If statement to create the first dataframe
if(i == 1) {
# COMMENT: Creates a tibble dataframe extracting information from the html code
job_data <- create_tibble(html = web_html)
}else{
# COMMENT: We add new observation to the first dataframe
df <- create_tibble(html = web_html)
job_data <- bind_rows(job_data, df)
}
# COMMENT: We have to wait a couple of seconds before moving to the next page
Sys.sleep(1.0)
}
head(job_data)
## # A tibble: 6 x 5
## title company city state zipcode
## <chr> <chr> <chr> <chr> <chr>
## 1 Data Analyst with Vlookup Javen Technolo… Minneap… MN 55412
## 2 Data Analyst / Business Analyst Kforce Inc. Minneap… MN 55479
## 3 HL7 Data Quality Analyst Prime Solution… Minneap… MN 55402
## 4 Data Analytics Consultant Contra… iTech Solutions Eden Pr… MN 55344
## 5 HL7 Data Acquisition & Reporting… UnitedHealth G… Minneto… MN 55305
## 6 HL7 Data Quality Analyst - UHC -… UnitedHealth G… Minneto… MN 55305
summary(job_data)
## title company city
## Length:663 Length:663 Length:663
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## state zipcode
## Length:663 Length:663
## Class :character Class :character
## Mode :character Mode :character
write_csv(x = job_data, path = "data/job_data.csv" )
knitr::include_graphics('Screen Shot 2018-03-19 at 1.36.58 PM.png')
### 2B) Using tableau geolocation features map cities using bubbles where the size of the bubble is cumulative number of jobs listing in that city. Note any interesting patterns, add an screenshot of your visualization.
knitr::include_graphics('Screen Shot 2018-03-18 at 4.50.48 PM.png')
##The cities where job postings are, are all grouped together in simliar region (around Minnepolis) and there are 4 circles that are the largest, most likely Minneapolis and the suburbs surrounding Minneapolis.
knitr::include_graphics('Screen Shot 2018-03-18 at 4.57.15 PM.png')
##Minneapolis has the most number of job openings with Eden Praire having the second largest number of job openings. This makes sense because Minneapolis is a large city and a majority of the cities with a lot of job openings are suburbs surrounding Minneapolis. Saint Paul and St. Paul are the same city but spelled differently, so the St. Paul job openings combined would probably be more than Eden Prairie.
knitr::include_graphics('Screen Shot 2018-03-19 at 2.18.46 PM.png')
knitr::include_graphics('Screen Shot 2018-03-19 at 2.20.22 PM.png')
##Minneapolis has the highest number of job openings and then Eden Prairie. There are many cities with very little (only 1) job opening ### 2E) Create a dashboard to display the three plots above. Use half of the dashboard to display the map with the location of the cities. On the bottom of the dashboard place the other two charts, add titles to each chart. Note any interesting patterns, add an screenshot of your visualization.
knitr::include_graphics('Screen Shot 2018-03-19 at 3.13.06 PM.png')
To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.
knitr::include_graphics('Screen Shot 2018-03-18 at 3.54.50 PM.png')
##This insight shows the breakdown of cities, with Minneapolis having the largest number of job postings and Eden Prairie having the second largest. The larger the square, the more number of job postings.
knitr::include_graphics('Screen Shot 2018-03-18 at 4.08.11 PM.png')
##This insight shows the companies with the most job postings. UnitedHealth Group is the most common company. ### 3C) Watson Analytics Insights, describe your findings.
knitr::include_graphics('Screen Shot 2018-03-18 at 4.12.54 PM.png')