Notebook Instructions


About

  • Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

  • This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet)

  • A fictitious London-based training company, WeTrainYou, wants to start a local training facility in California. It is looking for a city where ample Salesforce* development jobs are available.

  • Its goal is to train engineers and fulfill full-time and part-time jobs. WeTrainYou has hired you to determine where they should set up the business.

  • Case Study: https://software.intel.com/en-us/articles/using-visualization-to-tell-a-compelling-data-story

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

## Loading required package: tidyverse
## -- Attaching packages ----------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts -------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: rvest
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding

Web Scraping Functions

# This function to creates a URL for the www.dice.com website and extract the data 
create_url <- function(website, title, location, radius, page){
  url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
  url <- paste0(url, "&startPage=", page, "&jobs")
  return(url)
}

# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
  
  search_title <- html %>% 
    html_nodes(".complete-serp-result-div") %>%
    html_nodes(xpath = "//a/@title") %>%
    html_text()
  
  search_region <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=addressRegion]") %>%
    html_text()
  
  search_zipcode <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=postalCode]") %>%
    html_text()
  
  search_address <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=streetAddress]") %>%
    html_text() %>% 
    str_replace(pattern = paste0(", ",search_region), "")
  
  search_company <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="hiringOrganization"]') %>%
    html_nodes("[itemprop=name]") %>%
    html_text()
  
    df <- tibble(title = search_title,
                 company = search_company,
                 city = search_address, 
                 state = search_region,
                 zipcode = search_zipcode)

    return(df)
}

Task 1: Data Collection - Web Scraping


1B) From the target website determine the number of pages for the given search. Create a variable “num_pages” equal to the max number of pages for the job search. Create a variable “url” using the create_url() function with the same parameters than the previous search, for the page number use “i” as we are looping over all the pages.

  • Commands: create_url(website = JOBSITE , title = JOBTITLE, location = STATE, radius = NUM_MILES, page = PAG_NUM)
#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
#pag = PAG_NUM
num_pages = 29

# COMMENT: Loop over the max number of pages for the job search
  for (i in 1:num_pages) {

  # TODO: Create a url for the job search
  #url <- create_url()
  url <- create_url(website = site, title = job, location = region, radius = miles, page = num_pages)
  # COMMENT: read the created URL and collects the html code
  web_html <- read_html(url)

  # COMMENT: If statement to create the first dataframe
  if(i == 1) {
    
    # COMMENT: Creates a tibble dataframe extracting information from the html code
    job_data <- create_tibble(html = web_html)
    
  }else{
  
      # COMMENT: We add new observation to the first dataframe
    df <- create_tibble(html = web_html)
    job_data <- bind_rows(job_data, df)
  }
  
  # COMMENT: We have to wait a couple of seconds before moving to the next page
  Sys.sleep(1.0)
}

1C) Make sure that the data was collected correctly. By using the functions to inspect and summarize the data. Describe the summary statistics and note any significant observations.

  • Dataframe: job_data
  • Commands: head() summary()
summary(job_data)
##     title             company              city          
##  Length:551         Length:551         Length:551        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     state             zipcode         
##  Length:551         Length:551        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

every category is a character with the length of 406.

head(job_data)
## # A tibble: 6 x 5
##   title                          company             city    state zipcode
##   <chr>                          <chr>               <chr>   <chr> <chr>  
## 1 Hadoop Administrator / Archit~ Systems Technology~ Dearbo~ MI    48126  
## 2 C++ developer with exp. in si~ Systems Technology~ Warren  MI    48090  
## 3 Project Manager - Ann Arbor, ~ Intellisoft Techno~ Ann Ar~ MI    48103  
## 4 Adobe Experience Manager (AEM~ Systems Technology~ Dearbo~ MI    48126  
## 5 Adobe Experience Manager (AEM~ Systems Technology~ Dearbo~ MI    48126  
## 6 VCS STORAGE ADMINISTRATOR MUS~ Tayback Staffing    Auburn~ MI    48321

1D) After making sure that the data was collected correctly, save the data as csv file.

  • Commands: write_csv(x = DATAFRAME, path = “data/DATAFRAME” )
write_csv(x = job_data, path = "data/vice_mi.csv" )

Task 2: Visualization Analysis - Tableau


2A) Upload your data in csv format to Tableau, make any changes to the data types (GEOLOCATION, TEXT, NUMERIC). Take an screenshot of Tableu’s data inspection.

knitr::include_graphics('imgs/vice_inspection.png')

2B) Using tableau geolocation features map cities using bubbles where the size of the bubble is cumulative number of jobs listing in that city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('imgs/geomap_vice.png')

## Most of the locations are centered in one area by the lower part of the thumb with a few others scattered across the state.

2C) Create a tree map, to compare the different cities and the cumulative number of job posting in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('imgs/treedata.png')

## Most of the jobs are in Lansing and Grand Rapids which makes sense because they are big cities. Auburn Hills has the least amount of jobs.

2D) Create a Bar plot by State and City. To display the cumulative number of jobs in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('imgs/vicebar.png')

## The bigger cities have more jobs available. This makes sense since mor epeople live in bg cities.

2E) Create a dashboard to display the three plots above. Use half of the dashboard to display the map with the location of the cities. On the bottom of the dashboard place the other two charts, add titles to each chart. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('imgs/vicedash.png')

## All the data seems to go together showing that the same cities have the most amount of jobs available. ————-

Task 3: Watson Analysis


To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

3A) Upload you data to watson, explore the different insights. Take 3 screenshots of your insights and describe your findings.

knitr::include_graphics('imgs/watson1.png')

## The zipcodes with the highest value depend on the title, city, and company that the person is working with.

3B) Watson Analytics Insights, describe your findings.

knitr::include_graphics('imgs/watson2.png')

## This is like a word cloud that shows which cities have the most job offerings.

3C) Watson Analytics Insights, describe your findings.

knitr::include_graphics('imgs/watson3.png')

## This shows the type of jobs and number of each in certain cities. There’s a lot of jobs in system technology anf digital technology. Bigger cities have more types of job offerings.