Web Scraping and Visual Analytics

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

About

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet)
A fictitious London-based training company, WeTrainYou, wants to start a local training facility in California. It is looking for a city where ample Salesforce* development jobs are available.
Its goal is to train engineers and fulfill full-time and part-time jobs. WeTrainYou has hired you to determine where they should set up the business.
Case Study: https://software.intel.com/en-us/articles/using-visualization-to-tell-a-compelling-data-story

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: rvest

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

Web Scraping Functions

# This function to creates a URL for the www.dice.com website and extract the data 
create_url <- function(website, title, location, radius, page){
  url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
  url <- paste0(url, "&startPage=", page, "&jobs")
  return(url)
}

# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
  
  search_title <- html %>% 
    html_nodes(".complete-serp-result-div") %>%
    html_nodes(xpath = "//a/@title") %>%
    html_text()
  
  search_region <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=addressRegion]") %>%
    html_text()
  
  search_zipcode <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=postalCode]") %>%
    html_text()
  
  search_address <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=streetAddress]") %>%
    html_text() %>% 
    str_replace(pattern = paste0(", ",search_region), "")
  
  search_company <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="hiringOrganization"]') %>%
    html_nodes("[itemprop=name]") %>%
    html_text()
  
    df <- tibble(title = search_title,
                 company = search_company,
                 city = search_address, 
                 state = search_region,
                 zipcode = search_zipcode)

    return(df)
}

Task 1: Data Collection - Web Scraping

1A) Use the create_url() function and the given website to scope the data that you will collection. In the function for “title” use one of the “Trending Searches” from the website or a job title that you are interested. For “location” use a US State (CA, IL, …), for “radious” use 30 miles, finally we are only interested in the first page. Take an screenshot of only the data that you need

Trending Searches: www.dice.com
Job Website: “https://www.dice.com/jobs”
Commands: create_url(website = JOBSITE , title = JOBTITLE, location = STATE, radius = NUM_MILES, page = PAG_NUM)

Create the url then go to the url using the browser and post an screenshot of only the data that you need

site = "https://www.dice.com/jobs"
job = "Data+Analyst"
region = "NY"
miles = 30
pag = 1

url <- create_url(website = site, title = job, location = region, radius = miles, page = 1)
url

## [1] "https://www.dice.com/jobs?q=Data+Analyst&l=NY&radius=30&startPage=1&jobs"

1B) From the target website determine the number of pages for the given search. Create a variable “num_pages” equal to the max number of pages for the job search. Create a variable “url” using the create_url() function with the same parameters than the previous search, for the page number use “i” as we are looping over all the pages.

Commands: create_url(website = JOBSITE , title = JOBTITLE, location = STATE, radius = NUM_MILES, page = PAG_NUM)

#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
num_pages= 92

# COMMENT: Loop over the max number of pages for the job search
   for (i in 1:num_pages) {
  
  # TODO: Create a url for the job search
  url <- create_url(website = site, title = job, location = region, radius = miles, page = i)
  
  # COMMENT: read the created URL and collects the html code
  web_html <- read_html(url)

  # COMMENT: If statement to create the first dataframe
  if(i == 1) {
    
    # COMMENT: Creates a tibble dataframe extracting information from the html code
    job_data <- create_tibble(html = web_html)
    
  }else{
  
      # COMMENT: We add new observation to the first dataframe
    df <- create_tibble(html = web_html)
    job_data <- bind_rows(job_data, df)
  }
  
  # COMMENT: We have to wait a couple of seconds before moving to the next page
  Sys.sleep(1.0)
}

1C) Make sure that the data was collected correctly. By using the functions to inspect and summarize the data. Describe the summary statistics and note any significant observations.

Dataframe: job_data
Commands: head() summary()

summary(job_data)

##     title             company              city          
##  Length:2726        Length:2726        Length:2726       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     state             zipcode         
##  Length:2726        Length:2726       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

head(job_data)

## # A tibble: 6 x 5
##   title                      company              city       state zipcode
##   <chr>                      <chr>                <chr>      <chr> <chr>  
## 1 Data Analyst               Talent Minds Networ~ New York ~ NY    10023  
## 2 Data Analyst - Data Mining Harris Corporation   Rochester  NY    14602  
## 3 Business/Data Analyst      Morgan Stanley UK L~ New York   NY    10001  
## 4 Data Analyst - MDM         Publishers Clearing~ Jericho    NY    11753  
## 5 Data Analyst Intern        DST Systems, Inc     New York   NY    10001  
## 6 Data Analyst               Matlen Silver        New York   NY    10036

1D) After making sure that the data was collected correctly, save the data as csv file.

Commands: write_csv(x = DATAFRAME, path = “data/DATAFRAME” )

write_csv(x = job_data, path = "data/dice-ny.csv" )

Task 2: Visualization Analysis - Tableau

2A) Upload your data in csv format to Tableau, make any changes to the data types (GEOLOCATION, TEXT, NUMERIC). Take an screenshot of Tableu’s data inspection.

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img1.png')

2B) Using tableau geolocation features map cities using bubbles where the size of the bubble is cumulative number of jobs listing in that city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img2.png')

It appears the majority of companies are located on the East coast, whereas Upstate New York has more scattered job postings that aren’t as concentrated in cities as Manhattan.

2C) Create a tree map, to compare the different cities and the cumulative number of job posting in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img3.png')

It appears the majority of jobs can be found in New York City (shown by the size of the box), which makes sense as cities are known to have many job opportunties than suburban cities.

2D) Create a Bar plot by State and City. To display the cumulative number of jobs in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img5.png')

This graph illustrates the number of companies posted in the following cities. According to this graph, Brooklyn has the most company postings.

2E) Create a dashboard to display the three plots above. Use half of the dashboard to display the map with the location of the cities. On the bottom of the dashboard place the other two charts, add titles to each chart. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img6.png')

*** Overall, this dashboard complies our data sheets to conclude one result: New York City has the most job postings. ————-

Task 3: Watson Analysis

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

3A) Upload you data to watson, explore the different insights. Take 3 screenshots of your insights and describe your findings.

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img7.png')

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img8.png')

knitr::include_graphics('C:\\Users\\hp\\Documents\\Spring 2018\\BSAD 343H\\Labs\\Lab 6\\06-notebook-lab\\data\\img9.png')

The first figure shows the relationship between 3 variables. This allows one to identify a more narrowly defined finding. Here, the top city is New York. The second figure is a word cloud to provide a different visualization of how to read data. The bigger the word, the more frequent the presence of that word. Hence, we conclude again that New York shows us the most job postings. The third figure, which is very similiar to the Tableau finding, shows how New York, based on the size of its box, is the most prevalent and comprises the most job postings following by New York City. ### 3B) Watson Analytics Insights, describe your findings. **** These graphs show different perspectives of looking at data. Here, we conclude the same idea as with Tableau, that New York has the greatest number of job postings, closely followed by Rochester and White Plains. These visualizations provide researchers more opportunities to infer implications about their data.