About

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Web Scraping Functions


# This function to creates a URL for the www.dice.com website and extract the data 
create_url <- function(website, title, location, radius, page){
  url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
  url <- paste0(url, "&startPage=", page, "&jobs")
  return(url)
}

# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
  
  search_title <- html %>% 
    html_nodes(".complete-serp-result-div") %>%
    html_nodes(xpath = "//a/@title") %>%
    html_text()
  
  search_region <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=addressRegion]") %>%
    html_text()
  
  search_zipcode <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=postalCode]") %>%
    html_text()
  
  search_address <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=streetAddress]") %>%
    html_text() %>% 
    str_replace(pattern = paste0(", ",search_region), "")
  
  search_company <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="hiringOrganization"]') %>%
    html_nodes("[itemprop=name]") %>%
    html_text()
  
    df <- tibble(title = search_title,
                 company = search_company,
                 city = search_address, 
                 state = search_region,
                 zipcode = search_zipcode)

    return(df)
}

Task 1: Data Collection - Web Scraping


1B) From the target website determine the number of pages for the given search. Create a variable “num_pages” equal to the max number of pages for the job search. Create a variable “url” using the create_url() function with the same parameters than the previous search, for the page number use “i” as we are looping over all the pages.

  • Commands: create_url(website = JOBSITE , title = JOBTITLE, location = STATE, radius = NUM_MILES, page = PAG_NUM)
#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
#pag = PAG_NUM

num_pages= 92

# COMMENT: Loop over the max number of pages for the job search
for (i in 1:num_pages) {
  
  # TODO: Create a url for the job search
  #url <- create_url()
url <- create_url(website = site, title = job, location = region, radius = 30, page = i)  
  
  # COMMENT: read the created URL and collects the html code
  web_html <- read_html(url)

    
  # COMMENT: If statement to create the first dataframe
  if(i == 1) {
    
    # COMMENT: Creates a tibble dataframe extracting information from the html code
    job_data <- create_tibble(html = web_html)
    
  }else{
  
      # COMMENT: We add new observation to the first dataframe
    df <- create_tibble(html = web_html)
    job_data <- bind_rows(job_data, df)
  }
  
  # COMMENT: We have to wait a couple of seconds before moving to the next page
  Sys.sleep(0.5)
}

1C) Make sure that the data was collected correctly. By using the functions to inspect and summarize the data. Describe the summary statistics and note any significant observations.

  • Dataframe: job_data
  • Commands: head() summary()
summary(job_data)
head(job_data)

1D) After making sure that the data was collected correctly, save the data as csv file.

  • Commands: write_csv(x = DATAFRAME, path = “data/DATAFRAME” )
write_csv(x = job_data, path = "data/dice-ny.csv" )

Task 2: Visualization Analysis - Tableau


2A) Upload your data in csv format to Tableau, make any changes to the data types (GEOLOCATION, TEXT, NUMERIC). Take an screenshot of Tableu’s data inspection.

knitr::include_graphics('2A.PNG')

2B) Using tableau geolocation features map cities using bubbles where the size of the bubble is cumulative number of jobs listing in that city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2B.PNG')

This image shows the number of Data Analyst positions around New York City. There are more jobs closer to the city rather than further which makes sense because the city is a natural hub for companies to have offices.

2C) Create a tree map, to compare the different cities and the cumulative number of job posting in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2C.PNG')

Clearly the state will have more open positions than the individual citie. As someone not from/very familiar with cities in New York state, it is interesting to note that the cities I have heard of are listed. This makes sense because most people are more familiar with larger cities which have more job opportunities.

2D) Create a Bar plot by State and City. To display the cumulative number of jobs in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2D.PNG')

According to the graph, NYC has the most opportunities followed by Rochester.

2E) Create a dashboard to display the three plots above. Use half of the dashboard to display the map with the location of the cities. On the bottom of the dashboard place the other two charts, add titles to each chart. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2E.PNG')

Because the data throughout this process was consistent, the dashboard gives the same conclusions that each of the graphs showed: larger, more populated cities have greater numbers of job prospects.

Task 3: Watson Analysis


To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

3A) Upload you data to watson, explore the different insights. Take 3 screenshots of your insights and describe your findings.

knitr::include_graphics('3A.PNG')

I thought it would be interesting to include the same graph shown in tableau to see if it produced the same results. It did, so there is not any new information, but it demonstrates consistency in the findings.

3B) Watson Analytics Insights, describe your findings.

knitr::include_graphics('3B.PNG')

While I couldn’t get a good screenshot, this graph was a decision tree that broke down job title based on location which was super interesting. So as it filtered through you could see which jobs were more suited to find in which locations and how frequently they were posted. ### 3C) Watson Analytics Insights, describe your findings.

knitr::include_graphics('3C.PNG')

This broke down the number of companies located in each area. All of the companies were in New York, so clearly that had the highest concentration, but it was interesting that when you broke it down by company and not position it was more evenly dispersed which shows that there may not be more companies in larger cities but rather the same number of companies, but with a larger need for new employees.



---
title: "Web Scraping and Visual Analytics"
author: "Jax Farquhar"
output:
  html_notebook: default
  html_document: default
date: "Spring 2018"
subtitle: "CME Group Foundation Business Analytics Lab"
---

-------------

* For your assignment you may be using different dataset than what is included here. 

* Always read carefully the instructions on Sakai.  

* Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

### About

* Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

* This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet) 

* A fictitious London-based training company, WeTrainYou, wants to start a local training facility in California. It is looking for a city where ample Salesforce* development jobs are available. 

* Its goal is to train engineers and fulfill full-time and part-time jobs. WeTrainYou has hired you to determine where they should set up the business.

* Case Study: https://software.intel.com/en-us/articles/using-visualization-to-tell-a-compelling-data-story


### Load Packages in R/RStudio 

We are going to use tidyverse a collection of R packages designed for data science. 

* Info: https://www.tidyverse.org/

```{r, echo = FALSE}

# Here we are checking if the package is installed
if(!require("tidyverse")){
  
  # If the package is not in the system then it will be install
  install.packages("tidyverse", dependencies = TRUE)
  
  # Here we are loading the package
  library("tidyverse")
}

# Here we are checking if the package is installed
if(!require("rvest")){
  
  # If the package is not in the system then it will be install
  install.packages("rvest", dependencies = TRUE)
  
  # Here we are loading the package
  library("rvest")
}

```


#### Web Scraping Functions

```{r}

# This function to creates a URL for the www.dice.com website and extract the data 
create_url <- function(website, title, location, radius, page){
  url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
  url <- paste0(url, "&startPage=", page, "&jobs")
  return(url)
}

# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
  
  search_title <- html %>% 
    html_nodes(".complete-serp-result-div") %>%
    html_nodes(xpath = "//a/@title") %>%
    html_text()
  
  search_region <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=addressRegion]") %>%
    html_text()
  
  search_zipcode <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=postalCode]") %>%
    html_text()
  
  search_address <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=streetAddress]") %>%
    html_text() %>% 
    str_replace(pattern = paste0(", ",search_region), "")
  
  search_company <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="hiringOrganization"]') %>%
    html_nodes("[itemprop=name]") %>%
    html_text()
  
    df <- tibble(title = search_title,
                 company = search_company,
                 city = search_address, 
                 state = search_region,
                 zipcode = search_zipcode)

    return(df)
}

```


-------------

## Task 1: Data Collection - Web Scraping

-------------

### 1A) Use the create_url() function and the given website to scope the data that you will collection. In the function for "title" use one of the "Trending Searches" from the website or a job title that you are interested. For "location" use a US State (CA, IL, ...), for "radious" use 30 miles, finally we are only interested in the first page. Take an screenshot of only the data that you need

* Trending Searches: www.dice.com
* Job Website: "https://www.dice.com/jobs"
* Commands: create_url(website = JOBSITE , title = JOBTITLE, location = STATE, radius = NUM_MILES, page = PAG_NUM)

#### Create the url then go to the url using the browser and post an screenshot of only the data that you need
```{r}
site = "https://www.dice.com/jobs"
job = "Data+Analyst"
region = "NY"
miles = 30
pag = 1

url <- create_url(website = site, title = job, location = region, radius = miles, page = 1)
url


knitr::include_graphics('1A')
```


### 1B) From the target website determine the number of pages for the given search. Create a variable "num_pages" equal to the max number of pages for the job search. Create a variable "url" using the create_url() function with the same parameters than the previous search, for the page number use "i" as we are looping over all the pages.

* Commands: create_url(website = JOBSITE , title = JOBTITLE, location = STATE, radius = NUM_MILES, page = PAG_NUM)

```{r, eval=FALSE}

#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
#pag = PAG_NUM

num_pages= 92

# COMMENT: Loop over the max number of pages for the job search
for (i in 1:num_pages) {
  
  # TODO: Create a url for the job search
  #url <- create_url()
url <- create_url(website = site, title = job, location = region, radius = 30, page = i)  
  
  # COMMENT: read the created URL and collects the html code
  web_html <- read_html(url)


    
  # COMMENT: If statement to create the first dataframe
  if(i == 1) {
    
    # COMMENT: Creates a tibble dataframe extracting information from the html code
    job_data <- create_tibble(html = web_html)
    
  }else{
  
      # COMMENT: We add new observation to the first dataframe
    df <- create_tibble(html = web_html)
    job_data <- bind_rows(job_data, df)
  }
  
  # COMMENT: We have to wait a couple of seconds before moving to the next page
  Sys.sleep(0.5)
}

```

### 1C) Make sure that the data was collected correctly. By using the functions to inspect and summarize the data. Describe the summary statistics and note any significant observations.

* Dataframe: job_data
* Commands: head() summary() 

```{r}
summary(job_data)
head(job_data)
```

### 1D) After making sure that the data was collected correctly, save the data as csv file.

* Commands: write_csv(x = DATAFRAME, path = "data/DATAFRAME" )

```{r}
write_csv(x = job_data, path = "data/dice-ny.csv" )
```


-------------

## Task 2: Visualization Analysis - Tableau

-------------

### 2A) Upload your data in csv format to Tableau, make any changes to the data types (GEOLOCATION, TEXT, NUMERIC). Take an screenshot of Tableu's data inspection. 

```{r}
knitr::include_graphics('2A.PNG')
```


### 2B) Using tableau geolocation features map cities using bubbles where the size of the bubble is cumulative number of jobs listing in that city. Note any interesting patterns, add an screenshot of your visualization.

```{r}
knitr::include_graphics('2B.PNG')
```
This image shows the number of Data Analyst positions around New York City. There are more jobs closer to the city rather than further which makes sense because the city is a natural hub for companies to have offices.

### 2C) Create a tree map, to compare the different cities and the cumulative number of job posting in each city. Note any interesting patterns, add an screenshot of your visualization.

```{r}
knitr::include_graphics('2C.PNG')
```
Clearly the state will have more open positions than the individual citie. As someone not from/very familiar with cities in New York state, it is interesting to note that the cities I have heard of are listed. This makes sense because most people are more familiar with larger cities which have more job opportunities.

### 2D) Create a Bar plot by State and City. To display the cumulative number of jobs in each city. Note any interesting patterns, add an screenshot of your visualization.

```{r}
knitr::include_graphics('2D.PNG')
```
According to the graph, NYC has the most opportunities followed by Rochester.

### 2E) Create a dashboard to display the three plots above. Use half of the dashboard to display the map with the location of the cities. On the bottom of the dashboard place the other two charts, add titles to each chart. Note any interesting patterns, add an screenshot of your visualization.
```{r}
knitr::include_graphics('2E.PNG')
```
Because the data throughout this process was consistent, the dashboard gives the same conclusions that each of the graphs showed: larger, more populated cities have greater numbers of job prospects.
-------------

## Task 3: Watson Analysis

-------------

To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked. 

### 3A) Upload you data to watson, explore the different insights. Take 3 screenshots of your insights and describe your findings.

```{r}
knitr::include_graphics('3A.PNG')

```
I thought it would be interesting to include the same graph shown in tableau to see if it produced the same results. It did, so there is not any new information, but it demonstrates consistency in the findings.

### 3B) Watson Analytics Insights, describe your findings.
```{r}
knitr::include_graphics('3B.PNG')
```
While I couldn't get a good screenshot, this graph was a decision tree that broke down job title based on location which was super interesting. So as it filtered through you could see which jobs were more suited to find in which locations and how frequently they were posted. 
### 3C) Watson Analytics Insights, describe your findings.
```{r}
knitr::include_graphics('3C.PNG')
```
This broke down the number of companies located in each area. All of the companies were in New York, so clearly that had the highest concentration, but it was interesting that when you broke it down by company and not position it was more evenly dispersed which shows that there may not be more companies in larger cities but rather the same number of companies, but with a larger need for new employees.

