About

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Web Scraping Functions


# This function to creates a URL for the www.dice.com website and extract the data 
create_url <- function(website, title, location, radius, page){
  url <- paste0(website, "?q=", title, "&l=", location, "&radius=", radius)
  url <- paste0(url, "&startPage=", page, "&jobs")
  return(url)
}

# This function use the unstructure data from the html file to create a dataframe
# with only the data that is needed for analysis
create_tibble <- function(html){
  
  search_title <- html %>% 
    html_nodes(".complete-serp-result-div") %>%
    html_nodes(xpath = "//a/@title") %>%
    html_text()
  
  search_region <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=addressRegion]") %>%
    html_text()
  
  search_zipcode <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=postalCode]") %>%
    html_text()
  
  search_address <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="address"]') %>%
    html_nodes("[itemprop=streetAddress]") %>%
    html_text() %>% 
    str_replace(pattern = paste0(", ",search_region), "")
  
  search_company <- html %>% 
    html_nodes(".complete-serp-result-div") %>% 
    html_nodes('[itemprop="hiringOrganization"]') %>%
    html_nodes("[itemprop=name]") %>%
    html_text()
  
    df <- tibble(title = search_title,
                 company = search_company,
                 city = search_address, 
                 state = search_region,
                 zipcode = search_zipcode)

    return(df)
}

Task 1: Data Collection - Web Scraping


1B) From the target website determine the number of pages for the given search. Create a variable “num_pages” equal to the max number of pages for the job search. Create a variable “url” using the create_url() function with the same parameters than the previous search, for the page number use “i” as we are looping over all the pages.

  • Commands: create_url(website = JOBSITE , title = JOBTITLE, location = STATE, radius = NUM_MILES, page = PAG_NUM)
#site = JOB_WEBSITE
#job = "Data+Analyst"
#region = STATE_TWO_LETTERS
#miles = MILES_NUM
#pag = PAG_NUM

num_pages= 92

# COMMENT: Loop over the max number of pages for the job search
for (i in 1:num_pages) {
  
  # TODO: Create a url for the job search
  #url <- create_url()
url <- create_url(website = site, title = job, location = region, radius = 30, page = i)  
  
  # COMMENT: read the created URL and collects the html code
  web_html <- read_html(url)

    
  # COMMENT: If statement to create the first dataframe
  if(i == 1) {
    
    # COMMENT: Creates a tibble dataframe extracting information from the html code
    job_data <- create_tibble(html = web_html)
    
  }else{
  
      # COMMENT: We add new observation to the first dataframe
    df <- create_tibble(html = web_html)
    job_data <- bind_rows(job_data, df)
  }
  
  # COMMENT: We have to wait a couple of seconds before moving to the next page
  Sys.sleep(0.5)
}

1C) Make sure that the data was collected correctly. By using the functions to inspect and summarize the data. Describe the summary statistics and note any significant observations.

  • Dataframe: job_data
  • Commands: head() summary()
summary(job_data)
head(job_data)

1D) After making sure that the data was collected correctly, save the data as csv file.

  • Commands: write_csv(x = DATAFRAME, path = “data/DATAFRAME” )
write_csv(x = job_data, path = "data/dice-ny.csv" )

Task 2: Visualization Analysis - Tableau


2A) Upload your data in csv format to Tableau, make any changes to the data types (GEOLOCATION, TEXT, NUMERIC). Take an screenshot of Tableu’s data inspection.

knitr::include_graphics('2A.PNG')

2B) Using tableau geolocation features map cities using bubbles where the size of the bubble is cumulative number of jobs listing in that city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2B.PNG')

This image shows the number of Data Analyst positions around New York City. There are more jobs closer to the city rather than further which makes sense because the city is a natural hub for companies to have offices.

2C) Create a tree map, to compare the different cities and the cumulative number of job posting in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2C.PNG')

Clearly the state will have more open positions than the individual citie. As someone not from/very familiar with cities in New York state, it is interesting to note that the cities I have heard of are listed. This makes sense because most people are more familiar with larger cities which have more job opportunities.

2D) Create a Bar plot by State and City. To display the cumulative number of jobs in each city. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2D.PNG')

According to the graph, NYC has the most opportunities followed by Rochester.

2E) Create a dashboard to display the three plots above. Use half of the dashboard to display the map with the location of the cities. On the bottom of the dashboard place the other two charts, add titles to each chart. Note any interesting patterns, add an screenshot of your visualization.

knitr::include_graphics('2E.PNG')

Because the data throughout this process was consistent, the dashboard gives the same conclusions that each of the graphs showed: larger, more populated cities have greater numbers of job prospects.

Task 3: Watson Analysis


To complete the last task, follow the directions found below. Make sure to screenshot and attach any pictures of the results obtained or any questions asked.

3A) Upload you data to watson, explore the different insights. Take 3 screenshots of your insights and describe your findings.

knitr::include_graphics('3A.PNG')

I thought it would be interesting to include the same graph shown in tableau to see if it produced the same results. It did, so there is not any new information, but it demonstrates consistency in the findings.

3B) Watson Analytics Insights, describe your findings.

knitr::include_graphics('3B.PNG')

While I couldn’t get a good screenshot, this graph was a decision tree that broke down job title based on location which was super interesting. So as it filtered through you could see which jobs were more suited to find in which locations and how frequently they were posted. ### 3C) Watson Analytics Insights, describe your findings.

knitr::include_graphics('3C.PNG')

This broke down the number of companies located in each area. All of the companies were in New York, so clearly that had the highest concentration, but it was interesting that when you broke it down by company and not position it was more evenly dispersed which shows that there may not be more companies in larger cities but rather the same number of companies, but with a larger need for new employees.

