Exploring The Job Market for Data Scientists and Data Analysts In Boston, New York, and Philadelphia

Scraping Indeed with Rvest | Data Wrangling with Tidyverse | Text Mining with Stringr & Tidyverse | Visualization with the GGplot2

Brian Ward
3/13/2019

Introduction

In this project, I aimed to explore the job market for data analyst and data scientist roles in Boston. I decided this would be a great opportunity to learn about web scraping and decided to build a scraper to pull this information from Indeed and explore the data.

Part-1- Scraping the Data from Indeed

For the scraper I decided to use 2 different job titles in 3 different cities, producing 6 different search terms.

  1. Data Science in New York
  2. Data Analysis in New York

  3. Data Science in Philidelphia
  4. Data Analysis in Philidelphia

  5. Data Science in Boston
  6. Data Analysis in Boston

I used a lot of great resources that walked me through web-scraping and the rvest package so instead of walking through building my own scraper I will just link the posts that I used as guidance. Once you understand how to use the CSS Selector tool and the basic layout of an HTML document, it’s fairly straight forward. tip for beginners: pay attention to what changes in the URL as you navigate the site of interest.

Beginner’s Guide on Web Scraping in R

Heres my scraper:

library(rvest)
library(xml2)
library(tidyverse)

search <- c("q=data+analyst&l=New+York,+NY", 
            "q=data+scientist&l=New+York,+NY",
            "q=data+analyst&l=Boston,+MA", 
            "q=data+scientist&l=Boston,+MA",
            "q=data+analyst&l=Philadelphia,+PA" , 
            "q=data+scientist&l=Philadelphia,+PA")

page_index <- seq(from = 0, to = 990, by = 10)
full_df1 <- data.frame()


for(i in 1:length(search)){
  
  first_page_url <- paste0("https://www.indeed.com/jobs?", search[i])
                    
      for(i in page_index) {
                    
      url <- paste0(first_page_url, "&start=", page_index[i])
                    
      try(page <- xml2::read_html(url))
      Sys.sleep(2)
                    
      try(job_title <- page %>% 
      rvest::html_nodes("div") %>%
      rvest::html_nodes(xpath = '//a[@data-tn-element = "jobTitle"]') %>%
      rvest::html_attr("title") )
      try(company_name <- page %>% 
      rvest::html_nodes("span")  %>% 
      rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
      rvest::html_text() %>%
      stringi::stri_trim_both() ) 
      try(job_location <- page %>% 
      rvest::html_nodes("span") %>% 
      rvest::html_nodes(xpath = '//*[@class="location"]')%>% 
      rvest::html_text() %>%
      stringi::stri_trim_both() )
      try(links <- page %>% 
      rvest::html_nodes("div") %>%
      rvest::html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
      rvest::html_attr("href") )
                    
      job_description <- c()
      footer <- c()
                    
      for(i in 1:length(links)) {
      try(url2 <- paste0("https://indeed.com/", links[i]))
      page2 <- xml2::read_html(url2)
                      
      try(job_description[i] <- page2 %>%
      rvest::html_nodes("span")  %>% 
      rvest::html_nodes(xpath = '//*[@class="jobsearch-JobComponent-description icl-u-xs-mt--md"]') %>% 
      rvest::html_text() %>%
      stringi::stri_trim_both() )
      try(footer[i] <-  page2 %>%
      rvest::html_nodes("span")  %>% 
      rvest::html_nodes(xpath = '//*[@class="jobsearch-JobMetadataFooter"]') %>% 
      rvest::html_text() )
      }
      
  df <- data.frame(job_title, company_name, job_location, job_description, footer)
  df$search_terms <- search[i]
  df$date_scraped <- today()
  full_df1 <- rbind(full_df1, df)
  } 
}
  
write_csv(full_df1, "~/Desktop/CS/webscraping/scrape_2019_03_04.csv")

Im not going to spend any time talking about the scraper, but once you understand how to use the CSS Selector tool and the basic layout of an HTML document, it’s fairly straight forward. tip for beginners: pay attention to what changes in the URL as you navigate the site.

Part-2- Cleaning and Organizing the Data

Lets start by taking a look at the output dataframe from the scraper.

postings1 <- read_csv("~/Desktop/CS/webscraping/full_first_scrape_2019_02_05.csv")
glimpse(postings1)
Observations: 5,299
Variables: 7
$ job_title       <chr> "Analyst, Retail Space Planning", "Retail Mercha…
$ company_name    <chr> "SAS Retail Services", "SAS Retail Services", "S…
$ job_location    <chr> "Quincy, MA", "Boston, MA", "Braintree, MA", "Bo…
$ job_description <chr> "We’re looking for Planning Analysts to join our…
$ footer          <chr> "8 days ago - save job - report jobIf you requir…
$ search_terms    <chr> "q=data+analyst&l=Boston,+MA", "q=data+analyst&l…
$ date_scraped    <date> 2019-02-05, 2019-02-05, 2019-02-05, 2019-02-05,…

As you can see we have 7 different variables. All of which are pretty self-explanatory, except for the footer. I wanted the footer because it contained the “time ago” element, and I wanted to know how long each job posting had been up for. The job_description is by far the most dense element containing all of the text from the job listing. This is the part that I was most interested in because I was able to look through the text to find some interesting information.

In this part, I am going to extract a few more features from the data before I look for trends and compare the analyst and scientist positions visually.
Here are my 6 tasks for this part:
1. Creating variables title_search and city_search from the search_term variable.
2. Creating variable title_term from the job_title variable.
3. Creating variable days_ago from the footer.
4. Creating variable min_ed “minimum level of education” from the job descriptions.
5. Creating variable min_exp “minimum years of experience” from the job descriptions.
6. Adding multiple “technology/tool” variables to test for popularity.

Let’s get started.

1. Creating Variable title_search and city_search from the search_term Variable

Here I simply wanted to pull the search job titles (title_search) and location (city_search) into their own columns so I can refer to them directly. The search_term was a segment of the url from the scraper that told indeed what keywords and locations to use in the searches.

postings1$title_search <- NA
postings1$city_search <- NA
# Boston
postings1[which(postings1$search_terms == "q=data+analyst&l=Boston,+MA"), "title_search"] <- "data analyst"
postings1[which(postings1$search_terms == "q=data+analyst&l=Boston,+MA"), "city_search"] <- "Boston"
postings1[which(postings1$search_terms == "q=data+scientist&l=Boston,+MA"), "title_search"] <- "data scientist"
postings1[which(postings1$search_terms == "q=data+scientist&l=Boston,+MA"), "city_search"] <- "Boston"
# New York
postings1[which(postings1$search_terms == "q=data+analyst&l=New+York,+NY"), "title_search"] <- "data analyst"
postings1[which(postings1$search_terms == "q=data+analyst&l=New+York,+NY"), "city_search"] <- "New York"
postings1[which(postings1$search_terms == "q=data+scientist&l=New+York,+NY"),"title_search"] <- "data scientist"
postings1[which(postings1$search_terms == "q=data+scientist&l=New+York,+NY"), "city_search"] <- "New York"
# Philly
postings1[which(postings1$search_terms ==  "q=data+analyst&l=Philadelphia,+PA"), "title_search"] <- "data analyst"
postings1[which(postings1$search_terms ==  "q=data+analyst&l=Philadelphia,+PA"), "city_search"] <- "Philadelphia"
postings1[which(postings1$search_terms == "q=data+scientist&l=Philadelphia,+PA"), "title_search"] <- "data scientist"
postings1[which(postings1$search_terms == "q=data+scientist&l=Philadelphia,+PA"), "city_search"] <- "Philadelphia"

Simple enough, now I can filter through the listings using these variables.

2. Creating variable title_term from the job_title

I expected that the search job titles might overlap a bit, (i.e search_term “Data Analyst” would yield job titles of data scientist and vice-versa). So I decided to create the title_term variable to use the actual job title to differentiate between analyst and scientist positions. note: as you can see I removed the end of the word in the pattern so that It will be able to pick up words like “analyst”, “analytics”, “analysis”" etc..

postings1$title_term <- NA
postings1$job_title <- str_to_lower(postings1$job_title)
postings1$job_title <- str_trim(postings1$job_title, side = "both")

postings1[str_which(postings1$job_title, "analy"), "title_term"] <- "analyst"
postings1[str_which(postings1$job_title, "scien"), "title_term"] <- "science"

# Because it could be helpful to show include in our visualizations the NA's here since they still show a lot, lets make all the listings that dont have "analyst", or "scien" in the job titles as just "neither" 
postings1[(which(is.na(postings1$title_term))), "title_term"] <- "neither"
glimpse(postings1)
Observations: 5,299
Variables: 10
$ job_title       <chr> "analyst, retail space planning", "retail mercha…
$ company_name    <chr> "SAS Retail Services", "SAS Retail Services", "S…
$ job_location    <chr> "Quincy, MA", "Boston, MA", "Braintree, MA", "Bo…
$ job_description <chr> "We’re looking for Planning Analysts to join our…
$ footer          <chr> "8 days ago - save job - report jobIf you requir…
$ search_terms    <chr> "q=data+analyst&l=Boston,+MA", "q=data+analyst&l…
$ date_scraped    <date> 2019-02-05, 2019-02-05, 2019-02-05, 2019-02-05,…
$ title_search    <chr> "data analyst", "data analyst", "data analyst", …
$ city_search     <chr> "Boston", "Boston", "Boston", "Boston", "Boston"…
$ title_term      <chr> "analyst", "analyst", "analyst", "analyst", "ana…

I can now use this variable in my visualizations to compare the analyst and scientist roles.

3. Creating variable days_ago from the footer.

I wanted to know how long each listing had been up for, and the only place I could find this in the HTML, was in the footer. The footers had a few different things in it along with the ‘time ago’ value. Let’s see what an example footer looks like:

postings1$footer[1]
[1] "8 days ago - save job - report jobIf you require alternative methods of application or screening, you must approach the employer directly to request this as Indeed is not responsible for the employer's application process."

You can see it looks like there are 3 different items:
1. the “time ago” element (they’re not all measured in days)
2. a “save job” element
3. a “report job” element

I need to just pull out the first element. Unfortunately, they weren’t all formatted like this so I couldn’t just split them up using the “-” as the separator. I decided to search for the pattern “ago”, and then select it and the 10 previous characters to make sure I got the whole element. I know this is a really confusing for-loop, but the rest of it was just cleaning and trimming the string so that it would start with the first number of the time_ago element, which you can see in the output:

postings1$time_ago <- NA
for( i in 1:nrow(postings1)){
  var <- str_sub(postings1[i, "footer"], start = if_else((as.vector(str_locate(postings1[i, "footer"]," ago"))[1] -10) < 0, 0, (as.vector(str_locate(postings1[i, "footer"]," ago"))[1] -10)), end = as.vector(str_locate(postings1[i, "footer"]," ago"))[2])
  var1 <- unlist(var)
  var2 <- str_split(var, "-")
  var3 <- unlist(var2)
  var4 <- var3[if_else(length(var3) < 2, 1, 2)]
  var5 <- str_trim(var4, side = "both")
  postings1[i, "time_ago"] <- var5
}
# Lets take a look at all the unique values
unique(postings1$time_ago)
 [1] "8 days ago"   "29 days ago"  "4 days ago"   "25 days ago" 
 [5] "18 hours ago" "9 days ago"   "5 days ago"   "4 months ago"
 [9] "30+ days ago" "3 days ago"   "20 days ago"  "1 day ago"   
[13] "13 days ago"  "30 days ago"  "11 days ago"  "28 days ago" 
[17] "22 days ago"  "15 days ago"  "23 days ago"  "23 hours ago"
[21] "12 days ago"  "10 days ago"  "7 days ago"   "2 days ago"  
[25] "22 hours ago" "21 hours ago" "14 days ago"  "19 days ago" 
[29] "17 days ago"  "21 days ago"  "16 days ago"  "11 hours ago"
[33] "18 days ago"  "26 days ago"  "7 hours ago"  "15 hours ago"
[37] "8 hours ago"  "24 days ago"  "3 hours ago"  "19 hours ago"
[41] "27 days ago"  "16 hours ago" "6 days ago"   "17 hours ago"
[45] "4 hours ago"  "2 hours ago"  "9 hours ago"  "1 hour ago"  
[49] NA             "13 hours ago" "5 hours ago"  "10 hours ago"
[53] "6 hours ago"  "14 hours ago" "12 hours ago" "20 hours ago"

As you can see, the time_ago variable goes all the way from “1 hour ago” to “4 months ago”.

note: I actually had a hard time believing that I got good data from this because of the jump from 30+ days to 4 months. Since this project is more about the process, I chose to use it anyway. I also only ran this scraper twice, but If I were to run this scraper daily and track exactly how long each posting was listed before it was taken down, we might be able to infer supply/demand of different roles with the average number of days those listings were up.

Convert time_ago to a days_ago Number
I wanted to represent this data with a days_ago numerical value, So I decided to give every one a number from 1–30 days with some rounding. A listing with an “hour ago” value will just be 1 day, and one with a “month ago” value will just be 30.

postings1$days_ago <- NA
#These ones are straight forward
postings1[str_which(postings1$time_ago, "month"), "days_ago"] <- 30
postings1[str_which(postings1$time_ago, "hour"), "days_ago"] <- 1
# And now for the day values
which_days <- str_which(postings1$time_ago, "day")
for(i in which_days){
  postings1[i, "days_ago"] <- regmatches(postings1[i, "time_ago"], gregexpr("[[:xdigit:]]+", postings1[i, "time_ago"]))[[1]][1]
}

Now For the Interesting Part: Pulling data from the job descriptions This is the part I was really interested in looking at. Let’s start by taking a look at what one of these job descriptions look like.

postings1$job_description[[5]]
[1] "Overview :Supports the J.Jill customer analytics team, uses data to get both a broad and deep understanding of the customers and her behavior, analyzes and evaluates marketing efforts, facilitates contact campaigns and provides various ad hoc analytics. Provides database support to the business and assists the Sr. Manager in directing the activities of 3rd party database providers. Designs, runs, and maintains reporting and contributes to predictive analytics modeling efforts.\nResponsibilities :Provides a wide array of analytics/data functions including data access, cleansing, querying, ETL modeling, visualizations etc.\nConverts data and insights into stories through the use of clear, insightful, and scalable visualizations\nSupports and enhances corporate projects and goals such as augmenting customer segmentation strategies\nInterfaces with critical third party database vendors and develops “best practices” for database support\nCommunicates business needs to third party vendors, monitors their progress, and verifies the quality and accuracy of the executed project\nSupports Senior Manager on Big Data initiatives (i.e. structured and un-structured sources) such as integrating online behavioral data sets with our current customer database platform\nProvides guidance and support to end users and the Business Analytics team in the creation and execution of data pulls from the customer database\nBuilds ad hoc and production-level customer behavior reporting for a variety of projects including: Lifetime Value Analysis, Product Affinity, and Market Basket Reporting\nQualifications :Highly data savvy, flexible with tools, willing and able to quickly learn new tools/techniques and extreme “can do” attitude\nBroad understanding and experience with SQL, data stores, data modeling, analytical tools, and programming languages\nWorking knowledge of statistical or general programming languages such as R, Python, Julia etc. required\nProficiency with visualization tools (especially Tableau) required\nProficiency with Alteryx a plus\nAbility to understand complex business processes or requirements and translate them into simple business practices\nProficiency in MS Excel required, VBA a plus\nStrong technical skills with the ability to learn/understand the specialty womens apparel business\nAnalytical thinking and problem solving ability\nCapable of assisting others in the proper use of the Marketing Database\nIntellectual curiosity\nAcute attention to detail\nB.S. or M.S. degree in Computer Science, Business, Statistics, Mathematics, Economics or related discipline\n2+ Experience with, visualization, SQL, ETL, and statistical/general programming and database tools\nKnowledge of data management, business intelligence, master data management, and data services architectural frameworks\n2+ years’ experience with applications of predictive analytics, statistical modeling, and data mining or other quantitative techniques a plus\nExperience in retail industry, especially apparel or fashion, a plus"

So you can see its a big messy string… Im going to want to clean this up a bit. So what should I do? I definitely want to remove any case changes, so i’ll just make everything lowercase. I am also going to remove some annoying punctuation as well as “/n” that I don’t want to get in the way of my pattern matching.

# For some reason there were 3 listings that didnt have job postings, so im just going to remove those.
postings1 <- filter(postings1, !is.na(postings1$job_description))
postings1$job_description <- str_to_lower(postings1$job_description)
postings1$job_description <- str_replace_all(postings1$job_description, "\n", " ")
postings1$job_description <- str_replace_all(postings1$job_description, ",", " ")
postings1$job_description <- str_replace_all(postings1$job_description, "/", " ")

Cool, now we’re ready to look around a bit.

4. Creating variable min_ed “minimum level of education” from the job descriptions.

So I am basically just going to pick a few string patterns that I think will represent either a bachelors, masters, or doctorate degree. Then I will detect which job descriptions contain those patterns. I can then create a minimum level of education from those results.

Here are the words that I chose:

words_bachelors <- c("ba/bs", " bs", "b.s", "bachelors", "bachelor's")
words_masters <-  c(" mba", "masters", "master's", " msc", " ms", "m.s")
words_doctorate <-  c("phd", "p.h.d", "doctorate", "doctoral") 

note: for the abbreviations like “bs” & “ms” I had to put a space in front of it so that it wouldn’t pick up that pattern within other words.

Now I just have to search for these terms in all of the job postings.

postings1$bachelors <- NA
postings1$masters <- NA
postings1$doctorate <- NA

for(i in 1:nrow(postings1)){

  bach <- c()
                for(word in words_bachelors){
                  # start with bachelors
                  bach1 <- str_detect(postings1$job_description[[i]], word)
                  bach <- c(bach, bach1)
                }
  mast <- c()
                for(word in words_masters){
                  # start with bachelors
                  mast1 <- str_detect(postings1$job_description[[i]], word)
                  mast <- c(mast, mast1)
                }
  doc <- c()
                for(word in words_doctorate){
                  # start with bachelors
                  doc1 <- str_detect(postings1$job_description[[i]], word)
                  doc <- c(doc, doc1)
                }
  # should i do NA instead of False? probably
postings1$bachelors[[i]] <- ifelse(sum(bach) > 1, TRUE, FALSE)
postings1$masters[[i]] <- ifelse(sum(mast) > 1, TRUE, FALSE)
postings1$doctorate[[i]] <- ifelse(sum(doc) > 1, TRUE, FALSE)
}

To get the minimum level of education all I need to do is create a min_ed variable and fill it with the level of education from highest to lowest, that way if both bachelors and masters were mentioned in a post the bachelors would over-write it.

postings1$min_ed <- NA
postings1[which(postings1$doctorate == TRUE), "min_ed"] <- "doctorate"
postings1[which(postings1$masters == TRUE), "min_ed"] <- "masters"
postings1[which(postings1$bachelors == TRUE), "min_ed"] <- "bachelors"

# Now let's see where were at...
glimpse(postings1)
Observations: 5,296
Variables: 16
$ job_title       <chr> "analyst, retail space planning", "retail mercha…
$ company_name    <chr> "SAS Retail Services", "SAS Retail Services", "S…
$ job_location    <chr> "Quincy, MA", "Boston, MA", "Braintree, MA", "Bo…
$ job_description <chr> "we’re looking for planning analysts to join our…
$ footer          <chr> "8 days ago - save job - report jobIf you requir…
$ search_terms    <chr> "q=data+analyst&l=Boston,+MA", "q=data+analyst&l…
$ date_scraped    <date> 2019-02-05, 2019-02-05, 2019-02-05, 2019-02-05,…
$ title_search    <chr> "data analyst", "data analyst", "data analyst", …
$ city_search     <chr> "Boston", "Boston", "Boston", "Boston", "Boston"…
$ title_term      <chr> "analyst", "analyst", "analyst", "analyst", "ana…
$ time_ago        <chr> "8 days ago", "8 days ago", "29 days ago", "4 da…
$ days_ago        <chr> "8", "8", "29", "4", "25", "1", "9", "5", "30", …
$ bachelors       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, …
$ masters         <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, …
$ doctorate       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ min_ed          <chr> NA, NA, NA, NA, "masters", "bachelors", NA, NA, …

5. Creating varaible min_exp “minimum years of experience” From the Job Descriptions.

Now I want to do the same thing for years of experience. Again, I started by thinking of the different ways these may have been written in the job descriptions. Heres what I came up with:

words_experience <- c("0-1+ year",
                      "0-2+ year",
                      "0-3+ year",
                      "0-4+ year",
                      "0-5+ year",
                      "0-1 year",
                      "0-2 year",
                      "0-3 year",
                      "0-4 year",
                      "0-5 year",
                      "0 - 1+ year",
                      "0 - 2+ year",
                      "0 - 3+ year",
                      "0 - 4+ year",
                      "0 - 5+ year",
                      "0 - 1 year",
                      "0- 2 year",
                      "0- 3 year",
                      "0- 4 year",
                      "0- 5 year",
                      ### 1- 
                      " 1+ year",
                      "1-2+ year",
                      "1-3+ year",
                      "1-4+ year",
                      "1-5+ year",
                      " 1 year",
                      "1-2 year",
                      "1-3 year",
                      "1-4 year",
                      "1-5 year",
                      "1 - 2+ year",
                      "1 - 3+ year",
                      "1 - 4+ year",
                      "1 - 5+ year",
                      "1 - 2 year",
                      "1 - 3 year",
                      "1 - 4 year",
                      "1 - 5 year",
                      ### 2- 
                      " 2+ year",
                      "2-3+ year",
                      "2-4+ year",
                      "2-5+ year",
                      " 2 year",
                      "2-3 year",
                      "2-4 year",
                      "2-5 year",
                      "2 - 3+ year",
                      "2 - 4+ year",
                      "2 - 5+ year",
                      "2 - 3 year",
                      "2 - 4 year",
                      "2 - 5 year",
                      #### 3 -
                      " 3+ year",
                      "3-4+ year",
                      "3-5+ year",
                      " 3 year",
                      "3-4 year",
                      "3-5 year",
                      "3 - 4+ year",
                      "3 - 5+ year",
                      "3 - 4 year",
                      "3 - 5 year",
                      ### 4 - 
                      " 4+ year",
                      "4-5+ year",
                       " 4 year",
                      "4-5 year",
                      "4 - 5+ year",
                      "4 - 5 year",
                      ### 5- 
                      " 5+ year",
                      " 6+ year",
                      " 7+ year",
                      " 8+ year",
                      " 9+ year",
                      " 10+ year")

Now I just have to search for them in each posting and fill in the variable if its a positive match.

postings1$exp <- NA

for(i in 1:nrow(postings1)){
  for(term in words_experience){
    if(str_detect(postings1$job_description[[i]], term)){
      postings1$exp[i] <- term
    }
    else{
      next
    }
  }
}

Pulling out the Minimum Since I know they all start with the first number I can simply just select the first value in the string after removing the white space.

postings1$exp <- str_trim(postings1$exp, side = "both")
postings1$min_exp <- NA

for(i in 1:nrow(postings1)){
postings1$min_exp[i] <- str_sub(postings1$exp[i], 1, 1)
}

# now lets take a look
unique(postings1$min_exp)
 [1] NA  "1" "5" "0" "3" "2" "4" "7" "8" "6" "9"

Great, So we have a minimum experience with a range all the way from 1 to 9 years.

6. Adding Multiple “Technology/Tool” Variables to Test for Popularity.

I wanted to look at different technogies/tools that appear in the job descriptions to see which are the most popular, like Python vs. R etc. For this part I am going to start by picking the tools/tech I want to look for, then after creating a new column for each one, filling it with a TRUE/FALSE if it shows up in the job description. I made this list after reading through a few job descriptions and using general knowledge. Let me know if I missed anything big.

postings1 <- mutate(postings1, 
                  python = NA,
                  r = NA,
                  tableau = NA,
                  scala = NA,
                  google_analytics = NA,
                  java = NA,
                  sql = NA,
                  mysql =  NA,
                  nosql = NA,
                  mongodb = NA,
                  excel = NA,
                  powerpoint = NA,
                  matlab = NA,
                  sas = NA,
                  hadoop = NA,
                  tensorflow = NA,
                  postgres = NA,
                  linux = NA,
                  aws = NA,
                  hive = NA,
                  spark = NA,
                  power_bi =  NA,
                  scikit_learn =  NA,
                  azure = NA,
                  emr = NA)

# great now lets go ahead and fill them in

for(i in 1:nrow(postings1)){
  postings1$python[i] <- if_else(str_detect(postings1$job_description[i], "python"), TRUE, FALSE)
  postings1$r[i] <- if_else(str_detect(postings1$job_description[i],  " r "), TRUE, FALSE)
  postings1$tableau[i] <- if_else(str_detect(postings1$job_description[i], "tableau"), TRUE, FALSE)
  postings1$scala[i] <- if_else(str_detect(postings1$job_description[i], "scala"), TRUE, FALSE)
  postings1$google_analytics[i] <- if_else(str_detect(postings1$job_description[i], "google analytics"), TRUE, FALSE)
  postings1$java[i] <- if_else(str_detect(postings1$job_description[i],  "java"), TRUE, FALSE)
  postings1$sql[i] <- if_else(str_detect(postings1$job_description[i], " sql"), TRUE, FALSE)
  postings1$mysql[i] <- if_else(str_detect(postings1$job_description[i], "mysql"), TRUE, FALSE)
  postings1$nosql[i] <- if_else(str_detect(postings1$job_description[i], "nosql"), TRUE, FALSE)
  postings1$mongodb[i] <- if_else(str_detect(postings1$job_description[i], "mongodb"), TRUE, FALSE)
  postings1$excel[i] <- if_else(str_detect(postings1$job_description[i], "excel"), TRUE, FALSE)
  postings1$powerpoint[i] <- if_else(str_detect(postings1$job_description[i], "powerpoint"), TRUE, FALSE)
  postings1$matlab[i] <- if_else(str_detect(postings1$job_description[i], "matlab"), TRUE, FALSE)
  postings1$sas[i] <- if_else(str_detect(postings1$job_description[i], " sas "), TRUE, FALSE)
  postings1$hadoop[i] <- if_else(str_detect(postings1$job_description[i], "hadoop"), TRUE, FALSE)
  postings1$tensorflow[i] <- if_else(str_detect(postings1$job_description[i], "tensorflow"), TRUE, FALSE)
  postings1$postgres[i] <- if_else(str_detect(postings1$job_description[i], "postgres"), TRUE, FALSE)
  postings1$linux[i] <- if_else(str_detect(postings1$job_description[i], "linux"), TRUE, FALSE)
  postings1$aws[i] <- if_else(str_detect(postings1$job_description[i], "aws"), TRUE, FALSE)
  postings1$hive[i] <- if_else(str_detect(postings1$job_description[i], "hive"), TRUE, FALSE)
  postings1$spark[i] <- if_else(str_detect(postings1$job_description[i], "spark"), TRUE, FALSE)
  postings1$power_bi[i] <- if_else(str_detect(postings1$job_description[i], "power-bi")|str_detect(postings1$job_description[i], "power bi"), TRUE, FALSE)
  postings1$scikit_learn[i] <- if_else(str_detect(postings1$job_description[i], "scikit-learn"), TRUE, FALSE)
  postings1$azure[i] <- if_else(str_detect(postings1$job_description[i], "azure"), TRUE, FALSE)
  postings1$emr[i] <- if_else(str_detect(postings1$job_description[i], "emr"), TRUE, FALSE)
  }
# Now Lets see how it worked
postings1[17:41] %>% colSums(na.rm = TRUE)
          python                r          tableau            scala 
            1623             1089              700              733 
google_analytics             java              sql            mysql 
             154              572             1668              125 
           nosql          mongodb            excel       powerpoint 
             127               64             3363              752 
          matlab              sas           hadoop       tensorflow 
             225              473              381              198 
        postgres            linux              aws             hive 
              79              182              635              214 
           spark         power_bi     scikit_learn            azure 
             475              106              117               99 
             emr 
              74 

Okay great, we have some pretty high numbers there. Now let’s take a a look at our final dataframe before we start on the visualization part.

glimpse(postings1)
Observations: 5,296
Variables: 41
$ date_scraped     <date> 2019-02-05, 2019-02-05, 2019-02-05, 2019-02-05…
$ title_search     <chr> "data analyst", "data analyst", "data analyst",…
$ city_search      <chr> "Boston", "Boston", "Boston", "Boston", "Boston…
$ job_title        <chr> "analyst, retail space planning", "retail merch…
$ title_term       <chr> "analyst", "analyst", "analyst", "analyst", "an…
$ company_name     <chr> "SAS Retail Services", "SAS Retail Services", "…
$ job_location     <chr> "Quincy, MA", "Boston, MA", "Braintree, MA", "B…
$ job_description  <chr> "we’re looking for planning analysts to join ou…
$ time_ago         <chr> "8 days ago", "8 days ago", "29 days ago", "4 d…
$ days_ago         <chr> "8", "8", "29", "4", "25", "1", "9", "5", "30",…
$ bachelors        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ masters          <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,…
$ doctorate        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ min_ed           <chr> NA, NA, NA, NA, "masters", "bachelors", NA, NA,…
$ exp              <chr> NA, "1 year", "1 year", NA, NA, NA, NA, "5+ yea…
$ min_exp          <chr> NA, "1", "1", NA, NA, NA, NA, "5", "1", NA, NA,…
$ python           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,…
$ r                <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, …
$ tableau          <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,…
$ scala            <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,…
$ google_analytics <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ java             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ sql              <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, …
$ mysql            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ nosql            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ mongodb          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ excel            <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, …
$ powerpoint       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ matlab           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ sas              <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FA…
$ hadoop           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ tensorflow       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ postgres         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ linux            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ aws              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hive             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ spark            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ power_bi         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ scikit_learn     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ azure            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ emr              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…

Awesome, We now have 44 variables from the original seven.

Part -II- Visualization

Now that we have a nice data-frame with some variables that should help us understand what the job market for data analysts / data scientists might look like in the three largest cities of the North East. Let’s make some graphs to help us understand the market.

Things to note:

  • title_term depicts whether “analys”, “scien”, or neither were found in the job title. I am using this variable to differentiate between data analyst and data scientist roles throughout the project.

  • The graphs will represent all of the data from Boston, NY, and Philadelphia unless otherwise specified.


Let’s start out by simply looking at the distribution of our listings across each of the three cities.

# okay cool lets go ahead and see if we can just make a simple barchart to show the differences
ggplot(data = postings1, aes( x = city_search, fill = title_term)) + geom_bar(position = 'stack')+ scale_fill_brewer(palette = 'Blues') + theme_minimal() + theme(panel.grid.major = element_blank()) + labs( x ="City", y =  "Number of Listings") + ggtitle(" Number of Listings") + theme(plot.title = element_text(hjust=0.5))

So we have about 5,300 listings total. Half of those are from NY, 30% from Philly, and 20% from Boston. As far as analyst vs scientist roles, there appear to be many more analyst roles overall, only in Boston are the number of scientist roles comparable.

Top Companies in Each City

Now let’s see what the top companies are. I thought it would be better to split this up by each city.


note: I’m just showing the code block for Boston here to save room.

temp <-            postings1 %>% filter(city_search == "Boston") %>%
                          group_by(company_name) %>%
                          count() %>%
                          arrange(desc(n)) %>%
                          head(n= 10)

temp2 <- subset(postings1, postings1$company_name %in% temp$company_name & city_search == "Boston") %>%
        select(company_name, title_term)
# fixing levels 

temp2$company_name <- parse_factor(as.character(temp2$company_name), rev(as.character(temp$company_name)))

ggplot(data = temp2, aes(x = company_name, fill = title_term) ) + geom_bar(stat = "count") + coord_flip() + scale_fill_brewer(palette = 'Blues') + theme_minimal() + theme(panel.grid.major = element_blank()) + labs( x ="Company Name", y =  "Number of Listings") + ggtitle("Hiring Companies in Boston") + theme(plot.title = element_text(hjust=0.5))

Boston

Wow, so Wayfair on top in Boston, I guess thats not too surprising with the rate they are growing. The ratios of analyst to scientist are kind of all over the place. Amazon is looking for Data Scientists!
***

New York

JP Morgan Chase coming in hot with almost double second place… I guess it makes sense that this large bank is looking for tons of analysts.


***

Philedelphia

Comcast, J&J, IQVIA, all large companies, nothing too crazy here.

*** ##

Distribution of the Days_ago Feature

Now let’s take a look at the distribution of the Days_ago variable and see if we can find any patterns in how long these listings were up.

ggplot(data = subset(postings1, !is.na(title_term)), aes(x =days_ago, fill = title_term)) + geom_histogram(binwidth = 5) + scale_fill_brewer(palette = 'Blues') + theme_minimal() + theme(panel.grid.major = element_blank()) + labs( x ="Days Ago", y =  "Number of Listings") + ggtitle("Days Ago") + theme(plot.title = element_text(hjust=0.5))

We can see that the majority of the listings were listed 30+ days ago. Either the job filling process usually takes more than 30 days or companies are just leaving their listings up. Id bet it’s a mix of both. There doesn’t appear to be any trends differentiating the analyst and scientists roles.

Minumum Years of Experience

Now lets start looking at the information we were able to pull out of the job descriptions. Lets create a stacked bar chart to look at the minimum years of experience for analyst vs scientist positions.

ggplot(data = subset(postings1, !is.na(postings1$min_exp)), aes(x = min_exp, fill = title_term)) + geom_bar() + scale_fill_brewer(palette = 'Blues') + theme_minimal() + theme(panel.grid.major = element_blank()) + labs( x ="Years of Experience", y =  "Number of Listings") + ggtitle("Minumum Years of Experience") + theme(plot.title = element_text(hjust=0.5)) 

This is about what I expected with the 1–3 as the most common minimum years of experience. I’m little surprised with dip in 4 years. I guess if your looking for a more experienced candidate 5+ just sounds better than 4+. There doesnt appear to be any clear trends to differentiate the analyst vs. scientist roles. I was expecting to see a trend towards greater years of experience required for data scientist roles.

Minimum Level of Education

Now let’s take a look at how our minimum level of education variable looks. Lets start out by looking to see how many data points we were able to extract from the job description.

nrow(postings1[postings1$bachelors == TRUE | postings1$masters == TRUE | postings1$doctorate == TRUE,  ])
nrow(postings1[postings1$bachelors == TRUE | postings1$masters == TRUE | postings1$doctorate == TRUE,  ])/nrow(postings1)
sum(!is.na(postings1$min_ed))

length(which(postings1$min_ed == "bachelors"))
length(which(postings1$min_ed == "masters"))
length(which(postings1$min_ed == "doctorate"))
[1] 2341
[1] 0.4420317
[1] 2341
[1] 1465
[1] 827
[1] 49

okay cool, so we have just over 2300 data points here, around 44%. Im actually pretty happy with that. I’m sure that i missed a few, but that is definitely good enough to look for some trends! Let’s take a look.

ggplot(data = subset(postings1, !is.na(min_ed)), aes(x = min_ed, fill = title_term)) + geom_bar(position = 'dodge') + scale_fill_brewer(palette = 'Blues') + theme_minimal() + theme(panel.grid.major = element_blank()) + labs( x ="Minimum Level of Education", y =  "Number of Listings") + ggtitle("Minimum Level of Education") + theme(plot.title = element_text(hjust=0.5))

I think that looks about right, the majority of the analyst roles requiring a bachelors degree. Even the data scientist roles are split evenly between bachelors and masters, with a few doctorate requirements.

Top Tools/Technologies for Analyst and Scientist Positions

And finally, lets take a look at all of the tools and technology data. Lets just make barchart showing the most popular tools. I thought it would be cool to compare these for the analyst roles and the scientist roles as well. note: I am just showing the code for the analyst graph to save room.

Things to note:

  • I am hiding the code for producing the data scientist plot to save room.

  • the percentage is calculated using only the analyst listings (or scientist listings below)


temp     <- subset(postings1, postings1$title_term == "analyst")
tools.df <- colSums(temp[17:41],na.rm = TRUE) %>% as.data.frame()
tools.df <- rownames_to_column(tools.df)
colnames(tools.df) <- c("tech", "count")
tools.df <- as_tibble(tools.df)
tools.df <- arrange(tools.df, desc(count))
# ordering levels
tools.df$tech <- parse_factor(as.character(tools.df$tech), rev(as.character(tools.df$tech)))

# The Plot
ggplot(data = tools.df, aes(x = tech, y = count, fill = "Blues")) + geom_bar(stat = "identity") + 
  geom_text(aes(label=count), size = 2.75, hjust = -.1) +coord_flip() + 
  scale_fill_brewer(palette = 'Blues') + theme_minimal() + 
  theme(legend.position="none", panel.grid.major = element_blank()) + 
  labs( x ="Tools", y =  "Number of Listings") + ggtitle("Most Popular Tools & Tech") + 
  theme(plot.title = element_text(hjust=0.5), , axis.text.x=element_blank())

Data Analyst Positions

No surprises here, excel came in on top being mentioned in almost 65% of all the listings. Python beat out R. I am actually surprised that SQL is as high as it is above both python and R. Scala and Tableau were also pretty popular.


Data Scientist Positions

Similar results here, but python beat out even excel appearing in 63% of the listings. Spark jumps up to appear in nearly a quarter of the listings.


Thanks for Reading

Thanks for checking out my project, I am hoping to expand this project to do some more text analysis on the job descriptions to see what else I might be able to pull out of there. I would love to hear any thoughts on my methodology as I am learning myself. Thanks again!