Indeed Data Science Skills Ranking

Concept

Like many websites, Indeed.com uses a GET request with parameters for its search requests. Using this URL format with many combinations of “Data Science Term” + “Skill” + “City” + “Radius”, we are able to get a good estimate of which skills are most important, possibly depending on metro region.

Below is a SUBSET of the Indeed.com Skills and cities collected for the DA 607 Project 3. These skills were written to a CSV output file with about 6400 results for the 20 selected cities. These cities, by population, were:

Input Cities

CITY, STATE POPULATION

New York, NY

“8,363,710”

Los Angeles, CA

“3,833,995”

Chicago,IL

“2,853,114”

Houston,TX

“2,242,193”

Phoenix,AZ

“1,567,924”

Philadelphia,PA

“1,447,395”

San Antonio,TX

“1,351,305”

Dallas,TX

“1,279,910”

San Diego,CA

“1,279,329”

San Jose,CA

“948,279”

Detroit,MI

“912,062”

San Francisco,CA

“808,976”

Jacksonville,FL

“807,815”

Indianapolis, IN

“798,382”

Austin,TX

“757,688”

Columbus,OH

“754,885”

Fort Worth,TX

“703,073”

Charlotte,NC

“687,456”

Memphis,TN

“669,651”

Baltimore,MD

“636,919”

Scraping Code and (Subset) Results Table

library(stringr)
library(knitr)

get_indeed_url <- function(skill, base_term, city, radius){
  indeed_url <- paste("http://www.indeed.com/jobs?q=",trimws(skill),"+",trimws(base_term),"&l=",city,"&radius=",radius, sep = "")
  return (indeed_url)
}

get_count <- function(indeed_url){
  web_page <- readLines(indeed_url)

  count_str <- grep("Jobs 1 to [[0-9] of ([[0-9])", web_page, value=TRUE)
  count_str <- gsub(".*(Jobs 1 to [[0-9] of )","",count_str)
  count_str <- gsub("(</div>).*","",count_str)
  if(length(count_str) == 0){
    count_str <- 0
  }
    
  return (count_str)
}

write_indeed_results <- function(base_terms_list, skill_terms, cities, radius_list){
  
  grid <- expand.grid(base_terms_list, skill_terms, cities, radius_list)
  dfx <- data.frame(grid)
  colnames(dfx) <- c("Base_Term","Skill_Term","City","radius")
  
  
  dfx$indeed_url <- mapply(get_indeed_url, dfx$Skill_Term, dfx$Base_Term, dfx$City, dfx$radius)
  dfx$new_jobs_count <- mapply(get_count, dfx$indeed_url)

  write.csv(dfx, file = "Indeed_Data_Science_Job_Search_Results.csv", row.names = FALSE, na = "")

  kable(head(dfx, n = 20))  
}

go <- function(){
  base_terms <- c('Data Scientist','Data Analytics')
  skill_terms.df <- read.csv("IndeedDataScienceSkillsList.SUBSET.csv", header = TRUE)
  skill_terms <- trimws(skill_terms.df$SKILL)
  head(skill_terms.df)
  
  cities.df <- read.csv("IndeedDataScienceCitiesList.SUBSET.csv", header = TRUE)
  cities <- paste(trimws(cities.df$CITY), trimws(cities.df$STATE), sep=",")
  head(cities)
  
  radius_list <- c(25,50)

  write_indeed_results(trimws(base_terms), trimws(skill_terms), trimws(cities), trimws(radius_list))
}

go()
Base_Term Skill_Term City radius indeed_url new_jobs_count
Data Scientist Big Data New York,NY 25 http://www.indeed.com/jobs?q=Big Data+Data Scientist&l=New York,NY&radius=25 0
Data Analytics Big Data New York,NY 25 http://www.indeed.com/jobs?q=Big Data+Data Analytics&l=New York,NY&radius=25 0
Data Scientist GIS New York,NY 25 http://www.indeed.com/jobs?q=GIS+Data Scientist&l=New York,NY&radius=25 0
Data Analytics GIS New York,NY 25 http://www.indeed.com/jobs?q=GIS+Data Analytics&l=New York,NY&radius=25 0
Data Scientist Hadoop New York,NY 25 http://www.indeed.com/jobs?q=Hadoop+Data Scientist&l=New York,NY&radius=25 0
Data Analytics Hadoop New York,NY 25 http://www.indeed.com/jobs?q=Hadoop+Data Analytics&l=New York,NY&radius=25 0
Data Scientist Hive New York,NY 25 http://www.indeed.com/jobs?q=Hive+Data Scientist&l=New York,NY&radius=25 0
Data Analytics Hive New York,NY 25 http://www.indeed.com/jobs?q=Hive+Data Analytics&l=New York,NY&radius=25 0
Data Scientist HTML New York,NY 25 http://www.indeed.com/jobs?q=HTML+Data Scientist&l=New York,NY&radius=25 0
Data Analytics HTML New York,NY 25 http://www.indeed.com/jobs?q=HTML+Data Analytics&l=New York,NY&radius=25 0
Data Scientist JAVA New York,NY 25 http://www.indeed.com/jobs?q=JAVA+Data Scientist&l=New York,NY&radius=25 0
Data Analytics JAVA New York,NY 25 http://www.indeed.com/jobs?q=JAVA+Data Analytics&l=New York,NY&radius=25 0
Data Scientist JavaScript New York,NY 25 http://www.indeed.com/jobs?q=JavaScript+Data Scientist&l=New York,NY&radius=25 0
Data Analytics JavaScript New York,NY 25 http://www.indeed.com/jobs?q=JavaScript+Data Analytics&l=New York,NY&radius=25 0
Data Scientist JSON New York,NY 25 http://www.indeed.com/jobs?q=JSON+Data Scientist&l=New York,NY&radius=25 9
Data Analytics JSON New York,NY 25 http://www.indeed.com/jobs?q=JSON+Data Analytics&l=New York,NY&radius=25 0
Data Scientist Machine Learning New York,NY 25 http://www.indeed.com/jobs?q=Machine Learning+Data Scientist&l=New York,NY&radius=25 0
Data Analytics Machine Learning New York,NY 25 http://www.indeed.com/jobs?q=Machine Learning+Data Analytics&l=New York,NY&radius=25 0
Data Scientist Map/Reduce New York,NY 25 http://www.indeed.com/jobs?q=Map/Reduce+Data Scientist&l=New York,NY&radius=25 0
Data Analytics Map/Reduce New York,NY 25 http://www.indeed.com/jobs?q=Map/Reduce+Data Analytics&l=New York,NY&radius=25 0

*** The scrape didn’t seem to work from inside Rmd files, only inside R scripts, thus the 0 results on the right.