Like many websites, Indeed.com uses a GET request with parameters for its search requests. Using this URL format with many combinations of “Data Science Term” + “Skill” + “City” + “Radius”, we are able to get a good estimate of which skills are most important, possibly depending on metro region.
Below is a SUBSET of the Indeed.com Skills and cities collected for the DA 607 Project 3. These skills were written to a CSV output file with about 6400 results for the 20 selected cities. These cities, by population, were:
| CITY, STATE | POPULATION |
|---|---|
New York, NY |
“8,363,710” |
Los Angeles, CA |
“3,833,995” |
Chicago,IL |
“2,853,114” |
Houston,TX |
“2,242,193” |
Phoenix,AZ |
“1,567,924” |
Philadelphia,PA |
“1,447,395” |
San Antonio,TX |
“1,351,305” |
Dallas,TX |
“1,279,910” |
San Diego,CA |
“1,279,329” |
San Jose,CA |
“948,279” |
Detroit,MI |
“912,062” |
San Francisco,CA |
“808,976” |
Jacksonville,FL |
“807,815” |
Indianapolis, IN |
“798,382” |
Austin,TX |
“757,688” |
Columbus,OH |
“754,885” |
Fort Worth,TX |
“703,073” |
Charlotte,NC |
“687,456” |
Memphis,TN |
“669,651” |
Baltimore,MD |
“636,919” |
library(stringr)
library(knitr)
get_indeed_url <- function(skill, base_term, city, radius){
indeed_url <- paste("http://www.indeed.com/jobs?q=",trimws(skill),"+",trimws(base_term),"&l=",city,"&radius=",radius, sep = "")
return (indeed_url)
}
get_count <- function(indeed_url){
web_page <- readLines(indeed_url)
count_str <- grep("Jobs 1 to [[0-9] of ([[0-9])", web_page, value=TRUE)
count_str <- gsub(".*(Jobs 1 to [[0-9] of )","",count_str)
count_str <- gsub("(</div>).*","",count_str)
if(length(count_str) == 0){
count_str <- 0
}
return (count_str)
}
write_indeed_results <- function(base_terms_list, skill_terms, cities, radius_list){
grid <- expand.grid(base_terms_list, skill_terms, cities, radius_list)
dfx <- data.frame(grid)
colnames(dfx) <- c("Base_Term","Skill_Term","City","radius")
dfx$indeed_url <- mapply(get_indeed_url, dfx$Skill_Term, dfx$Base_Term, dfx$City, dfx$radius)
dfx$new_jobs_count <- mapply(get_count, dfx$indeed_url)
write.csv(dfx, file = "Indeed_Data_Science_Job_Search_Results.csv", row.names = FALSE, na = "")
kable(head(dfx, n = 20))
}
go <- function(){
base_terms <- c('Data Scientist','Data Analytics')
skill_terms.df <- read.csv("IndeedDataScienceSkillsList.SUBSET.csv", header = TRUE)
skill_terms <- trimws(skill_terms.df$SKILL)
head(skill_terms.df)
cities.df <- read.csv("IndeedDataScienceCitiesList.SUBSET.csv", header = TRUE)
cities <- paste(trimws(cities.df$CITY), trimws(cities.df$STATE), sep=",")
head(cities)
radius_list <- c(25,50)
write_indeed_results(trimws(base_terms), trimws(skill_terms), trimws(cities), trimws(radius_list))
}
go()
| Base_Term | Skill_Term | City | radius | indeed_url | new_jobs_count |
|---|---|---|---|---|---|
| Data Scientist | Big Data | New York,NY | 25 | http://www.indeed.com/jobs?q=Big Data+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | Big Data | New York,NY | 25 | http://www.indeed.com/jobs?q=Big Data+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | GIS | New York,NY | 25 | http://www.indeed.com/jobs?q=GIS+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | GIS | New York,NY | 25 | http://www.indeed.com/jobs?q=GIS+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | Hadoop | New York,NY | 25 | http://www.indeed.com/jobs?q=Hadoop+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | Hadoop | New York,NY | 25 | http://www.indeed.com/jobs?q=Hadoop+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | Hive | New York,NY | 25 | http://www.indeed.com/jobs?q=Hive+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | Hive | New York,NY | 25 | http://www.indeed.com/jobs?q=Hive+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | HTML | New York,NY | 25 | http://www.indeed.com/jobs?q=HTML+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | HTML | New York,NY | 25 | http://www.indeed.com/jobs?q=HTML+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | JAVA | New York,NY | 25 | http://www.indeed.com/jobs?q=JAVA+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | JAVA | New York,NY | 25 | http://www.indeed.com/jobs?q=JAVA+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | JavaScript | New York,NY | 25 | http://www.indeed.com/jobs?q=JavaScript+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | JavaScript | New York,NY | 25 | http://www.indeed.com/jobs?q=JavaScript+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | JSON | New York,NY | 25 | http://www.indeed.com/jobs?q=JSON+Data Scientist&l=New York,NY&radius=25 | 9 |
| Data Analytics | JSON | New York,NY | 25 | http://www.indeed.com/jobs?q=JSON+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | Machine Learning | New York,NY | 25 | http://www.indeed.com/jobs?q=Machine Learning+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | Machine Learning | New York,NY | 25 | http://www.indeed.com/jobs?q=Machine Learning+Data Analytics&l=New York,NY&radius=25 | 0 |
| Data Scientist | Map/Reduce | New York,NY | 25 | http://www.indeed.com/jobs?q=Map/Reduce+Data Scientist&l=New York,NY&radius=25 | 0 |
| Data Analytics | Map/Reduce | New York,NY | 25 | http://www.indeed.com/jobs?q=Map/Reduce+Data Analytics&l=New York,NY&radius=25 | 0 |
*** The scrape didn’t seem to work from inside Rmd files, only inside R scripts, thus the 0 results on the right.