In this project, we used supervised and unsupervised data mining techniques on a scraped dataset of 1,303 Indeed job listings to answer the following question:
What are the most valued data science skills?
We collaborated as a team to understand the question, get the data, clean it, analyze it, and draw conclusions. We used Slack, Google Docs, Google Hangouts, GitHub, and in-person meetings to work together, and we gave pair programming – live and virtual – a shot, too.
Kavya Beheraj, GitHub
Paul Britton, GitHub
Jeremy O’Brien, GitHub
Rickidon Singh, GitHub
Violeta Stoyanova, GitHub
Iden Watanabe, GitHub
Data Acquisition — Iden and Paul
Data Cleaning — Jeremy and Kavya
Unsupervised Analysis — Iden and Paul
Supervised Analysis — Rickidon and Violeta
Conclusions — Whole Team
library(rvest)
library(RCurl)
library(plyr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(tidytext)
library(xtable)
library(readr)
library(tidytext)
library(knitr)
library(kableExtra)To motivate our approach to data collection and analysis, we began with the concepts of “skills” and of “value.”
What are the most valued data science skills?
Skills
As discussed in our class, data science requires not only multiple skills but multiple categories of skills. The many fields and industries where data science is applied likely group these skills into different categories, but after some desk research and discussion we felt that in addition to toolsets (software and platforms), both “hard” (analytical and technical) and “soft” (communcative and collaborative) skills are also important.
We used the following list of categories – found in an article on Data Scientist Resume Skills – as a basis for our supervised analyses.
Value
To avoid wading into philosophical abstractions, we interpreted value in its economic sense – that is, what skills are sought after and / or rewarded in the marketplace.
As the economic value of data science skills is not directly measurable, we discussed three different approaches to getting a datset:
Mining existing custom research on data scientists (like that found here).
Analyzing online discussion boards focused on data science (like this one on Reddit). While threads can provide a historical record (i.e., the evolution of value), there are potentially compromises in data quality and bias (whether due to fanboys, trolls, or a silent majority) and informational contents does not necessarily accord with economic value.
Scraping online job postings for data science roles provides perspective on what skills employers emphasize and prioritize. This third approach has its limitations: there are multiple platforms (Glassdoor, Linkedin, Monster, Careerbuilder, etc.) none of which can have a complete view of the marketplace, and scraping time-delimited job postings captures a moment in time without any longitudinality.
We dimissed custom research as it didn’t seem to accord with the intent of the project. We thought that exploring online discussion boards could be valuable an alternative, fallback, or follow-up analysis. We agreed to focus on job postings.
Constraints of the data source notwithsanding, testing what signals of “skill value” (i.e. frequency of terms related to data science skills) could be detected in job postings would be a good approach to this project, and one that allowed us to meet technical requirements and collaborate.
After some exploration, we decided to focus on Indeed.com, which has a wealth of data science job postings that can be scraped. We scraped them – first a test set for evaluation and troubleshooting, then a larger, more robust set – to be cleaned and analyzed. We initially used Python, and later replicated the scraper in R.
We felt that the project could benefit from a two-pronged approach to analysis:
A more prescriptive, supervised approach based on cross-referencing job summaries with categorized lists of terms and calculcating the frequency of recurring keywords. To prove the concept, we used the “hard,” “soft,” and “tools” lists referenced above as we found them.
A more exploratory, unsupervised approach based on TF-IDF (term frequency-inverse document frequency) and sentiment analysis, which don’t semantically impose preconveived keywords upon job postings (short of filtering out stop-words).
To streamline our process, we conducted the two analyses in parallel, cleaning and preparing the data for both. We iterated and collaborated on the scraper, cleaning, and analysis using Slack and GitHub.
This scraper is working code, however, we’ve disabled here as it can take a while to run. It’s provided here as a working demonstration of how our data was collected. All the actual work for this project was completed on a static data set which we collected early on in our efforts. This was done to ensure that all group members were always working with identical data and that any user could re-produce our results as desired.
The following chunk of code scrapes job postings from indeed.com and collects the results into a dataframe. It’s a port from some python code originally used to scrape our data set.
First we’ll set a few variables that we’ll use in our scraping activity. We’ve used a smaller set of cities just to demonstrate how it works.
city.set_small <- c("New+York+NY", "Seattle+WA")
city.set <- c("New+York+NY", "Seattle+WA", "San+Francisco+CA",
"Washington+DC","Atlanta+GA","Boston+MA", "Austin+TX",
"Cincinnati+OH", "Pittsburgh+PA")
target.job <- "data+scientist"
base.url <- "https://www.indeed.com/"
max.results <- 50Indeed.com appears to use the “GET” request method, so we can directly mess around with the URL to get the data that we want. We’re going to iterate over our target cities and scrape the particulars for each job - this includes getting the links to each individual job-page so that we can also pull the full summary.
After the above is complete, we’re going to iterate over all the links that we’ve collected, pull them, and grab the full job summary for each. Note that it appears that jobs postings are sometimes removed, in which case, we pull an empty variable. We could probably do some cleaning in this step while downloading, but we’re going to handle that downstream.
#create a df to hold everything that we collect
jobs.data <- data.frame(matrix(ncol = 7, nrow = 0))
n <- c("city","job.title","company.name","job.location","summary.short","salary","links,summary.full")
colnames(jobs.data)
for (city in city.set_small){
print(paste("Downloading data for: ", city))
for (start in range(0,max.results,10)){
url <- paste(base.url,"jobs?q=",target.job,"&l=",city,"&start=", start ,sep="")
page <- read_html(url)
Sys.sleep(1)
#recorded the city search term << not working yet...
#i<-i+1
#job.city[i] <- city
#get the links
links <- page %>%
html_nodes("div") %>%
html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
html_attr("href")
#get the job title
job.title <- page %>%
html_nodes("div") %>%
html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
html_attr("title")
#get the job title
job.title <- page %>%
html_nodes("div") %>%
html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
html_attr("title")
#get the company name
company.name <- page %>%
html_nodes("span") %>%
html_nodes(xpath = '//*[@class="company"]') %>%
html_text() %>%
trimws -> company.name
#get job location
job.location <- page %>%
html_nodes("span") %>%
html_nodes(xpath = '//*[@class="location"]')%>%
html_text() %>%
trimws -> job.location
#get the short sumary
summary.short <- page %>%
html_nodes("span") %>%
html_nodes(xpath = '//*[@class="summary"]') %>%
html_text() %>%
trimws -> summary.short
}
#create a structure to hold our full summaries
summary.full <- rep(NA, length(links))
#fill in the job data
job.city <- rep(city,length(links))
#add a place-holder for the salary
job.salary <- rep(0,length(links))
#iterate over the links that we collected
for ( n in 1:length(links) ){
#build the link
link <- paste(base.url,links[n],sep="")
#pull the link
page <- read_html(link)
#get the full summary
s.full <- page %>%
html_nodes("span") %>%
html_nodes(xpath = '//*[@class="summary"]') %>%
html_text() %>%
trimws -> s.full
#check to make sure we got some data and if so, append it.
#as expired postings return an empty var
if (length(s.full) > 0 ){
summary.full[n] = s.full
}
}
#add the newly collected data to the jobs.data
jobs.data <- rbind(jobs.data,data.frame(city,
job.title,
company.name,
job.location,
summary.short,
job.salary,
links,
summary.full))
}The previous step resulted in raw CSV file with over 1300 rows. To clean the data, we first read in the CSV file, tested a cleaning procedure on a 100-row sample, and applied the procedure to the full dataset.
Read in raw dataframe, set separator as pipe
url <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Projects/Project%2003/indeed_jobs_large.csv"
df <- read.csv(url, sep="|", stringsAsFactors = F)Removed “location” and “salary” columns, to reduce redundancy.
df <- df[, -c(5,7)]Took 100-row sample of full dataset.
sample <- df[sample(1:nrow(df), 100, replace=F),]Removed brackets surrounding summaries.
sample1 <- sample %>% separate(summary_full, c("bracket", "new_summary"), sep="^[\\[]", remove=T, convert=F) %>%
separate(new_summary, c("summary_full", "bracket"), sep="[\\]]$", remove=T, convert=F)
sample1 <- sample1[, -c(5, 8)]Renamed column headers.
names(sample1) <- c("list_ID", "city", "job_title", "company_name", "link", "summary")Removed state and plus signs from city column.
# Separate City column into City and State by pattern of two uppercase letters after a plus sign (i.e., "+NY")
sample2 <- sample1 %>% separate(city, c("city", "state"), sep="[\\+][[:upper:]][[:upper:]]$", convert=T)
# Remove empty State column
sample2 <- sample2[, -c(3)]
# Replace plus signs with spaces
sample2$city <- str_replace_all(sample2$city, "[\\+]", " ")Removed rows where summary is blank.
sample3 <- filter(sample2, sample2$summary!="")
head(sample3, 2) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "800px", height = "200px")| list_ID | city | job_title | company_name | link | summary |
|---|---|---|---|---|---|
| 1156 | Cincinnati | Research Scientist (Relocation) | Novateur Research Solutions | https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0A1UtxlOJl6Y6F02-oTF9zwI-Nb9Dq9qDYX5XseiH9toZY47T_vf17yZekOKh6vKrDDCDFX3Wa6lCO1lqbbMub-n5iRQsliyvER1hj8ydFeCvr1dDdzSE2BRg0jsitAerWdpH4lwQy2YN10RaU4x7razrc-N-TWOgAyoXesX5yFmLmikvJOrRM-mJ2QsmncrEVzjnqE8VgKeF_x0lsgGnZN2GlJjM5j94KuMaZhknKONUmt3NyzS3B840l_G_X-KjfhwX--3eVKBuquaptQCQsYAcJ3v2p-KoEpvuZ1wwIJFoB6z2I9zVukI5GWav1HwGDD1kr6iMGqn-ux9tpldnJaFtcL4IQW_f6z8ykbTV8RsTnwlmLwcvh-92bR_meG52AyxJt-ZIKvIqeCz8ahaxpCVU6t24-WRy0Q7DOX1cNlfqlDB5Irejsjy22fJ0AKWyQ=&vjs=3&p=2&sk=&fvj=1 | RESEARCH SCIENTIST JOB SUMMARYNovateur Research Solutions is looking for research scientists and senior research scientists to develop cutting edge technologies and products. The successful candidates will be self-motivated individuals with strong background in one or two of the following areas: computer vision, image processing, cyber security, and machine learning.RESEARCH SCIENTIST RESPONSIBILITIESCollaborate with other researchers at Novateur and academia to solve challenging operational problems for government and commercial clients in above-mentioned areas.Develop novel algorithms leveraging the state-of-the-art technologies.RESEARCH SCIENTIST REQUIRED QUALIFICATIONSMasters or PhD. Degree in computer science, engineering, physics, applied mathematics or a related field with a focus on computer vision, image processing, cyber security or machine learningExcellent understanding of latest computer vision and machine learning technologies.Must be proficient in C/C++ and/or Python in Windows and Linux environment.Ability to work in a dynamic and fast-paced environment.Passion for working on cutting-edge technologies.Team player with excellent written and oral communication skills.RESEARCH SCIENTIST DESIRED QUALIFICATIONSPhD. DegreeStrong record of research publications.Experience with OpenCV, ROS, deep-learning packages, and similar tools.Experience with hardware optimization such as GPU programming using CUDA.Experience with development and prototyping of real-time systems.COMPANY BENEFITSNovateur offers competitive pay and benefits including a wide choice of healthcare options with generous company subsidy, 401(k) with generous employer match, paid holidays and paid time off increasing with tenure, and company paid short-term disability, long-term disability, and life insurance.About Novateur Research Solutions: Novateur is a research and development company providing innovative solutions for customers in defense, civilian government and commercial industries. Our focus areas include computer vision, machine learning, robotics and big data mining. We work at the forefront of innovation to enable transition from ideas to market by providing our customers the right blend of enabling scientific solutions, technologies, and product development expertise. We are located in Northern Virginia in the historic district of Leesburg. We offer a work environment which fosters individual thinking along with collaboration opportunities within and beyond Novateur. In return, we expect a high level of performance and passion to deliver enduring results for our clients.Novateur is an Equal Opportunity Employer. This company does not and will not discriminate in employment and personnel practices on the basis of race, sex, age, handicap, religion, national origin or any other basis prohibited by applicable law.Job Type: Full-timeJob Type: Full-timeRequired experience:Machine Learning: 1 yearPython or C++ programming: 1 yearComputer Vision: 1 yearRequired education:DoctorateRequired license or certification:US Permanent Residency or Citizenship |
| 1234 | Pittsburgh | Data Scientist | rue21 | https://www.indeed.com//rc/clk?jk=b9727f45a638d7e3&fccid=24d0f67f23343905&vjs=3 | Overviewrue21 is looking for an experienced Data Scientist. We are looking for an individual who will take ownership of data mining, building predictive models, and driving the construction of data insight products at rue21. The right candidate should be a self-motivated, highly detail-oriented team-player with a positive drive to provide insight to rue21 business partners. The right candidate will have a passion for discovering solutions hidden in large data sets and working with stakeholders to improve business outcomes.ResponsibilitiesWork with stakeholders throughout the organization to identify opportunities for leveraging company data to drive business solutions.Coordinate with different functional teams to implement models and monitor outcomes.Deliver insight to leadership, helping drive strategic business decisions.Execute data mining activities providing insight from datasets.Build accurate statistical supervised & unsupervised predictive models.Develop processes and tools to monitor and analyze model performance and data accuracy.Qualifications5+ years of experience with several years in hands on experience manipulating data sets and building statistical models.Knowledge of advanced statistical techniques and concepts (regression, properties of distributions, statistical tests and proper usage, etc.) and experience with applicationsExperience querying databases and using statistical computer languages: R, Python, etc.A strong analytical and quantitative analysis mindsetAbility to think strategically and translate plans into phased actions in a fast paced, high pressure environmentDimensional and/or Multidimensional modeling experience is a plusKnowledge of BI related principles such as ETL, data modeling, & data warehousingBS/MS degree in Statistics, Mathematics, Computer Science or equivalent/related degree. |
Removed brackets surrounding summaries.
df1 <- df %>% separate(summary_full, c("bracket", "new_summary"), sep="^[\\[]", remove=T, convert=F) %>%
separate(new_summary, c("summary_full", "bracket"), sep="[\\]]$", remove=T, convert=F)
df1 <- df1[, -c(5, 8)]Renamed column headers.
names(df1) <- c("list_ID", "city", "job_title", "company_name", "link", "summary")Removed state and plus signs from city column.
# Separate city column into city and state by pattern of two uppercase letters after a plus sign (i.e., "+NY")
df2 <- df1 %>% separate(city, c("city", "state"), sep="[\\+][[:upper:]][[:upper:]]$", convert=T)
# Remove empty State column
df2 <- df2[, -c(3)]
# Replace plus signs with spaces
df2$city <- str_replace_all(df2$city, "[\\+]", " ")Removed rows where summary is blank.
df_final <- filter(df2, df2$summary!="")
write.csv(df_final, "indeed_final.csv", row.names = FALSE)
head(df_final, 2) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "800px", height = "200px")| list_ID | city | job_title | company_name | link | summary |
|---|---|---|---|---|---|
| 1 | New York | Data Scientist | AbleTo, Inc. | https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 | OverviewAbleTo Inc. combines best-in-class patient engagement with behavior change treatment programs that allow health plans and plan sponsors to improve health outcomes, while lowering overall spending, for high-cost medical populations. Benefitting groups include heart patients and diabetics, as well as those suffering from depression/anxiety and chronic pain. All sixteen of our programs utilize best practice protocols and are delivered nationally by teams of licensed therapists and behavioral coaches. Program participants experience improved health, better recovery and in the case of heart patients, fewer hospitalizations.ResponsibilitiesData & Analytics TeamThe Data & Analytics team plays a key role at AbleTo. Data is part of nearly everything that we do. The Data & Analytics team is responsible for storing data, making it useful and actionable for business managers, and using it to validate the impact of our programs and inform organizational improvements.As a Data Scientist at AbleTo, you will work in a uniquely cross-functional capacity, helping teams across the organization apply analytical rigor to their decision-making:With our Research team, you will build claims-based predictive models to identify patients who can benefit from our programs and drive value for our clientsWith our Engagement team, you will help optimize a coordinated, multi-channel patient outreach processWith our Account Management team, you will develop reporting that demonstrates how AbleTo is helping clients manage their patient populationsWith our Clinical team, you will ensure that our therapist network is operating efficiently and is structured appropriately to meet demandWith our Product team, you will use data-driven insights to design new product features for our platformWith our Engineering team, you will help manage our data stack and ETL processes that feed our data warehouseMembers of the Data & Analytics team are:Both technically and strategically savvyAble to sift through large amounts of data and extract insightsAble to present recommendations to non-technical audiencesHighly organized and able to manage time across multiple competing prioritiesFocused on building flexible, durable, and well-designed solutionsFocused on preserving quality controlWilling to experiment with new tools / techniquesData & Analytics team culture:Collaboration: You will work with teams across the organization to both design and build data-driven solutionsInnovation: Our team prioritizes innovation over precedent. Weâre always looking for new, more efficient ways of doing thingsHigh-Impact Focus: Our team receives a lot of requests for new analyses, processes, reports, etc. It is our job to prioritize these requests with our stakeholders and invest time in projects with the highest potential value-add for the organization. You will be actively involved in making these decisionsQualificationsTechnical SkillsStrong proficiency in SQL and R/PythonExperience working on cloud-based platforms like AWS/GCSComfortable with variety of analytical techniques:Predictive modeling (e.g. regularized regressions, decision trees / random forests, association rules mining)Optimization (e.g. linear programming)Exploratory analysis (e.g. clustering, PCA)Some familiarity with BI tools such as Looker or TableauEducation / ExperienceBA in Operations Research, Computer Science, or other related discipline2-3 years work experience is strongly preferredAdditionally to the technical requirements for this specific position, AbleTo seeks candidates that will demonstrate:Personal ownership of assignments and responsibilitiesResilience and grit to ensure task completion even in the face of adversityDiscipline and organization to handle multiple tasks simultaneouslyGreat team playing skillsHigh levels of energy and positive ambitionA healthy balance of curiosity, humility and assertivenessExcellent written and verbal communication skillsProfessional attitude and demeanorAbleTo is committed to equal employment opportunities (EEO) and employs qualified persons without regard to race, color, religion, national origin, sex, age, handicap, or any other classification protected by federal, state, or local laws.AbleTo is an E-Verify Company |
| 2 | New York | Data Scientist | Shore Group Associates, LLC | https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0AmtW71uMJ-FMTnSgQAi6MO2hfu514W2To_ok1EkQDsLzCSnVx2dJJdWh_eVltX3NiKTJsOZ1PdrHmGwruy3Gwutw1Y3myrtnW-EAYSCQP1_EOIcyntJdxtj6FPF62TyGkihYDHIjTCVu_fBirizqhYKRHGco3FiaFPo1aadANl5b8sxxh_xfPnT7IgmCq1uhzhaBEqvJTIOxzME67gwkFDQMqxfRy5NeNlACthstIIJrrNKGb_4rHddOAIBkREr7GI5VMt-mMTREeanvd2N26PKfQEgntwy8IRwFCIBN2KLrb4LrABKQS4hpxFDEeLxkLy_brMWIhE5yVTBHezMc1KQdnpROfTZk9-mjnfIkcuCQ5m8avDk2OYqqcGBMs15Nw_jrnylFns0l3hlK9gyQa-a51XJpetjmCi6VJL-lVn1fC2_jm2_2cU&vjs=3&p=2&sk=&fvj=1 | Description -Our team is building machine learning tools to help determine predict information about donors to a given foundation. This includes the likelihood of donation, the size of the donation, when is best to reach out to a given possible donor, and how to reach out. We are looking for data scientists who are not only interested in plugging data into a model, but also understanding the data like the back of their hand.Requirements -Analyze requirements and formulate an appropriate technical solution that meets functional and non-functional requirements.Experience with large datasets in the 100’s of GBFundamental/broad understanding of data mining and predictive analytics techniques2+ years of Data Science Experience and a deep knowledge of various modeling techniques.Python (Pandas, SKLearn, NumPy, Matplotlib), SQL, GitStrong communication skills - both verbal and written â is a must.Strong experience with predictive regression modelsStrong software engineering skillsJob Type: Full-timeExperience:Data Science: 3 years (Required) |
df_final that has 1,303 job listings.TF-IDF stands for “term frequency-inverse document frequency”. It is calculated by first calculating the term frequency of a word, \(tf(t,d)\), and multiplying it by its inverse document frequency \(idf(t,D)\), so that how frequent a word appears in a document is offset by its frequency in the corpus.
For example, a word might appear frequently in one chapter of a book, so much so that its frequency might put it in the top 10 words, but TF-IDF weighs the value of this word by the fact that it only appears in one chapter of, say, a hundred chapter textbook.
tfidf <- read.csv("indeed_final.csv", stringsAsFactors = FALSE)
# Make all job titles lower case for later
tfidf$job_title <- tolower(tfidf$job_title)
# Control list to be used for all corpuses
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE,
weighting = weightTfIdf)corpus.all <- VCorpus(VectorSource(tfidf$summary))
tdm.all <- TermDocumentMatrix(corpus.all, control = control_list)
# Remove outliers consisting of very rare terms
tdm.80 <- removeSparseTerms(tdm.all, sparse = 0.80)
# Sum rows for total & make dataframe
df_all <- tidy(sort(rowSums(as.matrix(tdm.80))))
colnames(df_all) <- c("words", "count")
# Graph
ggplot(tail(df_all, 25), aes(reorder(words, count), count)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "TF-IDF of Indeed Job Postings",
x = "Words", y = "Frequency") +
coord_flip()First, a note on sparsity: Sparsity roughly controls the rarity of the word frequency. If we run removeSparseTerms(tdm, sparse = 0.99), it will remove only the rarest words, that is the words that appear in only 1% of the corpus. On the other hand, removeSparseTerms(tdm, sparse = 0.01) then only words that appear in nearly every document of the corpus will be kept.
For most analysis, we found that a sparsity of 80% was most beneficial. Sparsity > 80% often included words that were more important to job postings as a whole, and not to the specifics we wanted for the purpose of our question.
When each job postings are treated as individual documents, skills like “machine learning”, “analytics”, “statistics/statistical”, and “models/modeling” are very important for data scientists to have.
# Trying to divide the corpus by cities
nyc <- paste(tfidf[tfidf$city == "New York", 6], collapse = " ")
sea <- paste(tfidf[tfidf$city == "Seattle", 6], collapse = " ")
sf <- paste(tfidf[tfidf$city == "San Francisco", 6], collapse = " ")
dc <- paste(tfidf[tfidf$city == "Washington", 6], collapse = " ")
atl <- paste(tfidf[tfidf$city == "Atlanta", 6], collapse = " ")
bos <- paste(tfidf[tfidf$city == "Boston", 6], collapse = " ")
aus <- paste(tfidf[df_final$city == "Austin", 6], collapse = " ")
cin <- paste(tfidf[df_final$city == "Cincinnati", 6], collapse = " ")
pitt <- paste(tfidf[tfidf$city == "Pittsburgh", 6], collapse = " ")
cities <- c(nyc, sea, sf, dc, atl, bos, aus, cin, pitt)
corpus.city <- VCorpus(VectorSource(cities))
tdm.city <- TermDocumentMatrix(corpus.city, control = control_list)
# Make city dataframe
df_city <- tidy(tdm.city)
df_city$document <- mapvalues(df_city$document,
from = 1:9,
to = c("NYC", "SEA", "SF",
"DC", "ATL", "BOS",
"AUS", "CIN", "PITT"))
df_city %>%
arrange(desc(count)) %>%
mutate(word = factor(term, levels = rev(unique(term))),
city = factor(document, levels = c("NYC", "SEA", "SF",
"DC", "ATL", "BOS",
"AUS", "CIN", "PITT"))) %>%
group_by(document) %>%
top_n(6, wt = count) %>%
ungroup() %>%
ggplot(aes(word, count, fill = document)) +
geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
labs(title = "Highest TF-IDF Words in Job Listings by City",
x = "Words", y = "TF-IDF") +
facet_wrap(~city, ncol = 2, scales = "free") +
coord_flip()# write.csv(df_city, "city_tfidf.csv", row.names = FALSE)In this attempt, job postings were grouped by the cities they were listed in. When broken down this way, the companies themselves became the most important words rather than skills.
# Data Scientist - 739 instances
ds <- tfidf[grep("data scientist", tolower(tfidf$job_title)), 6]
ds.corpus <- VCorpus(VectorSource(ds))
ds.tdm <- TermDocumentMatrix(ds.corpus, control = control_list)
ds.80 <- removeSparseTerms(ds.tdm, sparse = 0.80)
df_ds <- tidy(sort(rowSums(as.matrix(ds.80))))
colnames(df_ds) <- c("words", "count")
ggplot(tail(df_ds, 25), aes(reorder(words, count), count)) +
geom_bar(stat = "identity", fill = "red") +
labs(title = "TF-IDF of Data Scientist Job Titles",
x = "Words", y = "Frequency") +
coord_flip()# Senior / Sr. - 84 instances
# Intern - 61 instance
# Senior vs Intern
# Not very illuminating
senior <- paste(tfidf[grep("senior", tolower(tfidf$job_title)), 6], collapse = " ")
intern <- paste(tfidf[grep("intern", tolower(tfidf$job_title)), 6], collapse = " ")
jrsr.corpus <- VCorpus(VectorSource(c(senior, intern)))
jrsr.tdm <- TermDocumentMatrix(jrsr.corpus, control = control_list)
df_jrsr <- tidy(jrsr.tdm)
df_jrsr$document <- mapvalues(df_jrsr$document, from = 1:2,
to = c("senior", "intern"))
df_jrsr %>%
arrange(desc(count)) %>%
mutate(word = factor(term, levels = rev(unique(term))),
type = factor(document, levels = c("senior", "intern"))) %>%
group_by(document) %>%
top_n(25, wt = count) %>%
ungroup() %>%
ggplot(aes(word, count, fill = document)) +
geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
labs(title = "TF-IDF of Senior vs Junior Jobs",
x = "Words", y = "TF-IDF") +
facet_wrap(~type, ncol = 2, scales = "free") +
coord_flip()# Machine Learning - 124 instances
ml <- tfidf[grep("machine learning", tolower(tfidf$job_title)), 6]
ml.corpus <- VCorpus(VectorSource(ml))
ml.tdm <- TermDocumentMatrix(ml.corpus, control = control_list)
ml.70 <- removeSparseTerms(ml.tdm, sparse = 0.70)
df_ml <- tidy(sort(rowSums(as.matrix(ml.70))))
colnames(df_ml) <- c("words", "count")
ggplot(tail(df_ml, 25), aes(reorder(words, count), count)) +
geom_bar(stat = "identity", fill = "green") +
labs(title = "TF-IDF for Machine Learning Jobs",
x = "Words", y = "Count") +
coord_flip()# Research - 119 instances
research <- tfidf[grep("research", tfidf$job_title), 6]
r.corpus <- VCorpus(VectorSource(research))
r.tdm <- TermDocumentMatrix(r.corpus, control = control_list)
r.80 <- removeSparseTerms(r.tdm, sparse = 0.80)
df_r <- tidy(sort(rowSums(as.matrix(r.80))))
colnames(df_r) <- c("words", "count")
ggplot(tail(df_r, 25), aes(reorder(words, count), count)) +
geom_bar(stat = "identity", fill = "orange") +
labs(title = "TF-IDF for Research Job Titles",
x = "Words", y = "Count") +
coord_flip()Though our primary search term was “Data Scientist”, Indeed also returned other job titles. These were some of the most common instances. Unsurprisingly, “Data Scientist” itself matches with what we see in the analysis of all job postings. We thought there might be an interesting shift between “senior” level jobs and internships, with perhaps a strong prevelance of “soft skills” for the higher level jobs, but did not see much evidence of that in the data.
The idea here is to take a look at the “sentiment” of the text within each job posting and use that information as a proxy for company quality. The thinking is that higher sentiment ranking will be indicative of better company quality ( a leap, to be sure, but probably acceptable given the scope of this project). We’ll then use this data to take a look at which skills are more heavily refered to by the highest (and lowest) sentiment ranked companies.
The first thing that we that we’re going to do is tokenize the “summary” column of the data which contains all the text which we are interested in. The essentially amounts to parsing the column into individual words and reshaping the dataframe into a “tidy” format where all individual words (tokens) are found in their own column.
We’ll then remove all the “stop_words” from this newly created data – words like “if”, “and”, “the”… etc.
#tokenize the summary into individual words, drop stop words
df.sent <- df_final %>%
unnest_tokens(token, summary) %>%
anti_join(stop_words, by=c("token" = "word"))
head(df.sent,5) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "800px", height = "200px")Next we’ll map a numeric sentiment score to the words in our token column. We’re going to use the AFINN set for simplicity as it maps to a simple integer score between [-5, +5] with numbers below zero representing negative sentiments and numbers above zero representing positive sentiments.
#map the words to a sentiment score
df.sentiment <- df.sent %>%
inner_join(get_sentiments("afinn"),by=c("token" = "word")) #%>%
head(df.sentiment[c("city","job_title","company_name","token","score")],5) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| city | job_title | company_name | token | score |
|---|---|---|---|---|
| New York | Data Scientist | AbleTo, Inc. | improve | 2 |
| New York | Data Scientist | AbleTo, Inc. | benefitting | 2 |
| New York | Data Scientist | AbleTo, Inc. | suffering | -2 |
| New York | Data Scientist | AbleTo, Inc. | anxiety | -2 |
| New York | Data Scientist | AbleTo, Inc. | pain | -2 |
Next we’re going to compute an average sentiment score for each company by aggregating the total sentiment score per company, and dividing by the number of job postings found for that particular company. We’ll also order the data by average sentiment.
#pare down the data
df.sentByComp <- df.sentiment[,c("company_name","score")]
#get the number of observations per co.
df.compCount <- df.sentiment %>%
dplyr::group_by(company_name) %>%
dplyr::summarize(num_obs = length(company_name))
#aggregate the sentiment score by company
df.sentByComp <-df.sentByComp %>%
dplyr::group_by(company_name) %>%
dplyr::summarize(sentiment = sum(score))
#get the average sentiment score per observation
df.sentByComp$num_obs = df.compCount$num_obs
df.sentByComp$avg.sentiment = df.sentByComp$sentiment / df.sentByComp$num_obs
df.sentByComp <- df.sentByComp[order(-df.sentByComp$avg.sentiment),]
head(df.sentByComp,5) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| company_name | sentiment | num_obs | avg.sentiment |
|---|---|---|---|
| Naval Nuclear Laboratory | 23 | 10 | 2.300000 |
| RJ Lee Group, Inc. | 20 | 9 | 2.222222 |
| Austin Fraser | 11 | 5 | 2.200000 |
| Directions Research, Inc. | 70 | 32 | 2.187500 |
| HarperCollins Publishers Inc. | 13 | 6 | 2.166667 |
Next we downsample the data to look at the top and bottom few companies, as per the sentiment rankings.
n <- 5 # number of companies to get
#get the top and bottom "n" ranked companies
bestNworst <- rbind(head(df.sentByComp,n),tail(df.sentByComp,n))
bestNworst %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| company_name | sentiment | num_obs | avg.sentiment |
|---|---|---|---|
| Naval Nuclear Laboratory | 23 | 10 | 2.3000000 |
| RJ Lee Group, Inc. | 20 | 9 | 2.2222222 |
| Austin Fraser | 11 | 5 | 2.2000000 |
| Directions Research, Inc. | 70 | 32 | 2.1875000 |
| HarperCollins Publishers Inc. | 13 | 6 | 2.1666667 |
| Expedia | -1 | 22 | -0.0454545 |
| Oracle | -6 | 28 | -0.2142857 |
| ZenX Solutions LLC | -3 | 11 | -0.2727273 |
| Ezra Penland Actuarial Recruitment | -44 | 22 | -2.0000000 |
| Affirm | -8 | 2 | -4.0000000 |
Next, we inner-join our bestNworst data back to the original df, preserving only entries that correspond to companies which fall in the top or bottom “n” in terms of sentiment rank. This should dramatically reduce the row-count from about 400K to somewhere in the low 000’s.
df.result <- inner_join(df.sent,bestNworst[c("company_name","avg.sentiment")])## Joining, by = "company_name"
colnames(df.result)## [1] "list_ID" "city" "job_title" "company_name"
## [5] "link" "token" "avg.sentiment"
tail(df.result[c("city","company_name","token","avg.sentiment")], 5) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| city | company_name | token | avg.sentiment | |
|---|---|---|---|---|
| 2430 | Pittsburgh | Naval Nuclear Laboratory | proficiency | 2.3 |
| 2431 | Pittsburgh | Naval Nuclear Laboratory | technical | 2.3 |
| 2432 | Pittsburgh | Naval Nuclear Laboratory | writing | 2.3 |
| 2433 | Pittsburgh | Naval Nuclear Laboratory | data | 2.3 |
| 2434 | Pittsburgh | Naval Nuclear Laboratory | analysis | 2.3 |
Now we’ll rank the count the terms
#remove any commas from the token column... makes it easier to remove #s
df.result$token <- gsub(",","",df.result$token)
#count the terms for the top rated companies
top.terms <- df.result %>%
dplyr::filter(is.na(as.numeric(as.character(token)))) %>% # removes numbers
dplyr::filter(avg.sentiment > 0 ) %>%
dplyr::count(token, sort = TRUE)
head(top.terms,5)## # A tibble: 5 x 2
## token n
## <chr> <int>
## 1 data 18
## 2 chemistry 9
## 3 experience 7
## 4 position 7
## 5 college 6
#count the terms for the bottom rated companies
bottom.terms <- df.result %>%
dplyr::filter(is.na(as.numeric(as.character(token)))) %>% # removes numbers
dplyr::filter(avg.sentiment < 0 ) %>%
dplyr::count(token, sort = TRUE)
head(bottom.terms,5) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| token | n |
|---|---|
| position | 42 |
| data | 33 |
| actuarial | 32 |
| actuary | 31 |
| job | 26 |
ggplot(head(top.terms,33), aes(reorder(token, n), n)) +
geom_bar(stat = "identity", fill = "Blue") +
labs(title = "Top Terms for Companies with Highest Sentiment",
x = "Term", y = "Frequency") +
coord_flip()ggplot(head(bottom.terms,33), aes(reorder(token, n), n)) +
geom_bar(stat = "identity", fill = "Red") +
labs(title = "Top Terms for Companies with Lowest Sentiment",
x = "Term", y = "Frequency") +
coord_flip()Our goal was to find the most valued data science skills using a supervised approach. We created new variables for analyzing three types of skills (hard skills, soft skills, and tool skills).
Assumptions:
We assumed certain terms fell into certain categories and searched for them.
We arrived at these categories based on outside SMEs (subject-matter experts).
We assumed that our analysis tools would lead to conclusions without human intervention.
We used the mutate function to create new variables for the tool skills category and preserve existing variables from the summary column, and turned on case insensitivity.
toolskills <- df_final %>%
mutate(R = grepl("\\bR\\b,", summary)) %>%
mutate(python = grepl("Python", summary, ignore.case=TRUE)) %>%
mutate(SQL = grepl("SQL", summary, ignore.case=TRUE)) %>%
mutate(hadoop = grepl("hadoop", summary, ignore.case=TRUE)) %>%
mutate(perl = grepl("perl", summary, ignore.case=TRUE)) %>%
mutate(matplotlib = grepl("matplotlib", summary, ignore.case=TRUE)) %>%
mutate(Cplusplus = grepl("C++", summary, fixed=TRUE)) %>%
mutate(VB = grepl("VB", summary, ignore.case=TRUE)) %>%
mutate(java = grepl("java\\b", summary, ignore.case=TRUE)) %>%
mutate(scala = grepl("scala", summary, ignore.case=TRUE)) %>%
mutate(tensorflow = grepl("tensorflow", summary, ignore.case=TRUE)) %>%
mutate(javascript = grepl("javascript", summary, ignore.case=TRUE)) %>%
mutate(spark = grepl("spark", summary, ignore.case=TRUE)) %>%
select(job_title, company_name, R, python, SQL, hadoop, perl, matplotlib, Cplusplus, VB, java, scala, tensorflow, javascript, spark)Applied the summarise_all function to all (non-grouping) columns.
toolskills2 <- toolskills %>% select(-(1:2)) %>% summarise_all(sum) %>% gather(variable,value) %>% arrange(desc(value))Visualized the most in-demand tool sKills:
ggplot(toolskills2,aes(x=reorder(variable, value), y=value)) + geom_bar(stat='identity',fill="green") + xlab('') + ylab('Frequency') + labs(title='Tool Skills') + coord_flip() + theme_minimal()Python, SQL, and R are the most in-demand tool skills according to Indeeed job posts, with the least in-demand tool skill being VB.
We used the mutate function to create new variables for the soft skills category and preserve existing variables from the summary column, and turned on case insensitivity.
softskills <- df_final %>%
mutate(workingremote = grepl("working remote", summary, ignore.case=TRUE)) %>%
mutate(communication = grepl("communicat", summary, ignore.case=TRUE)) %>%
mutate(collaborative = grepl("collaborat", summary, ignore.case=TRUE)) %>%
mutate(creative = grepl("creativ", summary, ignore.case=TRUE)) %>%
mutate(critical = grepl("critical", summary, ignore.case=TRUE)) %>%
mutate(problemsolving = grepl("problem solving", summary, ignore.case=TRUE)) %>%
mutate(activelearning = grepl("active learning", summary, ignore.case=TRUE)) %>%
mutate(hypothesis = grepl("hypothesis", summary, ignore.case=TRUE)) %>%
mutate(organized = grepl("organize", summary, ignore.case=TRUE)) %>%
mutate(judgement = grepl("judgement", summary, ignore.case=TRUE)) %>%
mutate(selfstarter = grepl("self Starter", summary, ignore.case=TRUE)) %>%
mutate(interpersonalskills = grepl("interpersonal skills", summary, ignore.case=TRUE)) %>%
mutate(atttodetail = grepl("attention to detail", summary, ignore.case=TRUE)) %>%
mutate(visualization = grepl("visualization", summary, ignore.case=TRUE)) %>%
mutate(leadership = grepl("leadership", summary, ignore.case=TRUE)) %>%
select(
job_title, company_name, workingremote, communication, collaborative, creative, critical, problemsolving,
activelearning, hypothesis, organized, judgement, selfstarter, interpersonalskills, atttodetail,
visualization, leadership)
summary(softskills) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "800px", height = "200px")| job_title | company_name | workingremote | communication | collaborative | creative | critical | problemsolving | activelearning | hypothesis | organized | judgement | selfstarter | interpersonalskills | atttodetail | visualization | leadership | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:1303 | Length:1303 | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | |
| Class :character | Class :character | FALSE:1302 | FALSE:445 | FALSE:650 | FALSE:1066 | FALSE:1105 | FALSE:1172 | FALSE:1303 | FALSE:1255 | FALSE:1217 | FALSE:1297 | FALSE:1301 | FALSE:1175 | FALSE:1206 | FALSE:946 | FALSE:1024 | |
| Mode :character | Mode :character | TRUE :1 | TRUE :858 | TRUE :653 | TRUE :237 | TRUE :198 | TRUE :131 | NA | TRUE :48 | TRUE :86 | TRUE :6 | TRUE :2 | TRUE :128 | TRUE :97 | TRUE :357 | TRUE :279 |
Applied the summarise_all function to all (non-grouping) columns.
softskills2 <- softskills %>%
select(-(1:2)) %>%
summarise_all(sum) %>%
gather(variable,value) %>%
arrange(desc(value))Visualized the most in-demand soft sKills:
ggplot(softskills2,aes(x=reorder(variable, value), y=value)) + geom_bar(stat='identity',fill="green") + xlab('') + ylab('Frequency') + labs(title='Soft Skills') + coord_flip() + theme_minimal()Communication, collaboration, and visualization are the most in-demand soft skills according to Indeeed job posts, with the least in-demand soft skill being active learning.
We used the mutate function to create new variables for the hard skills category and preserve existing variables from the summary column, and turned on case insensitivity.
hardskills <- df_final %>%
mutate(machinelearning = grepl("machine learning", summary, ignore.case=TRUE)) %>%
mutate(modeling = grepl("model", summary, ignore.case=TRUE)) %>%
mutate(statistics = grepl("statistics", summary, ignore.case=TRUE)) %>%
mutate(programming = grepl("programming", summary, ignore.case=TRUE)) %>%
mutate(quantitative = grepl("quantitative", summary, ignore.case=TRUE)) %>%
mutate(debugging = grepl("debugging", summary, ignore.case=TRUE)) %>%
mutate(statistics = grepl("statistics", summary, ignore.case=TRUE)) %>%
select(job_title, company_name, machinelearning, modeling, statistics, programming, quantitative, debugging, statistics)
summary(hardskills) %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| job_title | company_name | machinelearning | modeling | statistics | programming | quantitative | debugging | |
|---|---|---|---|---|---|---|---|---|
| Length:1303 | Length:1303 | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | Mode :logical | |
| Class :character | Class :character | FALSE:585 | FALSE:457 | FALSE:642 | FALSE:768 | FALSE:939 | FALSE:1298 | |
| Mode :character | Mode :character | TRUE :718 | TRUE :846 | TRUE :661 | TRUE :535 | TRUE :364 | TRUE :5 |
Applied the summarise_all function to all (non-grouping) columns.
hardskills2 <- hardskills %>%
select(-(1:2)) %>%
summarise_all(sum) %>%
gather(variable,value) %>%
arrange(desc(value))Visualized the most in-demand hard skills:
ggplot(hardskills2,aes(x=reorder(variable, value), y=value)) +
geom_bar(stat='identity',fill="green") +
xlab('') +
ylab('Frequency') +
labs(title='Hard Skills') +
coord_flip() +
theme_minimal()Modeling and machine learning are the most in-demand hard skills according to Indeeed job posts, with the least in-demand hard skill being debugging.
We used the summary column to create a word cloud of data science skills. To begin, we specified the removal of irrelevant words and stopwords that don’t add context.
datacloud <- Corpus(VectorSource(df_final$summary))
datacloud <- tm_map(datacloud, removePunctuation)
datacloud <- tm_map(datacloud, tolower)
datacloud <- tm_map(datacloud, removeWords, c("services", "data", "andor", "ability", "using", "new", "science", "scientist" , "you", "must", "will", "including", "can", stopwords('english')))Then, we visualized our data science word cloud:
wordcloud(datacloud,
max.words = 80,
random.order = FALSE,
scale=c(3,.3),
random.color = FALSE,
colors=palette())The supervised approach showed us which of the data science skills that we looked up were the most and least in-demand – and therefore, we assume, the most valuable:
Most in-demand
Least in-demand
The list of most in-demand skills is consistent with the tenor and topics of conversation in the field. While motivation, independence, and continuous learning do seem like important requisites to a success in data science, these are either relatively less so in our sample or potentially are conveyed in language opaque to our analysis.
These findings can not only inform how job seekers position their experience and skills, but how learning programs and bootcamps structure their curricula. Additionally, this information could help prospective employers of data scientists to understand how to communicate their own requirements, or how their requirements compare with others hiring in the marketplace, or even what their competitive set looks like.
Hard skills had a higher score in most situations. “Machine” and “learning”, “statistics” and “analysis” appeared high in most of our graphs. Soft and technical skills didn’t appear at all. However, the results should not be considered conclusive as the TF-IDF approach lacked any proper context. When the corpus was broken down by cities, the name of companies became the highest scoring terms.
We attempted to make the results more coherent by lowering the sparsity, but this method was arbitrary in nature. This makes sense as TF-IDF is often used in conjunction with other algorithms. For example, we might have been able to get our much-needed context by feeding our TF-IDF matrices into a k-means algorithm. Then we could see how certain words grouped together. In this way, we could prune our corpuses to focus solely on what words are important to data scientist jobs, as opposed to what words are important to job posts in general.
The sentiment analysis did not seem to be biased towards either soft or hard skills with terms such as “data”, “experience” ,“technical”, “modeling” , “development”, “python” and “team” coming in near the top. It did, however, seem to be confused by terms that the supervised approach would never include as “relevant”. For example, terms like “including” and “national” were found near the top of the list – these terms are likely artifacts of the process rather than important data science terms.
There did not appear to be any coherent difference between top- and bottom-rated companies, which made drawing a meaningful conclusion from any comparison between the two sets difficult. One thing of note was that the distribution for “positive sentiment” companies had significantly more right-skew in the frequency of terms being reported as relevant. More analysis would likely be required to uncover the meaning behing this observation.
At a high level, the sentiment testing was likely a bit mis-specified in the sense that a generic sentiment mapping was used and it was applied to individual words (i.e., ignoring the context of the word within the sentence) which may be less than optimal given the specific task here.
Future sentiment-related work here should include an examination of n-grams, sentences, paragraphs, or even entire summaries so as to capture the entire context of what is being said. A custom list of stop-words would also be beneficial here as there are we found many words that are not traditionally considered stop words, but which recurr in almost every job posting and thus provide little value.
Running parallel workstreams added depth to our analysis. It allowed us to compare and contrast the outputs of the supervised and unsupervised methods – a neat and tangible learning opportunity for the team.
Context matters. All data science methods – supervised or unsupervised – require that the user has background knowledge and context to determine whether the ouput is valuable. We can’t assume that the results of any method will be salient on their own, especially if the method is unsupervised.
Collaboration is key in a data science project. For a group of six strangers who are (mostly) not co-located and must work virtually (across time zones), it’s important to have regular check-ins to monitor progress and correct course, if necessary. It’s also key to align early on overall workflow and timeline, roles / responsibilities in the process, and how next steps evolve. It’s good to make project management and decision frameworks explicit, though not doing so did not present issues on this project (because this team is awesome). While we stayed on track through well-attended meetings held every few days, a project plan would be a good idea for our next collaboration.
While touched on above, we want to make clear the assumptions we held at each step of the process. These assumptions ultimately informed the direction of our approach, the content of our analysis, and the conclusions we could draw.
“Skills” refer to hard skills, soft skills, and technical skills of the individual data scientist.
We can determine a skill’s value by looking at job postings. Specifically, job postings on Indeed for the search term “Data Scientist”. Employers list skills that they need and find more valuable.
Job postings have the most up-to-date information on real employer needs, compared to other data sources such as data science course curricula, surveys of data scientists, and the Reddit page for data science.
The data we scraped is representative of all jobs. Indeed data is comparable to other job posting websites.
The moment we scraped data can be longitudinally extrapolated – it isn’t an outlier. What we scrape is expected to be valid right now, but not necessarily into the future.
All of the sections listed in a job post – and not just ones related to the potential employee – are useful in identifying the skills that employers most value. We arrived at this conclusion after reviewing a random sample of postings and concluding that there was valuable information throughout the summary.
We kept observations discrete – as a corpus – rather than as a single string. We also kept special characters and stopwords. This allowed the downstream user to decide what was important.
Overall
Supervised Approach