1. Background

In this project, we used supervised and unsupervised data mining techniques on a scraped dataset of 1,303 Indeed job listings to answer the following question:

What are the most valued data science skills?

We collaborated as a team to understand the question, get the data, clean it, analyze it, and draw conclusions. We used Slack, Google Docs, Google Hangouts, GitHub, and in-person meetings to work together, and we gave pair programming – live and virtual – a shot, too.


Team Rouge (Group 4)


Process

  • Data Acquisition — Iden and Paul

  • Data Cleaning — Jeremy and Kavya

  • Unsupervised Analysis — Iden and Paul

  • Supervised Analysis — Rickidon and Violeta

  • Conclusions — Whole Team


Libraries

library(rvest)
library(RCurl)
library(plyr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(tidytext)
library(xtable)
library(readr)
library(tidytext)
library(knitr)
library(kableExtra)



2. Approach

To motivate our approach to data collection and analysis, we began with the concepts of “skills” and of “value.”

A. Definitions

What are the most valued data science skills?

Skills

As discussed in our class, data science requires not only multiple skills but multiple categories of skills. The many fields and industries where data science is applied likely group these skills into different categories, but after some desk research and discussion we felt that in addition to toolsets (software and platforms), both “hard” (analytical and technical) and “soft” (communcative and collaborative) skills are also important.

We used the following list of categories – found in an article on Data Scientist Resume Skills – as a basis for our supervised analyses.


Value

To avoid wading into philosophical abstractions, we interpreted value in its economic sense – that is, what skills are sought after and / or rewarded in the marketplace.


B. Data Source

As the economic value of data science skills is not directly measurable, we discussed three different approaches to getting a datset:

  • Mining existing custom research on data scientists (like that found here).

  • Analyzing online discussion boards focused on data science (like this one on Reddit). While threads can provide a historical record (i.e., the evolution of value), there are potentially compromises in data quality and bias (whether due to fanboys, trolls, or a silent majority) and informational contents does not necessarily accord with economic value.

  • Scraping online job postings for data science roles provides perspective on what skills employers emphasize and prioritize. This third approach has its limitations: there are multiple platforms (Glassdoor, Linkedin, Monster, Careerbuilder, etc.) none of which can have a complete view of the marketplace, and scraping time-delimited job postings captures a moment in time without any longitudinality.

We dimissed custom research as it didn’t seem to accord with the intent of the project. We thought that exploring online discussion boards could be valuable an alternative, fallback, or follow-up analysis. We agreed to focus on job postings.

Constraints of the data source notwithsanding, testing what signals of “skill value” (i.e. frequency of terms related to data science skills) could be detected in job postings would be a good approach to this project, and one that allowed us to meet technical requirements and collaborate.

After some exploration, we decided to focus on Indeed.com, which has a wealth of data science job postings that can be scraped. We scraped them – first a test set for evaluation and troubleshooting, then a larger, more robust set – to be cleaned and analyzed. We initially used Python, and later replicated the scraper in R.


C. Analysis

We felt that the project could benefit from a two-pronged approach to analysis:

  1. A more prescriptive, supervised approach based on cross-referencing job summaries with categorized lists of terms and calculcating the frequency of recurring keywords. To prove the concept, we used the “hard,” “soft,” and “tools” lists referenced above as we found them.

  2. A more exploratory, unsupervised approach based on TF-IDF (term frequency-inverse document frequency) and sentiment analysis, which don’t semantically impose preconveived keywords upon job postings (short of filtering out stop-words).

To streamline our process, we conducted the two analyses in parallel, cleaning and preparing the data for both. We iterated and collaborated on the scraper, cleaning, and analysis using Slack and GitHub.



3. Data Acquisition

A. Note

This scraper is working code, however, we’ve disabled here as it can take a while to run. It’s provided here as a working demonstration of how our data was collected. All the actual work for this project was completed on a static data set which we collected early on in our efforts. This was done to ensure that all group members were always working with identical data and that any user could re-produce our results as desired.

The following chunk of code scrapes job postings from indeed.com and collects the results into a dataframe. It’s a port from some python code originally used to scrape our data set.


B. Set the variables

First we’ll set a few variables that we’ll use in our scraping activity. We’ve used a smaller set of cities just to demonstrate how it works.

city.set_small <- c("New+York+NY", "Seattle+WA")

city.set <- c("New+York+NY", "Seattle+WA", "San+Francisco+CA",
              "Washington+DC","Atlanta+GA","Boston+MA", "Austin+TX",
              "Cincinnati+OH", "Pittsburgh+PA")


target.job <- "data+scientist"   

base.url <- "https://www.indeed.com/"

max.results <- 50


C. Scrape the Details

Indeed.com appears to use the “GET” request method, so we can directly mess around with the URL to get the data that we want. We’re going to iterate over our target cities and scrape the particulars for each job - this includes getting the links to each individual job-page so that we can also pull the full summary.


D. Get the full Summary

After the above is complete, we’re going to iterate over all the links that we’ve collected, pull them, and grab the full job summary for each. Note that it appears that jobs postings are sometimes removed, in which case, we pull an empty variable. We could probably do some cleaning in this step while downloading, but we’re going to handle that downstream.

#create a df to hold everything that we collect
jobs.data <- data.frame(matrix(ncol = 7, nrow = 0))
n <- c("city","job.title","company.name","job.location","summary.short","salary","links,summary.full")
colnames(jobs.data)


for (city in city.set_small){
  print(paste("Downloading data for: ", city))

  
  for (start in range(0,max.results,10)){

    url <- paste(base.url,"jobs?q=",target.job,"&l=",city,"&start=", start ,sep="")
    page <- read_html(url)
    Sys.sleep(1)
  
    #recorded the city search term << not working yet...
    #i<-i+1
    #job.city[i] <- city
  
    #get the links
    links <- page %>% 
      html_nodes("div") %>%
      html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
      html_attr("href")
    
  
    #get the job title
    job.title <- page %>% 
      html_nodes("div") %>%
      html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
      html_attr("title")
  
    #get the job title
    job.title <- page %>% 
      html_nodes("div") %>%
      html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
      html_attr("title")
    
    #get the company name
    company.name <- page %>% 
      html_nodes("span")  %>% 
      html_nodes(xpath = '//*[@class="company"]')  %>% 
      html_text() %>%
      trimws -> company.name 
  
    #get job location
    job.location <- page %>% 
      html_nodes("span") %>% 
      html_nodes(xpath = '//*[@class="location"]')%>% 
      html_text() %>%
      trimws -> job.location
    
    #get the short sumary
    summary.short <- page %>% 
      html_nodes("span")  %>% 
      html_nodes(xpath = '//*[@class="summary"]')  %>% 
      html_text() %>%
      trimws -> summary.short 
    
  }
  
  #create a structure to hold our full summaries
  summary.full <- rep(NA, length(links))
  
  #fill in the job data
  job.city <- rep(city,length(links))
  
  #add a place-holder for the salary
  job.salary <- rep(0,length(links))
  
  #iterate over the links that we collected
  for ( n in 1:length(links) ){
    
    #build the link
    link <- paste(base.url,links[n],sep="")
    
    #pull the link
    page <- read_html(link)
  
    #get the full summary
    s.full <- page %>%
     html_nodes("span")  %>% 
     html_nodes(xpath = '//*[@class="summary"]') %>% 
     html_text() %>%
     trimws -> s.full
  
    #check to make sure we got some data and if so, append it.
    #as expired postings return an empty var
    if (length(s.full) > 0 ){
        summary.full[n] = s.full  
        } 
  
    }
  
    #add the newly collected data to the jobs.data
    jobs.data <- rbind(jobs.data,data.frame(city,
                                            job.title,
                                            company.name,
                                            job.location,
                                            summary.short,
                                            job.salary,
                                            links,
                                            summary.full))

    
}



4. Data Cleaning

The previous step resulted in raw CSV file with over 1300 rows. To clean the data, we first read in the CSV file, tested a cleaning procedure on a 100-row sample, and applied the procedure to the full dataset.

A. Read in the dataframe

Read in raw dataframe, set separator as pipe

url <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Projects/Project%2003/indeed_jobs_large.csv"

df <- read.csv(url, sep="|", stringsAsFactors = F)


Removed “location” and “salary” columns, to reduce redundancy.

df <- df[, -c(5,7)]


B. Test cleaning procedure

Took 100-row sample of full dataset.

sample <- df[sample(1:nrow(df), 100, replace=F),]


Removed brackets surrounding summaries.

sample1 <- sample %>% separate(summary_full, c("bracket", "new_summary"), sep="^[\\[]", remove=T, convert=F) %>%
                      separate(new_summary, c("summary_full", "bracket"), sep="[\\]]$", remove=T, convert=F)

sample1 <- sample1[, -c(5, 8)]


Renamed column headers.

names(sample1) <- c("list_ID", "city", "job_title", "company_name", "link", "summary")


Removed state and plus signs from city column.

# Separate City column into City and State by pattern of two uppercase letters after a plus sign (i.e., "+NY")
sample2 <- sample1 %>% separate(city, c("city", "state"), sep="[\\+][[:upper:]][[:upper:]]$", convert=T)

# Remove empty State column
sample2 <- sample2[, -c(3)]

# Replace plus signs with spaces
sample2$city <- str_replace_all(sample2$city, "[\\+]", " ")


Removed rows where summary is blank.

sample3 <- filter(sample2, sample2$summary!="")

head(sample3, 2) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width = "800px", height = "200px")
list_ID city job_title company_name link summary
1156 Cincinnati Research Scientist (Relocation) Novateur Research Solutions https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0A1UtxlOJl6Y6F02-oTF9zwI-Nb9Dq9qDYX5XseiH9toZY47T_vf17yZekOKh6vKrDDCDFX3Wa6lCO1lqbbMub-n5iRQsliyvER1hj8ydFeCvr1dDdzSE2BRg0jsitAerWdpH4lwQy2YN10RaU4x7razrc-N-TWOgAyoXesX5yFmLmikvJOrRM-mJ2QsmncrEVzjnqE8VgKeF_x0lsgGnZN2GlJjM5j94KuMaZhknKONUmt3NyzS3B840l_G_X-KjfhwX--3eVKBuquaptQCQsYAcJ3v2p-KoEpvuZ1wwIJFoB6z2I9zVukI5GWav1HwGDD1kr6iMGqn-ux9tpldnJaFtcL4IQW_f6z8ykbTV8RsTnwlmLwcvh-92bR_meG52AyxJt-ZIKvIqeCz8ahaxpCVU6t24-WRy0Q7DOX1cNlfqlDB5Irejsjy22fJ0AKWyQ=&vjs=3&p=2&sk=&fvj=1 RESEARCH SCIENTIST JOB SUMMARYNovateur Research Solutions is looking for research scientists and senior research scientists to develop cutting edge technologies and products. The successful candidates will be self-motivated individuals with strong background in one or two of the following areas: computer vision, image processing, cyber security, and machine learning.RESEARCH SCIENTIST RESPONSIBILITIESCollaborate with other researchers at Novateur and academia to solve challenging operational problems for government and commercial clients in above-mentioned areas.Develop novel algorithms leveraging the state-of-the-art technologies.RESEARCH SCIENTIST REQUIRED QUALIFICATIONSMasters or PhD. Degree in computer science, engineering, physics, applied mathematics or a related field with a focus on computer vision, image processing, cyber security or machine learningExcellent understanding of latest computer vision and machine learning technologies.Must be proficient in C/C++ and/or Python in Windows and Linux environment.Ability to work in a dynamic and fast-paced environment.Passion for working on cutting-edge technologies.Team player with excellent written and oral communication skills.RESEARCH SCIENTIST DESIRED QUALIFICATIONSPhD. DegreeStrong record of research publications.Experience with OpenCV, ROS, deep-learning packages, and similar tools.Experience with hardware optimization such as GPU programming using CUDA.Experience with development and prototyping of real-time systems.COMPANY BENEFITSNovateur offers competitive pay and benefits including a wide choice of healthcare options with generous company subsidy, 401(k) with generous employer match, paid holidays and paid time off increasing with tenure, and company paid short-term disability, long-term disability, and life insurance.About Novateur Research Solutions: Novateur is a research and development company providing innovative solutions for customers in defense, civilian government and commercial industries. Our focus areas include computer vision, machine learning, robotics and big data mining. We work at the forefront of innovation to enable transition from ideas to market by providing our customers the right blend of enabling scientific solutions, technologies, and product development expertise. We are located in Northern Virginia in the historic district of Leesburg. We offer a work environment which fosters individual thinking along with collaboration opportunities within and beyond Novateur. In return, we expect a high level of performance and passion to deliver enduring results for our clients.Novateur is an Equal Opportunity Employer. This company does not and will not discriminate in employment and personnel practices on the basis of race, sex, age, handicap, religion, national origin or any other basis prohibited by applicable law.Job Type: Full-timeJob Type: Full-timeRequired experience:Machine Learning: 1 yearPython or C++ programming: 1 yearComputer Vision: 1 yearRequired education:DoctorateRequired license or certification:US Permanent Residency or Citizenship
1234 Pittsburgh Data Scientist rue21 https://www.indeed.com//rc/clk?jk=b9727f45a638d7e3&fccid=24d0f67f23343905&vjs=3 Overviewrue21 is looking for an experienced Data Scientist. We are looking for an individual who will take ownership of data mining, building predictive models, and driving the construction of data insight products at rue21. The right candidate should be a self-motivated, highly detail-oriented team-player with a positive drive to provide insight to rue21 business partners. The right candidate will have a passion for discovering solutions hidden in large data sets and working with stakeholders to improve business outcomes.ResponsibilitiesWork with stakeholders throughout the organization to identify opportunities for leveraging company data to drive business solutions.Coordinate with different functional teams to implement models and monitor outcomes.Deliver insight to leadership, helping drive strategic business decisions.Execute data mining activities providing insight from datasets.Build accurate statistical supervised & unsupervised predictive models.Develop processes and tools to monitor and analyze model performance and data accuracy.Qualifications5+ years of experience with several years in hands on experience manipulating data sets and building statistical models.Knowledge of advanced statistical techniques and concepts (regression, properties of distributions, statistical tests and proper usage, etc.) and experience with applicationsExperience querying databases and using statistical computer languages: R, Python, etc.A strong analytical and quantitative analysis mindsetAbility to think strategically and translate plans into phased actions in a fast paced, high pressure environmentDimensional and/or Multidimensional modeling experience is a plusKnowledge of BI related principles such as ETL, data modeling, & data warehousingBS/MS degree in Statistics, Mathematics, Computer Science or equivalent/related degree.


C. Apply cleaning procedure to full dataset

Removed brackets surrounding summaries.

df1 <- df %>% separate(summary_full, c("bracket", "new_summary"), sep="^[\\[]", remove=T, convert=F) %>%
              separate(new_summary, c("summary_full", "bracket"), sep="[\\]]$", remove=T, convert=F)

df1 <- df1[, -c(5, 8)]


Renamed column headers.

names(df1) <- c("list_ID", "city", "job_title", "company_name", "link", "summary")


Removed state and plus signs from city column.

# Separate city column into city and state by pattern of two uppercase letters after a plus sign (i.e., "+NY")
df2 <- df1 %>% separate(city, c("city", "state"), sep="[\\+][[:upper:]][[:upper:]]$", convert=T)

# Remove empty State column
df2 <- df2[, -c(3)]

# Replace plus signs with spaces
df2$city <- str_replace_all(df2$city, "[\\+]", " ")


Removed rows where summary is blank.

df_final <- filter(df2, df2$summary!="")
write.csv(df_final, "indeed_final.csv", row.names = FALSE)

head(df_final, 2) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width = "800px", height = "200px")
list_ID city job_title company_name link summary
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 OverviewAbleTo Inc. combines best-in-class patient engagement with behavior change treatment programs that allow health plans and plan sponsors to improve health outcomes, while lowering overall spending, for high-cost medical populations. Benefitting groups include heart patients and diabetics, as well as those suffering from depression/anxiety and chronic pain. All sixteen of our programs utilize best practice protocols and are delivered nationally by teams of licensed therapists and behavioral coaches. Program participants experience improved health, better recovery and in the case of heart patients, fewer hospitalizations.ResponsibilitiesData & Analytics TeamThe Data & Analytics team plays a key role at AbleTo. Data is part of nearly everything that we do. The Data & Analytics team is responsible for storing data, making it useful and actionable for business managers, and using it to validate the impact of our programs and inform organizational improvements.As a Data Scientist at AbleTo, you will work in a uniquely cross-functional capacity, helping teams across the organization apply analytical rigor to their decision-making:With our Research team, you will build claims-based predictive models to identify patients who can benefit from our programs and drive value for our clientsWith our Engagement team, you will help optimize a coordinated, multi-channel patient outreach processWith our Account Management team, you will develop reporting that demonstrates how AbleTo is helping clients manage their patient populationsWith our Clinical team, you will ensure that our therapist network is operating efficiently and is structured appropriately to meet demandWith our Product team, you will use data-driven insights to design new product features for our platformWith our Engineering team, you will help manage our data stack and ETL processes that feed our data warehouseMembers of the Data & Analytics team are:Both technically and strategically savvyAble to sift through large amounts of data and extract insightsAble to present recommendations to non-technical audiencesHighly organized and able to manage time across multiple competing prioritiesFocused on building flexible, durable, and well-designed solutionsFocused on preserving quality controlWilling to experiment with new tools / techniquesData & Analytics team culture:Collaboration: You will work with teams across the organization to both design and build data-driven solutionsInnovation: Our team prioritizes innovation over precedent. We’re always looking for new, more efficient ways of doing thingsHigh-Impact Focus: Our team receives a lot of requests for new analyses, processes, reports, etc. It is our job to prioritize these requests with our stakeholders and invest time in projects with the highest potential value-add for the organization. You will be actively involved in making these decisionsQualificationsTechnical SkillsStrong proficiency in SQL and R/PythonExperience working on cloud-based platforms like AWS/GCSComfortable with variety of analytical techniques:Predictive modeling (e.g. regularized regressions, decision trees / random forests, association rules mining)Optimization (e.g. linear programming)Exploratory analysis (e.g. clustering, PCA)Some familiarity with BI tools such as Looker or TableauEducation / ExperienceBA in Operations Research, Computer Science, or other related discipline2-3 years work experience is strongly preferredAdditionally to the technical requirements for this specific position, AbleTo seeks candidates that will demonstrate:Personal ownership of assignments and responsibilitiesResilience and grit to ensure task completion even in the face of adversityDiscipline and organization to handle multiple tasks simultaneouslyGreat team playing skillsHigh levels of energy and positive ambitionA healthy balance of curiosity, humility and assertivenessExcellent written and verbal communication skillsProfessional attitude and demeanorAbleTo is committed to equal employment opportunities (EEO) and employs qualified persons without regard to race, color, religion, national origin, sex, age, handicap, or any other classification protected by federal, state, or local laws.AbleTo is an E-Verify Company
2 New York Data Scientist Shore Group Associates, LLC https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0AmtW71uMJ-FMTnSgQAi6MO2hfu514W2To_ok1EkQDsLzCSnVx2dJJdWh_eVltX3NiKTJsOZ1PdrHmGwruy3Gwutw1Y3myrtnW-EAYSCQP1_EOIcyntJdxtj6FPF62TyGkihYDHIjTCVu_fBirizqhYKRHGco3FiaFPo1aadANl5b8sxxh_xfPnT7IgmCq1uhzhaBEqvJTIOxzME67gwkFDQMqxfRy5NeNlACthstIIJrrNKGb_4rHddOAIBkREr7GI5VMt-mMTREeanvd2N26PKfQEgntwy8IRwFCIBN2KLrb4LrABKQS4hpxFDEeLxkLy_brMWIhE5yVTBHezMc1KQdnpROfTZk9-mjnfIkcuCQ5m8avDk2OYqqcGBMs15Nw_jrnylFns0l3hlK9gyQa-a51XJpetjmCi6VJL-lVn1fC2_jm2_2cU&vjs=3&p=2&sk=&fvj=1 Description -Our team is building machine learning tools to help determine predict information about donors to a given foundation. This includes the likelihood of donation, the size of the donation, when is best to reach out to a given possible donor, and how to reach out. We are looking for data scientists who are not only interested in plugging data into a model, but also understanding the data like the back of their hand.Requirements -Analyze requirements and formulate an appropriate technical solution that meets functional and non-functional requirements.Experience with large datasets in the 100’s of GBFundamental/broad understanding of data mining and predictive analytics techniques2+ years of Data Science Experience and a deep knowledge of various modeling techniques.Python (Pandas, SKLearn, NumPy, Matplotlib), SQL, GitStrong communication skills - both verbal and written – is a must.Strong experience with predictive regression modelsStrong software engineering skillsJob Type: Full-timeExperience:Data Science: 3 years (Required)


We are left with a dataset called df_final that has 1,303 job listings.



5. TF-IDF Analysis

A. About TF-IDF

TF-IDF stands for “term frequency-inverse document frequency”. It is calculated by first calculating the term frequency of a word, \(tf(t,d)\), and multiplying it by its inverse document frequency \(idf(t,D)\), so that how frequent a word appears in a document is offset by its frequency in the corpus.

For example, a word might appear frequently in one chapter of a book, so much so that its frequency might put it in the top 10 words, but TF-IDF weighs the value of this word by the fact that it only appears in one chapter of, say, a hundred chapter textbook.


B. Create control List

tfidf <- read.csv("indeed_final.csv", stringsAsFactors = FALSE)

# Make all job titles lower case for later
tfidf$job_title <- tolower(tfidf$job_title)

# Control list to be used for all corpuses
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE,
                     weighting = weightTfIdf)


C. TF-IDF on All Job Postings

corpus.all <- VCorpus(VectorSource(tfidf$summary))

tdm.all <- TermDocumentMatrix(corpus.all, control = control_list)

# Remove outliers consisting of very rare terms
tdm.80 <- removeSparseTerms(tdm.all, sparse = 0.80)

# Sum rows for total & make dataframe
df_all <- tidy(sort(rowSums(as.matrix(tdm.80))))
colnames(df_all) <- c("words", "count")

# Graph
ggplot(tail(df_all, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "TF-IDF of Indeed Job Postings",
       x = "Words", y = "Frequency") +
  coord_flip()


D. Sparsity

First, a note on sparsity: Sparsity roughly controls the rarity of the word frequency. If we run removeSparseTerms(tdm, sparse = 0.99), it will remove only the rarest words, that is the words that appear in only 1% of the corpus. On the other hand, removeSparseTerms(tdm, sparse = 0.01) then only words that appear in nearly every document of the corpus will be kept.

For most analysis, we found that a sparsity of 80% was most beneficial. Sparsity > 80% often included words that were more important to job postings as a whole, and not to the specifics we wanted for the purpose of our question.

When each job postings are treated as individual documents, skills like “machine learning”, “analytics”, “statistics/statistical”, and “models/modeling” are very important for data scientists to have.


E. TF-IDF on Job Postings by Cities

# Trying to divide the corpus by cities
nyc <- paste(tfidf[tfidf$city == "New York", 6], collapse = " ")
sea <- paste(tfidf[tfidf$city == "Seattle", 6], collapse = " ")
sf <- paste(tfidf[tfidf$city == "San Francisco", 6], collapse = " ")
dc <- paste(tfidf[tfidf$city == "Washington", 6], collapse = " ")
atl <- paste(tfidf[tfidf$city == "Atlanta", 6], collapse = " ")
bos <- paste(tfidf[tfidf$city == "Boston", 6], collapse = " ")
aus <- paste(tfidf[df_final$city == "Austin", 6], collapse = " ")
cin <- paste(tfidf[df_final$city == "Cincinnati", 6], collapse = " ")
pitt <- paste(tfidf[tfidf$city == "Pittsburgh", 6], collapse = " ")

cities <- c(nyc, sea, sf, dc, atl, bos, aus, cin, pitt)

corpus.city <- VCorpus(VectorSource(cities))

tdm.city <- TermDocumentMatrix(corpus.city, control = control_list)

# Make city dataframe
df_city <- tidy(tdm.city)
df_city$document <- mapvalues(df_city$document,
                              from = 1:9,
                              to = c("NYC", "SEA", "SF",
                                     "DC", "ATL", "BOS",
                                     "AUS", "CIN", "PITT"))

df_city %>%
  arrange(desc(count)) %>%
  mutate(word = factor(term, levels = rev(unique(term))),
           city = factor(document, levels = c("NYC", "SEA", "SF",
                                              "DC", "ATL", "BOS",
                                              "AUS", "CIN", "PITT"))) %>%
  group_by(document) %>%
  top_n(6, wt = count) %>%
  ungroup() %>%
  ggplot(aes(word, count, fill = document)) +
  geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
  labs(title = "Highest TF-IDF Words in Job Listings by City",
       x = "Words", y = "TF-IDF") +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip()

# write.csv(df_city, "city_tfidf.csv", row.names = FALSE)


In this attempt, job postings were grouped by the cities they were listed in. When broken down this way, the companies themselves became the most important words rather than skills.


F. TF-IDF Based on Job Titles

# Data Scientist - 739 instances
ds <- tfidf[grep("data scientist", tolower(tfidf$job_title)), 6]
ds.corpus <- VCorpus(VectorSource(ds))
ds.tdm <- TermDocumentMatrix(ds.corpus, control = control_list)

ds.80 <- removeSparseTerms(ds.tdm, sparse = 0.80)
df_ds <- tidy(sort(rowSums(as.matrix(ds.80))))
colnames(df_ds) <- c("words", "count")

ggplot(tail(df_ds, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "TF-IDF of Data Scientist Job Titles",
       x = "Words", y = "Frequency") +
  coord_flip()

# Senior / Sr. - 84 instances
# Intern - 61 instance
# Senior vs Intern
# Not very illuminating
senior <- paste(tfidf[grep("senior", tolower(tfidf$job_title)), 6], collapse = " ")
intern <- paste(tfidf[grep("intern", tolower(tfidf$job_title)), 6], collapse = " ")
jrsr.corpus <- VCorpus(VectorSource(c(senior, intern)))
jrsr.tdm <- TermDocumentMatrix(jrsr.corpus, control = control_list)
df_jrsr <- tidy(jrsr.tdm)
df_jrsr$document <- mapvalues(df_jrsr$document, from = 1:2,
                              to = c("senior", "intern"))

df_jrsr %>%
  arrange(desc(count)) %>%
  mutate(word = factor(term, levels = rev(unique(term))),
           type = factor(document, levels = c("senior", "intern"))) %>%
  group_by(document) %>%
  top_n(25, wt = count) %>%
  ungroup() %>%
  ggplot(aes(word, count, fill = document)) +
  geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
  labs(title = "TF-IDF of Senior vs Junior Jobs",
       x = "Words", y = "TF-IDF") +
  facet_wrap(~type, ncol = 2, scales = "free") +
  coord_flip()

# Machine Learning - 124 instances
ml <- tfidf[grep("machine learning", tolower(tfidf$job_title)), 6]
ml.corpus <- VCorpus(VectorSource(ml))
ml.tdm <- TermDocumentMatrix(ml.corpus, control = control_list)

ml.70 <- removeSparseTerms(ml.tdm, sparse = 0.70)
df_ml <- tidy(sort(rowSums(as.matrix(ml.70))))
colnames(df_ml) <- c("words", "count")

ggplot(tail(df_ml, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(title = "TF-IDF for Machine Learning Jobs",
       x = "Words", y = "Count") +
  coord_flip()

# Research - 119 instances
research <- tfidf[grep("research", tfidf$job_title), 6]
r.corpus <- VCorpus(VectorSource(research))
r.tdm <- TermDocumentMatrix(r.corpus, control = control_list)

r.80 <- removeSparseTerms(r.tdm, sparse = 0.80)
df_r <- tidy(sort(rowSums(as.matrix(r.80))))
colnames(df_r) <- c("words", "count")

ggplot(tail(df_r, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "orange") +
  labs(title = "TF-IDF for Research Job Titles",
       x = "Words", y = "Count") +
  coord_flip()


Though our primary search term was “Data Scientist”, Indeed also returned other job titles. These were some of the most common instances. Unsurprisingly, “Data Scientist” itself matches with what we see in the analysis of all job postings. We thought there might be an interesting shift between “senior” level jobs and internships, with perhaps a strong prevelance of “soft skills” for the higher level jobs, but did not see much evidence of that in the data.



6. Sentiment Analysis

The idea here is to take a look at the “sentiment” of the text within each job posting and use that information as a proxy for company quality. The thinking is that higher sentiment ranking will be indicative of better company quality ( a leap, to be sure, but probably acceptable given the scope of this project). We’ll then use this data to take a look at which skills are more heavily refered to by the highest (and lowest) sentiment ranked companies.


A. Prepare the data

The first thing that we that we’re going to do is tokenize the “summary” column of the data which contains all the text which we are interested in. The essentially amounts to parsing the column into individual words and reshaping the dataframe into a “tidy” format where all individual words (tokens) are found in their own column.

We’ll then remove all the “stop_words” from this newly created data – words like “if”, “and”, “the”… etc.


#tokenize the summary into individual words, drop stop words
df.sent <- df_final %>%
  unnest_tokens(token, summary) %>%
  anti_join(stop_words, by=c("token" = "word")) 

head(df.sent,5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "800px", height = "200px")
list_ID city job_title company_name link token
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 overviewableto
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 combines
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 class
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 patient
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 engagement


Next we’ll map a numeric sentiment score to the words in our token column. We’re going to use the AFINN set for simplicity as it maps to a simple integer score between [-5, +5] with numbers below zero representing negative sentiments and numbers above zero representing positive sentiments.

#map the words to a sentiment score
df.sentiment <- df.sent %>%
  inner_join(get_sentiments("afinn"),by=c("token" = "word")) #%>%

head(df.sentiment[c("city","job_title","company_name","token","score")],5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
city job_title company_name token score
New York Data Scientist AbleTo, Inc. improve 2
New York Data Scientist AbleTo, Inc. benefitting 2
New York Data Scientist AbleTo, Inc. suffering -2
New York Data Scientist AbleTo, Inc. anxiety -2
New York Data Scientist AbleTo, Inc. pain -2


Next we’re going to compute an average sentiment score for each company by aggregating the total sentiment score per company, and dividing by the number of job postings found for that particular company. We’ll also order the data by average sentiment.

#pare down the data
df.sentByComp <- df.sentiment[,c("company_name","score")]

#get the number of observations per co.
df.compCount <- df.sentiment %>% 
  dplyr::group_by(company_name) %>% 
  dplyr::summarize(num_obs = length(company_name))

#aggregate the sentiment score by company
df.sentByComp <-df.sentByComp %>%
   dplyr::group_by(company_name) %>%
   dplyr::summarize(sentiment = sum(score))

#get the average sentiment score per observation
df.sentByComp$num_obs = df.compCount$num_obs
df.sentByComp$avg.sentiment = df.sentByComp$sentiment / df.sentByComp$num_obs
df.sentByComp <- df.sentByComp[order(-df.sentByComp$avg.sentiment),]

head(df.sentByComp,5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
company_name sentiment num_obs avg.sentiment
Naval Nuclear Laboratory 23 10 2.300000
RJ Lee Group, Inc. 20 9 2.222222
Austin Fraser 11 5 2.200000
Directions Research, Inc. 70 32 2.187500
HarperCollins Publishers Inc. 13 6 2.166667


Next we downsample the data to look at the top and bottom few companies, as per the sentiment rankings.

n <- 5 # number of companies to get

#get the top and bottom "n" ranked companies
bestNworst <- rbind(head(df.sentByComp,n),tail(df.sentByComp,n))

bestNworst %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
company_name sentiment num_obs avg.sentiment
Naval Nuclear Laboratory 23 10 2.3000000
RJ Lee Group, Inc. 20 9 2.2222222
Austin Fraser 11 5 2.2000000
Directions Research, Inc. 70 32 2.1875000
HarperCollins Publishers Inc. 13 6 2.1666667
Expedia -1 22 -0.0454545
Oracle -6 28 -0.2142857
ZenX Solutions LLC -3 11 -0.2727273
Ezra Penland Actuarial Recruitment -44 22 -2.0000000
Affirm -8 2 -4.0000000


Next, we inner-join our bestNworst data back to the original df, preserving only entries that correspond to companies which fall in the top or bottom “n” in terms of sentiment rank. This should dramatically reduce the row-count from about 400K to somewhere in the low 000’s.

df.result <- inner_join(df.sent,bestNworst[c("company_name","avg.sentiment")])
## Joining, by = "company_name"
colnames(df.result)
## [1] "list_ID"       "city"          "job_title"     "company_name" 
## [5] "link"          "token"         "avg.sentiment"
tail(df.result[c("city","company_name","token","avg.sentiment")], 5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
city company_name token avg.sentiment
2430 Pittsburgh Naval Nuclear Laboratory proficiency 2.3
2431 Pittsburgh Naval Nuclear Laboratory technical 2.3
2432 Pittsburgh Naval Nuclear Laboratory writing 2.3
2433 Pittsburgh Naval Nuclear Laboratory data 2.3
2434 Pittsburgh Naval Nuclear Laboratory analysis 2.3


Now we’ll rank the count the terms

#remove any commas from the token column... makes it easier to remove #s 
df.result$token <- gsub(",","",df.result$token)

#count the terms for the top rated companies
top.terms <- df.result %>%
  dplyr::filter(is.na(as.numeric(as.character(token)))) %>%   # removes numbers
  dplyr::filter(avg.sentiment > 0 ) %>%
  dplyr::count(token, sort = TRUE) 

head(top.terms,5)
## # A tibble: 5 x 2
##   token          n
##   <chr>      <int>
## 1 data          18
## 2 chemistry      9
## 3 experience     7
## 4 position       7
## 5 college        6
#count the terms for the bottom rated companies
bottom.terms <- df.result %>%
  dplyr::filter(is.na(as.numeric(as.character(token)))) %>%  # removes numbers
  dplyr::filter(avg.sentiment < 0 ) %>%
  dplyr::count(token, sort = TRUE) 

head(bottom.terms,5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
token n
position 42
data 33
actuarial 32
actuary 31
job 26


B. Plot Some Findings

ggplot(head(top.terms,33), aes(reorder(token, n), n)) +
  geom_bar(stat = "identity", fill = "Blue") +
  labs(title = "Top Terms for Companies with Highest Sentiment",
       x = "Term", y = "Frequency") +
  coord_flip()

ggplot(head(bottom.terms,33), aes(reorder(token, n), n)) +
  geom_bar(stat = "identity", fill = "Red") +
  labs(title = "Top Terms for Companies with Lowest Sentiment",
       x = "Term", y = "Frequency") +
  coord_flip()



7. Supervised Analysis

Our goal was to find the most valued data science skills using a supervised approach. We created new variables for analyzing three types of skills (hard skills, soft skills, and tool skills).

Assumptions:

  • We assumed certain terms fell into certain categories and searched for them.

  • We arrived at these categories based on outside SMEs (subject-matter experts).

  • We assumed that our analysis tools would lead to conclusions without human intervention.


A. Frequency of Skills


Technical Skills

We used the mutate function to create new variables for the tool skills category and preserve existing variables from the summary column, and turned on case insensitivity.

toolskills <- df_final %>%
    mutate(R = grepl("\\bR\\b,", summary)) %>%
    mutate(python = grepl("Python", summary, ignore.case=TRUE)) %>%
    mutate(SQL = grepl("SQL", summary, ignore.case=TRUE)) %>%
    mutate(hadoop = grepl("hadoop", summary, ignore.case=TRUE)) %>%
    mutate(perl = grepl("perl", summary, ignore.case=TRUE)) %>%
    mutate(matplotlib = grepl("matplotlib", summary, ignore.case=TRUE)) %>%
    mutate(Cplusplus = grepl("C++", summary, fixed=TRUE)) %>%
    mutate(VB = grepl("VB", summary, ignore.case=TRUE)) %>%
    mutate(java = grepl("java\\b", summary, ignore.case=TRUE)) %>%
    mutate(scala = grepl("scala", summary, ignore.case=TRUE)) %>%
    mutate(tensorflow = grepl("tensorflow", summary, ignore.case=TRUE)) %>%
    mutate(javascript = grepl("javascript", summary, ignore.case=TRUE)) %>%
    mutate(spark = grepl("spark", summary, ignore.case=TRUE)) %>%
    
select(job_title, company_name, R, python, SQL, hadoop, perl, matplotlib, Cplusplus, VB, java, scala, tensorflow, javascript, spark)


Applied the summarise_all function to all (non-grouping) columns.

toolskills2 <- toolskills %>% select(-(1:2)) %>% summarise_all(sum) %>% gather(variable,value) %>% arrange(desc(value))


Visualized the most in-demand tool sKills:

ggplot(toolskills2,aes(x=reorder(variable, value), y=value)) + geom_bar(stat='identity',fill="green") + xlab('') + ylab('Frequency') + labs(title='Tool Skills') + coord_flip() + theme_minimal()

Python, SQL, and R are the most in-demand tool skills according to Indeeed job posts, with the least in-demand tool skill being VB.


Soft Skills

We used the mutate function to create new variables for the soft skills category and preserve existing variables from the summary column, and turned on case insensitivity.

softskills <- df_final %>%
    mutate(workingremote = grepl("working remote", summary, ignore.case=TRUE)) %>%
    mutate(communication = grepl("communicat", summary, ignore.case=TRUE)) %>%
    mutate(collaborative = grepl("collaborat", summary, ignore.case=TRUE)) %>%
    mutate(creative = grepl("creativ", summary, ignore.case=TRUE)) %>%
    mutate(critical = grepl("critical", summary, ignore.case=TRUE)) %>%
    mutate(problemsolving = grepl("problem solving", summary, ignore.case=TRUE)) %>%
    mutate(activelearning = grepl("active learning", summary, ignore.case=TRUE)) %>%
    mutate(hypothesis = grepl("hypothesis", summary, ignore.case=TRUE)) %>%
    mutate(organized = grepl("organize", summary, ignore.case=TRUE)) %>%
    mutate(judgement = grepl("judgement", summary, ignore.case=TRUE)) %>%
    mutate(selfstarter = grepl("self Starter", summary, ignore.case=TRUE)) %>%
    mutate(interpersonalskills = grepl("interpersonal skills", summary, ignore.case=TRUE)) %>%
    mutate(atttodetail = grepl("attention to detail", summary, ignore.case=TRUE)) %>%
    mutate(visualization = grepl("visualization", summary, ignore.case=TRUE)) %>%
    mutate(leadership = grepl("leadership", summary, ignore.case=TRUE)) %>%

select(
  job_title, company_name, workingremote, communication, collaborative, creative, critical, problemsolving, 
  activelearning, hypothesis, organized, judgement, selfstarter, interpersonalskills, atttodetail, 
  visualization, leadership)
    
summary(softskills) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width = "800px", height = "200px")
job_title company_name workingremote communication collaborative creative critical problemsolving activelearning hypothesis organized judgement selfstarter interpersonalskills atttodetail visualization leadership
Length:1303 Length:1303 Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
Class :character Class :character FALSE:1302 FALSE:445 FALSE:650 FALSE:1066 FALSE:1105 FALSE:1172 FALSE:1303 FALSE:1255 FALSE:1217 FALSE:1297 FALSE:1301 FALSE:1175 FALSE:1206 FALSE:946 FALSE:1024
Mode :character Mode :character TRUE :1 TRUE :858 TRUE :653 TRUE :237 TRUE :198 TRUE :131 NA TRUE :48 TRUE :86 TRUE :6 TRUE :2 TRUE :128 TRUE :97 TRUE :357 TRUE :279


Applied the summarise_all function to all (non-grouping) columns.

softskills2 <- softskills %>% 
               select(-(1:2)) %>% 
               summarise_all(sum) %>% 
               gather(variable,value) %>% 
               arrange(desc(value))


Visualized the most in-demand soft sKills:

ggplot(softskills2,aes(x=reorder(variable, value), y=value)) + geom_bar(stat='identity',fill="green") + xlab('') + ylab('Frequency') + labs(title='Soft Skills') + coord_flip() + theme_minimal()

Communication, collaboration, and visualization are the most in-demand soft skills according to Indeeed job posts, with the least in-demand soft skill being active learning.


Hard Skills

We used the mutate function to create new variables for the hard skills category and preserve existing variables from the summary column, and turned on case insensitivity.

hardskills <- df_final %>%
    mutate(machinelearning = grepl("machine learning", summary, ignore.case=TRUE)) %>%
    mutate(modeling = grepl("model", summary, ignore.case=TRUE)) %>%
    mutate(statistics = grepl("statistics", summary, ignore.case=TRUE)) %>%
    mutate(programming = grepl("programming", summary, ignore.case=TRUE)) %>%
    mutate(quantitative = grepl("quantitative", summary, ignore.case=TRUE)) %>%
    mutate(debugging = grepl("debugging", summary, ignore.case=TRUE)) %>%
    mutate(statistics = grepl("statistics",  summary, ignore.case=TRUE)) %>%
    

select(job_title, company_name, machinelearning, modeling, statistics, programming, quantitative, debugging, statistics)

summary(hardskills) %>% 
               kable("html") %>% 
               kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
job_title company_name machinelearning modeling statistics programming quantitative debugging
Length:1303 Length:1303 Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
Class :character Class :character FALSE:585 FALSE:457 FALSE:642 FALSE:768 FALSE:939 FALSE:1298
Mode :character Mode :character TRUE :718 TRUE :846 TRUE :661 TRUE :535 TRUE :364 TRUE :5


Applied the summarise_all function to all (non-grouping) columns.

hardskills2 <- hardskills %>% 
               select(-(1:2)) %>% 
               summarise_all(sum) %>% 
               gather(variable,value) %>% 
               arrange(desc(value))


Visualized the most in-demand hard skills:

ggplot(hardskills2,aes(x=reorder(variable, value), y=value)) + 
  geom_bar(stat='identity',fill="green") + 
  xlab('') + 
  ylab('Frequency') + 
  labs(title='Hard Skills') + 
  coord_flip() + 
  theme_minimal()

Modeling and machine learning are the most in-demand hard skills according to Indeeed job posts, with the least in-demand hard skill being debugging.


B. Word Cloud

We used the summary column to create a word cloud of data science skills. To begin, we specified the removal of irrelevant words and stopwords that don’t add context.

datacloud <- Corpus(VectorSource(df_final$summary))

datacloud <- tm_map(datacloud, removePunctuation)

datacloud <- tm_map(datacloud, tolower)

datacloud <- tm_map(datacloud, removeWords, c("services", "data", "andor", "ability", "using", "new", "science", "scientist" , "you", "must", "will", "including", "can", stopwords('english')))


Then, we visualized our data science word cloud:

wordcloud(datacloud, 
          max.words = 80, 
          random.order = FALSE, 
          scale=c(3,.3),
          random.color = FALSE,
          colors=palette())



8. Conclusions


A. What are the most valued data science skills?

Supervised Approach

The supervised approach showed us which of the data science skills that we looked up were the most and least in-demand – and therefore, we assume, the most valuable:

Most in-demand

  • Hard skills: modeling and machine learning
  • Soft skills: communication, collaboration, and visualization
  • Tools skills: Python, SQL, R

Least in-demand

  • Hard skills: debugging
  • Soft skills: self-starter, working remotely, and active learning
  • Tools skills: Perl and VB

The list of most in-demand skills is consistent with the tenor and topics of conversation in the field. While motivation, independence, and continuous learning do seem like important requisites to a success in data science, these are either relatively less so in our sample or potentially are conveyed in language opaque to our analysis.

These findings can not only inform how job seekers position their experience and skills, but how learning programs and bootcamps structure their curricula. Additionally, this information could help prospective employers of data scientists to understand how to communicate their own requirements, or how their requirements compare with others hiring in the marketplace, or even what their competitive set looks like.


TF-IDF Approach

Hard skills had a higher score in most situations. “Machine” and “learning”, “statistics” and “analysis” appeared high in most of our graphs. Soft and technical skills didn’t appear at all. However, the results should not be considered conclusive as the TF-IDF approach lacked any proper context. When the corpus was broken down by cities, the name of companies became the highest scoring terms.

We attempted to make the results more coherent by lowering the sparsity, but this method was arbitrary in nature. This makes sense as TF-IDF is often used in conjunction with other algorithms. For example, we might have been able to get our much-needed context by feeding our TF-IDF matrices into a k-means algorithm. Then we could see how certain words grouped together. In this way, we could prune our corpuses to focus solely on what words are important to data scientist jobs, as opposed to what words are important to job posts in general.


Sentiment Approach

The sentiment analysis did not seem to be biased towards either soft or hard skills with terms such as “data”, “experience” ,“technical”, “modeling” , “development”, “python” and “team” coming in near the top. It did, however, seem to be confused by terms that the supervised approach would never include as “relevant”. For example, terms like “including” and “national” were found near the top of the list – these terms are likely artifacts of the process rather than important data science terms.

There did not appear to be any coherent difference between top- and bottom-rated companies, which made drawing a meaningful conclusion from any comparison between the two sets difficult. One thing of note was that the distribution for “positive sentiment” companies had significantly more right-skew in the frequency of terms being reported as relevant. More analysis would likely be required to uncover the meaning behing this observation.

At a high level, the sentiment testing was likely a bit mis-specified in the sense that a generic sentiment mapping was used and it was applied to individual words (i.e., ignoring the context of the word within the sentence) which may be less than optimal given the specific task here.

Future sentiment-related work here should include an examination of n-grams, sentences, paragraphs, or even entire summaries so as to capture the entire context of what is being said. A custom list of stop-words would also be beneficial here as there are we found many words that are not traditionally considered stop words, but which recurr in almost every job posting and thus provide little value.


B. Conclusions about the process

  • Running parallel workstreams added depth to our analysis. It allowed us to compare and contrast the outputs of the supervised and unsupervised methods – a neat and tangible learning opportunity for the team.

  • Context matters. All data science methods – supervised or unsupervised – require that the user has background knowledge and context to determine whether the ouput is valuable. We can’t assume that the results of any method will be salient on their own, especially if the method is unsupervised.

    • On the whole, we found the supervised results much more coherent than the unsupervised. For example, a layman could review the table of skills from the supervised analysis and theorize which ones matter, but the same is not true for tables resulting from the unsupervised analysis, which contain a lot of syntactical noise and don’t feel terribly salient. While we recognize this as a product of the unsupervised methodology that could be “tuned” or “pruned,” for analysis of low complexity, such tuning and pruning comes to approximate a supervised approach. This seems less efficient than deliberately prescribing keywords and terms up front.
  • Collaboration is key in a data science project. For a group of six strangers who are (mostly) not co-located and must work virtually (across time zones), it’s important to have regular check-ins to monitor progress and correct course, if necessary. It’s also key to align early on overall workflow and timeline, roles / responsibilities in the process, and how next steps evolve. It’s good to make project management and decision frameworks explicit, though not doing so did not present issues on this project (because this team is awesome). While we stayed on track through well-attended meetings held every few days, a project plan would be a good idea for our next collaboration.



9. Assumptions

While touched on above, we want to make clear the assumptions we held at each step of the process. These assumptions ultimately informed the direction of our approach, the content of our analysis, and the conclusions we could draw.

A. Overall Assumptions

  • “Skills” refer to hard skills, soft skills, and technical skills of the individual data scientist.

  • We can determine a skill’s value by looking at job postings. Specifically, job postings on Indeed for the search term “Data Scientist”. Employers list skills that they need and find more valuable.

  • Job postings have the most up-to-date information on real employer needs, compared to other data sources such as data science course curricula, surveys of data scientists, and the Reddit page for data science.


B. Scraping Assumptions

  • The data we scraped is representative of all jobs. Indeed data is comparable to other job posting websites.

  • The moment we scraped data can be longitudinally extrapolated – it isn’t an outlier. What we scrape is expected to be valid right now, but not necessarily into the future.


C. Cleaning Assumptions

  • All of the sections listed in a job post – and not just ones related to the potential employee – are useful in identifying the skills that employers most value. We arrived at this conclusion after reviewing a random sample of postings and concluding that there was valuable information throughout the summary.

  • We kept observations discrete – as a corpus – rather than as a single string. We also kept special characters and stopwords. This allowed the downstream user to decide what was important.


D. Analysis Assumptions

  • Overall

    • We assumed that we would be able to compare the results of the two approaches.
    • We assumed that if we applied the appropriate data mining methods to the raw data, the methods – on their own – would tell us what was important.


  • Supervised Approach

    • We assumed certain terms fell into certain categories and searched for them.
    • We arrived at these categories based on online subject-matter experts.
    • We removed some stopwords, and words like “data”, that we assumed would not add context.


  • TF-IDF Approach
    • We assumed that TF-IDF would simply be a “smarter” version of calculating term-frequency. After running the algorithm on our corpus of job postings, we thought the key words would be roughly the same as the supervised, but with the added bonus of not having to find a list ahead of time. This proved not to be the case, as the more valuable words were sometimes generic in nature.
    • It appeared that the analysis would prove more useful to the business side, helping HR write a better data scientist job posting, rather than the employee trying to figure out what data science skills might be in demand.


  • Sentiment Analysis
    • We assumed that higher-sentiment companies are companies more people want to work for, therefore the skills they look for are more valuable.