1. Background

In this project, we used supervised and unsupervised data mining techniques on a scraped dataset of 1,303 Indeed job listings to answer the following question:

What are the most valued data science skills?

We collaborated as a team to understand the question, get the data, clean it, analyze it, and draw conclusions. We used Slack, Google Docs, Google Hangouts, GitHub, and in-person meetings to work together, and we gave pair programming – live and virtual – a shot, too.


Team Rouge (Group 4)


Process

  • Data Acquisition — Iden and Paul

  • Data Cleaning — Jeremy and Kavya

  • Unsupervised Analysis — Iden and Paul

  • Supervised Analysis — Rickidon and Violeta

  • Conclusions — Whole Team


Libraries

library(rvest)
library(RCurl)
library(plyr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(tidytext)
library(xtable)
library(readr)
library(tidytext)
library(knitr)
library(kableExtra)



2. Approach

To motivate our approach to data collection and analysis, we began with the concepts of “skills” and of “value.”

A. Definitions

What are the most valued data science skills?

Skills

As discussed in our class, data science requires not only multiple skills but multiple categories of skills. The many fields and industries where data science is applied likely group these skills into different categories, but after some desk research and discussion we felt that in addition to toolsets (software and platforms), both “hard” (analytical and technical) and “soft” (communcative and collaborative) skills are also important.

We used the following list of categories – found in an article on Data Scientist Resume Skills – as a basis for our supervised analyses.


Value

To avoid wading into philosophical abstractions, we interpreted value in its economic sense – that is, what skills are sought after and / or rewarded in the marketplace.


B. Data Source

As the economic value of data science skills is not directly measurable, we discussed three different approaches to getting a datset:

  • Mining existing custom research on data scientists (like that found here).

  • Analyzing online discussion boards focused on data science (like this one on Reddit). While threads can provide a historical record (i.e., the evolution of value), there are potentially compromises in data quality and bias (whether due to fanboys, trolls, or a silent majority) and informational contents does not necessarily accord with economic value.

  • Scraping online job postings for data science roles provides perspective on what skills employers emphasize and prioritize. This third approach has its limitations: there are multiple platforms (Glassdoor, Linkedin, Monster, Careerbuilder, etc.) none of which can have a complete view of the marketplace, and scraping time-delimited job postings captures a moment in time without any longitudinality.

We dimissed custom research as it didn’t seem to accord with the intent of the project. We thought that exploring online discussion boards could be valuable an alternative, fallback, or follow-up analysis. We agreed to focus on job postings.

Constraints of the data source notwithsanding, testing what signals of “skill value” (i.e. frequency of terms related to data science skills) could be detected in job postings would be a good approach to this project, and one that allowed us to meet technical requirements and collaborate.

After some exploration, we decided to focus on Indeed.com, which has a wealth of data science job postings that can be scraped. We scraped them – first a test set for evaluation and troubleshooting, then a larger, more robust set – to be cleaned and analyzed. We initially used Python, and later replicated the scraper in R.


C. Analysis

We felt that the project could benefit from a two-pronged approach to analysis:

  1. A more prescriptive, supervised approach based on cross-referencing job summaries with categorized lists of terms and calculcating the frequency of recurring keywords. To prove the concept, we used the “hard,” “soft,” and “tools” lists referenced above as we found them.

  2. A more exploratory, unsupervised approach based on TF-IDF (term frequency-inverse document frequency) and sentiment analysis, which don’t semantically impose preconveived keywords upon job postings (short of filtering out stop-words).

To streamline our process, we conducted the two analyses in parallel, cleaning and preparing the data for both. We iterated and collaborated on the scraper, cleaning, and analysis using Slack and GitHub.



3. Data Acquisition

A. Note

This scraper is working code, however, we’ve disabled it here as it can take a while to run. It’s provided here as a working demonstration of how our data was collected. All the actual work for this project was completed on a static data set which we collected early on in our efforts. This was done to ensure that all group members were always working with identical data and that any user could reproduce our results as desired.

The following chunk of code scrapes job postings from indeed.com and collects the results into a dataframe. It’s a port from some python code originally used to scrape our data set.


B. Set the variables

First we’ll set a few variables that we’ll use in our scraping activity. We’ve used a smaller set of cities just to demonstrate how it works.

city.set_small <- c("New+York+NY", "Seattle+WA")

city.set <- c("New+York+NY", "Seattle+WA", "San+Francisco+CA",
              "Washington+DC","Atlanta+GA","Boston+MA", "Austin+TX",
              "Cincinnati+OH", "Pittsburgh+PA")


target.job <- "data+scientist"   

base.url <- "https://www.indeed.com/"

max.results <- 50


C. Scrape the Details

Indeed.com appears to use the “GET” request method, so we can directly mess around with the URL to get the data that we want. We’re going to iterate over our target cities and scrape the particulars for each job - this includes getting the links to each individual job-page so that we can also pull the full summary.


D. Get the full Summary

After the above is complete, we’re going to iterate over all the links that we’ve collected, pull them, and grab the full job summary for each. Note that it appears that jobs postings are sometimes removed, in which case, we pull an empty variable. We could probably do some cleaning in this step while downloading, but we’re going to handle that downstream.

#create a df to hold everything that we collect
jobs.data <- data.frame(matrix(ncol = 7, nrow = 0))
n <- c("city","job.title","company.name","job.location","summary.short","salary","links,summary.full")
colnames(jobs.data)


for (city in city.set_small){
  print(paste("Downloading data for: ", city))

  
  for (start in range(0,max.results,10)){

    url <- paste(base.url,"jobs?q=",target.job,"&l=",city,"&start=", start ,sep="")
    page <- read_html(url)
    Sys.sleep(1)
  
    #recorded the city search term << not working yet...
    #i<-i+1
    #job.city[i] <- city
  
    #get the links
    links <- page %>% 
      html_nodes("div") %>%
      html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
      html_attr("href")
    
  
    #get the job title
    job.title <- page %>% 
      html_nodes("div") %>%
      html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
      html_attr("title")
  
    #get the job title
    job.title <- page %>% 
      html_nodes("div") %>%
      html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>%
      html_attr("title")
    
    #get the company name
    company.name <- page %>% 
      html_nodes("span")  %>% 
      html_nodes(xpath = '//*[@class="company"]')  %>% 
      html_text() %>%
      trimws -> company.name 
  
    #get job location
    job.location <- page %>% 
      html_nodes("span") %>% 
      html_nodes(xpath = '//*[@class="location"]')%>% 
      html_text() %>%
      trimws -> job.location
    
    #get the short sumary
    summary.short <- page %>% 
      html_nodes("span")  %>% 
      html_nodes(xpath = '//*[@class="summary"]')  %>% 
      html_text() %>%
      trimws -> summary.short 
    
  }
  
  #create a structure to hold our full summaries
  summary.full <- rep(NA, length(links))
  
  #fill in the job data
  job.city <- rep(city,length(links))
  
  #add a place-holder for the salary
  job.salary <- rep(0,length(links))
  
  #iterate over the links that we collected
  for ( n in 1:length(links) ){
    
    #build the link
    link <- paste(base.url,links[n],sep="")
    
    #pull the link
    page <- read_html(link)
  
    #get the full summary
    s.full <- page %>%
     html_nodes("span")  %>% 
     html_nodes(xpath = '//*[@class="summary"]') %>% 
     html_text() %>%
     trimws -> s.full
  
    #check to make sure we got some data and if so, append it.
    #as expired postings return an empty var
    if (length(s.full) > 0 ){
        summary.full[n] = s.full  
        } 
  
    }
  
    #add the newly collected data to the jobs.data
    jobs.data <- rbind(jobs.data,data.frame(city,
                                            job.title,
                                            company.name,
                                            job.location,
                                            summary.short,
                                            job.salary,
                                            links,
                                            summary.full))

    
}



4. Data Cleaning

The previous step resulted in raw CSV file with over 1300 rows. To clean the data, we first read in the CSV file, tested a cleaning procedure on a 100-row sample, and applied the procedure to the full dataset.

A. Read in the dataframe

Read in raw dataframe, set separator as pipe

url <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Projects/Project%2003/indeed_jobs_large.csv"

df <- read.csv(url, sep="|", stringsAsFactors = F)


Removed “location” and “salary” columns, to reduce redundancy.

df <- df[, -c(5,7)]


B. Test cleaning procedure

Took 100-row sample of full dataset.

sample <- df[sample(1:nrow(df), 100, replace=F),]


Removed brackets surrounding summaries.

sample1 <- sample %>% separate(summary_full, c("bracket", "new_summary"), sep="^[\\[]", remove=T, convert=F) %>%
                      separate(new_summary, c("summary_full", "bracket"), sep="[\\]]$", remove=T, convert=F)

sample1 <- sample1[, -c(5, 8)]


Renamed column headers.

names(sample1) <- c("list_ID", "city", "job_title", "company_name", "link", "summary")


Removed state and plus signs from city column.

# Separate City column into City and State by pattern of two uppercase letters after a plus sign (i.e., "+NY")
sample2 <- sample1 %>% separate(city, c("city", "state"), sep="[\\+][[:upper:]][[:upper:]]$", convert=T)

# Remove empty State column
sample2 <- sample2[, -c(3)]

# Replace plus signs with spaces
sample2$city <- str_replace_all(sample2$city, "[\\+]", " ")


Removed rows where summary is blank.

sample3 <- filter(sample2, sample2$summary!="")

head(sample3, 2) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width = "800px", height = "200px")
list_ID city job_title company_name link summary
676 Atlanta Data Scientist, NGA Explorer Program National Geospatial-Intelligence Agency https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DV4VMfmBJI2fADxaSmO4APG19pQrP8HAfsIrAnoAIqcAavJQq-2iDZ44JP6cc_iBiKSm_m23jNKDDYjUkcI2nuog_WehYYRkaPg-20OA5w9vcgUXblRPJ66e8BnjkJAcO4wzL6Q4LY656aIGiFgqPgTV-uyJW36PJpkAkMxzalTqjvYQ5aMqAuBMMXFVJ59yOZvNKNKTiBB7GCcSRVFp3XhQwHqtV6INNg9Z6rQsorrAUiFHpZonNGyXQknVMtXlRzNAq1gVz8xiRQKULoC7OoV-DAkxNJjGTRCuitabyKjKUxFX2V7ZtN63aEv8jg-l99JGpxXv6q_4yR8kh5apiGXIXc4z16ot4mWMsKQiQaI4gdYP7jxTFjzOTGqap7_kenISaqWBdXzTpC_4uC_y8XlRZ_iwIcyFCTtoSc12-z7BrvpKS01DUTgmiYKnpEHgA7Vb4qbe826FclKjQnlYuml5nNP3Xbt9whw5csPSaKOP9c9TCHE78i3bkJUwTNawktiLWPZHC7i8LSAXBd3GK0_D-o4xLwjr-Un6Hizi4cyCLHDf9DruUNwthIN2RfjBwIp3FswRmYv0Kjp2CRiUIHNmpS2IlmuwWpurWarKTT3Fy8o2hGJE_S&vjs=3&p=5&sk=&fvj=1 ASSIGNMENT DESCRIPTION: Data Scientists develop and apply methods to identify, collect, process, and analyze large volumes of data to build and enhance products, processes, and systems. They conduct data mining and retrieval, and apply statistical and mathematical analyses to identify trends, solve analytical problems, optimize performance, and gather intelligence. They visualize information using a range of tools (e.g., GIS, R), develop scripts and algorithms, create explanatory and predictive models, and conduct comparative analyses to address complex problems.Pay, Benefits, & Work SchedulePERMANENT CHANGE IN STATION:PCS expenses are not authorized.Additional Position InfoADDITIONAL INFORMATION:NGA is seeking data professionals, academics, and researchers to join its Explorer Program. The Explorer Program is a key aspect of the Agency’s data-enabled workforce initiative, which focuses on empowering NGA employees with the knowledge, skills, and tools to solve contemporary and future data-related challenges in the intelligence community. Data Scientists in the Explorer Program are responsible for leading the development and implementation of solutions to these challenges.Data Scientists in the Explorer Program will be proven innovation leaders, coalition builders, problem solvers, and effective communicators. They will possess the ability to think and act strategically, operate in a dynamic and sometimes ambiguous environment, and rapidly deliver impactful results. They will leverage their strong interpersonal skills to effectively communicate with individuals inside and outside the Agency. They have experience working on geographically distributed teams.As a Data Scientist in the Explorer Program you would work with office and directorate level data professionals on projects that solve their data-related problems. These projects are prioritized to achieve the Agency’s mission, and your solutions will scale throughout the Agency. You would report to the Chief Data Scientist, who leads the Explorer Program.Specific duties will include, but are not limited to, the following:- Working as a part of a team of geographically distributed data professionals, including data engineers, data analysts, data stewards, and fellow data scientists.- Creating, selecting, and applying methods to collect, process, and analyze large volumes of data.- Designing, building, and delivering solutions to the Agency’s complex data challenges.- Communicating with project stakeholders regarding your efforts and how they achieve the Agency’s mission.- Serving as a liaison to the academic, industry, and research communities, such as through presenting and publishing your work at conferences and in top-tier publications, respectively.This is a Term Appointment within NGA Not To Exceed 5 years. This position may be noncompetitively converted into a permanent full time position at the same or lower pay band and work role for which initially hired. The Term employee must have completed at least 5 years of continuous service with at least a “successful” overall rating to be eligible for conversion.Mandatory Qualification ReqsMANDATORY QUALIFICATION CRITERIA:For this particular job, applicants must meet all competencies reflected under the Mandatory Qualification Criteria to include education (if required). Online applications must demonstrate qualification by providing specific examples and associated results, in response to the announcement’s mandatory criteria specified in this vacancy announcement:1.) Proficiency in at least one of the following languages: Python, R, Java, or a similar language, as applied to programming access and manipulation of data;2.) Managed and delivered solutions to problems (e.g. cleaning, filtering, and transforming data, as well as enriching data, especially by way of database joins or table lookups);3.) Designed and delivered enterprise-scale solutions that enable access to data;4.) Has used analytic techniques such as classification, regression, similarity matching, clustering, co-occurrence grouping, profiling, link prediction, data reduction, and causal modelling, as well as experience developing and deploying automated processes;5.) Demonstrated expertise with visualization techniques and software tools, such as Tableau and matplotlib;6.) Proven ability to communicate ongoing work and findings to leadership and others through written, visual, and spoken reports and briefings.Although not required, it is the recommendation of the hiring office that applicants explicitly address in their cover letter how they meet each of the mandatory qualifications listed.EDUCATION REQUIREMENT:A. Education: Bachelor’s degree in Applied Mathematics, Computer Science, Data Science, Geographic Information Science, Informatics and Computing, Information Science, Operations Research, Statistics, or a related discipline. -OR-B. Combination of Education and Experience: A minimum of 30 semester (45 quarter) hours of coursework in any area listed in option A plus experience that demonstrates the ability to successfully perform the duties associated with this work. As a rule, every 30 semester (45 quarter) hours of college work is equivalent to one year of experience. Candidates should show that their combination of education and experience totals 4 years. -OR-C. Experience: Six years of experience in working with large or complex data sets, including performing data mining, running complex queries, or performing tasks in related areas of work.Highly Desired SkillsDESIRABLE QUALIFICATION CRITERIA:In addition to the mandatory qualifications, experience in the following is desired:1.) Proven innovator and strategic problem solver;2.) Ability to create and thrive in emerging and ambiguous environments;3.) Be an effective collaborator in high performance, geographically distributed team environments;4.) Evidence of impactful contributions to broader academic, industry, and research communities, such as code, presentations, and publications;5.) Proven experience in machine learning and/or artificial intelligence;6.) Extensive experience with Machine Learning and/or Artificial Intelligence (AI).How To Apply - ExternalApplication submission involves applying using the Intelligence Community’s Applicant Gateway on-line application process.How Will I Be Evaluated?APPLICANT EVALUATION PROCESS:Applicants will be evaluated for this job opportunity in three stages:1) All applicants will be evaluated using the Mandatory Qualification Criteria,2) Qualified applicants will then be evaluated by an expert or panel of experts using a combination of qualification criteria to determine the best-qualified candidates,3) Best-qualified applicants may then be further evaluated through an interview process.ONLY ELECTRONIC SUBMISSIONS WILL BE ACCEPTED.Applicants are not required to submit a cover letter. The entire cover letter cannot exceed the specified limits provided in the Cover Letter field (3,000 characters). Pages exceeding this limit will not be considered. The cover letter is recommended but is not required for employment consideration with the National Geospatial-Intelligence Agency. Applicants should place their narrative information in the Cover Letter / Other Professional Details field.Applicants are encouraged to carefully review the Assignment Description, Additional Information Provided By the Selecting Official, and the Qualification Requirements; and then construct their resumes to highlight their most relevant and significant experience and education for this job opportunity. This description should include examples that detail the level and complexity of the performed work. Applicants are encouraged to provide any education information referenced in the announcement. If education is listed as a mandatory requirement, only degrees obtained from an institution accredited by an accrediting organization recognized by the Secretary, US Department of Education will be accepted.In accordance with section 9902(h) of title 5, United States Code, annuitants reemployed in the Department of Defense shall receive full annuity and salary upon appointment. They shall not be eligible for retirement contributions, participation in the Thrift Savings Plan, or a supplemental or re-determined annuity for the reemployment period. Discontinued service retirement annuitants (i.e., retired under section 8336(d)(1) or 8414(b)(1)(A) of title 5, United States Code) appointed to the Department of Defense may elect to be subject to retirement provisions of the new appointment as appropriate. (See DoD Instruction 1400.25, Volume 300, at http://www.dtic.mil/whs/directives.)All candidates will be considered without regard to race, color, religion, sex, national origin, age, marital status, disability, or sexual orientation.NGA provides reasonable accommodations to applicants with disabilities. Applications will only be accepted online. The decision on granting reasonable accommodation will be on a case-by-case basis.DCIPS DisclaimerNGA utilizes all processes and procedures of the Defense Civilian Intelligence Personnel System (DCIPS). Non-executive NGA employees are assigned to five distinct pay bands based on the type and scope of work performed. The employee’s base salary is established within their assigned pay band based on their unique qualifications. A performance pay process is conducted each year to determine a potential base pay salary increase and/or bonus. An employee’s annual performance evaluation is a key factor in the performance pay process. Employees on term or temporary appointments are not eligible to apply for internal assignment opportunity notices.This position is a DCIPS position in the Excepted Service under 10 U.S.C. 1601. DoD Components with DCIPS positions apply Veterans’ Preference to preference eligible candidates as defined by Section 2108 of Title 5 USC, in accordance with the procedures provided in DoD Instruction 1400.25, Volume 2005, DCIPS Employment and Placement. If you are an external applicant claiming veterans’ preference, as defined by Section 2108 of Title 5 U.S.C., you must self-identify your eligibility in your application.This position is funded by the National Intelligence Program (NIP).Additional Job RequirementsYou must be able to obtain and retain a Top Secret security clearance with access to Sensitive Compartmented Information. In addition, you are subject to a Counterintelligence Polygraph examination in order to maintain access to Top Secret information. All employees are subject to a periodic examination on a random basis in order to determine continued eligibility. Refusal to take the examination may result in denial of access to Top Secret information, SAP, and/or unescorted access to SCIFs.Employees with SCI access and who are under NGA cognizance are required to submit a Security Financial Disclosure Report, SF-714, on an annual basis in order to determine continued eligibility. Failure to comply may negatively impact continued access to Top Secret information, Information Systems, SAP, and/or unescorted access to SCIFs.As a condition of employment at NGA, persons being considered for employment must meet NGA fitness for employment standards.- U.S. Citizenship Required- Security Clearance (Top Secret/Sensitive Compartmented Information)- Polygraph Test Required- Position Subject to Drug Testing- Two Year Probationary Period- Direct Deposit RequiredPay is only part of the compensation you will earn working for the Federal Government. We offer a broad array of benefits programs and family friendly flexibilities to meet the needs of you and your family. For more information on the array of benefits programs, please visit https://www.intelligencecareers.gov/nga/ngabenefits.htmJob Type: Full-time
738 Atlanta Cons, Advanced Analytics, Atlanta (DC Baltimore) Avanade https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0CDyWCk2bf-hB079XA31Tw5ngkjovj5vvTWFlkYAEWq5DYCSpmq5uTltT705T7lXFPFTOmlK_-ZyG14uT2pcsTXLV02jglMCHXhwkD0UXP8G-ShfIJeSPKLq3hXcXk6lkV3QgP0aH9lGtEqy9ep4O0nt_7bb0g4gpSVo9kRLMZtejjXE4kRy2PDhJfHYXy4cWv8fLaj9cA67fudtzoUtAZQGrRy-b_kfSHAU6XLGjSe_juXKpC2860IiXvnEOKF8FerG89ybdSmtu4Mk58DklwEObNer1mOGVic9qF2dgZWnd5ADjOeVJbkbSQe6o34XLyWh0954UdxCElS5L9cFq_LA7pnqg8l9_9cFveEpJ-uml5Ry_Pm9Cjfg8xyP9TGWWQlg7xZDS11-f2_vYTCqhrw47SYEBMUwedyJxOvZLc9UqA7CUX386vPPLCoPAc_4aF-pS5igvxZZaRLrPSC92hH5SsMkPDHLLOED9sSW53i0B8peNADS5KmawYLx5bRokw=&vjs=3&p=2&sk=&fvj=0 About the roleAdvanced Analytics consultants use sophisticated analytical tools and techniques to help clients understand and extract value from their data. Avanade’s Data Scientists leverage state-of-the-art data mining, analytics, statistical and machine learning techniques to derive insights that inform and optimize strategic and operational activities. Working both independently and as part of a team, this involves all phases of analytics projects, including question formulation, design, research and development, implementation and testing.As a Manager, Advanced Analytics, you will explore and understand data and build advanced analytical models independently, then present and discuss these models with any level of audience. You will partner directly with delivery teams and client stakeholders to identify, develop and deploy data-driven solutions to find answers to business questions and issues.Day-to-day, you will:Use exploratory data analysis from complex and high-dimensional datasets to produce innovative solutions for clientsUse a flexible, analytical approach to design, develop and evaluate predictive models and advanced algorithms that lead to optimal value extraction from the dataGenerate and test hypotheses, and analyze and interpret the results of product experimentsWork with product engineers to translate prototypes into new products, services, and features and provide guidelines for large-scale implementationsProvide input and guidance on data visualization techniques and opportunitiesApply knowledge of statistics, machine learning, programming, data modeling, simulation, and advanced mathematics to recognize patterns, identify opportunities, pose business questions, and make valuable discoveries leading to prototype development and product improvement.QualificationsAbout youYou have keen analytic skills and the people and teamwork skills to turn theory into results for our clients. You likely have around five years of experience mining and analyzing complex data sets, preferably in a consulting environment, or a combination of experience and relevant education, and three years or somanaging complex teams of business and technical resources.Your knowledge and skillset likely includes the following:Two or more predictive analytics tools, including but not limited to R, Revolution R, SAS, SPSS, MATLAB, MicroStrategy and TableauStatistical modeling, analysis, quantitative analytics, forecasting/predictive analytics, multivariate testing (A/B testing) and optimization algorithmsHigh-level programming languages (e.g., S, C++, Python, Julia)NoSQL databases and tools (e.g., Hadoop, Hive Spark)Exceptional written and oral communication and presentation skills.You likely have a Master’s Degree or Doctorate in a quantitative field such as computer science, applied mathematics, statistics or machine learning.About AvanadeAvanade leads in providing innovative digital services, business solutions and design-led experiences for its clients, delivered through the power of people and the Microsoft ecosystem. Our professionals combine technology, business and industry expertise to build and deploy solutions to realize results for clients and their customers. Avanade has 27,000 digitally connected people across 23 countries, bringing clients the best thinking through a collaborative culture that honors diversity and reflects the communities in which we operate. Majority owned by Accenture, Avanade was founded in 2000 by Accenture LLP and Microsoft Corporation. Learn more at www.avanade.com.Requisition ID - 52293Avanade is the leading provider of innovative digital and cloud-enabling services, business solutions and design-led experiences, delivered through the power of people and the Microsoft ecosystem. Majority owned by Accenture, Avanade was founded in 2000 by Accenture LLP and Microsoft Corporation and has 30,000 professionals in 24 countries. Visit us at www.avanade.com.


C. Apply cleaning procedure to full dataset

Removed brackets surrounding summaries.

df1 <- df %>% separate(summary_full, c("bracket", "new_summary"), sep="^[\\[]", remove=T, convert=F) %>%
              separate(new_summary, c("summary_full", "bracket"), sep="[\\]]$", remove=T, convert=F)

df1 <- df1[, -c(5, 8)]


Renamed column headers.

names(df1) <- c("list_ID", "city", "job_title", "company_name", "link", "summary")


Removed state and plus signs from city column.

# Separate city column into city and state by pattern of two uppercase letters after a plus sign (i.e., "+NY")
df2 <- df1 %>% separate(city, c("city", "state"), sep="[\\+][[:upper:]][[:upper:]]$", convert=T)

# Remove empty State column
df2 <- df2[, -c(3)]

# Replace plus signs with spaces
df2$city <- str_replace_all(df2$city, "[\\+]", " ")


Removed rows where summary is blank.

df_final <- filter(df2, df2$summary!="")
write.csv(df_final, "indeed_final.csv", row.names = FALSE)

head(df_final, 2) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width = "800px", height = "200px")
list_ID city job_title company_name link summary
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 OverviewAbleTo Inc. combines best-in-class patient engagement with behavior change treatment programs that allow health plans and plan sponsors to improve health outcomes, while lowering overall spending, for high-cost medical populations. Benefitting groups include heart patients and diabetics, as well as those suffering from depression/anxiety and chronic pain. All sixteen of our programs utilize best practice protocols and are delivered nationally by teams of licensed therapists and behavioral coaches. Program participants experience improved health, better recovery and in the case of heart patients, fewer hospitalizations.ResponsibilitiesData & Analytics TeamThe Data & Analytics team plays a key role at AbleTo. Data is part of nearly everything that we do. The Data & Analytics team is responsible for storing data, making it useful and actionable for business managers, and using it to validate the impact of our programs and inform organizational improvements.As a Data Scientist at AbleTo, you will work in a uniquely cross-functional capacity, helping teams across the organization apply analytical rigor to their decision-making:With our Research team, you will build claims-based predictive models to identify patients who can benefit from our programs and drive value for our clientsWith our Engagement team, you will help optimize a coordinated, multi-channel patient outreach processWith our Account Management team, you will develop reporting that demonstrates how AbleTo is helping clients manage their patient populationsWith our Clinical team, you will ensure that our therapist network is operating efficiently and is structured appropriately to meet demandWith our Product team, you will use data-driven insights to design new product features for our platformWith our Engineering team, you will help manage our data stack and ETL processes that feed our data warehouseMembers of the Data & Analytics team are:Both technically and strategically savvyAble to sift through large amounts of data and extract insightsAble to present recommendations to non-technical audiencesHighly organized and able to manage time across multiple competing prioritiesFocused on building flexible, durable, and well-designed solutionsFocused on preserving quality controlWilling to experiment with new tools / techniquesData & Analytics team culture:Collaboration: You will work with teams across the organization to both design and build data-driven solutionsInnovation: Our team prioritizes innovation over precedent. We’re always looking for new, more efficient ways of doing thingsHigh-Impact Focus: Our team receives a lot of requests for new analyses, processes, reports, etc. It is our job to prioritize these requests with our stakeholders and invest time in projects with the highest potential value-add for the organization. You will be actively involved in making these decisionsQualificationsTechnical SkillsStrong proficiency in SQL and R/PythonExperience working on cloud-based platforms like AWS/GCSComfortable with variety of analytical techniques:Predictive modeling (e.g. regularized regressions, decision trees / random forests, association rules mining)Optimization (e.g. linear programming)Exploratory analysis (e.g. clustering, PCA)Some familiarity with BI tools such as Looker or TableauEducation / ExperienceBA in Operations Research, Computer Science, or other related discipline2-3 years work experience is strongly preferredAdditionally to the technical requirements for this specific position, AbleTo seeks candidates that will demonstrate:Personal ownership of assignments and responsibilitiesResilience and grit to ensure task completion even in the face of adversityDiscipline and organization to handle multiple tasks simultaneouslyGreat team playing skillsHigh levels of energy and positive ambitionA healthy balance of curiosity, humility and assertivenessExcellent written and verbal communication skillsProfessional attitude and demeanorAbleTo is committed to equal employment opportunities (EEO) and employs qualified persons without regard to race, color, religion, national origin, sex, age, handicap, or any other classification protected by federal, state, or local laws.AbleTo is an E-Verify Company
2 New York Data Scientist Shore Group Associates, LLC https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0AmtW71uMJ-FMTnSgQAi6MO2hfu514W2To_ok1EkQDsLzCSnVx2dJJdWh_eVltX3NiKTJsOZ1PdrHmGwruy3Gwutw1Y3myrtnW-EAYSCQP1_EOIcyntJdxtj6FPF62TyGkihYDHIjTCVu_fBirizqhYKRHGco3FiaFPo1aadANl5b8sxxh_xfPnT7IgmCq1uhzhaBEqvJTIOxzME67gwkFDQMqxfRy5NeNlACthstIIJrrNKGb_4rHddOAIBkREr7GI5VMt-mMTREeanvd2N26PKfQEgntwy8IRwFCIBN2KLrb4LrABKQS4hpxFDEeLxkLy_brMWIhE5yVTBHezMc1KQdnpROfTZk9-mjnfIkcuCQ5m8avDk2OYqqcGBMs15Nw_jrnylFns0l3hlK9gyQa-a51XJpetjmCi6VJL-lVn1fC2_jm2_2cU&vjs=3&p=2&sk=&fvj=1 Description -Our team is building machine learning tools to help determine predict information about donors to a given foundation. This includes the likelihood of donation, the size of the donation, when is best to reach out to a given possible donor, and how to reach out. We are looking for data scientists who are not only interested in plugging data into a model, but also understanding the data like the back of their hand.Requirements -Analyze requirements and formulate an appropriate technical solution that meets functional and non-functional requirements.Experience with large datasets in the 100’s of GBFundamental/broad understanding of data mining and predictive analytics techniques2+ years of Data Science Experience and a deep knowledge of various modeling techniques.Python (Pandas, SKLearn, NumPy, Matplotlib), SQL, GitStrong communication skills - both verbal and written – is a must.Strong experience with predictive regression modelsStrong software engineering skillsJob Type: Full-timeExperience:Data Science: 3 years (Required)


We are left with a dataset called df_final that has 1,303 job listings.



5. TF-IDF Analysis

A. About TF-IDF

TF-IDF stands for “term frequency-inverse document frequency”. It is calculated by first calculating the term frequency of a word, \(tf(t,d)\), and multiplying it by its inverse document frequency \(idf(t,D)\), so that how frequent a word appears in a document is offset by its frequency in the corpus.

For example, a word might appear frequently in one chapter of a book, so much so that its frequency might put it in the top 10 words, but TF-IDF weighs the value of this word by the fact that it only appears in one chapter of, say, a hundred chapter textbook.


B. Create control List

tfidf <- read.csv("indeed_final.csv", stringsAsFactors = FALSE)

# Make all job titles lower case for later
tfidf$job_title <- tolower(tfidf$job_title)

# Control list to be used for all corpuses
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE,
                     weighting = weightTfIdf)


C. TF-IDF on All Job Postings

corpus.all <- VCorpus(VectorSource(tfidf$summary))

tdm.all <- TermDocumentMatrix(corpus.all, control = control_list)

# Remove outliers consisting of very rare terms
tdm.80 <- removeSparseTerms(tdm.all, sparse = 0.80)

# Sum rows for total & make dataframe
df_all <- tidy(sort(rowSums(as.matrix(tdm.80))))
colnames(df_all) <- c("words", "count")

# Graph
ggplot(tail(df_all, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "TF-IDF of Indeed Job Postings",
       x = "Words", y = "Frequency") +
  coord_flip()


D. Sparsity

First, a note on sparsity: Sparsity roughly controls the rarity of the word frequency. If we run removeSparseTerms(tdm, sparse = 0.99), it will remove only the rarest words, that is the words that appear in only 1% of the corpus. On the other hand, removeSparseTerms(tdm, sparse = 0.01) then only words that appear in nearly every document of the corpus will be kept.

For most analysis, we found that a sparsity of 80% was most beneficial. Sparsity > 80% often included words that were more important to job postings as a whole, and not to the specifics we wanted for the purpose of our question.

When each job postings are treated as individual documents, skills like “machine learning”, “analytics”, “statistics/statistical”, and “models/modeling” are very important for data scientists to have.


E. TF-IDF on Job Postings by Cities

# Trying to divide the corpus by cities
nyc <- paste(tfidf[tfidf$city == "New York", 6], collapse = " ")
sea <- paste(tfidf[tfidf$city == "Seattle", 6], collapse = " ")
sf <- paste(tfidf[tfidf$city == "San Francisco", 6], collapse = " ")
dc <- paste(tfidf[tfidf$city == "Washington", 6], collapse = " ")
atl <- paste(tfidf[tfidf$city == "Atlanta", 6], collapse = " ")
bos <- paste(tfidf[tfidf$city == "Boston", 6], collapse = " ")
aus <- paste(tfidf[df_final$city == "Austin", 6], collapse = " ")
cin <- paste(tfidf[df_final$city == "Cincinnati", 6], collapse = " ")
pitt <- paste(tfidf[tfidf$city == "Pittsburgh", 6], collapse = " ")

cities <- c(nyc, sea, sf, dc, atl, bos, aus, cin, pitt)

corpus.city <- VCorpus(VectorSource(cities))

tdm.city <- TermDocumentMatrix(corpus.city, control = control_list)

# Make city dataframe
df_city <- tidy(tdm.city)
df_city$document <- mapvalues(df_city$document,
                              from = 1:9,
                              to = c("NYC", "SEA", "SF",
                                     "DC", "ATL", "BOS",
                                     "AUS", "CIN", "PITT"))

df_city %>%
  arrange(desc(count)) %>%
  mutate(word = factor(term, levels = rev(unique(term))),
           city = factor(document, levels = c("NYC", "SEA", "SF",
                                              "DC", "ATL", "BOS",
                                              "AUS", "CIN", "PITT"))) %>%
  group_by(document) %>%
  top_n(6, wt = count) %>%
  ungroup() %>%
  ggplot(aes(word, count, fill = document)) +
  geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
  labs(title = "Highest TF-IDF Words in Job Listings by City",
       x = "Words", y = "TF-IDF") +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip()

# write.csv(df_city, "city_tfidf.csv", row.names = FALSE)


In this attempt, job postings were grouped by the cities they were listed in. When broken down this way, the companies themselves became the most important words rather than skills.


F. TF-IDF Based on Job Titles

# Data Scientist - 739 instances
ds <- tfidf[grep("data scientist", tolower(tfidf$job_title)), 6]
ds.corpus <- VCorpus(VectorSource(ds))
ds.tdm <- TermDocumentMatrix(ds.corpus, control = control_list)

ds.80 <- removeSparseTerms(ds.tdm, sparse = 0.80)
df_ds <- tidy(sort(rowSums(as.matrix(ds.80))))
colnames(df_ds) <- c("words", "count")

ggplot(tail(df_ds, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "TF-IDF of Data Scientist Job Titles",
       x = "Words", y = "Frequency") +
  coord_flip()

# Senior / Sr. - 84 instances
# Intern - 61 instance
# Senior vs Intern
# Not very illuminating
senior <- paste(tfidf[grep("senior", tolower(tfidf$job_title)), 6], collapse = " ")
intern <- paste(tfidf[grep("intern", tolower(tfidf$job_title)), 6], collapse = " ")
jrsr.corpus <- VCorpus(VectorSource(c(senior, intern)))
jrsr.tdm <- TermDocumentMatrix(jrsr.corpus, control = control_list)
df_jrsr <- tidy(jrsr.tdm)
df_jrsr$document <- mapvalues(df_jrsr$document, from = 1:2,
                              to = c("senior", "intern"))

df_jrsr %>%
  arrange(desc(count)) %>%
  mutate(word = factor(term, levels = rev(unique(term))),
           type = factor(document, levels = c("senior", "intern"))) %>%
  group_by(document) %>%
  top_n(25, wt = count) %>%
  ungroup() %>%
  ggplot(aes(word, count, fill = document)) +
  geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
  labs(title = "TF-IDF of Senior vs Junior Jobs",
       x = "Words", y = "TF-IDF") +
  facet_wrap(~type, ncol = 2, scales = "free") +
  coord_flip()

# Machine Learning - 124 instances
ml <- tfidf[grep("machine learning", tolower(tfidf$job_title)), 6]
ml.corpus <- VCorpus(VectorSource(ml))
ml.tdm <- TermDocumentMatrix(ml.corpus, control = control_list)

ml.70 <- removeSparseTerms(ml.tdm, sparse = 0.70)
df_ml <- tidy(sort(rowSums(as.matrix(ml.70))))
colnames(df_ml) <- c("words", "count")

ggplot(tail(df_ml, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(title = "TF-IDF for Machine Learning Jobs",
       x = "Words", y = "Count") +
  coord_flip()

# Research - 119 instances
research <- tfidf[grep("research", tfidf$job_title), 6]
r.corpus <- VCorpus(VectorSource(research))
r.tdm <- TermDocumentMatrix(r.corpus, control = control_list)

r.80 <- removeSparseTerms(r.tdm, sparse = 0.80)
df_r <- tidy(sort(rowSums(as.matrix(r.80))))
colnames(df_r) <- c("words", "count")

ggplot(tail(df_r, 25), aes(reorder(words, count), count)) +
  geom_bar(stat = "identity", fill = "orange") +
  labs(title = "TF-IDF for Research Job Titles",
       x = "Words", y = "Count") +
  coord_flip()


Though our primary search term was “Data Scientist”, Indeed also returned other job titles. These were some of the most common instances. Unsurprisingly, “Data Scientist” itself matches with what we see in the analysis of all job postings. We thought there might be an interesting shift between “senior” level jobs and internships, with perhaps a strong prevelance of “soft skills” for the higher level jobs, but did not see much evidence of that in the data.



6. Sentiment Analysis

The idea here is to take a look at the “sentiment” of the text within each job posting and use that information as a proxy for company quality. The thinking is that higher sentiment ranking will be indicative of better company quality ( a leap, to be sure, but probably acceptable given the scope of this project). We’ll then use this data to take a look at which skills are more heavily refered to by the highest (and lowest) sentiment ranked companies.


A. Prepare the data

The first thing that we that we’re going to do is tokenize the “summary” column of the data which contains all the text which we are interested in. This essentially amounts to parsing the column into individual words and reshaping the dataframe into a “tidy” format where all individual words (tokens) are found in their own column.

We’ll then remove all the “stop_words” from this newly created data – words like “if”, “and”, “the”… etc.


#tokenize the summary into individual words, drop stop words
df.sent <- df_final %>%
  unnest_tokens(token, summary) %>%
  anti_join(stop_words, by=c("token" = "word")) 

head(df.sent,5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "800px", height = "200px")
list_ID city job_title company_name link token
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 overviewableto
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 combines
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 class
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 patient
1 New York Data Scientist AbleTo, Inc. https://www.indeed.com//pagead/clk?mo=r&ad=-6NYlbfkN0DDxuA8Y4K3JnPiGV4kjt8LJAX0ZelysMhEJeNM_3_rWb_L4BVNF4KpDHXkayWIYw5H919b29Wv9kQgd-mQGWEY-QQXRpTL5rlsZ7_n6AWh5RptzR11B6ZJnyJplt7VTSzq1CsFIpNQMytVyIBVMSrguwd-ESHqspczWm_AdnUxSs1jiwfYm9e3e6AT-Hh20UGh0KJi238J21XagkNbN-QSCV_9RqzVw6HbR-OEHLPDucMHuaKN_gpW6UiDyjfm9he2EbTXP8Rgkx-e3GecFU2APK8l5g0ymwqGZL_hEDQEXV_KCaQj10Dyd4spfmQ6-j1RHkdoOudRJ0SRuoIRiPTchuCTcuQkwtd3m8kmIFcdVjsaT9wxfgYx0pWhP32vCKTfgUSE5Ze7ePGboPkMAkLMaOYkaRUqFtt4g34rxsAyjHfpUr8XszxgN6pyyvexN_0qEzEkIQ46oO30CzvOcnYXJPKhYtRhvtFcPiklsBqwWkbTMmDmRUE5TIw22dDZ2kAzGQubdRYnp6KnbgP0mfeHzCnDljLM8wcus3PTL101_j1VZ-Z0h897BHWzlKMcHYSNmT4-oxr4d5bxD_BuSQiLqRe5nQlujMoZ2IsWl8LOIZKiN556aul-NqrzquBk3sLy-nEegXA6fQ==&vjs=3&p=1&sk=&fvj=0 engagement


Next we’ll map a numeric sentiment score to the words in our token column. We’re going to use the AFINN set for simplicity as it maps to a simple integer score between [-5, +5] with numbers below zero representing negative sentiments and numbers above zero representing positive sentiments.

#map the words to a sentiment score
df.sentiment <- df.sent %>%
  inner_join(get_sentiments("afinn"),by=c("token" = "word")) #%>%

head(df.sentiment[c("city","job_title","company_name","token","score")],5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
city job_title company_name token score
New York Data Scientist AbleTo, Inc. improve 2
New York Data Scientist AbleTo, Inc. benefitting 2
New York Data Scientist AbleTo, Inc. suffering -2
New York Data Scientist AbleTo, Inc. anxiety -2
New York Data Scientist AbleTo, Inc. pain -2


Next we’re going to compute an average sentiment score for each company by aggregating the total sentiment score per company, and dividing by the number of job postings found for that particular company. We’ll also order the data by average sentiment.

#pare down the data
df.sentByComp <- df.sentiment[,c("company_name","score")]

#get the number of observations per co.
df.compCount <- df.sentiment %>% 
  dplyr::group_by(company_name) %>% 
  dplyr::summarize(num_obs = length(company_name))

#aggregate the sentiment score by company
df.sentByComp <-df.sentByComp %>%
   dplyr::group_by(company_name) %>%
   dplyr::summarize(sentiment = sum(score))

#get the average sentiment score per observation
df.sentByComp$num_obs = df.compCount$num_obs
df.sentByComp$avg.sentiment = df.sentByComp$sentiment / df.sentByComp$num_obs
df.sentByComp <- df.sentByComp[order(-df.sentByComp$avg.sentiment),]

head(df.sentByComp,5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
company_name sentiment num_obs avg.sentiment
Naval Nuclear Laboratory 23 10 2.300000
RJ Lee Group, Inc. 20 9 2.222222
Austin Fraser 11 5 2.200000
Directions Research, Inc. 70 32 2.187500
HarperCollins Publishers Inc. 13 6 2.166667


Next we downsample the data to look at the top and bottom few companies, as per the sentiment rankings.

n <- 5 # number of companies to get

#get the top and bottom "n" ranked companies
bestNworst <- rbind(head(df.sentByComp,n),tail(df.sentByComp,n))

bestNworst %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
company_name sentiment num_obs avg.sentiment
Naval Nuclear Laboratory 23 10 2.3000000
RJ Lee Group, Inc. 20 9 2.2222222
Austin Fraser 11 5 2.2000000
Directions Research, Inc. 70 32 2.1875000
HarperCollins Publishers Inc. 13 6 2.1666667
Expedia -1 22 -0.0454545
Oracle -6 28 -0.2142857
ZenX Solutions LLC -3 11 -0.2727273
Ezra Penland Actuarial Recruitment -44 22 -2.0000000
Affirm -8 2 -4.0000000


Next, we inner-join our bestNworst data back to the original df, preserving only entries that correspond to companies which fall in the top or bottom “n” in terms of sentiment rank. This should dramatically reduce the row-count from about 400K to somewhere in the low 000’s.

df.result <- inner_join(df.sent,bestNworst[c("company_name","avg.sentiment")])
## Joining, by = "company_name"
colnames(df.result)
## [1] "list_ID"       "city"          "job_title"     "company_name" 
## [5] "link"          "token"         "avg.sentiment"
tail(df.result[c("city","company_name","token","avg.sentiment")], 5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
city company_name token avg.sentiment
2430 Pittsburgh Naval Nuclear Laboratory proficiency 2.3
2431 Pittsburgh Naval Nuclear Laboratory technical 2.3
2432 Pittsburgh Naval Nuclear Laboratory writing 2.3
2433 Pittsburgh Naval Nuclear Laboratory data 2.3
2434 Pittsburgh Naval Nuclear Laboratory analysis 2.3


Now we’ll rank the count the terms

#remove any commas from the token column... makes it easier to remove #s 
df.result$token <- gsub(",","",df.result$token)

#count the terms for the top rated companies
top.terms <- df.result %>%
  dplyr::filter(is.na(as.numeric(as.character(token)))) %>%   # removes numbers
  dplyr::filter(avg.sentiment > 0 ) %>%
  dplyr::count(token, sort = TRUE) 

head(top.terms,5)
## # A tibble: 5 x 2
##   token          n
##   <chr>      <int>
## 1 data          18
## 2 chemistry      9
## 3 experience     7
## 4 position       7
## 5 college        6
#count the terms for the bottom rated companies
bottom.terms <- df.result %>%
  dplyr::filter(is.na(as.numeric(as.character(token)))) %>%  # removes numbers
  dplyr::filter(avg.sentiment < 0 ) %>%
  dplyr::count(token, sort = TRUE) 

head(bottom.terms,5) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
token n
position 42
data 33
actuarial 32
actuary 31
job 26


B. Plot Some Findings

ggplot(head(top.terms,33), aes(reorder(token, n), n)) +
  geom_bar(stat = "identity", fill = "Blue") +
  labs(title = "Top Terms for Companies with Highest Sentiment",
       x = "Term", y = "Frequency") +
  coord_flip()

ggplot(head(bottom.terms,33), aes(reorder(token, n), n)) +
  geom_bar(stat = "identity", fill = "Red") +
  labs(title = "Top Terms for Companies with Lowest Sentiment",
       x = "Term", y = "Frequency") +
  coord_flip()



7. Supervised Analysis

Our goal was to find the most valued data science skills using a supervised approach. We created new variables for analyzing three types of skills (hard skills, soft skills, and tool skills).

Assumptions:

  • We assumed certain terms fell into certain categories and searched for them.

  • We arrived at these categories based on outside SMEs (subject-matter experts).

  • We assumed that our analysis tools would lead to conclusions without human intervention.


A. Frequency of Skills


Technical Skills

We used the mutate function to create new variables for the tool skills category and preserve existing variables from the summary column, and turned on case insensitivity.

toolskills <- df_final %>%
    mutate(R = grepl("\\bR\\b,", summary)) %>%
    mutate(python = grepl("Python", summary, ignore.case=TRUE)) %>%
    mutate(SQL = grepl("SQL", summary, ignore.case=TRUE)) %>%
    mutate(hadoop = grepl("hadoop", summary, ignore.case=TRUE)) %>%
    mutate(perl = grepl("perl", summary, ignore.case=TRUE)) %>%
    mutate(matplotlib = grepl("matplotlib", summary, ignore.case=TRUE)) %>%
    mutate(Cplusplus = grepl("C++", summary, fixed=TRUE)) %>%
    mutate(VB = grepl("VB", summary, ignore.case=TRUE)) %>%
    mutate(java = grepl("java\\b", summary, ignore.case=TRUE)) %>%
    mutate(scala = grepl("scala", summary, ignore.case=TRUE)) %>%
    mutate(tensorflow = grepl("tensorflow", summary, ignore.case=TRUE)) %>%
    mutate(javascript = grepl("javascript", summary, ignore.case=TRUE)) %>%
    mutate(spark = grepl("spark", summary, ignore.case=TRUE)) %>%
    
select(job_title, company_name, R, python, SQL, hadoop, perl, matplotlib, Cplusplus, VB, java, scala, tensorflow, javascript, spark)


Applied the summarise_all function to all (non-grouping) columns.

toolskills2 <- toolskills %>% select(-(1:2)) %>% summarise_all(sum) %>% gather(variable,value) %>% arrange(desc(value))


Visualized the most in-demand tool sKills:

ggplot(toolskills2,aes(x=reorder(variable, value), y=value)) + geom_bar(stat='identity',fill="green") + xlab('') + ylab('Frequency') + labs(title='Tool Skills') + coord_flip() + theme_minimal()

Python, SQL, and R are the most in-demand tool skills according to Indeeed job posts, with the least in-demand tool skill being VB.


Soft Skills

We used the mutate function to create new variables for the soft skills category and preserve existing variables from the summary column, and turned on case insensitivity.

softskills <- df_final %>%
    mutate(workingremote = grepl("working remote", summary, ignore.case=TRUE)) %>%
    mutate(communication = grepl("communicat", summary, ignore.case=TRUE)) %>%
    mutate(collaborative = grepl("collaborat", summary, ignore.case=TRUE)) %>%
    mutate(creative = grepl("creativ", summary, ignore.case=TRUE)) %>%
    mutate(critical = grepl("critical", summary, ignore.case=TRUE)) %>%
    mutate(problemsolving = grepl("problem solving", summary, ignore.case=TRUE)) %>%
    mutate(activelearning = grepl("active learning", summary, ignore.case=TRUE)) %>%
    mutate(hypothesis = grepl("hypothesis", summary, ignore.case=TRUE)) %>%
    mutate(organized = grepl("organize", summary, ignore.case=TRUE)) %>%
    mutate(judgement = grepl("judgement", summary, ignore.case=TRUE)) %>%
    mutate(selfstarter = grepl("self Starter", summary, ignore.case=TRUE)) %>%
    mutate(interpersonalskills = grepl("interpersonal skills", summary, ignore.case=TRUE)) %>%
    mutate(atttodetail = grepl("attention to detail", summary, ignore.case=TRUE)) %>%
    mutate(visualization = grepl("visualization", summary, ignore.case=TRUE)) %>%
    mutate(leadership = grepl("leadership", summary, ignore.case=TRUE)) %>%

select(
  job_title, company_name, workingremote, communication, collaborative, creative, critical, problemsolving, 
  activelearning, hypothesis, organized, judgement, selfstarter, interpersonalskills, atttodetail, 
  visualization, leadership)
    
summary(softskills) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  scroll_box(width = "800px", height = "200px")
job_title company_name workingremote communication collaborative creative critical problemsolving activelearning hypothesis organized judgement selfstarter interpersonalskills atttodetail visualization leadership
Length:1303 Length:1303 Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
Class :character Class :character FALSE:1302 FALSE:445 FALSE:650 FALSE:1066 FALSE:1105 FALSE:1172 FALSE:1303 FALSE:1255 FALSE:1217 FALSE:1297 FALSE:1301 FALSE:1175 FALSE:1206 FALSE:946 FALSE:1024
Mode :character Mode :character TRUE :1 TRUE :858 TRUE :653 TRUE :237 TRUE :198 TRUE :131 NA TRUE :48 TRUE :86 TRUE :6 TRUE :2 TRUE :128 TRUE :97 TRUE :357 TRUE :279


Applied the summarise_all function to all (non-grouping) columns.

softskills2 <- softskills %>% 
               select(-(1:2)) %>% 
               summarise_all(sum) %>% 
               gather(variable,value) %>% 
               arrange(desc(value))


Visualized the most in-demand soft sKills:

ggplot(softskills2,aes(x=reorder(variable, value), y=value)) + geom_bar(stat='identity',fill="green") + xlab('') + ylab('Frequency') + labs(title='Soft Skills') + coord_flip() + theme_minimal()

Communication, collaboration, and visualization are the most in-demand soft skills according to Indeeed job posts, with the least in-demand soft skill being active learning.


Hard Skills

We used the mutate function to create new variables for the hard skills category and preserve existing variables from the summary column, and turned on case insensitivity.

hardskills <- df_final %>%
    mutate(machinelearning = grepl("machine learning", summary, ignore.case=TRUE)) %>%
    mutate(modeling = grepl("model", summary, ignore.case=TRUE)) %>%
    mutate(statistics = grepl("statistics", summary, ignore.case=TRUE)) %>%
    mutate(programming = grepl("programming", summary, ignore.case=TRUE)) %>%
    mutate(quantitative = grepl("quantitative", summary, ignore.case=TRUE)) %>%
    mutate(debugging = grepl("debugging", summary, ignore.case=TRUE)) %>%
    mutate(statistics = grepl("statistics",  summary, ignore.case=TRUE)) %>%
    

select(job_title, company_name, machinelearning, modeling, statistics, programming, quantitative, debugging, statistics)

summary(hardskills) %>% 
               kable("html") %>% 
               kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
job_title company_name machinelearning modeling statistics programming quantitative debugging
Length:1303 Length:1303 Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
Class :character Class :character FALSE:585 FALSE:457 FALSE:642 FALSE:768 FALSE:939 FALSE:1298
Mode :character Mode :character TRUE :718 TRUE :846 TRUE :661 TRUE :535 TRUE :364 TRUE :5


Applied the summarise_all function to all (non-grouping) columns.

hardskills2 <- hardskills %>% 
               select(-(1:2)) %>% 
               summarise_all(sum) %>% 
               gather(variable,value) %>% 
               arrange(desc(value))


Visualized the most in-demand hard skills:

ggplot(hardskills2,aes(x=reorder(variable, value), y=value)) + 
  geom_bar(stat='identity',fill="green") + 
  xlab('') + 
  ylab('Frequency') + 
  labs(title='Hard Skills') + 
  coord_flip() + 
  theme_minimal()

Modeling and machine learning are the most in-demand hard skills according to Indeeed job posts, with the least in-demand hard skill being debugging.


B. Word Cloud

We used the summary column to create a word cloud of data science skills. To begin, we specified the removal of irrelevant words and stopwords that don’t add context.

datacloud <- Corpus(VectorSource(df_final$summary))

datacloud <- tm_map(datacloud, removePunctuation)

datacloud <- tm_map(datacloud, tolower)

datacloud <- tm_map(datacloud, removeWords, c("services", "data", "andor", "ability", "using", "new", "science", "scientist" , "you", "must", "will", "including", "can", stopwords('english')))


Then, we visualized our data science word cloud:

wordcloud(datacloud, 
          max.words = 80, 
          random.order = FALSE, 
          scale=c(3,.3),
          random.color = FALSE,
          colors=palette())



8. Conclusions


A. What are the most valued data science skills?

Supervised Approach

The supervised approach showed us which of the data science skills that we looked up were the most and least in-demand – and therefore, we assume, the most valuable:

Most in-demand

  • Hard skills: modeling and machine learning
  • Soft skills: communication, collaboration, and visualization
  • Tools skills: Python, SQL, R

Least in-demand

  • Hard skills: debugging
  • Soft skills: self-starter, working remotely, and active learning
  • Tools skills: Perl and VB

The list of most in-demand skills is consistent with the tenor and topics of conversation in the field. While motivation, independence, and continuous learning do seem like important requisites to a success in data science, these are either relatively less so in our sample or potentially are conveyed in language opaque to our analysis.

These findings can not only inform how job seekers position their experience and skills, but how learning programs and bootcamps structure their curricula. Additionally, this information could help prospective employers of data scientists to understand how to communicate their own requirements, or how their requirements compare with others hiring in the marketplace, or even what their competitive set looks like.


TF-IDF Approach

Hard skills had a higher score in most situations. “Machine” and “learning”, “statistics” and “analysis” appeared high in most of our graphs. Soft and technical skills didn’t appear at all. However, the results should not be considered conclusive as the TF-IDF approach lacked any proper context. When the corpus was broken down by cities, the name of companies became the highest scoring terms.

We attempted to make the results more coherent by lowering the sparsity, but this method was arbitrary in nature. This makes sense as TF-IDF is often used in conjunction with other algorithms. For example, we might have been able to get our much-needed context by feeding our TF-IDF matrices into a k-means algorithm. Then we could see how certain words grouped together. In this way, we could prune our corpuses to focus solely on what words are important to data scientist jobs, as opposed to what words are important to job posts in general.


Sentiment Approach

The sentiment analysis did not seem to be biased towards either soft or hard skills with terms such as “data”, “experience” ,“technical”, “modeling” , “development”, “python” and “team” coming in near the top. It did, however, seem to be confused by terms that the supervised approach would never include as “relevant”. For example, terms like “including” and “national” were found near the top of the list – these terms are likely artifacts of the process rather than important data science terms.

There did not appear to be any coherent difference between top- and bottom-rated companies, which made drawing a meaningful conclusion from any comparison between the two sets difficult. One thing of note was that the distribution for “positive sentiment” companies had significantly more right-skew in the frequency of terms being reported as relevant. More analysis would likely be required to uncover the meaning behing this observation.

At a high level, the sentiment testing was likely a bit mis-specified in the sense that a generic sentiment mapping was used and it was applied to individual words (i.e., ignoring the context of the word within the sentence) which may be less than optimal given the specific task here.

Future sentiment-related work here should include an examination of n-grams, sentences, paragraphs, or even entire summaries so as to capture the entire context of what is being said. A custom list of stop-words would also be beneficial here as there are we found many words that are not traditionally considered stop words, but which recurr in almost every job posting and thus provide little value.


B. Conclusions about the process

  • Running parallel workstreams added depth to our analysis. It allowed us to compare and contrast the outputs of the supervised and unsupervised methods – a neat and tangible learning opportunity for the team.

  • Context matters. All data science methods – supervised or unsupervised – require that the user has background knowledge and context to determine whether the ouput is valuable. We can’t assume that the results of any method will be salient on their own, especially if the method is unsupervised.

    • On the whole, we found the supervised results much more coherent than the unsupervised. For example, a layman could review the table of skills from the supervised analysis and theorize which ones matter, but the same is not true for tables resulting from the unsupervised analysis, which contain a lot of syntactical noise and don’t feel terribly salient. While we recognize this as a product of the unsupervised methodology that could be “tuned” or “pruned,” for analysis of low complexity, such tuning and pruning comes to approximate a supervised approach. This seems less efficient than deliberately prescribing keywords and terms up front.
  • Collaboration is key in a data science project. For a group of six strangers who are (mostly) not co-located and must work virtually (across time zones), it’s important to have regular check-ins to monitor progress and correct course, if necessary. It’s also key to align early on overall workflow and timeline, roles / responsibilities in the process, and how next steps evolve. It’s good to make project management and decision frameworks explicit, though not doing so did not present issues on this project (because this team is awesome). While we stayed on track through well-attended meetings held every few days,



9. Assumptions

While touched on above, we want to make clear the assumptions we held at each step of the process. These assumptions ultimately informed the direction of our approach, the content of our analysis, and the conclusions we could draw.

A. Overall Assumptions

  • “Skills” refer to hard skills, soft skills, and technical skills of the individual data scientist.

  • We can determine a skill’s value by looking at job postings. Specifically, job postings on Indeed for the search term “Data Scientist”. Employers list skills that they need and find more valuable.

  • Job postings have the most up-to-date information on real employer needs, compared to other data sources such as data science course curricula, surveys of data scientists, and the Reddit page for data science.


B. Scraping Assumptions

  • The data we scraped is representative of all jobs. Indeed data is comparable to other job posting websites.

  • The moment we scraped data can be longitudinally extrapolated – it isn’t an outlier. What we scrape is expected to be valid right now, but not necessarily into the future.


C. Cleaning Assumptions

  • All of the sections listed in a job post – and not just ones related to the potential employee – are useful in identifying the skills that employers most value. We arrived at this conclusion after reviewing a random sample of postings and concluding that there was valuable information throughout the summary.

  • We kept observations discrete – as a corpus – rather than as a single string. We also kept special characters and stopwords. This allowed the downstream user to decide what was important.


D. Analysis Assumptions

  • Overall

    • We assumed that we would be able to compare the results of the two approaches.
    • We assumed that if we applied the appropriate data mining methods to the raw data, the methods – on their own – would tell us what was important.


  • Supervised Approach

    • We assumed certain terms fell into certain categories and searched for them.
    • We arrived at these categories based on online subject-matter experts.
    • We removed some stopwords, and words like “data”, that we assumed would not add context.


  • TF-IDF Approach
    • We assumed that TF-IDF would simply be a “smarter” version of calculating term-frequency. After running the algorithm on our corpus of job postings, we thought the key words would be roughly the same as the supervised, but with the added bonus of not having to find a list ahead of time. This proved not to be the case, as the more valuable words were sometimes generic in nature.
    • It appeared that the analysis would prove more useful to the business side, helping HR write a better data scientist job posting, rather than the employee trying to figure out what data science skills might be in demand.


  • Sentiment Analysis
    • We assumed that higher-sentiment companies are companies more people want to work for, therefore the skills they look for are more valuable.