Getting Started

Instructions

Use data to answer the question, “Which are the most valued data science skills?”

Proposed Implementation

Our preliminary goal was to use multiple data sets from Kaggle to:

create a rough association between job titles and pay bands
identify skill sets in job listings using a codex
assign a value to individual skill sets using the above

Our actual implementation was to:

create a tidied and stripped corpus from job descriptions
view words associated with ‘data’ using a ‘Document Term Matrix’
extract and rank programming languages based on frequency in the cleaned corpus
extract and rank soft skills based on frequency in the cleaned corpus

Libraries

Here we load our libraries.

# Install new packages
#install.packages('RMariaDB')
#install.packages('bcputility')
#install.packages('tm')

# Load packages --------------------------------------
library(RMariaDB) # For accessing our SQL data
library(DBI)
library(bcputility)
library(tm) # Our star package with a host of functionality for working with corpora
library(dplyr)
library(stringr)
library(tidytext)

Data

Source Data

Our underlying data comes from four kaggle databases that we have unified to total nearly 26,000 data science-focused job listings.

NICOLAE RIGHERIU (2020) “10000 Data Scientist Job Postings from the USA”
https://www.kaggle.com/nrigheriu/data-scientist-job-postings/data

RASHIK RAHMAN (2021) “Data Science Job Posting on Glassdoor”
https://www.kaggle.com/rashikrahmanpritom/data-science-job-posting-on-glassdoor

MICHAEL BRYANT (2022) “Data Science Job Postings and Salaries”
https://www.kaggle.com/michaelbryantds/california-salaries-in-data-science

KAJAL YADAV (2021) “Glassdoor Pre-pandemic Dataset for USA”
https://www.kaggle.com/techykajal/glassdoor-prepandemic-dataset-for-usa

Data Manipulation, Ex 1

Here we memorialize how salary information from one of the data sets was cleaned. After a fork in the project we no longer needed to use Salary information but document it here for completeness.

library(tidyr)
library(dplyr)
library(naniar)
library(stringr)

glassdoor <- read.csv("C:\\Users\\Carlos\\Documents\\GitHub\\Project_3\\Data_Sets\\Glassdoor USA Dataset.csv")

glassdoordf <- separate(data = glassdoor, col = Salary.Estimate, into = c("Salary_Min", "Salary_Max"), sep = "-")

table(glassdoordf$Salary_Min, useNA = 'always')

# Removing all characters after K (thousand). Results with just $ are records with an 
# hourly rate ($17 - $23 Per Hour)
glassdoordf$Salary_Max = sub('\\ .*', '', glassdoordf$Salary_Max)
glassdoordf$Salary_Max = sub('\\(.*', '', glassdoordf$Salary_Max)
glassdoordf$Salary_Min[glassdoordf$Salary_Min == "$ 17"] <- "0"

# Replacing all values smaller than length of 4 (Example: Anything with less characters than $60k) with NA.
glassdoordf$Salary_Max[nchar(glassdoordf$Salary_Max) < 4] <- NA
glassdoordf$Salary_Min[nchar(glassdoordf$Salary_Min) < 4] <- NA

table(glassdoordf$'Salary_Max', useNA = 'always')

# If the record doesn't have a range, salary_Min will have the salary for the job posting. If this
# is the case, we'll be applying that value to Salary_Max as well.
glassdoordf %>%
  mutate(Salary_Max = ifelse(is.na(Salary_Max), Salary_Min, Salary_Max))

Data Manipulation, Ex 2

Here we memorialize how salary information from another data set was cleaned. After a fork in the project we no longer needed to use Salary information but document it here for completeness.

library(RMariaDB)
library(DBI)
library(bcputility)
library(tm)
library(dplyr)
library(stringr)

#DB Set up 
con <- dbConnect(RMariaDB::MariaDB(), username="ahmed", password="buckets123!@#", dbname ="basic", host="localhost")
dbListTables(con)

df <- dbReadTable(con, "listings2019_2022")
df

getMoney <- function(str1, str2) {
  extractboi1 <- str_extract(str1,"\\d+")
  extractboi2 <- str_extract(str2,"\\d+")
  myextract = ''
  
  #check the first string, if its empyty then use the second
  if(is.na(extractboi1)){
    myextract=extractboi2
  }
  else{
    myextract=extractboi1
  }
  
  
  cleanboi = ''
  if(nchar(myextract) == 5){
    cleanboi=substr(myextract, 1,2)
  }
  else if(nchar(myextract) >= 6){
    cleanboi=substr(myextract, 1,3)
  }
  else if(nchar(myextract) == 3 || nchar(myextract == 2)){
    cleanboi=myextract
  }
  as.integer(cleanboi)
}

full_time <- df %>% filter(workType == "Full Time") %>% 
  filter(salary_string != '') %>% 
  filter(grepl("\\d", salary_string)) %>% 
  separate(salary_string, c('min','max') ,'-') %>% 
  rowwise() %>%
  mutate(`min_cleaned` = getMoney(min, max)) %>%
  mutate(`max_cleaned` = getMoney(max, min)) %>%
  filter(min_cleaned >= 50 && max_cleaned >= 50) 

full_time

# write to maria}
dbWriteTable(con, "australia_rectified", full_time)

Project Fork

Having all of this wonderful data, we wanted to make a salary predictor based on location, job title, and skills mentioned in the job description. As a flex we could use that predictor to assign a value to each individual skill based on expected pay (holding location and job title constant).

Where we were able to meet the objectives of the project was in analyzing the job descriptions, so in our following Analysis section we rely just on job descriptions from each job listing.

Analysis

Data Read in

Here we read in our aggregated data.

Code Summary
+ SQL connection set up
+ Read our aggregated data to a dataframe
+ Restrict data based on memory as a limiting factor

# SQL connection set up
con <- dbConnect(RMariaDB::MariaDB(), username="root", password="TestCase123.", dbname ="basic", host="127.0.0.1")
dbListTables(con)

## [1] "data_science_job_posting"            "Rectified_data_science_job_postings"
## [3] "Glassdoor USA Dataset"               "Rectified_Listings"                 
## [5] "listings2019_2022"                   "Rectified_Glassdoor"                
## [7] "Rectified_Cleaned_DS_Jobs"           "Cleaned_DS_Jobs"                    
## [9] "Rectified_Total"

# Read our aggregated data to a dataframe
df <- dbReadTable(con, "Rectified_Total")

# Restrict data based on memory as a limiting factor
sub_length = 100
new_df <- data.frame( doc_id = 1:sub_length, text =df$Job_Description[1:sub_length] )

Note, this file was knit using “4 GB 1600 MHz DDR3” Memory. You can reproduce the results using a sub_length greater than 100. Since the corpus is held dynamically in memory, memory becomes a limiting factor in running these results.

Corpus Creation

Corpus Creation, Option 1

Here we create the corpora

Code summary
+ Create the corpora
+ Write the corpora to file

# Create the corpora
ds <- DataframeSource(new_df)
v <- VCorpus(ds)
x <- Corpus(ds)

Note, the function Corpus() from the tm package generates either a Virtual Corpus or a Simple Corpus based on context. Both are fully kept in memory, the difference being that a simple corpus is optimized for the most common tasks required of the tm package. Going forward we will use the simple corpus, x; However we create and store a Virtual Corpus as well in case a future project requires that flexibility.

# Write the corpus to file, not evaluated for our purposes
writeCorpus(v, path = "./Vcorpus_sub")
writeCorpus(x, path = "./corpus_sub")

Note, this step allows us to create the corpus once to be read into multiple projects. The code to read an already created corpus is in the next section.

Load the Corpus, Option 2

Assuming we’ve already created our corpus we can read it directly into a given project.

# Load the corpus as an option instead of recreating, not evaluated for our purposes
x <- Corpus(DirSource("./corpus_sub"), readerControl = list(language="lat"))

Tidy and Clean Corpus

Here we tidy and strip our corpus.

Code Summary
+ Transform content to lower case
+ Remove filler words
+ Inspect one element of the corpus, equivalent to one job description, before cleaning
+ Inspect the same element of the corpus, after cleaning

Notice how after the cleaning connector words are gone and everything is lower case.

Also note, the job descriptions are long. The output is not hidden so you can see the difference between the corpus before and after cleaning for one job description. Instead of scrolling you can navigate between sections using the Table of Contents.

# Transform content to lower case
xc <- tm_map(x, content_transformer(tolower))

# Remove filler words
xc <- tm_map(xc, removeWords, stopwords("english"))

# Inspect one element of the corpus, equivalent to one job description, before cleaning
inspect(x[2])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2 
## Overview\n\n\nAnalysis Group is one of the largest international economics consulting firms, with more than 1,000 professionals across 14 offices in North America, Europe, and Asia. Since 1981, we have provided expertise in economics, finance, health care analytics, and strategy to top law firms, Fortune Global 500 companies, and government agencies worldwide. Our internal experts, together with our network of affiliated experts from academia, industry, and government, offer our clients exceptional breadth and depth of expertise.\n\nWe are currently seeking a Data Scientist to join our team. The ideal candidate should be passionate about working on cutting edge research and analytical services for Fortune 500 companies, global pharma/biotech firms and leaders in industries such as finance, energy and life sciences. The Data Scientist will be a contributing member to client engagements and have the opportunity to work with our network of world-class experts and thought leaders.\n\nJob Functions and Responsibilities\n\nThe candidate Data Scientist will help develop, maintain and teach new tools and methodologies related to data science and high performance computing. This position will also help Analysis Group in maintaining our leadership position in terms of advancing methodology and data analytics. The Data Scientist will be responsible for staying abreast of new developments in technology relating to data science, to share more broadly with Analysis Group.\n\nKey responsibilities for this position will include:\nWorking with project teams to address data science/computing challenges\nIdentifying opportunities for technology to enhance service offerings\nActing as a resource and participating in client engagements and research as part of the project team\nMaintaining up-to-date knowledge of computing tools, providing technical training and helping to grow the in-house knowledge base, specifically in a Linux environment\nPresenting research at selected conferences\nExamples of activities for the Data Scientist will include:\nDeveloping data engineering and machine learning production systems for full stack data science projects\nUsing natural language processing methodologies to work with EMR data, social media data and other unstructured data\nOptimizing procedures for managing and accessing large databases (e.g., insurance claims, electronic health records, financial transactions)\nCreating interactive analytics portals and data visualizations (e.g., using R/Shiny, Python/Flask, D3)\nBuilding and maintaining high performance computing (HPC) tools on grid and cloud computing environments\nDeveloping and reviewing software and packages in R, Python and other Object Oriented Languages\nEstablishing optimized procedures for repetitive or computationally intensive tasks (C, C++, Cuda-C)\nQualifications\nStrong credentials and experience in database management and data visualization\nSignificant experience working within a Linux environment required\nBackground in Statistics/Econometrics or Biostatistics\nIdeally PhD in Computer Science, Mathematics, Statistics, Economics or other relevant scientific degree with relevant experience. Other candidates with at least one year of experience in the field may also be considered\nExcellent written and verbal communication skills\nProject experience with R and/or Python\nFamiliar with online/cloud computing/storage (e.g., AWS)\nDemonstrated experience working on project teams and collaborating with others\nSCIENTIFIQUE DES DONNÃ‰ES\n\n*Lâ€™utilisation du genre masculin sert uniquement Ã  allÃ©ger le texte et est utilisÃ© ici en tant que genre neutre\n\nSurvol\n\nGroupe dâ€™analyse ltÃ©e est lâ€™une des plus grandes firmes de services-conseils en Ã©conomie, comptant plus de 950 professionnels rÃ©partis dans 14 bureaux en AmÃ©rique du Nord, en Europe et en Asie. Depuis 1981, nous offrons notre expertise en matiÃ¨re de stratÃ©gie, dâ€™Ã©conomie, de finance et dâ€™analyse dans le domaine des soins de santÃ© aux grands cabinets dâ€™avocats, aux sociÃ©tÃ©s Fortune Global 500 et aux agences gouvernementales du monde entier. Nos professionnels en poste conjuguÃ©s Ã  notre rÃ©seau de spÃ©cialistes affiliÃ©s issus dâ€™universitÃ©s, dâ€™industries spÃ©cifiques et dâ€™organismes gouvernementaux procurent Ã  notre clientÃ¨le un savoir-faire dâ€™une portÃ©e et dâ€™une profondeur exceptionnelles.\n\nNous sommes prÃ©sentement Ã  la recherche d'un Scientifique des donnÃ©es (Â« Data Scientist Â») pour se joindre Ã  notre Ã©quipe. Le candidat idÃ©al devrait Ãªtre passionnÃ© par la recherche de pointe et les services analytiques pour les entreprises Fortune 500, les entreprises pharmaceutiques et biotechnologiques mondiales et les chefs de file dans des secteurs de la finance, l'Ã©nergie et les sciences de la vie. Le Scientifique des donnÃ©es sera un membre contributeur aux mandats des clients et aura l'occasion de travailler avec notre rÃ©seau d'experts et de leaders d'opinion de classe mondiale.\n\nDescription du poste et des responsabilitÃ©s\n\nLe scientifique des donnÃ©es aidera Ã  dÃ©velopper, maintenir et enseigner de nouveaux outils et mÃ©thodologies liÃ©s Ã  la science des donnÃ©es (Â« Data Science Â») et au HPC. Ce poste aidera Ã©galement le Groupe d'analyse Ã  maintenir sa position de chef de file en ce qui a trait Ã  l'avancement de la mÃ©thodologie et de l'analyse des donnÃ©es. Le scientifique des donnÃ©es sera chargÃ© de se tenir au courant des nouveaux dÃ©veloppements technologiques liÃ©s Ã  la science des donnÃ©es, afin de les partager plus largement avec le Groupe d'analyse.\n\nLes principales responsabilitÃ©s de ce poste comprendront:\n\n- Collaborer avec les consultants pour relever les dÃ©fis de la science des donnÃ©es et de sciences informatiques\n\n- Agir Ã  titre de ressource et participer aux mandats et Ã  la recherche en tant que membre de l'Ã©quipe de projet\n\n- Maintenir Ã  jour les connaissances sur les outils informatiques, fournir une formation technique et aider Ã  dÃ©velopper la base de connaissances interne, notamment dans un environnement Linux\n\n- PrÃ©senter la recherche Ã  des confÃ©rences choisies\n\nExemples de tÃ¢ches du scientifique des donnÃ©es :\n\n- DÃ©veloppement de systÃ¨mes de production en ingÃ©nierie des donnÃ©es ainsi quâ€™en apprentissage machine pour des projets de science des donnÃ©es full stack\n\n- Utiliser des mÃ©thodologies NLP pour travailler avec les donnÃ©es mÃ©dicales Ã©lectroniques, les donnÃ©es des mÃ©dias sociaux et d'autres donnÃ©es non structures\n\n- Optimiser les procÃ©dures de gestion et d'accÃ¨s aux grandes bases de donnÃ©es (ex. rÃ©clamations d'assurance, dossiers de santÃ© Ã©lectroniques, transactions financiÃ¨res)\n\n- CrÃ©ation de portails d'analyse interactifs et de visualisations de donnÃ©es (par exemple, en utilisant R/Shiny, Python/Flask, D3)\n\n- Construire et maintenir des outils de calcul de haute performance (HPC).\n\n- DÃ©veloppement et rÃ©vision de codes en R, Python et autres langages\n\n- Mise en place de procÃ©dures optimisÃ©es pour les tÃ¢ches rÃ©pÃ©titives ou intensives en calcul (C, C++, Cuda-C)\n\nQualifications requises\n\n- Solides rÃ©fÃ©rences et expÃ©rience dans la gestion de bases de donnÃ©es et de la visualisation de donnÃ©es\n\n- ExpÃ©rience de travail significative dans un environnement Linux requise\n\n- ExpÃ©rience antÃ©rieure en statistique/Ã©conomÃ©trie ou bio-statistique\n\n- IdÃ©alement, Ãªtre titulaire d'un doctorat en sciences informatiques, en mathÃ©matiques, en statistique, en Ã©conomie ou d'un autre diplÃ´me scientifique pertinent et possÃ©der une expÃ©rience pertinente. Les candidats ayant au moins un an d'expÃ©rience dans le domaine peuvent Ã©galement Ãªtre considÃ©rÃ©s.\n\n- Excellentes aptitudes de communication Ã©crite et verbale\n\n- ExpÃ©rience de projet avec R et/ou Python\n\n- FamiliaritÃ© avec l'informatique en ligne/info nuagique et le stockage (AWS)\n\n- ExpÃ©rience de travail dÃ©montrÃ©e au sein d'Ã©quipes de projet et de collaboration avec d'autres personnes\n\nÂ\nEqual Opportunity Employer/Protected Veterans/Individuals with Disabilities.\nPlease view Equal Employment Opportunity Posters provided by OFCCP here.\nThe contractor will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant. However, employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information, unless the disclosure is (a) in response to a formal complaint or charge, (b) in furtherance of an investigation, proceeding, hearing, or action, including an investigation conducted by the employer, or (c) consistent with the contractor's legal duty to furnish information. 41 CFR 60-1.35(c)

# Inspect the same element of the corpus, after cleaning
inspect(xc[2])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2 
## overview\n\n\nanalysis group  one   largest international economics consulting firms,    1,000 professionals across 14 offices  north america, europe,  asia. since 1981,   provided expertise  economics, finance, health care analytics,  strategy  top law firms, fortune global 500 companies,  government agencies worldwide.  internal experts, together   network  affiliated experts  academia, industry,  government, offer  clients exceptional breadth  depth  expertise.\n\n  currently seeking  data scientist  join  team.  ideal candidate   passionate  working  cutting edge research  analytical services  fortune 500 companies, global pharma/biotech firms  leaders  industries   finance, energy  life sciences.  data scientist will   contributing member  client engagements    opportunity  work   network  world-class experts  thought leaders.\n\njob functions  responsibilities\n\n candidate data scientist will help develop, maintain  teach new tools  methodologies related  data science  high performance computing.  position will also help analysis group  maintaining  leadership position  terms  advancing methodology  data analytics.  data scientist will  responsible  staying abreast  new developments  technology relating  data science,  share  broadly  analysis group.\n\nkey responsibilities   position will include:\nworking  project teams  address data science/computing challenges\nidentifying opportunities  technology  enhance service offerings\nacting   resource  participating  client engagements  research  part   project team\nmaintaining --date knowledge  computing tools, providing technical training  helping  grow  -house knowledge base, specifically   linux environment\npresenting research  selected conferences\nexamples  activities   data scientist will include:\ndeveloping data engineering  machine learning production systems  full stack data science projects\nusing natural language processing methodologies  work  emr data, social media data   unstructured data\noptimizing procedures  managing  accessing large databases (e.g., insurance claims, electronic health records, financial transactions)\ncreating interactive analytics portals  data visualizations (e.g., using r/shiny, python/flask, d3)\nbuilding  maintaining high performance computing (hpc) tools  grid  cloud computing environments\ndeveloping  reviewing software  packages  r, python   object oriented languages\nestablishing optimized procedures  repetitive  computationally intensive tasks (c, c++, cuda-c)\nqualifications\nstrong credentials  experience  database management  data visualization\nsignificant experience working within  linux environment required\nbackground  statistics/econometrics  biostatistics\nideally phd  computer science, mathematics, statistics, economics   relevant scientific degree  relevant experience.  candidates   least one year  experience   field may also  considered\nexcellent written  verbal communication skills\nproject experience  r / python\nfamiliar  online/cloud computing/storage (e.g., aws)\ndemonstrated experience working  project teams  collaborating  others\nscientifique des donnã‰es\n\n*lâ€™utilisation du genre masculin sert uniquement ã  allã©ger le texte et est utilisã© ici en tant que genre neutre\n\nsurvol\n\ngroupe dâ€™analyse ltã©e est lâ€™une des plus grandes firmes de services-conseils en ã©conomie, comptant plus de 950 professionnels rã©partis dans 14 bureaux en amã©rique du nord, en europe et en asie. depuis 1981, nous offrons notre expertise en matiã¨re de stratã©gie, dâ€™ã©conomie, de finance et dâ€™analyse dans le domaine des soins de santã© aux grands cabinets dâ€™avocats, aux sociã©tã©s fortune global 500 et aux agences gouvernementales du monde entier. nos professionnels en poste conjuguã©s ã  notre rã©seau de spã©cialistes affiliã©s issus dâ€™universitã©s, dâ€™industries spã©cifiques et dâ€™organismes gouvernementaux procurent ã  notre clientã¨le un savoir-faire dâ€™une portã©e et dâ€™une profondeur exceptionnelles.\n\nnous sommes prã©sentement ã  la recherche d'un scientifique des donnã©es (â« data scientist â») pour se joindre ã  notre ã©quipe. le candidat idã©al devrait ãªtre passionnã© par la recherche de pointe et les services analytiques pour les entreprises fortune 500, les entreprises pharmaceutiques et biotechnologiques mondiales et les chefs de file dans des secteurs de la finance, l'ã©nergie et les sciences de la vie. le scientifique des donnã©es sera un membre contributeur aux mandats des clients et aura l'occasion de travailler avec notre rã©seau d'experts et de leaders d'opinion de classe mondiale.\n\ndescription du poste et des responsabilitã©s\n\nle scientifique des donnã©es aidera ã  dã©velopper, maintenir et enseigner de nouveaux outils et mã©thodologies liã©s ã  la science des donnã©es (â« data science â») et au hpc. ce poste aidera ã©galement le groupe d'analyse ã  maintenir sa position de chef de file en ce qui  trait ã  l'avancement de la mã©thodologie et de l'analyse des donnã©es. le scientifique des donnã©es sera chargã© de se tenir au courant des nouveaux dã©veloppements technologiques liã©s ã  la science des donnã©es, afin de les partager plus largement avec le groupe d'analyse.\n\nles principales responsabilitã©s de ce poste comprendront:\n\n- collaborer avec les consultants pour relever les dã©fis de la science des donnã©es et de sciences informatiques\n\n- agir ã  titre de ressource et participer aux mandats et ã  la recherche en tant que membre de l'ã©quipe de projet\n\n- maintenir ã  jour les connaissances sur les outils informatiques, fournir une formation technique et aider ã  dã©velopper la base de connaissances interne, notamment dans un environnement linux\n\n- prã©senter la recherche ã  des confã©rences choisies\n\nexemples de tã¢ches du scientifique des donnã©es :\n\n- dã©veloppement de systã¨mes de production en ingã©nierie des donnã©es ainsi quâ€™en apprentissage machine pour des projets de science des donnã©es full stack\n\n- utiliser des mã©thodologies nlp pour travailler avec les donnã©es mã©dicales ã©lectroniques, les donnã©es des mã©dias sociaux et d'autres donnã©es non structures\n\n- optimiser les procã©dures de gestion et d'accã¨s aux grandes bases de donnã©es (ex. rã©clamations d'assurance, dossiers de santã© ã©lectroniques, transactions financiã¨res)\n\n- crã©ation de portails d'analyse interactifs et de visualisations de donnã©es (par exemple, en utilisant r/shiny, python/flask, d3)\n\n- construire et maintenir des outils de calcul de haute performance (hpc).\n\n- dã©veloppement et rã©vision de codes en r, python et autres langages\n\n- mise en place de procã©dures optimisã©es pour les tã¢ches rã©pã©titives ou intensives en calcul (c, c++, cuda-c)\n\nqualifications requises\n\n- solides rã©fã©rences et expã©rience dans la gestion de bases de donnã©es et de la visualisation de donnã©es\n\n- expã©rience de travail significative dans un environnement linux requise\n\n- expã©rience antã©rieure en statistique/ã©conomã©trie ou bio-statistique\n\n- idã©alement, ãªtre titulaire d'un doctorat en sciences informatiques, en mathã©matiques, en statistique, en ã©conomie ou d'un autre diplã´ scientifique pertinent et possã©der une expã©rience pertinente. les candidats ayant au moins un  d'expã©rience dans le domaine peuvent ã©galement ãªtre considã©rã©s.\n\n- excellentes aptitudes de communication ã©crite et verbale\n\n- expã©rience de projet avec r et/ou python\n\n- familiaritã© avec l'informatique en ligne/info nuagique et le stockage (aws)\n\n- expã©rience de travail dã©montrã©e au sein d'ã©quipes de projet et de collaboration avec d'autres personnes\n\nâ\nequal opportunity employer/protected veterans/individuals  disabilities.\nplease view equal employment opportunity posters provided  ofccp .\n contractor will  discharge     manner discriminate  employees  applicants    inquired , discussed,  disclosed   pay   pay  another employee  applicant. however, employees   access   compensation information   employees  applicants   part   essential job functions  disclose  pay   employees  applicants  individuals    otherwise  access  compensation information, unless  disclosure  ()  response   formal complaint  charge, (b)  furtherance   investigation, proceeding, hearing,  action, including  investigation conducted   employer,  (c) consistent   contractor's legal duty  furnish information. 41 cfr 60-1.35(c)

Demonstrate the DTM

Here we demonstrate the Document Term Matrix. This is a grid where terms are listed on both the x and the y-axis. Where the word is equal to itself, along the diagonal from top-left to bottom-right, the value is not applicable. In all of the other cells a value is assigned based on how often the two different words are close to each other.

Code Summary
+ Create the Document Term Matrix (DTM)
+ View summary details of the DTM
+ Find the words associated with the word ‘data’

# Create the Document Term Matrix
dtm <- DocumentTermMatrix(xc)

# View summary details of the DTM
inspect(dtm)

## <<DocumentTermMatrix (documents: 100, terms: 6530)>>
## Non-/sparse entries: 20117/632883
## Sparsity           : 97%
## Maximal term length: 72
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs ability business data experience learning machine science team will work
##   13       5       17   25         14        0       0       1    7    6    9
##   16       2        1   13         11        2       2       1    7   16    8
##   17       2        1   13         11        2       2       1    7   16    8
##   2        0        0   17          5        1       2       7    1    7    2
##   20       0       12   36         22        2       2       8   14   10    8
##   25      13        0    4          3        1       0       1    2    1    5
##   8        0        0   17          5        1       2       7    1    7    2
##   89       2       10   20         11        3       4      11    2    4    6
##   9        2        1   13         11        2       2       1    7   16    8
##   90       2        7   15         11        3       4       5    2    5    6

Note, we set the association level to ‘0.45’. By lowering the association level we would produce more results.

# Find the words associated with the word 'data' 
findAssocs(dtm, "data", 0.45)

## $data
##        management              team        experience             tools 
##              0.64              0.60              0.59              0.58 
##              work          building          science,           analyze 
##              0.54              0.54              0.54              0.53 
##          business           reports              part          database 
##              0.53              0.53              0.52              0.52 
##           closely         providing           science              also 
##              0.52              0.51              0.50              0.50 
##           working       development             needs responsibilities: 
##              0.50              0.49              0.49              0.49 
##        individual            design            skills            strong 
##              0.49              0.49              0.48              0.48 
##         languages             agile          physical      demonstrated 
##              0.48              0.48              0.47              0.47 
##     visualization       proactively      bachelorâ€™s           project 
##              0.47              0.47              0.47              0.46 
##          services              will               key          projects 
##              0.46              0.46              0.46              0.46 
##       statistics,             using           utilize         warehouse 
##              0.46              0.46              0.46              0.46 
##         analytics         including             field             group 
##              0.45              0.45              0.45              0.45 
##           written       operations,              play 
##              0.45              0.45              0.45

Programming Languages

Here we apply a series of manipulations to arrive at a list of programming languages ranked by frequency in the corpus.

The Master List

Here we grab a list of programming languages.

The data is a list of programming languages from Wikipedia collected in 2015 by github user, Jamhall.

# Grab a list of programming languages
url.data <- "https://raw.githubusercontent.com/jamhall/programming-languages-csv/master/languages.csv"
raw <- read.csv(url(url.data), header = TRUE,)
programming_list <- tolower(raw$name)

Technical Skills Data

Here we use the power of the Document Term Matrix to find instances of the programming languages in the corpus of job descriptions.

# Produce the DTM using the list of programming languages
programming_list_dtm <- DocumentTermMatrix(xc, list(dictionary = c(programming_list)))

# Convert the DTM into a dataframe for ease of use
programming_list_df <- as.data.frame(as.matrix(programming_list_dtm), stringsAsFactors=False)

Test Metrics for Count

Here we test a routine. We’re trying to count the number of rows that exist for the skill ‘python’.

# Count the number of rows that exist for the skill 'python'
sum(programming_list_df$python != 0, na.rm=TRUE)

## [1] 25

Skills and Counts Dataframe

Here we make a new dataframe with skills and counts.

Code Summary
+ Create a raw dataframe
+ Populate the dataframe with counts of programming languages
+ Remove the programming languages with a count of zero

# Create a raw dataframe ready to be populated
index_df <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(index_df) <-c("Skill", "Count")

# Populate the raw dataframe similarly using the methodology in the preceding 
# section to count 'python' but now for all programming languages
list_holder <- list()
for(i in 1:ncol(programming_list_df)) {       # for-loop over columns
  index_df[i,] <- c(colnames(programming_list_df)[i], (sum(programming_list_df[, i] != 0, na.rm=TRUE)))
}
index_df

# Remove all programming languages with a count of zero
finalizedProgramList <- index_df[index_df$Count !=0,]

Programming Languages Ranked

Here we display the programming languages ranked by frequency in the corpus. While all of the words are associated with programming languages there are results whose counts are conflated with non-programming language uses of the word, such as “make”.

finalizedProgramList$Count <- as.integer(finalizedProgramList$Count)
arrange(finalizedProgramList, desc(Count))

Soft Skills

Here we repeat the preceding series of manipulations to arrive at a list of soft skills ranked by frequency in the corpus.

The Master List

Here we grab a list of soft skills.

The list of soft skills was compiled by combining the following three lists:
+ https://simplicable.com/en/data-science-skills
+ https://simplicable.com/new/soft-skills
+ https://simplicable.com/new/communication-skills

# Grab a list of soft skills
url.data <- "https://raw.githubusercontent.com/Amantux/Project_3/main/Data_Skills.csv"
raw <- read.csv(url(url.data), header= TRUE,)
head(raw)

soft_skills <- tolower(raw$ï..Skills)

Non-Technical Skills Data

Here we use the power of the Document Term Matrix to find instances of the soft skills in the corpus of job descriptions.

# Produce the DTM using the list of soft skills
soft_skills_dtm <- DocumentTermMatrix(xc, list(dictionary = c(soft_skills)))

# Convert the DTM into a dataframe for ease of use
soft_skills_dtm <- as.data.frame(as.matrix(soft_skills_dtm), stringsAsFactors=False)

Test Metrics for Count

Here we test a routine. We’re trying to count the number of rows that exist for the skill ‘analytics’.

# Count the number of rows that exist for the skill 'analytics'
sum(soft_skills_dtm$analytics != 0, na.rm=TRUE)

## [1] 27

Skills and Counts Dataframe

Here we make a new dataframe with skills and counts.

Code Summary
+ Create a raw dataframe
+ Populate the dataframe with counts of soft skills
+ Remove the soft skills with a count of zero

# Create a raw dataframe ready to be populated
index_df_soft <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(index_df_soft) <-c("Skill", "Count")

# Populate the raw dataframe similarly using the methodology in the preceding 
# section to count 'analytics' but now for all soft skills
list_holder <- list()
for(i in 1:ncol(soft_skills_dtm)) {       # for-loop over columns
  index_df_soft[i,] <- c(colnames(soft_skills_dtm)[i], (sum(soft_skills_dtm[, i] != 0, na.rm=TRUE)))
}
index_df_soft

# Remove all soft skills with a count of zero
finalizedSoftList <- index_df_soft[index_df$Count !=0,]

Soft Skills Ranked

Here we display the soft skills ranked by frequency in the corpus. While all of the words are associated with soft skills there are results whose counts are conflated by other, generic use, such as the word “using”.

finalizedSoftList$Count <- as.integer(finalizedSoftList$Count)
arrange(finalizedSoftList, desc(Count))

Concluding Remarks

Decisions made

Data Initially we wanted to scrape job listings from job boards like Indeed and LinkedIn; However we ran into issues with being able to collect data at scale. The kaggle datasets allowed us to pursue our lines of analysis as a proof of concept. The natural progression of the project would be to duplicate it using current, scraped job listings.

Predictors Because we have so much data from the job listings, an interesting project would be to create salary predictors based on location, job title, and required skills.

Resources

Here is a tutorial for web scraping in RSelenium
Trello is a way to collective track and prioritize elements of a project
neo4j is database system which organizes data based on association
docker is a way to create instances of an environment to run your code. It’s a best in practice because any docking container you create would be identical across computers.
tm, or text miner, is a package for working with corpora
MariaDB is a SQL language that works well with docker containers.

Data607 Project 3 - Most Valued Data Science Skills

MEAO - A. Moyse, A. Elsaeyed, C. Alvarez, P. O’Flaherty

2022-03-22

Getting Started

Instructions

Proposed Implementation

Libraries

Data

Source Data

Data Manipulation, Ex 1

Data Manipulation, Ex 2

Project Fork

Analysis

Data Read in

Corpus Creation

Corpus Creation, Option 1

Load the Corpus, Option 2

Tidy and Clean Corpus

Demonstrate the DTM

Programming Languages

The Master List

Technical Skills Data

Test Metrics for Count

Skills and Counts Dataframe

Programming Languages Ranked

Soft Skills

The Master List

Non-Technical Skills Data

Test Metrics for Count

Skills and Counts Dataframe

Soft Skills Ranked

Concluding Remarks

Decisions made

Resources