Use data to answer the question, “Which are the most valued data science skills?”
Our preliminary goal was to use multiple data sets from Kaggle to:
Our actual implementation was to:
Here we load our libraries.
# Install new packages
#install.packages('RMariaDB')
#install.packages('bcputility')
#install.packages('tm')
# Load packages --------------------------------------
library(RMariaDB) # For accessing our SQL data
library(DBI)
library(bcputility)
library(tm) # Our star package with a host of functionality for working with corpora
library(dplyr)
library(stringr)
library(tidytext)
Our underlying data comes from four kaggle databases that we have unified to total nearly 26,000 data science-focused job listings.
NICOLAE RIGHERIU (2020) “10000 Data Scientist Job Postings from
the USA”
https://www.kaggle.com/nrigheriu/data-scientist-job-postings/data
RASHIK RAHMAN (2021) “Data Science Job Posting on
Glassdoor”
https://www.kaggle.com/rashikrahmanpritom/data-science-job-posting-on-glassdoor
MICHAEL BRYANT (2022) “Data Science Job Postings and
Salaries”
https://www.kaggle.com/michaelbryantds/california-salaries-in-data-science
KAJAL YADAV (2021) “Glassdoor Pre-pandemic Dataset for
USA”
https://www.kaggle.com/techykajal/glassdoor-prepandemic-dataset-for-usa
Here we memorialize how salary information from one of the data sets was cleaned. After a fork in the project we no longer needed to use Salary information but document it here for completeness.
library(tidyr)
library(dplyr)
library(naniar)
library(stringr)
glassdoor <- read.csv("C:\\Users\\Carlos\\Documents\\GitHub\\Project_3\\Data_Sets\\Glassdoor USA Dataset.csv")
glassdoordf <- separate(data = glassdoor, col = Salary.Estimate, into = c("Salary_Min", "Salary_Max"), sep = "-")
table(glassdoordf$Salary_Min, useNA = 'always')
# Removing all characters after K (thousand). Results with just $ are records with an
# hourly rate ($17 - $23 Per Hour)
glassdoordf$Salary_Max = sub('\\ .*', '', glassdoordf$Salary_Max)
glassdoordf$Salary_Max = sub('\\(.*', '', glassdoordf$Salary_Max)
glassdoordf$Salary_Min[glassdoordf$Salary_Min == "$ 17"] <- "0"
# Replacing all values smaller than length of 4 (Example: Anything with less characters than $60k) with NA.
glassdoordf$Salary_Max[nchar(glassdoordf$Salary_Max) < 4] <- NA
glassdoordf$Salary_Min[nchar(glassdoordf$Salary_Min) < 4] <- NA
table(glassdoordf$'Salary_Max', useNA = 'always')
# If the record doesn't have a range, salary_Min will have the salary for the job posting. If this
# is the case, we'll be applying that value to Salary_Max as well.
glassdoordf %>%
mutate(Salary_Max = ifelse(is.na(Salary_Max), Salary_Min, Salary_Max))
Here we memorialize how salary information from another data set was cleaned. After a fork in the project we no longer needed to use Salary information but document it here for completeness.
library(RMariaDB)
library(DBI)
library(bcputility)
library(tm)
library(dplyr)
library(stringr)
#DB Set up
con <- dbConnect(RMariaDB::MariaDB(), username="ahmed", password="buckets123!@#", dbname ="basic", host="localhost")
dbListTables(con)
df <- dbReadTable(con, "listings2019_2022")
df
getMoney <- function(str1, str2) {
extractboi1 <- str_extract(str1,"\\d+")
extractboi2 <- str_extract(str2,"\\d+")
myextract = ''
#check the first string, if its empyty then use the second
if(is.na(extractboi1)){
myextract=extractboi2
}
else{
myextract=extractboi1
}
cleanboi = ''
if(nchar(myextract) == 5){
cleanboi=substr(myextract, 1,2)
}
else if(nchar(myextract) >= 6){
cleanboi=substr(myextract, 1,3)
}
else if(nchar(myextract) == 3 || nchar(myextract == 2)){
cleanboi=myextract
}
as.integer(cleanboi)
}
full_time <- df %>% filter(workType == "Full Time") %>%
filter(salary_string != '') %>%
filter(grepl("\\d", salary_string)) %>%
separate(salary_string, c('min','max') ,'-') %>%
rowwise() %>%
mutate(`min_cleaned` = getMoney(min, max)) %>%
mutate(`max_cleaned` = getMoney(max, min)) %>%
filter(min_cleaned >= 50 && max_cleaned >= 50)
full_time
# write to maria}
dbWriteTable(con, "australia_rectified", full_time)
Having all of this wonderful data, we wanted to make a salary predictor based on location, job title, and skills mentioned in the job description. As a flex we could use that predictor to assign a value to each individual skill based on expected pay (holding location and job title constant).
Where we were able to meet the objectives of the project was in analyzing the job descriptions, so in our following Analysis section we rely just on job descriptions from each job listing.
Here we read in our aggregated data.
Code Summary
+ SQL connection set up
+ Read our aggregated data to a dataframe
+ Restrict data based on memory as a limiting factor
# SQL connection set up
con <- dbConnect(RMariaDB::MariaDB(), username="root", password="TestCase123.", dbname ="basic", host="127.0.0.1")
dbListTables(con)
## [1] "data_science_job_posting" "Rectified_data_science_job_postings"
## [3] "Glassdoor USA Dataset" "Rectified_Listings"
## [5] "listings2019_2022" "Rectified_Glassdoor"
## [7] "Rectified_Cleaned_DS_Jobs" "Cleaned_DS_Jobs"
## [9] "Rectified_Total"
# Read our aggregated data to a dataframe
df <- dbReadTable(con, "Rectified_Total")
# Restrict data based on memory as a limiting factor
sub_length = 100
new_df <- data.frame( doc_id = 1:sub_length, text =df$Job_Description[1:sub_length] )
Note, this file was knit using “4 GB 1600 MHz DDR3” Memory. You can reproduce the results using a sub_length greater than 100. Since the corpus is held dynamically in memory, memory becomes a limiting factor in running these results.
Here we create the corpora
Code summary
+ Create the corpora
+ Write the corpora to file
# Create the corpora
ds <- DataframeSource(new_df)
v <- VCorpus(ds)
x <- Corpus(ds)
Note, the function Corpus() from the tm
package
generates either a Virtual Corpus or a Simple Corpus based on context.
Both are fully kept in memory, the difference being that a simple corpus
is optimized for the most common tasks required of the tm
package. Going forward we will use the simple corpus, x
;
However we create and store a Virtual Corpus as well in case a future
project requires that flexibility.
# Write the corpus to file, not evaluated for our purposes
writeCorpus(v, path = "./Vcorpus_sub")
writeCorpus(x, path = "./corpus_sub")
Note, this step allows us to create the corpus once to be read into multiple projects. The code to read an already created corpus is in the next section.
Assuming we’ve already created our corpus we can read it directly into a given project.
# Load the corpus as an option instead of recreating, not evaluated for our purposes
x <- Corpus(DirSource("./corpus_sub"), readerControl = list(language="lat"))
Here we tidy and strip our corpus.
Code Summary
+ Transform content to lower case
+ Remove filler words
+ Inspect one element of the corpus, equivalent to one job description,
before cleaning
+ Inspect the same element of the corpus, after cleaning
Notice how after the cleaning connector words are gone and everything is lower case.
Also note, the job descriptions are long. The output is not hidden so you can see the difference between the corpus before and after cleaning for one job description. Instead of scrolling you can navigate between sections using the Table of Contents.
# Transform content to lower case
xc <- tm_map(x, content_transformer(tolower))
# Remove filler words
xc <- tm_map(xc, removeWords, stopwords("english"))
# Inspect one element of the corpus, equivalent to one job description, before cleaning
inspect(x[2])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## 2
## Overview\n\n\nAnalysis Group is one of the largest international economics consulting firms, with more than 1,000 professionals across 14 offices in North America, Europe, and Asia. Since 1981, we have provided expertise in economics, finance, health care analytics, and strategy to top law firms, Fortune Global 500 companies, and government agencies worldwide. Our internal experts, together with our network of affiliated experts from academia, industry, and government, offer our clients exceptional breadth and depth of expertise.\n\nWe are currently seeking a Data Scientist to join our team. The ideal candidate should be passionate about working on cutting edge research and analytical services for Fortune 500 companies, global pharma/biotech firms and leaders in industries such as finance, energy and life sciences. The Data Scientist will be a contributing member to client engagements and have the opportunity to work with our network of world-class experts and thought leaders.\n\nJob Functions and Responsibilities\n\nThe candidate Data Scientist will help develop, maintain and teach new tools and methodologies related to data science and high performance computing. This position will also help Analysis Group in maintaining our leadership position in terms of advancing methodology and data analytics. The Data Scientist will be responsible for staying abreast of new developments in technology relating to data science, to share more broadly with Analysis Group.\n\nKey responsibilities for this position will include:\nWorking with project teams to address data science/computing challenges\nIdentifying opportunities for technology to enhance service offerings\nActing as a resource and participating in client engagements and research as part of the project team\nMaintaining up-to-date knowledge of computing tools, providing technical training and helping to grow the in-house knowledge base, specifically in a Linux environment\nPresenting research at selected conferences\nExamples of activities for the Data Scientist will include:\nDeveloping data engineering and machine learning production systems for full stack data science projects\nUsing natural language processing methodologies to work with EMR data, social media data and other unstructured data\nOptimizing procedures for managing and accessing large databases (e.g., insurance claims, electronic health records, financial transactions)\nCreating interactive analytics portals and data visualizations (e.g., using R/Shiny, Python/Flask, D3)\nBuilding and maintaining high performance computing (HPC) tools on grid and cloud computing environments\nDeveloping and reviewing software and packages in R, Python and other Object Oriented Languages\nEstablishing optimized procedures for repetitive or computationally intensive tasks (C, C++, Cuda-C)\nQualifications\nStrong credentials and experience in database management and data visualization\nSignificant experience working within a Linux environment required\nBackground in Statistics/Econometrics or Biostatistics\nIdeally PhD in Computer Science, Mathematics, Statistics, Economics or other relevant scientific degree with relevant experience. Other candidates with at least one year of experience in the field may also be considered\nExcellent written and verbal communication skills\nProject experience with R and/or Python\nFamiliar with online/cloud computing/storage (e.g., AWS)\nDemonstrated experience working on project teams and collaborating with others\nSCIENTIFIQUE DES DONNÉES\n\n*L’utilisation du genre masculin sert uniquement à alléger le texte et est utilisé ici en tant que genre neutre\n\nSurvol\n\nGroupe d’analyse ltée est l’une des plus grandes firmes de services-conseils en économie, comptant plus de 950 professionnels répartis dans 14 bureaux en Amérique du Nord, en Europe et en Asie. Depuis 1981, nous offrons notre expertise en matière de stratégie, d’économie, de finance et d’analyse dans le domaine des soins de santé aux grands cabinets d’avocats, aux sociétés Fortune Global 500 et aux agences gouvernementales du monde entier. Nos professionnels en poste conjugués à notre réseau de spécialistes affiliés issus d’universités, d’industries spécifiques et d’organismes gouvernementaux procurent à notre clientèle un savoir-faire d’une portée et d’une profondeur exceptionnelles.\n\nNous sommes présentement à la recherche d'un Scientifique des données (« Data Scientist ») pour se joindre à notre équipe. Le candidat idéal devrait être passionné par la recherche de pointe et les services analytiques pour les entreprises Fortune 500, les entreprises pharmaceutiques et biotechnologiques mondiales et les chefs de file dans des secteurs de la finance, l'énergie et les sciences de la vie. Le Scientifique des données sera un membre contributeur aux mandats des clients et aura l'occasion de travailler avec notre réseau d'experts et de leaders d'opinion de classe mondiale.\n\nDescription du poste et des responsabilités\n\nLe scientifique des données aidera à développer, maintenir et enseigner de nouveaux outils et méthodologies liés à la science des données (« Data Science ») et au HPC. Ce poste aidera également le Groupe d'analyse à maintenir sa position de chef de file en ce qui a trait à l'avancement de la méthodologie et de l'analyse des données. Le scientifique des données sera chargé de se tenir au courant des nouveaux développements technologiques liés à la science des données, afin de les partager plus largement avec le Groupe d'analyse.\n\nLes principales responsabilités de ce poste comprendront:\n\n- Collaborer avec les consultants pour relever les défis de la science des données et de sciences informatiques\n\n- Agir à titre de ressource et participer aux mandats et à la recherche en tant que membre de l'équipe de projet\n\n- Maintenir à jour les connaissances sur les outils informatiques, fournir une formation technique et aider à développer la base de connaissances interne, notamment dans un environnement Linux\n\n- Présenter la recherche à des conférences choisies\n\nExemples de tâches du scientifique des données :\n\n- Développement de systèmes de production en ingénierie des données ainsi qu’en apprentissage machine pour des projets de science des données full stack\n\n- Utiliser des méthodologies NLP pour travailler avec les données médicales électroniques, les données des médias sociaux et d'autres données non structures\n\n- Optimiser les procédures de gestion et d'accès aux grandes bases de données (ex. réclamations d'assurance, dossiers de santé électroniques, transactions financières)\n\n- Création de portails d'analyse interactifs et de visualisations de données (par exemple, en utilisant R/Shiny, Python/Flask, D3)\n\n- Construire et maintenir des outils de calcul de haute performance (HPC).\n\n- Développement et révision de codes en R, Python et autres langages\n\n- Mise en place de procédures optimisées pour les tâches répétitives ou intensives en calcul (C, C++, Cuda-C)\n\nQualifications requises\n\n- Solides références et expérience dans la gestion de bases de données et de la visualisation de données\n\n- Expérience de travail significative dans un environnement Linux requise\n\n- Expérience antérieure en statistique/économétrie ou bio-statistique\n\n- Idéalement, être titulaire d'un doctorat en sciences informatiques, en mathématiques, en statistique, en économie ou d'un autre diplôme scientifique pertinent et posséder une expérience pertinente. Les candidats ayant au moins un an d'expérience dans le domaine peuvent également être considérés.\n\n- Excellentes aptitudes de communication écrite et verbale\n\n- Expérience de projet avec R et/ou Python\n\n- Familiarité avec l'informatique en ligne/info nuagique et le stockage (AWS)\n\n- Expérience de travail démontrée au sein d'équipes de projet et de collaboration avec d'autres personnes\n\nÂ\nEqual Opportunity Employer/Protected Veterans/Individuals with Disabilities.\nPlease view Equal Employment Opportunity Posters provided by OFCCP here.\nThe contractor will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant. However, employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information, unless the disclosure is (a) in response to a formal complaint or charge, (b) in furtherance of an investigation, proceeding, hearing, or action, including an investigation conducted by the employer, or (c) consistent with the contractor's legal duty to furnish information. 41 CFR 60-1.35(c)
# Inspect the same element of the corpus, after cleaning
inspect(xc[2])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## 2
## overview\n\n\nanalysis group one largest international economics consulting firms, 1,000 professionals across 14 offices north america, europe, asia. since 1981, provided expertise economics, finance, health care analytics, strategy top law firms, fortune global 500 companies, government agencies worldwide. internal experts, together network affiliated experts academia, industry, government, offer clients exceptional breadth depth expertise.\n\n currently seeking data scientist join team. ideal candidate passionate working cutting edge research analytical services fortune 500 companies, global pharma/biotech firms leaders industries finance, energy life sciences. data scientist will contributing member client engagements opportunity work network world-class experts thought leaders.\n\njob functions responsibilities\n\n candidate data scientist will help develop, maintain teach new tools methodologies related data science high performance computing. position will also help analysis group maintaining leadership position terms advancing methodology data analytics. data scientist will responsible staying abreast new developments technology relating data science, share broadly analysis group.\n\nkey responsibilities position will include:\nworking project teams address data science/computing challenges\nidentifying opportunities technology enhance service offerings\nacting resource participating client engagements research part project team\nmaintaining --date knowledge computing tools, providing technical training helping grow -house knowledge base, specifically linux environment\npresenting research selected conferences\nexamples activities data scientist will include:\ndeveloping data engineering machine learning production systems full stack data science projects\nusing natural language processing methodologies work emr data, social media data unstructured data\noptimizing procedures managing accessing large databases (e.g., insurance claims, electronic health records, financial transactions)\ncreating interactive analytics portals data visualizations (e.g., using r/shiny, python/flask, d3)\nbuilding maintaining high performance computing (hpc) tools grid cloud computing environments\ndeveloping reviewing software packages r, python object oriented languages\nestablishing optimized procedures repetitive computationally intensive tasks (c, c++, cuda-c)\nqualifications\nstrong credentials experience database management data visualization\nsignificant experience working within linux environment required\nbackground statistics/econometrics biostatistics\nideally phd computer science, mathematics, statistics, economics relevant scientific degree relevant experience. candidates least one year experience field may also considered\nexcellent written verbal communication skills\nproject experience r / python\nfamiliar online/cloud computing/storage (e.g., aws)\ndemonstrated experience working project teams collaborating others\nscientifique des donnã‰es\n\n*l’utilisation du genre masculin sert uniquement ã allã©ger le texte et est utilisã© ici en tant que genre neutre\n\nsurvol\n\ngroupe d’analyse ltã©e est l’une des plus grandes firmes de services-conseils en ã©conomie, comptant plus de 950 professionnels rã©partis dans 14 bureaux en amã©rique du nord, en europe et en asie. depuis 1981, nous offrons notre expertise en matiã¨re de stratã©gie, d’ã©conomie, de finance et d’analyse dans le domaine des soins de santã© aux grands cabinets d’avocats, aux sociã©tã©s fortune global 500 et aux agences gouvernementales du monde entier. nos professionnels en poste conjuguã©s ã notre rã©seau de spã©cialistes affiliã©s issus d’universitã©s, d’industries spã©cifiques et d’organismes gouvernementaux procurent ã notre clientã¨le un savoir-faire d’une portã©e et d’une profondeur exceptionnelles.\n\nnous sommes prã©sentement ã la recherche d'un scientifique des donnã©es (â« data scientist â») pour se joindre ã notre ã©quipe. le candidat idã©al devrait ãªtre passionnã© par la recherche de pointe et les services analytiques pour les entreprises fortune 500, les entreprises pharmaceutiques et biotechnologiques mondiales et les chefs de file dans des secteurs de la finance, l'ã©nergie et les sciences de la vie. le scientifique des donnã©es sera un membre contributeur aux mandats des clients et aura l'occasion de travailler avec notre rã©seau d'experts et de leaders d'opinion de classe mondiale.\n\ndescription du poste et des responsabilitã©s\n\nle scientifique des donnã©es aidera ã dã©velopper, maintenir et enseigner de nouveaux outils et mã©thodologies liã©s ã la science des donnã©es (â« data science â») et au hpc. ce poste aidera ã©galement le groupe d'analyse ã maintenir sa position de chef de file en ce qui trait ã l'avancement de la mã©thodologie et de l'analyse des donnã©es. le scientifique des donnã©es sera chargã© de se tenir au courant des nouveaux dã©veloppements technologiques liã©s ã la science des donnã©es, afin de les partager plus largement avec le groupe d'analyse.\n\nles principales responsabilitã©s de ce poste comprendront:\n\n- collaborer avec les consultants pour relever les dã©fis de la science des donnã©es et de sciences informatiques\n\n- agir ã titre de ressource et participer aux mandats et ã la recherche en tant que membre de l'ã©quipe de projet\n\n- maintenir ã jour les connaissances sur les outils informatiques, fournir une formation technique et aider ã dã©velopper la base de connaissances interne, notamment dans un environnement linux\n\n- prã©senter la recherche ã des confã©rences choisies\n\nexemples de tã¢ches du scientifique des donnã©es :\n\n- dã©veloppement de systã¨mes de production en ingã©nierie des donnã©es ainsi qu’en apprentissage machine pour des projets de science des donnã©es full stack\n\n- utiliser des mã©thodologies nlp pour travailler avec les donnã©es mã©dicales ã©lectroniques, les donnã©es des mã©dias sociaux et d'autres donnã©es non structures\n\n- optimiser les procã©dures de gestion et d'accã¨s aux grandes bases de donnã©es (ex. rã©clamations d'assurance, dossiers de santã© ã©lectroniques, transactions financiã¨res)\n\n- crã©ation de portails d'analyse interactifs et de visualisations de donnã©es (par exemple, en utilisant r/shiny, python/flask, d3)\n\n- construire et maintenir des outils de calcul de haute performance (hpc).\n\n- dã©veloppement et rã©vision de codes en r, python et autres langages\n\n- mise en place de procã©dures optimisã©es pour les tã¢ches rã©pã©titives ou intensives en calcul (c, c++, cuda-c)\n\nqualifications requises\n\n- solides rã©fã©rences et expã©rience dans la gestion de bases de donnã©es et de la visualisation de donnã©es\n\n- expã©rience de travail significative dans un environnement linux requise\n\n- expã©rience antã©rieure en statistique/ã©conomã©trie ou bio-statistique\n\n- idã©alement, ãªtre titulaire d'un doctorat en sciences informatiques, en mathã©matiques, en statistique, en ã©conomie ou d'un autre diplã´ scientifique pertinent et possã©der une expã©rience pertinente. les candidats ayant au moins un d'expã©rience dans le domaine peuvent ã©galement ãªtre considã©rã©s.\n\n- excellentes aptitudes de communication ã©crite et verbale\n\n- expã©rience de projet avec r et/ou python\n\n- familiaritã© avec l'informatique en ligne/info nuagique et le stockage (aws)\n\n- expã©rience de travail dã©montrã©e au sein d'ã©quipes de projet et de collaboration avec d'autres personnes\n\nâ\nequal opportunity employer/protected veterans/individuals disabilities.\nplease view equal employment opportunity posters provided ofccp .\n contractor will discharge manner discriminate employees applicants inquired , discussed, disclosed pay pay another employee applicant. however, employees access compensation information employees applicants part essential job functions disclose pay employees applicants individuals otherwise access compensation information, unless disclosure () response formal complaint charge, (b) furtherance investigation, proceeding, hearing, action, including investigation conducted employer, (c) consistent contractor's legal duty furnish information. 41 cfr 60-1.35(c)
Here we demonstrate the Document Term Matrix. This is a grid where terms are listed on both the x and the y-axis. Where the word is equal to itself, along the diagonal from top-left to bottom-right, the value is not applicable. In all of the other cells a value is assigned based on how often the two different words are close to each other.
Code Summary
+ Create the Document Term Matrix (DTM)
+ View summary details of the DTM
+ Find the words associated with the word ‘data’
# Create the Document Term Matrix
dtm <- DocumentTermMatrix(xc)
# View summary details of the DTM
inspect(dtm)
## <<DocumentTermMatrix (documents: 100, terms: 6530)>>
## Non-/sparse entries: 20117/632883
## Sparsity : 97%
## Maximal term length: 72
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs ability business data experience learning machine science team will work
## 13 5 17 25 14 0 0 1 7 6 9
## 16 2 1 13 11 2 2 1 7 16 8
## 17 2 1 13 11 2 2 1 7 16 8
## 2 0 0 17 5 1 2 7 1 7 2
## 20 0 12 36 22 2 2 8 14 10 8
## 25 13 0 4 3 1 0 1 2 1 5
## 8 0 0 17 5 1 2 7 1 7 2
## 89 2 10 20 11 3 4 11 2 4 6
## 9 2 1 13 11 2 2 1 7 16 8
## 90 2 7 15 11 3 4 5 2 5 6
Note, we set the association level to ‘0.45’. By lowering the association level we would produce more results.
# Find the words associated with the word 'data'
findAssocs(dtm, "data", 0.45)
## $data
## management team experience tools
## 0.64 0.60 0.59 0.58
## work building science, analyze
## 0.54 0.54 0.54 0.53
## business reports part database
## 0.53 0.53 0.52 0.52
## closely providing science also
## 0.52 0.51 0.50 0.50
## working development needs responsibilities:
## 0.50 0.49 0.49 0.49
## individual design skills strong
## 0.49 0.49 0.48 0.48
## languages agile physical demonstrated
## 0.48 0.48 0.47 0.47
## visualization proactively bachelor’s project
## 0.47 0.47 0.47 0.46
## services will key projects
## 0.46 0.46 0.46 0.46
## statistics, using utilize warehouse
## 0.46 0.46 0.46 0.46
## analytics including field group
## 0.45 0.45 0.45 0.45
## written operations, play
## 0.45 0.45 0.45
Here we apply a series of manipulations to arrive at a list of programming languages ranked by frequency in the corpus.
Here we grab a list of programming languages.
The data is a list of programming languages from Wikipedia collected in 2015 by github user, Jamhall.
# Grab a list of programming languages
url.data <- "https://raw.githubusercontent.com/jamhall/programming-languages-csv/master/languages.csv"
raw <- read.csv(url(url.data), header = TRUE,)
programming_list <- tolower(raw$name)
Here we use the power of the Document Term Matrix to find instances of the programming languages in the corpus of job descriptions.
# Produce the DTM using the list of programming languages
programming_list_dtm <- DocumentTermMatrix(xc, list(dictionary = c(programming_list)))
# Convert the DTM into a dataframe for ease of use
programming_list_df <- as.data.frame(as.matrix(programming_list_dtm), stringsAsFactors=False)
Here we test a routine. We’re trying to count the number of rows that exist for the skill ‘python’.
# Count the number of rows that exist for the skill 'python'
sum(programming_list_df$python != 0, na.rm=TRUE)
## [1] 25
Here we make a new dataframe with skills and counts.
Code Summary
+ Create a raw dataframe
+ Populate the dataframe with counts of programming languages
+ Remove the programming languages with a count of zero
# Create a raw dataframe ready to be populated
index_df <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(index_df) <-c("Skill", "Count")
# Populate the raw dataframe similarly using the methodology in the preceding
# section to count 'python' but now for all programming languages
list_holder <- list()
for(i in 1:ncol(programming_list_df)) { # for-loop over columns
index_df[i,] <- c(colnames(programming_list_df)[i], (sum(programming_list_df[, i] != 0, na.rm=TRUE)))
}
index_df
# Remove all programming languages with a count of zero
finalizedProgramList <- index_df[index_df$Count !=0,]
Here we display the programming languages ranked by frequency in the corpus. While all of the words are associated with programming languages there are results whose counts are conflated with non-programming language uses of the word, such as “make”.
finalizedProgramList$Count <- as.integer(finalizedProgramList$Count)
arrange(finalizedProgramList, desc(Count))
Here we repeat the preceding series of manipulations to arrive at a list of soft skills ranked by frequency in the corpus.
Here we grab a list of soft skills.
The list of soft skills was compiled by combining the following three
lists:
+ https://simplicable.com/en/data-science-skills
+ https://simplicable.com/new/soft-skills
+ https://simplicable.com/new/communication-skills
# Grab a list of soft skills
url.data <- "https://raw.githubusercontent.com/Amantux/Project_3/main/Data_Skills.csv"
raw <- read.csv(url(url.data), header= TRUE,)
head(raw)
soft_skills <- tolower(raw$ï..Skills)
Here we use the power of the Document Term Matrix to find instances of the soft skills in the corpus of job descriptions.
# Produce the DTM using the list of soft skills
soft_skills_dtm <- DocumentTermMatrix(xc, list(dictionary = c(soft_skills)))
# Convert the DTM into a dataframe for ease of use
soft_skills_dtm <- as.data.frame(as.matrix(soft_skills_dtm), stringsAsFactors=False)
Here we test a routine. We’re trying to count the number of rows that exist for the skill ‘analytics’.
# Count the number of rows that exist for the skill 'analytics'
sum(soft_skills_dtm$analytics != 0, na.rm=TRUE)
## [1] 27
Here we make a new dataframe with skills and counts.
Code Summary
+ Create a raw dataframe
+ Populate the dataframe with counts of soft skills
+ Remove the soft skills with a count of zero
# Create a raw dataframe ready to be populated
index_df_soft <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(index_df_soft) <-c("Skill", "Count")
# Populate the raw dataframe similarly using the methodology in the preceding
# section to count 'analytics' but now for all soft skills
list_holder <- list()
for(i in 1:ncol(soft_skills_dtm)) { # for-loop over columns
index_df_soft[i,] <- c(colnames(soft_skills_dtm)[i], (sum(soft_skills_dtm[, i] != 0, na.rm=TRUE)))
}
index_df_soft
# Remove all soft skills with a count of zero
finalizedSoftList <- index_df_soft[index_df$Count !=0,]
Here we display the soft skills ranked by frequency in the corpus. While all of the words are associated with soft skills there are results whose counts are conflated by other, generic use, such as the word “using”.
finalizedSoftList$Count <- as.integer(finalizedSoftList$Count)
arrange(finalizedSoftList, desc(Count))
Data Initially we wanted to scrape job listings from job boards like Indeed and LinkedIn; However we ran into issues with being able to collect data at scale. The kaggle datasets allowed us to pursue our lines of analysis as a proof of concept. The natural progression of the project would be to duplicate it using current, scraped job listings.
Predictors Because we have so much data from the job listings, an interesting project would be to create salary predictors based on location, job title, and required skills.