Introduction

The purpose of this assignment is to answer the question, “Which are the most valued data science skills?” Within the team we have had experiences applying for jobs, but we were each relatively new to the data science vernacular. For example, we have one seasoned programmer and a student with a background in chemical engineering. This project is our endeavor in how we answered this question.

Approach

We had discussions about which sets of data to use. The internet is filled with a plethora of data, but few were of applicable use. LinkedIn was not readily accessible. We worked from our experiences with applying for jobs and during some brainstorming over Zoom, we came up with searching government sites for available data, Kaggle, and the possibility of web scraping Monster or Indeed. True, we did have our own presuppositions about needed skills for a data scientist, but we wanted the data to speak for itself and hopefully challenge our presuppositions.

Jackson Pollock the Data

Our thinking was that if we obtain data science job listings, we would mine the text for words and phrases. In other words, if we splash the data in front of us, find some frequencies, perhaps a pattern would arise.

Pollock painting

After googling and searching, we settled on a data set that will be reviewed in the next section. However, little did we as a group realize how much cleaning and hours would be committed to having data that was accessible and usable for the entire team.

Continuing on, the following are the R libraries we will be using.

library(RMySQL)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(stringr)
library(RWeka)
library(ggplot2)
library(dplyr)
library(tidytext)
library(tidyverse)
library(knitr)
library(summarytools)
library(ggraph)
library(igraph)
library(quanteda)
library(tidyr)
library(topicmodels)

RMySQL is used for working with a MySQL database. tm and RWeka were used for text mining. Packages SnowballC, wordcloud, and RColorBrewer were used for creating the word cloud in R. stringr, dplyr, tidytext, and tidyverse were used for text manipulations. ggplot2, ggraph, igraph were used for graphs and knitr was used for making some tables. topicmodels and quanteda were used to derive text analysis from the data. summarytools provides a handy frequency function to find tally the occurrences in a column.

Importing

The data we ended up using is from the web site https://data.world/jobspikr/10000-data-scientist-job-postings-from-the-usa. Phil was able to obtain the csv. The data was scraped using JobsPikr. The csv consists of 10,000 records and had the following as text fields: crawl_timestamp, url, job_title, category, company_name, city, state, country, inferred_city, inferred_state, inferred_country, post_date, job_description, job_type, salary_offered, job_board, geo, cursor, contact_email, contact_phone_number, uniq_id, and html_job_desc.

This is the original csv with field names.

CSV Fieldnames

Inital import of the csv file.

# Using read_csv as it allows for on the fly decompression of zip csv files
jobs_df <- read_csv("https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   post_date = col_date(format = ""),
##   salary_offered = col_logical(),
##   cursor = col_double(),
##   contact_email = col_logical(),
##   html_job_description = col_logical()
## )

## See spec(...) for full column specifications.

# Look that the first group of imported rows
head(jobs_df)

jobs_desc <- tibble(id = jobs_df$uniq_id, 
                        desc = jobs_df$job_description)

jobs_desc %>% 
  select(desc) %>% 
  sample_n(5)

jobs_cat <- tibble(id = jobs_df$uniq_id, 
                        category = jobs_df$category)

jobs_cat <- jobs_cat %>% filter(jobs_cat$category != "") %>% print(n=10)

## # A tibble: 2,136 x 2
##    id                               category          
##    <chr>                            <chr>             
##  1 3b6c6acfcba6135a31c83bd7ea493b18 Accounting/Finance
##  2 1c8541cd2c2c924f9391c7d3f526f64e Accounting/Finance
##  3 445652a560a5441060857853cf267470 biotech           
##  4 9571ec617ba209fd9a4f842973a4e9c8 Accounting/Finance
##  5 0ec629c03f3e82651711f2626c23cadb Accounting/Finance
##  6 80d64b46bc7c89602f63daf06b9f1b4c Accounting/Finance
##  7 9cd3ed78e5cac9e516ea41173de2c25f Computer/Internet 
##  8 53224da901548e7137bbb163d456ba6a Computer/Internet 
##  9 010b5d671896d26eba50948c7c337f94 Computer/Internet 
## 10 4a8b875d1acc4f560716561d699aa022 Computer/Internet 
## # ... with 2,126 more rows

job_description is the most important field in the dataset for the data mining exercise. The field contains the complete write-up posted for the job listing. The jobjob_category` provides some context of the job listing and will be used to capture important words per category.

Initial problems and solutions:

The csv file is 48.5 MB and the character set had UTF-8 characters.

The file was zipped up to reduce download time. When the file is imported into R, we will tidy the data.
HTML code was left in the job_description field. Evidence of the HTML scrape of the data source.

Invalid characters will be removed.
Initial load of the csv into R would take time depending upon the hardware of the group member’s system.

With the zip file and cutting up the file into smaller pieces, this would reduce the need for additional RAM.
How would we convert to a normalized database?

The csv was converted to a MySQL script file. The header column names were encapsulated with " marks. Workbench was used to create a new table on the MySQL database. Subsequently, through SQL SELECT, we normalized the data.

In addition, when the data was imported in a database, Sung ran Text Analytics on the job_description column to find key phrases. This information was used to create a new column called keyphrases in the normalized database.
Some group member’s machines did not have the same amount of RAM. Vanita’s laptop had 8 GB RAM and Sung’s laptop was running 16 GB.

How did the data look in Azure?

When the data was imported into Microsoft’s cloud service, we ran some initial tests on the data to look for patterns. Using Microsoft’s Text Analytics¹ tools the following is a word cloud² of the keyphrases that were discovered by Microsoft’s AI:

Word Cloud Courtesy of Azure

The key phrases were stored in the database. Key phrases were extrapolated from the job descriptions for each of the jobs listed. Because the use of Microsoft’s service was limited, we found the need for persistent storage of the results. Sung had used up his $200 credit for one of his subscriptions within a week of testing. In working with the Azure Text Analytics API, there is an option to send data via JSON to have it analyzed.

Excursus

This is a brief description of some of the steps taken to help alleviate the problems of importing the information for use in our project.

Infrastructure

Azure, MySQL, and Cognitive Text Analytics Sung wanted to experiment with cloud computing and so he used his SPS email to create a trial account for Azure. He created a MySQL database instance to host the databases used. Each trial account is given a $200 credit.

MySQL database access This connection is made public but is restricted by IP address. If you desire access, please email one of the Team members.

Server: data607.mysql.database.azure.com
Port: 3306
Username: data607@data607
Password: student#2020

Default timeout settings for the database server had to be lengthened to enable longer processing.

This is a link to the mysqldump of the database.

Diagram of Database E-R Digram

Tools

Microsoft’s Cognitive Text Analytics, Power BI, Slack, GitHub, and MySQL Workbench

In addition to the database instance, he created, Sung created an instance for Azure’s Cognitive Text Analytics to experiment to see what information Microsoft’s AI can mine from our data. The tools were used to facilitate handling the data.

Power BI is a Microsoft product that is used to visualize data. It was employed to work with Microsoft’s Cognitive Text Analytics to extrapolate keyphrases from the job_descriptions of the data and to create a simple word cloud to compare with our later analysis.

Slack and GitHub were used to collaborate. Files were exchanged via Slack. Code was edited via GitHub.

MySQL Workbench was used to connect to the MySQL database.

Data Management

As an option to curtail the amount of time needed to process the 10,000 row csv, group members worked with a small subset of the file, tested code, and then would work with the larger data set. Some group members machines would complain about memory errors.

Files being worked on from Github were compressed to enable quicker transport across the network.

This is an example of a memory error on Sung’s laptop when trying to text mine on the job description data:

To avert this problem, Sung modified his code to use the Corpus object instead of VCorpus for collections of text documents. A VCorpus is described as a volatile corpus that is “fully kept in memory”.³

In addition, wherever possible, snippets of the data objects were viewed and knitted in R Markdown. So instead of viewing 10,000 rows, we did our best to display 10-20 rows of a data set.

Tidying

Since we have imported our data into a database, we still need to clean up the job descriptions. We hope that we will be able to tidy the data such that we would be able to determine frequencies and see what picture the data paints for us.

We will now begin to query the database for the job descriptions and the key phrases that Azure extrapolated for us. We will store job descriptions in the variable descriptions and the key phrases in keyphrases.

#Form a connection to the MySQL database
mydb <- dbConnect(MySQL(), 
                  user='data607@data607', 
                  password='student#2020',
                  dbname='project3data607', 
                  host='data607.mysql.database.azure.com'
                  )

#Key phrases is stored in the meta table of the database
rs <- dbSendQuery(mydb, "SELECT keyphrases FROM meta")

#Remove argument for n = -1 if you don't want all the records in the db
dbRows <- dbFetch(rs, n = -1 )
#dbRows <- dbFetch(rs)
dbClearResult(rs)

## [1] TRUE

#Job descriptions are stored in the job table of the database
rs <- dbSendQuery(mydb, "SELECT job_description FROM job")

#Remove argument for n = -1 if you don't want all the records in the db
descriptions <- dbFetch(rs, n = -1)
#descriptions <- dbFetch(rs)
dbClearResult(rs)

## [1] TRUE

#This a sample of the Key Phrases from Azure
kable(dbRows[1:5, ], 
      caption = "Job Description Key Phrases", 
      col.names = c("Key Phrases") 
      )

Job Description Key Phrases
Key Phrases
logistics-related data analysis, data integrity, historical data, data scientist, hard data, internal reporting, data science role, external reporting, analysis processes, ongoing reporting, transportation analysis projects, in-depth logistics analytics, predictive logistics pricing models, transportation bids, analytical tools, years of robust analytical experience, advanced degree, operations team, ability, theoretical client distribution models, carrier bids, in-depth investigations, s degree, computer science, potential opportunities, quantitative skills, similar quantitative field, client cost reduction, institutional analytics capabilities, methodologies relevant, process improvements, optimization, mathematics, engineering, statistics, customer, corporate finance, pandas, current tms software provider, growth, business problems, assumptions, things, design, vision, python, focus, validation, metrics, eye, financial impacts, best practices, recommendations, drive actions, benefits, vacation, proficiency, bachelorâ, example, automation, company, supply chain, location, norwell, mysql, candidate, documentation, position, time, desired background, aborn, compensation, salary
data science, data analyses, data scientists, significant experience, talented data scientist, big data tools, data mining problems, significant programming experience, machine learning software libraries, large amounts of complex data, advanced data analytics division, challenging machine learning, related libraries, related algorithms, deep learning algorithms, s degree, reston, extensible software, innovative designs, employment individuals, results of innovative solutions, software developers, related field, qualified applicants, s entire process, computer science, algorithm design, protected veterans, off-the-shelf solutions, people, d projects, sexual orientation, gender identity, classification, national origin, religion, scalable anomaly detection, concept, recommendation, minorities, performance metrics, women, java, clientsâ, mathematics, knowledge of best practices, disability, color, policy, freedom, apache spark, statistics, python, mathematicians, masterâ, tensorflow, scikit, eligibility requirements, affirmative action, support, classified information, access, projectâ, nlp, natural language processing, torch, oral reports, basis of race, ideal candidate, metron, mahout, implementation, demonstration, products, apis, disabilities, position location, consideration, security investigation, veteran status, vevraa federal contractor, project teams, management structure, deeplearning4j, us citizenship, bachelorâ, job description, desired qualifications, person, weka
data science, data analyses, data scientists, significant experience, talented data scientist, big data tools, data mining problems, significant programming experience, machine learning software libraries, large amounts of complex data, advanced data analytics division, challenging machine learning, related libraries, related algorithms, deep learning algorithms, s degree, reston, extensible software, innovative designs, employment individuals, results of innovative solutions, software developers, related field, qualified applicants, s entire process, computer science, algorithm design, protected veterans, off-the-shelf solutions, people, d projects, sexual orientation, gender identity, classification, national origin, religion, scalable anomaly detection, concept, recommendation, minorities, performance metrics, women, java, clientsâ, mathematics, knowledge of best practices, disability, color, policy, freedom, apache spark, statistics, python, mathematicians, masterâ, tensorflow, scikit, eligibility requirements, affirmative action, support, classified information, access, projectâ, nlp, natural language processing, torch, oral reports, basis of race, ideal candidate, metron, mahout, implementation, demonstration, products, apis, disabilities, position location, consideration, security investigation, veteran status, vevraa federal contractor, project teams, management structure, deeplearning4j, us citizenship, bachelorâ, job description, desired qualifications, person, weka
data visualizations, interpreting data, data wrangling, data capture firm, rigorous data capture, unstructured data sets, ssrs data scientist, reports of data metrics, experienced data scientist, data visualization tools, leading data-driven survey research, growing team of skilled data capture professionals, research outcomes, machine learning methods, cluster analysis, analysis of variance, s degree, bagging methods, simulation methods, appropriate methods, modeling process, years of related work experience, time series analysis, business case analysis, year degree, quantitative modeling, successful modeling, non-partisan research areas, challenging research needs, trends, python, hands, advanced mathematical techniques, public policy, public opinion, logistics regression, problem identification, statistical analyses, random forest, neural networks, correlation, decision trees, health care, feature engineering, clientsâ, bootstrap, matplotlib, nosql dbs, prescriptive analytic solutions, technical bachelorâ, oracle, algorithms, teradata, monte carlo, information technology field of study, graphical representation, insight generation, ggplot, models, databases, exposure, sas, tableau, masterâ, outliers, srss, ability, dashboards, key responsibilities, qualifications
data visualizations, interpreting data, data wrangling, data capture firm, rigorous data capture, unstructured data sets, ssrs data scientist, reports of data metrics, experienced data scientist, data visualization tools, leading data-driven survey research, growing team of skilled data capture professionals, research outcomes, machine learning methods, cluster analysis, analysis of variance, s degree, bagging methods, simulation methods, appropriate methods, modeling process, years of related work experience, time series analysis, business case analysis, year degree, quantitative modeling, successful modeling, non-partisan research areas, challenging research needs, trends, python, hands, advanced mathematical techniques, public policy, public opinion, logistics regression, problem identification, statistical analyses, random forest, neural networks, correlation, decision trees, health care, feature engineering, clientsâ, bootstrap, matplotlib, nosql dbs, prescriptive analytic solutions, technical bachelorâ, oracle, algorithms, teradata, monte carlo, information technology field of study, graphical representation, insight generation, ggplot, models, databases, exposure, sas, tableau, masterâ, outliers, srss, ability, dashboards, key responsibilities, qualifications

Azure Key Phrases

Let’s get the unique key phrases that Azure found.

#Grab the unique keyphrases after some clean up: 
keyphrases <- unlist(str_split(dbRows, ","))
keyphrases <- str_replace_all(keyphrases, "^ ", "")
keyphrases <- str_replace_all(keyphrases, "[^\x20-\x7E]", "")
keyphrases <- str_replace_all(keyphrases, "\"", "")

# Remove the "c(" in the first element
keyphrases[[1]] <- keyphrases[[1]] %>% 
  str_replace("c[(]", "")

keyphrases <- as.data.frame(keyphrases)

#Some of the key phrases
keyphrases

#Frequency chart of the key phrases using the summary tools package
freq(keyphrases$keyphrases, order = "freq", style = "rmarkdown", rows = 1:20)

## ### Frequencies  
## #### keyphrases$keyphrases  
## **Type:** Factor  
## 
## |                  &nbsp; |    Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |------------------------:|--------:|--------:|-------------:|--------:|-------------:|
## |              **python** |    6471 |    0.63 |         0.63 |    0.63 |         0.63 |
## |    **computer science** |    4441 |    0.44 |         1.07 |    0.44 |         1.07 |
## |          **statistics** |    3910 |    0.38 |         1.45 |    0.38 |         1.45 |
## |            **religion** |    2746 |    0.27 |         1.72 |    0.27 |         1.72 |
## |               **color** |    2675 |    0.26 |         1.99 |    0.26 |         1.99 |
## |                 **sql** |    2672 |    0.26 |         2.25 |    0.26 |         2.25 |
## | **years of experience** |    2672 |    0.26 |         2.51 |    0.26 |         2.51 |
## |     **national origin** |    2649 |    0.26 |         2.77 |    0.26 |         2.77 |
## |         **mathematics** |    2587 |    0.25 |         3.02 |    0.25 |         3.02 |
## |    **responsibilities** |    2483 |    0.24 |         3.27 |    0.24 |         3.27 |
## |                **race** |    2476 |    0.24 |         3.51 |    0.24 |         3.51 |
## |  **sexual orientation** |    2409 |    0.24 |         3.75 |    0.24 |         3.75 |
## |      **data scientist** |    2365 |    0.23 |         3.98 |    0.23 |         3.98 |
## |      **veteran status** |    2321 |    0.23 |         4.21 |    0.23 |         4.21 |
## |     **gender identity** |    2236 |    0.22 |         4.43 |    0.22 |         4.43 |
## |              **people** |    2123 |    0.21 |         4.63 |    0.21 |         4.63 |
## |          **disability** |    2113 |    0.21 |         4.84 |    0.21 |         4.84 |
## |               **spark** |    2064 |    0.20 |         5.04 |    0.20 |         5.04 |
## |            **position** |    2051 |    0.20 |         5.24 |    0.20 |         5.24 |
## |              **hadoop** |    1969 |    0.19 |         5.44 |    0.19 |         5.44 |
## |             **(Other)** |  963928 |   94.56 |       100.00 |   94.56 |       100.00 |
## |              **\<NA\>** |       0 |         |              |    0.00 |       100.00 |
## |               **Total** | 1019361 |  100.00 |       100.00 |  100.00 |       100.00 |

Briefly looking at the frequencies found in the key phrases, some skills started to appear such as Python (count: 6471), Computer Science (4441), and SQL (2672). However, some problems came us as the boiler plate for many job descriptions about diversity and Equal opportunity policies were appearing as evident by the frequency of such phrases as Religion (2746), Color (2675), and National Origin (2649).

Tidy the job descriptions

Before we continue, we have to tidy the job descriptions. We will remove non-ASCII characters, remove some unecessary characters, stop words, make the text lowercase, and white space. Stop words are common words such as the, is, at, which, etc. This will be handy in our analysis.

#Remove non-ASCII characters
descriptions <-  gsub("[^\x20-\x7E]", " ", descriptions)


#Convert our variable to a Corpus variable for the package tm to work on.
#A corpus is a collection of documents
descriptions <- Corpus(VectorSource(descriptions))

#Create a function to make patterns into spaces
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

descriptions <- tm_map(descriptions, toSpace, "/")
descriptions <- tm_map(descriptions, toSpace, "\n")
descriptions <- tm_map(descriptions, toSpace, "@")
descriptions <- tm_map(descriptions, toSpace, "\\|")

#Convert the text to lowercase
descriptions <- tm_map(descriptions, content_transformer(tolower))

#Remove English stop words
descriptions <- tm_map(descriptions, removeWords, stopwords("english"))

#Remove punctuation , ., etc.
descriptions <- tm_map(descriptions, removePunctuation)

#Remove whitespace
descriptions <- tm_map(descriptions, stripWhitespace)

Text Mining

After we have imported and tidied our data, we will perform tests to mine it.

Process Input Data

Following the NASA case study example presented at https://www.tidytextmining.com/nasa.html for Term Frequency and Topic Modeling.

From the job_description, tokenize all the words and remove “stop_words” which are common words in the English language to allow for focus on meaningful words of the job listing.

# Use tidytext’s unnest_tokens() for the description field so we can do the text analysis.
# unnest_tokens() will tokenize all the words in the description field and create a tidy dataframe of the word by identifer

jobs_desc <- jobs_desc %>% 
  unnest_tokens(word, desc) %>% 
  anti_join(stop_words)

## Joining, by = "word"

jobs_desc

Provide count in table form of the most common words in the job descriptions.

# Most common words in the description field
jobs_desc %>%
  count(word, sort = TRUE) %>% print(n=10)

## # A tibble: 18,517 x 2
##    word           n
##    <chr>      <int>
##  1 data       31593
##  2 experience 14064
##  3 business    7720
##  4 learning    6339
##  5 science     6032
##  6 analytics   5561
##  7 analysis    4913
##  8 skills      4910
##  9 machine     4833
## 10 team        4821
## # ... with 1.851e+04 more rows

Applying lowercase to all the words to ensure different cases of the same word aren’t considered different.

# lowercase all the words just to make sure there's no redundancy
jobs_desc <- jobs_desc %>% 
  mutate(word = tolower(word))

Term Frequency

The term frequency times inverse document frequency (TF-IDF) is used to identify words that are especially important to a document within a collection of documents. The results are the most important words in the description fields as measured by TF-IDF, meaning the words are common but not too common.

Calculate the TF-IDF

# Calculating tf-idf for the description fields

desc_tf_idf <- jobs_desc %>% 
  count(id, word, sort = TRUE) %>%
  ungroup() %>%
  bind_tf_idf(word, id, n)

desc_tf_idf %>% filter(n >= 10) %>%
  arrange(-tf_idf)

Combine the data frame of the TF_IDF of the job descriptions with the job categories.

The join is performed on the unique ID as key. Joining with the categories will identify the most important words from the job descriptions per job category.

# Join with the category
desc_tf_idf <- full_join(desc_tf_idf, jobs_cat, by = "id")

desc_tf_idf %>% 
  filter(!near(tf, 1)) %>%
  filter(category %in% jobs_cat$category) %>%
  arrange(desc(tf_idf)) %>%
  group_by(category) %>%
  distinct(word, category, .keep_all = TRUE) %>%
  top_n(8, tf_idf) %>% 
  ungroup() %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  ggplot(aes(word, tf_idf, fill = category)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~category, ncol = 3, scales = "free") +
  coord_flip() +
  labs(title = "Highest tf-idf words in job listing description fields",
       caption = "From jobpickr dataset",
       x = NULL, y = "tf-idf")

The resulting plot did not prove useful for identifying skills across all the job listings for data scientist. The plot does indicate which words are more common across that specific job category. The results demonstrate that job listings by category are likely posted by the same company or same recruiter and thus the same boilerplate description is often used across many job listings.

** Topic Modeling **

In order to peform topic modeling, a document term matrix is required.

Calculate the word count by document ID. Each job description is considered a unique document by the job listing’s unique ID.

# 8.4 Topic Modeling

# Casting to a document-term matrix
word_counts <- jobs_desc %>%
  count(id, word, sort = TRUE) %>%
  ungroup()

word_counts %>% print(n=10)

## # A tibble: 483,690 x 3
##    id                               word      n
##    <chr>                            <chr> <int>
##  1 f563735821e0d15153c487edc700db30 data     88
##  2 f9c3f51cbdf699189e3175c66310934b data     72
##  3 25992aeea0ec81a362eb8d5c8a945bac data     66
##  4 04f4b7504d7cb71baf8a22c5a6dafb4e data     64
##  5 25f126b32f26022e8e1d3fbfe85b1802 data     64
##  6 d869251bbe6e76b971246ce06642a44d data     62
##  7 626eb8b274e2369aabdcc153e4e1f32c data     60
##  8 3d8f0398cc54891ea8882148c5ccbc94 data     58
##  9 049784c976f14a08fd6ad657112ee142 data     56
## 10 97e787bb231a3947d4535d7caccc8a2b data     54
## # ... with 4.837e+05 more rows

Construct the document-term matrix.

The resulting document-term matrix indicates a high level of sparsity. The non-zero entries do correspond to a certain word appearing in a particular document.

# Construct DTM
desc_dtm <- word_counts %>%
  cast_dtm(id, word, n)

desc_dtm

## <<DocumentTermMatrix (documents: 2497, terms: 18517)>>
## Non-/sparse entries: 483690/45753259
## Sparsity           : 99%
## Maximal term length: 72
## Weighting          : term frequency (tf)

Calculate the LDA

According to Wikipedia, “In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.”

# Rrunning this model is time intensive
# Define there to be 16 topics.
desc_lda <- LDA(desc_dtm, k = 16, control = list(seed = 1234))
desc_lda

## A LDA_VEM topic model with 16 topics.

Tidy the resulting LDA topics.

# Interpreting the data model
tidy_lda <- tidy(desc_lda)

tidy_lda

Identify the top 10 terms for each topic.

# Top 10 Terms
top_terms <- tidy_lda %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms

Plot the top 10 terms for each topic.

Even though the topics are anonymous, only identified by number, the groupings of words show some similarities and differences, but do not necessarily provide much value at this point.

The topic modeling process has identified groupings of terms that we can understand as human readers of these description fields.

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = NULL, y = expression(beta)) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

Calculate gamma

Gamma will define the probability that each document belongs in each topic.

# LDA gamma
lda_gamma <- tidy(desc_lda, matrix = "gamma")

lda_gamma

Identify the categories associated with each topic

lda_gamma <- full_join(lda_gamma, jobs_cat, by = c("document" = "id"))

lda_gamma

top_cats <- lda_gamma %>% 
  filter(gamma > 0.5) %>% 
  count(topic, category, sort = TRUE)

top_cats <- top_cats %>% filter(!is.na(category))

Topic 9 identifes ‘business and financial operations’ as the top category, and the only topic to include the term ‘aws’. Topic 4, most identified with category ‘Arts/Entertainment/Publishing’ contains the terms ‘experience’ and ‘content’ which align with the category broadly.

# One more graph from 8.4.4
top_cats %>%
  group_by(topic) %>%
  top_n(5, n) %>%
  ungroup %>%
  mutate(category = reorder_within(category, n, topic)) %>%
  ggplot(aes(category, n, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  labs(title = "Top categories for each LDA topic",
       x = NULL, y = "Number of documents") +
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~ topic, ncol = 2, scales = "free")

Lexical Dispersion

Following the example at https://www.r-bloggers.com/advancing-text-mining-with-r-and-quanteda/ and https://quanteda.io/articles/pkgdown/examples/plotting.html

Create a corpus based on the unique ID and the job description.

# Generate a corpus

uniq_jobs_df <- jobs_df %>% distinct(uniq_id, .keep_all = TRUE)

my_corpus <- corpus(uniq_jobs_df, docid_field = "uniq_id", text_field = "job_description")

mycorpus_stats <- summary(my_corpus)

Preprocess the text.

Remove numbers, remove punctuation, remove symbols, remove URLs, split hyphens. Clean for OCR.

# Preprocess the text

# Create tokens
token <-
  tokens(
    my_corpus,
    remove_numbers  = TRUE,
    remove_punct    = TRUE,
    remove_symbols  = TRUE,
    remove_url      = TRUE,
    split_hyphens   = TRUE
  )

# Clean tokens created by OCR
token_ungd <- tokens_select(
  token,
  c("[\\d-]", "[[:punct:]]", "^.{1,2}$"),
  selection = "remove",
  valuetype = "regex",
  verbose = TRUE
)

## removed 2,961 features

Create a Data Frequency Matrix

Using the Quanteda library, create the data frequency matrix and filter words that appear less than 7.5% and more than 90%.

# Data frequency matrix
my_dfm <- dfm(token_ungd,
              tolower = TRUE,
              stem = TRUE,
              remove = stopwords("english")
              )

my_dfm_trim <-
  dfm_trim(
    my_dfm,
    min_docfreq = 0.075,
    # min 7.5%
    max_docfreq = 0.90,
    # max 90%
    docfreq_type = "prop"
  )

head(dfm_sort(my_dfm_trim, decreasing = TRUE, margin = "both"),
     n = 10,
     nf = 10)

## Document-feature matrix of: 10 documents, 10 features (4.0% sparse) and 20 docvars.
##                                   features
## docs                               analyt busi develop model learn team scienc
##   9824c201fbaf4d211ef817cdac428156     14    8       2    10     2   12     10
##   f563735821e0d15153c487edc700db30     22   27      18     6     8    8     22
##   0f4fc982e72ca26aaad1f66637019652     10    6      10     6     6   12     12
##   9f2a8aee7e7a102720cfb3dbc39c6c24      2    0      28     4    18   18     10
##   6737d173f0bdf3d4d8fd3543518e40c7      2    0      30     4    18   18     10
##   3fcf1d237cf41524a247f9b940393b95     40   22      18    18     4   18      0
##                                   features
## docs                               statist use scientist
##   9824c201fbaf4d211ef817cdac428156      14   8        10
##   f563735821e0d15153c487edc700db30      12  24         6
##   0f4fc982e72ca26aaad1f66637019652      10  10         4
##   9f2a8aee7e7a102720cfb3dbc39c6c24      10   4        12
##   6737d173f0bdf3d4d8fd3543518e40c7      10   4         8
##   3fcf1d237cf41524a247f9b940393b95      12   6         6
## [ reached max_ndoc ... 4 more documents ]

Plot lexical dispersion

Plot shows the occurrences of the term ‘python’ and ‘r’ in across all documents for the state of Oregon (OR). The state was chosen to give a natural subset of all documents initially included.

The lexical dispersion appears to indicate the use of the terms ‘python’ and ‘r’ often occur in conjunction in the documents (job descriptions) which would indicate the listings are listing the two programming languages in or near the same sentence.

my_corpus_sub <- corpus_subset(my_corpus, state == "OR")

theme_set(theme_bw())

g <- textplot_xray(
     kwic(my_corpus_sub, pattern = "python"),
     kwic(my_corpus_sub, pattern = "r")
)

g + aes(color = keyword) + 
    scale_color_manual(values = c("blue", "red")) +
    theme(legend.position = "none")

Bigrams

From the tutorial at https://www.tidytextmining.com/. The bigrams identify the word pairs that occur the most frequently

Identify bigrams of n=2.

This exercise finds bigrams of two words. The function does allow for bigrams of greater than two.

jobs_bigrams <- jobs_df %>%
  unnest_tokens(bigram, job_description, token = "ngrams", n = 2)

jobs_bigrams %>%
  count(bigram, sort = TRUE)

bigrams_separated <- jobs_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

# This result is valuable
bigram_counts %>% print(n = 20)

## # A tibble: 110,235 x 3
##    word1         word2           n
##    <chr>         <chr>       <int>
##  1 machine       learning     4721
##  2 data          scientist    4271
##  3 data          science      3998
##  4 computer      science      1415
##  5 data          sets         1272
##  6 data          analysis     1208
##  7 data          mining       1150
##  8 equal         opportunity  1129
##  9 national      origin        926
## 10 data          scientists    875
## 11 veteran       status        858
## 12 communication skills        851
## 13 sexual        orientation   842
## 14 data          analytics     838
## 15 opportunity   employer      829
## 16 data          sources       817
## 17 gender        identity      752
## 18 race          color         737
## 19 color         religion      730
## 20 related       field         713
## # ... with 1.102e+05 more rows

Filter the bigrams

Include bigrams that occurred at least 1250 times in order to filter out visual noise.

# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
  filter(n > 1250) %>%
  graph_from_data_frame()

bigram_graph

## IGRAPH ad3c94f DN-- 7 5 -- 
## + attr: name (v/c), n (e/n)
## + edges from ad3c94f (vertex names):
## [1] machine ->learning  data    ->scientist data    ->science  
## [4] computer->science   data    ->sets

Visualize the bigrams in Network plot

set.seed(2020)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

Visualize the bigrams with Directional plot

set.seed(2021)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

Frequency Plot

Simple bar plot of the most common words across all the job listings’ descriptions.

jobs_words <- jobs_df %>%
  unnest_tokens(word, job_description) %>%
  anti_join(stop_words) %>%
  count(uniq_id, word, sort = TRUE)

## Joining, by = "word"

total_words <- jobs_words %>% 
  group_by(uniq_id) %>% 
  summarize(total = sum(n))

jobs_words <- left_join(jobs_words, total_words)

## Joining, by = "uniq_id"

jobs_words <- jobs_words %>%
    anti_join(stop_words)

## Joining, by = "word"

jobs_words %>%
  count(word, sort = TRUE)

jobs_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 2500) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Words by Frequency

#Create a term document matrix based upon the variable descriptions
tdm <- TermDocumentMatrix(descriptions)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)

#Creates a frequency table for words
freqTable <- data.frame(word = names(v), freq=v)

Here is a frequency table and barplot for words found in job descriptions:

freqTable

#Bar plot of the 20 most common words
barplot(v[1:20], col = "#003366", las = 2)

Word Cloud

A Word Cloud Based Upon the Job Descriptions⁴

set.seed(1234)
wordcloud(words = freqTable$word, freq = freqTable$freq, min.freq = 100,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Analysis

After mining the job descriptions of our data set, the team met again over Zoom to discuss the results of our Pollock method for looking at the data.

Observations:

The Equal Employment Opportunity (EEO) information for jobs was impacting the frequencies.

An example of an EEO is below:

CUNY encourages people with disabilities, minorities, veterans and women to apply. At CUNY, Italian Americans are also included among our protected groups. Applicants and employees will not be discriminated against on the basis of any legally protected category, including sexual orientation or gender identity. EEO/AA/Vet/Disability Employer.⁵

Such words as religion, nation of origin, and diversity would appear in our frequencies. We would have to exclude these. We would not have been able to make this decision if we did not regroup.
Loss of context

A word or phrase can be taken out of context. For example, the word experience sounds good by itself, but when it is counted it could have been within a sentence that asked for 3-5 years of experience or no experience needed. In addition, we did not examine the role or title of the job description. A senior data scientist would require a particular gamut of skills different from an entry level data scientist. One job title might require a PhD as opposed to a Bachelors degree.
Hard Skills

After going through the analysis, these were the top 5 hard skills we determined by looking at the data together. Note: hard skills are “teachable abilities or skill sets that are easy to quantify.”⁶
- Programming Skills: Python, R, SQL
- Platforms: Tableau, Apache Spark, Hadoop, AWS
- Computer Science
- Statistics, Mathematics, Algorithms
- Experience and Education
Soft Skills

The following were the top 5 soft skills we determined by looking at the data together. Note: soft skills are “subjective skills that are much harder to quantify.”⁷
- Communication Skills: presentation skills
- Interpersonal Skills: i.e. EQ Emotional Intelligence
- Team Player
- Analytical
- Motivated, driven
Debugging?

From the number of days spent debugging and troubleshooting, the team saw what was absent was the skill of debugging. Why was troubleshooting seemingly absent from the job descriptions?

Conclusions

The above skills are highly subjective as the team was looking at the data that was before us. From what the data and analysis provided, did not provide any earth-breaking “Aha” moments. We did make a note that the data we obtained is for job postings and represented only a slice of the skills needed to be a data scientist. Our analysis is not a holistic one.

Every company will look for the best ideal candidates and the job posting is one step in any recruitment process. For every job, candidates have to interview and sometimes have to go through a barrage of interpersonal relationships with managers and headhunters.

For future steps, it would be necessary to include headhunters and other recruiters to provide data. In addition, a breakdown of the skills by job title might produce additional insight. For the Azure text analytics, one can remove the EEO policies from the job descriptions and employ RJson to encapsulate the job descriptions as JSON objects to be analyzed.

Data 607 Project 3

Matt Mecoli, Philip Tanofsky, Sung Lee, Vanita Thompson

3/21/2020