1. Introduction

a. Overview

This project is focused on extending the work completed in Project 3, where we set out to identify the top skills required for a Data Scientist.

In that project, we collected over 100+ postings collected across several job boards, and extracted job requirements information from the various postings to generate a corpus of the most common Data Science related terms.

For this project, I shifted my focus to building a classification model that would take an entry from a Data Science job posting and classify it based on the category of skill that it belongs to. I see this as being a part of a larger project, that will allow for a user to determine if they are a fit for a job, based on their own skills and qualifications and the skills listed in the job posting.

The two primary components of this project are:

  1. Build an unsupervised model that can take the unlabeled job postings and identify common groups amongst the data based on the skills listed in the posting;

  2. Using the identified clusters to provide labels for the data that can then be used to build a classifier that takes in new data and classifies the job requirement skill.

b. Objective

The overall research of this project and the research question was “Can I build a classifier that can be used to predict the type of job skills for a new job posting with > 80% accuracy?”

c. Project Motivation

There were several key motivations that I had for this project:

  1. To continue to build upon and expand upon my Machine Learning skills. I want to become more adept at working with different supervised and unsupervised ML models.
  2. To build upon previous work, and extend the functionality of the data collected in Project 3
  3. I envision this work being further extended, by adding a front-end interface that will allow myself or another individual to feed in a Data Science job posting and then do a skills match with their own skills and the skills requested by the company. This will ultimately allow the individual to determine if they are a good match for the position based on their skills
  4. Finally, I desire to be able to apply these same concepts and principles to solving different types of problems.

d. Methodology

Finally, the core methodology for this project was as follows:

  1. Import corpus of job requirements created in Project 3
  2. Pre-process the data in order to generate a list of core words found in the data
  3. Create a features array of the commonly used words across the different job postings
  4. Use this features array in a KMeans model to identify groupings amongst the job skills requirments
  5. Examine these clusters and label the data based on these clusters
  6. Using the newly labeled data, build a classifier that is trained on 70% of the data
  7. Evaluate the performance of the classifier on the test data

2. Setup

For this project, I used the following libraries:

  1. tidyverse
  2. janitor
  3. tidytext
  4. stopwords
  5. tm
  6. class
  7. kableExtra
  8. groupdata2

The class library was a new library for me. This library contains the knn* model that was used to build the classifier. Additionally, I used the groupdata2** library for the first time, which was used to upsample the features array to deal with unbalanced data clusters.

a. Import libraries

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2      ✔ purrr   1.0.1 
## ✔ tibble  3.2.1      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.4      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(tidytext)
library(stopwords)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## Attaching package: 'tm'
## 
## The following object is masked from 'package:stopwords':
## 
##     stopwords
library(class)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(wordcloud)
## Loading required package: RColorBrewer
library(groupdata2)

3. Develop List of words

In this step, we begin by importing the job skills data that we collected in Project 3, and then conducting a number of pre-processing steps in order to build our word features array that we will use in our clustering algorithm.

This process is divided into the following steps:

  1. Import data science job data collected in Project 3
  2. Break each job posting into discrete sets of job skills requirements using a number of pre-processing steps
  3. Import standard english stopwords and build custom stop words based on domain knowledge and common job posting words that don’t add value to the data
  4. Create modified version of the job skills info that removes stop-words, and special characters, and modifies the words based on stemming
  5. Create a unique corpus of terms based on a n-grams of 1-3 words
  6. Create a feature vector for the terms in the distinct words list

a. Import Job Postings

We begin by importing the job postings data collected and processed in Project 3

path = './Job Postings Data.csv'

job_postings_raw <- read_csv(path)
## Rows: 94 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Job Board, Post URL, Company Name, Industry, Location, Position/Ti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
job_postings_raw <- clean_names(job_postings_raw)

b. Create individual job requirements documents

In some instances, the underlying data lists job requirements that include multiple independent sentences that include different skills requirements. In this step, we further atomized the data to try and collect separate lines for each independent listing of job skills requirements.

job_postings_simple <- job_postings_raw %>%
  select(job_board, position_title, skills)

cols = c("job_board", "position_title", "skill")
job_skills_df = data.frame(matrix(nrow=0, ncol=length(cols)))
colnames(job_skills_df) = cols

row_index = 1

job_postings_simple <- 
  job_postings_simple %>% 
  drop_na()

for(i in 1:nrow(job_postings_simple)) {
  job_board <- job_postings_simple[i,]$job_board
  position_title <- job_postings_simple[i,]$position_title
  skills <- job_postings_simple[i,]$skills
  skills_list <- str_split_1(skills,regex("\\s\\s|\\n|;|\\.|\\,"))
  
  
  for(j in 1:length(skills_list)) {
    
    job_skills_df[row_index,]$job_board <- job_board
    job_skills_df[row_index,]$position_title <- position_title
    job_skills_df[row_index,]$skill <- skills_list[j]
      
    row_index <- row_index + 1  
    }
  
}

job_skills_df<- tibble(job_skills_df) %>%
  filter(skill != "") %>%
  filter(nchar(skill) > 1) %>%
  mutate(skill = str_remove_all(skill, regex("\\d")))

job_skills_df <- job_skills_df %>% 
  mutate(doc_num = row_number()) %>%
  relocate(doc_num, .before=job_board) 

c. Import and Create stopwords list

In order to reduce the inclusion of stop words that are not essential to our data models, we imported a standard english stopwords list. Additionally, I conducted a manual review of the data, and used that to create a stop words list of the stemmed words. Finally, we created a list of stop words related to Data Science job skills and general job postings, for use when creating our labels for our different groups.

english_stopwords <- stopwords("en")

job_skills_stopwords = sort(unique(c("abil", "coursework", "spanish", "approach","perspect",
    "strong","year","work","prior","experi","must", "proven","prefer","understand", "etc", "e",
    "advanc", "requir", "field", "techniques", "well","flexibl", "knowledg", "realworld", "employe",
    "fluent","english", "respect","detail","establi","experience", "experi","use", "strong",
    "g", "travel", "qualiti","maintain","well", "preferred", "hard", "ask", "champion", 
    "prefer", "quantit", "also", "degree", "learn", "understand", "master", "environ", "general",
    "skill", "requir", "hands", "field", "techniqu", "includ","tool","use", "think","learn","will",
    "manag", "solid", "skill","profici","o","understand","arena","thrive","familiar","focus",
    "willing","work","prioriti", "experti","profici","thrive","enabl","either","can","prior",
    "relev","study","requir", "minimum","skill","role","similar", "develop", "especi","willing",
    "signific","perspect","work","flexibl","d","demand","techniques","answering","process","ori",
    "good","abl","effici","build", "expert","level","interact","bachelor","degre","long",
    "shortterm", "task","succeed","fastpac","deliver","plus", "department","exercis","experienc",
    "partner","expertis","thought","leader","phd","new","two","team","major","practic","teamwork",
    "need","offici","obtain","team","small","thing","years","masters")))


job_skills_categories_stopwords = sort(unique(c("ability", "experience", "skills", "knowledge", "working",
                          "related", "using", "understanding", "including", "advanced", "relevant", 
                          "proficiency", "excellent", "building", "processes","able","effectively",
                          "master's", "demonstrated", "non", "concepts","expertise", "familiarity",
                          "results", "within","applying","others","data", "years", "including","get",
                          "willingness","machine", "statistical", "arrive","extensive","data science",
                          "science", "least","languages","large", "open", "work", "techniques", "large",
                          "solutions","vision","large scale","questions","convert","management",
                          "focus", "technologies","high","sets","coursework","materials", "one","complex",
                          "methods","following","production")))

d. Preprocessing terms

For each document of text, I standardized the data and stored the text in a new column called skill_mod. The core steps conducted to generate this column of data include:

  1. Convert text to lower-case
  2. Remove stop words
  3. Remove special characters
  4. Stem the text
job_skills_clean <- job_skills_df %>%
  mutate(skill_mod = tolower(skill), 
         skill_mod = removeWords(skill_mod, english_stopwords),
         skill_mod = stemDocument(skill_mod, language='english'),
         skill_mod = removeWords(skill_mod, job_skills_stopwords),
         skill_mod = str_replace_all(skill_mod, regex("\\d"), "") ,
         skill_mod = str_replace_all(skill_mod, regex("[[:punct:]]"), ""),
         skill_mod = str_replace_all(skill_mod, regex("(^\\b|^\\W)"), ""),
         skill_mod = str_squish(skill_mod),
         num_words = str_count(skill_mod, "\\w+"))


job_skills_clean <- job_skills_clean %>% 
  filter(num_words != 0) %>%
  mutate(doc_num = row_number())

e. Get words from skills

Next we generate a list of unique n-grams of length 1-3, and used this to generate a list of distinct words that will serve as the basis for our word features array.

text_df <- job_skills_clean %>% select(doc_num, skill_mod)

one_word_df <- text_df %>% unnest_tokens(word, skill_mod, token = 'ngrams', n=1)
two_word_df <- text_df %>% unnest_tokens(word, skill_mod, token = 'ngrams', n=2)
three_word_df <- text_df %>% unnest_tokens(word, skill_mod, token = 'ngrams', n=3)


combined_df <- rbind(one_word_df, two_word_df)
combined_df <- rbind(combined_df, three_word_df)

f. Create features vector based on unique terms

Next we used this list of words to create a term-frequency array that will count the number of times a word appears in the dataset. This wll be the way that we will convert the word feature into a numerical value. Through trial and error, I determined that the final features array would exclude words that were in fewer than 10 postings, and those postings that had fewer than 5 words.

While this was discovered through trial and error, this was done to handle words that should not be deemed relevant to the data set given their infrequent use across the overall set of documents.

combined_df_clean <- combined_df %>%
  filter(!is.na(word))

skills_df_a <-left_join(combined_df_clean, job_skills_clean) %>% 
  select(-c("job_board", "position_title", "skill", "skill_mod"))
## Joining, by = "doc_num"
skills_df_b <- skills_df_a %>% 
  group_by(doc_num, word) %>% 
  mutate(n = n()) %>%
  ungroup() %>%
  group_by(word) %>%
  mutate(word_count = n(), 
         num_docs = n_distinct(doc_num)) %>% 
  ungroup() %>% 
  mutate(total_words = n_distinct(word), 
         total_docs = n_distinct(doc_num))

skills_df_c <- skills_df_b %>% 
  filter(num_docs >= 10,
         num_words >= 5) %>%
  mutate(total_words = n_distinct(word), 
         total_docs = n_distinct(doc_num))

tf_df <- skills_df_c %>% 
  mutate(tf = word_count/total_words)

tf_df_mod <- tf_df %>% 
  distinct(doc_num, word, .keep_all = TRUE)

4. Create features array for use in clustering model

In this step, we take the words from the previous step and create a features array for the terms.

Rather than use tf-idf for our feature values, we will use the overall term frequency as the values for our word features.

Additionally, for each document, we will replace the NA values with a -1 value. We use -1 to penalize a document for not having a word in the document.

tf_df_spread <- tf_df_mod %>% 
  select(c(doc_num, word, tf)) %>% 
  spread(key=word, value=tf, fill = NA)


tf_df_mod_2 <- tf_df_spread %>%
  group_by(doc_num) %>%
  fill(colnames(.),.direction="updown") %>%
  ungroup()

included_docs <- tf_df_mod %>% distinct(doc_num)


## Create a data frame of just the word features for each job posting
features_array <- tf_df_mod_2 %>% 
  ungroup()
  

## Replaces NA values with -1
features_array_mod <- replace(features_array, is.na(features_array),-1)

features_array_clean <- features_array_mod %>%
  select(-doc_num) 

5. Conduct KMeans Clusterings

Now we use our features array to train our KMeans clustering model in order to identify the clusters for our data

We used 30 groups for our clustering algorithm; however as we continue to expand upon this project in the future, we will consider testing out different size groups and evaluate the performance of the clusters to determine the best number of groups to use.

Once the model has been created, we will look at the size of th clusters, as well as the variance amongst the clusters.

a. Train Model

num_groups = 30

## Run KMeans algorithm on the features array
set.seed(4152023)

kmeans_info <- kmeans(features_array_clean,num_groups, nstart = 15)


kmeans_info$size
##  [1]  9 11  5 16  4  4 20  3 10  3 14  3  6  9  7 12  3 16 11 35  6  9 19  5  1
## [26]  5 87  4  6 10
kmeans_info$withinss
##  [1]  30.130773  41.078400  24.259302  30.788456  13.387488  16.082544
##  [7]  85.363901   4.753877  35.935072   8.642389  29.299579   5.792128
## [13]  13.782336  25.319808  14.673042  40.190384   4.240597  56.775936
## [19]  34.066153  94.427820  20.562325  29.929458  57.106776  13.594906
## [25]   0.000000  12.560896 260.699494  14.810704  18.643659  32.886170
kmeans_info$totss
## [1] 2715.78

b. Label job skills data

Next we take our cluster information and use them to label the original data with the new cluster numbers.

Additionally, we create a plot to show how many documents were included in each cluster group.

job_skills_clean_b <- job_skills_clean %>% filter(doc_num %in% included_docs$doc_num)

job_skills_df_updated <- tibble(cbind(job_skills_clean_b, kmeans_info$cluster))
job_skills_df_updated <- job_skills_df_updated %>% rename(cluster=`kmeans_info$cluster`)
job_skills_df_updated <- job_skills_df_updated %>% mutate(cluster = as.character(cluster))


job_skills_df_updated %>%
  mutate(cluster = as.numeric(cluster)) %>%
  count(cluster) %>%
  ggplot() +
  geom_bar(aes(x=-cluster, y=n, fill=n), stat='identity') +
  coord_flip() +
  ggtitle("Cluster Size") +
  xlab("Cluster Number") +
  ylab("")

  job_skills_df_updated %>%
    mutate(cluster = as.numeric(cluster)) %>%
    count(cluster) %>%
    select(cluster, n) %>%
    mutate(pct_total = paste0(round((n/sum(n)*100),1),"%")) %>%
    kable(
      caption = "Cluster Size",
      col.names = c("Cluster", "Size", "Pct of Total")
    ) %>%
    kable_material(c("striped"))
Cluster Size
Cluster Size Pct of Total
1 9 2.5%
2 11 3.1%
3 5 1.4%
4 16 4.5%
5 4 1.1%
6 4 1.1%
7 20 5.7%
8 3 0.8%
9 10 2.8%
10 3 0.8%
11 14 4%
12 3 0.8%
13 6 1.7%
14 9 2.5%
15 7 2%
16 12 3.4%
17 3 0.8%
18 16 4.5%
19 11 3.1%
20 35 9.9%
21 6 1.7%
22 9 2.5%
23 19 5.4%
24 5 1.4%
25 1 0.3%
26 5 1.4%
27 87 24.6%
28 4 1.1%
29 6 1.7%
30 10 2.8%

6. Label Jobs

Now that we have the clusters identified for each of our documents, we will now review the job postings included in each group, in order to infer a label to use for the group. This will be done by looking at the original job postings associated with each cluster, and then take a count of the unique terms found in each group and use the most frequently appearing term as the groups label.

a. Generate Labels for the groups

job_skills_labels = c()

for(cluster_num in seq(num_groups)) {
    
  
    cluster_docs <- (job_skills_df_updated %>% 
      filter(cluster == cluster_num) %>%
      distinct(doc_num))$doc_num
  
  
    relevant_docs <- job_skills_clean %>% filter(doc_num %in% cluster_docs)
    
    one_word_df <- relevant_docs %>% unnest_tokens(word, skill, token = 'ngrams', n=1)
    two_word_df <- relevant_docs %>% unnest_tokens(word, skill, token = 'ngrams', n=2)
    three_word_df <- relevant_docs %>% unnest_tokens(word, skill, token = 'ngrams', n=3)
    
    combined_words <- rbind(one_word_df, two_word_df)
    combined_words <- rbind(combined_words, three_word_df)
    
    top_words <- combined_words %>% 
      mutate(word = removeWords(word, english_stopwords),
             word = str_remove(word, regex("(years)")),
             word = str_squish(word), ) %>%
      filter(!word %in% english_stopwords) %>% 
      filter(!word %in% job_skills_categories_stopwords) %>%
      filter(!is.na(word)) %>%
      filter(word != "") %>%
      count(word, sort=TRUE)
    
    top_word <- top_words[[1,1]]
    job_skills_labels <- c(job_skills_labels, top_word)
  
}

cluster_labels <- 
  tibble(job_skills_labels) %>% 
  mutate(cluster = as.character(row_number())) %>% 
  rename(skills_label = job_skills_labels) %>% 
  relocate(cluster, .before=cluster)


cluster_labels %>%
  kable(
    caption='Cluster Labels',
    col.names = c("Label Name", "Cluster")
  ) %>%
  kable_material(c("striped",
                   font_size=8)) %>%
  kable_styling(fixed_thead = T)
Cluster Labels
Label Name Cluster
python 1
analysis 2
business 3
business 4
develop 5
technical 6
machine learning 7
research 8
analytics 9
r 10
project 11
statistical analysis 12
data sets 13
machine learning 14
linux 15
analytical 16
open source 17
communicate 18
analysis 19
insights 20
models 21
big data 22
models 23
computer science 24
seven 25
relational databases 26
technical 27
communicate 28
commercial 29
models 30

Additionally, we want to add our new labels to the features array, which will be used in our classification model.

features_array_labeled <- tibble(cbind(features_array_mod, kmeans_info$cluster))

 features_array_labeled <- features_array_labeled %>% 
  rename(cluster_name=`kmeans_info$cluster`)

features_array_labeled <- features_array_labeled %>% 
  mutate(cluster_name = as.character(cluster_name)) %>%
  left_join(cluster_labels, by=c('cluster_name'='cluster'))


features_array_upsampled <- tibble(upsample(features_array_labeled, cat_col='skills_label')) %>%
  relocate(c(skills_label),.after=doc_num)


features_array_labeled %>% 
  select(doc_num, skills_label) %>% 
  count(skills_label) %>% 
  ggplot() +
  geom_bar(aes(x=reorder(skills_label,n), y=n, fill=skills_label), stat='identity') +
  coord_flip() +
  ggtitle("Cluster Labels") +
  xlab("Label") +
  ylab("n")

features_array_labeled %>% 
  select(doc_num, skills_label) %>% 
  count(skills_label) %>% 
  with(wordcloud(skills_label, n, max.words=150))

7. Train and Evaluate Classifier

Finally, we focus on building a classification model using our now labeled data. Additionally, we will run several iterations of the model and record the accuracy score so that we can evaluate the model performance and determine if it is > 85% accurate.

a. Build and evaluate Classification model

For 500 iterations, we will build a model and evaluate the accuracy by completing the following steps in each iteration:

  1. Shuffle the dataframe
  2. Split the data into a training and set set, with 70% of the data included in the training set.
  3. Apply KNN to training set
  4. Evaluate performance of the model on the test set
  5. Generate the accuracy score and store it into a vector of accuracy scores for each
seed_num = 5142023
sample_size = 500

accuracy <- function(x) {
  sum(diag(x)/sum(rowSums(x)))
}


accuracy_sample = c()

for(i in seq(sample_size)) {
  
  set.seed(seed_num + i)  
  shuffled_features <- features_array_upsampled %>% sample_frac()

  train_size = .70
  test_size = 1-train_size
  
  set.seed(seed_num + i)  
  rand <- sample(seq(1,nrow(shuffled_features)),size=train_size*nrow(shuffled_features),replace=FALSE)
  
  train_data <- shuffled_features %>% slice(rand)
  test_data <- shuffled_features %>% slice(-rand)
  
  X_train <- train_data %>% select(-c(doc_num, skills_label,cluster_name))
  X_test <- test_data %>% select(-c(doc_num, skills_label, cluster_name))
  
  y_train <- train_data %>% select(skills_label)
  y_test <- test_data %>% select(skills_label)
  
  y_train <- y_train %>%
    mutate(category = as.factor(skills_label))
  
  y_test <- y_test %>% 
    mutate(category = as.factor(skills_label))
  
  
  y_train_labels <- y_train$skills_label
  y_test_labels <- y_test$skills_label
  
  knn_pred1 <- knn(
    train = X_train,
    test = X_test,
    cl = y_train_labels,
    k = 25
  )


  tab <- table(knn_pred1,y_test_labels)
  
  
  accuracy_sample <- c(accuracy_sample, accuracy(tab))
    
}

b. View distribution of sample accuracy scores

Next, we will evaluate our accuracy scores to determine if the mean accuracy score of our model is > 80% at a 90% confidence level.

mean(accuracy_sample)
## [1] 0.9173312
sd(accuracy_sample)
## [1] 0.01696199
quantile(accuracy_sample)
##        0%       25%       50%       75%      100% 
## 0.8566879 0.9076433 0.9187898 0.9283439 0.9538217
accuracy_sample <- tibble(accuracy_sample)

ggplot(accuracy_sample) +
  geom_histogram(aes(x=accuracy_sample), fill='white', color='black') +
  ggtitle("Distribution of Model Accuracy Scores (n=500)") +
  xlab("Accuracy Score") +
  ylab("")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

c. Generate Confidence Interval for Mean Accuracy Score

Finally we determine the confidence interval for our Mean Accuracy Score to determine if the mean accuracy score for our data is >= 80%

alpha = .90

n <- sample_size
avg <- mean(accuracy_sample$accuracy_sample)
std_dev <- sd(accuracy_sample$accuracy_sample)

margin <- qt(1-((1-alpha)/2),df=(n-1))*std_dev/sqrt(n)

lower_interval <- avg - margin
upper_interval <- avg + margin

print(paste(lower_interval, upper_interval))
## [1] "0.916081163889318 0.918581256492847"

For this model, there’s we are confident that the true accuracy score for the model is between 91.6% and 91.9% at a 90% level of confidence.

9. Conclusion

Through this process, we were able to take a corpus of job postings and break them down into individual discrete job skills requirement documents, and then vectorize these skills into a word array, and use this data to generate clusters of related job skills. From there, we were able to generate labels for each of the groupings and then train a model for classifying new job skills requirements.

Overall, our objective was to build a model that is able to achieve perform this classification with an accuracy rate above 85%. After running 500 simulations, we discovered that the accuracy of our model was had a confidence interval of 91.6% to 91.9% at a 90% level of confidence.

While I was able to effectively go through the steps to build this model, there’s a lot more learning that I need in order to better understand how to tune models and adjust the performance. For this model, several issues that arose were related to the following:

  1. Clusters were not cohesive
  2. Some of the labels generated for the clusters did not make sense
  3. Across multiple iterations, there were some clusters that were significantly larger than other groups.

While this code reflects the final work-product, there was a lot of work done to improve the performance of the models, which is not reflected in the final code.