This project is focused on extending the work completed in Project 3, where we set out to identify the top skills required for a Data Scientist.
In that project, we collected over 100+ postings collected across several job boards, and extracted job requirements information from the various postings to generate a corpus of the most common Data Science related terms.
For this project, I shifted my focus to building a classification model that would take an entry from a Data Science job posting and classify it based on the category of skill that it belongs to. I see this as being a part of a larger project, that will allow for a user to determine if they are a fit for a job, based on their own skills and qualifications and the skills listed in the job posting.
The two primary components of this project are:
Build an unsupervised model that can take the unlabeled job postings and identify common groups amongst the data based on the skills listed in the posting;
Using the identified clusters to provide labels for the data that can then be used to build a classifier that takes in new data and classifies the job requirement skill.
The overall research of this project and the research question was “Can I build a classifier that can be used to predict the type of job skills for a new job posting with > 80% accuracy?”
There were several key motivations that I had for this project:
Finally, the core methodology for this project was as follows:
For this project, I used the following libraries:
The class library was a new library for me. This library contains the knn* model that was used to build the classifier. Additionally, I used the groupdata2** library for the first time, which was used to upsample the features array to deal with unbalanced data clusters.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(tidytext)
library(stopwords)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
##
##
## Attaching package: 'tm'
##
## The following object is masked from 'package:stopwords':
##
## stopwords
library(class)
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(wordcloud)
## Loading required package: RColorBrewer
library(groupdata2)
In this step, we begin by importing the job skills data that we collected in Project 3, and then conducting a number of pre-processing steps in order to build our word features array that we will use in our clustering algorithm.
This process is divided into the following steps:
We begin by importing the job postings data collected and processed in Project 3
path = './Job Postings Data.csv'
job_postings_raw <- read_csv(path)
## Rows: 94 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Job Board, Post URL, Company Name, Industry, Location, Position/Ti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
job_postings_raw <- clean_names(job_postings_raw)
In some instances, the underlying data lists job requirements that include multiple independent sentences that include different skills requirements. In this step, we further atomized the data to try and collect separate lines for each independent listing of job skills requirements.
job_postings_simple <- job_postings_raw %>%
select(job_board, position_title, skills)
cols = c("job_board", "position_title", "skill")
job_skills_df = data.frame(matrix(nrow=0, ncol=length(cols)))
colnames(job_skills_df) = cols
row_index = 1
job_postings_simple <-
job_postings_simple %>%
drop_na()
for(i in 1:nrow(job_postings_simple)) {
job_board <- job_postings_simple[i,]$job_board
position_title <- job_postings_simple[i,]$position_title
skills <- job_postings_simple[i,]$skills
skills_list <- str_split_1(skills,regex("\\s\\s|\\n|;|\\.|\\,"))
for(j in 1:length(skills_list)) {
job_skills_df[row_index,]$job_board <- job_board
job_skills_df[row_index,]$position_title <- position_title
job_skills_df[row_index,]$skill <- skills_list[j]
row_index <- row_index + 1
}
}
job_skills_df<- tibble(job_skills_df) %>%
filter(skill != "") %>%
filter(nchar(skill) > 1) %>%
mutate(skill = str_remove_all(skill, regex("\\d")))
job_skills_df <- job_skills_df %>%
mutate(doc_num = row_number()) %>%
relocate(doc_num, .before=job_board)
In order to reduce the inclusion of stop words that are not essential to our data models, we imported a standard english stopwords list. Additionally, I conducted a manual review of the data, and used that to create a stop words list of the stemmed words. Finally, we created a list of stop words related to Data Science job skills and general job postings, for use when creating our labels for our different groups.
english_stopwords <- stopwords("en")
job_skills_stopwords = sort(unique(c("abil", "coursework", "spanish", "approach","perspect",
"strong","year","work","prior","experi","must", "proven","prefer","understand", "etc", "e",
"advanc", "requir", "field", "techniques", "well","flexibl", "knowledg", "realworld", "employe",
"fluent","english", "respect","detail","establi","experience", "experi","use", "strong",
"g", "travel", "qualiti","maintain","well", "preferred", "hard", "ask", "champion",
"prefer", "quantit", "also", "degree", "learn", "understand", "master", "environ", "general",
"skill", "requir", "hands", "field", "techniqu", "includ","tool","use", "think","learn","will",
"manag", "solid", "skill","profici","o","understand","arena","thrive","familiar","focus",
"willing","work","prioriti", "experti","profici","thrive","enabl","either","can","prior",
"relev","study","requir", "minimum","skill","role","similar", "develop", "especi","willing",
"signific","perspect","work","flexibl","d","demand","techniques","answering","process","ori",
"good","abl","effici","build", "expert","level","interact","bachelor","degre","long",
"shortterm", "task","succeed","fastpac","deliver","plus", "department","exercis","experienc",
"partner","expertis","thought","leader","phd","new","two","team","major","practic","teamwork",
"need","offici","obtain","team","small","thing","years","masters")))
job_skills_categories_stopwords = sort(unique(c("ability", "experience", "skills", "knowledge", "working",
"related", "using", "understanding", "including", "advanced", "relevant",
"proficiency", "excellent", "building", "processes","able","effectively",
"master's", "demonstrated", "non", "concepts","expertise", "familiarity",
"results", "within","applying","others","data", "years", "including","get",
"willingness","machine", "statistical", "arrive","extensive","data science",
"science", "least","languages","large", "open", "work", "techniques", "large",
"solutions","vision","large scale","questions","convert","management",
"focus", "technologies","high","sets","coursework","materials", "one","complex",
"methods","following","production")))
For each document of text, I standardized the data and stored the text in a new column called skill_mod. The core steps conducted to generate this column of data include:
job_skills_clean <- job_skills_df %>%
mutate(skill_mod = tolower(skill),
skill_mod = removeWords(skill_mod, english_stopwords),
skill_mod = stemDocument(skill_mod, language='english'),
skill_mod = removeWords(skill_mod, job_skills_stopwords),
skill_mod = str_replace_all(skill_mod, regex("\\d"), "") ,
skill_mod = str_replace_all(skill_mod, regex("[[:punct:]]"), ""),
skill_mod = str_replace_all(skill_mod, regex("(^\\b|^\\W)"), ""),
skill_mod = str_squish(skill_mod),
num_words = str_count(skill_mod, "\\w+"))
job_skills_clean <- job_skills_clean %>%
filter(num_words != 0) %>%
mutate(doc_num = row_number())
Next we generate a list of unique n-grams of length 1-3, and used this to generate a list of distinct words that will serve as the basis for our word features array.
text_df <- job_skills_clean %>% select(doc_num, skill_mod)
one_word_df <- text_df %>% unnest_tokens(word, skill_mod, token = 'ngrams', n=1)
two_word_df <- text_df %>% unnest_tokens(word, skill_mod, token = 'ngrams', n=2)
three_word_df <- text_df %>% unnest_tokens(word, skill_mod, token = 'ngrams', n=3)
combined_df <- rbind(one_word_df, two_word_df)
combined_df <- rbind(combined_df, three_word_df)
Next we used this list of words to create a term-frequency array that will count the number of times a word appears in the dataset. This wll be the way that we will convert the word feature into a numerical value. Through trial and error, I determined that the final features array would exclude words that were in fewer than 10 postings, and those postings that had fewer than 5 words.
While this was discovered through trial and error, this was done to handle words that should not be deemed relevant to the data set given their infrequent use across the overall set of documents.
combined_df_clean <- combined_df %>%
filter(!is.na(word))
skills_df_a <-left_join(combined_df_clean, job_skills_clean) %>%
select(-c("job_board", "position_title", "skill", "skill_mod"))
## Joining, by = "doc_num"
skills_df_b <- skills_df_a %>%
group_by(doc_num, word) %>%
mutate(n = n()) %>%
ungroup() %>%
group_by(word) %>%
mutate(word_count = n(),
num_docs = n_distinct(doc_num)) %>%
ungroup() %>%
mutate(total_words = n_distinct(word),
total_docs = n_distinct(doc_num))
skills_df_c <- skills_df_b %>%
filter(num_docs >= 10,
num_words >= 5) %>%
mutate(total_words = n_distinct(word),
total_docs = n_distinct(doc_num))
tf_df <- skills_df_c %>%
mutate(tf = word_count/total_words)
tf_df_mod <- tf_df %>%
distinct(doc_num, word, .keep_all = TRUE)
In this step, we take the words from the previous step and create a features array for the terms.
Rather than use tf-idf for our feature values, we will use the overall term frequency as the values for our word features.
Additionally, for each document, we will replace the NA values with a -1 value. We use -1 to penalize a document for not having a word in the document.
tf_df_spread <- tf_df_mod %>%
select(c(doc_num, word, tf)) %>%
spread(key=word, value=tf, fill = NA)
tf_df_mod_2 <- tf_df_spread %>%
group_by(doc_num) %>%
fill(colnames(.),.direction="updown") %>%
ungroup()
included_docs <- tf_df_mod %>% distinct(doc_num)
## Create a data frame of just the word features for each job posting
features_array <- tf_df_mod_2 %>%
ungroup()
## Replaces NA values with -1
features_array_mod <- replace(features_array, is.na(features_array),-1)
features_array_clean <- features_array_mod %>%
select(-doc_num)
Now we use our features array to train our KMeans clustering model in order to identify the clusters for our data
We used 30 groups for our clustering algorithm; however as we continue to expand upon this project in the future, we will consider testing out different size groups and evaluate the performance of the clusters to determine the best number of groups to use.
Once the model has been created, we will look at the size of th clusters, as well as the variance amongst the clusters.
num_groups = 30
## Run KMeans algorithm on the features array
set.seed(4152023)
kmeans_info <- kmeans(features_array_clean,num_groups, nstart = 15)
kmeans_info$size
## [1] 9 11 5 16 4 4 20 3 10 3 14 3 6 9 7 12 3 16 11 35 6 9 19 5 1
## [26] 5 87 4 6 10
kmeans_info$withinss
## [1] 30.130773 41.078400 24.259302 30.788456 13.387488 16.082544
## [7] 85.363901 4.753877 35.935072 8.642389 29.299579 5.792128
## [13] 13.782336 25.319808 14.673042 40.190384 4.240597 56.775936
## [19] 34.066153 94.427820 20.562325 29.929458 57.106776 13.594906
## [25] 0.000000 12.560896 260.699494 14.810704 18.643659 32.886170
kmeans_info$totss
## [1] 2715.78
Next we take our cluster information and use them to label the original data with the new cluster numbers.
Additionally, we create a plot to show how many documents were included in each cluster group.
job_skills_clean_b <- job_skills_clean %>% filter(doc_num %in% included_docs$doc_num)
job_skills_df_updated <- tibble(cbind(job_skills_clean_b, kmeans_info$cluster))
job_skills_df_updated <- job_skills_df_updated %>% rename(cluster=`kmeans_info$cluster`)
job_skills_df_updated <- job_skills_df_updated %>% mutate(cluster = as.character(cluster))
job_skills_df_updated %>%
mutate(cluster = as.numeric(cluster)) %>%
count(cluster) %>%
ggplot() +
geom_bar(aes(x=-cluster, y=n, fill=n), stat='identity') +
coord_flip() +
ggtitle("Cluster Size") +
xlab("Cluster Number") +
ylab("")
job_skills_df_updated %>%
mutate(cluster = as.numeric(cluster)) %>%
count(cluster) %>%
select(cluster, n) %>%
mutate(pct_total = paste0(round((n/sum(n)*100),1),"%")) %>%
kable(
caption = "Cluster Size",
col.names = c("Cluster", "Size", "Pct of Total")
) %>%
kable_material(c("striped"))
| Cluster | Size | Pct of Total |
|---|---|---|
| 1 | 9 | 2.5% |
| 2 | 11 | 3.1% |
| 3 | 5 | 1.4% |
| 4 | 16 | 4.5% |
| 5 | 4 | 1.1% |
| 6 | 4 | 1.1% |
| 7 | 20 | 5.7% |
| 8 | 3 | 0.8% |
| 9 | 10 | 2.8% |
| 10 | 3 | 0.8% |
| 11 | 14 | 4% |
| 12 | 3 | 0.8% |
| 13 | 6 | 1.7% |
| 14 | 9 | 2.5% |
| 15 | 7 | 2% |
| 16 | 12 | 3.4% |
| 17 | 3 | 0.8% |
| 18 | 16 | 4.5% |
| 19 | 11 | 3.1% |
| 20 | 35 | 9.9% |
| 21 | 6 | 1.7% |
| 22 | 9 | 2.5% |
| 23 | 19 | 5.4% |
| 24 | 5 | 1.4% |
| 25 | 1 | 0.3% |
| 26 | 5 | 1.4% |
| 27 | 87 | 24.6% |
| 28 | 4 | 1.1% |
| 29 | 6 | 1.7% |
| 30 | 10 | 2.8% |
Now that we have the clusters identified for each of our documents, we will now review the job postings included in each group, in order to infer a label to use for the group. This will be done by looking at the original job postings associated with each cluster, and then take a count of the unique terms found in each group and use the most frequently appearing term as the groups label.
job_skills_labels = c()
for(cluster_num in seq(num_groups)) {
cluster_docs <- (job_skills_df_updated %>%
filter(cluster == cluster_num) %>%
distinct(doc_num))$doc_num
relevant_docs <- job_skills_clean %>% filter(doc_num %in% cluster_docs)
one_word_df <- relevant_docs %>% unnest_tokens(word, skill, token = 'ngrams', n=1)
two_word_df <- relevant_docs %>% unnest_tokens(word, skill, token = 'ngrams', n=2)
three_word_df <- relevant_docs %>% unnest_tokens(word, skill, token = 'ngrams', n=3)
combined_words <- rbind(one_word_df, two_word_df)
combined_words <- rbind(combined_words, three_word_df)
top_words <- combined_words %>%
mutate(word = removeWords(word, english_stopwords),
word = str_remove(word, regex("(years)")),
word = str_squish(word), ) %>%
filter(!word %in% english_stopwords) %>%
filter(!word %in% job_skills_categories_stopwords) %>%
filter(!is.na(word)) %>%
filter(word != "") %>%
count(word, sort=TRUE)
top_word <- top_words[[1,1]]
job_skills_labels <- c(job_skills_labels, top_word)
}
cluster_labels <-
tibble(job_skills_labels) %>%
mutate(cluster = as.character(row_number())) %>%
rename(skills_label = job_skills_labels) %>%
relocate(cluster, .before=cluster)
cluster_labels %>%
kable(
caption='Cluster Labels',
col.names = c("Label Name", "Cluster")
) %>%
kable_material(c("striped",
font_size=8)) %>%
kable_styling(fixed_thead = T)
| Label Name | Cluster |
|---|---|
| python | 1 |
| analysis | 2 |
| business | 3 |
| business | 4 |
| develop | 5 |
| technical | 6 |
| machine learning | 7 |
| research | 8 |
| analytics | 9 |
| r | 10 |
| project | 11 |
| statistical analysis | 12 |
| data sets | 13 |
| machine learning | 14 |
| linux | 15 |
| analytical | 16 |
| open source | 17 |
| communicate | 18 |
| analysis | 19 |
| insights | 20 |
| models | 21 |
| big data | 22 |
| models | 23 |
| computer science | 24 |
| seven | 25 |
| relational databases | 26 |
| technical | 27 |
| communicate | 28 |
| commercial | 29 |
| models | 30 |
Additionally, we want to add our new labels to the features array, which will be used in our classification model.
features_array_labeled <- tibble(cbind(features_array_mod, kmeans_info$cluster))
features_array_labeled <- features_array_labeled %>%
rename(cluster_name=`kmeans_info$cluster`)
features_array_labeled <- features_array_labeled %>%
mutate(cluster_name = as.character(cluster_name)) %>%
left_join(cluster_labels, by=c('cluster_name'='cluster'))
features_array_upsampled <- tibble(upsample(features_array_labeled, cat_col='skills_label')) %>%
relocate(c(skills_label),.after=doc_num)
features_array_labeled %>%
select(doc_num, skills_label) %>%
count(skills_label) %>%
ggplot() +
geom_bar(aes(x=reorder(skills_label,n), y=n, fill=skills_label), stat='identity') +
coord_flip() +
ggtitle("Cluster Labels") +
xlab("Label") +
ylab("n")
features_array_labeled %>%
select(doc_num, skills_label) %>%
count(skills_label) %>%
with(wordcloud(skills_label, n, max.words=150))
Finally, we focus on building a classification model using our now labeled data. Additionally, we will run several iterations of the model and record the accuracy score so that we can evaluate the model performance and determine if it is > 85% accurate.
For 500 iterations, we will build a model and evaluate the accuracy by completing the following steps in each iteration:
seed_num = 5142023
sample_size = 500
accuracy <- function(x) {
sum(diag(x)/sum(rowSums(x)))
}
accuracy_sample = c()
for(i in seq(sample_size)) {
set.seed(seed_num + i)
shuffled_features <- features_array_upsampled %>% sample_frac()
train_size = .70
test_size = 1-train_size
set.seed(seed_num + i)
rand <- sample(seq(1,nrow(shuffled_features)),size=train_size*nrow(shuffled_features),replace=FALSE)
train_data <- shuffled_features %>% slice(rand)
test_data <- shuffled_features %>% slice(-rand)
X_train <- train_data %>% select(-c(doc_num, skills_label,cluster_name))
X_test <- test_data %>% select(-c(doc_num, skills_label, cluster_name))
y_train <- train_data %>% select(skills_label)
y_test <- test_data %>% select(skills_label)
y_train <- y_train %>%
mutate(category = as.factor(skills_label))
y_test <- y_test %>%
mutate(category = as.factor(skills_label))
y_train_labels <- y_train$skills_label
y_test_labels <- y_test$skills_label
knn_pred1 <- knn(
train = X_train,
test = X_test,
cl = y_train_labels,
k = 25
)
tab <- table(knn_pred1,y_test_labels)
accuracy_sample <- c(accuracy_sample, accuracy(tab))
}
Next, we will evaluate our accuracy scores to determine if the mean accuracy score of our model is > 80% at a 90% confidence level.
mean(accuracy_sample)
## [1] 0.9173312
sd(accuracy_sample)
## [1] 0.01696199
quantile(accuracy_sample)
## 0% 25% 50% 75% 100%
## 0.8566879 0.9076433 0.9187898 0.9283439 0.9538217
accuracy_sample <- tibble(accuracy_sample)
ggplot(accuracy_sample) +
geom_histogram(aes(x=accuracy_sample), fill='white', color='black') +
ggtitle("Distribution of Model Accuracy Scores (n=500)") +
xlab("Accuracy Score") +
ylab("")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Finally we determine the confidence interval for our Mean Accuracy Score to determine if the mean accuracy score for our data is >= 80%
alpha = .90
n <- sample_size
avg <- mean(accuracy_sample$accuracy_sample)
std_dev <- sd(accuracy_sample$accuracy_sample)
margin <- qt(1-((1-alpha)/2),df=(n-1))*std_dev/sqrt(n)
lower_interval <- avg - margin
upper_interval <- avg + margin
print(paste(lower_interval, upper_interval))
## [1] "0.916081163889318 0.918581256492847"
For this model, there’s we are confident that the true accuracy score for the model is between 91.6% and 91.9% at a 90% level of confidence.
Through this process, we were able to take a corpus of job postings and break them down into individual discrete job skills requirement documents, and then vectorize these skills into a word array, and use this data to generate clusters of related job skills. From there, we were able to generate labels for each of the groupings and then train a model for classifying new job skills requirements.
Overall, our objective was to build a model that is able to achieve perform this classification with an accuracy rate above 85%. After running 500 simulations, we discovered that the accuracy of our model was had a confidence interval of 91.6% to 91.9% at a 90% level of confidence.
While I was able to effectively go through the steps to build this model, there’s a lot more learning that I need in order to better understand how to tune models and adjust the performance. For this model, several issues that arose were related to the following:
While this code reflects the final work-product, there was a lot of work done to improve the performance of the models, which is not reflected in the final code.