This project is an upgrade of my Resume Wizard web-app, the paper for which can be found here. The original project was in September 2023, and was my first-ever Shiny web app (not just a dashboard), and I was in a hurry to submit it to an event at my graduated data science school. Consequently, the web app’s code and logic is a bit messy and ineffective, so I intend to upgrade it after the event to make it more polished and efficient. There were also some tokenization challenges in the web app that I couldn’t address at the time since i was in a hurry and didn’t have the time to learn about it. But now, I have studied about them and now I know what to do for those challenges, and i will do it here in this upgrade.
In July 2023, I was scrolling on LinkedIn to learn more about data science applications to know what the industry currently needs in a data scientist. I stumbled upon a data scientist job application that had over 250 applicants. I was shocked to see this and thought, “This is just from one application; what about all the other applications? How much time and energy has the HR team gone through just to find a few most suitable applicants that are probably even not the most suitable?” At that moment, I said to myself, “I need to create something that will boost the resume screening process to be most effective and objective, so that the HR team can save their time and energy for other crucial hiring processes like tests and interviews.” And that’s how Resume Wizard came to be.
Resume Wizard is an innovative web application designed to tackle the challenges of identifying abundant resumes. Inspired by a personal realization where I stumbled upon a job applications with hundreds of applicants, this project aims to bring resume screening smarter, faster, and objective.
With Resume Wizard, companies can find the most suitable resume for the desired skills. It’s a bridge between robots and real people, making hiring better for everyone. I believe this will help companies succeed in today’s competitive job market.
I want to make hiring smarter, faster, and objective. With my project, companies can find the best applicants, even if their resumes are a bit different. It’s a bridge between robots and real people, making hiring better for everyone. I believe this will help companies succeed in today’s competitive job market.
N-grams are contiguous sequences of n items from a given sample of text or speech. In the context of text analysis, these items are often words. Unigrams (1-grams) represent single words, bigrams (2-grams) are pairs of consecutive words, and trigrams (3-grams) consist of three-words sequences, and so on. N-grams capture local patterns and relationships between words, allowing for a more comprehensive analysis of language structure.
The output of this project is a Shiny Apps based Web App that can be used by anyone for ranking and clustering resumes. And the goal is to find th most suitable resumes based on the desired skills.
Resume Wizard significantly enhances the efficiency of the resume screening process for HR professionals. By automating the ranking and clustering of resumes based on desired skills, the app streamlines the initial screening phase. This leads to a more efficient use of HR resources, allowing them to focus on other critical aspects of the hiring process.
The automation of text processing, ranking, and clustering reduces the manual effort required to sift through numerous resumes. HR professionals can save valuable time that would otherwise be spent on manual resume evaluation. This time-saving feature enables HR teams to handle larger volumes of resumes effectively, allowing the recruiters to leverage more time and energy on other critical aspects of the hiring process, such as interviews and such.
The ranking algorithm, based on matching skills and percentages, provides a more objective and standardized approach to evaluating resumes. This reduces the subjectivity often associated with manual resume screening, ensuring a fair and consistent evaluation process.
By clustering resumes based on desired skills, the app enables HR professionals to identify patterns and similarities among resumes. This can lead to more accurate and targeted candidate matching, ensuring that candidates with the most relevant skills are considered for specific roles.
library(tidyverse)
library(ggplot2)
library(plotly)
library(scales)
library(purrr)
library(tm)
library(NLP)
library(proxy)
library(NLP)
library(pdftools)
library(data.table)
library(tools)
library(tidytext)
library(glue)
library(textclean)
library(tokenizers)
library(stringr)
library(wordcloud)
library(RColorBrewer)The folder_to_table() function serves the purpose of
transforming a folder of PDF documents into a structured table,
represented as a data frame.
The output is a data frame with the row names representing the file name and column text representing the raw extracted text.
PDFs_folder_to_table <- function(folder_path) {
# Function to convert PDF to text
convert_pdf_to_text <- function(pdf_path) {
pdf_text_content <- pdf_text(pdf_path)
extracted_text <- list()
for (page in seq_along(pdf_text_content)) {
text <- pdf_text_content[[page]]
extracted_text[[page]] <- text
}
all_text <- paste(extracted_text, collapse = "\n")
}
# Function to get file name
get_file_name <- function(file_path) {
file_path_sans_ext(basename(file_path))
}
# Getting PDF files from the specified folder
pdf_files <- list.files(folder_path, pattern = ".pdf", full.names = TRUE)
# Converting PDFs to text
pdf_texts <- lapply(pdf_files, convert_pdf_to_text)
# Creating a data frame for the extracted texts
table_data <- data.frame(
PDF_Text = unlist(pdf_texts)
)
# Renaming the data frame rows with the file names
rownames(table_data) <- get_file_name(pdf_files)
return(table_data)
}I will use a collection of designer profession resumes that I got from here.
## [1] "FLORAL DESIGNER\nSummary\nPersonable Customer Service Associate dedicated to providing the highest level of customer service. Outgoing, and efficient with the capacity to\nmulti-task.\nHighlights\n Floral designer\n Inventory control Organized\n Employee scheduling Placing orders in person and over the phone\n Cash handling and banking Customer service\n Excellent multi-tasker\n\nExperience\nJune 2013\nto\nMarch 2016\nCompany Name City , State Floral designer\nDesigned arrangements for wide range of events, which included wedding and corporate parties. I did all of the prep work as well. I kept the\nshowroom clean and maintained properly for display\nJanuary 2011\nto\nDecember 2012\nCompany Name City , State Floral designer\nOpened and closed the store, which included counting cash drawers and making bank deposits. Helped customers select products that best fit\ntheir personal needs, as well as floral designing.\nApril 2008\nto\nAugust 2009\nCompany Name City , State Cashier\nCashier main function. In addition helped unloaded trucks, stocked shelves and carried merchandise out on the floor for customers. Marked\nclearance products with updated price tags.\nOctober 2002\nto\nApril 2008\nCompany Name City , State Manager/Floral designer\nOpened and closed the store, which included counting cash drawers and making bank deposits.Maintained visually appealing and effective\ndisplays for the entire store. Answered customers' questions and addressed problems and complaints in person and via phone. Helped customers\nselect products that best fit their personal needs, as well as design floral arrangements for the cooler display and for outgoing orders.\nEducation\nNorthwestern College City , State , Dupage Medical Assistant\n"
I know it’s a little messy, but don’t worry, I’ll clean the texts with the next function.
The clean_text() function is designed to clean text
data.
The output is the cleaned text in the form of a character or corpus based on the as_corpus argument.
clean_text <- function(text, as.corpus = T, lower = T, rm.number = T, rm.stopwords_english = T, rm.stopwords_bahasa = T, rm.punctuation = T, stem = T, rm.whitespace = T){
text_corpus <- text %>% VectorSource() %>% VCorpus()
# Lowercasing
if (lower){
text_corpus <- tm_map(x = text_corpus,
FUN = content_transformer(tolower))
}
# Removing numbers
if (rm.number){
text_corpus <- tm_map(x = text_corpus,
FUN = removeNumbers)
}
# Removing english stop words
if (rm.stopwords_english){
list_stop_words_english <- readLines("stop-words_english.txt", warn = FALSE, encoding = "UTF-8")
text_corpus <- tm_map(x = text_corpus,
FUN = removeWords,
list_stop_words_english)
}
# Removing bahasa stop words
if (rm.stopwords_bahasa){
list_stop_words_bahasa <- readLines("stop-words_bahasa.txt", warn = FALSE, encoding = "UTF-8")
text_corpus <- tm_map(x = text_corpus,
FUN = removeWords,
list_stop_words_bahasa)
}
# Removing punctuation
if (rm.punctuation){
text_corpus <- tm_map(x = text_corpus,
FUN = removePunctuation)
}
# Reducing words to their base form
if (stem){
text_corpus <- tm_map(x = text_corpus,
FUN = stemDocument)
}
# Removing white/blank spaces
if (rm.whitespace){
text_corpus <- tm_map(x = text_corpus,
FUN = stripWhitespace)
}
# Returning the text as or not as corpus
if (as.corpus){
return(text_corpus)
}
else(
return(sapply(text_corpus, as.character))
)
}You may think “Why you gotta put if else statements for each transformation?” It’s because this function is taken directly from my personal package and there’s no need to remove them. I can just use the parameters to adjust”
The get_wordcloud() function is created to generate a
word cloud that emphasizes words for specific analysis.
The output is a word cloud plot representing the input text/words.
get_wordcloud <- function(tokens, scale = c(2, 0), normalize_higher_ngrams = F) {
# Creating a data frame of tokens and count their occurrences.
words <- data.frame(token = tokens) %>%
count(token, sort = TRUE) %>%
na.omit()
# Calculate the tf-idf weight of each token
if (normalize_higher_ngrams){
words <- words %>% rowwise() %>% mutate(n = n * length(unlist(str_split(token, " "))) + 1)
}
# Generating a word cloud with specified settings, scaling word size by frequency
words %>%
with(
wordcloud(
words = token,
random.order = FALSE,
color = colorRampPalette(c("#B8E2FF", "#00416F"))(length(unique(words$token))),
min.freq = 1,
scale = scale,
rot.per = 0,
freq = n
)
)
}The rank_resumes() function is designed to rank resumes
based on their similarity to the desired skills.
PDFs_folder_to_table().The output is a data frame that includes 4 columns: PDF_Text; Matching_Skills; Matching_Percentage; and Matching_Percentage_Formatted
rank_resumes <- function(resumes_df, desired_skills, descriptive_skills = F) {
# Function to tokenize the text by n gram range
tokenize_ngrams_by_range <- function(text, upper_range) {
tokens <- c()
for (i in upper_range:1) {
tokens <- c(tokens, unlist(tokenize_ngrams(text, n = i)))
}
return(tokens)
}
# Tokenizing the desired skills
if (descriptive_skills){
desired_skills <- desired_skills %>% clean_text(as.corpus = F,
lower = T,
rm.number = F,
rm.stopwords_english = T,
rm.stopwords_bahasa = T,
rm.punctuation = T,
stem = T,
rm.whitespace = T)
tokenized_desired_skills <- desired_skills %>% tokenize_ngrams_by_range(2)
resumes_df$PDF_Text <- resumes_df$PDF_Text %>% clean_text(as.corpus = F,
lower = T,
rm.number = F,
rm.stopwords_english = T,
rm.stopwords_bahasa = T,
rm.punctuation = T,
stem = T,
rm.whitespace = T)
max_n_gram <- 2
} else{
tokenized_desired_skills <- unlist(str_split(desired_skills %>% tolower() %>% trimws(), ", ")) %>% unique()
max_n_gram <- strsplit(desired_skills, ", ") %>% sapply(length) %>% max()
}
# Creating the ranking data frame
ranking_df <- resumes_df %>%
rowwise() %>%
# Extracting the matching skills
mutate(Matching_Skills = intersect(tokenize_ngrams_by_range(PDF_Text, max_n_gram), tokenized_desired_skills) %>% paste(collapse = ", "),
# Calculating the matching percentage
Matching_Percentage = ifelse(Matching_Skills == "", 0, round(length(unlist(str_split(Matching_Skills, ", "))) / length(tokenized_desired_skills), 2)),
Matching_Percentage_Formatted = paste0(Matching_Percentage * 100, "%")) %>%
as.data.frame() %>%
mutate(Resume = rownames(resumes_df)) %>%
arrange(desc(Matching_Percentage)) %>%
select(Resume, everything())
return(ranking_df)
}desired_skills <- "- **Expertise in Image Manipulation:** Skilled in working with light, transparencies, color density, shadowing, and understanding image resolution and sizing.
- **Retouching and Selection:** Strategic approach to retouching, manipulating selections, and using advanced selection tools like Magnetic Lasso.
- **Layer Management:** Creating and managing layers, applying gradients, layer styles, borders, and adjustment layers for enhancing images.
- **Proficient in Adobe Photoshop:** Extensive experience in utilizing Photoshop tools such as masking, layers, silos, and camera raw adjustments.
- **Specialized Techniques:** Proficient in advanced techniques like creating panoramas, correcting image distortion, and extending depth of field.
- **Content Manipulation:** Expertise in content-aware tools for moving and manipulating objects seamlessly within images.
- **Prototyping:** Building interactive prototypes to demonstrate the functionality and flow of designs.
- **Wireframing:** Creating basic layouts outlining the structure of a website or application.
- **Vector Graphics:** Proficiency in using vector-based software like Adobe Illustrator for scalable and high-quality graphics.
- **Photo Editing:** Skills in enhancing and retouching photos for various design projects.
- **Creative Problem-Solving:** Ability to find innovative solutions to design challenges and think critically about design problems.
- **Storyboarding:** Planning and visualizing the sequence of events for multimedia presentations or animations.
- **Digital Illustration:** Creating original digital artwork and illustrations using tools like Adobe Photoshop or Adobe Illustrator.
- **Print Design:** Experience in designing for print media, including brochures, posters, business cards, and other promotional materials.
- **Collaboration:** Ability to work effectively in a team, collaborate with clients, and communicate design ideas clearly.
- **Time Management:** Skill in managing multiple projects, meeting deadlines, and prioritizing tasks efficiently."
ranking <- rank_resumes(resumes_df, desired_skills, T)
rankingdesired_skills <- "AutoCAD, Photoshop, Illustrator, SketchUp, Lumion, InDesign, CorelDRAW, SolidWorks, Blender, Revit, Rhinoceros (Rhino), Cinema 4D, Premiere Pro, Lightroom, Maya, ZBrush, Dreamweaver, XD, Figma, QuarkXPress, Spark, Procreate, Inkscape, Unity3D"
ranking <- rank_resumes(resumes_df, desired_skills, F)
rankingThe cluster_resumes() function is made to cluster the
ranked resumes based on the matching skills.
rank_resume() function.The output is the resumes with a new column: Cluster.
cluster_resumes <- function(ranking_df, k){
# Creating document term matrix of ranking_df
dtm <- DocumentTermMatrix((ranking_df %>% mutate(Matching_Skills = removePunctuation(Matching_Skills)))$Matching_Skills)
# Converting the document term matrix into a data frame
df <- dtm %>%
as.matrix() %>%
as.data.frame()
# Replacing the row names with the resume file names
rownames(df) <- rownames(ranking_df)
# Applying K-means Clustering algorithm
clusters <- kmeans(x = df,
centers = k)
clustered_ranking_df <- ranking_df %>% mutate(Cluster = clusters$cluster)
return(clustered_ranking_df)
}ggplot <- ranking %>%
slice(1:15) %>%
ggplot(aes(y = reorder(Resume, Matching_Percentage),
x = Matching_Percentage,
fill = Matching_Percentage,
text = glue("{Resume}
{Matching_Percentage_Formatted} Matched"))) +
geom_col(show.legend = F) +
labs(title = "Top 15 Resumes",
y = NULL,
x = NULL) +
scale_x_continuous(labels = percent_format(scale = 100)) +
scale_fill_gradient(brewer.pal(9, "Blues")) +
theme_minimal()
ggplotly(ggplot, tooltip = "text")resume_name <- "26496059"
resume_data <- clustered_ranking_df %>%
filter(Resume == resume_name)
resume_words_cleaned <- clean_text(resume_data$PDF_Text, as.corpus = F, rm.number = F, rm.stopwords_english = T, rm.stopwords_bahasa = T, rm.punctuation = F, stem = F, rm.whitespace = T)
get_wordcloud(resume_words_cleaned %>% str_split(" ") %>% unlist(), scale = c(2, 0))😎
Here is the link to the web-app.
The inputs will be resume PDFs, ideal skills, a conditional statement that states if the input skills are desriptive or not, and a number of clusters/segments.
The app converts the PDF files into raw text. Finally, the app creates a data frame containing the raw text of the pdf and the file names as row names.
The text processing involves converting all text into lowercase, removing punctuation marks, and stripping white spaces. And if the desired skills are descriptive, the text processing continues by removing numbers, removing stop words, and stemming words into their most base form.
The desired skills are then lowered and tokenized, and the cleaned text is tokenized into a specific range of n grams. The app then calculate the matching percentage by dividing the total matching skills against the total desired skills. The resumes data frame are then sorted by the matching percentage in descending order.
The app then displays a text input where the user can input a resume name. When a resume name is input, the app displays a value box containing the matching percentage, following with matching skills. Below the value box, there will be a word cloud of the resume words distribution. Below the word cloud, there are a radio buttons input which are n grams choices: Unigram; Bigram; and Trigram, on default, it is set to Unigram. When an n grams is chosen, the word cloud plot will change based on the n grams choice. Below the radio buttons input, there will be the pdf viewer of the input resume itself.
After the ranking, Matching skills are then used as features for clustering, the resumes are then clustered by the desired numbers of clusters/segments. Then, the app displays the word clouds of matching skills from each cluster, following with the resumes table of each cluster.