Final Project: Job Recommender
Introduction: This project will build a very basic job recommendation system using a content-based approach.
Purpose 1: In this project, we present a recommender system designed for the job seeker in Data Science. The proposed recommender system aims at leveraging the jobs and companies that are important for a target candidate. To meet this objective, job descriptions and candidate resumes are examined along with other user inputs. The recommendation approach is modeled on content-based analysis using natural language processing. The dataset consisted of scraped job postings from Glassdoor and resumes from Post Resumes Free
Instructions
This project is a proof-of-concept(POC) with certain assumptions on the data. For this implementation, Purpose 1 will be demonstrated via a markdown file below to show step by step how the text data is processed. Purpose 2 will be displayed in the Shiny App to allow the user to easily manipulate and filter settings to gain more insight of the job market today.
Load the libraries
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.3 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(httr)
library(rvest)##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(stringr)
library(readr)
library(tm)## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:httr':
##
## content
## The following object is masked from 'package:ggplot2':
##
## annotate
library(slam)
library(dplyr)
library(tidytext)
library(tidyr)
library(dplyr)
library(textstem)## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
##
## available.koRpus.lang()
##
## and see ?install.koRpus.lang()
##
## Attaching package: 'koRpus'
## The following object is masked from 'package:tm':
##
## readTagged
## The following object is masked from 'package:readr':
##
## tokenize
library(lsa)## Loading required package: SnowballC
##
## Attaching package: 'lsa'
## The following object is masked from 'package:koRpus':
##
## query
library(data.table)##
## Attaching package: 'data.table'
## The following object is masked from 'package:slam':
##
## rollup
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(VennDiagram)## Loading required package: grid
## Loading required package: futile.logger
library(wordcloud)## Loading required package: RColorBrewer
library(RColorBrewer)
library(ggplot2)
library(wordcloud)
library(httr)Recommender System: Cosine Similarity
Load the data
The scraped job postings are stored in a dataframe and read in below. A sample resume is scraped and the raw text is stored in variable, “resume”.
library(readr)
urlfile<-"https://raw.githubusercontent.com/baruab/Team2_Project_3_607/main/job_posting.csv"
jobs<-read_csv(url(urlfile))
#url_res<-"https://www.postjobfree.com/resume/adktqz/senior-data-scientist-brooklyn-ny"
#url_res<-"https://www.postjobfree.com/resume/adk07o/data-science-new-york-ny"
#url_res<-"https://www.postjobfree.com/resume/adol8d/data-scientist-new-york-ny"
#url_res<-"https://www.postjobfree.com/resume/adost3/data-scientist-new-york-ny"
url_res<-"https://www.postjobfree.com/resume/adonl3/data-scientist-charlotte-nc"
#url_res<-"https://www.postjobfree.com/resume/ado61j/data-scientist-arlington-va"
#url_res<-"https://www.postjobfree.com/resume/adol8d/data-scientist-new-york-ny"
web<- read_html(url_res)
resume<-web %>%html_nodes(".normalText")%>%html_text()
head(jobs)## # A tibble: 6 × 13
## ...1 job_title job_description min_salary max_salary city state company_name
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 Data Sci… "Polypore Inte… 71310 122975 Char… NC Polypore In…
## 2 2 Data Sci… "Key Responsib… NA NA Not … MA Van Pool Tr…
## 3 3 Data Sci… "The Challenge… 69795 111477 Norf… VA Booz Allen …
## 4 4 Senior D… "Position Summ… 75217 121211 Beac… OH Penske
## 5 5 Data Sci… "The kind of p… 74839 112212 Newt… MA Paytronix S…
## 6 6 Data Sci… "The Data Scie… 80867 133796 Trum… CT HPOne
## # … with 5 more variables: company_industry <chr>, company_rating <dbl>,
## # bachelors <dbl>, masters <dbl>, phd <dbl>
head(resume)## [1] "\r\n\t\t\t\t\tNicholas Kim\r\nData Scientist\r\nP: 980-***-****\r\nG: adonl3@r.postjobfree.com\r\nPROFESSIONAL SUMMARY\r\nData Scientist with 7+ years’ experience processing and analyzing data across a variety of industries. Leverages various mathematical, statistical, and Machine Learning tools to collaboratively synthesize business insights and drive innovative solutions for productivity, efficiency, and revenue.\r\n\r\n•Experience applying statistical models on big data sets using cloud-based cluster computing assets with AWS, Azure, and other Unix-based architectures.\r\n•Experience applying Bayesian Techniques, Advanced Analytics, Neural Networks and Deep Neural Networks, Support Vector Machines (SVMs), and Decision Trees with Random Forest ensemble.\r\n•Experience implementing industry standard analytics within specific domains and applying data science techniques to expand these methods using Natural Language Processing, implementing clustering algorithms, and deriving insight.\r\n•In-depth knowledge of statistical procedures that are applied in both Supervised and Unsupervised Machine Learning problems.\r\n•Machine Learning techniques to promote marketing and merchandising ideas.\r\n•Proven creative thinker with a strong ability to devise and propose novel ways to look at and approach problems using a combination of business acumen and mathematical methods.\r\n•Identification of patterns in data and using experimental and iterative approaches to validate findings.\r\n•Advanced predictive modeling techniques to build, maintain, and improve on real-time decision systems.\r\n•Contributed to advanced analytical teams to design, build, validate, and re-train models.\r\n•Excellent communication skills (verbal and written) to communicate with clients, stakeholders, and team members.\r\n•Ability to quickly gain an understanding of niche subject matter domains, and design and implement effective novel solutions to be used by other subject matter experts.\r\n\r\nTECHNICAL SKILLS\r\n•Analytic Development: Python, R, Spark, SQL.\r\n•Python Packages: NumPy, Pandas, Scikit-learn, TensorFlow, Keras, PyTorch, Fastai, SciPy, Matplotlib, Seaborn, Numba.\r\n•Programming Tools: Jupyter, RStudio, Github, Git.\r\n•Cloud Computing: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP)\r\nMachine Learning, Natural Language Processing & Understanding, Machine Intelligence, Machine Learning algorithms.\r\n•Analysis Methods: Forecasting, Predictive, Statistical, Sentiment, Exploratory and Bayesian Analysis. Regression Analysis, Linear models, Multivariate analysis, Sampling methods, Clustering.\r\n•Applied Data Science: Natural Language Processing, Machine Learning, Social Analytics, Predictive Maintenance, Chatbots, Interactive Dashboards.\r\n•Artificial Intelligence: Classification and Regression Trees (CART), Support Vector Machine, Random Forest, Gradient Boosting Machine (GBM), TensorFlow, PCA, Regression, Naïve Bayes.\r\n•Natural Language Processing: Text analysis, classification, chatbots.\r\n•Deep Learning: Machine Perception, Data Mining, Machine Learning, Neural Networks, TensorFlow, Keras, PyTorch, Transfer Learning.\r\n•Data Modeling: Bayesian Analysis, Statistical Inference, Predictive Modeling, Stochastic Modeling, Linear Modeling, Behavioral Modeling, Probabilistic Modeling, Time-Series analysis.\r\n•Soft Skills: Excellent communication and presentation skills. Ability to work well with stakeholders to discern needs. Leadership, mentoring.\r\n•Other Programming Languages & Skills: APIs, C++, Java, Linux, Kubernetes, Back-End, Databases.\r\n\r\nWORK EXPERIENCE\r\nBank of America, Charlotte, NC February 2020 - Present\r\nSenior Data Scientist\r\n\r\nAt Bank of America, I worked as a Natural Language Processing expert and model architect where I built, trained, and tested multiple Natural Language Processing models which classified user descriptions and wrote SQL code based on user questions. The goal of the project was to centralize and search for Splunk dashboards within the Bank of America network, and to create an A.I. assistant to automate the coding process to extract information from these dashboards.\r\n\r\n•Used Python and SQL to collect, explore, analyze the structured/unstructured data.\r\n•Used Python, NLTK, and Tensorflow to tokenize and pad comments/tweets and vectorize.\r\n•Vectorized the documents using Bag of Words, TF-IDF, Word2Vec, and GloVe to test the performance it had on each model.\r\n•Created and trained an Artificial Neural Network with TensorFlow on the tokenized documents/articles/SQL/user inputs.\r\n•Performed Named Entity Recognition (NER) by utilizing ANNs, RNNs, LSTMs, and Transformers.\r\n•Involved in model deployment using Flask with a REST API deployed on internal Bank of America systems.\r\n•Wrote extensive SQL queries to extract data from the MySQL database hosted on Bank of America internal servers.\r\n•Built a deep learning model for text classification and analysis.\r\n•Performed classification on text data using NLP fundamental concepts including tokenization, stemming, lemmatization, and padding.\r\n•Performed EDA using Pandas library in Python to inspect and clean the data.\r\n•Visualized the data using matplotlib and seaborn.\r\n•Explored using word embedding techniques such as Word2Vec, GloVe, and Bert.\r\n•Built an ETL pipeline that could read data from multiple macros, processed it using self-made preprocessing functions, and stored the processed data on a separate internal server.\r\n•Automated ETL tasks and scheduling using self-built data pull-request functions.\r\n\r\nDominion Energy, Richmond, VA June 2017 – February 2020\r\nData Scientist / ML Ops Engineer\r\n\r\nWorked as a Data Scientist for a large American power and energy company headquartered in Richmond, Virginia that supplies electricity and natural gas to various states. Member of a small team of data scientists and analysts where we created numerous demand forecasting models from Dominion’s historical data hosted on Hadoop HDFS and Hive to estimate short-term demand peaks for optimizing economic load dispatch. Models were built using Time Series analysis using algorithms like ARIMA, SARIMA, ARIMAX, and Facebook Prophet.\r\n\r\n•Endeavored multiple approaches for predicting day ahead energy demand with Python, including exponential smoothing, ARIMA, Prophet, TBATS, and RNNs (LSTM).\r\n•Successfully built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units.\r\n•Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.\r\n•Incessantly validated models using a train-validate-test split to ensure forecasting was sufficient to elevate optimal output of the number of generation facilities to meet system load.\r\n•Prevented over-fitting with the use of a validation set while training.\r\n•Built a meta-model to ensemble the predictions of several different models.\r\n•Performed feature engineering with the use of NumPy, Pandas, and FeatureTools to engineer time-series features.\r\n•Coordinated with facility engineers to understand the problem and ensure our predictions were beneficial.\r\n•Participated in daily standups working under an Agile KanBan environment.\r\n•Queried Hive by utilizing Spark through the use of Python’s PySpark Library.\r\n\r\nCargill, Minneapolis, MN June 2015 – June 2017\r\nComputer Vision Engineer\r\n\r\nCargill is an American privately held international food conglomerate; major businesses are trading, purchasing and distributing grain and other agricultural commodities. Our team used CNNs with Computer Vision to build the Machine Learning model to detect unhealthy hydrophytes. Our model helped regulators work more efficiently by detecting unhealthy hydrophytes in hydroponic farming automatically, and increased their harvesting rate which increased their revenue.\r\n\r\n•Performed statistical analysis and built statistical models in R and Python using various supervised and unsupervised Machine Learning algorithms like Regression, Decision Trees, Random Forests, Support Vector Machines, K-Means Clustering, and dimensionality reduction.\r\n•Used MLlib, Spark's Machine Learning library, to build and evaluate different models.\r\n•Defined the list codes and code conversions between the source systems and the data mart enterprise metadata library with any changes or updates.\r\n•Developed Ridge regression model to predict energy consumption of customers. Evaluated model using Mean Absolute Percent Error (MAPE).\r\n•Developed and enhanced statistical models by leveraging best-in-class modeling techniques.\r\n•Developed a predictive model and validated Neural Network Classification model for predicting the feature label.\r\n•Implemented logistic regression to model customer default and identified factors that were good predictors.\r\n•Designed a model to predict if a customer would respond to marketing campaign based on customer information.\r\n•Developed Random Forest and logistic regression models to observe this classification. Fine-tuned models to obtain more recall than accuracy. Tradeoff between False Positives and False Negatives.\r\n•Evaluated and optimized performance of models by tuning parameters with K-Fold Cross Validation.\r\n\r\nHilton Hotels, McLean, VA April 2014 – June 2015\r\nData Analyst\r\n\r\nWorked with NLP to classify text with data draw from a big data system. The text categorization involved labeling natural language texts with relevant categories from a predefined set. One goal was to target users by automated classification. In this way we could create cohorts to improve marketing. The NLP text analysis monitored, tracked, and classified user discussion about product and/or service in online discussion. The Machine Learning classifier was trained to identify whether a cohort was a promoter or a detractor. Overall, the project improved marketing ROI and customer satisfaction. Also incorporated a Churn Analysis model to examine repeat business/dropoff.\r\n\r\n•Worked the entire production cycle to extract and display metadata from various assets and helped develop a report display that was easy to grasp and gain insights from.\r\n•Collaborated with both the Research and Engineering teams to productionize the application.\r\n•Assisted various teams in bringing prototyped assets into production.\r\n•Applied data mining techniques and optimization techniques standard to B2B and B2C industries, and applied Machine Learning, Data/Text Mining, Statistical Analysis and Predictive Modeling.\r\n•Utilized MapReduce/PySpark Python modules for Machine Learning and predictive analytics on AWS.\r\n•Implemented assets and scripts for various projects using R, Java, and Python.\r\n•Built sustainable rapport with senior leaders.\r\n•Developed and maintained Data Dictionary to create metadata reports for technical and business purposes.\r\n•Built and maintained dashboard and reporting based on the statistical models to identify and track key metrics and risk indicators.\r\n•Kept up to date with latest NLP methodologies by reading 10 to 15 articles and whitepapers per week.\r\n•Extracted source data from Oracle tables, MS SQL Server, sequential files, and Excel sheets.\r\n•Parsed and manipulated raw, complex data streams to prepare for loading into an analytical tool.\r\n•Involved in defining the source to target data mappings, business rules, and data definitions.\r\n•Project environment was AWS and Linux.\r\n\r\nEDUCATION\r\nBachelor of Arts - Data Science - University of California, Berkeley\r\n\t\t\t\t\tContact this candidate\r\n\t\t\t\t\t\r\n\t\t\t\t"
Subsetting the data
A total of 2527 Data Science related Job postings are available for the candidate to consider. However, 300 are evaluated here to save on computation time. The job descriptions are stored as raw text in a new dataframe, “des_all”.
##Multiple Job postings at once (Corpus)
#One row of posting
postings<-300
des_all<-subset(jobs,select=c(3))
#des_all<-data.frame(jobs$job_description)
des_all<-des_all[1:postings,]
head(des_all)## # A tibble: 6 × 1
## job_description
## <chr>
## 1 "Polypore International, an Asahi Kasei Group Company, is a leading technolog…
## 2 "Key Responsibilities: Beacon is seeking a Data Scientist to join the organiz…
## 3 "The Challenge: Are you excited at the prospect of unlocking the secrets held…
## 4 "Position Summary As a Senior Data Scientist, you develop the next generation…
## 5 "The kind of person we're looking for: We're looking for an energetic, though…
## 6 "The Data Scientist is responsible for collecting, cleaning, translating data…
Cleaning The Text
The variable “resume”, which contains the resume text file is stored as the last row of the des_all dataframe (after all the job postings are listed in preceding rows). In preparation for NLP, the text is processed by: 1) Using regular expressions, unnecesssary symbols and notations are removed. 2) Stop words are removed. 3) All letters are brought to lower case. 4) Punctuations are removed. 5) Each string is lemmatized to bring to its basic form.
#adding resume text as doc_id last
des_all<-rbind(des_all,resume)
des_all$job_description<-des_all$job_description%>%
str_replace_all(pattern="\n",replacement=" ")%>%
str_replace_all(pattern="www+|com|@\\S+|#\\S+|http|\\*|\\s[A-Z]\\s|\\s[a-z]\\s|\\d|�+",replacement=" ")
des_all$job_description<-tolower(des_all$job_description)
des_all$job_description<-removeNumbers(des_all$job_description)
des_all$job_description<-removePunctuation(des_all$job_description)
#des_all$job_description<-stripWhitespace(des_all$job_description)
des_all$job_description<-removeWords(des_all$job_description,stopwords("en"))
des_all$job_description<-sapply(des_all$job_description,lemmatize_strings)
head(des_all)## # A tibble: 6 × 1
## job_description
## <chr>
## 1 polypore international asahi kasei group company lead technology pany special…
## 2 key responsibility beacon seek datum scientist join organization serve key pl…
## 3 challenge excite prospect unlock secret hold datum set fascinate possibility …
## 4 position summary senior datum scientist develop next generation supply chain …
## 5 kind person look look energetic thoughtful intelligent creative thinker join …
## 6 datum scientist responsible collect clean translate datum meet panys need eve…
Term Matrix
The job descriptions are stored in a Volatile Corpus and the words are tokenized into a matrix. The term frequency per job description is recorded in the matrix and the terms are weighted using term frequency-inverse document frequency (tfidf). The tfidf offsets the number of times a term appears in a document by the number of documents in the corpous that contain the word. This ensures that terms that simply appear more times than others are not incorrectly considered to be significant since a term can simply appear more frequently if a document has more text.
des_all_df<-data.frame(
doc_id=1:(postings+1),
text=des_all$job_description
)
Corpus=VCorpus(DataframeSource(des_all_df))
tf<-DocumentTermMatrix(Corpus,control=list(weighting=weightTf))
tfidf<-DocumentTermMatrix(Corpus,control=list(weighting=weightTfIdf))
inspect(tf)## <<DocumentTermMatrix (documents: 301, terms: 8883)>>
## Non-/sparse entries: 62137/2611646
## Sparsity : 98%
## Maximal term length: 73
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs business datum experience learn model science team use will work
## 100 2 17 5 0 0 4 3 5 6 16
## 121 0 17 8 10 4 5 9 0 10 8
## 179 3 34 1 4 2 5 1 8 12 5
## 232 4 15 10 3 9 4 6 1 2 8
## 268 5 7 10 2 1 2 5 1 1 16
## 301 5 34 5 16 38 3 6 27 0 8
## 62 2 16 5 6 5 4 4 0 7 5
## 63 0 43 16 7 3 34 3 5 14 7
## 91 0 13 7 2 5 1 7 7 4 10
## 98 6 12 3 3 5 2 4 7 4 4
inspect(tfidf)## <<DocumentTermMatrix (documents: 301, terms: 8883)>>
## Non-/sparse entries: 62137/2611646
## Sparsity : 98%
## Maximal term length: 73
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Terms
## Docs business covid datum health hire model
## 120 0.000000000 0.00000000 0.0008532882 0.00000000 0.000000000 0.000000000
## 15 0.000000000 0.00000000 0.0025206677 0.00000000 0.000000000 0.000000000
## 195 0.000000000 0.00000000 0.0011771435 0.00000000 0.000000000 0.000000000
## 232 0.004337884 0.00000000 0.0040408740 0.00000000 0.000000000 0.008921024
## 242 0.000000000 0.00000000 0.0015782057 0.00000000 0.000000000 0.001935666
## 275 0.000000000 0.00000000 0.0047983114 0.00000000 0.008572608 0.000000000
## 34 0.000000000 0.02430545 0.0003998593 0.00000000 0.000000000 0.000000000
## 53 0.000000000 0.00000000 0.0022669988 0.00000000 0.008100388 0.000000000
## 54 0.000000000 0.00000000 0.0040531191 0.02728102 0.000000000 0.007456714
## 68 0.000000000 0.00000000 0.0034968087 0.00000000 0.000000000 0.000000000
## Terms
## Docs pany product response surge
## 120 0.000000000 0.009251992 0 0
## 15 0.005885090 0.003416372 0 0
## 195 0.010993271 0.000000000 0 0
## 232 0.001257916 0.000000000 0 0
## 242 0.004912922 0.000000000 0 0
## 275 0.000000000 0.004335575 0 0
## 34 0.001867131 0.000000000 0 0
## 53 0.000000000 0.000000000 0 0
## 54 0.000000000 0.000000000 0 0
## 68 0.000000000 0.000000000 0 0
TF-IDF Cosine Similarity
The similarity between job descriptions of each job posting to the candidates resume is assessed using cosine similarity. Mathematically, the cosine similarity measures the cosine of the angle between two vectors projected. The closer the output is to 1, the more similar the objects are. The lsa package is used to calculate the cosine matrix. Since the resume text is stored in the last row of the matrix, the last row of the cosine similarity output will compare the resume to all the job postings.
#test cosine
tfidf_a<-as.matrix(tfidf)
tfidf_a<-transpose(data.frame(tfidf_a))
tfidf_a<-as.matrix(tfidf_a)
cos_df<-data.frame(cosine(tfidf_a))
resume_similarity<-cos_df[(postings+1),]
head(resume_similarity)## V1 V2 V3 V4 V5 V6 V7
## V301 0.1416551 0.07769467 0.1470682 0.1108904 0.1207849 0.1089555 0.08929687
## V8 V9 V10 V11 V12 V13
## V301 0.07795933 0.09617149 0.1022105 0.07522777 0.05413861 0.05010286
## V14 V15 V16 V17 V18 V19
## V301 0.05302425 0.05430907 0.09197351 0.1060665 0.07356662 0.07220639
## V20 V21 V22 V23 V24 V25 V26
## V301 0.09708145 0.1456521 0.05204723 0.06929169 0.07617128 0.08086151 0.0554781
## V27 V28 V29 V30 V31 V32 V33
## V301 0.07167883 0.08772479 0.08438069 0.04300432 0.0464312 0.1283961 0.09102716
## V34 V35 V36 V37 V38 V39
## V301 0.02580397 0.06711945 0.1294315 0.05802339 0.07459732 0.06155406
## V40 V41 V42 V43 V44 V45
## V301 0.04910052 0.07197415 0.1510529 0.05628731 0.08363072 0.05485513
## V46 V47 V48 V49 V50 V51 V52
## V301 0.08944897 0.04737076 0.1245482 0.05575244 0.0996943 0.05733874 0.07921441
## V53 V54 V55 V56 V57 V58 V59
## V301 0.0324837 0.03008588 0.1146318 0.05723049 0.1434239 0.0632432 0.08114656
## V60 V61 V62 V63 V64 V65
## V301 0.08693144 0.0373362 0.05764025 0.08077321 0.06960547 0.06249381
## V66 V67 V68 V69 V70 V71 V72
## V301 0.08397868 0.04738006 0.02321146 0.03304222 0.1323106 0.1315139 0.1183743
## V73 V74 V75 V76 V77 V78 V79
## V301 0.05484786 0.101101 0.100935 0.09970921 0.1320837 0.0624572 0.07594105
## V80 V81 V82 V83 V84 V85
## V301 0.06359105 0.05767434 0.07778547 0.08371811 0.06836139 0.06726932
## V86 V87 V88 V89 V90 V91
## V301 0.04396347 0.04029002 0.02586653 0.04269718 0.04788154 0.04924198
## V92 V93 V94 V95 V96 V97 V98
## V301 0.07188841 0.074638 0.08510355 0.1326032 0.07139971 0.03304222 0.1301215
## V99 V100 V101 V102 V103 V104 V105
## V301 0.05548018 0.04143765 0.09534898 0.05544516 0.1296061 0.05794908 0.1011755
## V106 V107 V108 V109 V110 V111
## V301 0.05424914 0.1008594 0.06628381 0.08212385 0.0006584202 0.07623733
## V112 V113 V114 V115 V116 V117
## V301 0.01220999 0.08725037 0.0006584202 0.06295279 0.02861998 0.04715003
## V118 V119 V120 V121 V122 V123
## V301 0.06791097 0.06712756 0.03551908 0.06731936 0.02618158 0.0006584202
## V124 V125 V126 V127 V128 V129
## V301 0.03431915 0.0006584202 0.05663141 0.0451591 0.05902404 0.0006584202
## V130 V131 V132 V133 V134 V135 V136
## V301 0.0525206 0.04763502 0.06171601 0.03998052 0.09928592 0.1308348 0.02761613
## V137 V138 V139 V140 V141 V142
## V301 0.06655464 0.0006584202 0.08050519 0.06233227 0.09019632 0.0006584202
## V143 V144 V145 V146 V147 V148
## V301 0.055537 0.05199758 0.04770371 0.04767373 0.06029579 0.04866286
## V149 V150 V151 V152 V153 V154
## V301 0.0006584202 0.0006584202 0.04067046 0.09415757 0.05117887 0.07445718
## V155 V156 V157 V158 V159 V160 V161
## V301 0.05695702 0.1194858 0.09758605 0.0813059 0.0292141 0.05920644 0.08735253
## V162 V163 V164 V165 V166 V167 V168
## V301 0.08706643 0.1667365 0.05595936 0.03197698 0.1464094 0.06115246 0.09214396
## V169 V170 V171 V172 V173 V174
## V301 0.04929526 0.02877888 0.0638164 0.07177846 0.0006584202 0.08717067
## V175 V176 V177 V178 V179 V180
## V301 0.03400786 0.0006584202 0.04311297 0.1229501 0.07525265 0.06012819
## V181 V182 V183 V184 V185 V186
## V301 0.02687784 0.04273866 0.07923166 0.0590084 0.09158149 0.07473451
## V187 V188 V189 V190 V191 V192
## V301 0.0006584202 0.04663357 0.1076789 0.0006584202 0.1262571 0.06115601
## V193 V194 V195 V196 V197 V198
## V301 0.03167277 0.02638287 0.01800265 0.09340836 0.04555181 0.04891775
## V199 V200 V201 V202 V203 V204
## V301 0.09117405 0.1017901 0.05430446 0.06054871 0.0006584202 0.0006584202
## V205 V206 V207 V208 V209 V210 V211
## V301 0.05909225 0.1382049 0.1926756 0.01220999 0.02961342 0.07180154 0.0704373
## V212 V213 V214 V215 V216 V217
## V301 0.0522907 0.0006584202 0.06753681 0.1141436 0.03727778 0.05957519
## V218 V219 V220 V221 V222 V223
## V301 0.02585522 0.2210862 0.08652779 0.07693054 0.04616294 0.01220999
## V224 V225 V226 V227 V228 V229
## V301 0.07082259 0.0778447 0.0006584202 0.05391029 0.04877406 0.1118834
## V230 V231 V232 V233 V234 V235
## V301 0.0852164 0.0006584202 0.07599066 0.1199132 0.03112881 0.03235892
## V236 V237 V238 V239 V240 V241 V242
## V301 0.03157602 0.05542685 0.108302 0.1430948 0.0486259 0.0006584202 0.04118837
## V243 V244 V245 V246 V247 V248
## V301 0.0591698 0.06370276 0.0006584202 0.07662721 0.0006584202 0.07901934
## V249 V250 V251 V252 V253 V254
## V301 0.0006584202 0.05804969 0.0006584202 0.09182605 0.06979577 0.04603223
## V255 V256 V257 V258 V259 V260 V261
## V301 0.0521297 0.037593 0.1020849 0.0006584202 0.03857487 0.02910581 0.04502476
## V262 V263 V264 V265 V266 V267 V268
## V301 0.07302718 0.09399763 0.1483487 0.01220999 0.03864722 0.1076212 0.08008692
## V269 V270 V271 V272 V273 V274
## V301 0.02862698 0.09645855 0.05278978 0.05334496 0.05514225 0.03865138
## V275 V276 V277 V278 V279 V280
## V301 0.0264474 0.08242506 0.0006584202 0.06504287 0.03591798 0.09838319
## V281 V282 V283 V284 V285 V286 V287
## V301 0.0611253 0.147724 0.1261354 0.0006584202 0.03095455 0.04303923 0.03662237
## V288 V289 V290 V291 V292 V293
## V301 0.02654855 0.05788278 0.0006584202 0.09938876 0.07476916 0.06727576
## V294 V295 V296 V297 V298 V299
## V301 0.04089304 0.0006584202 0.06449686 0.05204562 0.04311112 0.05768125
## V300 V301
## V301 0.1519766 1
Recommendation Dataframe
The job posting dataframe is re-arranged based on the cosine similarity output. As the row number increases, the lower the similarity between the resume and job posting. The new dataframe is called “rec_df”.
list<-names(resume_similarity)<-NULL
list<-unlist(c(resume_similarity))
order<-order(list,decreasing=TRUE)
order<-order[-c(1)]
doc_ID<-data.frame(order)
rec_df<-doc_ID
colnames(rec_df)<-c("doc_ID")
rec_df<-rec_df%>%
mutate(job_title=jobs[order,2])%>%
mutate(min_salary=jobs[order,4])%>%
mutate(max_salary=jobs[order,5])%>%
mutate(city=jobs[order,6])%>%
mutate(state=jobs[order,7])%>%
mutate(company_name=jobs[order,8])%>%
mutate(company_industry=jobs[order,9])%>%
mutate(company_rating=jobs[order,10])%>%
mutate(bachelors=jobs[order,11])%>%
mutate(masters=jobs[order,12])%>%
mutate(PHD=jobs[order,13])
rec_df<-unnest(rec_df)
head(rec_df)## # A tibble: 6 × 12
## doc_ID job_title min_salary max_salary city state company_name
## <int> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 219 Machine Learning Engineer 122998 151838 San … CA price.com
## 2 219 Data Scientist NA NA Palo… CA Landing AI
## 3 219 Data Scientist 118964 187172 Bris… CA Nomis Solut…
## 4 219 Senior Data Scientist 150691 151132 Menl… CA Quantifind
## 5 219 Chief Data Scientist NA NA Remo… Remo… Espire Serv…
## 6 219 Data Scientist II NA NA San … CA EDI Special…
## # … with 5 more variables: company_industry <chr>, company_rating <dbl>,
## # bachelors <dbl>, masters <dbl>, PHD <dbl>
Visualization
The following section writes a function rec(), which will return a word cloud and venn diagram, and the job postings associated with them, showing what attributes were similar between these top ranked recommendations. The first 3 recommendations are shown.
#understanding the terms that are most relevant
rec<-function(ranking){
row_num<-ranking+1
z<-data.frame(as.matrix(tfidf))
compare1<-rbind(z[ranking,],z[nrow(z),])
comp1<-compare1%>%
mutate(row_n=1:n())%>%
select_if(function(x) any(x!=0 & .$row_n!=0))
comp1_t<-transpose(comp1)
colnames(comp1_t)<-c("Resume","Job")
comp1_t$terms<-colnames(comp1)
comp1_matches<-comp1_t[comp1_t$Resume!=0 & comp1_t$Job!=0,]
rownames(comp1_matches)<-NULL
comp1_matches<-comp1_matches[-c(nrow(comp1_matches)),]
comp1_matches_n<-nrow(comp1_matches)
Job1_diff<-comp1_t[comp1_t$Resume==0 & comp1_t$Job!=0,]
Job1_diff_n<-nrow(Job1_diff)
Resume1_diff<-comp1_t[comp1_t$Resume!=0 & comp1_t$Job==0,]
Resume1_diff_n<-nrow(Resume1_diff)
#Venn Diagram of common and different words between resume and job posting.
grid.newpage()
draw.pairwise.venn(Resume1_diff_n+comp1_matches_n, Job1_diff_n+comp1_matches_n, comp1_matches_n, category = c("Terms in your resume", "Terms in Job Posting"), lty = rep("blank",
2), fill = c("light blue", "pink"), alpha = rep(0.5, 2), cat.pos = c(0,
0), cat.dist = rep(0.025, 2), scaled = FALSE)
#Word Cloud for top recommended job
comp1_matches_adjust<-comp1_matches%>%
mutate(comp1_matches$Job*1000)
colnames(comp1_matches_adjust)<-c("Resume","Job","terms","adjust")
set.seed(1234)
wordcloud(words = comp1_matches_adjust$terms, freq = comp1_matches_adjust$adjust, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
return(rec_df[ranking,2:9])
}Test cases
rec(1) is the output of the top recommended job posting. rec(2) and rec(3) are the second and third.
rec(1)## # A tibble: 1 × 8
## job_title min_salary max_salary city state company_name company_industry
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Machine Lear… 122998 151838 San F… CA price.com <NA>
## # … with 1 more variable: company_rating <dbl>
rec(2)## # A tibble: 1 × 8
## job_title min_salary max_salary city state company_name company_industry
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Data Scientist NA NA Palo Alto CA Landing AI Information Tec…
## # … with 1 more variable: company_rating <dbl>
rec(3)## # A tibble: 1 × 8
## job_title min_salary max_salary city state company_name company_industry
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Data Scientist 118964 187172 Brisbane CA Nomis Solut… Information Tec…
## # … with 1 more variable: company_rating <dbl>
Create the Shiny UI
Authenticate the user
Gather Inputs
Present visualizations
Conclusion
#Future Improvement