Final Project: Job Recommender

Introduction: This project will build a very basic job recommendation system using a content-based approach.

Purpose 1: In this project, we present a recommender system designed for the job seeker in Data Science. The proposed recommender system aims at leveraging the jobs and companies that are important for a target candidate. To meet this objective, job descriptions and candidate resumes are examined along with other user inputs. The recommendation approach is modeled on content-based analysis using natural language processing. The dataset consisted of scraped job postings from Glassdoor and resumes from Post Resumes Free

Instructions

This project is a proof-of-concept(POC) with certain assumptions on the data. For this implementation, Purpose 1 will be demonstrated via a markdown file below to show step by step how the text data is processed. Purpose 2 will be displayed in the Shiny App to allow the user to easily manipulate and filter settings to gain more insight of the job market today.

Load the libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)
library(httr)
library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(stringr)
library(readr)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:httr':
## 
##     content

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(slam)
library(dplyr)
library(tidytext)
library(tidyr)
library(dplyr)
library(textstem)

## Loading required package: koRpus.lang.en

## Loading required package: koRpus

## Loading required package: sylly

## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()

## 
## Attaching package: 'koRpus'

## The following object is masked from 'package:tm':
## 
##     readTagged

## The following object is masked from 'package:readr':
## 
##     tokenize

library(lsa)

## Loading required package: SnowballC

## 
## Attaching package: 'lsa'

## The following object is masked from 'package:koRpus':
## 
##     query

library(data.table)

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:slam':
## 
##     rollup

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(VennDiagram)

## Loading required package: grid

## Loading required package: futile.logger

library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(ggplot2)
library(wordcloud)
library(httr)

Recommender System: Cosine Similarity

Load the data

The scraped job postings are stored in a dataframe and read in below. A sample resume is scraped and the raw text is stored in variable, “resume”.

library(readr)
urlfile<-"https://raw.githubusercontent.com/baruab/Team2_Project_3_607/main/job_posting.csv"
jobs<-read_csv(url(urlfile))

#url_res<-"https://www.postjobfree.com/resume/adktqz/senior-data-scientist-brooklyn-ny"
#url_res<-"https://www.postjobfree.com/resume/adk07o/data-science-new-york-ny"
#url_res<-"https://www.postjobfree.com/resume/adol8d/data-scientist-new-york-ny"
#url_res<-"https://www.postjobfree.com/resume/adost3/data-scientist-new-york-ny"
url_res<-"https://www.postjobfree.com/resume/adonl3/data-scientist-charlotte-nc"
#url_res<-"https://www.postjobfree.com/resume/ado61j/data-scientist-arlington-va"
#url_res<-"https://www.postjobfree.com/resume/adol8d/data-scientist-new-york-ny"
web<- read_html(url_res)
resume<-web %>%html_nodes(".normalText")%>%html_text()

head(jobs)

## # A tibble: 6 × 13
##    ...1 job_title job_description min_salary max_salary city  state company_name
##   <dbl> <chr>     <chr>                <dbl>      <dbl> <chr> <chr> <chr>       
## 1     1 Data Sci… "Polypore Inte…      71310     122975 Char… NC    Polypore In…
## 2     2 Data Sci… "Key Responsib…         NA         NA Not … MA    Van Pool Tr…
## 3     3 Data Sci… "The Challenge…      69795     111477 Norf… VA    Booz Allen …
## 4     4 Senior D… "Position Summ…      75217     121211 Beac… OH    Penske      
## 5     5 Data Sci… "The kind of p…      74839     112212 Newt… MA    Paytronix S…
## 6     6 Data Sci… "The Data Scie…      80867     133796 Trum… CT    HPOne       
## # … with 5 more variables: company_industry <chr>, company_rating <dbl>,
## #   bachelors <dbl>, masters <dbl>, phd <dbl>

head(resume)

## [1] "\r\n\t\t\t\t\tNicholas Kim\r\nData Scientist\r\nP: 980-***-****\r\nG: adonl3@r.postjobfree.com\r\nPROFESSIONAL SUMMARY\r\nData Scientist with 7+ years’ experience processing and analyzing data across a variety of industries. Leverages various mathematical, statistical, and Machine Learning tools to collaboratively synthesize business insights and drive innovative solutions for productivity, efficiency, and revenue.\r\n\r\n•Experience applying statistical models on big data sets using cloud-based cluster computing assets with AWS, Azure, and other Unix-based architectures.\r\n•Experience applying Bayesian Techniques, Advanced Analytics, Neural Networks and Deep Neural Networks, Support Vector Machines (SVMs), and Decision Trees with Random Forest ensemble.\r\n•Experience implementing industry standard analytics within specific domains and applying data science techniques to expand these methods using Natural Language Processing, implementing clustering algorithms, and deriving insight.\r\n•In-depth knowledge of statistical procedures that are applied in both Supervised and Unsupervised Machine Learning problems.\r\n•Machine Learning techniques to promote marketing and merchandising ideas.\r\n•Proven creative thinker with a strong ability to devise and propose novel ways to look at and approach problems using a combination of business acumen and mathematical methods.\r\n•Identification of patterns in data and using experimental and iterative approaches to validate findings.\r\n•Advanced predictive modeling techniques to build, maintain, and improve on real-time decision systems.\r\n•Contributed to advanced analytical teams to design, build, validate, and re-train models.\r\n•Excellent communication skills (verbal and written) to communicate with clients, stakeholders, and team members.\r\n•Ability to quickly gain an understanding of niche subject matter domains, and design and implement effective novel solutions to be used by other subject matter experts.\r\n\r\nTECHNICAL SKILLS\r\n•Analytic Development: Python, R, Spark, SQL.\r\n•Python Packages: NumPy, Pandas, Scikit-learn, TensorFlow, Keras, PyTorch, Fastai, SciPy, Matplotlib, Seaborn, Numba.\r\n•Programming Tools: Jupyter, RStudio, Github, Git.\r\n•Cloud Computing: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP)\r\nMachine Learning, Natural Language Processing & Understanding, Machine Intelligence, Machine Learning algorithms.\r\n•Analysis Methods: Forecasting, Predictive, Statistical, Sentiment, Exploratory and Bayesian Analysis. Regression Analysis, Linear models, Multivariate analysis, Sampling methods, Clustering.\r\n•Applied Data Science: Natural Language Processing, Machine Learning, Social Analytics, Predictive Maintenance, Chatbots, Interactive Dashboards.\r\n•Artificial Intelligence: Classification and Regression Trees (CART), Support Vector Machine, Random Forest, Gradient Boosting Machine (GBM), TensorFlow, PCA, Regression, Naïve Bayes.\r\n•Natural Language Processing: Text analysis, classification, chatbots.\r\n•Deep Learning: Machine Perception, Data Mining, Machine Learning, Neural Networks, TensorFlow, Keras, PyTorch, Transfer Learning.\r\n•Data Modeling: Bayesian Analysis, Statistical Inference, Predictive Modeling, Stochastic Modeling, Linear Modeling, Behavioral Modeling, Probabilistic Modeling, Time-Series analysis.\r\n•Soft Skills: Excellent communication and presentation skills. Ability to work well with stakeholders to discern needs. Leadership, mentoring.\r\n•Other Programming Languages & Skills: APIs, C++, Java, Linux, Kubernetes, Back-End, Databases.\r\n\r\nWORK EXPERIENCE\r\nBank of America, Charlotte, NC February 2020 - Present\r\nSenior Data Scientist\r\n\r\nAt Bank of America, I worked as a Natural Language Processing expert and model architect where I built, trained, and tested multiple Natural Language Processing models which classified user descriptions and wrote SQL code based on user questions. The goal of the project was to centralize and search for Splunk dashboards within the Bank of America network, and to create an A.I. assistant to automate the coding process to extract information from these dashboards.\r\n\r\n•Used Python and SQL to collect, explore, analyze the structured/unstructured data.\r\n•Used Python, NLTK, and Tensorflow to tokenize and pad comments/tweets and vectorize.\r\n•Vectorized the documents using Bag of Words, TF-IDF, Word2Vec, and GloVe to test the performance it had on each model.\r\n•Created and trained an Artificial Neural Network with TensorFlow on the tokenized documents/articles/SQL/user inputs.\r\n•Performed Named Entity Recognition (NER) by utilizing ANNs, RNNs, LSTMs, and Transformers.\r\n•Involved in model deployment using Flask with a REST API deployed on internal Bank of America systems.\r\n•Wrote extensive SQL queries to extract data from the MySQL database hosted on Bank of America internal servers.\r\n•Built a deep learning model for text classification and analysis.\r\n•Performed classification on text data using NLP fundamental concepts including tokenization, stemming, lemmatization, and padding.\r\n•Performed EDA using Pandas library in Python to inspect and clean the data.\r\n•Visualized the data using matplotlib and seaborn.\r\n•Explored using word embedding techniques such as Word2Vec, GloVe, and Bert.\r\n•Built an ETL pipeline that could read data from multiple macros, processed it using self-made preprocessing functions, and stored the processed data on a separate internal server.\r\n•Automated ETL tasks and scheduling using self-built data pull-request functions.\r\n\r\nDominion Energy, Richmond, VA June 2017 – February 2020\r\nData Scientist / ML Ops Engineer\r\n\r\nWorked as a Data Scientist for a large American power and energy company headquartered in Richmond, Virginia that supplies electricity and natural gas to various states. Member of a small team of data scientists and analysts where we created numerous demand forecasting models from Dominion’s historical data hosted on Hadoop HDFS and Hive to estimate short-term demand peaks for optimizing economic load dispatch. Models were built using Time Series analysis using algorithms like ARIMA, SARIMA, ARIMAX, and Facebook Prophet.\r\n\r\n•Endeavored multiple approaches for predicting day ahead energy demand with Python, including exponential smoothing, ARIMA, Prophet, TBATS, and RNNs (LSTM).\r\n•Successfully built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units.\r\n•Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.\r\n•Incessantly validated models using a train-validate-test split to ensure forecasting was sufficient to elevate optimal output of the number of generation facilities to meet system load.\r\n•Prevented over-fitting with the use of a validation set while training.\r\n•Built a meta-model to ensemble the predictions of several different models.\r\n•Performed feature engineering with the use of NumPy, Pandas, and FeatureTools to engineer time-series features.\r\n•Coordinated with facility engineers to understand the problem and ensure our predictions were beneficial.\r\n•Participated in daily standups working under an Agile KanBan environment.\r\n•Queried Hive by utilizing Spark through the use of Python’s PySpark Library.\r\n\r\nCargill, Minneapolis, MN June 2015 – June 2017\r\nComputer Vision Engineer\r\n\r\nCargill is an American privately held international food conglomerate; major businesses are trading, purchasing and distributing grain and other agricultural commodities. Our team used CNNs with Computer Vision to build the Machine Learning model to detect unhealthy hydrophytes. Our model helped regulators work more efficiently by detecting unhealthy hydrophytes in hydroponic farming automatically, and increased their harvesting rate which increased their revenue.\r\n\r\n•Performed statistical analysis and built statistical models in R and Python using various supervised and unsupervised Machine Learning algorithms like Regression, Decision Trees, Random Forests, Support Vector Machines, K-Means Clustering, and dimensionality reduction.\r\n•Used MLlib, Spark's Machine Learning library, to build and evaluate different models.\r\n•Defined the list codes and code conversions between the source systems and the data mart enterprise metadata library with any changes or updates.\r\n•Developed Ridge regression model to predict energy consumption of customers. Evaluated model using Mean Absolute Percent Error (MAPE).\r\n•Developed and enhanced statistical models by leveraging best-in-class modeling techniques.\r\n•Developed a predictive model and validated Neural Network Classification model for predicting the feature label.\r\n•Implemented logistic regression to model customer default and identified factors that were good predictors.\r\n•Designed a model to predict if a customer would respond to marketing campaign based on customer information.\r\n•Developed Random Forest and logistic regression models to observe this classification. Fine-tuned models to obtain more recall than accuracy. Tradeoff between False Positives and False Negatives.\r\n•Evaluated and optimized performance of models by tuning parameters with K-Fold Cross Validation.\r\n\r\nHilton Hotels, McLean, VA April 2014 – June 2015\r\nData Analyst\r\n\r\nWorked with NLP to classify text with data draw from a big data system. The text categorization involved labeling natural language texts with relevant categories from a predefined set. One goal was to target users by automated classification. In this way we could create cohorts to improve marketing. The NLP text analysis monitored, tracked, and classified user discussion about product and/or service in online discussion. The Machine Learning classifier was trained to identify whether a cohort was a promoter or a detractor. Overall, the project improved marketing ROI and customer satisfaction. Also incorporated a Churn Analysis model to examine repeat business/dropoff.\r\n\r\n•Worked the entire production cycle to extract and display metadata from various assets and helped develop a report display that was easy to grasp and gain insights from.\r\n•Collaborated with both the Research and Engineering teams to productionize the application.\r\n•Assisted various teams in bringing prototyped assets into production.\r\n•Applied data mining techniques and optimization techniques standard to B2B and B2C industries, and applied Machine Learning, Data/Text Mining, Statistical Analysis and Predictive Modeling.\r\n•Utilized MapReduce/PySpark Python modules for Machine Learning and predictive analytics on AWS.\r\n•Implemented assets and scripts for various projects using R, Java, and Python.\r\n•Built sustainable rapport with senior leaders.\r\n•Developed and maintained Data Dictionary to create metadata reports for technical and business purposes.\r\n•Built and maintained dashboard and reporting based on the statistical models to identify and track key metrics and risk indicators.\r\n•Kept up to date with latest NLP methodologies by reading 10 to 15 articles and whitepapers per week.\r\n•Extracted source data from Oracle tables, MS SQL Server, sequential files, and Excel sheets.\r\n•Parsed and manipulated raw, complex data streams to prepare for loading into an analytical tool.\r\n•Involved in defining the source to target data mappings, business rules, and data definitions.\r\n•Project environment was AWS and Linux.\r\n\r\nEDUCATION\r\nBachelor of Arts - Data Science - University of California, Berkeley\r\n\t\t\t\t\tContact this candidate\r\n\t\t\t\t\t\r\n\t\t\t\t"

Subsetting the data

A total of 2527 Data Science related Job postings are available for the candidate to consider. However, 300 are evaluated here to save on computation time. The job descriptions are stored as raw text in a new dataframe, “des_all”.

##Multiple Job postings at once (Corpus)
#One row of posting
postings<-300
des_all<-subset(jobs,select=c(3))
#des_all<-data.frame(jobs$job_description)
des_all<-des_all[1:postings,]
head(des_all)

## # A tibble: 6 × 1
##   job_description                                                               
##   <chr>                                                                         
## 1 "Polypore International, an Asahi Kasei Group Company, is a leading technolog…
## 2 "Key Responsibilities: Beacon is seeking a Data Scientist to join the organiz…
## 3 "The Challenge: Are you excited at the prospect of unlocking the secrets held…
## 4 "Position Summary As a Senior Data Scientist, you develop the next generation…
## 5 "The kind of person we're looking for: We're looking for an energetic, though…
## 6 "The Data Scientist is responsible for collecting, cleaning, translating data…

Cleaning The Text

The variable “resume”, which contains the resume text file is stored as the last row of the des_all dataframe (after all the job postings are listed in preceding rows). In preparation for NLP, the text is processed by: 1) Using regular expressions, unnecesssary symbols and notations are removed. 2) Stop words are removed. 3) All letters are brought to lower case. 4) Punctuations are removed. 5) Each string is lemmatized to bring to its basic form.

#adding resume text as doc_id last
des_all<-rbind(des_all,resume)

des_all$job_description<-des_all$job_description%>%
  str_replace_all(pattern="\n",replacement=" ")%>%
  str_replace_all(pattern="www+|com|@\\S+|#\\S+|http|\\*|\\s[A-Z]\\s|\\s[a-z]\\s|\\d|�+",replacement=" ")
des_all$job_description<-tolower(des_all$job_description)
des_all$job_description<-removeNumbers(des_all$job_description)
des_all$job_description<-removePunctuation(des_all$job_description)
#des_all$job_description<-stripWhitespace(des_all$job_description)
des_all$job_description<-removeWords(des_all$job_description,stopwords("en"))
des_all$job_description<-sapply(des_all$job_description,lemmatize_strings)

head(des_all)

## # A tibble: 6 × 1
##   job_description                                                               
##   <chr>                                                                         
## 1 polypore international asahi kasei group company lead technology pany special…
## 2 key responsibility beacon seek datum scientist join organization serve key pl…
## 3 challenge excite prospect unlock secret hold datum set fascinate possibility …
## 4 position summary senior datum scientist develop next generation supply chain …
## 5 kind person look look energetic thoughtful intelligent creative thinker join …
## 6 datum scientist responsible collect clean translate datum meet panys need eve…

Term Matrix

The job descriptions are stored in a Volatile Corpus and the words are tokenized into a matrix. The term frequency per job description is recorded in the matrix and the terms are weighted using term frequency-inverse document frequency (tfidf). The tfidf offsets the number of times a term appears in a document by the number of documents in the corpous that contain the word. This ensures that terms that simply appear more times than others are not incorrectly considered to be significant since a term can simply appear more frequently if a document has more text.

des_all_df<-data.frame(
  doc_id=1:(postings+1),
  text=des_all$job_description
)

Corpus=VCorpus(DataframeSource(des_all_df))

tf<-DocumentTermMatrix(Corpus,control=list(weighting=weightTf))
tfidf<-DocumentTermMatrix(Corpus,control=list(weighting=weightTfIdf))

inspect(tf)

## <<DocumentTermMatrix (documents: 301, terms: 8883)>>
## Non-/sparse entries: 62137/2611646
## Sparsity           : 98%
## Maximal term length: 73
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  business datum experience learn model science team use will work
##   100        2    17          5     0     0       4    3   5    6   16
##   121        0    17          8    10     4       5    9   0   10    8
##   179        3    34          1     4     2       5    1   8   12    5
##   232        4    15         10     3     9       4    6   1    2    8
##   268        5     7         10     2     1       2    5   1    1   16
##   301        5    34          5    16    38       3    6  27    0    8
##   62         2    16          5     6     5       4    4   0    7    5
##   63         0    43         16     7     3      34    3   5   14    7
##   91         0    13          7     2     5       1    7   7    4   10
##   98         6    12          3     3     5       2    4   7    4    4

inspect(tfidf)

## <<DocumentTermMatrix (documents: 301, terms: 8883)>>
## Non-/sparse entries: 62137/2611646
## Sparsity           : 98%
## Maximal term length: 73
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##      Terms
## Docs     business      covid        datum     health        hire       model
##   120 0.000000000 0.00000000 0.0008532882 0.00000000 0.000000000 0.000000000
##   15  0.000000000 0.00000000 0.0025206677 0.00000000 0.000000000 0.000000000
##   195 0.000000000 0.00000000 0.0011771435 0.00000000 0.000000000 0.000000000
##   232 0.004337884 0.00000000 0.0040408740 0.00000000 0.000000000 0.008921024
##   242 0.000000000 0.00000000 0.0015782057 0.00000000 0.000000000 0.001935666
##   275 0.000000000 0.00000000 0.0047983114 0.00000000 0.008572608 0.000000000
##   34  0.000000000 0.02430545 0.0003998593 0.00000000 0.000000000 0.000000000
##   53  0.000000000 0.00000000 0.0022669988 0.00000000 0.008100388 0.000000000
##   54  0.000000000 0.00000000 0.0040531191 0.02728102 0.000000000 0.007456714
##   68  0.000000000 0.00000000 0.0034968087 0.00000000 0.000000000 0.000000000
##      Terms
## Docs         pany     product response surge
##   120 0.000000000 0.009251992        0     0
##   15  0.005885090 0.003416372        0     0
##   195 0.010993271 0.000000000        0     0
##   232 0.001257916 0.000000000        0     0
##   242 0.004912922 0.000000000        0     0
##   275 0.000000000 0.004335575        0     0
##   34  0.001867131 0.000000000        0     0
##   53  0.000000000 0.000000000        0     0
##   54  0.000000000 0.000000000        0     0
##   68  0.000000000 0.000000000        0     0

TF-IDF Cosine Similarity

The similarity between job descriptions of each job posting to the candidates resume is assessed using cosine similarity. Mathematically, the cosine similarity measures the cosine of the angle between two vectors projected. The closer the output is to 1, the more similar the objects are. The lsa package is used to calculate the cosine matrix. Since the resume text is stored in the last row of the matrix, the last row of the cosine similarity output will compare the resume to all the job postings.

#test cosine
tfidf_a<-as.matrix(tfidf)
tfidf_a<-transpose(data.frame(tfidf_a))
tfidf_a<-as.matrix(tfidf_a)

cos_df<-data.frame(cosine(tfidf_a))
resume_similarity<-cos_df[(postings+1),]
head(resume_similarity)

##             V1         V2        V3        V4        V5        V6         V7
## V301 0.1416551 0.07769467 0.1470682 0.1108904 0.1207849 0.1089555 0.08929687
##              V8         V9       V10        V11        V12        V13
## V301 0.07795933 0.09617149 0.1022105 0.07522777 0.05413861 0.05010286
##             V14        V15        V16       V17        V18        V19
## V301 0.05302425 0.05430907 0.09197351 0.1060665 0.07356662 0.07220639
##             V20       V21        V22        V23        V24        V25       V26
## V301 0.09708145 0.1456521 0.05204723 0.06929169 0.07617128 0.08086151 0.0554781
##             V27        V28        V29        V30       V31       V32        V33
## V301 0.07167883 0.08772479 0.08438069 0.04300432 0.0464312 0.1283961 0.09102716
##             V34        V35       V36        V37        V38        V39
## V301 0.02580397 0.06711945 0.1294315 0.05802339 0.07459732 0.06155406
##             V40        V41       V42        V43        V44        V45
## V301 0.04910052 0.07197415 0.1510529 0.05628731 0.08363072 0.05485513
##             V46        V47       V48        V49       V50        V51        V52
## V301 0.08944897 0.04737076 0.1245482 0.05575244 0.0996943 0.05733874 0.07921441
##            V53        V54       V55        V56       V57       V58        V59
## V301 0.0324837 0.03008588 0.1146318 0.05723049 0.1434239 0.0632432 0.08114656
##             V60       V61        V62        V63        V64        V65
## V301 0.08693144 0.0373362 0.05764025 0.08077321 0.06960547 0.06249381
##             V66        V67        V68        V69       V70       V71       V72
## V301 0.08397868 0.04738006 0.02321146 0.03304222 0.1323106 0.1315139 0.1183743
##             V73      V74      V75        V76       V77       V78        V79
## V301 0.05484786 0.101101 0.100935 0.09970921 0.1320837 0.0624572 0.07594105
##             V80        V81        V82        V83        V84        V85
## V301 0.06359105 0.05767434 0.07778547 0.08371811 0.06836139 0.06726932
##             V86        V87        V88        V89        V90        V91
## V301 0.04396347 0.04029002 0.02586653 0.04269718 0.04788154 0.04924198
##             V92      V93        V94       V95        V96        V97       V98
## V301 0.07188841 0.074638 0.08510355 0.1326032 0.07139971 0.03304222 0.1301215
##             V99       V100       V101       V102      V103       V104      V105
## V301 0.05548018 0.04143765 0.09534898 0.05544516 0.1296061 0.05794908 0.1011755
##            V106      V107       V108       V109         V110       V111
## V301 0.05424914 0.1008594 0.06628381 0.08212385 0.0006584202 0.07623733
##            V112       V113         V114       V115       V116       V117
## V301 0.01220999 0.08725037 0.0006584202 0.06295279 0.02861998 0.04715003
##            V118       V119       V120       V121       V122         V123
## V301 0.06791097 0.06712756 0.03551908 0.06731936 0.02618158 0.0006584202
##            V124         V125       V126      V127       V128         V129
## V301 0.03431915 0.0006584202 0.05663141 0.0451591 0.05902404 0.0006584202
##           V130       V131       V132       V133       V134      V135       V136
## V301 0.0525206 0.04763502 0.06171601 0.03998052 0.09928592 0.1308348 0.02761613
##            V137         V138       V139       V140       V141         V142
## V301 0.06655464 0.0006584202 0.08050519 0.06233227 0.09019632 0.0006584202
##          V143       V144       V145       V146       V147       V148
## V301 0.055537 0.05199758 0.04770371 0.04767373 0.06029579 0.04866286
##              V149         V150       V151       V152       V153       V154
## V301 0.0006584202 0.0006584202 0.04067046 0.09415757 0.05117887 0.07445718
##            V155      V156       V157      V158      V159       V160       V161
## V301 0.05695702 0.1194858 0.09758605 0.0813059 0.0292141 0.05920644 0.08735253
##            V162      V163       V164       V165      V166       V167       V168
## V301 0.08706643 0.1667365 0.05595936 0.03197698 0.1464094 0.06115246 0.09214396
##            V169       V170      V171       V172         V173       V174
## V301 0.04929526 0.02877888 0.0638164 0.07177846 0.0006584202 0.08717067
##            V175         V176       V177      V178       V179       V180
## V301 0.03400786 0.0006584202 0.04311297 0.1229501 0.07525265 0.06012819
##            V181       V182       V183      V184       V185       V186
## V301 0.02687784 0.04273866 0.07923166 0.0590084 0.09158149 0.07473451
##              V187       V188      V189         V190      V191       V192
## V301 0.0006584202 0.04663357 0.1076789 0.0006584202 0.1262571 0.06115601
##            V193       V194       V195       V196       V197       V198
## V301 0.03167277 0.02638287 0.01800265 0.09340836 0.04555181 0.04891775
##            V199      V200       V201       V202         V203         V204
## V301 0.09117405 0.1017901 0.05430446 0.06054871 0.0006584202 0.0006584202
##            V205      V206      V207       V208       V209       V210      V211
## V301 0.05909225 0.1382049 0.1926756 0.01220999 0.02961342 0.07180154 0.0704373
##           V212         V213       V214      V215       V216       V217
## V301 0.0522907 0.0006584202 0.06753681 0.1141436 0.03727778 0.05957519
##            V218      V219       V220       V221       V222       V223
## V301 0.02585522 0.2210862 0.08652779 0.07693054 0.04616294 0.01220999
##            V224      V225         V226       V227       V228      V229
## V301 0.07082259 0.0778447 0.0006584202 0.05391029 0.04877406 0.1118834
##           V230         V231       V232      V233       V234       V235
## V301 0.0852164 0.0006584202 0.07599066 0.1199132 0.03112881 0.03235892
##            V236       V237     V238      V239      V240         V241       V242
## V301 0.03157602 0.05542685 0.108302 0.1430948 0.0486259 0.0006584202 0.04118837
##           V243       V244         V245       V246         V247       V248
## V301 0.0591698 0.06370276 0.0006584202 0.07662721 0.0006584202 0.07901934
##              V249       V250         V251       V252       V253       V254
## V301 0.0006584202 0.05804969 0.0006584202 0.09182605 0.06979577 0.04603223
##           V255     V256      V257         V258       V259       V260       V261
## V301 0.0521297 0.037593 0.1020849 0.0006584202 0.03857487 0.02910581 0.04502476
##            V262       V263      V264       V265       V266      V267       V268
## V301 0.07302718 0.09399763 0.1483487 0.01220999 0.03864722 0.1076212 0.08008692
##            V269       V270       V271       V272       V273       V274
## V301 0.02862698 0.09645855 0.05278978 0.05334496 0.05514225 0.03865138
##           V275       V276         V277       V278       V279       V280
## V301 0.0264474 0.08242506 0.0006584202 0.06504287 0.03591798 0.09838319
##           V281     V282      V283         V284       V285       V286       V287
## V301 0.0611253 0.147724 0.1261354 0.0006584202 0.03095455 0.04303923 0.03662237
##            V288       V289         V290       V291       V292       V293
## V301 0.02654855 0.05788278 0.0006584202 0.09938876 0.07476916 0.06727576
##            V294         V295       V296       V297       V298       V299
## V301 0.04089304 0.0006584202 0.06449686 0.05204562 0.04311112 0.05768125
##           V300 V301
## V301 0.1519766    1

Recommendation Dataframe

The job posting dataframe is re-arranged based on the cosine similarity output. As the row number increases, the lower the similarity between the resume and job posting. The new dataframe is called “rec_df”.

list<-names(resume_similarity)<-NULL
list<-unlist(c(resume_similarity))
order<-order(list,decreasing=TRUE)
order<-order[-c(1)]
doc_ID<-data.frame(order)

rec_df<-doc_ID
colnames(rec_df)<-c("doc_ID")

rec_df<-rec_df%>%
  mutate(job_title=jobs[order,2])%>%
  mutate(min_salary=jobs[order,4])%>%
  mutate(max_salary=jobs[order,5])%>%
  mutate(city=jobs[order,6])%>%
  mutate(state=jobs[order,7])%>%
  mutate(company_name=jobs[order,8])%>%
  mutate(company_industry=jobs[order,9])%>%
  mutate(company_rating=jobs[order,10])%>%
  mutate(bachelors=jobs[order,11])%>%
  mutate(masters=jobs[order,12])%>%
  mutate(PHD=jobs[order,13])
rec_df<-unnest(rec_df)
head(rec_df)

## # A tibble: 6 × 12
##   doc_ID job_title                 min_salary max_salary city  state company_name
##    <int> <chr>                          <dbl>      <dbl> <chr> <chr> <chr>       
## 1    219 Machine Learning Engineer     122998     151838 San … CA    price.com   
## 2    219 Data Scientist                    NA         NA Palo… CA    Landing AI  
## 3    219 Data Scientist                118964     187172 Bris… CA    Nomis Solut…
## 4    219 Senior Data Scientist         150691     151132 Menl… CA    Quantifind  
## 5    219 Chief Data Scientist              NA         NA Remo… Remo… Espire Serv…
## 6    219 Data Scientist II                 NA         NA San … CA    EDI Special…
## # … with 5 more variables: company_industry <chr>, company_rating <dbl>,
## #   bachelors <dbl>, masters <dbl>, PHD <dbl>

Visualization

The following section writes a function rec(), which will return a word cloud and venn diagram, and the job postings associated with them, showing what attributes were similar between these top ranked recommendations. The first 3 recommendations are shown.

#understanding the terms that are most relevant
rec<-function(ranking){
row_num<-ranking+1
z<-data.frame(as.matrix(tfidf)) 
compare1<-rbind(z[ranking,],z[nrow(z),]) 
comp1<-compare1%>%
  mutate(row_n=1:n())%>%
  select_if(function(x) any(x!=0 & .$row_n!=0))
comp1_t<-transpose(comp1)
colnames(comp1_t)<-c("Resume","Job")
comp1_t$terms<-colnames(comp1)
comp1_matches<-comp1_t[comp1_t$Resume!=0 & comp1_t$Job!=0,]
rownames(comp1_matches)<-NULL
comp1_matches<-comp1_matches[-c(nrow(comp1_matches)),]
comp1_matches_n<-nrow(comp1_matches)

Job1_diff<-comp1_t[comp1_t$Resume==0 & comp1_t$Job!=0,]
Job1_diff_n<-nrow(Job1_diff)
Resume1_diff<-comp1_t[comp1_t$Resume!=0 & comp1_t$Job==0,]
Resume1_diff_n<-nrow(Resume1_diff)
#Venn Diagram of common and different words between resume and job posting. 
grid.newpage()
draw.pairwise.venn(Resume1_diff_n+comp1_matches_n, Job1_diff_n+comp1_matches_n, comp1_matches_n, category = c("Terms in your resume", "Terms in Job Posting"), lty = rep("blank", 
    2), fill = c("light blue", "pink"), alpha = rep(0.5, 2), cat.pos = c(0, 
    0), cat.dist = rep(0.025, 2), scaled = FALSE)
#Word Cloud for top recommended job
comp1_matches_adjust<-comp1_matches%>%
  mutate(comp1_matches$Job*1000)
colnames(comp1_matches_adjust)<-c("Resume","Job","terms","adjust")
set.seed(1234)
wordcloud(words = comp1_matches_adjust$terms, freq = comp1_matches_adjust$adjust, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
return(rec_df[ranking,2:9])
}

Test cases

rec(1) is the output of the top recommended job posting. rec(2) and rec(3) are the second and third.

rec(1)

## # A tibble: 1 × 8
##   job_title     min_salary max_salary city   state company_name company_industry
##   <chr>              <dbl>      <dbl> <chr>  <chr> <chr>        <chr>           
## 1 Machine Lear…     122998     151838 San F… CA    price.com    <NA>            
## # … with 1 more variable: company_rating <dbl>

rec(2)

## # A tibble: 1 × 8
##   job_title      min_salary max_salary city      state company_name company_industry
##   <chr>               <dbl>      <dbl> <chr>     <chr> <chr>        <chr>           
## 1 Data Scientist         NA         NA Palo Alto CA    Landing AI   Information Tec…
## # … with 1 more variable: company_rating <dbl>

rec(3)

## # A tibble: 1 × 8
##   job_title      min_salary max_salary city     state company_name company_industry
##   <chr>               <dbl>      <dbl> <chr>    <chr> <chr>        <chr>           
## 1 Data Scientist     118964     187172 Brisbane CA    Nomis Solut… Information Tec…
## # … with 1 more variable: company_rating <dbl>

Final Project: Job Recommender