Real or Fake? Using a Naïve Bayes Classifier to Identify Fraudulent Job Postings

Author

Kimberly Ouimette

Abstract

The proliferation of fraudulent job postings presents a significant concern globally, with negative implications for both individuals and economies. In this study, I aim to replicate the preprocessing steps outlined by Amaari et al. (2022) and extend their work to compare the performance of Naïve Bayes classifiers to other machine learning algorithms (e.g., random forest, support vector machines) in predicting fraudulent job postings. Utilizing a dataset of real job advertisements, I replicated Amaari and colleagues’ (2022) rigorous text preprocessing techniques and oversampling methods to address imbalanced data and applied a Naïve Bayes classification model to the identical dataset used by Amaari et al. (2022). Results indicate that while Naïve Bayes models perform well, they do not surpass the performance of other machine learning algorithms with the exception of K Nearest Neighbor. Notably, further refinement of preprocessing steps (i.e., reducing feature space) improved Naïve Bayes performance significantly, highlighting the importance of preprocessing in text analysis via machine learning. These findings contribute to a growing literature of how different machine learning algorithms perform in detecting fraudulent job postings and emphasize the need for comparative analysis in text classification.

Introduction

Background

The proliferation of fraudulent job postings presents a significant and burgeoning concern within the United States, impacting approximately 14 million people annually and resulting in $2 billion in direct losses (Better Business Bureau, 2020). Appallingly, losses reported to the Federal Bureau of Investigations’ Crime Complaint Center skyrocketed by 27 percent between 2018 and 2020 (Better Business Bureau, 2020), highlighting the growing threat posed by online job scams. While anyone can be susceptible to a job scam, some groups are disproportionately affected. Specifically, people aged 25-34 accounted for 28 percent of job scams reported to the Better Business Bureau (BBB) between 2017 and March 2020. Additionally, women (66.7%) and unemployed (54%) individuals accounted for a significant portion of complaints to the BBB during that time frame.

In response to this growing crisis, regulatory bodies like the BBB have emphasized the urgent need for online job boards to bolster their screening procedures (Better Business Bureau, 2020). Past research has demonstrated the efficacy of machine learning methodologies in detecting online fraudulent activities, including job postings (Naudé et al., 2022; Vidros et al., 2017; Amaar et al., 2022). Furthermore, existing studies on fraud detection in human-generated content, such as email spam (CITATION), online reviews (Ott et al., 2013) and fake news (Braşoveanu & Adonie, 2022), underscore the applicability of supervised machine learning techniques in identifying fraud.

Of particular importance to this study, Amaar and colleagues (2022) compared the efficacy of several types of supervised machine learning models (i.e., Support Vector Machines, Random Forest, Logistic Regression, Extra-Trees Classifier, K-Nearest Neighbor, & Multilayer Perceptron) in predicting fraudulent job descriptions and found their extra trees classifier (ETC) to perform the best, achieving an accuracy, precision, and recall rate over 99%. Despite comparing multiple machine learning methods, Amaar et al. (2022) did not include another prominent supervised classifier, Naïve Bayes. Although research prior to Amaar et al. (2022) has compared Naïve Bayes to other machine learning approaches (e.g., random forests, K-Nearest neigbor) on the same dataset, interestingly different techniques have emerged as the top-peforming for this use-case (Dutta & Bandyopadhyay, 2020). These differences perhaps emerge from variations in text pre-processing approaches, as Amaar et al. (2022) highlights the sensitivity of machine learning algorithms’ performance, particularly that of Naïve Bayes, to pre-processing techniques.

To ensure an accurate comparison of Naïve Bayes classification with other machine learning techniques in identifying fraudulent job postings, this study aims to replicate the text pre-processing techniques utilized by Amaar et al. (2022) and apply a Naïve Bayes classification model to the same set of data. By employing the same data cleaning and text pre-processing procedures, as outlined in the original study by Amaar et al. (2022), this research strives to provide a robust evaluation of the performance of Naïve Bayes classifiers relative to other prominent machine learning algorithms, including Support Vector Machines, Random Forest, Logistic Regression, Extra Trees Classifier, K-Nearest Neighbor, and Multilayer Perceptron. Through this approach, this study aims to offer insights into the comparative effectiveness of Naïve Bayes in the domain of fraudulent job posting detection, thereby contributing to a deeper understanding of the strengths and limitations of various supervised classification methodologies.

Research Question

How can supervised learning algorithms, specifically Naïve Bayes classifiers, effectively differentiate between real and fraudulent job descriptions?

Method

Data Acquisition

Data derived from a Kaggle dataset (Bansal, 2020) retrieved from an open dataset published by Vidros et al. (2017) gathered from real-life job advertisements posted by Worktable, an online job posting forum, between 2012 and 2014 within the Employment Scan Aegean Dataset (EMSCAD). The dataset contains 17,880 job postings, 866 (4.8%) of which were classified as fraudulent. In addition, the dataset includes several text fields, including job title, company biography, job description, job requirements, required education, and department. Furthermore, the dataset contains binary indicators of whether the posting is remote eligible, contains the company logo, and is fraudulent. Overall, the dataset provides sufficient observations of both verified and fraudulent job postings to develop and deploy a Naïve Bayes supervised classification model and is identical to the data utilized by Amaar et al. (2022).

Variables

The objective of this algorithm is to correctly distinguish between real and fraudulent job postings based on a document frequency matrix of the words used in their job descriptions. Taking this into consideration, the dependent variable of interest (i.e., ‘fraudulent’) is a binary variable where an assigned value of 0 represents a verified job posting while a value of 1 represents a fraudulent job posting. The document frequency matrix of the job descriptions and other characteristics (e.g., job title, location), generated via the Quanteda package In R, served as the predictor variable within the Naïve Bayes algorithm. Table 1 below offers a list of all variables within the dataset and a brief description of what they represent and sample input (from row 7). In accordance with the analysis conducted by Amaar et al. (2022), only the following textual variables were included in the document frequency matrix: company’s profile, location, job description, job title, department, benefits, job requirements, type of employment, industry, and function.

Table 1. Attributes of dataset.
Variable Description Example
job_id Unique identifier of job posting 7
title Title of job advertisement Head of Content (m/f)
location Geographical location of job posts DE, BE, Berlin
department Corporate department (e.g., sales, marketing, human resources) ANDROIDPIT
salary_range Posted salary range (if applicable) 20000-28000
company_profile Short description of company (e.g., mission statement, history) Founded in 2009, the Fonpit AG rose with its international web portal ANDROIDPIT to the world’s largest Android community…
description Job description Your Responsibilities: Manage the English-speaking editorial team and build a team of best-in-class editors…
requirements List of job requirements University or college degree in journalism, media or other communication studies…
benefits Describes benefits offered for position Your Benefits: Being part of a fast-growing company in a booming industryFast decision-making thanks to flat hierarchies and clear structures…
telecommuting Binary indicator of whether job offers working from home (1 = True, 0 = False) 0
has_company_logo Binary indicator of whether job has company logo (1 = True, 0 = False) 1
has_questions Binary indicator of whether job has FAQ section (1 = True, 0 = False) 1
employment_type Indicates type of employment (e.g., full-time, part-time, contract-based) Full-Time
required_experience Required experience level for position (e.g., Mid Senior-Level, Entry Level) Mid-Senior Level
required_education Required education level for position (e.g., High School, College) Master’s Degree
Industry Type of industry (e.g., Health Care, Computer Software) Online Media
function Description of job function Management
fraudulent Binary indicator of whether job is fraudulent (1) or real (0) 0

Text Pre-Processing

In accordance with Amaar and colleagues (2022) approach to pre-processing the textual data, the following steps were taken:

  1. Merging of Textual Fields. The following 10 textual fields were merged into one column (i.e., “text”): company’s profile, location, job description, job title, department, benefits, job requirements, type of employment, industry, and function. Once these columns were united, a corpus, and subsequently a document frequency matrix, of the “text” column was created via the Quanteda package in R.
  2. Stop-Words Removal. Stopwords are words that are helpful for humans to understand sentences in terms of proper grammatical structure, however they do not add much information to the sentence and can over complicate machine learning models (Amaar et al., 2022). Unfortunately, Amaar and colleagues’ (2022) approach to removing stopwords via the Natural Language Toolkit (NLTK) is only available in Python, not in R. However, I was able to obtain the list of stop words within the NLTK via the NLTK package documentation and removed all 179 stopwords (e.g., “had”, “most”, “through”) from the corpus.
  3. Punctuation Removal. Punctuation is the part of a sentence that assists the reader in understanding the message being conveyed (e.g., possession, end or separation of a thought). However, similar to stopwords, these tokens are not useful in machine learning processes. In this analysis, punctuation (i.e., !.,¿¡/([-=+&%$#)]) was removed via the Quanteda package. Similarly, other non-alphanumeric characters and URLs were removed from the corpus.
  4. Numerical Removal. Similar to punctuation, numerical characters do not add any specific meaning within text analysis. Thus, numerical characters were removed to reduce the size of the feature space to improve model performance.
  5. Stemming. Stemming is used to reduce different iterations of the same word to their root (e.g., “go”, “going”, “gone” -> “go”). Amaar et al. (2022) employed the Porter stemmer library to further trim their feature space down to the roots of words. In R, the snowballC langauge library employs this same library.
  6. Case Normalization. To ensure that the casing of letters (e.g., upper vs. lower case) did not add unnecessary tokens to the corpus, I ensured that all words were lowercased.
  7. Feature Engineering. Finally, to minimize the feature space and improve model performance, the document frequency matrix was trimmed to include only those tokens that appeared at least 10 times. Unfortunately, Amaar et al. (2022) did not extensively document their criteria for feature engineering. In attempt to replicate their process, this study chose to employ a rather conservative cut-off such that only the most infrequent terms (<10 instances) were excluded from the analysis. For future reference, this model with the n = 10 cutoff will be known as Model 1.

Note: Code for all text pre-processing can be found in Appendix under I. Text Pre-Processing.

Data Analysis

This section covers the construction, evaluation, and comparison of the Naïve Bayes classifier to the machine learning algorithms detailed in Amaar et al. (2022).

Oversampling

As discussed earlier, the original EMSCAD datatset, despite containing nearly 18,000 job postings, was incredibly imbalanced with nearly 95% of job postings being verified. Imbalanced datasets can negatively impact model performance by biasing performance towards the more common outcome (Gosain & Sardana, 2017). To address this issue in line with the approach outlined in Amaar et al. (2022), I have employed the Adaptive Synthetic Sampling (ADASYN) technique on a subset of 2,000 rows of the original data via the UBL package in R.1 ADASYN is an oversampling technique that generates synthetic cases of the underrepresented outcome in a dataset. Thus, a dataset in which ADASYN has been applied will become more balanced. The breakdown of the percentage of fraudulent cases in the subset before and after the application of the ADASYN technique can be seen in Figure 1 below. As the figure demonstrates, the application of ADASYN increased the proportion of fraudulent cases from 5% to 50%. Furthermore, a breakdown of the sample sizes of fraudulent and real job postings before and after the application of the ADASYN technique is documented within Table 2. In total, the ADASYN technique added 1,811 synthetic, fraudulent cases to the dataset.

Table 2. Summary of legitimate and fraudulent postings across imbalanced and balanced (i.e., after application of ADASYN technique) datasets.
Classification Imbalanced Dataset Balanced Dataset
Legitimate Postings 1,900 1,900
Fraudulent Postings 100 1,911
Total (N) 2,000 3,811

Note: Code for application of the ADASYN technique can be found in A. Oversampling in the II. Naïve Bayes Classifier of the Appendix. Similarly, code for Figure 1 can be found in the III. Visualization section of the Appendix.

Training & Testing Data

Once a balanced dataset was achieved, the data was then split into training and testing datasets in accordance with the ratio set by Amaar et al. (2022). Specifically, roughly 80 percent, or 3,048 rows, were randomly assigned to be within the training dataset. The remaining 20 percent, or 763 rows, were assigned to the testing dataset. A breakdown of verified and fraudulent postings within the training and testing datasets is outlined in Table 3.

Table 3. Breakdown of verified and fraudulent postings within training (N = 3,048) and testing (N = 763) datasets.
Dataset Fraudulent, n Fraudulent, % Verified, n Verified, % Total, N
Training 1,539 50.49% 1,509 49.51% 3,048
Testing 372 48.75% 391 51.24% 763

Note: Code for the splitting of data into training and testing datasets can be found in the B. Training & Testing Data in the II. Naïve Bayes Classifier of the Appendix.

Model Construction, Evaluation, & Comparison

The Naïve Bayes classification model was constructed via the e1071 package in R and was first trained on 3,048 rows of training data. The algorithm was then applied to the smaller testing dataset (N = 763). A confusion matrix was then generated, calculating the accuracy, precision, recall, and F1-score of the classification model. These values were then compared to those of the machine learning algorithms employed by Amaar et al (2022) on page 2238 in their article.

Note: Code for the model construction and confusion matrix can be found in the C. Model Construction & Confusion Matrix in the II. Naïve Bayes Classifier section of the Appendix.

Model Refinement

Upon successful replication of the text preprocessing and training/testing data preparation outlined by Amaar et al. (2022), further refinements to the original preprocessing process were explored. Specifically, the original preprocessing steps chose a rather conservative cut-off such that only the most infrequent terms (<10 instances) were excluded from document frequency matrix. To gauge the impact of this feature engineering on model performance, this number was increased to a minimum threshold of 50 (Model 2), 75 (Model 3), and 100 (Model 4) appearances of the token in the document frequency matrix. All other preprocessing steps remained the same, in accordance with the original plan set out by Amaar et al. (2022).

Note: Code for this refinement can be found in D. Model Refinement in the II. Naïve Bayes Classifier section of the Appendix.

Results

Preprocessing Results

Table 4 illustrates the token count for each level of document frequency matrix trimming. Notably, the number of tokens decreased immensely from nearly 100,000 with no minimum threshold to only 12,573 with a threshold of 10. Unsurprisingly, with each increase in the minimum threshold, the amount of tokens in the document frequency matrix decreased further. Notably, there was only a 560 token drop between the thresholds of 75 and 100.

Additionally, Figure 2 illustrates the 10 most commonly occurring tokens in the document frequency matrix. Notably, although the order changed slightly, there were minimal differences in the most commonly occurring words between verified and fraudulent job postings. Thus, Figure 2 only illustrates the most commonly occurring tokens for the entire dataset. However, the results of the most commonly occurring words can be found in B. Figure 2. Most commonly occurring tokens in document frequency matrix. in the III. Visualization section of the Appendix.

Table 4. Number of tokens at each token frequency minimal threshold within document frequency matrix.

Model Number Token Frequency Minimum Threshold Number of Tokens within Document Frequency Matrix
No Minimum Threshold 0 98,580
Model 1 10 12,573
Model 2 50 4,803
Model 3 75 3,853
Model 4 100 3,293

Notably, the most commonly occurring words did not vary much between real and fraudulent job postings. Thus, the bar graph above emphasizes the most commonly occurring tokens overall.

Model Performance

Across all classification algorithms used by Amaar et al. (2022; i.e., Random Forest, Extra Tree Classifier, Logistic Regression, Support Vector Machine, Multilayer Perceptron, and K Nearest Neighbor), high levels of accuracy, precision, recall, and F1 scores were achieved, consistently hovering around 99 percent. Notably, the K Nearest Neighbor model demonstrated slightly lower performance metrics, particularly in precision (77%) and F1 score (87%).

Despite replicating the preprocessing technique utilized by Amaar et al. (2022) as much as possible, the Naïve Bayes models generated within this study did not achieve as high levels of performance, as shown in Table 5. Model performance did, however, improve with further trimming of the document frequency matrix, with frequency rate. Specifically, Model 3 and 4 demonstrated the highest performance and actually outperformed the K-Nearest Neighbor model in Amaar et al. (2022), with accuracy, precision, recall, and F1 scores all around 96%, outperforming Models 1 and 2.

In examining the confusion matrices shown in Table 6, we see that although the number of false positives (i.e., real postings erroneously classified as fraudulent) skyrockets from 2 or less in Models 1 and 2, the number of false negatives (i.e., fraudulent postings erroneously classified as verified) decreases tremendously. In fact, the model improved from nearly 200 false positives in it’s first iteration to only 13 false positives in Models 3 and 4. Although this shift slightly decreased the model’s recall performance, in this use case, focusing on minimizing the false positive rate is most favorable given the damage that an unidentified fraudulent job posting can inflict on job seekers.

Furthermore, when examining the F1 scores of all classification models (see Figure 3), we see that the Naïve Bayes models, with the exception of Model 1, only outperform the K-Nearest Neighbor model in Amaari et al. (2022). F1 scores range from 0 to 1 and are used in binary classification tasks to evaluate the performance of a model by combining precision and recall into a single value (Amaari et al., 2022). A high F1 score (close to 1) indicates the model has both high precision and high recall, meaning it makes both accurate predictions and captures most of the positive instances in the dataset. Contrarily, a low F1 score (close to 0), suggests that the model struggles to make accurate positive predictions or fails to identify many positive instances in the dataset. Thus, although Models 3 and 4 achieved higher F1 scores of 96%, they still are less sensitive to identifying fraudulent job postings than other machine learning algorithms employed by Amaari et al. (2022), namely random forest, extra tree classifier, support vector machines, and multilayer perceptron.

Table 5. Model performance comparison to Amaar et al. (2022) and four Naïve Bayes models produced in this study. Notably, Naïve Bayes model 1 through 4 represent the different thresholds (i.e., n = 10, 50, 75, & 100) for the minimum frequency a given token appeared in the document frequency matrix in order to be included.
Model Accuracy Precision Recall F1 Score
Random Forest 99% 99% 99% 99%
Extra Tree Classifier 99% 99% 99% 99%
Logistic Regression 99% 99% 98% 99%
Support Vector Machine 99% 99% 98% 99%
Multilayer Perceptron 99% 99% 100% 99%
K Nearest Neighbor 85% 77% 100% 87%
Naïve Bayes, Model 1 75% 67% 100% 80%
Naïve Bayes, Model 2 94% 89% 99% 94%
Naïve Bayes, Model 3 96% 96% 95% 96%
Naïve Bayes, Model 4 96% 95% 96% 96%
Table 6. Naïve Bayes classification model confusion matrices.
Note: When interpreting this table, please keep the following in mind: 1) True positives represent those fraudulent postings that were correctly identified as fraudulent. 2) False positives represent those verified postings that were falsely classified as fraudulent. 3) True negatives represent those verified postings that were correctly identified as verified, 4) False negatives represent those fraudulent postings that were falsely identified as verified.
Model Number True Positives False Positives True Negatives False Negatives
Naïve Bayes, Model 1 180 0 391 192
Naïve Bayes, Model 2 333 2 385 44
Naïve Bayes, Model 3 380 18 360 13
Naïve Bayes, Model 4 387 17 353 13

Key: RF: Random Forest; ETC: Extra Tree Classifier; SVM: Support Vector Machine; MLP: Multilayer Perceptron; NB-3: Naïve Bayes, Model 3; NB-4: Naïve Bayes, Model 4; NB-2: Naïve Bayes, Model 2; KNN - K Nearest Neighbor; NB-1: Naïve Bayes, Model 1.

Note: Code evaluating model performance can be found in C. Model Construction & Confusion Matrix of the II. Naïve Bayes Classifier section within the Appendix. Additionally, model performance from the Amaar et al. (2022) article can be found in Table 12, columns 1-7 on page 2239 of the original publication.

Discussion

This study sought to replicate the text preprocessing and data transformation steps outlined by Amaari et al. (2022) to compare the performance of a Naïve Bayes classification algorithm against other prominent machine learning algorithms (i.e., Random Forest, Extra Tree Classifier, Support Vector Machines, Multilayer Perceptron, and K Nearest Neighbor) in predicting fraudulent job postings. Ultimately, the Naive Bayes models generated in this study, although performing well, did not exceed the performance standards set by the other machine learning algorithms utilized by Amaari et al. (2022) with the exception of K Nearest Neighbor.

These findings suggest that while Naïve Bayes classifiers offer a viable approach to classifying job postings, they may not be the best choice for capturing the underlying, nuanced patterns in job description data. However, it is worth highlighting that the Naïve Bayes models demonstrated competitive performance, particularly within Models 3 and 4 where further trimming of the document frequency matrix improved their overall performance. It is possible with further exploration of the predictive quality of individual tokens and subsequent trimming of the document frequency matrix that these performance metrics could improve further.

Additionally, it is worth highlighting that complete replication of the preprocessing steps outlined by Amaari et al. (2022) was not possible, particularly in regard to feature engineering. Although Amaari et al. (2022) briefly described their methods in trimming their document frequency matrix, it was unfortunately too vague to apply the same methodology in this study. Furthermore, due to limitations in computational power, I was unable to utiilize the full dataset to run this analysis, as completed in Amaari et al. (2022). Therefore, it is possible that the performance of the Naïve Bayes models enclosed in this report would change if those feature engineering steps were known and followed, as well as the inclusion of all observations.

In conclusion, while Naïve Bayes classifiers offer a straightforward and interpretable approach to text classification, they may not always yield the highest performance in more complex scenarios, such as predicting job scams. Ultimately, this study contributes to the broader understanding of machine learning algorithms’ performance in identifying fraudulent job postings and highlights the importance of comparative analysis in machine learning research.

References

Bansal, S. (2020, February). Real/Fake Job Posting Prediction, Version 1. Retrieved February 25, 2024, from https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction?resource=download

Better Business Bureau (2020). Job scams: Full study. Retrieved from https://www.bbb.org/all/scamstudies/jobscams/jobscamsfullstudy

Braşoveanu, A. M., & Andonie, R. (2021). Integrating machine learning techniques in semantic fake news detection. Neural Processing Letters53(5), 3055-3072. https://doi.org/10.1007/s11063-020-10365-x

Dutta, S., & Bandyopadhyay, S. K. (2020). Fake job recruitment detection using machine learning approach. International Journal of Engineering Trends and Technology68(4), 48-53.https://doi.org/10.14445/22315381/IJETT-V68I4P209S

Gosain, A., & Sardana, S. (2017, September). Handling class imbalance problem using oversampling techniques: A review. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 79-85). IEEE. https://doi.org/10.1109/ICACCI.2017.8125820

Naudé, M., Adebayo, K. J., & Nanda, R. (2022). A machine learning approach to detecting fraudulent job types. AI & Society, 38(2), 1013-1024. https://doi.org/10.1007/s00146-022-01469-0

Ott, M., Cardie, C., & Hancock, J. T. (2013, June). Negative deceptive opinion spam. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 497-501).

Vidros, S., Kolias, C., Kambourakis, G., & Akoglu, L. (2017). Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset. Future Internet, 9(1), 6. https://doi.org/10.3390/fi9010006

Note: References to specific R packages utilized are linked within text throughout report.

Appendix

I. Text Pre-Processing

# Setting seed for reproducibility 
set.seed(33)

# Loading required packages 
pacman::p_load(
  Naïvebayes, 
  caret, 
  e1071,
  quanteda, 
  tidyverse,
  here,
  SnowballC,
  quanteda.textstats
)
Installing package into 'C:/Users/kaoui/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
# Setting working directory
wd <- here()
setwd(wd)

# Reading in data
df <- read_csv('fake_job_postings.csv')
Rows: 17880 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): title, location, department, salary_range, company_profile, descri...
dbl  (6): job_id, STEM, telecommuting, has_company_logo, has_questions, frau...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Creating factor of dependent variable, fraudulent (1 = Fraudulent, 0 = Real)
df$fraudulent <- as.factor(df$fraudulent)

# Selecting necessary columns 
jobs <- df %>% 
  select(title, location, department, company_profile, description, requirements, benefits, employment_type, industry, `function`,fraudulent)

# Uniting text columns 
jobs <- jobs %>% 
  unite(col = "text", c("title", "location", "department", "company_profile", "description", "requirements", "benefits", "employment_type", "industry", "function"), sep = " ")

# Saving NLTK stopwords
nltk <- c('had', 'most', "aren't", "shan't", 'such', 'his', 'at', 'which', 'd', 'i', 'yourself', 'nor', 're', 'being', 'won', 'itself', 'don', 'for', 'my', 'what', 'was', 've', 'aren', "wasn't", 'wouldn', 'than', 'before', 'shouldn', 'our', 'the', 'ma', 'it', 'hadn', 'them', 'through', 'who', "mustn't", 'shan', 'couldn', 'haven', "couldn't", 'those', "should've", "you've", 'yourselves', 'by', 'on', 'during', 'their', 'further', 'with', 'will', 'himself', 'be', 'any', 'some', 'until', 'too', 'between', 'can', 'your', 'off', 'weren', 'hasn', 'up', 'hers', 'ain', 'again', 'below', 'same', 'themselves', "that'll", 'should', 'each', 'both', 'we', 'herself', 'yours', 'to', "hadn't", "needn't", 'while', 'above', 'but', 'her', 'under', "isn't", 'only', "haven't", 'its', 'wasn', 'is', "doesn't", 'doing', "didn't", 'you', 'theirs', 'an', 's', 'when', 'against', 'ours', 'ourselves', 'out', 'more', 'are', 'where', 'down', 'no', 'in', 'have', 'were', 'mustn', 'having', 'now', 'they', 'here', 'does', 'whom', 'him', 'm', "it's", 'll', "mightn't", 'am', 'about', 'other', 'from', 'has', 'or', 'so', 'how', 'very', 'he', 'o', 'doesn', 'own', 'once', 'y', 'few', 'just', 'isn', 'been', 'because', "wouldn't", "she's", 'as', 'over', 'after', 'didn', 'these', 'then', "don't", 'she', 'if', 'why', 'not', "weren't", 'into', 'all', 'that', "you'd", 'myself', 'needn', 'me', "won't", 'mightn', 'a', 'do', 'of', 't', "shouldn't", "you'll", "hasn't", 'this', 'there', "you're", 'did', 'and')


# Use Quanteda for pre-processing
        job_corpus <- corpus(as.character(jobs$text)) %>% 
          tokens(remove_punct = TRUE, # Removing punctuation
                 remove_symbols = TRUE, # Removing symbols
                 remove_url = TRUE, # Removing URLs
                 remove_separators = TRUE,# Removing separators
                 split_hyphens = TRUE, # Splitting hyphenated words (e.g., self-aware) 
                 remove_numbers = TRUE) %>% # Removing numbers
         tokens_remove(pattern = nltk) %>%  # Removing NLTK stopwords
         tokens_wordstem(language = "english")  %>% # Stemming tokens using SnowballC
         tokens_tolower() # making sure all tokens are lowercase 

        # Creating dfm
        job_dfm <- dfm(job_corpus)
        
        
        # Trimming document frequency matrix to only include those tokens that appear at least 10 times
        job_dfm_trim <- dfm_trim(job_dfm, min_termfreq = 10)

        # Converting dfm to matrix
        job_matrix <- as.matrix(job_dfm_trim)
        
        
        # Getting the most commonly occurring tokens
        job_dfm %>%
          textstat_frequency() %>%
          head(20)
   feature frequency rank docfreq group
1     work     53349    1   14502   all
2       na     39967    2   14982   all
3     team     36964    3   12641   all
4    manag     36147    4   11309   all
5   servic     34517    5   10824   all
6  develop     32937    6   10008   all
7   experi     31865    7   13906   all
8     time     30335    8   14864   all
9   custom     29683    9    8534   all
10 compani     27540   10   11511   all
11 product     27217   11    8833   all
12    busi     24868   12    8927   all
13      us     23060   13   13589   all
14  market     21151   14    6295   all
15    full     19998   15   13359   all
16  client     19926   16    7049   all
17  provid     19708   17   10118   all
18   skill     19342   18   10142   all
19     new     18939   19    8832   all
20  design     18868   20    6010   all

II. Naïve Bayes Classifier

A. Oversampling

# Setting seed for reproducibility 
set.seed(33)

# Loading UBL package
pacman::p_load(
  UBL
)

# Converting job_matrix to data frame
job_df <- as.data.frame(job_matrix)

# Adding fraudulent column to job document frequency matrix
job_df$fraudulent <- df$fraudulent

# Taking a sample of 2,000 rows 
job_subset <- sample_n(job_df, 2000)

# Defining formula for prediction problem 
formula <- fraudulent ~ .

# Applying ADASYN to dataset
new_data <- AdasynClassif(formula, job_subset)

B. Training & Testing Data

# Setting seed for reproducibility
set.seed(33)

     # Splitting sample into a training set and testing set
        sample <- sample.int(n = nrow(new_data), 
                             size = floor(.80*nrow(new_data)),
                             replace = F)

        # Selecting training dataset
        train_matrix <- new_data[sample,]
        train_labels <- new_data$fraudulent[sample]
        test_matrix <- new_data[-sample,]
        test_labels <- new_data$fraudulent[-sample]

# Saving data as a fail safe
write.csv(test_matrix, "testdata.csv")
write.csv(train_matrix, "traindata.csv")

# Reading testing and training data back in 
test_data <- read_csv('testdata.csv')
New names:
Rows: 763 Columns: 12574
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (12573): market, intern, us, ny, new, york, we'r, food52, we'v,
creat, g...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
train_data <- read_csv('traindata.csv')
New names:
Rows: 3048 Columns: 12574
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (12573): market, intern, us, ny, new, york, we'r, food52, we'v,
creat, g...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Removing first column of both datasets
test_data <- test_data[,-1]
train_data <- train_data[,-1]

# Getting counts & proportions 
    # Training Data 
      prop.table(table(train_data$fraudulent)) * 100

       0        1 
49.50787 50.49213 
      train_data %>%
        summarize(Real = sum(fraudulent == 0), 
                  Fake = sum(fraudulent == 1))
# A tibble: 1 × 2
   Real  Fake
  <int> <int>
1  1509  1539
    # Testing Data
      prop.table(table(test_data$fraudulent)) * 100

       0        1 
51.24509 48.75491 
      test_data %>%
        summarize(Real = sum(fraudulent == 0), 
                  Fake = sum(fraudulent == 1))
# A tibble: 1 × 2
   Real  Fake
  <int> <int>
1   391   372

C. Model Construction & Confusion Matrix

# Removing 'fraudulent' column from training and testing datasets
train_remove <- train_data %>%
  select(-c(fraudulent))

test_remove <- test_data %>% 
  select(-c(fraudulent))

# Creating matrices to pass into e1071
train_matrix <- as.matrix(train_remove)
test_matrix <- as.matrix(test_remove)

# Creating factor of dependent variable (job_cat)
category <- as.factor(train_data$fraudulent)
levels(category)
[1] "0" "1"
# Training model
nb <- e1071::naiveBayes(
  x=train_matrix, 
  y=train_labels,
  method='class'
)

# Applying model to testing matrix 
nb_test_prediction <- predict(nb, test_matrix)

# Running confusion matrix 
        con_matrix1 <- confusionMatrix(data = nb_test_prediction, 
                                       reference = test_labels, 
                                       mode = 'prec_recall')

        # Returning accuracy, precision, and recall
        print(con_matrix1)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 391 192
         1   0 180
                                         
               Accuracy : 0.7484         
                 95% CI : (0.716, 0.7788)
    No Information Rate : 0.5125         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.49           
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
              Precision : 0.6707         
                 Recall : 1.0000         
                     F1 : 0.8029         
             Prevalence : 0.5125         
         Detection Rate : 0.5125         
   Detection Prevalence : 0.7641         
      Balanced Accuracy : 0.7419         
                                         
       'Positive' Class : 0              
                                         

D. Model Refinement

Minimum of 50 Instances of Token Frequency
# Trimming document frequency matrix to only include tokens that appeared at least 50 times
        job_dfm_50 <- dfm_trim(job_dfm, min_termfreq = 50)

# Converting dfm to data frame
        job_data50 <- as.data.frame(job_dfm_50)
Warning: 'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")
# Adding fraudulent column to job document frequency matrix
job_data50$fraudulent <- as.factor(df$fraudulent)

# Taking a sample of 2,000 rows 
job_subset50 <- sample_n(job_data50, 2000)

# Removing first column
job_subset50 <- job_subset50[,-1]

# Defining formula for prediction problem 
formula <- fraudulent ~ .

# Applying ADASYN to dataset
new_data50 <- AdasynClassif(formula, job_subset50)

# Setting seed for reproducibility
set.seed(33)

     # Splitting sample into a training set and testing set
        sample50 <- sample.int(n = nrow(new_data50), 
                             size = floor(.80*nrow(new_data50)),
                             replace = F)

        # Selecting training dataset
        train_matrix50 <- new_data50[sample50,]
        train_labels50 <- new_data50$fraudulent[sample50]
        test_matrix50 <- new_data50[-sample50,]
        test_labels50 <- new_data50$fraudulent[-sample50]
        
# Removing 'fraudulent' column from training and testing datasets
train_remove50 <- train_matrix50 %>%
  select(-c(fraudulent))

test_remove50 <- test_matrix50 %>% 
  select(-c(fraudulent))

# Creating matrices to pass into e1071
train_matrix50 <- as.matrix(train_remove50)
test_matrix50 <- as.matrix(test_remove50)

# Training model
nb50 <- e1071::naiveBayes(
  x=train_matrix50, 
  y=train_labels50,
  method='class'
)

# Applying model to testing matrix 
nb_test_prediction50 <- predict(nb50, test_matrix50)

# Running confusion matrix 
        con_matrix50 <- confusionMatrix(data = nb_test_prediction50, 
                                       reference = test_labels50, 
                                       mode = 'prec_recall')

        # Returning accuracy, precision, and recall
        print(con_matrix50)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 379  19
         1  11 357
                                          
               Accuracy : 0.9608          
                 95% CI : (0.9446, 0.9734)
    No Information Rate : 0.5091          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9216          
                                          
 Mcnemar's Test P-Value : 0.2012          
                                          
              Precision : 0.9523          
                 Recall : 0.9718          
                     F1 : 0.9619          
             Prevalence : 0.5091          
         Detection Rate : 0.4948          
   Detection Prevalence : 0.5196          
      Balanced Accuracy : 0.9606          
                                          
       'Positive' Class : 0               
                                          
Minimum of 75 Instances of Token Frequency
# Trimming document frequency matrix to only include tokens that appeared at least 75 times
        job_dfm_75 <- dfm_trim(job_dfm, min_termfreq = 75)

# Converting dfm to data frame
        job_data75 <- as.data.frame(job_dfm_75)
Warning: 'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")
# Adding fraudulent column to job document frequency matrix
job_data75$fraudulent <- as.factor(df$fraudulent)

# Taking a sample of 2,000 rows 
job_subset75 <- sample_n(job_data75, 2000)

# Removing first column
job_subset75 <- job_subset75[,-1]

# Defining formula for prediction problem 
formula <- fraudulent ~ .

# Applying ADASYN to dataset
new_data75 <- AdasynClassif(formula, job_subset75)

# Setting seed for reproducibility
set.seed(33)

     # Splitting sample into a training set and testing set
        sample75 <- sample.int(n = nrow(new_data75), 
                             size = floor(.80*nrow(new_data75)),
                             replace = F)

        # Selecting training dataset
        train_matrix75 <- new_data75[sample75,]
        train_labels75 <- new_data75$fraudulent[sample75]
        test_matrix75 <- new_data75[-sample75,]
        test_labels75 <- new_data75$fraudulent[-sample75]
        
# Removing 'fraudulent' column from training and testing datasets
train_remove75 <- train_matrix75 %>%
  select(-c(fraudulent))

test_remove75 <- test_matrix75 %>% 
  select(-c(fraudulent))

# Creating matrices to pass into e1071
train_matrix75 <- as.matrix(train_remove75)
test_matrix75 <- as.matrix(test_remove75)

# Training model
nb75 <- e1071::naiveBayes(
  x=train_matrix75, 
  y=train_labels75,
  method='class'
)

# Applying model to testing matrix 
nb_test_prediction75 <- predict(nb75, test_matrix75)

# Running confusion matrix 
        con_matrix75 <- confusionMatrix(data = nb_test_prediction75, 
                                       reference = test_labels75, 
                                       mode = 'prec_recall')

        # Returning accuracy, precision, and recall
        print(con_matrix75)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 387  18
         1   7 354
                                          
               Accuracy : 0.9674          
                 95% CI : (0.9522, 0.9788)
    No Information Rate : 0.5144          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9346          
                                          
 Mcnemar's Test P-Value : 0.0455          
                                          
              Precision : 0.9556          
                 Recall : 0.9822          
                     F1 : 0.9687          
             Prevalence : 0.5144          
         Detection Rate : 0.5052          
   Detection Prevalence : 0.5287          
      Balanced Accuracy : 0.9669          
                                          
       'Positive' Class : 0               
                                          
Minimum of 100 Instances of Token Frequency
# Trimming document frequency matrix to only include tokens that appeared at least 100 times
        job_dfm_100 <- dfm_trim(job_dfm, min_termfreq = 100)

# Converting dfm to data frame
        job_data100 <- as.data.frame(job_dfm_100)
Warning: 'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")
# Adding fraudulent column to job document frequency matrix
job_data100$fraudulent <- as.factor(df$fraudulent)

# Taking a sample of 2,000 rows 
job_subset100 <- sample_n(job_data100, 2000)

# Removing first column
job_subset100 <- job_subset100[,-1]

# Defining formula for prediction problem 
formula <- fraudulent ~ .

# Applying ADASYN to dataset
new_data100 <- AdasynClassif(formula, job_subset100)

# Setting seed for reproducibility
set.seed(33)

     # Splitting sample into a training set and testing set
        sample100 <- sample.int(n = nrow(new_data100), 
                             size = floor(.80*nrow(new_data100)),
                             replace = F)

        # Selecting training dataset
        train_matrix100 <- new_data100[sample100,]
        train_labels100 <- new_data100$fraudulent[sample100]
        test_matrix100 <- new_data100[-sample100,]
        test_labels100 <- new_data100$fraudulent[-sample100]
        
# Removing 'fraudulent' column from training and testing datasets
train_remove100 <- train_matrix100 %>%
  select(-c(fraudulent))

test_remove100 <- test_matrix100 %>% 
  select(-c(fraudulent))

# Creating matrices to pass into e1071
train_matrix100 <- as.matrix(train_remove100)
test_matrix100 <- as.matrix(test_remove100)

# Training model
nb100 <- e1071::naiveBayes(
  x=train_matrix100, 
  y=train_labels100,
  method='class'
)

# Applying model to testing matrix 
nb_test_prediction100 <- predict(nb100, test_matrix100)

# Running confusion matrix 
        con_matrix100 <- confusionMatrix(data = nb_test_prediction100, 
                                       reference = test_labels100, 
                                       mode = 'prec_recall')

        # Returning accuracy, precision, and recall
        print(con_matrix100)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 351  24
         1  37 351
                                          
               Accuracy : 0.9201          
                 95% CI : (0.8985, 0.9383)
    No Information Rate : 0.5085          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8402          
                                          
 Mcnemar's Test P-Value : 0.1244          
                                          
              Precision : 0.9360          
                 Recall : 0.9046          
                     F1 : 0.9201          
             Prevalence : 0.5085          
         Detection Rate : 0.4600          
   Detection Prevalence : 0.4915          
      Balanced Accuracy : 0.9203          
                                          
       'Positive' Class : 0               
                                          

III. Visualization

A. Figure 1. Percentage of fraudulent cases before and after ADASYN.

# Comparing percentage of fraudulent cases before and after ADASYN
percentage_before <- prop.table(table(job_subset$fraudulent))*100

percentage_after <- prop.table(table(new_data$fraudulent)) * 100

# Printing proportions
percentage_before

 0  1 
95  5 
percentage_after

       0        1 
49.85568 50.14432 
# Creating plot data 
plot_data <- data.frame(
  Method = c("Before ADASYN", "After ADASYN"),
  Fraudulent = c(percentage_before[2], percentage_after[2])
)

# Flipping order of the levels on the x-axis
plot_data$Method <- as.factor(plot_data$Method)
plot_data$Method <- factor(plot_data$Method, levels = c("Before ADASYN", "After ADASYN"))


# Creating grouped bar chart
figure1 <- ggplot(plot_data, aes(x = Method, y = Fraudulent, fill = Method)) +
  geom_bar(stat = "identity", fill = c("maroon", "steelblue4")) +
  labs(title = "Figure 1. Percentage of Fraudulent Cases Before\n and After Application of ADASYN",
       x = "Method",
       y = "% Fraudulent") +
  theme_minimal() +
  theme(legend.position = "none")


# Getting breakdown of real and fraudulent postings by balanced and imbalanced datasets
job_subset %>%
  summarize(Real = sum(fraudulent == 0), 
            Fake = sum(fraudulent == 1))
  Real Fake
1 1900  100
new_data %>%
  summarize(Real = sum(fraudulent == 0), 
            Fake = sum(fraudulent == 1))
  Real Fake
1 1900 1911

B. Figure 2. Most commonly occurring tokens in document frequency matrix.

top10 <-job_dfm_50 %>%
  textstat_frequency() %>%
  head(10)

top10 <- top10 %>% select(feature, frequency)

figure2 <- ggplot(top10, aes(x = reorder(feature, -frequency), y = frequency)) +
         geom_bar(stat = "identity", fill = "steelblue4") +
         labs(title = "Figure 2. Top 10 Most Common Words in Job Postings", 
              x = "Word", 
              y = "Frequency (n)") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 55000))  # Add commas to the y-axis labels and set limits

# NO SIGNIFICANT DIFFERENCES IN MOST COMMONLY OCCURRING WORDS BY REAL OR FRAUDULENT STATUS
# Subsetting by fraudulent status 
fraud <- subset(df, fraudulent == 1)
real <- subset(df, fraudulent == 0)
# Creating factor of dependent variable, fraudulent (1 = Fraudulent, 0 = Real)
df$fraudulent <- as.factor(df$fraudulent)

# Selecting necessary columns 
fraud_jobs <- fraud %>% 
 select(title, location, department, company_profile, description, requirements, benefits, employment_type, industry, `function`,fraudulent)

# Uniting text columns 
fraud_jobs <- fraud_jobs %>% 
  unite(col = "text", c("title", "location", "department", "company_profile", "description", "requirements", "benefits", "employment_type", "industry", "function"), sep = " ")

 # Saving NLTK stopwords
nltk <- c('had', 'most', "aren't", "shan't", 'such', 'his', 'at', 'which', 'd', 'i', 'yourself', 'nor', 're', 'being', 'won', 'itself', 'don', 'for', 'my', 'what', 'was', 've', 'aren', "wasn't", 'wouldn', 'than', 'before', 'shouldn', 'our', 'the', 'ma', 'it', 'hadn', 'them', 'through', 'who', "mustn't", 'shan', 'couldn', 'haven', "couldn't", 'those', "should've", "you've", 'yourselves', 'by', 'on', 'during', 'their', 'further', 'with', 'will', 'himself', 'be', 'any', 'some', 'until', 'too', 'between', 'can', 'your', 'off', 'weren', 'hasn', 'up', 'hers', 'ain', 'again', 'below', 'same', 'themselves', "that'll", 'should', 'each', 'both', 'we', 'herself', 'yours', 'to', "hadn't", "needn't", 'while', 'above', 'but', 'her', 'under', "isn't", 'only', "haven't", 'its', 'wasn', 'is', "doesn't", 'doing', "didn't", 'you', 'theirs', 'an', 's', 'when', 'against', 'ours', 'ourselves', 'out', 'more', 'are', 'where', 'down', 'no', 'in', 'have', 'were', 'mustn', 'having', 'now', 'they', 'here', 'does', 'whom', 'him', 'm', "it's", 'll', "mightn't", 'am', 'about', 'other', 'from', 'has', 'or', 'so', 'how', 'very', 'he', 'o', 'doesn', 'own', 'once', 'y', 'few', 'just', 'isn', 'been', 'because', "wouldn't", "she's", 'as', 'over', 'after', 'didn', 'these', 'then', "don't", 'she', 'if', 'why', 'not', "weren't", 'into', 'all', 'that', "you'd", 'myself', 'needn', 'me', "won't", 'mightn', 'a', 'do', 'of', 't', "shouldn't", "you'll", "hasn't", 'this', 'there', "you're", 'did', 'and')

# Use Quanteda for pre-processing
fraud_job_corpus <- corpus(as.character(fraud_jobs$text)) %>% 
          tokens(remove_punct = TRUE, # Removing punctuation
                  remove_symbols = TRUE, # Removing symbols
                  remove_url = TRUE, # Removing URLs
                  remove_separators = TRUE,# Removing separators
                  split_hyphens = TRUE, # Splitting hyphenated words (e.g., self-aware) 
                  remove_numbers = TRUE) %>% # Removing numbers
         tokens_remove(pattern = nltk) %>%  # Removing NLTK stopwords
          tokens_wordstem(language = "english")  %>% # Stemming tokens using SnowballC
         tokens_tolower() # making sure all tokens are lowercase 
 
        # Creating dfm
         fraud_job_dfm <- dfm(fraud_job_corpus)
    
       
       # Trimming document frequency matrix to only include those tokens that appear at least 10 times
      fraud_job_dfm_trim <- dfm_trim(fraud_job_dfm, min_termfreq = 50)

       # Converting dfm to matrix
     fraud_job_matrix <- as.matrix(fraud_job_dfm_trim)
       
        
      
   # Getting the most commonly occurring tokens
  fraud_freq <- fraud_job_dfm %>%
  textstat_frequency() %>%
    head(20)
  
  # Selecting necessary columns 
real_jobs <- real %>% 
  select(title, location, department, company_profile, description, requirements, benefits, employment_type, industry, `function`,fraudulent)

# Uniting text columns 
real_jobs <- real_jobs %>% 
  unite(col = "text", c("title", "location", "department", "company_profile", "description", "requirements", "benefits", "employment_type", "industry", "function"), sep = " ")

# Saving NLTK stopwords
nltk <- c('had', 'most', "aren't", "shan't", 'such', 'his', 'at', 'which', 'd', 'i', 'yourself', 'nor', 're', 'being', 'won', 'itself', 'don', 'for', 'my', 'what', 'was', 've', 'aren', "wasn't", 'wouldn', 'than', 'before', 'shouldn', 'our', 'the', 'ma', 'it', 'hadn', 'them', 'through', 'who', "mustn't", 'shan', 'couldn', 'haven', "couldn't", 'those', "should've", "you've", 'yourselves', 'by', 'on', 'during', 'their', 'further', 'with', 'will', 'himself', 'be', 'any', 'some', 'until', 'too', 'between', 'can', 'your', 'off', 'weren', 'hasn', 'up', 'hers', 'ain', 'again', 'below', 'same', 'themselves', "that'll", 'should', 'each', 'both', 'we', 'herself', 'yours', 'to', "hadn't", "needn't", 'while', 'above', 'but', 'her', 'under', "isn't", 'only', "haven't", 'its', 'wasn', 'is', "doesn't", 'doing', "didn't", 'you', 'theirs', 'an', 's', 'when', 'against', 'ours', 'ourselves', 'out', 'more', 'are', 'where', 'down', 'no', 'in', 'have', 'were', 'mustn', 'having', 'now', 'they', 'here', 'does', 'whom', 'him', 'm', "it's", 'll', "mightn't", 'am', 'about', 'other', 'from', 'has', 'or', 'so', 'how', 'very', 'he', 'o', 'doesn', 'own', 'once', 'y', 'few', 'just', 'isn', 'been', 'because', "wouldn't", "she's", 'as', 'over', 'after', 'didn', 'these', 'then', "don't", 'she', 'if', 'why', 'not', "weren't", 'into', 'all', 'that', "you'd", 'myself', 'needn', 'me', "won't", 'mightn', 'a', 'do', 'of', 't', "shouldn't", "you'll", "hasn't", 'this', 'there', "you're", 'did', 'and')


# Use Quanteda for pre-processing
        real_job_corpus <- corpus(as.character(real_jobs$text)) %>% 
          tokens(remove_punct = TRUE, # Removing punctuation
                 remove_symbols = TRUE, # Removing symbols
                 remove_url = TRUE, # Removing URLs
                 remove_separators = TRUE,# Removing separators
                 split_hyphens = TRUE, # Splitting hyphenated words (e.g., self-aware) 
                 remove_numbers = TRUE) %>% # Removing numbers
         tokens_remove(pattern = nltk) %>%  # Removing NLTK stopwords
         tokens_wordstem(language = "english")  %>% # Stemming tokens using SnowballC
         tokens_tolower() # making sure all tokens are lowercase 

        # Creating dfm
        real_job_dfm <- dfm(real_job_corpus)
        
        
        # Trimming document frequency matrix to only include those tokens that appear at least 10 times
        real_job_dfm_trim <- dfm_trim(real_job_dfm, min_termfreq = 50)

        # Converting dfm to matrix
        real_job_matrix <- as.matrix(real_job_dfm_trim)
        
        
        # Getting the most commonly occurring tokens
     real_freq <-  real_job_dfm %>%
          textstat_frequency() %>%
          head(20)
    
# Printing results by fraudulent status
     real_freq
   feature frequency rank docfreq group
1     work     51119    1   13786   all
2       na     37459    2   14243   all
3     team     36051    3   12250   all
4    manag     34612    4   10849   all
5   servic     33056    5   10254   all
6  develop     31895    6    9674   all
7   experi     30693    7   13362   all
8     time     28820    8   14120   all
9   custom     28570    9    8129   all
10 compani     26635   10   11047   all
11 product     26027   11    8546   all
12    busi     24085   12    8597   all
13      us     22103   13   12826   all
14  market     20788   14    6093   all
15  client     19454   15    6817   all
16    full     19062   16   12691   all
17  provid     18805   17    9682   all
18     new     18359   18    8530   all
19    sale     18334   19    4370   all
20  design     18334   19    5779   all
     fraud_freq
   feature frequency rank docfreq group
1       na      2508    1     739   all
2     work      2230    2     716   all
3    manag      1535    3     460   all
4     time      1515    4     744   all
5   servic      1461    5     570   all
6    skill      1236    6     537   all
7  product      1190    7     287   all
8   experi      1172    8     544   all
9      amp      1154    9     378   all
10  custom      1113   10     405   all
11 develop      1042   11     334   all
12      us       957   12     763   all
13  requir       957   12     492   all
14    full       936   14     668   all
15    team       913   15     391   all
16   posit       907   16     446   all
17 compani       905   17     464   all
18  provid       903   18     436   all
19 project       837   19     253   all
20    busi       783   20     330   all

C. Figure 3. Comparison of model F1 scores.

# Creating Dataset of F1 scores
fig3_data <- data.frame(f1score = c(99, 99, 99, 99, 99, 96, 96, 94, 87, 80),
model_name = c("RF", "ETC", "LR", "SVM", "MLP", "NB-3", "NB-4", "NB-2", "KNN", "NB-1"),
model_type = c("Amaar et al. (2022)", "Amaar et al. (2022)", "Amaar et al. (2022)", "Amaar et al. (2022)", "Amaar et al. (2022)", "Current Study", "Current Study", "Current Study", "Amaar et al. (2022)", "Current Study"))

fig3_data$Source <- as.factor(fig3_data$model_type)
# Reorder model_name based on f1score
fig3_data$model_name <- factor(fig3_data$model_name, levels = fig3_data$model_name[order(fig3_data$f1score, decreasing = TRUE)])


# Creating visualization
figure3<-ggplot(data = fig3_data, aes(x=model_name, y=f1score, fill=Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Figure 3. F1 Score by Classification Model", 
       subtitle = "Summary of F1 Scores (Range 0-1) across several classification models of job postings within\ncurrent study and Amaar et al. (2022)",
       x = "Model Abbreviation", 
       y = "F1-Score") +
 scale_fill_manual(values = c('maroon', 'steelblue4')) +  # Specify fill colors
  theme_minimal()

Footnotes

  1. Due to limitations in computational power, the ADASYN technique could not be applied to the entire dataset, as documented in Amaar et al. (2022). However, 2,000 rows is a more than adequate sample size to run a Naïve Bayes classification algorithm without sacrificing computational performance and avoiding program crashes.↩︎