Real or Fake? Using a Naïve Bayes Classifier to Identify Fraudulent Job Postings
Author
Kimberly Ouimette
Abstract
The proliferation of fraudulent job postings presents a significant concern globally, with negative implications for both individuals and economies. In this study, I aim to replicate the preprocessing steps outlined by Amaari et al. (2022) and extend their work to compare the performance of Naïve Bayes classifiers to other machine learning algorithms (e.g., random forest, support vector machines) in predicting fraudulent job postings. Utilizing a dataset of real job advertisements, I replicated Amaari and colleagues’ (2022) rigorous text preprocessing techniques and oversampling methods to address imbalanced data and applied a Naïve Bayes classification model to the identical dataset used by Amaari et al. (2022). Results indicate that while Naïve Bayes models perform well, they do not surpass the performance of other machine learning algorithms with the exception of K Nearest Neighbor. Notably, further refinement of preprocessing steps (i.e., reducing feature space) improved Naïve Bayes performance significantly, highlighting the importance of preprocessing in text analysis via machine learning. These findings contribute to a growing literature of how different machine learning algorithms perform in detecting fraudulent job postings and emphasize the need for comparative analysis in text classification.
Introduction
Background
The proliferation of fraudulent job postings presents a significant and burgeoning concern within the United States, impacting approximately 14 million people annually and resulting in $2 billion in direct losses (Better Business Bureau, 2020). Appallingly, losses reported to the Federal Bureau of Investigations’ Crime Complaint Center skyrocketed by 27 percent between 2018 and 2020 (Better Business Bureau, 2020), highlighting the growing threat posed by online job scams. While anyone can be susceptible to a job scam, some groups are disproportionately affected. Specifically, people aged 25-34 accounted for 28 percent of job scams reported to the Better Business Bureau (BBB) between 2017 and March 2020. Additionally, women (66.7%) and unemployed (54%) individuals accounted for a significant portion of complaints to the BBB during that time frame.
In response to this growing crisis, regulatory bodies like the BBB have emphasized the urgent need for online job boards to bolster their screening procedures (Better Business Bureau, 2020). Past research has demonstrated the efficacy of machine learning methodologies in detecting online fraudulent activities, including job postings (Naudé et al., 2022; Vidros et al., 2017; Amaar et al., 2022). Furthermore, existing studies on fraud detection in human-generated content, such as email spam (CITATION), online reviews (Ott et al., 2013) and fake news (Braşoveanu & Adonie, 2022), underscore the applicability of supervised machine learning techniques in identifying fraud.
Of particular importance to this study, Amaar and colleagues (2022) compared the efficacy of several types of supervised machine learning models (i.e., Support Vector Machines, Random Forest, Logistic Regression, Extra-Trees Classifier, K-Nearest Neighbor, & Multilayer Perceptron) in predicting fraudulent job descriptions and found their extra trees classifier (ETC) to perform the best, achieving an accuracy, precision, and recall rate over 99%. Despite comparing multiple machine learning methods, Amaar et al. (2022) did not include another prominent supervised classifier, Naïve Bayes. Although research prior to Amaar et al. (2022) has compared Naïve Bayes to other machine learning approaches (e.g., random forests, K-Nearest neigbor) on the same dataset, interestingly different techniques have emerged as the top-peforming for this use-case (Dutta & Bandyopadhyay, 2020). These differences perhaps emerge from variations in text pre-processing approaches, as Amaar et al. (2022) highlights the sensitivity of machine learning algorithms’ performance, particularly that of Naïve Bayes, to pre-processing techniques.
To ensure an accurate comparison of Naïve Bayes classification with other machine learning techniques in identifying fraudulent job postings, this study aims to replicate the text pre-processing techniques utilized by Amaar et al. (2022) and apply a Naïve Bayes classification model to the same set of data. By employing the same data cleaning and text pre-processing procedures, as outlined in the original study by Amaar et al. (2022), this research strives to provide a robust evaluation of the performance of Naïve Bayes classifiers relative to other prominent machine learning algorithms, including Support Vector Machines, Random Forest, Logistic Regression, Extra Trees Classifier, K-Nearest Neighbor, and Multilayer Perceptron. Through this approach, this study aims to offer insights into the comparative effectiveness of Naïve Bayes in the domain of fraudulent job posting detection, thereby contributing to a deeper understanding of the strengths and limitations of various supervised classification methodologies.
Research Question
How can supervised learning algorithms, specifically Naïve Bayes classifiers, effectively differentiate between real and fraudulent job descriptions?
Method
Data Acquisition
Data derived from a Kaggle dataset (Bansal, 2020) retrieved from an open dataset published by Vidros et al. (2017) gathered from real-life job advertisements posted by Worktable, an online job posting forum, between 2012 and 2014 within the Employment Scan Aegean Dataset (EMSCAD). The dataset contains 17,880 job postings, 866 (4.8%) of which were classified as fraudulent. In addition, the dataset includes several text fields, including job title, company biography, job description, job requirements, required education, and department. Furthermore, the dataset contains binary indicators of whether the posting is remote eligible, contains the company logo, and is fraudulent. Overall, the dataset provides sufficient observations of both verified and fraudulent job postings to develop and deploy a Naïve Bayes supervised classification model and is identical to the data utilized by Amaar et al. (2022).
Variables
The objective of this algorithm is to correctly distinguish between real and fraudulent job postings based on a document frequency matrix of the words used in their job descriptions. Taking this into consideration, the dependent variable of interest (i.e., ‘fraudulent’) is a binary variable where an assigned value of 0 represents a verified job posting while a value of 1 represents a fraudulent job posting. The document frequency matrix of the job descriptions and other characteristics (e.g., job title, location), generated via the Quanteda package In R, served as the predictor variable within the Naïve Bayes algorithm. Table 1 below offers a list of all variables within the dataset and a brief description of what they represent and sample input (from row 7). In accordance with the analysis conducted by Amaar et al. (2022), only the following textual variables were included in the document frequency matrix: company’s profile, location, job description, job title, department, benefits, job requirements, type of employment, industry, and function.
Table 1. Attributes of dataset.
Variable
Description
Example
job_id
Unique identifier of job posting
7
title
Title of job advertisement
Head of Content (m/f)
location
Geographical location of job posts
DE, BE, Berlin
department
Corporate department (e.g., sales, marketing, human resources)
ANDROIDPIT
salary_range
Posted salary range (if applicable)
20000-28000
company_profile
Short description of company (e.g., mission statement, history)
Founded in 2009, the Fonpit AGÂ rose with its international web portal ANDROIDPITÂ to the world’s largest Android community…
description
Job description
Your Responsibilities: Manage the English-speaking editorial team and build a team of best-in-class editors…
requirements
List of job requirements
University or college degree in journalism, media or other communication studies…
benefits
Describes benefits offered for position
Your Benefits: Being part of a fast-growing company in a booming industryFast decision-making thanks to flat hierarchies and clear structures…
telecommuting
Binary indicator of whether job offers working from home (1 = True, 0 = False)
0
has_company_logo
Binary indicator of whether job has company logo (1 = True, 0 = False)
1
has_questions
Binary indicator of whether job has FAQ section (1 = True, 0 = False)
1
employment_type
Indicates type of employment (e.g., full-time, part-time, contract-based)
Full-Time
required_experience
Required experience level for position (e.g., Mid Senior-Level, Entry Level)
Mid-Senior Level
required_education
Required education level for position (e.g., High School, College)
Master’s Degree
Industry
Type of industry (e.g., Health Care, Computer Software)
Online Media
function
Description of job function
Management
fraudulent
Binary indicator of whether job is fraudulent (1) or real (0)
0
Text Pre-Processing
In accordance with Amaar and colleagues (2022) approach to pre-processing the textual data, the following steps were taken:
Merging of Textual Fields. The following 10 textual fields were merged into one column (i.e., “text”): company’s profile, location, job description, job title, department, benefits, job requirements, type of employment, industry, and function. Once these columns were united, a corpus, and subsequently a document frequency matrix, of the “text” column was created via the Quanteda package in R.
Stop-Words Removal. Stopwords are words that are helpful for humans to understand sentences in terms of proper grammatical structure, however they do not add much information to the sentence and can over complicate machine learning models (Amaar et al., 2022). Unfortunately, Amaar and colleagues’ (2022) approach to removing stopwords via the Natural Language Toolkit (NLTK) is only available in Python, not in R. However, I was able to obtain the list of stop words within the NLTK via the NLTK package documentation and removed all 179 stopwords (e.g., “had”, “most”, “through”) from the corpus.
Punctuation Removal. Punctuation is the part of a sentence that assists the reader in understanding the message being conveyed (e.g., possession, end or separation of a thought). However, similar to stopwords, these tokens are not useful in machine learning processes. In this analysis, punctuation (i.e., !.,¿¡/([-=+&%$#)]) was removed via the Quanteda package. Similarly, other non-alphanumeric characters and URLs were removed from the corpus.
Numerical Removal. Similar to punctuation, numerical characters do not add any specific meaning within text analysis. Thus, numerical characters were removed to reduce the size of the feature space to improve model performance.
Stemming. Stemming is used to reduce different iterations of the same word to their root (e.g., “go”, “going”, “gone” -> “go”). Amaar et al. (2022) employed the Porter stemmer library to further trim their feature space down to the roots of words. In R, the snowballC langauge library employs this same library.
Case Normalization. To ensure that the casing of letters (e.g., upper vs. lower case) did not add unnecessary tokens to the corpus, I ensured that all words were lowercased.
Feature Engineering. Finally, to minimize the feature space and improve model performance, the document frequency matrix was trimmed to include only those tokens that appeared at least 10 times. Unfortunately, Amaar et al. (2022) did not extensively document their criteria for feature engineering. In attempt to replicate their process, this study chose to employ a rather conservative cut-off such that only the most infrequent terms (<10 instances) were excluded from the analysis. For future reference, this model with the n = 10 cutoff will be known as Model 1.
This section covers the construction, evaluation, and comparison of the Naïve Bayes classifier to the machine learning algorithms detailed in Amaar et al. (2022).
Oversampling
As discussed earlier, the original EMSCAD datatset, despite containing nearly 18,000 job postings, was incredibly imbalanced with nearly 95% of job postings being verified. Imbalanced datasets can negatively impact model performance by biasing performance towards the more common outcome (Gosain & Sardana, 2017). To address this issue in line with the approach outlined in Amaar et al. (2022), I have employed the Adaptive Synthetic Sampling (ADASYN) technique on a subset of 2,000 rows of the original data via the UBL package in R.1 ADASYN is an oversampling technique that generates synthetic cases of the underrepresented outcome in a dataset. Thus, a dataset in which ADASYN has been applied will become more balanced. The breakdown of the percentage of fraudulent cases in the subset before and after the application of the ADASYN technique can be seen in Figure 1 below. As the figure demonstrates, the application of ADASYN increased the proportion of fraudulent cases from 5% to 50%. Furthermore, a breakdown of the sample sizes of fraudulent and real job postings before and after the application of the ADASYN technique is documented within Table 2. In total, the ADASYN technique added 1,811 synthetic, fraudulent cases to the dataset.
Table 2. Summary of legitimate and fraudulent postings across imbalanced and balanced (i.e., after application of ADASYN technique) datasets.
Once a balanced dataset was achieved, the data was then split into training and testing datasets in accordance with the ratio set by Amaar et al. (2022). Specifically, roughly 80 percent, or 3,048 rows, were randomly assigned to be within the training dataset. The remaining 20 percent, or 763 rows, were assigned to the testing dataset. A breakdown of verified and fraudulent postings within the training and testing datasets is outlined in Table 3.
Table 3. Breakdown of verified and fraudulent postings within training (N = 3,048) and testing (N = 763) datasets.
The Naïve Bayes classification model was constructed via the e1071 package in R and was first trained on 3,048 rows of training data. The algorithm was then applied to the smaller testing dataset (N = 763). A confusion matrix was then generated, calculating the accuracy, precision, recall, and F1-score of the classification model. These values were then compared to those of the machine learning algorithms employed by Amaar et al (2022) on page 2238 in their article.
Upon successful replication of the text preprocessing and training/testing data preparation outlined by Amaar et al. (2022), further refinements to the original preprocessing process were explored. Specifically, the original preprocessing steps chose a rather conservative cut-off such that only the most infrequent terms (<10 instances) were excluded from document frequency matrix. To gauge the impact of this feature engineering on model performance, this number was increased to a minimum threshold of 50 (Model 2), 75 (Model 3), and 100 (Model 4) appearances of the token in the document frequency matrix. All other preprocessing steps remained the same, in accordance with the original plan set out by Amaar et al. (2022).
Table 4 illustrates the token count for each level of document frequency matrix trimming. Notably, the number of tokens decreased immensely from nearly 100,000 with no minimum threshold to only 12,573 with a threshold of 10. Unsurprisingly, with each increase in the minimum threshold, the amount of tokens in the document frequency matrix decreased further. Notably, there was only a 560 token drop between the thresholds of 75 and 100.
Additionally, Figure 2 illustrates the 10 most commonly occurring tokens in the document frequency matrix. Notably, although the order changed slightly, there were minimal differences in the most commonly occurring words between verified and fraudulent job postings. Thus, Figure 2 only illustrates the most commonly occurring tokens for the entire dataset. However, the results of the most commonly occurring words can be found in B. Figure 2. Most commonly occurring tokens in document frequency matrix. in the III. Visualization section of the Appendix.
Table 4. Number of tokens at each token frequency minimal threshold within document frequency matrix.
Model Number
Token Frequency Minimum Threshold
Number of Tokens within Document Frequency Matrix
No Minimum Threshold
0
98,580
Model 1
10
12,573
Model 2
50
4,803
Model 3
75
3,853
Model 4
100
3,293
Notably, the most commonly occurring words did not vary much between real and fraudulent job postings. Thus, the bar graph above emphasizes the most commonly occurring tokens overall.
Model Performance
Across all classification algorithms used by Amaar et al. (2022; i.e., Random Forest, Extra Tree Classifier, Logistic Regression, Support Vector Machine, Multilayer Perceptron, and K Nearest Neighbor), high levels of accuracy, precision, recall, and F1 scores were achieved, consistently hovering around 99 percent. Notably, the K Nearest Neighbor model demonstrated slightly lower performance metrics, particularly in precision (77%) and F1 score (87%).
Despite replicating the preprocessing technique utilized by Amaar et al. (2022) as much as possible, the Naïve Bayes models generated within this study did not achieve as high levels of performance, as shown in Table 5. Model performance did, however, improve with further trimming of the document frequency matrix, with frequency rate. Specifically, Model 3 and 4 demonstrated the highest performance and actually outperformed the K-Nearest Neighbor model in Amaar et al. (2022), with accuracy, precision, recall, and F1 scores all around 96%, outperforming Models 1 and 2.
In examining the confusion matrices shown in Table 6, we see that although the number of false positives (i.e., real postings erroneously classified as fraudulent) skyrockets from 2 or less in Models 1 and 2, the number of false negatives (i.e., fraudulent postings erroneously classified as verified) decreases tremendously. In fact, the model improved from nearly 200 false positives in it’s first iteration to only 13 false positives in Models 3 and 4. Although this shift slightly decreased the model’s recall performance, in this use case, focusing on minimizing the false positive rate is most favorable given the damage that an unidentified fraudulent job posting can inflict on job seekers.
Furthermore, when examining the F1 scores of all classification models (see Figure 3), we see that the Naïve Bayes models, with the exception of Model 1, only outperform the K-Nearest Neighbor model in Amaari et al. (2022). F1 scores range from 0 to 1 and are used in binary classification tasks to evaluate the performance of a model by combining precision and recall into a single value (Amaari et al., 2022). A high F1 score (close to 1) indicates the model has both high precision and high recall, meaning it makes both accurate predictions and captures most of the positive instances in the dataset. Contrarily, a low F1 score (close to 0), suggests that the model struggles to make accurate positive predictions or fails to identify many positive instances in the dataset. Thus, although Models 3 and 4 achieved higher F1 scores of 96%, they still are less sensitive to identifying fraudulent job postings than other machine learning algorithms employed by Amaari et al. (2022), namely random forest, extra tree classifier, support vector machines, and multilayer perceptron.
Table 5. Model performance comparison to Amaar et al. (2022) and four Naïve Bayes models produced in this study. Notably, Naïve Bayes model 1 through 4 represent the different thresholds (i.e., n = 10, 50, 75, & 100) for the minimum frequency a given token appeared in the document frequency matrix in order to be included.
Model
Accuracy
Precision
Recall
F1 Score
Random Forest
99%
99%
99%
99%
Extra Tree Classifier
99%
99%
99%
99%
Logistic Regression
99%
99%
98%
99%
Support Vector Machine
99%
99%
98%
99%
Multilayer Perceptron
99%
99%
100%
99%
K Nearest Neighbor
85%
77%
100%
87%
Naïve Bayes, Model 1
75%
67%
100%
80%
Naïve Bayes, Model 2
94%
89%
99%
94%
Naïve Bayes, Model 3
96%
96%
95%
96%
Naïve Bayes, Model 4
96%
95%
96%
96%
Table 6. Naïve Bayes classification model confusion matrices.
Note: When interpreting this table, please keep the following in mind: 1) True positives represent those fraudulent postings that were correctly identified as fraudulent. 2) False positives represent those verified postings that were falsely classified as fraudulent. 3) True negatives represent those verified postings that were correctly identified as verified, 4) False negatives represent those fraudulent postings that were falsely identified as verified.
Model Number
True Positives
False Positives
True Negatives
False Negatives
Naïve Bayes, Model 1
180
0
391
192
Naïve Bayes, Model 2
333
2
385
44
Naïve Bayes, Model 3
380
18
360
13
Naïve Bayes, Model 4
387
17
353
13
Key: RF: Random Forest; ETC: Extra Tree Classifier; SVM: Support Vector Machine; MLP: Multilayer Perceptron; NB-3: Naïve Bayes, Model 3; NB-4: Naïve Bayes, Model 4; NB-2: Naïve Bayes, Model 2; KNN - K Nearest Neighbor; NB-1: Naïve Bayes, Model 1.
Note: Code evaluating model performance can be found in C. Model Construction & Confusion Matrix of the II. Naïve Bayes Classifier section within the Appendix. Additionally, model performance from the Amaar et al. (2022) article can be found in Table 12, columns 1-7 on page 2239 of the original publication.
Discussion
This study sought to replicate the text preprocessing and data transformation steps outlined by Amaari et al. (2022) to compare the performance of a Naïve Bayes classification algorithm against other prominent machine learning algorithms (i.e., Random Forest, Extra Tree Classifier, Support Vector Machines, Multilayer Perceptron, and K Nearest Neighbor) in predicting fraudulent job postings. Ultimately, the Naive Bayes models generated in this study, although performing well, did not exceed the performance standards set by the other machine learning algorithms utilized by Amaari et al. (2022) with the exception of K Nearest Neighbor.
These findings suggest that while Naïve Bayes classifiers offer a viable approach to classifying job postings, they may not be the best choice for capturing the underlying, nuanced patterns in job description data. However, it is worth highlighting that the Naïve Bayes models demonstrated competitive performance, particularly within Models 3 and 4 where further trimming of the document frequency matrix improved their overall performance. It is possible with further exploration of the predictive quality of individual tokens and subsequent trimming of the document frequency matrix that these performance metrics could improve further.
Additionally, it is worth highlighting that complete replication of the preprocessing steps outlined by Amaari et al. (2022) was not possible, particularly in regard to feature engineering. Although Amaari et al. (2022) briefly described their methods in trimming their document frequency matrix, it was unfortunately too vague to apply the same methodology in this study. Furthermore, due to limitations in computational power, I was unable to utiilize the full dataset to run this analysis, as completed in Amaari et al. (2022). Therefore, it is possible that the performance of the Naïve Bayes models enclosed in this report would change if those feature engineering steps were known and followed, as well as the inclusion of all observations.
In conclusion, while Naïve Bayes classifiers offer a straightforward and interpretable approach to text classification, they may not always yield the highest performance in more complex scenarios, such as predicting job scams. Ultimately, this study contributes to the broader understanding of machine learning algorithms’ performance in identifying fraudulent job postings and highlights the importance of comparative analysis in machine learning research.
Braşoveanu, A. M., & Andonie, R. (2021). Integrating machine learning techniques in semantic fake news detection. Neural Processing Letters, 53(5), 3055-3072. https://doi.org/10.1007/s11063-020-10365-x
Dutta, S., & Bandyopadhyay, S. K. (2020). Fake job recruitment detection using machine learning approach. International Journal of Engineering Trends and Technology, 68(4), 48-53.https://doi.org/10.14445/22315381/IJETT-V68I4P209S
Gosain, A., & Sardana, S. (2017, September). Handling class imbalance problem using oversampling techniques: A review. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 79-85). IEEE. https://doi.org/10.1109/ICACCI.2017.8125820
Naudé, M., Adebayo, K. J., & Nanda, R. (2022). A machine learning approach to detecting fraudulent job types. AI & Society, 38(2), 1013-1024. https://doi.org/10.1007/s00146-022-01469-0
Ott, M., Cardie, C., & Hancock, J. T. (2013, June). Negative deceptive opinion spam. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 497-501).
Vidros, S., Kolias, C., Kambourakis, G., & Akoglu, L. (2017). Automatic detection of online recruitment frauds: Characteristics, methods, and a public dataset. Future Internet, 9(1), 6. https://doi.org/10.3390/fi9010006
Note: References to specific R packages utilized are linked within text throughout report.
Installing package into 'C:/Users/kaoui/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
# Setting working directorywd <-here()setwd(wd)# Reading in datadf <-read_csv('fake_job_postings.csv')
Rows: 17880 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): title, location, department, salary_range, company_profile, descri...
dbl (6): job_id, STEM, telecommuting, has_company_logo, has_questions, frau...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feature frequency rank docfreq group
1 work 53349 1 14502 all
2 na 39967 2 14982 all
3 team 36964 3 12641 all
4 manag 36147 4 11309 all
5 servic 34517 5 10824 all
6 develop 32937 6 10008 all
7 experi 31865 7 13906 all
8 time 30335 8 14864 all
9 custom 29683 9 8534 all
10 compani 27540 10 11511 all
11 product 27217 11 8833 all
12 busi 24868 12 8927 all
13 us 23060 13 13589 all
14 market 21151 14 6295 all
15 full 19998 15 13359 all
16 client 19926 16 7049 all
17 provid 19708 17 10118 all
18 skill 19342 18 10142 all
19 new 18939 19 8832 all
20 design 18868 20 6010 all
II. Naïve Bayes Classifier
A. Oversampling
# Setting seed for reproducibility set.seed(33)# Loading UBL packagepacman::p_load( UBL)# Converting job_matrix to data framejob_df <-as.data.frame(job_matrix)# Adding fraudulent column to job document frequency matrixjob_df$fraudulent <- df$fraudulent# Taking a sample of 2,000 rows job_subset <-sample_n(job_df, 2000)# Defining formula for prediction problem formula <- fraudulent ~ .# Applying ADASYN to datasetnew_data <-AdasynClassif(formula, job_subset)
B. Training & Testing Data
# Setting seed for reproducibilityset.seed(33)# Splitting sample into a training set and testing set sample <-sample.int(n =nrow(new_data), size =floor(.80*nrow(new_data)),replace = F)# Selecting training dataset train_matrix <- new_data[sample,] train_labels <- new_data$fraudulent[sample] test_matrix <- new_data[-sample,] test_labels <- new_data$fraudulent[-sample]# Saving data as a fail safewrite.csv(test_matrix, "testdata.csv")write.csv(train_matrix, "traindata.csv")# Reading testing and training data back in test_data <-read_csv('testdata.csv')
New names:
Rows: 763 Columns: 12574
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (12573): market, intern, us, ny, new, york, we'r, food52, we'v,
creat, g...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
train_data <-read_csv('traindata.csv')
New names:
Rows: 3048 Columns: 12574
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (12573): market, intern, us, ny, new, york, we'r, food52, we'v,
creat, g...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Removing first column of both datasetstest_data <- test_data[,-1]train_data <- train_data[,-1]# Getting counts & proportions # Training Data prop.table(table(train_data$fraudulent)) *100
# Removing 'fraudulent' column from training and testing datasetstrain_remove <- train_data %>%select(-c(fraudulent))test_remove <- test_data %>%select(-c(fraudulent))# Creating matrices to pass into e1071train_matrix <-as.matrix(train_remove)test_matrix <-as.matrix(test_remove)# Creating factor of dependent variable (job_cat)category <-as.factor(train_data$fraudulent)levels(category)
[1] "0" "1"
# Training modelnb <- e1071::naiveBayes(x=train_matrix, y=train_labels,method='class')# Applying model to testing matrix nb_test_prediction <-predict(nb, test_matrix)# Running confusion matrix con_matrix1 <-confusionMatrix(data = nb_test_prediction, reference = test_labels, mode ='prec_recall')# Returning accuracy, precision, and recallprint(con_matrix1)
# Trimming document frequency matrix to only include tokens that appeared at least 50 times job_dfm_50 <-dfm_trim(job_dfm, min_termfreq =50)# Converting dfm to data frame job_data50 <-as.data.frame(job_dfm_50)
Warning: 'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")
# Adding fraudulent column to job document frequency matrixjob_data50$fraudulent <-as.factor(df$fraudulent)# Taking a sample of 2,000 rows job_subset50 <-sample_n(job_data50, 2000)# Removing first columnjob_subset50 <- job_subset50[,-1]# Defining formula for prediction problem formula <- fraudulent ~ .# Applying ADASYN to datasetnew_data50 <-AdasynClassif(formula, job_subset50)# Setting seed for reproducibilityset.seed(33)# Splitting sample into a training set and testing set sample50 <-sample.int(n =nrow(new_data50), size =floor(.80*nrow(new_data50)),replace = F)# Selecting training dataset train_matrix50 <- new_data50[sample50,] train_labels50 <- new_data50$fraudulent[sample50] test_matrix50 <- new_data50[-sample50,] test_labels50 <- new_data50$fraudulent[-sample50]# Removing 'fraudulent' column from training and testing datasetstrain_remove50 <- train_matrix50 %>%select(-c(fraudulent))test_remove50 <- test_matrix50 %>%select(-c(fraudulent))# Creating matrices to pass into e1071train_matrix50 <-as.matrix(train_remove50)test_matrix50 <-as.matrix(test_remove50)# Training modelnb50 <- e1071::naiveBayes(x=train_matrix50, y=train_labels50,method='class')# Applying model to testing matrix nb_test_prediction50 <-predict(nb50, test_matrix50)# Running confusion matrix con_matrix50 <-confusionMatrix(data = nb_test_prediction50, reference = test_labels50, mode ='prec_recall')# Returning accuracy, precision, and recallprint(con_matrix50)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 379 19
1 11 357
Accuracy : 0.9608
95% CI : (0.9446, 0.9734)
No Information Rate : 0.5091
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9216
Mcnemar's Test P-Value : 0.2012
Precision : 0.9523
Recall : 0.9718
F1 : 0.9619
Prevalence : 0.5091
Detection Rate : 0.4948
Detection Prevalence : 0.5196
Balanced Accuracy : 0.9606
'Positive' Class : 0
Minimum of 75 Instances of Token Frequency
# Trimming document frequency matrix to only include tokens that appeared at least 75 times job_dfm_75 <-dfm_trim(job_dfm, min_termfreq =75)# Converting dfm to data frame job_data75 <-as.data.frame(job_dfm_75)
Warning: 'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")
# Adding fraudulent column to job document frequency matrixjob_data75$fraudulent <-as.factor(df$fraudulent)# Taking a sample of 2,000 rows job_subset75 <-sample_n(job_data75, 2000)# Removing first columnjob_subset75 <- job_subset75[,-1]# Defining formula for prediction problem formula <- fraudulent ~ .# Applying ADASYN to datasetnew_data75 <-AdasynClassif(formula, job_subset75)# Setting seed for reproducibilityset.seed(33)# Splitting sample into a training set and testing set sample75 <-sample.int(n =nrow(new_data75), size =floor(.80*nrow(new_data75)),replace = F)# Selecting training dataset train_matrix75 <- new_data75[sample75,] train_labels75 <- new_data75$fraudulent[sample75] test_matrix75 <- new_data75[-sample75,] test_labels75 <- new_data75$fraudulent[-sample75]# Removing 'fraudulent' column from training and testing datasetstrain_remove75 <- train_matrix75 %>%select(-c(fraudulent))test_remove75 <- test_matrix75 %>%select(-c(fraudulent))# Creating matrices to pass into e1071train_matrix75 <-as.matrix(train_remove75)test_matrix75 <-as.matrix(test_remove75)# Training modelnb75 <- e1071::naiveBayes(x=train_matrix75, y=train_labels75,method='class')# Applying model to testing matrix nb_test_prediction75 <-predict(nb75, test_matrix75)# Running confusion matrix con_matrix75 <-confusionMatrix(data = nb_test_prediction75, reference = test_labels75, mode ='prec_recall')# Returning accuracy, precision, and recallprint(con_matrix75)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 387 18
1 7 354
Accuracy : 0.9674
95% CI : (0.9522, 0.9788)
No Information Rate : 0.5144
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9346
Mcnemar's Test P-Value : 0.0455
Precision : 0.9556
Recall : 0.9822
F1 : 0.9687
Prevalence : 0.5144
Detection Rate : 0.5052
Detection Prevalence : 0.5287
Balanced Accuracy : 0.9669
'Positive' Class : 0
Minimum of 100 Instances of Token Frequency
# Trimming document frequency matrix to only include tokens that appeared at least 100 times job_dfm_100 <-dfm_trim(job_dfm, min_termfreq =100)# Converting dfm to data frame job_data100 <-as.data.frame(job_dfm_100)
Warning: 'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")
# Adding fraudulent column to job document frequency matrixjob_data100$fraudulent <-as.factor(df$fraudulent)# Taking a sample of 2,000 rows job_subset100 <-sample_n(job_data100, 2000)# Removing first columnjob_subset100 <- job_subset100[,-1]# Defining formula for prediction problem formula <- fraudulent ~ .# Applying ADASYN to datasetnew_data100 <-AdasynClassif(formula, job_subset100)# Setting seed for reproducibilityset.seed(33)# Splitting sample into a training set and testing set sample100 <-sample.int(n =nrow(new_data100), size =floor(.80*nrow(new_data100)),replace = F)# Selecting training dataset train_matrix100 <- new_data100[sample100,] train_labels100 <- new_data100$fraudulent[sample100] test_matrix100 <- new_data100[-sample100,] test_labels100 <- new_data100$fraudulent[-sample100]# Removing 'fraudulent' column from training and testing datasetstrain_remove100 <- train_matrix100 %>%select(-c(fraudulent))test_remove100 <- test_matrix100 %>%select(-c(fraudulent))# Creating matrices to pass into e1071train_matrix100 <-as.matrix(train_remove100)test_matrix100 <-as.matrix(test_remove100)# Training modelnb100 <- e1071::naiveBayes(x=train_matrix100, y=train_labels100,method='class')# Applying model to testing matrix nb_test_prediction100 <-predict(nb100, test_matrix100)# Running confusion matrix con_matrix100 <-confusionMatrix(data = nb_test_prediction100, reference = test_labels100, mode ='prec_recall')# Returning accuracy, precision, and recallprint(con_matrix100)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 351 24
1 37 351
Accuracy : 0.9201
95% CI : (0.8985, 0.9383)
No Information Rate : 0.5085
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8402
Mcnemar's Test P-Value : 0.1244
Precision : 0.9360
Recall : 0.9046
F1 : 0.9201
Prevalence : 0.5085
Detection Rate : 0.4600
Detection Prevalence : 0.4915
Balanced Accuracy : 0.9203
'Positive' Class : 0
III. Visualization
A. Figure 1. Percentage of fraudulent cases before and after ADASYN.
# Comparing percentage of fraudulent cases before and after ADASYNpercentage_before <-prop.table(table(job_subset$fraudulent))*100percentage_after <-prop.table(table(new_data$fraudulent)) *100# Printing proportionspercentage_before
0 1
95 5
percentage_after
0 1
49.85568 50.14432
# Creating plot data plot_data <-data.frame(Method =c("Before ADASYN", "After ADASYN"),Fraudulent =c(percentage_before[2], percentage_after[2]))# Flipping order of the levels on the x-axisplot_data$Method <-as.factor(plot_data$Method)plot_data$Method <-factor(plot_data$Method, levels =c("Before ADASYN", "After ADASYN"))# Creating grouped bar chartfigure1 <-ggplot(plot_data, aes(x = Method, y = Fraudulent, fill = Method)) +geom_bar(stat ="identity", fill =c("maroon", "steelblue4")) +labs(title ="Figure 1. Percentage of Fraudulent Cases Before\n and After Application of ADASYN",x ="Method",y ="% Fraudulent") +theme_minimal() +theme(legend.position ="none")# Getting breakdown of real and fraudulent postings by balanced and imbalanced datasetsjob_subset %>%summarize(Real =sum(fraudulent ==0), Fake =sum(fraudulent ==1))
feature frequency rank docfreq group
1 work 51119 1 13786 all
2 na 37459 2 14243 all
3 team 36051 3 12250 all
4 manag 34612 4 10849 all
5 servic 33056 5 10254 all
6 develop 31895 6 9674 all
7 experi 30693 7 13362 all
8 time 28820 8 14120 all
9 custom 28570 9 8129 all
10 compani 26635 10 11047 all
11 product 26027 11 8546 all
12 busi 24085 12 8597 all
13 us 22103 13 12826 all
14 market 20788 14 6093 all
15 client 19454 15 6817 all
16 full 19062 16 12691 all
17 provid 18805 17 9682 all
18 new 18359 18 8530 all
19 sale 18334 19 4370 all
20 design 18334 19 5779 all
fraud_freq
feature frequency rank docfreq group
1 na 2508 1 739 all
2 work 2230 2 716 all
3 manag 1535 3 460 all
4 time 1515 4 744 all
5 servic 1461 5 570 all
6 skill 1236 6 537 all
7 product 1190 7 287 all
8 experi 1172 8 544 all
9 amp 1154 9 378 all
10 custom 1113 10 405 all
11 develop 1042 11 334 all
12 us 957 12 763 all
13 requir 957 12 492 all
14 full 936 14 668 all
15 team 913 15 391 all
16 posit 907 16 446 all
17 compani 905 17 464 all
18 provid 903 18 436 all
19 project 837 19 253 all
20 busi 783 20 330 all
C. Figure 3. Comparison of model F1 scores.
# Creating Dataset of F1 scoresfig3_data <-data.frame(f1score =c(99, 99, 99, 99, 99, 96, 96, 94, 87, 80),model_name =c("RF", "ETC", "LR", "SVM", "MLP", "NB-3", "NB-4", "NB-2", "KNN", "NB-1"),model_type =c("Amaar et al. (2022)", "Amaar et al. (2022)", "Amaar et al. (2022)", "Amaar et al. (2022)", "Amaar et al. (2022)", "Current Study", "Current Study", "Current Study", "Amaar et al. (2022)", "Current Study"))fig3_data$Source <-as.factor(fig3_data$model_type)# Reorder model_name based on f1scorefig3_data$model_name <-factor(fig3_data$model_name, levels = fig3_data$model_name[order(fig3_data$f1score, decreasing =TRUE)])# Creating visualizationfigure3<-ggplot(data = fig3_data, aes(x=model_name, y=f1score, fill=Source)) +geom_bar(stat ="identity") +labs(title ="Figure 3. F1 Score by Classification Model", subtitle ="Summary of F1 Scores (Range 0-1) across several classification models of job postings within\ncurrent study and Amaar et al. (2022)",x ="Model Abbreviation", y ="F1-Score") +scale_fill_manual(values =c('maroon', 'steelblue4')) +# Specify fill colorstheme_minimal()
Footnotes
Due to limitations in computational power, the ADASYN technique could not be applied to the entire dataset, as documented in Amaar et al. (2022). However, 2,000 rows is a more than adequate sample size to run a Naïve Bayes classification algorithm without sacrificing computational performance and avoiding program crashes.↩︎