Final Project: Detecting Fraudulent Job Postings with NLP
Real or Fake Jobs?
Author
Daria Smyslova
Published
April 26, 2025
Detecting Fraudulent Job Postings with NLP
This project explores the use of natural language processing (NLP) techniques to identify fraudulent job postings in a real-world dataset of approximately 18,000 online job ads. Each posting includes both textual components (such as job title, description, requirements, company profile) and metadata (such as location, employment type, and salary range). Approximately 5% of postings are labeled fraudulent, making fraud detection a challenging yet highly relevant classification task.
The dataset was collected from real online job boards and curated through verification of scam reports, making it a rich source for studying how deceptive job listings differ linguistically and structurally from authentic ones. In today’s digital economy, where job searching increasingly happens online, this context is crucial for enhancing digital literacy and safety practices among vulnerable groups such as students and early-career professionals.
Research Objectives
This project is guided by the following core research questions:
RQ1: What linguistic or structural features distinguish real from fake job descriptions?
RQ2: Can sentiment analysis, keyword patterns, or other NLP-derived features help identify fraudulent postings?
RQ3: How can insights from this analysis inform the design of educational tools or curricula that support digital safety and critical thinking?
RQ1: What linguistic or structural features distinguish real from fake job descriptions?
Code
# Compute TF-IDF by fraudulent statustfidf_fraud <- tokens %>%count(fraudulent, word, sort =TRUE) %>%bind_tf_idf(word, fraudulent, n)# Plot top TF-IDF words for real vs fake jobstfidf_fraud %>%group_by(fraudulent) %>%slice_max(tf_idf, n =10) %>%# Top 10 words per groupungroup() %>%mutate(word =reorder_within(word, tf_idf, fraudulent)) %>%ggplot(aes(word, tf_idf, fill =as.factor(fraudulent))) +geom_col(show.legend =FALSE) +facet_wrap(~ fraudulent, scales ="free_y", labeller =as_labeller(c("0"="Real Jobs", "1"="Fake Jobs"))) +scale_x_reordered() +coord_flip() +labs(x =NULL, y ="TF-IDF", title ="Top TF-IDF Words: Real vs Fake Job Postings")
Real Job Postings:
Real job postings are characterized by geographic and educational references, including terms such as “tidewater,” “european,” “asia,” “greece,” “athens,” and “berlin.”
The presence of words like “tefl” (Teaching English as a Foreign Language) and “tesol” (Teaching English to Speakers of Other Languages) suggests many postings relate to legitimate international teaching and professional opportunities.
These terms reflect specific organizational ties, location identifiers, or professional certifications, adding credibility and specificity to real ads.
Fake Job Postings:
Fake job postings frequently contain corporate-sounding or fabricated entities such as “aker,” “subsea,” “accion,” “novation,” and “overviewaker.”
Terms like “makeing” (a misspelled form of “making”) further suggest lower-quality text or hastily constructed ads.
Several words (e.g., “overviewaker,” “onlyclick,” “expro”) seem invented or brand-like, consistent with strategies fraudsters use to impersonate companies without giving verifiable information.
Code
# Save side-by-side wordclouds with titlespar(mfrow =c(1, 2)) # 1 row, 2 columnspar(mar =c(0, 0, 2, 0)) # space for a title (top margin)# --- Real Jobs Wordcloud ---tokens %>%filter(fraudulent ==0, word !="na") %>%count(word, sort =TRUE) %>%with(wordcloud(words = word,freq = n,max.words =50,random.order =FALSE,rot.per =0.20,colors =brewer.pal(8, "Dark2") ))title("Real Job Postings", line =0, cex.main =1) # title for left plot# --- Fake Jobs Wordcloud ---tokens %>%filter(fraudulent ==1, word !="na") %>%count(word, sort =TRUE) %>%with(wordcloud(words = word,freq = n,max.words =50,random.order =FALSE,rot.per =0.20,colors =brewer.pal(8, "Dark2") ))title("Fake Job Postings", line =0, cex.main =1) # title for right plot
Words like “team,” “service,” “customer,” “company,” “skill,” and “time” appear prominently in both real and fake postings. This suggests that fraudsters deliberately mimic the general tone and keywords of authentic job ads to appear credible and familiar to job seekers.
Distinctive Features of Real Job Postings:
“Development,” “offer,” “build,” “market,” “management” emerge more clearly in real job postings.
Real ads are more likely to emphasize career growth opportunities, structured roles, and organizational functions (e.g., “development,” “management,” “lead”).
Words like “product” and “web” also suggest that real postings often highlight specific project work or company offerings.
Distinctive Features of Fake Job Postings:
Fake postings feature stronger emphasis on “position,” “system,” “plan,” “solution,” “provide,” and “test.”
The language tends to be vaguer and more generic, focusing on the idea of employment (“position”) without detailed reference to company structure or career paths.
Terms like “home” and “entry” indicate a pattern where fake postings may target individuals seeking remote work or easy access jobs, exploiting these attractive offers.
Educational Implication:
This comparison highlights the need for critical reading strategies when evaluating online job postings:
Specificity (e.g., role, career development, product descriptions) is a positive indicator of authenticity.
Generic language centered around “positions,” “plans,” or “systems” without concrete detail could signal potential fraud.
Teaching students and job seekers to recognize these linguistic patterns can enhance digital discernment and online safety.
Code
# Word frequenciesreal_words <- tokens %>%filter(fraudulent ==0, word !="na") %>%count(word, sort =TRUE)fake_words <- tokens %>%filter(fraudulent ==1, word !="na") %>%count(word, sort =TRUE)# Merge frequenciesword_freqs <-full_join(real_words, fake_words, by ="word", suffix =c("_real", "_fake")) %>%replace_na(list(n_real =0, n_fake =0))# Now directly build a matrixfreq_matrix <-as.matrix(word_freqs[, c("n_real", "n_fake")])rownames(freq_matrix) <- word_freqs$word# Remove rownames "n_real" and "n_fake" if they appearfreq_matrix <- freq_matrix[!(rownames(freq_matrix) %in%c("n_real", "n_fake")), ]# Plotting layoutlayout(matrix(c(1,2), nrow=2), heights=c(1,8))# Titlepar(mar =c(0, 0, 0, 0))plot.new()text(0.5, 0.5, "Real vs Fake Job Postings (Comparison Wordcloud)", cex =1, font =1)# Wordcloudpar(mar =c(0, 0, 0, 0))comparison.cloud( freq_matrix,max.words =100,colors =c("steelblue3", "indianred3"),title.size =2,scale =c(4, 0.5),random.order =FALSE,rot.per =0.20,match.colors =TRUE)
Real Job Postings (Blue):
Real postings prominently feature terms such as “market,” “team,” “web,” “build,” “client,” and “grow.”
These words suggest a stronger emphasis on business development, project building, and professional collaboration.
The focus on “client” and “grow” indicates that real postings often detail the company’s mission, customer engagement, and future growth plans.
Fake Job Postings (Red):
Fake postings are dominated by terms such as “skill,” “entry,” “system,” “amp,” “oil,” and “control.”
The language is more generic and task-oriented, centering on the idea of performing a “skill” or filling a “system” position without concrete descriptions.
Words like “entry” and “assist” suggest a targeting strategy aimed at individuals seeking low-barrier, quick-access job opportunities, which are commonly exploited by fraudulent ads.
The plot shows the top 10 most probable words (terms) for each of the 5 extracted topics in the job postings dataset. The beta values represent the strength of association between words and topics.
Topic 1: Care/Service-Oriented Jobs
Top terms: job, service, experience, require, provide, home, care, time, include, pay
Interpretation:
Language here reflects service-oriented positions like caregiving, healthcare, home services, and customer support.
Focus on requirements and provision of care.
Topic 2: Hiring and Staffing Language
Top terms: job, experience, company, service, position, candidate, provide, include, engineer, time
Interpretation:
Focuses on company-centric language, recruiting candidates for roles (likely generic staffing ads).
Mix of technical (engineer) and general service positions.
Topic 3: Sales and Business-Focused Jobs
Top terms: customer, sale, service, business, client, team, management, experience, skill, company
Interpretation:
Strong sales, business, and client-focused vocabulary.
Highlights customer relationships, teamwork, and managerial responsibilities.
Topic 4: Tech/Product Development Jobs
Top terms: team, market, experience, company, people, product, build, digital, grow, medium
Interpretation:
Language typical for tech startups, product design, digital marketing, or software development.
Themes of growth, digital work, and building products.
Topic 5: Engineering and Software Design Jobs
Top terms: experience, development, design, team, software, technology, project, application, system, datum
Interpretation:
Focused on software engineering, system design, and technology projects.
Likely reflects real postings from tech companies.
In conclusion, topics 3, 4, and 5 seem more likely associated with real postings (structured roles, specific skills). Topic 1 could include fake postings if service language is vague (common in scams targeting job seekers for caregiving or remote work). Language across topics varies by industry domain (service vs sales vs tech).
RQ2: Can sentiment analysis, keyword patterns, or other NLP-derived features help identify fraudulent postings?
Code
# First join the fraudulent label backsentiment_with_labels <- sentiment_scores %>%left_join(df_clean %>%select(job_id, fraudulent), by ="job_id")# Now plotggplot(sentiment_with_labels, aes(x =factor(fraudulent), y = sentiment_score, fill =factor(fraudulent))) +geom_boxplot() +scale_fill_manual(values =c("0"="steelblue", "1"="indianred")) +labs(title ="Sentiment Comparison: Real vs Fake Job Postings",x ="Fraudulent (0 = Real, 1 = Fake)",y ="Sentiment Score" ) +theme_minimal()
Real Job Postings (Fraudulent = 0, blue):
Median sentiment score is slightly higher compared to fake postings.
The interquartile range (IQR)—the spread of the middle 50%—is wider, indicating greater variability in sentiment among real jobs.
There are more high positive outliers (above 50), suggesting that a small number of real postings are highly enthusiastic (e.g., emphasizing benefits, growth opportunities).
Fake Job Postings (Fraudulent = 1, red):
Median sentiment score is lower than that of real jobs, indicating that fake postings are, on average, less positive.
The distribution is more compressed, with less variability and fewer extremely positive postings.
A small number of negative sentiment outliers appear, reflecting some postings with strongly negative or concerning wording.
Overall, real job postings tend to use more positive and varied emotional language, while fake job postings are more flat, slightly less positive, and occasionally exhibit concerning or unusual sentiment. This difference in emotional tone can serve as an important signal for deception detection.
RQ3: How can insights from this analysis inform the design of educational tools or curricula that support digital safety and critical thinking?
To address RQ3, we extend our analysis by building classification models that predict whether a job posting is real or fraudulent based on linguistic and sentiment-based features. Specifically, we merge sentiment scores with metadata (e.g., employment type, required experience, industry) and generate TF-IDF features from the top 500 words across postings. Using these combined features, we train and evaluate multiple machine learning models—including Logistic Regression, Random Forest, and XGBoost—to assess how well simple text-based indicators can support fraud detection. This modeling approach serves as a proof of concept for developing educational tools that help users critically evaluate online job ads by highlighting suspicious patterns in language and structure.
Model Evaluation
Code
# ROC curvesplot(roc(y_test, lr_preds), col ="blue", main ="ROC Curves")
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Code
lines(roc(y_test, rf_preds), col ="green")
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Code
lines(roc(y_test, xgb_preds), col ="red")
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.7933
Code
auc(roc(y_test, rf_preds))
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.7275
Code
auc(roc(y_test, xgb_preds))
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.8385
The ROC curve above compares the classification performance of three models: Logistic Regression (blue), Random Forest (green), and XGBoost (red). The steeper and higher the curve bows toward the top-left corner, the better the model distinguishes between real and fake job postings.
The Area Under the Curve (AUC) values for each model are:
Logistic Regression: ~0.793
Random Forest: ~0.728
XGBoost: ~0.839
From both the ROC plot and AUC metrics, XGBoost exhibits the strongest discriminative ability, achieving the highest AUC and consistently outperforming the other models across thresholds. Logistic Regression shows good performance and remains a strong lightweight baseline, while Random Forest trails behind, likely due to challenges in handling the high-dimensional, sparse TF-IDF feature space without deeper tuning.
These results directly inform RQ3 by demonstrating that relatively simple linguistic and sentiment-based features, extracted using accessible NLP techniques can reliably distinguish between real and fake job ads. This supports the idea that educational tools or curricula designed to foster digital discernment could leverage such textual patterns. For instance, training users to recognize keyword anomalies (e.g., excessive emphasis on vague benefits or urgent recruitment language) or to critically reflect on sentiment extremity could meaningfully strengthen digital safety and critical thinking skills. Furthermore, the success of interpretable models like Logistic Regression suggests that these insights can be made transparent and teachable rather than relying solely on “black-box” AI systems. Overall, the findings underscore the potential for lightweight NLP pipelines to empower users, especially students and job seekers, to detect online fraud through evidence-based reasoning.
Implications
This analysis would most benefit students, early-career job seekers, educators in digital literacy, and career support services. These groups are especially vulnerable to online employment scams, often lacking training in detecting subtle language cues that indicate fraud. By identifying key linguistic and structural markers that differentiate real and fake postings, educators can design curricula or workshops that teach critical evaluation of online job ads. Students and job seekers can use this knowledge to make safer application decisions, and institutions can embed AI literacy modules into broader digital safety initiatives. In practice, users might learn to spot suspicious phrases, unusually emotional language, or patterns of solicitation typical of fraudulent postings.
Limitations
Generalizability: The dataset may not represent all industries, regions, or updated fraud tactics, especially as scam techniques evolve.
Simplistic features: The models rely primarily on text-based features without incorporating richer context such as company reputation or user reports.
False positives/negatives: Some legitimate postings may share surface-level traits with scams, leading to misclassification.
Ethical and Legal Considerations
Overreliance on automation: Teaching users to blindly trust AI classifications without critical thinking could reinforce new vulnerabilities.
Bias in labeling: The dataset’s original labeling may reflect historical biases or misjudgments about what constitutes “fraudulent,” which could propagate unfair skepticism.
Data privacy: Although the dataset is public, future extensions involving scraping or real-world user data would need strict compliance with data protection regulations (e.g., GDPR).
Overall, the project emphasizes AI-assisted decision-making, not AI replacement of human judgment, aligning with ethical guidelines for responsible AI literacy education.