Group 4 members:

JingYu Shen (S2113037)
JiPing Zhang (S2042894)
Lee Mun Mun (S2112842)
Nayli Hatim (S2149344)
Jenifer Mayang Jues (S2016572)

Introduction

Unlike the past when job seekers used newspapers to seek job opportunities, job seekers nowadays use employment website such as JobStreet, Linkedin, Indeed and countless due to advancement in modern technology and social communication. The authenticity of job postings has become critical with a constant increase in the number of job scams. According to Habiba et all (2021), job advertisements which are fake and steal personal & professional information of job seekers instead of giving right jobs to them is known as job scam. Job scams often involve fake online job ads in social platforms and untrusted job portals offering high paying jobs. Victims may also receive unsolicited messages from social media such as Whatsapp, Facebook, WeChat that offers jobs that do not exist. For example, job scammers will ask victims to disclose personal and/or banking details or transfer upfront fees to secure a interview or more information about the fraud jobs. Due to the growing concerns about job scams, our aim is to raise awareness of job seekers in the job application process and give an early warning sign to job seekers with Machine Learning (ML) and Natural Language Processing (NLP) approaches.

Objectives

To identify the key features of fraudulent job postings.
To build a model to classify real or fake job postings.

Initial Questions

What are the key features/characteristics of fraudulent job postings?
Which classification model is the best to determine whether the job is real or not?

Data Cleaning and Pre-processing

The dataset used in this project was published by the Employment Scam Aegean Dataset (EMSCAD) and was retrieved from Kaggle. This data contains 17,880 observations out of which about 866 are fake, and 18 features. The data consists of a combination of numeric and text features. A brief description of the variables is given below:

Variable	Description
job_id	ID of each job posting
title	Description of position or job
location	Where the job is located
department	Department of the job offered
salary_range	Expected salary range
company_profile	Company information
description	Description about the position offered
requirements	Pre-requisites to qualify for the job
benefits	Benefits provided by the job
telecommuting	Is work from home or remote work allowed
has_company_logo	Does the post have a company logo
has_questions	Does the post have any questions
employment_type	Full-time, part-time, contract, temporary and others
required_experience	Experience level, e.g. Entry level, Executive, Director…
required_education	Education level, e.g. High School, Bachelor, Master…
industry	Relevant industry
function	Job’s functionality
fraudulent	Target variable (0: Real, 1: Fake)

Import libraries

Load data

df <- read.csv("https://raw.githubusercontent.com/abbylmm/fake_job_posting/main/data/fake_job_postings.csv")

Display n sample of the data

df_fake_job <- df
sample_n(df_fake_job, 3)

##   job_id                                   title          location department
## 1   4794 NARRATIVE: Influencer Marketing Manager  US, NY, New York           
## 2  10541                 English Teacher Abroad    US, NH, Hanover           
## 3  16242                         Quality Manager US, MO, St. Louis           
##   salary_range
## 1             
## 2             
## 3             
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       company_profile
## 1 We are not your average Monday mail recruiters. We are here to align stars and connect dots, not just match titles with positions &amp; salary demands with salary offerings. Our approach is simple; we read between the lines to see YOU. Both of you. Employer and employee. You &amp; Them is the most personal, innovative and open-minded professional recruiting can be. Or should be. Our network is a community of people with the same mentality; that work is a part of our lives and not the other way around. A creative community of great minds who seek minds that think alike.You &amp; Them is Us. Real people. Nice to meet you.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               We help teachers get safe &amp; secure jobs abroad :)
## 3                                                                                                                                                                                                                                                                                                                                                                                                                We Provide Full Time Permanent Positions for many medium to large US companies. We are interested in finding/recruiting high quality candidates in IT, Engineering, Manufacturing and other highly technical and non-technical jobs.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 1 Narrative is looking for a Senior Influencer Marketing Manager to join our team in New York. With this role,Â you will report directly to the CEO.Â We are looking for someone that is:â—\217 well connectedâ—\217 proactive,Â detail oriented and professionalâ—\217 a master negotiatorâ—\217 comfortable working closely and collaboratively across the entire agencyâ—\217 willing to go above and beyond the call of dutyâ—\217 experienced in pitch workAs the Senior Influencer Marketing Manager, you obsess over people across all walks of life. You know theinâ\200\231s and outâ\200\231s of the industry, and you have a knack for identifying and connecting with talent. You knowtheir story, background, favorite color and a lot more that make you a little creepy if this wasnâ\200\231t your field ofwork. Youâ\200\231re smart, articulate and base your decisions on data and strategic thinking. You are an influencerin your own right, and possess the ability to persuade at will.You will be immediately injected into our yearlong music activation â\200“ ADD52 toÂ help drive engagement anddefine/execute marketing initiatives with and through influencers. ADD52 is a talent discovery platformreinventing how emerging artists and fans find, share and listen to music. Created by Russell Simmons andSteve Rifkind in partnership with Samsung, ADD52 gives unsigned artists the opportunity to get discoveredand signed by All Def Music.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Play with kids, get paid for itÂ Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabriel Adkins : #URL_ed9094c60184b8a4975333957f05be37e69d3cdb68decc9dd9a4242733cfd7f7##URL_75db76d58f7994c7db24e8998c2fc953ab9a20ea9ac948b217693963f78d2e6b#12 month contract : Apply todayÂ 
## 3                                                                                                                                                                                                                                                                                                                                                                  (We have more than 1500+ Job openings in our website and some of them are relevant to this job. Feel free to search it in the website and apply directly. Just Click the â\200œApply Nowâ\200\235 and you will redirect to our main website where you can search for the other jobs.)Implementation and maintenance of quality management system throughout the organization.5. Conducting management review meeting and providing recommendations for improvement.6. To provide customer complaint addressal, resolution and application support.7. Implementation of various standards such as QS 9000, ISO/TS 16949, ISO 9000, Kaizen projects, Six sigma projects, TPM etc.8. To act as management representative for the plant / company.We have many more Global Healthcare â\200‹Professionals jobs are available in our website. Please go through our website and search the relevant job and apply directly.VisitÂ  : #URL_ec64af2b4fe2ca316e828f93b0cd098c22f8beba98dcac09d4dd7384b221a5e8#-#URL_9753a54b28303bf636a2816399b9c255d76fabb791336a4c748da2611a23264f#
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               requirements
## 1 GENERAL RESPONSIBILITIESâ—\217 Conceptualize, create, manage and execute all influencer marketing initiatives through completion ensuringÂ all aspects align with client goals.â—\217 Work closely with agencies, publicity departments, management, production houses and media toÂ execute celebrity/influencer seeding strategy.â—\217 Leverage influencer marketing programs on social media platforms. Track social mediaÂ engagement, create content, and increase social interaction and relevance.THE TALENT AND YOUâ—\217 Manage national and regional events, including sponsorship activations.â—\217 Create and analyze reports + establish KPIs that measure the impact of the influencer marketingÂ program to better serve future strategies.â—\217 Identify new passion groups and individuals we should engage with.â—\217 Work with clients, partner agencies and 3rd party vendors on joint activities.â—\217 Work with product team to give feedback regarding features and optimizations that will driveÂ user/influencer growth.â—\217 Research and identify prospective influencers (including/not limited to: blogs, Twitter, Instagram,Â Vine, YouTube, etc.).â—\217 Help coordinate influencer communications and plan activities.â—\217 Develop and execute influencer marketing programs that drive sales and generate positive brandÂ exposure.â—\217 Establishing contact, seed and manage influencer relationships on an ongoing basis.â—\217 Lead the influencer communication strategy and delivery of content through various mediums.â—\217 Manage talent negotiations and working closely with legal to draft talent agreements/contracts.â—\217 Activating talent against clientâ\200\231s goals and objective. Working closely with talent to develop contentÂ and platforms.â—\217 Maintaining talent schedule to ensure that we are aligned with timelines/ deliveries.REQUIRED SKILLSâ—\217 Must have 5+ years of PR/digital marketing and/or influencer marketing experience.â—\217 Expertise in building communities and key relationships.â—\217 Knowledge and expertise in using various social media platforms.â—\217 Advanced proficiency with PR applications, Keynote, MS office Suite, Google Docs.â—\217 Keen sense of awareness of influencer/celebrity culture.â—\217 Excellent project management, organization, communication, writing and relationship building skills.â—\217 Well versed in marketing strategies across multiple categories.â—\217 Proven successes in both traditional and interactive PR channels.â—\217 Collaborative with a solutions oriented attitude and a willingness to pitch in when necessary.â—\217 Ability to work on multiple projects simultaneously with tight deadlines.â—\217 Has a clear understanding of industry standards and practices.â—\217 Strong analytic skills and ability to think strategically and applying them.â—\217 Continually working to understand the clients, their industry &amp; how we can make a difference in theirÂ business.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      University degree required.Â TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
##              benefits telecommuting has_company_logo has_questions
## 1                                 0                1             0
## 2 See job description             0                1             1
## 3                                 0                0             0
##   employment_type required_experience required_education             industry
## 1                                                                            
## 2        Contract                      Bachelor's Degree Education Management
## 3       Full-time                                                            
##   function. fraudulent
## 1                    0
## 2                    0
## 3                    0

Summary data

summary(df_fake_job)

##      job_id         title             location          department       
##  Min.   :    1   Length:17880       Length:17880       Length:17880      
##  1st Qu.: 4471   Class :character   Class :character   Class :character  
##  Median : 8940   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8940                                                           
##  3rd Qu.:13410                                                           
##  Max.   :17880                                                           
##  salary_range       company_profile    description        requirements      
##  Length:17880       Length:17880       Length:17880       Length:17880      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    benefits         telecommuting    has_company_logo has_questions   
##  Length:17880       Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :1.0000   Median :0.0000  
##                     Mean   :0.0429   Mean   :0.7953   Mean   :0.4917  
##                     3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  employment_type    required_experience required_education   industry        
##  Length:17880       Length:17880        Length:17880       Length:17880      
##  Class :character   Class :character    Class :character   Class :character  
##  Mode  :character   Mode  :character    Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##   function.           fraudulent     
##  Length:17880       Min.   :0.00000  
##  Class :character   1st Qu.:0.00000  
##  Mode  :character   Median :0.00000  
##                     Mean   :0.04843  
##                     3rd Qu.:0.00000  
##                     Max.   :1.00000

Check all the missing values - ‘empty’

skim_without_charts(df_fake_job)

Data summary
Name	df_fake_job
Number of rows	17880
Number of columns	18
_______________________
Column type frequency:
character	13
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
title	0	1	3	142	0	11231	0
location	0	1	0	161	346	3106	0
department	0	1	0	255	11547	1338	6
salary_range	0	1	0	20	15012	875	0
company_profile	0	1	0	6230	3308	1710	0
description	0	1	3	22722	0	14802	0
requirements	0	1	0	10921	2694	11970	0
benefits	2	1	0	4489	7206	6207	0
employment_type	0	1	0	9	3471	6	0
required_experience	0	1	0	16	7050	8	0
required_education	0	1	0	33	8105	14	0
industry	0	1	0	36	4903	132	0
function.	0	1	0	22	6455	38	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
job_id	1	8940.50	5161.66	1	4470.75	8940.5	13410.25	17880
telecommuting	1	0.04	0.20	0	0.00	0.0	0.00	1
has_company_logo	1	0.80	0.40	0	1.00	1.0	1.00	1
has_questions	1	0.49	0.50	0	0.00	0.0	1.00	1
fraudulent	1	0.05	0.21	0	0.00	0.0	0.00	1

Split location to country, state, city and fill empty with NA

df_fake_job[c("country", "state", "city")] <- str_split_fixed(df_fake_job$location, ", ", 3)
df_fake_job[c("country", "state", "city")][df_fake_job[c("country", "state", "city")] == ""] <- NA

Split salary_range to min_salary, max_salary and fill empty with NA

df_fake_job[c("min_salary", "max_salary")] <- str_split_fixed(df_fake_job$salary_range, "-", 2)
df_fake_job[c("min_salary", "max_salary")][df_fake_job[c("min_salary", "max_salary")] == ""] <- NA

Drop location and salary_range

df_fake_job <- select(df_fake_job, -c(location, salary_range))

View the structure of data

glimpse(df_fake_job)

## Rows: 17,880
## Columns: 21
## $ job_id              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
## $ title               <chr> "Marketing Intern", "Customer Service - Cloud Vide~
## $ department          <chr> "Marketing", "Success", "", "Sales", "", "", "ANDR~
## $ company_profile     <chr> "We're Food52, and we've created a groundbreaking ~
## $ description         <chr> "Food52, a fast-growing, James Beard Award-winning~
## $ requirements        <chr> "Experience with content management systems a majo~
## $ benefits            <chr> "", "What you will get from usThrough being part o~
## $ telecommuting       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ has_company_logo    <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,~
## $ has_questions       <int> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,~
## $ employment_type     <chr> "Other", "Full-time", "", "Full-time", "Full-time"~
## $ required_experience <chr> "Internship", "Not Applicable", "", "Mid-Senior le~
## $ required_education  <chr> "", "", "", "Bachelor's Degree", "Bachelor's Degre~
## $ industry            <chr> "", "Marketing and Advertising", "", "Computer Sof~
## $ function.           <chr> "Marketing", "Customer Service", "", "Sales", "Hea~
## $ fraudulent          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ country             <chr> "US", "NZ", "US", "US", "US", "US", "DE", "US", "U~
## $ state               <chr> "NY", NA, "IA", "DC", "FL", "MD", "BE", "CA", "FL"~
## $ city                <chr> "New York", "Auckland", "Wever", "Washington", "Fo~
## $ min_salary          <chr> NA, NA, NA, NA, NA, NA, "20000", NA, NA, NA, "1000~
## $ max_salary          <chr> NA, NA, NA, NA, NA, NA, "28000", NA, NA, NA, "1200~

class(df_fake_job)

## [1] "data.frame"

View column names

names(df_fake_job)

##  [1] "job_id"              "title"               "department"         
##  [4] "company_profile"     "description"         "requirements"       
##  [7] "benefits"            "telecommuting"       "has_company_logo"   
## [10] "has_questions"       "employment_type"     "required_experience"
## [13] "required_education"  "industry"            "function."          
## [16] "fraudulent"          "country"             "state"              
## [19] "city"                "min_salary"          "max_salary"

Check if any duplication id

table(duplicated(df_fake_job$job_id))

## 
## FALSE 
## 17880

There is no duplication id.

Check for total missing values for each feature

colSums(is.na(df_fake_job))

##              job_id               title          department     company_profile 
##                   0                   0                   0                   0 
##         description        requirements            benefits       telecommuting 
##                   0                   0                   2                   0 
##    has_company_logo       has_questions     employment_type required_experience 
##                   0                   0                   0                   0 
##  required_education            industry           function.          fraudulent 
##                   0                   0                   0                   0 
##             country               state                city          min_salary 
##                 346                2580                2067               15012 
##          max_salary 
##               15013

There are two missing values in ‘benefits’ column.

List rows with missing values

missingdf <- df_fake_job[!complete.cases(df_fake_job), ]
sample_n(missingdf, 3)

##   job_id                                            title department
## 1  12183 Title Insurance: Commercial Underwriting Counsel           
## 2   7336                  Customer Service Representative           
## 3  11857                          Telesales Opportunities           
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   company_profile
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          #URL_e7c9057d5e6f097876436d175031e95669ede4ebaab52b6be0957c837bc98343#
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Hawkeye Recruitment provides cost effective recruitment advertising solutions to help you cast the widest net to find the perfect candidate for your job. We can help improve your recruitment efforts, and streamline your hiring process.Â 
## 3 Established on the principles that full time education is not for everyone Spectrum Learning is made up of a team of passionate consultants with the drive for putting people who wish to grow themselves through education whilst working into long term and relevant job roles.We also are official re-sellers for The Institute of Recruiters/ Study Course professional courses in HR Practice, In-House Recruitment and Agency RecruitmentIt is our mission to help anyone wishing to pursue an apprenticeship onto the right qualification and into the right job.We work closely with both the candidate and the employer to ensure when the learner is enrolled they are at the start of a long and successful career.We have great relationships with a number of national training providers to ensure we can cover any apprenticeship available.Â Â 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  A well run, very well connected Title Insurance Agency based in NY, has a need for an experienced Commercial Underwriting Counsel. This position can be based either in the NYC location or Garden City -Long Island location. He or she will have significant responsibility within the organization and should be able to operate as a senior executive in interactions with both internal and external constituents. This is a great opportunity for the right person. Drop us a line if you fit the qualifications below and are interested in the role.Underwriting counsel1. 5-10 years of NY, commercial, independent underwriting for an underwriter or large commercial agent2. Strong commercial reading background;3. Strong commercial surveys reading experience;4. Strong NY Practice experience;5. Strong understanding for development rights transactions in NY;6. Strong understanding of NYC mortgage tax and NY transfer tax consequencesAll Inquiries are strictly confidential
## 2 As a trusted systems integrator for more than 50 years, General Dynamics Information Technology provides information technology (IT), systems engineering, professional services and simulation and training to customers in the defense, federal civilian government, health, homeland security, intelligence, state and local government and commercial sectors. With approximately 28,000 professionals worldwide, the company delivers IT enterprise solutions, manages large-scale, mission-critical IT programs and provides mission support services. GDIT is an Equal Opportunity/Affirmative Action Employer - Minorities/Females/Protected Veterans/DisabledGENERAL SUMMARY:Â The CMS Customer Service Representative I (CSR) is responsible for delivering general Marketplace information to callers. The CSRs use basic office equipment and technology such as telephones, email, and web browsers to perform their duties. The processes that the CSRs must follow are well defined and documented in standard operating procedures and scripts. Prescribed scripts must be read verbatim to the caller. Neither subject matter knowledge nor independent decision making is required by this position.The Customer Service Representative I reports directly to the Customer Service Supervisor. This is an entry level position responsible for disseminating general Marketplace information. Application processing, enrollment guidelines and a general Marketplace background will be the focus with callers. The Customer Service Representative I will follow scripting to determine when to transfer the caller to a Customer Service Representative IIGeneral Dynamics Information Technology is an Equal Opportunity/Affirmative Action Employer (M/F/D/V).
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    We are a busy recruitment agency in Wakefield looking for Telesales Executives.Â We are now able to offer a number of apprenticeship and training opportunities to businesses looking for new staff or employers looking to train their staff, and we urgently need Sales staff to sell these opportunities!The role will involve business to business telesales making a high volume of calls each day.Â As our company is currently growing at the moment this position has excellent career prospects and we are looking for long term members of staff.If you are interested please apply now.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           requirements
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## 2 JOB RESPONSIBILITIES:â\200¢ Utilize standard technology such as telephone, e-mail, and web browser to perform job duties.â\200¢ Provide knowledgeable responses to telephone inquiries in a courteous and professionalmanner, utilizing pre-scripted responses which they must read verbatim to provide basic general and claims specific information.â\200¢ Follow established and documented policies and standard operating procedures such as filling out timesheets, adhering to privacy rules and responding to numerous phone inquiries.â\200¢ Assist caller with filling out online application and submitting it electronically to plan provider for processing.â\200¢ Complete basic call log related to the phone inquiries such as clicking radio buttons to confirm which scripts were read by the CSR to the caller.â\200¢ Refer calls as required to Customer Service Representative II.â\200¢ Maintain up-to-date knowledge of CMS regulations and policies as they apply.â\200¢ Report problems that occur via the online system so they can be addressed by the appropriate parties.â\200¢ Respond to telephone inquiries within the set departmental staffing and time parameters.â\200¢ May be required to work GDIT scheduled holidays. Overtime may be required.â\200¢ Perform other related duties as assigned.â\200¢ High School diploma or equivalent requiredWORKING CONDITIONS:The work is typically performed in an office environment, which requires proper safety and security precautions. To ensure our contact center production area is at minimal risk for unauthorized disclosure (that is, the release or divulgence of information by an entity to persons or organizations outside of that entity) of Personally Identifiable Information (PII) or Protected Health Information (PHI), the work environment operates under a Secure Floor Policy. The Secure Floor Policy limits or restricts personal belongings, electronic devices, or paper that can be brought into production areas.The above job description is not intended to be, nor should it be construed as, exhaustive of all responsibilities, skills, efforts, or working conditions associated with this job.Requests for reasonable accommodations will be considered to enable individuals with disabilities to perform the principal (essential) functions of this job.EXPERIENCE:â\200¢ Minimum 6 months customer service/secretarial/telemarketing experience required.â\200¢ Must be able to speak and read English clearly, professionally and fluently.â\200¢ Must be able to type a minimum of 20 WPM.â\200¢ Ability to effectively work within established contractual turnaround times required.â\200¢ Must have demonstrated excellent interpersonal and the ability to organize simultaneous tasks.â\200¢ Proven ability to work as a member of a team.â\200¢ All CMS personnel will be required by contract to undergo program update training as the program changes.â\200¢ Spanish fluency is desirable
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Love of sales.Excellent telephone manner.2 year business to business telesales experience.Â 
##                          benefits telecommuting has_company_logo has_questions
## 1                                             0                1             0
## 2                                             0                1             0
## 3 Career prospects.Busy workload.             0                1             1
##   employment_type required_experience        required_education
## 1       Full-time                                              
## 2       Full-time         Entry level High School or equivalent
## 3       Full-time           Associate                          
##             industry        function. fraudulent country state       city
## 1        Real Estate                           0      US    NY   New York
## 2 Telecommunications Customer Service          0      US    IA Coralville
## 3                               Sales          0      GB   WKF  Wakefield
##   min_salary max_salary
## 1       <NA>       <NA>
## 2       <NA>       <NA>
## 3       <NA>       <NA>

Visualize missing rates for each feature

gg_miss_var(df_fake_job, show_pct = TRUE) + labs(y = "% Missing")

Merge columns and create a new ‘full_text’ column

viz_df <- select(df_fake_job, -c(max_salary, min_salary, state, city))
viz_df$full_text <- 
  paste(na.omit(viz_df$title), 
        na.omit(viz_df$country), 
        na.omit(viz_df$department), 
        na.omit(viz_df$company_profile), 
        na.omit(viz_df$description), 
        na.omit(viz_df$requirements), 
        na.omit(viz_df$benefits), 
        na.omit(viz_df$employment_type), 
        na.omit(viz_df$required_experience), 
        na.omit(viz_df$required_education), 
        na.omit(viz_df$industry), 
        na.omit(viz_df$function.))
viz_df[viz_df == ""] <- NA

Visualize missing profile for each feature

plot_missing(viz_df)

Heatplot of missingness across the dataframe

vis_miss(viz_df)

Drop columns

model_df <- select(viz_df, 
                   -c(title, 
                      country, 
                      department, 
                      company_profile, 
                      description, 
                      requirements, 
                      benefits, 
                      employment_type, 
                      required_experience, 
                      required_education, 
                      industry, 
                      function.))
sample_n(model_df, 3)

##   job_id telecommuting has_company_logo has_questions fraudulent
## 1   4394             0                1             0          0
## 2   8490             0                1             1          0
## 3   6125             0                1             0          0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               full_text
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      EXPERIENCED WAITER/ESS NEEDED @ YOOBI - LONDON'S 1ST TEMAKERIA - SUSHI RESTAURANT US   STRONG COMMAND OF ENGLISH NECESSARYÂ Here at Yoobi, we are looking to expand our team in order to accommodate for large customer demand.Â We have a simple mission Â– making LondonÂ’s best temaki sushi with the freshest, most sustainable ingredients around whilst having fun together and with our customers.Â Role DescriptionÂ Customer service is where it all starts at Yoobi Â– it is the first step to building your career with us. Sharpen your people, and teamwork skills, and learn how to run every aspect of creating a great experience for our customers. Get ready to grow!Â We are looking forÂ…:Â Passionate people. People who operate with a sense of urgency. People who smile uncontrollably. People who love to serve. Foodies, eaters, and sushi aficionados. Neat-freaks. People who are willing to learn from their mistakes. People who want to have a voice in their workplace. People who want to jump at the opportunity to join a rapidly growing company with extremely high standards.Â The ideal candidate will need to have:Â - Have excellent command of English - written, spoken &amp; comprehensionÂ - Experience in working in a restaurantÂ - Have great customer service skillsÂ - Be able to work under pressureÂ - Quick LearnerÂ - Have a position attitudeÂ In return, We will offer you:Â - Competitive wage plus cash tipsÂ - Free staff mealÂ - Paid holidayÂ - Help you develop your careerÂ 3 Quick Questions You Must Answer:Â 1. Who is the coolest person in the world?Â 2. What is your favourite current song?Â 3. Can you whistle?Â Send us a message, answer the questions and attach a copy of your resume with references. This is your first step to starting your career at Yoobi!   Full-time Not Applicable Unspecified Restaurants Customer Service
## 2 Front end engineer US  Mindscape is a Wellington based software development company that specialises in building tools for software engineers. We have a high growth product, Raygun (#URL_6b2f170addc3dd0415d65e21a8ece81d4c134c2b1a8b449386367dfaa286971b#) that's growing strongly. Mindscape is profitable and recently raised money to aggressively expand. Well respected, Mindscape has won international and national awards for excellence in software and has thousands of customers, including BMW, NATO, Intel, Microsoft &amp; Beats Music to name a few. If you're up for the challenge of joining a fast growing business then look no further. RaygunÂ is a fast growing MindscapeÂ product (#URL_6b2f170addc3dd0415d65e21a8ece81d4c134c2b1a8b449386367dfaa286971b#). Raygun is a hosted service for automatically collecting data about software crashes and errors. It has a strong design aesthetic with plenty of opportunity to be creative, quirky and professional all at once so it's no suprise that customers love the current design and cite it as being one of the many reasons they choose our service.You'll be joining a small team and have a direct impact on the Raygun web application. You should have extremely solid production skills with CSS/JavaScript, as well as a strong interest in the usability of what you're designing. This role is predominantly about design, but a full-stack skillset to implement your designs in the application would be a substantial benefit.One of the great things about building a product for a technical audience that we can use cutting edge technologies. Forget Internet Explorer 7 support - if our customers used that, they'd already be out of a job. You get to work with all the latest buzzword technologies and frameworks -Â HTML5 (we particularly love the Canvas tag), CSS3, D3.js, #URL_b7bad8ac916069eadd573f035544c52dc3519a0ba054fb7ab1ff9ba3e1525399#.Â Our team is tight, and you'll be working directly with our lead designer, implementing great stuff with him and also being part of the design process yourself. You'll be tasked with creating a world leading user experience. We have users who want to pay for our product just for how beautiful it looks and we want you to help dial up the front end even further!Raygun is growing strongly, with thousands of developers globally using the service. Mindscape is well respected company for excellence in product development. The opportunity to join a fast growing, fast moving company where you have a direct impact on the application is here -Â are you up for the challenge? 3 years of frontend development experience.Highly skilled at HTML, CSS and JavaScript.Great taste, strong empathy, customer focus.Effectively incorporates broad goals into tactical work.Experience with Backbone, Angular JS, #URL_1d0f9eb2a7073ab63d5cfc0f9762fb40962b2b8ad1607a31c869aa4fd0382977# or #URL_ec870d4c32d3db2026283bb633aad057f18c5d5242768ddea14d56d6a38b12ef# is a plus.Experience with D3 is a plus. Youâ\200\231ll get other perks in the office like having a sweet place to work, where weirdness is welcomed and encouraged. Youâ\200\231ll get fresh fruit, and lollies (a balanced diet!). You can choose to work from a couch or a standing desk or a sitting desk. Â And lastly, youâ\200\231ll get the opportunity to join one hell of a crazy awesome ride with us. There arenâ\200\231t very many New Zealand-based SaaS companies who are in the same position to dent the world. Full-time  Bachelor's Degree Information Technology and Services Engineering
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Marketing Communications Specialist US  As a growing and successful startup, Conversocial is a great place to work for ambitious individuals.We build a market leading social customer service solution, and we need even more great people to help us push that position even further. Youâ\200\231ll get the opportunity to work in an exciting new market, where weâ\200\231re helping companies to understand the solution to their problems and are changing the way they interact with consumers.We have a trusting, hands-off management style, which is suited for people that are self-motivated.Our employees have the opportunity for independence and responsibility over their own projects, but we provide all the support and training they need to get there and to develop their careers.At Conversocial we like to balance work and play.We eat lunch together everyday (a company perk) and all enjoy a Friday treat of cake and few drinks. Our close-knit team is very sociable, which makes the Conversocial office a relaxed, fun and supportive working environment. You will work closely with Head of Marketing - EMEA to develop outstanding content that engages our audience and, ultimately, drives inbound leads.Â As the Marketing Communications Specialist, you will be responsible for drafting Conversocialâ\200\231s best practice guides, white papers, case studies and some contributions to our blog.Â  You will need to be confident, as a large part of this role will be interviewing/building relationships with clients in order to create engaging content.Â The marketing team is small, so you also need to be a team player prepared to â\200œmuck inâ\200\235 with marketing activities such as manning event stands.This is an exciting opportunity to challenge yourself and join a talented team within the technology space. You must want to be a team player and thrive off creating engaging content and copy.You will enjoy and have experience of delivering thought leadership content within the B2B technology space. As a Digital Communications Specialist, you will have:â\200¢Â Â Â Demonstrable experience in creating thought leadership contentâ\200¢Â Â Â Knowledge and understanding of Social Customer Serviceâ\200¢Â Â Â Great organisational skillsâ\200¢Â Â Â Be a team player, yet capable of working independently $50 - $70k DOE and Performance + Medical, 401kHealth, Vision and Dental Insurance401k w/ 4% matchingGrowth Opportunities Available

Check NA or missing values

sum(is.na(model_df))

## [1] 0

sum(model_df == "")

## [1] 0

Visualize missing values

vis_miss(model_df)

vis_dat(model_df)

Exploratory Data Analysis (EDA)

Before building our models, we performed exploratory data analysis to understand the dataset.

Visualize fraud and real

viz_df2 <- viz_df
viz_df2$fraudulent[viz_df2$fraudulent == 1] <- "Fraud"
viz_df2$fraudulent[viz_df2$fraudulent == 0] <- "Non Fraud"
count <- table(viz_df2$fraudulent)
bar <- barplot(count, 
               main="Proportion of fraudulent job postings", 
               xlab="fraudulent", 
               ylab="count", 
               col=c(rgb(0.3,0.1,0.4,0.6), rgb(0.3,0.9,0.4,0.6)))
text(bar, count/2, labels = count)

It is observable that there are 17,014 cases of legitimate job postings, while the number of fraudulent job postings is 866. The fraud rate of this dataset is 4.84%.

Visualize country-wise job postings

temp <- na.omit(subset(viz_df, select = c(country))) %>% 
  group_by(country) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  top_n(10, n)

par(mar=c(6,4,4,4))
barplot(height=temp$n, 
        main="Top 10 country-wise job postings", 
        ylab="count", 
        col=brewer.pal(10, "Set3"), 
        names.arg=c("United States",
                    "United Kingdom",
                    "Greece",
                    "Canada",
                    "Germany",
                    "New Zealand",
                    "India",
                    "Australia",
                    "Philippines",
                    "Netherlands"), 
        cex.names=0.7, 
        las=2)

Top 10 countries with most of the number of job postings are US, GB, GR, CA, DE, NZ, IN, AU, PH, NL. United States listed 10,656 job postings, followed by 2,384 for United Kingdom and 940 for Greece.

Visualize the industries

temp <- na.omit(subset(viz_df, select = c(industry))) %>% 
  group_by(industry) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  top_n(10, n)

par(mar=c(10,4,4,4))
barplot(height=temp$n, 
        names=temp$industry, 
        main="Top 10 industries", 
        ylab="count", 
        col=brewer.pal(10, "Set3"), 
        cex.names=0.6, 
        las=2)

Most job openings are IT related such as Information Technology and Services (1,734), Computer Software (1,376) and Internet (1,062).

Visualize the departments

temp <- na.omit(subset(viz_df, select = c(department))) %>% 
  group_by(department) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  top_n(10, n)

par(mar=c(8,4,4,4))
barplot(height=temp$n, 
        names=temp$department, 
        main="Top 10 departments", 
        ylab="count", 
        col=brewer.pal(10, "Set3"), 
        cex.names=0.6, 
        las=2)

Top hiring departments are Sales (551), Engineering (487) and Marketing (401).

Visualize the required experiences in the jobs

viz_df %>% group_by(required_experience) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  drop_na() %>% 
  top_n(10, n) %>% 
  ggplot(aes(x=reorder(required_experience, -n), y = n)) + 
  geom_segment(aes(x=reorder(required_experience, -n), xend=reorder(required_experience, -n), y=0, yend=n), color="skyblue") + 
  geom_point(color="steelblue", size=2, alpha=1) + 
  theme_light() + 
  coord_flip() + 
  theme(panel.grid.major.y = element_blank(), 
        panel.border = element_blank(), 
        axis.ticks.y = element_blank()) + 
  theme_bw() + labs(title = "Listed jobs with required experiences", 
                    x = "Experience", 
                    y = "Count", 
                    fill = "Experience") + 
  geom_text(aes(label=round(n,0)), vjust=-0.6)

Mid-Senior level jobs are in demand, followed by entry level and associate.

Visualize the required education in the jobs

viz_df %>% group_by(required_education) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  drop_na() %>% 
  top_n(10, n) %>% 
  ggplot(aes(x=reorder(required_education, -n), y = n)) + 
  geom_segment(aes(x=reorder(required_education, -n), xend=reorder(required_education, -n), y=0, yend=n), color="skyblue") + 
  geom_point(color="steelblue", size=2, alpha=1) + 
  theme_light() + 
  coord_flip() + 
  theme(panel.grid.major.y = element_blank(), 
        panel.border = element_blank(), 
        axis.ticks.y = element_blank()) + 
  theme_bw() + labs(title = "Listed jobs with required education", 
                    x = "Education", 
                    y = "Count", 
                    fill = "Education") + 
  geom_text(aes(label=round(n,0)), vjust=-0.6)

Most of the education requirements in job ads are at least Bachelor’s degree.

Visualize fraudulent job postings based on employment types

viz_df2 <- viz_df
viz_df2$employment_type <- ifelse(is.na(viz_df2$employment_type), "Missing", viz_df2$employment_type)
df1 <- subset(viz_df2, select = c(employment_type, fraudulent)) %>% 
  group_by(employment_type, fraudulent) %>% 
  summarize(yes = sum(fraudulent==1), .groups = 'drop') %>% 
  filter(fraudulent==1)
df2 <- subset(viz_df2, select = c(employment_type, fraudulent)) %>% 
  group_by(employment_type, fraudulent) %>% 
  summarize(no = sum(fraudulent==0), .groups = 'drop') %>% 
  filter(fraudulent==0)
df_new <- merge(df1, df2, by = c("employment_type")) %>% 
  group_by(employment_type) %>% 
  summarize(pct_fraud = round(yes/(yes+no), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  mutate(employment_type = factor(employment_type, 
                                  levels = c('Part-time',
                                             'Missing',
                                             'Other',
                                             'Full-time',
                                             'Contract',
                                             'Temporary')))
fig <- df_new %>% plot_ly(width = 700, height = 400)
fig <- fig %>% add_trace(x = ~employment_type, y = ~pct_non_fraud, type = 'bar', 
             text = ~paste0(pct_non_fraud*100,"%"), textposition = 'outside', name = 'pct_non_fraud', 
             marker = list(color = 'rgb(158,202,225)', 
                           line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% add_trace(x = ~employment_type, y = ~pct_fraud, type = 'bar', 
            text = ~paste0(pct_fraud*100,"%"), textposition = 'outside', name = 'pct_fraud', 
            marker = list(color = 'rgb(58,200,225)', 
                          line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% layout(title = "Employment types with % fraud and non-fraud",
         barmode = 'group',
         xaxis = list(title = "employment_type"),
         yaxis = list(title = "percentage"))
fig

The percentage of fraudulent job postings is the highest for part-time jobs, nearly 9%. Jobs without an employment type also have a high fraud rate, around 7%.

Visualize fraudulent job postings based on required experiences

viz_df2 <- viz_df
viz_df2$required_experience <- ifelse(is.na(viz_df2$required_experience), "Not Applicable", viz_df2$required_experience)
df1 <- subset(viz_df2, select = c(required_experience, fraudulent)) %>% 
  group_by(required_experience, fraudulent) %>% 
  summarize(yes = sum(fraudulent==1), .groups = 'drop') %>% 
  filter(fraudulent==1)
df2 <- subset(viz_df2, select = c(required_experience, fraudulent)) %>% 
  group_by(required_experience, fraudulent) %>% 
  summarize(no = sum(fraudulent==0), .groups = 'drop') %>% 
  filter(fraudulent==0)
df_new <- merge(df1, df2, by = c("required_experience")) %>% 
  group_by(required_experience) %>% 
  summarize(pct_fraud = round(yes/(yes+no), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  mutate(required_experience = factor(required_experience, 
                                      levels = c('Executive',
                                                 'Entry level',
                                                 'Not Applicable',
                                                 'Director',
                                                 'Mid-Senior level',
                                                 'Internship',
                                                 'Associate')))
fig <- df_new %>% plot_ly(width = 700, height = 400)
fig <- fig %>% add_trace(x = ~required_experience, y = ~pct_non_fraud, type = 'bar', 
             text = ~paste0(pct_non_fraud*100,"%"), textposition = 'outside', name = 'pct_non_fraud', 
             marker = list(color = 'rgb(158,202,225)', 
                           line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% add_trace(x = ~required_experience, y = ~pct_fraud, type = 'bar', 
            text = ~paste0(pct_fraud*100,"%"), textposition = 'outside', name = 'pct_fraud', 
            marker = list(color = 'rgb(58,200,225)', 
                          line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% layout(title = "Required experiences with % fraud and non-fraud",
         barmode = 'group',
         xaxis = list(title = "required_experience"),
         yaxis = list(title = "percentage"))
fig

Most executive or entry level jobs that require minimum qualifications and little experience have highest fraud rate, nearly 7%.

Visualize fraudulent job postings based on job functions

viz_df2 <- viz_df
viz_df2$fraudulent[viz_df2$fraudulent == 1] <- "Fraud"
viz_df2$fraudulent[viz_df2$fraudulent == 0] <- "Non Fraud"
temp <- na.omit(subset(viz_df2, select = c(function., fraudulent))) %>% 
  group_by(function., fraudulent) %>% 
  summarize(n = n(), .groups = 'drop') %>% 
  group_by(function.) %>% 
  summarize(pct_fraud = round(sum(n[fraudulent=="Fraud"]/sum(n)), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  arrange(desc(pct_fraud)) %>% 
  top_n(10, pct_fraud) %>% 
  mutate(function. = factor(function., 
                            levels = c('Administrative',
                                       'Financial Analyst',
                                       'Accounting/Auditing',
                                       'Distribution',
                                       'Other',
                                       'Finance',
                                       'Engineering',
                                       'Business Development',
                                       'Advertising',
                                       'Customer Service')))
melted_temp <- melt(temp, id = "function.")
ggplot(melted_temp, aes(x = function., y = value, fill = variable)) + 
  geom_bar(position = "fill", 
           stat = "identity", 
           color = "black", 
           width = 0.8) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.6)) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_text(aes(label = paste0(value*100,"%")), 
            position = position_stack(vjust = 0.6), size = 2) + 
  ggtitle("Job functions with % fraud and non-fraud") + 
  xlab("function") + 
  ylab("percentage")

The function with highest fraudulent job postings is Administrative, close to 19%, followed by Financial Analyst, Accounting/Auditing. Admin jobs seem most suspicious. Possibly, it’s easy for scammers to disguise their scams.

Visualize fraudulent job postings based on required education

temp <- na.omit(subset(viz_df2, select = c(required_education, fraudulent))) %>% 
  group_by(required_education, fraudulent) %>% 
  summarize(n = n(), .groups = 'drop') %>% 
  group_by(required_education) %>% 
  summarize(pct_fraud = round(sum(n[fraudulent=="Fraud"]/sum(n)), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  arrange(desc(pct_fraud)) %>% 
  top_n(10, pct_fraud) %>% 
  mutate(required_education = factor(required_education, 
                                     levels = c("Some High School Coursework",
                                                "Certification",
                                                "High School or equivalent",
                                                "Master's Degree",
                                                "Professional",
                                                "Unspecified",
                                                "Doctorate",
                                                "Some College Coursework Completed",
                                                "Associate Degree",
                                                "Bachelor's Degree")))
melted_temp <- melt(temp, id = "required_education")
ggplot(melted_temp, aes(x = required_education, y = value, fill = variable)) + 
  geom_bar(position = "fill", 
           stat = "identity", 
           color = "black", 
           width = 0.8) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.6)) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_text(aes(label = paste0(value*100,"%")), 
            position = position_stack(vjust = 0.6), size = 2) + 
  ggtitle("Required education with % fraud and non-fraud") + 
  xlab("required_education") + 
  ylab("percentage")

As high as 74% of fake jobs require little educational credentials - “Some High School Coursework”.

Word Cloud

To visualize the fraud and real job postings, the WordCloud is used to see the top occurring keywords in the data. To do so, fraud and real job postings are separated into two text files and WordCloud has plotted accordingly.

Word Cloud of fraudulent job postings

selected_df <- subset(viz_df, fraudulent == 1)

# Create a vector containing only the text
text <- selected_df$title

# Create a corpus
docs <- Corpus(VectorSource(text))

docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)

wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

Many of the fraudulent job postings have common keywords in the job titles - “Data Entry”, “Administrative”, “Home Based”, “Earn Daily”.

Word Cloud of NON-fraudulent job postings

selected_df <- subset(viz_df, fraudulent == 0)

# Create a vector containing only the text
text <- selected_df$title

# Create a corpus
docs <- Corpus(VectorSource(text))

docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)

wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

Many of the NON-fraudulent job postings have common keywords in the job titles - “Manager”, “Developer”, “Engineer”.

Modeling

Before modeling, a final dataset is determined. This project will use a dataset with these features for the final analysis:

fraudulent (target variable)
telecommuting
has_company_logo
has_questions
full_text: a combination of title, country, department, company_profile, description, requirements, benefits, employment_type, required_experience, required_education, industry and function

Five supervised machine learning algorithms used in the project are:

Logistic Regression
Random Forest
K-Nearest Neighbor (KNN)
XGBoost
Support Vector Machine (SVM)

Data pre-process (full_text)

For this analysis, the entire full_text column is converted to a DocumentTermMatrix and then convert to a dataframe.

docs <- Corpus(VectorSource(model_df$full_text))
docs <- docs %>%
  tm_map(removeNumbers) %>% # Remove numbers
  tm_map(removePunctuation) %>% # Remove punctuation
  tm_map(stripWhitespace) # Eliminate extra white spaces
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

# Convert each full_text into a row with columns containing each term in the document and giving the frequency of unique words used in the full_text
dtm <- DocumentTermMatrix(docs)
sparse_data <- removeSparseTerms(dtm, 0.90) # Remove sparse data

# Convert to dataframe for further analysis
sparse_data_df <- as.data.frame(as.matrix(sparse_data))
final_df <- subset(sparse_data_df, select = -c(`–`))

# Add other variables
final_df$telecommuting <- model_df$telecommuting
final_df$has_company_logo <- model_df$has_company_logo
final_df$has_questions <- model_df$has_questions
final_df$fraudulent <- model_df$fraudulent

View the dimension of the dataframe

dim(final_df)

## [1] 17880   313

# 17880 rows, 313 columns

Visualize data

# Histogram
par(mfrow=c(2,2))
for(i in 310:313) {
    hist(final_df[,i], main=names(final_df)[i], border="blue", col="yellow")
}

# Boxplot
par(mfrow=c(2,2))
for(i in 310:313) {
    boxplot(final_df[,i], main=names(final_df)[i], border="blue", col="yellow")
}

Correlation

A correlation matrix is created to visualize the numeric data relationship.

# Calculate the correlation between each pair of numeric variables
selected_df <- final_df[, 310:313]
corr_df <- round(cor(selected_df), 2)
corr_df

##                  telecommuting has_company_logo has_questions fraudulent
## telecommuting             1.00            -0.02          0.02       0.03
## has_company_logo         -0.02             1.00          0.23      -0.26
## has_questions             0.02             0.23          1.00      -0.09
## fraudulent                0.03            -0.26         -0.09       1.00

Visualize correlation heatmap

# Reduce the size of correlation matrix
melted_corr_mat <- melt(corr_df)

# Plot the correlation heatmap
ggplot(data = melted_corr_mat, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  geom_text(aes( label = value), color = "black", size = 4)

It can be seen that all features are not highly correlated, however has_company_logo and has_questions have negative correlation with fraudulent. This indicates that if the job posting has a company logo or with questions, the likelihood of fraudulent decreases.

Split data into 70% training, 30% testing

# Using the same seed value, reproduce the division of the training and testing sets
set.seed(123)
train_index <- sample(dim(final_df)[1], 0.7 * dim(final_df)[1])
model_dftrain<- final_df[train_index, ]
model_dftest <- final_df[-train_index, ]
paste("train sample size: ", dim(model_dftrain)[1])

## [1] "train sample size:  12516"

paste("test sample size: ", dim(model_dftest)[1])

## [1] "test sample size:  5364"

View training set

sample_n(model_dftrain, 3)

##       also amp andor around attention best big business communication company
## 17169    0   0     0      0         0    1   0        0             0       0
## 3482     0   0     0      0         0    0   0        1             0       0
## 9043     1   0     0      0         0    0   0        0             1       1
##       content currently daily drive engineering existing experience full highly
## 17169       0         0     0     0           0        0          0    0      0
## 3482        0         0     0     0           1        1          2    0      0
## 9043        0         0     0     0           0        0          0    0      0
##       hours information like long management market marketing media need new
## 17169     0           1    0    0          1      0         0     0    0   0
## 3482      0           2    1    0          0      0         0     0    0   1
## 9043      0           0    0    1          1      0         0     0    0   2
##       offer office one online people plus small social staff startup support
## 17169     0      0   0      0      0    0     0      0     0       0       0
## 3482      0      0   1      0      0    0     0      0     0       0       0
## 9043      1      1   0      0      0    0     0      0     0       0       0
##       systems talented team technology top using various website work working
## 17169       0        0    1          1   0     0       1       0    0       0
## 3482        0        0    2          3   0     1       0       0    0       1
## 9043        0        0    0          0   0     0       2       0    1       0
##       able apply based can candidates client clients communicate companies
## 17169    0     0     0   0          0      0       0           0         0
## 3482     0     0     0   0          0      0       0           0         0
## 9043     0     0     0   0          0      0       0           0         0
##       computer cost creative customer delivery effectively email environment
## 17169        0    0        0        0        0           0     0           0
## 3482         2    0        0        0        0           0     0           0
## 9043         0    0        0        1        0           1     0           0
##       every excellent fast following fulltime get global great grow growing
## 17169     0         0    0         0        1   0      0     0    0       0
## 3482      0         0    0         0        1   0      0     0    0       0
## 9043      0         0    0         0        2   0      0     0    0       1
##       growth high include including international issues key know knowledge
## 17169      0    0       0         0             0      0   0    0         0
## 3482       0    2       0         0             0      0   0    0         0
## 9043       0    0       0         1             0      0   0    0         1
##       large learn level looking making manage manager managing network
## 17169     0     0     1       0      0      0       0        0       0
## 3482      0     0     0       0      0      0       0        0       0
## 9043      0     0     0       0      0      0       0        1       0
##       opportunity part passion person phone planning platform please position
## 17169           0    0       0      0     0        0        0      0        1
## 3482            2    1       0      0     0        0        0      0        0
## 9043            0    0       0      0     0        0        0      0        0
##       process product production project projects provides quality range right
## 17169       0       1          0       0        0        0       0     0     0
## 3482        0       0          0       0        0        0       2     0     0
## 9043        0       2          0       0        0        0       0     0     0
##       role service skills software success successful system teams understand
## 17169    0       0      0        1       0          0      0     0          0
## 3482     0       0      0        3       0          0      0     1          1
## 9043     0       2      1        0       0          1      0     0          0
##       web will world across activities candidate career contract engineer
## 17169   0    1     0      0          0         0      1        0        0
## 3482    2    1     0      0          0         0      0        0        0
## 9043    1    0     0      0          0         0      0        0        0
##       ensure experienced field focus health ideal meet must needs opportunities
## 17169      1           0     0     0      0     1    0    0     0             0
## 3482       0           0     0     0      0     0    0    0     0             0
## 9043       0           0     0     0      3     0    0    0     1             0
##       per provide requirements resources seeking services solutions strong
## 17169   0       0            0         0       0        0         2      0
## 3482    0       0            0         0       0        1         5      0
## 9043    0       0            1         0       0        1         0      0
##       unique vision way ability analysis available bachelors benefits build
## 17169      0      0   0       0        0         0         1        0     0
## 3482       0      0   0       0        0         0         1        0     1
## 9043       0      1   0       1        1         0         0        1     0
##       competitive culture customers degree develop development equivalent first
## 17169           0       0         0      1       0           2          0     0
## 3482            0       0         0      2       1           3          0     0
## 9043            1       0         0      1       0           0          0     0
##       goals good help industry lead life maintain make midsenior motivated
## 17169     0    0    0        0    0    0        0    0         1         0
## 3482      0    0    0        1    0    1        0    0         0         0
## 9043      0    0    0        0    0    1        0    0         0         0
##       order organization personal problem professional providing related
## 17169     0            0        0       0            0         0       0
## 3482      0            0        0       0            0         0       0
## 9043      0            0        0       0            0         0       1
##       responsible sales strategy travel understanding value verbal within
## 17169           0     0        0      1             0     0      0      0
## 3482            0     0        0      0             0     0      0      0
## 9043            0     1        0      0             0     0      0      1
##       written year years care current deliver directly innovative interested
## 17169       0    0     0    0       0       0        0          0          0
## 3482        0    0     1    0       0       2        0          1          0
## 9043        0    1     1    2       0       0        0          0          0
##       job leadership monthly offers open operations performance positions
## 17169   0          0       0      0    0          0           0         0
## 3482    0          0       0      0    0          0           0         0
## 9043    0          0       0      1    1          0           0         0
##       potential preferred processes reports results standards time training
## 17169         0         0         1       0       0         0    1        0
## 3482          0         0         0       0       0         0    0        0
## 9043          0         0         0       0       0         0    3        0
##       well areas come design driven employees excel financial join relevant
## 17169    1     0    0      1      0         0     0         0    0        0
## 3482     0     0    0      0      0         0     0         0    0        0
## 9043     0     0    0      0      0         1     0         0    0        1
##       school senior technical we’re without brand dynamic ideas leading many
## 17169      0      1         0     0       0     0       0     0       0    0
## 3482       0      0         0     1       0     0       1     0       0    0
## 9043       0      0         0     0       0     0       0     0       0    1
##       mobile take creating flexible free just love minimum mission multiple
## 17169      0    0        0        0    0    0    0       0       0        0
## 3482       0    0        0        0    0    0    0       0       0        0
## 9043       0    0        0        1    0    0    0       0       0        1
##       passionate play record required use want applications associate change
## 17169          0    0      0        0   0    0            0         0      0
## 3482           0    1      0        0   0    0            3         0      0
## 9043           1    0      0        0   0    0            0         1      0
##       tools background delivering duties entry improve months reporting tasks
## 17169     0          0          0      0     0       1      0         0     0
## 3482      0          0          0      0     0       0      0         0     0
## 9043      0          0          0      1     1       0      0         0     1
##       agency building data developer developing digital internal learning
## 17169      0        0    0         2          0       0        0        0
## 3482       0        0    0         1          2       0        0        0
## 9043       0        0    1         0          0       0        0        0
##       products technologies closely employee internet start track application
## 17169        0            0       0        0        0     0     0           0
## 3482         1            0       0        0        0     0     0           2
## 9043         2            0       0        0        0     0     0           0
##       create established may user hard insurance believe now plan problems
## 17169      0           0   0    0    1         0       0   0    0        0
## 3482       1           0   0    0    0         0       0   0    0        0
## 9043       0           2   0    0    0         2       0   1    1        0
##       complex day education individuals relationships jobs fun see english
## 17169       0   0         0           0             0    0   0   0       0
## 3482        0   0         0           0             0    0   0   0       0
## 9043        0   0         2           0             0    0   0   0       0
##       individual salary dental group package paid medical exciting members
## 17169          0      0      0     0       0    0       0        0       1
## 3482           0      0      0     0       0    0       0        0       0
## 9043           0      0      1     0       1    1      12        0       1
##       least telecommuting has_company_logo has_questions fraudulent
## 17169     0             0                1             1          0
## 3482      1             0                1             1          0
## 9043      0             0                1             0          0

Convert the dependent variable as a factor

model_dftrain$fraudulent = as.factor(model_dftrain$fraudulent)
model_dftest$fraudulent = as.factor(model_dftest$fraudulent)

Logistic Regression

# Train logistic regression
lr_model <- glm(formula = fraudulent ~ ., family = "binomial", data = model_dftrain)

Predict the testing set

lr_pred_test <- predict(lr_model, newdata = model_dftest, type = "response")

test <- model_dftest
glm.probs = predict(lr_model, newdata = test, type = "response")
test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")
test$pred_glm = as.factor(test$pred_glm)

Calculate AUC of the model

calcAUC <- function(predcol, outcol) {
  perf <- performance(prediction(as.numeric(predcol), outcol == 1), "auc")
  as.numeric(perf@y.values)
}

paste("AUC of Logistic Regression is", round(calcAUC(lr_pred_test, model_dftest$fraudulent), digits=4))

## [1] "AUC of Logistic Regression is 0.953"

Random Forest

# Train random forest
trcontrol <- trainControl(method = "repeatedcv", number = 2, repeats = 1, search = "random", verboseIter = TRUE)
grid <- data.frame(mtry = c(100))
rf_model <- train(fraudulent ~ ., method = "rf", data = model_dftrain, ntree = 200, trControl = trcontrol, tuneGrid = grid)

## + Fold1.Rep1: mtry=100 
## - Fold1.Rep1: mtry=100 
## + Fold2.Rep1: mtry=100 
## - Fold2.Rep1: mtry=100 
## Aggregating results
## Fitting final model on full training set

rf_model

## Random Forest 
## 
## 12516 samples
##   312 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 1 times) 
## Summary of sample sizes: 6258, 6258 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9691595  0.5253441
## 
## Tuning parameter 'mtry' was held constant at a value of 100

Predict the testing set

rf_pred_test <- predict(rf_model, newdata = model_dftest)

Calculate AUC of the model

paste("AUC of Random Forest is", round(calcAUC(rf_pred_test, model_dftest$fraudulent), digits=4))

## [1] "AUC of Random Forest is 0.8028"

K-Nearest Neighbor (KNN)

# Train knn
knn <- kknn(fraudulent ~ ., model_dftrain, model_dftest, k = 25)
# View(knn)

Predict the testing set

knn_pred_test <- predict(knn, newdata = model_dftest)

Calculate AUC of the model

paste("AUC of KNN is", round(calcAUC(knn_pred_test, model_dftest$fraudulent), digits=4))

## [1] "AUC of KNN is 0.767"

XGBoost

x_train = subset(model_dftrain, select = -c(fraudulent))
y_train = subset(model_dftrain, select = c(fraudulent))
x_test = subset(model_dftest, select = -c(fraudulent))
y_test= subset(model_dftest, select = c(fraudulent))
x_train = as.matrix(x_train)
y_train = as.matrix(y_train)
x_test = as.matrix(x_test)
y_test = as.matrix(y_test)
xgboost_train = xgb.DMatrix(data=x_train, label=y_train)
xgboost_test = xgb.DMatrix(data=x_test, label=y_test)

model <- xgboost(data = xgboost_train, 
                 max_depth = 3, 
                 eta = 0.1, 
                 nrounds = 100, 
                 booster = "gbtree")

## [1]  train-rmse:0.457630 
## [2]  train-rmse:0.420081 
## [3]  train-rmse:0.386988 
## [4]  train-rmse:0.357967 
## [5]  train-rmse:0.332568 
## [6]  train-rmse:0.310259 
## [7]  train-rmse:0.290797 
## [8]  train-rmse:0.273965 
## [9]  train-rmse:0.259404 
## [10] train-rmse:0.246891 
## [11] train-rmse:0.236248 
## [12] train-rmse:0.227150 
## [13] train-rmse:0.219307 
## [14] train-rmse:0.212578 
## [15] train-rmse:0.207041 
## [16] train-rmse:0.202255 
## [17] train-rmse:0.198396 
## [18] train-rmse:0.194924 
## [19] train-rmse:0.192325 
## [20] train-rmse:0.189878 
## [21] train-rmse:0.187593 
## [22] train-rmse:0.186022 
## [23] train-rmse:0.184275 
## [24] train-rmse:0.182951 
## [25] train-rmse:0.181881 
## [26] train-rmse:0.181043 
## [27] train-rmse:0.180074 
## [28] train-rmse:0.179555 
## [29] train-rmse:0.178357 
## [30] train-rmse:0.177737 
## [31] train-rmse:0.176941 
## [32] train-rmse:0.176630 
## [33] train-rmse:0.176375 
## [34] train-rmse:0.175657 
## [35] train-rmse:0.175207 
## [36] train-rmse:0.174575 
## [37] train-rmse:0.174157 
## [38] train-rmse:0.173986 
## [39] train-rmse:0.173817 
## [40] train-rmse:0.173649 
## [41] train-rmse:0.172673 
## [42] train-rmse:0.172193 
## [43] train-rmse:0.172038 
## [44] train-rmse:0.171735 
## [45] train-rmse:0.171296 
## [46] train-rmse:0.171182 
## [47] train-rmse:0.170742 
## [48] train-rmse:0.170479 
## [49] train-rmse:0.170209 
## [50] train-rmse:0.169823 
## [51] train-rmse:0.169673 
## [52] train-rmse:0.169418 
## [53] train-rmse:0.169115 
## [54] train-rmse:0.168875 
## [55] train-rmse:0.168692 
## [56] train-rmse:0.168299 
## [57] train-rmse:0.167796 
## [58] train-rmse:0.167589 
## [59] train-rmse:0.167490 
## [60] train-rmse:0.167180 
## [61] train-rmse:0.167008 
## [62] train-rmse:0.166682 
## [63] train-rmse:0.166507 
## [64] train-rmse:0.166344 
## [65] train-rmse:0.165948 
## [66] train-rmse:0.165773 
## [67] train-rmse:0.165665 
## [68] train-rmse:0.165345 
## [69] train-rmse:0.164959 
## [70] train-rmse:0.164591 
## [71] train-rmse:0.164412 
## [72] train-rmse:0.164269 
## [73] train-rmse:0.164155 
## [74] train-rmse:0.163932 
## [75] train-rmse:0.163832 
## [76] train-rmse:0.163560 
## [77] train-rmse:0.163200 
## [78] train-rmse:0.162873 
## [79] train-rmse:0.162655 
## [80] train-rmse:0.162445 
## [81] train-rmse:0.162223 
## [82] train-rmse:0.162022 
## [83] train-rmse:0.161935 
## [84] train-rmse:0.161770 
## [85] train-rmse:0.161594 
## [86] train-rmse:0.161420 
## [87] train-rmse:0.161321 
## [88] train-rmse:0.160963 
## [89] train-rmse:0.160885 
## [90] train-rmse:0.160762 
## [91] train-rmse:0.160711 
## [92] train-rmse:0.160495 
## [93] train-rmse:0.160235 
## [94] train-rmse:0.160145 
## [95] train-rmse:0.160026 
## [96] train-rmse:0.159598 
## [97] train-rmse:0.159510 
## [98] train-rmse:0.159454 
## [99] train-rmse:0.159272 
## [100]    train-rmse:0.159154

Predict the testing set

summary(model)

##                Length Class              Mode       
## handle              1 xgb.Booster.handle externalptr
## raw            115056 -none-             raw        
## niter               1 -none-             numeric    
## evaluation_log      2 data.table         list       
## call               16 -none-             call       
## params              4 -none-             list       
## callbacks           2 -none-             list       
## feature_names     312 -none-             character  
## nfeatures           1 -none-             numeric

pred_test = predict(model, x_test)

prediction = as.numeric(pred_test > 0.5)
y_test = as.numeric(y_test)
prediction = as.factor(prediction)
y_test = as.factor(y_test)

Calculate AUC of the model

paste("AUC of XGBoost is", round(calcAUC(y_test, prediction), digits=4))

## [1] "AUC of XGBoost is 0.9545"

Support Vector Machine (SVM)

fraudulentSVM = svm(formula = fraudulent ~ ., data = model_dftrain, type='C-classification', kernel='linear')

Predict the testing set

fraudulentSVMPrediction = predict(fraudulentSVM, newdata = model_dftest)

Calculate AUC of the model

paste("AUC of SVM is", round(calcAUC(fraudulentSVMPrediction, model_dftest$fraudulent), digits=4))

## [1] "AUC of SVM is 0.759"

Evaluation

Accuracy and area under the curve (AUC) are used to evaluate the effectiveness of models in terms of classifying real and fake job postings. However, the dataset used for training is highly imbalanced. Thus, it is necessary to use F1, precision and recall scores to evaluate the model’s ability to identify both real and fake job postings.

Accuracy score: Metric that provides a general idea of the model performance.
AUC score: Measure how well the model can distinguish real and fake job postings.
Precision score: Percentage of positive predictions are accurate.
Recall score: Percentage of positive results that have been classified correctly by the model.
F1 score: Harmonic mean of precision and recall.

Confusion Matrix and Error Metrics of Logistic Regression

confMatrix_lr = confusionMatrix(test$pred_glm, test$fraudulent, mode = "everything", positive = "1")
print(confMatrix_lr)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5025  114
##          1   67  158
##                                           
##                Accuracy : 0.9663          
##                  95% CI : (0.9611, 0.9709)
##     No Information Rate : 0.9493          
##     P-Value [Acc > NIR] : 1.193e-09       
##                                           
##                   Kappa : 0.6183          
##                                           
##  Mcnemar's Test P-Value : 0.0006282       
##                                           
##             Sensitivity : 0.58088         
##             Specificity : 0.98684         
##          Pos Pred Value : 0.70222         
##          Neg Pred Value : 0.97782         
##               Precision : 0.70222         
##                  Recall : 0.58088         
##                      F1 : 0.63581         
##              Prevalence : 0.05071         
##          Detection Rate : 0.02946         
##    Detection Prevalence : 0.04195         
##       Balanced Accuracy : 0.78386         
##                                           
##        'Positive' Class : 1               
##

Confusion Matrix and Error Metrics of Random Forest

confMatrix_rf = confusionMatrix(rf_pred_test, model_dftest$fraudulent, mode = "everything", positive = "1")
print(confMatrix_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5087  107
##          1    5  165
##                                           
##                Accuracy : 0.9791          
##                  95% CI : (0.9749, 0.9828)
##     No Information Rate : 0.9493          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7363          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.60662         
##             Specificity : 0.99902         
##          Pos Pred Value : 0.97059         
##          Neg Pred Value : 0.97940         
##               Precision : 0.97059         
##                  Recall : 0.60662         
##                      F1 : 0.74661         
##              Prevalence : 0.05071         
##          Detection Rate : 0.03076         
##    Detection Prevalence : 0.03169         
##       Balanced Accuracy : 0.80282         
##                                           
##        'Positive' Class : 1               
##

Confusion Matrix and Error Metrics of KNN

confMatrix_knn = confusionMatrix(knn_pred_test, model_dftest$fraudulent, mode = "everything", positive = "1")
print(confMatrix_knn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5078  126
##          1   14  146
##                                          
##                Accuracy : 0.9739         
##                  95% CI : (0.9693, 0.978)
##     No Information Rate : 0.9493         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6633         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.53676        
##             Specificity : 0.99725        
##          Pos Pred Value : 0.91250        
##          Neg Pred Value : 0.97579        
##               Precision : 0.91250        
##                  Recall : 0.53676        
##                      F1 : 0.67593        
##              Prevalence : 0.05071        
##          Detection Rate : 0.02722        
##    Detection Prevalence : 0.02983        
##       Balanced Accuracy : 0.76701        
##                                          
##        'Positive' Class : 1              
##

Confusion Matrix and Error Metrics of XGBoost

conf_mat = confusionMatrix(y_test, prediction, mode = "everything", positive = "1")
print(conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5087    5
##          1  187   85
##                                          
##                Accuracy : 0.9642         
##                  95% CI : (0.9589, 0.969)
##     No Information Rate : 0.9832         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.4559         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.94444        
##             Specificity : 0.96454        
##          Pos Pred Value : 0.31250        
##          Neg Pred Value : 0.99902        
##               Precision : 0.31250        
##                  Recall : 0.94444        
##                      F1 : 0.46961        
##              Prevalence : 0.01678        
##          Detection Rate : 0.01585        
##    Detection Prevalence : 0.05071        
##       Balanced Accuracy : 0.95449        
##                                          
##        'Positive' Class : 1              
##

Confusion Matrix and Error Metrics of SVM

confMatrix_svm = confusionMatrix(fraudulentSVMPrediction, model_dftest$fraudulent, mode = "everything", positive = "1")
print(confMatrix_svm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5034  128
##          1   58  144
##                                           
##                Accuracy : 0.9653          
##                  95% CI : (0.9601, 0.9701)
##     No Information Rate : 0.9493          
##     P-Value [Acc > NIR] : 9.687e-09       
##                                           
##                   Kappa : 0.5899          
##                                           
##  Mcnemar's Test P-Value : 4.207e-07       
##                                           
##             Sensitivity : 0.52941         
##             Specificity : 0.98861         
##          Pos Pred Value : 0.71287         
##          Neg Pred Value : 0.97520         
##               Precision : 0.71287         
##                  Recall : 0.52941         
##                      F1 : 0.60759         
##              Prevalence : 0.05071         
##          Detection Rate : 0.02685         
##    Detection Prevalence : 0.03766         
##       Balanced Accuracy : 0.75901         
##                                           
##        'Positive' Class : 1               
##

Summary of Results

Metric	Logistic Regression	Random Forest	KNN	XGBoost	SVM
Accuracy	0.97	0.98	0.97	0.96	0.97
Precision	0.70	0.97	0.91	0.31	0.71
Recall	0.58	0.61	0.54	0.94	0.53
F1	0.64	0.75	0.68	0.47	0.61
AUC	0.95	0.80	0.77	0.95	0.76

The Random Forest has achieved the best accuracy, precision and F1 scores. However, Logistic Regression and XGBoost have achieved the highest AUC than others while their precision scores are comparative lower than other models. Given the precision and also F1 scores, we can conclude that Random Forest is the best in terms of classifying real and fake job postings.

Results Analysis Summary

What are the key features/characteristics of fraudulent job postings?

Based on the correlation analysis, all of the features are not highly correlated to our target feature (fraudulent) and therefore, it is difficult to find out the key features or characteristics of fraudulent job postings. However, it can be seen that has_company_logo and has_questions features have negative correlation with fraudulent. This indicates that if the job posting has a company logo or with questions, the likelihood of fraudulent decreases.

Which classification model is the best to determine whether the job is real or not?

Random Forest is the best classification model to determine whether the job is real or not. This conclusion was made in regard to Random Forest model has shown the best accuracy, precision and F1 scores compared to other models.

Other findings

74% of fake jobs require little educational credentials - “Some High School Coursework”. This may indicate the target of fake job postings is jobseekers with little educational credentials such as highschoolers or students.
Most executive or entry level jobs that require minimum qualifications and little experience have highest fraud rate, nearly 7%. This information implies that job seekers with lack of experience such as fresh graduates are most likely being the target of these fake job postings.
Many of the fraudulent job postings have common keywords in the job titles - “Data Entry”, “Administrative”, “Home Based”, “Earn Daily”. These are the words that can attract the attention of the jobseekers.

Limitation and Improvement

Since the dataset is highly imbalanced where most of the job postings are legitimate, and only few are fraudulent. Thus, real jobs are being identified quite well. Techniques to handle imbalanced data like SMOTE can be applied to make a fair comparison between real and fraudulent jobs. Besides, other NLP processing like TF-IDF vectorizer can be chosen to discover the best possible numerical/vectorial representation of the text strings for running ML models.

Conclusion

In most instances, if something appears too good to be true, it probably is. Most of the fraudulent job description and requirements are vague and too good to be true such as easy work for unrealistic pay. Be aware of part-time, entry-level jobs that require minimum qualifications and little experience like data entry and administrative. Home based and job listings without company logo can be alarming. In terms of classification models, Random Forest gives the best accuracy, precision and F1 scores, however better results can be achieved with a more balanced dataset with sufficient use cases for both real and fake job postings. Finally, with a little research, we can not only find out if a company and a job are legit, but also discover if the company is a right fit.

Fake Job Posting Analysis

WQD7004 Group Project

Group 4 members:

Introduction

Objectives

Initial Questions

Data Cleaning and Pre-processing

Import libraries

Load data

Display n sample of the data

Summary data

Check all the missing values - ‘empty’

Split location to country, state, city and fill empty with NA

Split salary_range to min_salary, max_salary and fill empty with NA

Drop location and salary_range

View the structure of data

View column names

Check if any duplication id

Check for total missing values for each feature

List rows with missing values

Visualize missing rates for each feature

Merge columns and create a new ‘full_text’ column

Visualize missing profile for each feature

Heatplot of missingness across the dataframe

Drop columns

Check NA or missing values

Visualize missing values

Exploratory Data Analysis (EDA)

Visualize fraud and real

Visualize country-wise job postings

Visualize the industries

Visualize the departments

Visualize the required experiences in the jobs

Visualize the required education in the jobs

Visualize fraudulent job postings based on employment types

Visualize fraudulent job postings based on required experiences

Visualize fraudulent job postings based on job functions

Visualize fraudulent job postings based on required education

Word Cloud

Word Cloud of fraudulent job postings

Word Cloud of NON-fraudulent job postings

Modeling

Data pre-process (full_text)

View the dimension of the dataframe

Visualize data

Correlation

Split data into 70% training, 30% testing

View training set

Convert the dependent variable as a factor

Logistic Regression

Predict the testing set

Calculate AUC of the model

Random Forest

Predict the testing set

Calculate AUC of the model

K-Nearest Neighbor (KNN)

Predict the testing set

Calculate AUC of the model

XGBoost

Predict the testing set

Calculate AUC of the model

Support Vector Machine (SVM)

Predict the testing set

Calculate AUC of the model

Evaluation

Confusion Matrix and Error Metrics of Logistic Regression

Confusion Matrix and Error Metrics of Random Forest

Confusion Matrix and Error Metrics of KNN

Confusion Matrix and Error Metrics of XGBoost

Confusion Matrix and Error Metrics of SVM

Summary of Results

Results Analysis Summary

Limitation and Improvement

Conclusion