Group 4 members:
- JingYu Shen (S2113037)
- JiPing Zhang (S2042894)
- Lee Mun Mun (S2112842)
- Nayli Hatim (S2149344)
- Jenifer Mayang Jues (S2016572)
Introduction
Unlike the past when job seekers used newspapers to seek job opportunities, job seekers nowadays use employment website such as JobStreet, Linkedin, Indeed and countless due to advancement in modern technology and social communication. The authenticity of job postings has become critical with a constant increase in the number of job scams. According to Habiba et all (2021), job advertisements which are fake and steal personal & professional information of job seekers instead of giving right jobs to them is known as job scam. Job scams often involve fake online job ads in social platforms and untrusted job portals offering high paying jobs. Victims may also receive unsolicited messages from social media such as Whatsapp, Facebook, WeChat that offers jobs that do not exist. For example, job scammers will ask victims to disclose personal and/or banking details or transfer upfront fees to secure a interview or more information about the fraud jobs. Due to the growing concerns about job scams, our aim is to raise awareness of job seekers in the job application process and give an early warning sign to job seekers with Machine Learning (ML) and Natural Language Processing (NLP) approaches.
Objectives
- To identify the key features of fraudulent job postings.
- To build a model to classify real or fake job postings.
Initial Questions
- What are the key features/characteristics of fraudulent job postings?
- Which classification model is the best to determine whether the job is real or not?
Data Cleaning and Pre-processing
The dataset used in this project was published by the Employment Scam Aegean Dataset (EMSCAD) and was retrieved from Kaggle. This data contains 17,880 observations out of which about 866 are fake, and 18 features. The data consists of a combination of numeric and text features. A brief description of the variables is given below:
| Variable | Description |
|---|---|
| job_id | ID of each job posting |
| title | Description of position or job |
| location | Where the job is located |
| department | Department of the job offered |
| salary_range | Expected salary range |
| company_profile | Company information |
| description | Description about the position offered |
| requirements | Pre-requisites to qualify for the job |
| benefits | Benefits provided by the job |
| telecommuting | Is work from home or remote work allowed |
| has_company_logo | Does the post have a company logo |
| has_questions | Does the post have any questions |
| employment_type | Full-time, part-time, contract, temporary and others |
| required_experience | Experience level, e.g. Entry level, Executive, Director… |
| required_education | Education level, e.g. High School, Bachelor, Master… |
| industry | Relevant industry |
| function | Job’s functionality |
| fraudulent | Target variable (0: Real, 1: Fake) |
Import libraries
Load data
df <- read.csv("https://raw.githubusercontent.com/abbylmm/fake_job_posting/main/data/fake_job_postings.csv")Display n sample of the data
df_fake_job <- df
sample_n(df_fake_job, 3)## job_id title location department
## 1 4794 NARRATIVE: Influencer Marketing Manager US, NY, New York
## 2 10541 English Teacher Abroad US, NH, Hanover
## 3 16242 Quality Manager US, MO, St. Louis
## salary_range
## 1
## 2
## 3
## company_profile
## 1 We are not your average Monday mail recruiters. We are here to align stars and connect dots, not just match titles with positions & salary demands with salary offerings. Our approach is simple; we read between the lines to see YOU. Both of you. Employer and employee. You & Them is the most personal, innovative and open-minded professional recruiting can be. Or should be. Our network is a community of people with the same mentality; that work is a part of our lives and not the other way around. A creative community of great minds who seek minds that think alike.You & Them is Us. Real people. Nice to meet you.
## 2 We help teachers get safe & secure jobs abroad :)
## 3 We Provide Full Time Permanent Positions for many medium to large US companies. We are interested in finding/recruiting high quality candidates in IT, Engineering, Manufacturing and other highly technical and non-technical jobs.
## description
## 1 Narrative is looking for a Senior Influencer Marketing Manager to join our team in New York. With this role, you will report directly to the CEO. We are looking for someone that is:â—\217 well connectedâ—\217 proactive, detail oriented and professionalâ—\217 a master negotiatorâ—\217 comfortable working closely and collaboratively across the entire agencyâ—\217 willing to go above and beyond the call of dutyâ—\217 experienced in pitch workAs the Senior Influencer Marketing Manager, you obsess over people across all walks of life. You know theinâ\200\231s and outâ\200\231s of the industry, and you have a knack for identifying and connecting with talent. You knowtheir story, background, favorite color and a lot more that make you a little creepy if this wasnâ\200\231t your field ofwork. Youâ\200\231re smart, articulate and base your decisions on data and strategic thinking. You are an influencerin your own right, and possess the ability to persuade at will.You will be immediately injected into our yearlong music activation â\200“ ADD52 to help drive engagement anddefine/execute marketing initiatives with and through influencers. ADD52 is a talent discovery platformreinventing how emerging artists and fans find, share and listen to music. Created by Russell Simmons andSteve Rifkind in partnership with Samsung, ADD52 gives unsigned artists the opportunity to get discoveredand signed by All Def Music.
## 2 Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabriel Adkins : #URL_ed9094c60184b8a4975333957f05be37e69d3cdb68decc9dd9a4242733cfd7f7##URL_75db76d58f7994c7db24e8998c2fc953ab9a20ea9ac948b217693963f78d2e6b#12 month contract : Apply todayÂ
## 3 (We have more than 1500+ Job openings in our website and some of them are relevant to this job. Feel free to search it in the website and apply directly. Just Click the â\200œApply Nowâ\200\235 and you will redirect to our main website where you can search for the other jobs.)Implementation and maintenance of quality management system throughout the organization.5. Conducting management review meeting and providing recommendations for improvement.6. To provide customer complaint addressal, resolution and application support.7. Implementation of various standards such as QS 9000, ISO/TS 16949, ISO 9000, Kaizen projects, Six sigma projects, TPM etc.8. To act as management representative for the plant / company.We have many more Global Healthcare â\200‹Professionals jobs are available in our website. Please go through our website and search the relevant job and apply directly.Visit : #URL_ec64af2b4fe2ca316e828f93b0cd098c22f8beba98dcac09d4dd7384b221a5e8#-#URL_9753a54b28303bf636a2816399b9c255d76fabb791336a4c748da2611a23264f#
## requirements
## 1 GENERAL RESPONSIBILITIESâ—\217 Conceptualize, create, manage and execute all influencer marketing initiatives through completion ensuring all aspects align with client goals.â—\217 Work closely with agencies, publicity departments, management, production houses and media to execute celebrity/influencer seeding strategy.â—\217 Leverage influencer marketing programs on social media platforms. Track social media engagement, create content, and increase social interaction and relevance.THE TALENT AND YOUâ—\217 Manage national and regional events, including sponsorship activations.â—\217 Create and analyze reports + establish KPIs that measure the impact of the influencer marketing program to better serve future strategies.â—\217 Identify new passion groups and individuals we should engage with.â—\217 Work with clients, partner agencies and 3rd party vendors on joint activities.â—\217 Work with product team to give feedback regarding features and optimizations that will drive user/influencer growth.â—\217 Research and identify prospective influencers (including/not limited to: blogs, Twitter, Instagram, Vine, YouTube, etc.).â—\217 Help coordinate influencer communications and plan activities.â—\217 Develop and execute influencer marketing programs that drive sales and generate positive brand exposure.â—\217 Establishing contact, seed and manage influencer relationships on an ongoing basis.â—\217 Lead the influencer communication strategy and delivery of content through various mediums.â—\217 Manage talent negotiations and working closely with legal to draft talent agreements/contracts.â—\217 Activating talent against clientâ\200\231s goals and objective. Working closely with talent to develop content and platforms.â—\217 Maintaining talent schedule to ensure that we are aligned with timelines/ deliveries.REQUIRED SKILLSâ—\217 Must have 5+ years of PR/digital marketing and/or influencer marketing experience.â—\217 Expertise in building communities and key relationships.â—\217 Knowledge and expertise in using various social media platforms.â—\217 Advanced proficiency with PR applications, Keynote, MS office Suite, Google Docs.â—\217 Keen sense of awareness of influencer/celebrity culture.â—\217 Excellent project management, organization, communication, writing and relationship building skills.â—\217 Well versed in marketing strategies across multiple categories.â—\217 Proven successes in both traditional and interactive PR channels.â—\217 Collaborative with a solutions oriented attitude and a willingness to pitch in when necessary.â—\217 Ability to work on multiple projects simultaneously with tight deadlines.â—\217 Has a clear understanding of industry standards and practices.â—\217 Strong analytic skills and ability to think strategically and applying them.â—\217 Continually working to understand the clients, their industry & how we can make a difference in their business.
## 2 University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only
## 3
## benefits telecommuting has_company_logo has_questions
## 1 0 1 0
## 2 See job description 0 1 1
## 3 0 0 0
## employment_type required_experience required_education industry
## 1
## 2 Contract Bachelor's Degree Education Management
## 3 Full-time
## function. fraudulent
## 1 0
## 2 0
## 3 0
Summary data
summary(df_fake_job)## job_id title location department
## Min. : 1 Length:17880 Length:17880 Length:17880
## 1st Qu.: 4471 Class :character Class :character Class :character
## Median : 8940 Mode :character Mode :character Mode :character
## Mean : 8940
## 3rd Qu.:13410
## Max. :17880
## salary_range company_profile description requirements
## Length:17880 Length:17880 Length:17880 Length:17880
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## benefits telecommuting has_company_logo has_questions
## Length:17880 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :1.0000 Median :0.0000
## Mean :0.0429 Mean :0.7953 Mean :0.4917
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## employment_type required_experience required_education industry
## Length:17880 Length:17880 Length:17880 Length:17880
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## function. fraudulent
## Length:17880 Min. :0.00000
## Class :character 1st Qu.:0.00000
## Mode :character Median :0.00000
## Mean :0.04843
## 3rd Qu.:0.00000
## Max. :1.00000
Check all the missing values - ‘empty’
skim_without_charts(df_fake_job)| Name | df_fake_job |
| Number of rows | 17880 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 3 | 142 | 0 | 11231 | 0 |
| location | 0 | 1 | 0 | 161 | 346 | 3106 | 0 |
| department | 0 | 1 | 0 | 255 | 11547 | 1338 | 6 |
| salary_range | 0 | 1 | 0 | 20 | 15012 | 875 | 0 |
| company_profile | 0 | 1 | 0 | 6230 | 3308 | 1710 | 0 |
| description | 0 | 1 | 3 | 22722 | 0 | 14802 | 0 |
| requirements | 0 | 1 | 0 | 10921 | 2694 | 11970 | 0 |
| benefits | 2 | 1 | 0 | 4489 | 7206 | 6207 | 0 |
| employment_type | 0 | 1 | 0 | 9 | 3471 | 6 | 0 |
| required_experience | 0 | 1 | 0 | 16 | 7050 | 8 | 0 |
| required_education | 0 | 1 | 0 | 33 | 8105 | 14 | 0 |
| industry | 0 | 1 | 0 | 36 | 4903 | 132 | 0 |
| function. | 0 | 1 | 0 | 22 | 6455 | 38 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| job_id | 0 | 1 | 8940.50 | 5161.66 | 1 | 4470.75 | 8940.5 | 13410.25 | 17880 |
| telecommuting | 0 | 1 | 0.04 | 0.20 | 0 | 0.00 | 0.0 | 0.00 | 1 |
| has_company_logo | 0 | 1 | 0.80 | 0.40 | 0 | 1.00 | 1.0 | 1.00 | 1 |
| has_questions | 0 | 1 | 0.49 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 1 |
| fraudulent | 0 | 1 | 0.05 | 0.21 | 0 | 0.00 | 0.0 | 0.00 | 1 |
Split location to country, state, city and fill empty with NA
df_fake_job[c("country", "state", "city")] <- str_split_fixed(df_fake_job$location, ", ", 3)
df_fake_job[c("country", "state", "city")][df_fake_job[c("country", "state", "city")] == ""] <- NASplit salary_range to min_salary, max_salary and fill empty with NA
df_fake_job[c("min_salary", "max_salary")] <- str_split_fixed(df_fake_job$salary_range, "-", 2)
df_fake_job[c("min_salary", "max_salary")][df_fake_job[c("min_salary", "max_salary")] == ""] <- NADrop location and salary_range
df_fake_job <- select(df_fake_job, -c(location, salary_range))View the structure of data
glimpse(df_fake_job)## Rows: 17,880
## Columns: 21
## $ job_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
## $ title <chr> "Marketing Intern", "Customer Service - Cloud Vide~
## $ department <chr> "Marketing", "Success", "", "Sales", "", "", "ANDR~
## $ company_profile <chr> "We're Food52, and we've created a groundbreaking ~
## $ description <chr> "Food52, a fast-growing, James Beard Award-winning~
## $ requirements <chr> "Experience with content management systems a majo~
## $ benefits <chr> "", "What you will get from usThrough being part o~
## $ telecommuting <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ has_company_logo <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,~
## $ has_questions <int> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,~
## $ employment_type <chr> "Other", "Full-time", "", "Full-time", "Full-time"~
## $ required_experience <chr> "Internship", "Not Applicable", "", "Mid-Senior le~
## $ required_education <chr> "", "", "", "Bachelor's Degree", "Bachelor's Degre~
## $ industry <chr> "", "Marketing and Advertising", "", "Computer Sof~
## $ function. <chr> "Marketing", "Customer Service", "", "Sales", "Hea~
## $ fraudulent <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ country <chr> "US", "NZ", "US", "US", "US", "US", "DE", "US", "U~
## $ state <chr> "NY", NA, "IA", "DC", "FL", "MD", "BE", "CA", "FL"~
## $ city <chr> "New York", "Auckland", "Wever", "Washington", "Fo~
## $ min_salary <chr> NA, NA, NA, NA, NA, NA, "20000", NA, NA, NA, "1000~
## $ max_salary <chr> NA, NA, NA, NA, NA, NA, "28000", NA, NA, NA, "1200~
class(df_fake_job)## [1] "data.frame"
View column names
names(df_fake_job)## [1] "job_id" "title" "department"
## [4] "company_profile" "description" "requirements"
## [7] "benefits" "telecommuting" "has_company_logo"
## [10] "has_questions" "employment_type" "required_experience"
## [13] "required_education" "industry" "function."
## [16] "fraudulent" "country" "state"
## [19] "city" "min_salary" "max_salary"
Check if any duplication id
table(duplicated(df_fake_job$job_id))##
## FALSE
## 17880
There is no duplication id.
Check for total missing values for each feature
colSums(is.na(df_fake_job))## job_id title department company_profile
## 0 0 0 0
## description requirements benefits telecommuting
## 0 0 2 0
## has_company_logo has_questions employment_type required_experience
## 0 0 0 0
## required_education industry function. fraudulent
## 0 0 0 0
## country state city min_salary
## 346 2580 2067 15012
## max_salary
## 15013
There are two missing values in ‘benefits’ column.
List rows with missing values
missingdf <- df_fake_job[!complete.cases(df_fake_job), ]
sample_n(missingdf, 3)## job_id title department
## 1 12183 Title Insurance: Commercial Underwriting Counsel
## 2 7336 Customer Service Representative
## 3 11857 Telesales Opportunities
## company_profile
## 1 #URL_e7c9057d5e6f097876436d175031e95669ede4ebaab52b6be0957c837bc98343#
## 2 Hawkeye Recruitment provides cost effective recruitment advertising solutions to help you cast the widest net to find the perfect candidate for your job. We can help improve your recruitment efforts, and streamline your hiring process.Â
## 3 Established on the principles that full time education is not for everyone Spectrum Learning is made up of a team of passionate consultants with the drive for putting people who wish to grow themselves through education whilst working into long term and relevant job roles.We also are official re-sellers for The Institute of Recruiters/ Study Course professional courses in HR Practice, In-House Recruitment and Agency RecruitmentIt is our mission to help anyone wishing to pursue an apprenticeship onto the right qualification and into the right job.We work closely with both the candidate and the employer to ensure when the learner is enrolled they are at the start of a long and successful career.We have great relationships with a number of national training providers to ensure we can cover any apprenticeship available. Â
## description
## 1 A well run, very well connected Title Insurance Agency based in NY, has a need for an experienced Commercial Underwriting Counsel. This position can be based either in the NYC location or Garden City -Long Island location. He or she will have significant responsibility within the organization and should be able to operate as a senior executive in interactions with both internal and external constituents. This is a great opportunity for the right person. Drop us a line if you fit the qualifications below and are interested in the role.Underwriting counsel1. 5-10 years of NY, commercial, independent underwriting for an underwriter or large commercial agent2. Strong commercial reading background;3. Strong commercial surveys reading experience;4. Strong NY Practice experience;5. Strong understanding for development rights transactions in NY;6. Strong understanding of NYC mortgage tax and NY transfer tax consequencesAll Inquiries are strictly confidential
## 2 As a trusted systems integrator for more than 50 years, General Dynamics Information Technology provides information technology (IT), systems engineering, professional services and simulation and training to customers in the defense, federal civilian government, health, homeland security, intelligence, state and local government and commercial sectors. With approximately 28,000 professionals worldwide, the company delivers IT enterprise solutions, manages large-scale, mission-critical IT programs and provides mission support services. GDIT is an Equal Opportunity/Affirmative Action Employer - Minorities/Females/Protected Veterans/DisabledGENERAL SUMMARY:Â The CMS Customer Service Representative I (CSR) is responsible for delivering general Marketplace information to callers. The CSRs use basic office equipment and technology such as telephones, email, and web browsers to perform their duties. The processes that the CSRs must follow are well defined and documented in standard operating procedures and scripts. Prescribed scripts must be read verbatim to the caller. Neither subject matter knowledge nor independent decision making is required by this position.The Customer Service Representative I reports directly to the Customer Service Supervisor. This is an entry level position responsible for disseminating general Marketplace information. Application processing, enrollment guidelines and a general Marketplace background will be the focus with callers. The Customer Service Representative I will follow scripting to determine when to transfer the caller to a Customer Service Representative IIGeneral Dynamics Information Technology is an Equal Opportunity/Affirmative Action Employer (M/F/D/V).
## 3 We are a busy recruitment agency in Wakefield looking for Telesales Executives. We are now able to offer a number of apprenticeship and training opportunities to businesses looking for new staff or employers looking to train their staff, and we urgently need Sales staff to sell these opportunities!The role will involve business to business telesales making a high volume of calls each day. As our company is currently growing at the moment this position has excellent career prospects and we are looking for long term members of staff.If you are interested please apply now.
## requirements
## 1
## 2 JOB RESPONSIBILITIES:â\200¢ Utilize standard technology such as telephone, e-mail, and web browser to perform job duties.â\200¢ Provide knowledgeable responses to telephone inquiries in a courteous and professionalmanner, utilizing pre-scripted responses which they must read verbatim to provide basic general and claims specific information.â\200¢ Follow established and documented policies and standard operating procedures such as filling out timesheets, adhering to privacy rules and responding to numerous phone inquiries.â\200¢ Assist caller with filling out online application and submitting it electronically to plan provider for processing.â\200¢ Complete basic call log related to the phone inquiries such as clicking radio buttons to confirm which scripts were read by the CSR to the caller.â\200¢ Refer calls as required to Customer Service Representative II.â\200¢ Maintain up-to-date knowledge of CMS regulations and policies as they apply.â\200¢ Report problems that occur via the online system so they can be addressed by the appropriate parties.â\200¢ Respond to telephone inquiries within the set departmental staffing and time parameters.â\200¢ May be required to work GDIT scheduled holidays. Overtime may be required.â\200¢ Perform other related duties as assigned.â\200¢ High School diploma or equivalent requiredWORKING CONDITIONS:The work is typically performed in an office environment, which requires proper safety and security precautions. To ensure our contact center production area is at minimal risk for unauthorized disclosure (that is, the release or divulgence of information by an entity to persons or organizations outside of that entity) of Personally Identifiable Information (PII) or Protected Health Information (PHI), the work environment operates under a Secure Floor Policy. The Secure Floor Policy limits or restricts personal belongings, electronic devices, or paper that can be brought into production areas.The above job description is not intended to be, nor should it be construed as, exhaustive of all responsibilities, skills, efforts, or working conditions associated with this job.Requests for reasonable accommodations will be considered to enable individuals with disabilities to perform the principal (essential) functions of this job.EXPERIENCE:â\200¢ Minimum 6 months customer service/secretarial/telemarketing experience required.â\200¢ Must be able to speak and read English clearly, professionally and fluently.â\200¢ Must be able to type a minimum of 20 WPM.â\200¢ Ability to effectively work within established contractual turnaround times required.â\200¢ Must have demonstrated excellent interpersonal and the ability to organize simultaneous tasks.â\200¢ Proven ability to work as a member of a team.â\200¢ All CMS personnel will be required by contract to undergo program update training as the program changes.â\200¢ Spanish fluency is desirable
## 3 Love of sales.Excellent telephone manner.2 year business to business telesales experience.Â
## benefits telecommuting has_company_logo has_questions
## 1 0 1 0
## 2 0 1 0
## 3 Career prospects.Busy workload. 0 1 1
## employment_type required_experience required_education
## 1 Full-time
## 2 Full-time Entry level High School or equivalent
## 3 Full-time Associate
## industry function. fraudulent country state city
## 1 Real Estate 0 US NY New York
## 2 Telecommunications Customer Service 0 US IA Coralville
## 3 Sales 0 GB WKF Wakefield
## min_salary max_salary
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
Visualize missing rates for each feature
gg_miss_var(df_fake_job, show_pct = TRUE) + labs(y = "% Missing")Merge columns and create a new ‘full_text’ column
viz_df <- select(df_fake_job, -c(max_salary, min_salary, state, city))
viz_df$full_text <-
paste(na.omit(viz_df$title),
na.omit(viz_df$country),
na.omit(viz_df$department),
na.omit(viz_df$company_profile),
na.omit(viz_df$description),
na.omit(viz_df$requirements),
na.omit(viz_df$benefits),
na.omit(viz_df$employment_type),
na.omit(viz_df$required_experience),
na.omit(viz_df$required_education),
na.omit(viz_df$industry),
na.omit(viz_df$function.))
viz_df[viz_df == ""] <- NAVisualize missing profile for each feature
plot_missing(viz_df)Heatplot of missingness across the dataframe
vis_miss(viz_df)Drop columns
model_df <- select(viz_df,
-c(title,
country,
department,
company_profile,
description,
requirements,
benefits,
employment_type,
required_experience,
required_education,
industry,
function.))
sample_n(model_df, 3)## job_id telecommuting has_company_logo has_questions fraudulent
## 1 4394 0 1 0 0
## 2 8490 0 1 1 0
## 3 6125 0 1 0 0
## full_text
## 1 EXPERIENCED WAITER/ESS NEEDED @ YOOBI - LONDON'S 1ST TEMAKERIA - SUSHI RESTAURANT US STRONG COMMAND OF ENGLISH NECESSARY Here at Yoobi, we are looking to expand our team in order to accommodate for large customer demand. We have a simple mission – making London’s best temaki sushi with the freshest, most sustainable ingredients around whilst having fun together and with our customers. Role Description Customer service is where it all starts at Yoobi – it is the first step to building your career with us. Sharpen your people, and teamwork skills, and learn how to run every aspect of creating a great experience for our customers. Get ready to grow! We are looking for…: Passionate people. People who operate with a sense of urgency. People who smile uncontrollably. People who love to serve. Foodies, eaters, and sushi aficionados. Neat-freaks. People who are willing to learn from their mistakes. People who want to have a voice in their workplace. People who want to jump at the opportunity to join a rapidly growing company with extremely high standards. The ideal candidate will need to have: - Have excellent command of English - written, spoken & comprehension - Experience in working in a restaurant - Have great customer service skills - Be able to work under pressure - Quick Learner - Have a position attitude In return, We will offer you: - Competitive wage plus cash tips - Free staff meal - Paid holiday - Help you develop your career 3 Quick Questions You Must Answer: 1. Who is the coolest person in the world? 2. What is your favourite current song? 3. Can you whistle? Send us a message, answer the questions and attach a copy of your resume with references. This is your first step to starting your career at Yoobi! Full-time Not Applicable Unspecified Restaurants Customer Service
## 2 Front end engineer US Mindscape is a Wellington based software development company that specialises in building tools for software engineers. We have a high growth product, Raygun (#URL_6b2f170addc3dd0415d65e21a8ece81d4c134c2b1a8b449386367dfaa286971b#) that's growing strongly. Mindscape is profitable and recently raised money to aggressively expand. Well respected, Mindscape has won international and national awards for excellence in software and has thousands of customers, including BMW, NATO, Intel, Microsoft & Beats Music to name a few. If you're up for the challenge of joining a fast growing business then look no further. Raygun is a fast growing Mindscape product (#URL_6b2f170addc3dd0415d65e21a8ece81d4c134c2b1a8b449386367dfaa286971b#). Raygun is a hosted service for automatically collecting data about software crashes and errors. It has a strong design aesthetic with plenty of opportunity to be creative, quirky and professional all at once so it's no suprise that customers love the current design and cite it as being one of the many reasons they choose our service.You'll be joining a small team and have a direct impact on the Raygun web application. You should have extremely solid production skills with CSS/JavaScript, as well as a strong interest in the usability of what you're designing. This role is predominantly about design, but a full-stack skillset to implement your designs in the application would be a substantial benefit.One of the great things about building a product for a technical audience that we can use cutting edge technologies. Forget Internet Explorer 7 support - if our customers used that, they'd already be out of a job. You get to work with all the latest buzzword technologies and frameworks - HTML5 (we particularly love the Canvas tag), CSS3, D3.js, #URL_b7bad8ac916069eadd573f035544c52dc3519a0ba054fb7ab1ff9ba3e1525399#. Our team is tight, and you'll be working directly with our lead designer, implementing great stuff with him and also being part of the design process yourself. You'll be tasked with creating a world leading user experience. We have users who want to pay for our product just for how beautiful it looks and we want you to help dial up the front end even further!Raygun is growing strongly, with thousands of developers globally using the service. Mindscape is well respected company for excellence in product development. The opportunity to join a fast growing, fast moving company where you have a direct impact on the application is here - are you up for the challenge? 3 years of frontend development experience.Highly skilled at HTML, CSS and JavaScript.Great taste, strong empathy, customer focus.Effectively incorporates broad goals into tactical work.Experience with Backbone, Angular JS, #URL_1d0f9eb2a7073ab63d5cfc0f9762fb40962b2b8ad1607a31c869aa4fd0382977# or #URL_ec870d4c32d3db2026283bb633aad057f18c5d5242768ddea14d56d6a38b12ef# is a plus.Experience with D3 is a plus. Youâ\200\231ll get other perks in the office like having a sweet place to work, where weirdness is welcomed and encouraged. Youâ\200\231ll get fresh fruit, and lollies (a balanced diet!). You can choose to work from a couch or a standing desk or a sitting desk.  And lastly, youâ\200\231ll get the opportunity to join one hell of a crazy awesome ride with us. There arenâ\200\231t very many New Zealand-based SaaS companies who are in the same position to dent the world. Full-time Bachelor's Degree Information Technology and Services Engineering
## 3 Marketing Communications Specialist US As a growing and successful startup, Conversocial is a great place to work for ambitious individuals.We build a market leading social customer service solution, and we need even more great people to help us push that position even further. Youâ\200\231ll get the opportunity to work in an exciting new market, where weâ\200\231re helping companies to understand the solution to their problems and are changing the way they interact with consumers.We have a trusting, hands-off management style, which is suited for people that are self-motivated.Our employees have the opportunity for independence and responsibility over their own projects, but we provide all the support and training they need to get there and to develop their careers.At Conversocial we like to balance work and play.We eat lunch together everyday (a company perk) and all enjoy a Friday treat of cake and few drinks. Our close-knit team is very sociable, which makes the Conversocial office a relaxed, fun and supportive working environment. You will work closely with Head of Marketing - EMEA to develop outstanding content that engages our audience and, ultimately, drives inbound leads. As the Marketing Communications Specialist, you will be responsible for drafting Conversocialâ\200\231s best practice guides, white papers, case studies and some contributions to our blog. You will need to be confident, as a large part of this role will be interviewing/building relationships with clients in order to create engaging content. The marketing team is small, so you also need to be a team player prepared to â\200œmuck inâ\200\235 with marketing activities such as manning event stands.This is an exciting opportunity to challenge yourself and join a talented team within the technology space. You must want to be a team player and thrive off creating engaging content and copy.You will enjoy and have experience of delivering thought leadership content within the B2B technology space. As a Digital Communications Specialist, you will have:â\200¢Â   Demonstrable experience in creating thought leadership contentâ\200¢Â   Knowledge and understanding of Social Customer Serviceâ\200¢Â   Great organisational skillsâ\200¢Â   Be a team player, yet capable of working independently $50 - $70k DOE and Performance + Medical, 401kHealth, Vision and Dental Insurance401k w/ 4% matchingGrowth Opportunities Available
Check NA or missing values
sum(is.na(model_df))## [1] 0
sum(model_df == "")## [1] 0
Visualize missing values
vis_miss(model_df)vis_dat(model_df)Exploratory Data Analysis (EDA)
Before building our models, we performed exploratory data analysis to understand the dataset.
Visualize fraud and real
viz_df2 <- viz_df
viz_df2$fraudulent[viz_df2$fraudulent == 1] <- "Fraud"
viz_df2$fraudulent[viz_df2$fraudulent == 0] <- "Non Fraud"
count <- table(viz_df2$fraudulent)
bar <- barplot(count,
main="Proportion of fraudulent job postings",
xlab="fraudulent",
ylab="count",
col=c(rgb(0.3,0.1,0.4,0.6), rgb(0.3,0.9,0.4,0.6)))
text(bar, count/2, labels = count)It is observable that there are 17,014 cases of legitimate job postings, while the number of fraudulent job postings is 866. The fraud rate of this dataset is 4.84%.
Visualize country-wise job postings
temp <- na.omit(subset(viz_df, select = c(country))) %>%
group_by(country) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
top_n(10, n)
par(mar=c(6,4,4,4))
barplot(height=temp$n,
main="Top 10 country-wise job postings",
ylab="count",
col=brewer.pal(10, "Set3"),
names.arg=c("United States",
"United Kingdom",
"Greece",
"Canada",
"Germany",
"New Zealand",
"India",
"Australia",
"Philippines",
"Netherlands"),
cex.names=0.7,
las=2)Top 10 countries with most of the number of job postings are US, GB, GR, CA, DE, NZ, IN, AU, PH, NL. United States listed 10,656 job postings, followed by 2,384 for United Kingdom and 940 for Greece.
Visualize the industries
temp <- na.omit(subset(viz_df, select = c(industry))) %>%
group_by(industry) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
top_n(10, n)
par(mar=c(10,4,4,4))
barplot(height=temp$n,
names=temp$industry,
main="Top 10 industries",
ylab="count",
col=brewer.pal(10, "Set3"),
cex.names=0.6,
las=2)Most job openings are IT related such as Information Technology and Services (1,734), Computer Software (1,376) and Internet (1,062).
Visualize the departments
temp <- na.omit(subset(viz_df, select = c(department))) %>%
group_by(department) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
top_n(10, n)
par(mar=c(8,4,4,4))
barplot(height=temp$n,
names=temp$department,
main="Top 10 departments",
ylab="count",
col=brewer.pal(10, "Set3"),
cex.names=0.6,
las=2)Top hiring departments are Sales (551), Engineering (487) and Marketing (401).
Visualize the required experiences in the jobs
viz_df %>% group_by(required_experience) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
drop_na() %>%
top_n(10, n) %>%
ggplot(aes(x=reorder(required_experience, -n), y = n)) +
geom_segment(aes(x=reorder(required_experience, -n), xend=reorder(required_experience, -n), y=0, yend=n), color="skyblue") +
geom_point(color="steelblue", size=2, alpha=1) +
theme_light() +
coord_flip() +
theme(panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()) +
theme_bw() + labs(title = "Listed jobs with required experiences",
x = "Experience",
y = "Count",
fill = "Experience") +
geom_text(aes(label=round(n,0)), vjust=-0.6)Mid-Senior level jobs are in demand, followed by entry level and associate.
Visualize the required education in the jobs
viz_df %>% group_by(required_education) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
drop_na() %>%
top_n(10, n) %>%
ggplot(aes(x=reorder(required_education, -n), y = n)) +
geom_segment(aes(x=reorder(required_education, -n), xend=reorder(required_education, -n), y=0, yend=n), color="skyblue") +
geom_point(color="steelblue", size=2, alpha=1) +
theme_light() +
coord_flip() +
theme(panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()) +
theme_bw() + labs(title = "Listed jobs with required education",
x = "Education",
y = "Count",
fill = "Education") +
geom_text(aes(label=round(n,0)), vjust=-0.6)Most of the education requirements in job ads are at least Bachelor’s degree.
Visualize fraudulent job postings based on employment types
viz_df2 <- viz_df
viz_df2$employment_type <- ifelse(is.na(viz_df2$employment_type), "Missing", viz_df2$employment_type)
df1 <- subset(viz_df2, select = c(employment_type, fraudulent)) %>%
group_by(employment_type, fraudulent) %>%
summarize(yes = sum(fraudulent==1), .groups = 'drop') %>%
filter(fraudulent==1)
df2 <- subset(viz_df2, select = c(employment_type, fraudulent)) %>%
group_by(employment_type, fraudulent) %>%
summarize(no = sum(fraudulent==0), .groups = 'drop') %>%
filter(fraudulent==0)
df_new <- merge(df1, df2, by = c("employment_type")) %>%
group_by(employment_type) %>%
summarize(pct_fraud = round(yes/(yes+no), digits=3),
pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>%
mutate(employment_type = factor(employment_type,
levels = c('Part-time',
'Missing',
'Other',
'Full-time',
'Contract',
'Temporary')))
fig <- df_new %>% plot_ly(width = 700, height = 400)
fig <- fig %>% add_trace(x = ~employment_type, y = ~pct_non_fraud, type = 'bar',
text = ~paste0(pct_non_fraud*100,"%"), textposition = 'outside', name = 'pct_non_fraud',
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% add_trace(x = ~employment_type, y = ~pct_fraud, type = 'bar',
text = ~paste0(pct_fraud*100,"%"), textposition = 'outside', name = 'pct_fraud',
marker = list(color = 'rgb(58,200,225)',
line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% layout(title = "Employment types with % fraud and non-fraud",
barmode = 'group',
xaxis = list(title = "employment_type"),
yaxis = list(title = "percentage"))
figThe percentage of fraudulent job postings is the highest for part-time jobs, nearly 9%. Jobs without an employment type also have a high fraud rate, around 7%.
Visualize fraudulent job postings based on required experiences
viz_df2 <- viz_df
viz_df2$required_experience <- ifelse(is.na(viz_df2$required_experience), "Not Applicable", viz_df2$required_experience)
df1 <- subset(viz_df2, select = c(required_experience, fraudulent)) %>%
group_by(required_experience, fraudulent) %>%
summarize(yes = sum(fraudulent==1), .groups = 'drop') %>%
filter(fraudulent==1)
df2 <- subset(viz_df2, select = c(required_experience, fraudulent)) %>%
group_by(required_experience, fraudulent) %>%
summarize(no = sum(fraudulent==0), .groups = 'drop') %>%
filter(fraudulent==0)
df_new <- merge(df1, df2, by = c("required_experience")) %>%
group_by(required_experience) %>%
summarize(pct_fraud = round(yes/(yes+no), digits=3),
pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>%
mutate(required_experience = factor(required_experience,
levels = c('Executive',
'Entry level',
'Not Applicable',
'Director',
'Mid-Senior level',
'Internship',
'Associate')))
fig <- df_new %>% plot_ly(width = 700, height = 400)
fig <- fig %>% add_trace(x = ~required_experience, y = ~pct_non_fraud, type = 'bar',
text = ~paste0(pct_non_fraud*100,"%"), textposition = 'outside', name = 'pct_non_fraud',
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% add_trace(x = ~required_experience, y = ~pct_fraud, type = 'bar',
text = ~paste0(pct_fraud*100,"%"), textposition = 'outside', name = 'pct_fraud',
marker = list(color = 'rgb(58,200,225)',
line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% layout(title = "Required experiences with % fraud and non-fraud",
barmode = 'group',
xaxis = list(title = "required_experience"),
yaxis = list(title = "percentage"))
figMost executive or entry level jobs that require minimum qualifications and little experience have highest fraud rate, nearly 7%.
Visualize fraudulent job postings based on job functions
viz_df2 <- viz_df
viz_df2$fraudulent[viz_df2$fraudulent == 1] <- "Fraud"
viz_df2$fraudulent[viz_df2$fraudulent == 0] <- "Non Fraud"
temp <- na.omit(subset(viz_df2, select = c(function., fraudulent))) %>%
group_by(function., fraudulent) %>%
summarize(n = n(), .groups = 'drop') %>%
group_by(function.) %>%
summarize(pct_fraud = round(sum(n[fraudulent=="Fraud"]/sum(n)), digits=3),
pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>%
arrange(desc(pct_fraud)) %>%
top_n(10, pct_fraud) %>%
mutate(function. = factor(function.,
levels = c('Administrative',
'Financial Analyst',
'Accounting/Auditing',
'Distribution',
'Other',
'Finance',
'Engineering',
'Business Development',
'Advertising',
'Customer Service')))
melted_temp <- melt(temp, id = "function.")
ggplot(melted_temp, aes(x = function., y = value, fill = variable)) +
geom_bar(position = "fill",
stat = "identity",
color = "black",
width = 0.8) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.6)) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = paste0(value*100,"%")),
position = position_stack(vjust = 0.6), size = 2) +
ggtitle("Job functions with % fraud and non-fraud") +
xlab("function") +
ylab("percentage")The function with highest fraudulent job postings is Administrative, close to 19%, followed by Financial Analyst, Accounting/Auditing. Admin jobs seem most suspicious. Possibly, it’s easy for scammers to disguise their scams.
Visualize fraudulent job postings based on required education
temp <- na.omit(subset(viz_df2, select = c(required_education, fraudulent))) %>%
group_by(required_education, fraudulent) %>%
summarize(n = n(), .groups = 'drop') %>%
group_by(required_education) %>%
summarize(pct_fraud = round(sum(n[fraudulent=="Fraud"]/sum(n)), digits=3),
pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>%
arrange(desc(pct_fraud)) %>%
top_n(10, pct_fraud) %>%
mutate(required_education = factor(required_education,
levels = c("Some High School Coursework",
"Certification",
"High School or equivalent",
"Master's Degree",
"Professional",
"Unspecified",
"Doctorate",
"Some College Coursework Completed",
"Associate Degree",
"Bachelor's Degree")))
melted_temp <- melt(temp, id = "required_education")
ggplot(melted_temp, aes(x = required_education, y = value, fill = variable)) +
geom_bar(position = "fill",
stat = "identity",
color = "black",
width = 0.8) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.6)) +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = paste0(value*100,"%")),
position = position_stack(vjust = 0.6), size = 2) +
ggtitle("Required education with % fraud and non-fraud") +
xlab("required_education") +
ylab("percentage")As high as 74% of fake jobs require little educational credentials - “Some High School Coursework”.
Word Cloud
To visualize the fraud and real job postings, the WordCloud is used to see the top occurring keywords in the data. To do so, fraud and real job postings are separated into two text files and WordCloud has plotted accordingly.
Word Cloud of fraudulent job postings
selected_df <- subset(viz_df, fraudulent == 1)
# Create a vector containing only the text
text <- selected_df$title
# Create a corpus
docs <- Corpus(VectorSource(text))
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)
wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))Many of the fraudulent job postings have common keywords in the job titles - “Data Entry”, “Administrative”, “Home Based”, “Earn Daily”.
Word Cloud of NON-fraudulent job postings
selected_df <- subset(viz_df, fraudulent == 0)
# Create a vector containing only the text
text <- selected_df$title
# Create a corpus
docs <- Corpus(VectorSource(text))
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)
wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))Many of the NON-fraudulent job postings have common keywords in the job titles - “Manager”, “Developer”, “Engineer”.
Modeling
Before modeling, a final dataset is determined. This project will use a dataset with these features for the final analysis:
- fraudulent (target variable)
- telecommuting
- has_company_logo
- has_questions
- full_text: a combination of title, country, department, company_profile, description, requirements, benefits, employment_type, required_experience, required_education, industry and function
Five supervised machine learning algorithms used in the project are:
- Logistic Regression
- Random Forest
- K-Nearest Neighbor (KNN)
- XGBoost
- Support Vector Machine (SVM)
Data pre-process (full_text)
For this analysis, the entire full_text column is converted to a DocumentTermMatrix and then convert to a dataframe.
docs <- Corpus(VectorSource(model_df$full_text))
docs <- docs %>%
tm_map(removeNumbers) %>% # Remove numbers
tm_map(removePunctuation) %>% # Remove punctuation
tm_map(stripWhitespace) # Eliminate extra white spaces
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
# Convert each full_text into a row with columns containing each term in the document and giving the frequency of unique words used in the full_text
dtm <- DocumentTermMatrix(docs)
sparse_data <- removeSparseTerms(dtm, 0.90) # Remove sparse data# Convert to dataframe for further analysis
sparse_data_df <- as.data.frame(as.matrix(sparse_data))
final_df <- subset(sparse_data_df, select = -c(`–`))
# Add other variables
final_df$telecommuting <- model_df$telecommuting
final_df$has_company_logo <- model_df$has_company_logo
final_df$has_questions <- model_df$has_questions
final_df$fraudulent <- model_df$fraudulentView the dimension of the dataframe
dim(final_df)## [1] 17880 313
# 17880 rows, 313 columnsVisualize data
# Histogram
par(mfrow=c(2,2))
for(i in 310:313) {
hist(final_df[,i], main=names(final_df)[i], border="blue", col="yellow")
}# Boxplot
par(mfrow=c(2,2))
for(i in 310:313) {
boxplot(final_df[,i], main=names(final_df)[i], border="blue", col="yellow")
}Correlation
A correlation matrix is created to visualize the numeric data relationship.
# Calculate the correlation between each pair of numeric variables
selected_df <- final_df[, 310:313]
corr_df <- round(cor(selected_df), 2)
corr_df## telecommuting has_company_logo has_questions fraudulent
## telecommuting 1.00 -0.02 0.02 0.03
## has_company_logo -0.02 1.00 0.23 -0.26
## has_questions 0.02 0.23 1.00 -0.09
## fraudulent 0.03 -0.26 -0.09 1.00
Visualize correlation heatmap
# Reduce the size of correlation matrix
melted_corr_mat <- melt(corr_df)
# Plot the correlation heatmap
ggplot(data = melted_corr_mat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes( label = value), color = "black", size = 4)It can be seen that all features are not highly correlated, however has_company_logo and has_questions have negative correlation with fraudulent. This indicates that if the job posting has a company logo or with questions, the likelihood of fraudulent decreases.
Split data into 70% training, 30% testing
# Using the same seed value, reproduce the division of the training and testing sets
set.seed(123)
train_index <- sample(dim(final_df)[1], 0.7 * dim(final_df)[1])
model_dftrain<- final_df[train_index, ]
model_dftest <- final_df[-train_index, ]
paste("train sample size: ", dim(model_dftrain)[1])## [1] "train sample size: 12516"
paste("test sample size: ", dim(model_dftest)[1])## [1] "test sample size: 5364"
View training set
sample_n(model_dftrain, 3)## also amp andor around attention best big business communication company
## 17169 0 0 0 0 0 1 0 0 0 0
## 3482 0 0 0 0 0 0 0 1 0 0
## 9043 1 0 0 0 0 0 0 0 1 1
## content currently daily drive engineering existing experience full highly
## 17169 0 0 0 0 0 0 0 0 0
## 3482 0 0 0 0 1 1 2 0 0
## 9043 0 0 0 0 0 0 0 0 0
## hours information like long management market marketing media need new
## 17169 0 1 0 0 1 0 0 0 0 0
## 3482 0 2 1 0 0 0 0 0 0 1
## 9043 0 0 0 1 1 0 0 0 0 2
## offer office one online people plus small social staff startup support
## 17169 0 0 0 0 0 0 0 0 0 0 0
## 3482 0 0 1 0 0 0 0 0 0 0 0
## 9043 1 1 0 0 0 0 0 0 0 0 0
## systems talented team technology top using various website work working
## 17169 0 0 1 1 0 0 1 0 0 0
## 3482 0 0 2 3 0 1 0 0 0 1
## 9043 0 0 0 0 0 0 2 0 1 0
## able apply based can candidates client clients communicate companies
## 17169 0 0 0 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 0 0 0
## 9043 0 0 0 0 0 0 0 0 0
## computer cost creative customer delivery effectively email environment
## 17169 0 0 0 0 0 0 0 0
## 3482 2 0 0 0 0 0 0 0
## 9043 0 0 0 1 0 1 0 0
## every excellent fast following fulltime get global great grow growing
## 17169 0 0 0 0 1 0 0 0 0 0
## 3482 0 0 0 0 1 0 0 0 0 0
## 9043 0 0 0 0 2 0 0 0 0 1
## growth high include including international issues key know knowledge
## 17169 0 0 0 0 0 0 0 0 0
## 3482 0 2 0 0 0 0 0 0 0
## 9043 0 0 0 1 0 0 0 0 1
## large learn level looking making manage manager managing network
## 17169 0 0 1 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 0 0 0
## 9043 0 0 0 0 0 0 0 1 0
## opportunity part passion person phone planning platform please position
## 17169 0 0 0 0 0 0 0 0 1
## 3482 2 1 0 0 0 0 0 0 0
## 9043 0 0 0 0 0 0 0 0 0
## process product production project projects provides quality range right
## 17169 0 1 0 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 2 0 0
## 9043 0 2 0 0 0 0 0 0 0
## role service skills software success successful system teams understand
## 17169 0 0 0 1 0 0 0 0 0
## 3482 0 0 0 3 0 0 0 1 1
## 9043 0 2 1 0 0 1 0 0 0
## web will world across activities candidate career contract engineer
## 17169 0 1 0 0 0 0 1 0 0
## 3482 2 1 0 0 0 0 0 0 0
## 9043 1 0 0 0 0 0 0 0 0
## ensure experienced field focus health ideal meet must needs opportunities
## 17169 1 0 0 0 0 1 0 0 0 0
## 3482 0 0 0 0 0 0 0 0 0 0
## 9043 0 0 0 0 3 0 0 0 1 0
## per provide requirements resources seeking services solutions strong
## 17169 0 0 0 0 0 0 2 0
## 3482 0 0 0 0 0 1 5 0
## 9043 0 0 1 0 0 1 0 0
## unique vision way ability analysis available bachelors benefits build
## 17169 0 0 0 0 0 0 1 0 0
## 3482 0 0 0 0 0 0 1 0 1
## 9043 0 1 0 1 1 0 0 1 0
## competitive culture customers degree develop development equivalent first
## 17169 0 0 0 1 0 2 0 0
## 3482 0 0 0 2 1 3 0 0
## 9043 1 0 0 1 0 0 0 0
## goals good help industry lead life maintain make midsenior motivated
## 17169 0 0 0 0 0 0 0 0 1 0
## 3482 0 0 0 1 0 1 0 0 0 0
## 9043 0 0 0 0 0 1 0 0 0 0
## order organization personal problem professional providing related
## 17169 0 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 0
## 9043 0 0 0 0 0 0 1
## responsible sales strategy travel understanding value verbal within
## 17169 0 0 0 1 0 0 0 0
## 3482 0 0 0 0 0 0 0 0
## 9043 0 1 0 0 0 0 0 1
## written year years care current deliver directly innovative interested
## 17169 0 0 0 0 0 0 0 0 0
## 3482 0 0 1 0 0 2 0 1 0
## 9043 0 1 1 2 0 0 0 0 0
## job leadership monthly offers open operations performance positions
## 17169 0 0 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 0 0
## 9043 0 0 0 1 1 0 0 0
## potential preferred processes reports results standards time training
## 17169 0 0 1 0 0 0 1 0
## 3482 0 0 0 0 0 0 0 0
## 9043 0 0 0 0 0 0 3 0
## well areas come design driven employees excel financial join relevant
## 17169 1 0 0 1 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 0 0 0 0
## 9043 0 0 0 0 0 1 0 0 0 1
## school senior technical we’re without brand dynamic ideas leading many
## 17169 0 1 0 0 0 0 0 0 0 0
## 3482 0 0 0 1 0 0 1 0 0 0
## 9043 0 0 0 0 0 0 0 0 0 1
## mobile take creating flexible free just love minimum mission multiple
## 17169 0 0 0 0 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 0 0 0 0
## 9043 0 0 0 1 0 0 0 0 0 1
## passionate play record required use want applications associate change
## 17169 0 0 0 0 0 0 0 0 0
## 3482 0 1 0 0 0 0 3 0 0
## 9043 1 0 0 0 0 0 0 1 0
## tools background delivering duties entry improve months reporting tasks
## 17169 0 0 0 0 0 1 0 0 0
## 3482 0 0 0 0 0 0 0 0 0
## 9043 0 0 0 1 1 0 0 0 1
## agency building data developer developing digital internal learning
## 17169 0 0 0 2 0 0 0 0
## 3482 0 0 0 1 2 0 0 0
## 9043 0 0 1 0 0 0 0 0
## products technologies closely employee internet start track application
## 17169 0 0 0 0 0 0 0 0
## 3482 1 0 0 0 0 0 0 2
## 9043 2 0 0 0 0 0 0 0
## create established may user hard insurance believe now plan problems
## 17169 0 0 0 0 1 0 0 0 0 0
## 3482 1 0 0 0 0 0 0 0 0 0
## 9043 0 2 0 0 0 2 0 1 1 0
## complex day education individuals relationships jobs fun see english
## 17169 0 0 0 0 0 0 0 0 0
## 3482 0 0 0 0 0 0 0 0 0
## 9043 0 0 2 0 0 0 0 0 0
## individual salary dental group package paid medical exciting members
## 17169 0 0 0 0 0 0 0 0 1
## 3482 0 0 0 0 0 0 0 0 0
## 9043 0 0 1 0 1 1 12 0 1
## least telecommuting has_company_logo has_questions fraudulent
## 17169 0 0 1 1 0
## 3482 1 0 1 1 0
## 9043 0 0 1 0 0
Convert the dependent variable as a factor
model_dftrain$fraudulent = as.factor(model_dftrain$fraudulent)
model_dftest$fraudulent = as.factor(model_dftest$fraudulent)Logistic Regression
# Train logistic regression
lr_model <- glm(formula = fraudulent ~ ., family = "binomial", data = model_dftrain)Predict the testing set
lr_pred_test <- predict(lr_model, newdata = model_dftest, type = "response")test <- model_dftest
glm.probs = predict(lr_model, newdata = test, type = "response")
test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")
test$pred_glm = as.factor(test$pred_glm)Calculate AUC of the model
calcAUC <- function(predcol, outcol) {
perf <- performance(prediction(as.numeric(predcol), outcol == 1), "auc")
as.numeric(perf@y.values)
}
paste("AUC of Logistic Regression is", round(calcAUC(lr_pred_test, model_dftest$fraudulent), digits=4))## [1] "AUC of Logistic Regression is 0.953"
Random Forest
# Train random forest
trcontrol <- trainControl(method = "repeatedcv", number = 2, repeats = 1, search = "random", verboseIter = TRUE)
grid <- data.frame(mtry = c(100))
rf_model <- train(fraudulent ~ ., method = "rf", data = model_dftrain, ntree = 200, trControl = trcontrol, tuneGrid = grid)## + Fold1.Rep1: mtry=100
## - Fold1.Rep1: mtry=100
## + Fold2.Rep1: mtry=100
## - Fold2.Rep1: mtry=100
## Aggregating results
## Fitting final model on full training set
rf_model## Random Forest
##
## 12516 samples
## 312 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 1 times)
## Summary of sample sizes: 6258, 6258
## Resampling results:
##
## Accuracy Kappa
## 0.9691595 0.5253441
##
## Tuning parameter 'mtry' was held constant at a value of 100
Predict the testing set
rf_pred_test <- predict(rf_model, newdata = model_dftest)Calculate AUC of the model
paste("AUC of Random Forest is", round(calcAUC(rf_pred_test, model_dftest$fraudulent), digits=4))## [1] "AUC of Random Forest is 0.8028"
K-Nearest Neighbor (KNN)
# Train knn
knn <- kknn(fraudulent ~ ., model_dftrain, model_dftest, k = 25)
# View(knn)Predict the testing set
knn_pred_test <- predict(knn, newdata = model_dftest)Calculate AUC of the model
paste("AUC of KNN is", round(calcAUC(knn_pred_test, model_dftest$fraudulent), digits=4))## [1] "AUC of KNN is 0.767"
XGBoost
x_train = subset(model_dftrain, select = -c(fraudulent))
y_train = subset(model_dftrain, select = c(fraudulent))
x_test = subset(model_dftest, select = -c(fraudulent))
y_test= subset(model_dftest, select = c(fraudulent))
x_train = as.matrix(x_train)
y_train = as.matrix(y_train)
x_test = as.matrix(x_test)
y_test = as.matrix(y_test)
xgboost_train = xgb.DMatrix(data=x_train, label=y_train)
xgboost_test = xgb.DMatrix(data=x_test, label=y_test)model <- xgboost(data = xgboost_train,
max_depth = 3,
eta = 0.1,
nrounds = 100,
booster = "gbtree")## [1] train-rmse:0.457630
## [2] train-rmse:0.420081
## [3] train-rmse:0.386988
## [4] train-rmse:0.357967
## [5] train-rmse:0.332568
## [6] train-rmse:0.310259
## [7] train-rmse:0.290797
## [8] train-rmse:0.273965
## [9] train-rmse:0.259404
## [10] train-rmse:0.246891
## [11] train-rmse:0.236248
## [12] train-rmse:0.227150
## [13] train-rmse:0.219307
## [14] train-rmse:0.212578
## [15] train-rmse:0.207041
## [16] train-rmse:0.202255
## [17] train-rmse:0.198396
## [18] train-rmse:0.194924
## [19] train-rmse:0.192325
## [20] train-rmse:0.189878
## [21] train-rmse:0.187593
## [22] train-rmse:0.186022
## [23] train-rmse:0.184275
## [24] train-rmse:0.182951
## [25] train-rmse:0.181881
## [26] train-rmse:0.181043
## [27] train-rmse:0.180074
## [28] train-rmse:0.179555
## [29] train-rmse:0.178357
## [30] train-rmse:0.177737
## [31] train-rmse:0.176941
## [32] train-rmse:0.176630
## [33] train-rmse:0.176375
## [34] train-rmse:0.175657
## [35] train-rmse:0.175207
## [36] train-rmse:0.174575
## [37] train-rmse:0.174157
## [38] train-rmse:0.173986
## [39] train-rmse:0.173817
## [40] train-rmse:0.173649
## [41] train-rmse:0.172673
## [42] train-rmse:0.172193
## [43] train-rmse:0.172038
## [44] train-rmse:0.171735
## [45] train-rmse:0.171296
## [46] train-rmse:0.171182
## [47] train-rmse:0.170742
## [48] train-rmse:0.170479
## [49] train-rmse:0.170209
## [50] train-rmse:0.169823
## [51] train-rmse:0.169673
## [52] train-rmse:0.169418
## [53] train-rmse:0.169115
## [54] train-rmse:0.168875
## [55] train-rmse:0.168692
## [56] train-rmse:0.168299
## [57] train-rmse:0.167796
## [58] train-rmse:0.167589
## [59] train-rmse:0.167490
## [60] train-rmse:0.167180
## [61] train-rmse:0.167008
## [62] train-rmse:0.166682
## [63] train-rmse:0.166507
## [64] train-rmse:0.166344
## [65] train-rmse:0.165948
## [66] train-rmse:0.165773
## [67] train-rmse:0.165665
## [68] train-rmse:0.165345
## [69] train-rmse:0.164959
## [70] train-rmse:0.164591
## [71] train-rmse:0.164412
## [72] train-rmse:0.164269
## [73] train-rmse:0.164155
## [74] train-rmse:0.163932
## [75] train-rmse:0.163832
## [76] train-rmse:0.163560
## [77] train-rmse:0.163200
## [78] train-rmse:0.162873
## [79] train-rmse:0.162655
## [80] train-rmse:0.162445
## [81] train-rmse:0.162223
## [82] train-rmse:0.162022
## [83] train-rmse:0.161935
## [84] train-rmse:0.161770
## [85] train-rmse:0.161594
## [86] train-rmse:0.161420
## [87] train-rmse:0.161321
## [88] train-rmse:0.160963
## [89] train-rmse:0.160885
## [90] train-rmse:0.160762
## [91] train-rmse:0.160711
## [92] train-rmse:0.160495
## [93] train-rmse:0.160235
## [94] train-rmse:0.160145
## [95] train-rmse:0.160026
## [96] train-rmse:0.159598
## [97] train-rmse:0.159510
## [98] train-rmse:0.159454
## [99] train-rmse:0.159272
## [100] train-rmse:0.159154
Predict the testing set
summary(model)## Length Class Mode
## handle 1 xgb.Booster.handle externalptr
## raw 115056 -none- raw
## niter 1 -none- numeric
## evaluation_log 2 data.table list
## call 16 -none- call
## params 4 -none- list
## callbacks 2 -none- list
## feature_names 312 -none- character
## nfeatures 1 -none- numeric
pred_test = predict(model, x_test)prediction = as.numeric(pred_test > 0.5)
y_test = as.numeric(y_test)
prediction = as.factor(prediction)
y_test = as.factor(y_test)Calculate AUC of the model
paste("AUC of XGBoost is", round(calcAUC(y_test, prediction), digits=4))## [1] "AUC of XGBoost is 0.9545"
Support Vector Machine (SVM)
fraudulentSVM = svm(formula = fraudulent ~ ., data = model_dftrain, type='C-classification', kernel='linear')Predict the testing set
fraudulentSVMPrediction = predict(fraudulentSVM, newdata = model_dftest)Calculate AUC of the model
paste("AUC of SVM is", round(calcAUC(fraudulentSVMPrediction, model_dftest$fraudulent), digits=4))## [1] "AUC of SVM is 0.759"
Evaluation
Accuracy and area under the curve (AUC) are used to evaluate the effectiveness of models in terms of classifying real and fake job postings. However, the dataset used for training is highly imbalanced. Thus, it is necessary to use F1, precision and recall scores to evaluate the model’s ability to identify both real and fake job postings.
- Accuracy score: Metric that provides a general idea of the model performance.
- AUC score: Measure how well the model can distinguish real and fake job postings.
- Precision score: Percentage of positive predictions are accurate.
- Recall score: Percentage of positive results that have been classified correctly by the model.
- F1 score: Harmonic mean of precision and recall.
Confusion Matrix and Error Metrics of Logistic Regression
confMatrix_lr = confusionMatrix(test$pred_glm, test$fraudulent, mode = "everything", positive = "1")
print(confMatrix_lr)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5025 114
## 1 67 158
##
## Accuracy : 0.9663
## 95% CI : (0.9611, 0.9709)
## No Information Rate : 0.9493
## P-Value [Acc > NIR] : 1.193e-09
##
## Kappa : 0.6183
##
## Mcnemar's Test P-Value : 0.0006282
##
## Sensitivity : 0.58088
## Specificity : 0.98684
## Pos Pred Value : 0.70222
## Neg Pred Value : 0.97782
## Precision : 0.70222
## Recall : 0.58088
## F1 : 0.63581
## Prevalence : 0.05071
## Detection Rate : 0.02946
## Detection Prevalence : 0.04195
## Balanced Accuracy : 0.78386
##
## 'Positive' Class : 1
##
Confusion Matrix and Error Metrics of Random Forest
confMatrix_rf = confusionMatrix(rf_pred_test, model_dftest$fraudulent, mode = "everything", positive = "1")
print(confMatrix_rf)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5087 107
## 1 5 165
##
## Accuracy : 0.9791
## 95% CI : (0.9749, 0.9828)
## No Information Rate : 0.9493
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7363
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.60662
## Specificity : 0.99902
## Pos Pred Value : 0.97059
## Neg Pred Value : 0.97940
## Precision : 0.97059
## Recall : 0.60662
## F1 : 0.74661
## Prevalence : 0.05071
## Detection Rate : 0.03076
## Detection Prevalence : 0.03169
## Balanced Accuracy : 0.80282
##
## 'Positive' Class : 1
##
Confusion Matrix and Error Metrics of KNN
confMatrix_knn = confusionMatrix(knn_pred_test, model_dftest$fraudulent, mode = "everything", positive = "1")
print(confMatrix_knn)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5078 126
## 1 14 146
##
## Accuracy : 0.9739
## 95% CI : (0.9693, 0.978)
## No Information Rate : 0.9493
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6633
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.53676
## Specificity : 0.99725
## Pos Pred Value : 0.91250
## Neg Pred Value : 0.97579
## Precision : 0.91250
## Recall : 0.53676
## F1 : 0.67593
## Prevalence : 0.05071
## Detection Rate : 0.02722
## Detection Prevalence : 0.02983
## Balanced Accuracy : 0.76701
##
## 'Positive' Class : 1
##
Confusion Matrix and Error Metrics of XGBoost
conf_mat = confusionMatrix(y_test, prediction, mode = "everything", positive = "1")
print(conf_mat)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5087 5
## 1 187 85
##
## Accuracy : 0.9642
## 95% CI : (0.9589, 0.969)
## No Information Rate : 0.9832
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4559
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.94444
## Specificity : 0.96454
## Pos Pred Value : 0.31250
## Neg Pred Value : 0.99902
## Precision : 0.31250
## Recall : 0.94444
## F1 : 0.46961
## Prevalence : 0.01678
## Detection Rate : 0.01585
## Detection Prevalence : 0.05071
## Balanced Accuracy : 0.95449
##
## 'Positive' Class : 1
##
Confusion Matrix and Error Metrics of SVM
confMatrix_svm = confusionMatrix(fraudulentSVMPrediction, model_dftest$fraudulent, mode = "everything", positive = "1")
print(confMatrix_svm)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5034 128
## 1 58 144
##
## Accuracy : 0.9653
## 95% CI : (0.9601, 0.9701)
## No Information Rate : 0.9493
## P-Value [Acc > NIR] : 9.687e-09
##
## Kappa : 0.5899
##
## Mcnemar's Test P-Value : 4.207e-07
##
## Sensitivity : 0.52941
## Specificity : 0.98861
## Pos Pred Value : 0.71287
## Neg Pred Value : 0.97520
## Precision : 0.71287
## Recall : 0.52941
## F1 : 0.60759
## Prevalence : 0.05071
## Detection Rate : 0.02685
## Detection Prevalence : 0.03766
## Balanced Accuracy : 0.75901
##
## 'Positive' Class : 1
##
Summary of Results
| Metric | Logistic Regression | Random Forest | KNN | XGBoost | SVM |
|---|---|---|---|---|---|
| Accuracy | 0.97 | 0.98 | 0.97 | 0.96 | 0.97 |
| Precision | 0.70 | 0.97 | 0.91 | 0.31 | 0.71 |
| Recall | 0.58 | 0.61 | 0.54 | 0.94 | 0.53 |
| F1 | 0.64 | 0.75 | 0.68 | 0.47 | 0.61 |
| AUC | 0.95 | 0.80 | 0.77 | 0.95 | 0.76 |
The Random Forest has achieved the best accuracy, precision and F1 scores. However, Logistic Regression and XGBoost have achieved the highest AUC than others while their precision scores are comparative lower than other models. Given the precision and also F1 scores, we can conclude that Random Forest is the best in terms of classifying real and fake job postings.
Results Analysis Summary
- What are the key features/characteristics of fraudulent job postings?
Based on the correlation analysis, all of the features are not highly correlated to our target feature (fraudulent) and therefore, it is difficult to find out the key features or characteristics of fraudulent job postings. However, it can be seen that has_company_logo and has_questions features have negative correlation with fraudulent. This indicates that if the job posting has a company logo or with questions, the likelihood of fraudulent decreases.
- Which classification model is the best to determine whether the job is real or not?
Random Forest is the best classification model to determine whether the job is real or not. This conclusion was made in regard to Random Forest model has shown the best accuracy, precision and F1 scores compared to other models.
- Other findings
- 74% of fake jobs require little educational credentials - “Some High School Coursework”. This may indicate the target of fake job postings is jobseekers with little educational credentials such as highschoolers or students.
- Most executive or entry level jobs that require minimum qualifications and little experience have highest fraud rate, nearly 7%. This information implies that job seekers with lack of experience such as fresh graduates are most likely being the target of these fake job postings.
- Many of the fraudulent job postings have common keywords in the job titles - “Data Entry”, “Administrative”, “Home Based”, “Earn Daily”. These are the words that can attract the attention of the jobseekers.
Limitation and Improvement
Since the dataset is highly imbalanced where most of the job postings are legitimate, and only few are fraudulent. Thus, real jobs are being identified quite well. Techniques to handle imbalanced data like SMOTE can be applied to make a fair comparison between real and fraudulent jobs. Besides, other NLP processing like TF-IDF vectorizer can be chosen to discover the best possible numerical/vectorial representation of the text strings for running ML models.
Conclusion
In most instances, if something appears too good to be true, it probably is. Most of the fraudulent job description and requirements are vague and too good to be true such as easy work for unrealistic pay. Be aware of part-time, entry-level jobs that require minimum qualifications and little experience like data entry and administrative. Home based and job listings without company logo can be alarming. In terms of classification models, Random Forest gives the best accuracy, precision and F1 scores, however better results can be achieved with a more balanced dataset with sufficient use cases for both real and fake job postings. Finally, with a little research, we can not only find out if a company and a job are legit, but also discover if the company is a right fit.