In this assignment, you will be immersed in a real-world scenario. Your work for the Ministry of Labor who is currently analysing data on job postings. Your team has realized that a significant share of job posts is fake, and wants to be able to detect which job posts are fake, and understand what distinguishes these posts from real ones.

Your goal is to understand what the main differences between a real and a fake job posting are, and to build a model to predict whether a job posting is fake. Because the people that will read the ministry’s report are not economists, the department asked you to build a short report explaining intuitively (1) the insights you get from the data, (2) the best model you found, and (3) the performance of the model - a discussion about how much it should be trusted and what is the error rate we expect in the future when you apply it to new data.

At the end, you will come up with a model that will be tested on a dataset of job postings you have no access to. You will be competing against other groups and part of your grade on the assignment (20%) will depend on the relative performance of your model in this data, measured by its accuracy. More details in question 9.

Practicalities

Your submission must be uploaded on Moodle by TBD (one submission by group).

Deliverables: You must deliver two components: (i) your code and (ii) a discussion of your results to present to the manager of the company.

You can either complete this .Rmd file with both the code and your answers/interpretations (and then knit it to an html or pdf file) OR deliver the code and a short report with your insights separately.

Be complete but succinct. Follow the questions below as the guideline for your code and discussion. Make sure your code is executable, so that I can evaluate your performance.

Prep

# PRELIMINARIES

rm(list=ls()) 

library(ggplot2)
library(dplyr)

## 
## Attachement du package : 'dplyr'

## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag

## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tibble)
library(readr)
library(haven)
library(magrittr)

## 
## Attachement du package : 'magrittr'

## L'objet suivant est masqué depuis 'package:tidyr':
## 
##     extract

library(gridExtra)

## 
## Attachement du package : 'gridExtra'

## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     combine

# Text mining library
library(tidytext)

# This sets global options
knitr::opts_chunk$set(echo = TRUE,
                      message = FALSE,
                      warning = FALSE)

# Start from scratch - clean your environment
rm(list = ls())

Questions

Start by loading the data fake_job_postings_students.csv and take a look at it. The column fraudulent indicates whether a given job posting is real (0) or fake (1).

Here we will load the data using the read_csv() function. The data contains 778 observations for fraudulent job posts and 13 315 observations for real job posts (we know that using the count() function, which counts the number of rows where fraudulent == 1 and the rows where fraudulent == 0.

# In order to load the data, we use the function read_csv 
job_data <- read_csv("fake_job_postings_students.csv")
job_data %>% count(fraudulent)

## # A tibble: 2 × 2
##   fraudulent     n
##        <dbl> <int>
## 1          0 15315
## 2          1   777

Let’s start by exploring quickly some differences between fraudulent and real job postings. Create a table with the percentage of job postings that has the company logo, has questions and has telecomuting, split by whether the post is real or fake. What do you conclude?

As fraudulent is a binary variable, we use the function group_by in order to have the average for the fraudulents and for the real job posts. Then we use the summarise() function right after the group_by() in order to obtain statistics informations about our data on the columns selected in the vector c().

job_percentage <- job_data %>% 
  group_by(fraudulent) %>% 
  summarise(across(c("has_company_logo", "has_questions", "telecommuting"), 
                   ~ mean(.x, na.rm = TRUE)*100))



job_percentage <- job_percentage %>% 
  as.data.frame() %>% 
  pivot_longer(cols = -fraudulent, 
               names_to = "Job characteristics", 
               values_to = "Percentage") %>% 
  pivot_wider(names_from = fraudulent, values_from = Percentage) %>%  
  rename(Real_job = "0", Fake_job = "1")

job_percentage

## # A tibble: 3 × 3
##   `Job characteristics` Real_job Fake_job
##   <chr>                    <dbl>    <dbl>
## 1 has_company_logo         81.9     31.8 
## 2 has_questions            50.2     28.7 
## 3 telecommuting             4.13     7.21

Interpretation : we can see thanks to the table that 81% of real job postings contain the company logo while only 31% of the fake job postings have it. It may be due to copyright laws that prevent fake job postings from containing an official logo or a fake one.

Then we see that 50% of the real job postings contain questions while only 28% of the fake job postings contain questions, which shows that the presence of question or not can be a good variable to differentiate real and fake job postings.

Finally, we can see that the variable telecommuting is not useful to differenciate the two types of job posts as 4% of real job postings have mention of telecommuting (defined as the possibility of working outside from workplace) and 7% of fake job postings mentions the possibility of telecommuting.

The ministry commonly uses these variables to detect fake job postings, but you learned in your master’s degree how to analyse text data. You believe that analysing the text in the job will yield important insights.

We use the function unnest_tokens() in order to split the data word by word resulting into a list of words. Then the goal is to plot the most common words that are used in fake and real job postings considered together

Split the description of the job post by word (tokens) and represent graphically the most common words (the 15 most common).

# We use the function unnest_tokens() to split the texts into lists of words 

description_word <- job_data %>% 
  select(description, fraudulent) %>% 
  unnest_tokens(word, "description")

description_word %>% 
  count(word, sort = TRUE) %>% 
  top_n(15) %>% 
  mutate(word = reorder (word, n)) %>% 
  ggplot(aes(x=word, y=n)) +
  geom_col(fill="orange", alpha = 0.7) +
  xlab(NULL) + 
  coord_flip() + 
  labs(x="Unique word", y="count") + 
  ggtitle("Most common words in job description") + 
  theme_minimal()

Is the previous plot informative? Why/Why not? Improve it by eliminating stop words. Explain briefly what these words are.

Answer : The previous plot is not informative enough as the data contains many stop words such as “in”, “for”, “will”… However, these words are not useful in our analysis for two reasons.

First, stop words do not contain any meaningfull information as they are just used to make sentences and connect words. As they don’t have particular meaning when considered one by one, they can be considered as a noisy information that we want to get rid off.

Second, these words are heavily and frequently used in everyday language to make sentences, which means that the frequency of stop words in our data is very high. Removing the stop words is a way to reduce the dimensionality (and so reduce the number of features) of our data and only keeping essential information for our analysis.

Also, the graph is not informative enough because it doesn’t split the data between fraudulent and true job. However, the whole goal of this research is to search for the main differences between real and fake job postings, in order to be able to predict whether a job posting is fraudulent or not.

So it would be more interesting to know what are the most common words in fraudulent postings and the most common words in real postings, and distinguish what are the specific common words in each type of post.

data("stop_words")
description_nostop <- description_word %>% 
  anti_join(stop_words)

Do the same plot but now splitting between real and fake job posts. What insights do you get?

Answer : Thanks to this separated plots, we gain understanding of the differences between real and fake job postings. This graph helps us in two ways :

Firstly, it doesn’t contains stop words anymore, so that we have words that have more meaning and that can help our analysis.

This graph is also helpful because it allow us to compare the most common words between real and fake job posts. For example, we can see that fraudulent job posts and real job posts use several common words, such as “team”, “experience”, or “management”, but that they do not use them at the same frequency (for example the word “management” is the 7th most common word in fraudulent job posts while it is the 9th most common word in real posts).

# First we create two datasets, one containing the words from fraudulent job posts and one containing the words from real job posts 

fraudulent_word <- description_nostop %>% 
  filter(fraudulent == "1")

real_word <- description_nostop %>% 
  filter(fraudulent == "0")

fraudulent_plot <- fraudulent_word %>% 
  count(word, sort = TRUE) %>% 
  top_n(15) %>% 
  mutate(word = reorder (word, n)) %>% 
  ggplot(aes(x=word, y=n)) +
  geom_col(fill="blue", alpha = 0.5) +
  xlab(NULL) + 
  coord_flip() + 
  labs(x="Unique word", y="count") + 
  ggtitle("Most common words (fraudulent job posts)") + 
  theme_minimal() + 
  theme(plot.title = element_text(color="black", size=9, face="bold.italic"))


real_plot <- real_word %>% 
  count(word, sort = TRUE) %>% 
  top_n(15) %>% 
  mutate(word = reorder (word, n)) %>% 
  ggplot(aes(x=word, y=n)) +
  geom_col(fill="green", alpha = 0.5) +
  xlab(NULL) + 
  coord_flip() + 
  labs(x="Unique word", y="count") + 
  ggtitle("Most common words (real job posts)") +
  theme_minimal() + 
  theme(plot.title = element_text(color="black", size=9, face="bold.italic"))

grid.arrange(fraudulent_plot, real_plot, ncol=2)

To predict whether a job posting is fraudulent, it is important to look beyond words, and analyse what we call “text features”. These include characteristics like the text length, number of characters per word, number of digits, number of propositions etc. One package in R that extracts text features is textfeatures. Install and explore the documentation of this package. Next, extract the text features from the descriptions in the dataset. In the function textfeatures, set word_dims=FALSE. Additional hint: It may be useful to join the dataset with the features (after extraction) with the dataset of the job postings (to keep all other variables).

if (!require(textfeatures)) {
  install.packages("textfeatures")
  library(textfeatures)
}

text_features <- textfeatures(job_data$description,
                          normalize = FALSE,
                          sentiment = FALSE,
                          word_dims = 0)

## [32m↪[39m [38;5;244mCounting features in text...[39m
## [32m↪[39m [38;5;244mParts of speech...[39m
## [32m↪[39m [38;5;244mWord dimensions started[39m
## [32m✔[39m Job's done!

job_features <- cbind(job_data, text_features)

Note : we are keeping the variables including _uq_ as they give an important information. For example let’s take the features n_words and n_uq_words. While n_words tells us how many words are in each job posting, the feature n_uq_words is telling us how many unique words are used. So while n_words gives us an information about quantity, n_uq_words gives an information about quality.

For example, if a high n_uq_words value is associated to a job posting, it means that the job post is using a rich and diverse vocabulary. On the contrary, a job posting with a low n_uq_words value uses a restricted vocabulary.

Find now the average of the text features for each of the two types of posts. Then, represent the features for real vs fake posts, either in a plot or a table. Interpret your findings. Which type of job posting has usually more characters? Which has more digits? Why do you think this happens?

avg_fraudulent <- job_features %>% 
  select(fraudulent, 20:48) %>% 
  group_by(fraudulent) %>% 
  summarise(across(everything(), mean, na.rm = TRUE))

# We can represent the results in a histogram graph
# First, we need to pivot the data from wide to long format 

avg_fraudulent_long <- avg_fraudulent %>%  
  pivot_longer(cols = -fraudulent, names_to = "feature", values_to = "average") 

ggplot(avg_fraudulent_long, aes(x = feature, y = average, fill = as.factor(fraudulent))) + 
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "text feature", y = "average value", fill = "is the job post fraudulent ?") + 
  theme_light() + 
  scale_fill_discrete(name = "job post type", breaks = c(0,1), labels = c("Real post", "Fraudulent post")) + 
  coord_flip()

Analysis : We use a graph in order to have a look at the main differences and the main characteristics of the two types of job postings. For example :

Real job posts have more words on average than fake job posts.This could be explained by the fact that real job posts are written by managers or directors who have a better knowledge about the job posts they propose. However, the scammers who are writting the fake job are staying vague and general since they don’t have deep knowledge about the fictive job they propose.
Real job posts have more lower spaces than fraudulent job posts, surely because they are more formal than fake job postings, so they use more lower spaces letters and less capital letter to stay professional and formal.
Fraudulent job posts contains more uppercase letters. This can be explained by the fact that fake job posts use more capital letters to make some words more visible to grab attention, for example by emphasizing certain strategic words.
Fraudulent job posts contain more digits. This could be explained by the fact that they contain more fake numbers or high salary mentions in order to attract people.

We will now start building a model to predict whether a job posting is spam. We will use two types of models: a multiple linear regression model and a decision tree.

The extracted features will be our explanatory variables, coupled with the following variables: telecommuting, has_company_logo, employment_type and required_education. Prepare a dataset with text features from the job description and these variables.

We will now start building a model to predict whether a job posting is spam. We will use two types of models: a multiple linear regression model and a decision tree.

Start by splitting the data into a training and a testing set. Explain why this is important.

Answer : Splitting the data into training and testing set is crucial in machine learning. On one hand,the train dataset is the dataset on which we are going to train our predictive model. Then, the testing set is the one on which we are going to test the predictive model (this will be the out-of-sample).

The goal when using a test datasets is to make sure our model generalizes well, in other words that it is able to make good predictions from databases that it never saw before.

Checking the performance of our model on test set is a way to see if we did not over fit the model (case when the model memorizes the features of the training dataset and then loose the ability to adapt to new datasets).

# First step : create a data containing our variables of interest
job_features$location <- substr(job_features$location, start=1, stop=2)

job_features <- job_features %>%  
  na.omit()

lower_salary <- as.numeric(sapply(strsplit(job_features$salary_range, "-"), "[[", 1))
upper_salary <- as.numeric(sapply(strsplit(job_features$salary_range, "-"), "[[", 2))
job_features$salary_range <- (lower_salary + upper_salary) / 2



reg_data <- job_features %>% 
  select(fraudulent, telecommuting, has_company_logo, employment_type, required_education, 20:48) %>% 
  na.omit()

# Second step : splitting the data into training and testing set 
# Let's use 80% of the rows as a training dataset and 20% of the rows as a testing dataset
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(reg_data), replace = TRUE, prob = c(0.7,0.3))
train <- reg_data[sample, ]
test <- reg_data[!sample, ]

Fit a logit regression to the training set (using glm()), where the dependent variable is a binary variable equal to 1 if the posting is fake, and equal to zero otherwise. Interpret the output of the regression. Use the model to predict whether a job posting is fake in your test set. How accurate are your predictions? Hint: It may be useful to use the function confusionMatrixfrom the package caret.

Interpretation of the method: For the first prediction method, we are using a logit regression model using the glm() function. The generalized linear model is used when we want to predict a binary outcome (here fraudulent variable is either equal to 1 or to 0). Here we predict the outcome variable fraudulent using all the variables in the train data as regressors / explanatory variables.

The goal of the model is to predict the probability of a posting job to be either real (fraudulent = 0) or fake (fraudulent = 1)

Interpretation of the output of the regression : The regression output gives us the estimates for the coefficients associated with each regressor of our model, the significance of each variable (*** meaning that the variable is significant while * means that the variable is less significant). The significance of the variables are defined by their associated p-value. To explain it, p-values give the probability that the changes in the outcome variable are due to hazard and not due to the effect of the variable.

So the higher the p-value, the higher the probability that the change in the outcome variable was due to random chance and the less significative is the variable in the regression.

Let’s interpret some values from the regression table :

The parameter associated with has_company_logo is equal to -1.502 meaning that if a job posting contains the company logo, it will decrease the probability of that job posting being fraudulent by 1.5 percentage points.
The parameter associated with n_hashtags is equal to 0.841 meaning that the presence of hashtags in a job postings will increase the probability that the latter is fraudulent by 0.8 percentage points.

# First step : fit in the regression model 
if (!require(stargazer)) {
  install.packages("stargazer")
  library(stargazer)
}

logit_model <- glm(fraudulent ~ ., family = "binomial", data = train)
stargazer(logit_model, type = "text")

## 
## ===============================================================================
##                                                         Dependent variable:    
##                                                     ---------------------------
##                                                             fraudulent         
## -------------------------------------------------------------------------------
## telecommuting                                                 -0.189           
##                                                               (0.956)          
##                                                                                
## has_company_logo                                              -0.349           
##                                                               (1.110)          
##                                                                                
## employment_typeFull-time                                      -2.678*          
##                                                               (1.518)          
##                                                                                
## employment_typeOther                                          -21.652          
##                                                             (9,065.526)        
##                                                                                
## employment_typePart-time                                      -0.300           
##                                                               (2.515)          
##                                                                                
## employment_typeTemporary                                      -21.725          
##                                                             (7,414.435)        
##                                                                                
## required_educationBachelor's Degree                           14.625           
##                                                             (2,495.072)        
##                                                                                
## required_educationCertification                               19.721           
##                                                             (2,495.072)        
##                                                                                
## required_educationHigh School or equivalent                   18.597           
##                                                             (2,495.072)        
##                                                                                
## required_educationMaster's Degree                             -5.463           
##                                                             (3,950.473)        
##                                                                                
## required_educationProfessional                                21.941           
##                                                            (21,629.190)        
##                                                                                
## required_educationSome College Coursework Completed           14.363           
##                                                             (2,495.074)        
##                                                                                
## required_educationUnspecified                                 -0.318           
##                                                             (2,980.144)        
##                                                                                
## required_educationVocational                                  -3.229           
##                                                            (17,905.070)        
##                                                                                
## required_educationVocational - HS Diploma                      2.139           
##                                                            (17,905.070)        
##                                                                                
## n_urls                                                        -17.227          
##                                                            (12,133.860)        
##                                                                                
## n_uq_urls                                                                      
##                                                                                
##                                                                                
## n_hashtags                                                     0.810           
##                                                             (4,957.770)        
##                                                                                
## n_uq_hashtags                                                 -0.392           
##                                                             (4,957.770)        
##                                                                                
## n_mentions                                                                     
##                                                                                
##                                                                                
## n_uq_mentions                                                                  
##                                                                                
##                                                                                
## n_chars                                                        0.213           
##                                                               (0.187)          
##                                                                                
## n_uq_chars                                                     0.164           
##                                                               (0.138)          
##                                                                                
## n_commas                                                      -0.242           
##                                                               (0.205)          
##                                                                                
## n_digits                                                      -0.337           
##                                                               (0.249)          
##                                                                                
## n_exclaims                                                    -0.777           
##                                                               (1.193)          
##                                                                                
## n_extraspaces                                                  0.124           
##                                                               (0.098)          
##                                                                                
## n_lowers                                                      -0.191           
##                                                               (0.187)          
##                                                                                
## n_lowersp                                                     -2.769           
##                                                              (53.541)          
##                                                                                
## n_periods                                                     -0.194           
##                                                               (0.208)          
##                                                                                
## n_words                                                      -0.179***         
##                                                               (0.062)          
##                                                                                
## n_uq_words                                                     0.061           
##                                                               (0.059)          
##                                                                                
## n_caps                                                        -0.123           
##                                                               (0.190)          
##                                                                                
## n_nonasciis                                                   -0.072           
##                                                               (0.084)          
##                                                                                
## n_puncts                                                      -0.310*          
##                                                               (0.179)          
##                                                                                
## n_capsp                                                      -102.093*         
##                                                              (61.977)          
##                                                                                
## n_charsperword                                               -2.107**          
##                                                               (0.957)          
##                                                                                
## n_first_person                                                 0.746           
##                                                               (0.572)          
##                                                                                
## n_first_personp                                               -0.551           
##                                                               (0.414)          
##                                                                                
## n_second_person                                               -0.766           
##                                                               (0.525)          
##                                                                                
## n_second_personp                                              -0.648           
##                                                               (0.780)          
##                                                                                
## n_third_person                                                -0.075           
##                                                               (0.383)          
##                                                                                
## n_tobe                                                         0.013           
##                                                               (0.254)          
##                                                                                
## n_prepositions                                               -0.565***         
##                                                               (0.211)          
##                                                                                
## Constant                                                      -0.163           
##                                                             (2,495.692)        
##                                                                                
## -------------------------------------------------------------------------------
## Observations                                                    481            
## Log Likelihood                                                -52.026          
## Akaike Inf. Crit.                                             188.053          
## ===============================================================================
## Note:                                               *p<0.1; **p<0.05; ***p<0.01

# Second step : predict whether a job is fake on the test set using the glm model 
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}


prediction_reg <- predict(logit_model, test, type = "response")
predicted_fraudulent <- ifelse(prediction_reg > 0.4, 1, 0)
actual_class <- test$fraudulent
cm <- confusionMatrix(as.factor(predicted_fraudulent), as.factor(actual_class))
print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 183   8
##          1   5  15
##                                           
##                Accuracy : 0.9384          
##                  95% CI : (0.8969, 0.9668)
##     No Information Rate : 0.891           
##     P-Value [Acc > NIR] : 0.01311         
##                                           
##                   Kappa : 0.6636          
##                                           
##  Mcnemar's Test P-Value : 0.57910         
##                                           
##             Sensitivity : 0.9734          
##             Specificity : 0.6522          
##          Pos Pred Value : 0.9581          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.8910          
##          Detection Rate : 0.8673          
##    Detection Prevalence : 0.9052          
##       Balanced Accuracy : 0.8128          
##                                           
##        'Positive' Class : 0               
##

Looking at the confusion matrix you obtained before, is there an issue with how the model is working? Would you use this model if someone tells you that it is very important to detect fake postings? Why? Why not? Hint: Maybe accuracy is not the only metric we should look at, look at the percentage of job postings that are predicted to be fake and comment on the usefulness of the model.

Interpretation of the matrix result : First it is important to understand the important metrics given by the matrix

True positives : these are the real jobs posts in the data that were indeed predicted as real by the model
False positives : these are the fraudulent job posts in the data that were falsly predicted as fraudulent by the model
True negatives : these are the fraudulent job posts in the data that were predicted as fraudulent by the model
False negatives : these are the fraudulent job posts in the data that were wrongly predicted as real by the model
accuracy : it gives the proportion of correct predictions that were made out of all the predictions realised by our model
sensitivity : measure of the proportion of true positives that the model predicted right out of all the actual positives.In our case, it measures the ability of our model to predict real job posts.
specificity : measure of the proportion of true negatives that the model predicted right out of all the actual true negatives. In our case, it measures the ability of our model to predict fraudulent job posts
balanced accuracy : this one is particulary important in our case because we are facing an imbalanced dataset with a majority of real job posts and a minority of fraudulent posts. In fact, balanced accuracy will measure the model’s ability to predict both real and fraudulent job posts. So in our case, it may be more interesting to look at the other metrics, particularly balanced accuracy, rather than simple accuracy that is too much simplistic.

Looking at the confusion matrix we obtained we see that while the regression models predicts very well true positives, it does a poor job at predicting true negatives (fraudulent job posts). In fact, we can see that while the model has a perfect sensitivity (so a high ability to predict true positives), it has a relatively low specificity (0.0303).

So if we are told that it is very important to detect fake postings, this regression model wouldn’t be fitted to attain this goal, as it does a very bad job at predicting fraudulent postings. So we see that even though the accuracy of the model is pretty high (0.96), the model is not useful to predict fraudulent postings.

So it is interesting to see that basing ourselves only on accuracy can be misleading, especially since our dataset contains a majority of real job postings, such that it will naturally lead to a better prediction of the majority class. However, looking at the balanced accuracy and the specificity is interesting as it gives us the information that the model is not fitted well to achieve our goal of predicting fake postings.

Do the same now using a decision tree. You are free to play with the parameters of the tree, so as to maximize accuracy in the testing set. Represent the tree graphically. Explain briefly what you see.

# First step : fitting the decision tree on the train dataset

# cp is the complexity parameter. It defines the minimum level of gain in predicting power that needs to be achieved by each split. It means that a high cp parameter will result in a smaller tree (because each split needs to respect a certain level of prediction power gain) while a low cp parameter will result in a large tree with many nodes, which can lead to a problem of over-fitting. 

# minsplit parameter defines the minimum number of observations in each leaf required in order to perform an additional split. 

if (!require(rpart)) 
  install.packages("rpart", dependencies = TRUE); library(rpart)
if (!require(rpart.plot)) 
  install.packages("rpart.plot", dependencies = TRUE); library(rpart.plot)

tree <- rpart(fraudulent ~ .,  
                     method="class", 
                     control=rpart.control(minsplit=30, cp=0.002, maxdepth=5), 
                     data = train)



# We selected the cp (cross split) that is associated with the lowest cross validation error (xerror)
printcp(tree)

## 
## Classification tree:
## rpart(formula = fraudulent ~ ., data = train, method = "class", 
##     control = rpart.control(minsplit = 30, cp = 0.002, maxdepth = 5))
## 
## Variables actually used in tree construction:
## [1] n_charsperword     n_tobe             required_education
## 
## Root node error: 38/481 = 0.079002
## 
## n= 481 
## 
##        CP nsplit rel error  xerror    xstd
## 1 0.13158      0   1.00000 1.00000 0.15568
## 2 0.00200      3   0.60526 0.84211 0.14383

print(tree$cptable)

##          CP nsplit rel error    xerror      xstd
## 1 0.1315789      0 1.0000000 1.0000000 0.1556817
## 2 0.0020000      3 0.6052632 0.8421053 0.1438275

rpart.plot(tree)

Analysis of the tree form: the result of our code is a decision tree, that basically apply binary splits on the data at each leaf.

The tree starts with all observations of the data (here the training dataset) and then splits the data according to the characteristics we selected in the data (text features, has_company_logo,…)

Each node (the root node, each decision nodes in the middle of the tree and the leaf node at the end) contains a condition to split the data. In our case, we start with the first root node, then end up with 4 leaf nodes.

The general strategy of the tree is to perform efficient splits, it is to say splits that achieve the largest possible reduction in variance among the child regions while avoiding over fitting of the model.

# Second step : prediction on the test set 

prediction_probs <- predict(tree, test, type="prob")
predicted_class_tree <- ifelse(prediction_probs[,2] > 0.03, 1, 0)
actual_class_tree <- test$fraudulent
accuracy <- sum(predicted_class_tree == actual_class_tree) / length(actual_class_tree)
print(accuracy)

## [1] 0.8293839

cm_tree <- confusionMatrix(as.factor(predicted_class_tree), as.factor(actual_class_tree))
print(cm_tree)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 154   2
##          1  34  21
##                                           
##                Accuracy : 0.8294          
##                  95% CI : (0.7717, 0.8775)
##     No Information Rate : 0.891           
##     P-Value [Acc > NIR] : 0.9975          
##                                           
##                   Kappa : 0.4546          
##                                           
##  Mcnemar's Test P-Value : 2.383e-07       
##                                           
##             Sensitivity : 0.8191          
##             Specificity : 0.9130          
##          Pos Pred Value : 0.9872          
##          Neg Pred Value : 0.3818          
##              Prevalence : 0.8910          
##          Detection Rate : 0.7299          
##    Detection Prevalence : 0.7393          
##       Balanced Accuracy : 0.8661          
##                                           
##        'Positive' Class : 0               
##

Compare the results of the decision tree with the ones from the regression. Which model is more useful in your opinion? Why?

Comparison : Looking at the main metrics of the confusion matrix, we see that the decision tree achieved to keep a stable level of accuracy (even though a little lower than the regression) while significantly gaining in specificity (the tree predicts better the fraudulent job posts while not loosing predicting power of the real job posts). This translates into a significantly higher balanced accuracy (86%) compared to the regression model (53%).

Looking simply at the matrix, we see that even though the tree still makes some prediction errors for the fraudulent job posts, it still does a much better job than the regression model. Also, we see that the accuracy is much lower with the tree than with the regression. But in our case, we have an unbalanced dataset as it have much more real job posts than fraudulent job posts. So in this context, it is more relevant and useful to look at the balanced accuracy, which is higher for the decision tree (meaning that the tree does relatively good at predicting both real positives and real negatives).

However, the decision tree still have a lower accuracy than the regression model.

Finally, choose from the two models above the one who maximizes accuracy - call this model final_model, to which you can add other variables. You can also come up with another model if you prefer, using other methods.

Explain briefly what this model does and the main variables it uses to detect spam messages. To do this, you can explore the vip package, which allows you to plot the most important variables of a model. Discuss also how trustworthy your model is to detect fake job postings in data it has not seen.

Note: Your final model will be tested in a test dataset that you have no access to. Please make sure your code runs so that I can evaluate its performance. I will rank your performance in terms of accuracy against the other groups. 20% of your grade in the assignment depends on how accurate your model is. Therefore, make sure that your model performs well in data it has not seen.

Analysis and explanation of the final model: for the final model, we choose to keep a regression since it is the model that displays the highest accuracy. Basically, the generalized linear model is a way to make predictions of a binary outcome (which is why we chose family = binomial as a parameter). The prediction works as a generalization of a simple linear regression, where each explanatory variable has an impact on the outcome variable, that is caught through the value of their associated parameters.

Even though we kept the same structure, we chose to remove some explanatory variables in order to prevent the model from over fitting. In fact, putting many variables in the model may make it able to perform well on our data set, but there is a risk that our model is too specific to our data set and that it will perform poorly on an unknown data set.

So when we try to maximize the performance of our regression on the job_data set, we need to bear in mind that we shouldn’t only focus on that and make sure that the model has the ability to generalize well. So if we put too much variables, it will stay too specific and won’t be able to adapt well to other data set that may present a new proportion between real and fraudulent jobs, or a different distribution of variables for examples.

We decided to add a variable that was not included in the previous models which is salary_range. In order to include it in the variables, we computed the average of the ranges given for each row at the beginning of the code. We thought it would be interesting to include it since we expect fraudulent job posts to include very high and surrealist salaries in order to attract potential scam victims.

Looking at the vip plot (with the most important variable), it seems that the variables included into the final model have a significant impact on the fraudulent outcome variable.

How trustworthy is the model: the model chosen seems to do a good job on predicting fake job postings since its specificity is up to 76%, while the sensitivity is still high at 98%, meaning the model has a great balance and does a good predicting for both real positives and real negatives in our data set (it is best to increase specificity without decreasing sensitivity too much since it is also important not to predict wrongly real job posts as fraudulent).

However, unless we test the data on on multiple unseen data, we cannot be completely sure of the model ability to generalize well. But still, looking at its balanced accuracy tells us that the model is not biased at only predicting the majority class in our data, and that it is doing a good job at predicting both class. In fact, if the model was only doing good at predicting the majority class (here the real job posts), the matrix would tell us that it has a great accuracy. But if we test the model on a new data set that contains majority of fraudulent job posts, it is going to have a very low accuracy. That’s why we conclude again that accuracy is not the best metric to look at especially when the training data is unbalanced.

Finally, we can see that compared to the first decision regression, both accuracy (95.58%) and balanced accuracy are higher (87%) which explain why it was chosen as a final model.

# First step : create a new data set that contains the variables we want to include into the final model 

data_final <- job_features %>%  
  select(fraudulent,n_charsperword,
          n_lowersp, n_capsp, required_education, required_experience,
         n_uq_hashtags, salary_range
         ) %>% 
  na.omit()


# Second step : separate the data set between train and test data 

set.seed(123)
sample_final <- sample(c(TRUE, FALSE), nrow(data_final), replace = TRUE, prob = c(0.8,0.2))
train_final <- data_final[sample_final, ]
test_final <- data_final[!sample_final, ]

# Third step : fit in the regression model using the train data set 

final_model <- glm(fraudulent ~ ., family = "binomial", data = train_final)

# Fourth step : testing the model prediction power on testing data set 

prediction_final <- predict(final_model, test_final, type = "response")
predicted_fraudulent_final <- ifelse(prediction_final > 0.4, 1, 0)
actual_class_final <- test_final$fraudulent

cm_final <- confusionMatrix(as.factor(predicted_fraudulent_final), as.factor(actual_class_final))
print(cm_final)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124   4
##          1   2  13
##                                           
##                Accuracy : 0.958           
##                  95% CI : (0.9109, 0.9844)
##     No Information Rate : 0.8811          
##     P-Value [Acc > NIR] : 0.001271        
##                                           
##                   Kappa : 0.789           
##                                           
##  Mcnemar's Test P-Value : 0.683091        
##                                           
##             Sensitivity : 0.9841          
##             Specificity : 0.7647          
##          Pos Pred Value : 0.9687          
##          Neg Pred Value : 0.8667          
##              Prevalence : 0.8811          
##          Detection Rate : 0.8671          
##    Detection Prevalence : 0.8951          
##       Balanced Accuracy : 0.8744          
##                                           
##        'Positive' Class : 0               
##

# The vip() function gives us the importance of the impact in explaining the outcome variable "fraudulent" that we want to predict

if (!require(vip)) {
  install.packages("vip")
  library(vip)
}


vip(final_model) + ggtitle("Variable Importance")

Assignment 1

Fall 2023

Practicalities

Prep

Questions