Final Project - DATA622

Group 1 Members:

David Moste
Vinayak Kamath
Kenan Sooklall
Christian Thieme
Lidiia Tronina

Introduction

Data scientists are in extremely high demand around the world. Companies are constantly fighting to acquire and keep talented professionals. However, as we’ve seen in the last year, many professionals are leaving their jobs and looking for new employment. Why is this? What are the factors behind it? And is this something that can be predicted?

In this project we will be utilizing a dataset from a Kaggle Competition titled, “HR Analytics: Job Changes of Data Scientists”. This competition’s focus is to determine which data scientists will be looking for a job change. That being said, the target variable is labeled target and is a binary 0 or 1 indicating whether the individual is not looking for a job change or is, respectively. To do this, we’ve been provided with a training dataset with the following columns:

Variable	Description
enrollee_id	Unique ID for each candidate
city	city code
city_development_index	scaled development index of the city
gender	gender of the candidate
relevant_experience	Binary field indicating if the candidate has experience
enrolled_university	University enrollment type (if any)
education_level	Level of education completed
major_discipline	The focus area of their studies
experience	number of years of experience
company_size	number of employees at their current company of employ
company_type	Type of company (startup, private, etc.)
last_new_job	How many years they were at their previous job for
training_hours	How many hours of training they have completed
target	0 - Not looking for a job change, 1 - Looking for a job change

Our task will be to preform an extensive exploratory data analysis, including a clustering analysis, to get an understanding of the data, perform any necessary data transformations (feature engineering, imputation, etc.), and then build the following models to see which is the most accurate in predicting which data scientists are looking to leave their jobs:

Decision Tree
Random Forest
K-Nearest Neighbors
XGBoost
SVM

Exploratory Data Analysis

Let’s begin exploring by taking a high level look at our dataset:

glimpse(train)

## Rows: 19,158
## Columns: 14
## $ enrollee_id            <dbl> 8949, 29725, 11561, 33241, 666, 21651, 28806, 4~
## $ city                   <chr> "city_103", "city_40", "city_21", "city_115", "~
## $ city_development_index <dbl> 0.920, 0.776, 0.624, 0.789, 0.767, 0.764, 0.920~
## $ gender                 <chr> "Male", "Male", NA, NA, "Male", NA, "Male", "Ma~
## $ relevent_experience    <chr> "Has relevent experience", "No relevent experie~
## $ enrolled_university    <chr> "no_enrollment", "no_enrollment", "Full time co~
## $ education_level        <chr> "Graduate", "Graduate", "Graduate", "Graduate",~
## $ major_discipline       <chr> "STEM", "STEM", "STEM", "Business Degree", "STE~
## $ experience             <chr> ">20", "15", "5", "<1", ">20", "11", "5", "13",~
## $ company_size           <chr> NA, "50-99", NA, NA, "50-99", NA, "50-99", "<10~
## $ company_type           <chr> NA, "Pvt Ltd", NA, "Pvt Ltd", "Funded Startup",~
## $ last_new_job           <chr> "1", ">4", "never", "never", "4", "1", "1", ">4~
## $ training_hours         <dbl> 36, 47, 83, 52, 8, 24, 24, 18, 46, 123, 32, 108~
## $ target                 <dbl> 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,~

We can see that there is a mix of both numeric and categorical data. We note that most of the columns appear to be categorical data and the column type should be changed to factor. Additionally, our target variable target should also be a factor. We’ll make these changes now.

Having made the necessary changes, we can see in the visual above, that our dataset is mostly categorical data. However, there are two columns that are of type integer. We note that enrollee_id is just a field used for identification and will not be used during modeling.

Numerical Features

As mentioned previously, we have two numerical features, city_development_index and training_hours. Let’s visualize these distributions to get a better sense of their shape:

In looking at the distribution of city_development_index, it appears that our dataset is multi-modal. This often means that there are sub-populations within the data. We’ll keep this in mind as we continue our exploration. Additionally, we note that a significant portion of our dataset live where the city_development_index is greater than ~0.87.

As we’d expect, the distribution of training_hours is extremely right skewed. We see the majority of the population have less than 100 hours of training.

Let’s now visualize the relationship each of these variables has with target, starting with city_development_index:

Looking at the boxplots above, we can see that those who are not looking to leave their jobs, most often, live in cities with a high city development index. Looking at the boxplot on the right of those who are considering leaving their jobs, we can see that the median value is much lower and the interquartile range is much wider. It does appear that city_development_index has a strong relationship with target.

We’ll now look at the relationship between training_hours and target:

In looking at the boxplots above, we do not note a significant relationship between this variable and target. In fact, the boxplots are almost identical.

Categorical Features

Now let’s turn our attention to our categorical variables. We’ll start by visualizing this data:

In looking at the visualization above, we note:

city: A large portion of our dataset come from 5-6 different cities, however, this factor has many different unique values.
company_size: This column was most often left blank. There appear to be a good mix of individuals from every company size.
company_type: A large portion of individuals work at a private company. This column was also left blank very frequently.
education_level: A large percentage of the population have a graduate degree, which is usually the same as Masters which is the second most frequent value. Perhaps these values should be combined?
enrolled_university: About 75% of the population is not enrolled in a university. This is what we’d expect, as most individuals have probably already graduated with their degrees.
experience: This column is surprising. Here we see that the largest group in this factor is “>20” years of experience. This is unusual as there are few people in the data science industry with that much experience. We do, however, see a good amount of other responses within this column.
gender: The majority of our population is Male or did not answer. This is probably expected as the engineering world often has more men than women.
last_new_job: Interestingly, it looks like for a good percentage of our population, they were only at their last job for 1 year. The next largest group was at their last job for “>4” years.
major_discipline: Not surprisingly, the overwhelming focus of individual’s study was in the STEM field. We note a good population of missing values in this field.
relevant_experience: It appears that about 2/3 of our dataset has relevant experience
target: We note a class imbalance here in that it appears that about 75% of our dataset is not looking to leave their current job.

As we will be building models to predict target, let’s see if a relationship exists between these categorical variables and target. To do this, we can explore the percentage of individuals who are looking for new jobs vs those who are not at each level of the factor. If we see significant differences between the percentages in each level, we can conclude that the factor and its associated levels are predictive of target.

In looking at the charts above, we see very minor differences between the percentages for those who are looking to leave their jobs and those who are not in almost all of our categorical features. However, we do note the following significant differences:

enrolled_university: Those who are enrolled in a full time course are twice as likely to be looking for a new position. This makes sense as people who are in school are often preparing to start a new career or job when they finish.
experience: Those with “>20” years of experience are twice as likely to stay at their current job. This also makes sense as they may be nearing the end of their career and want some security during the last several years of their career.

We excluded a the view of city in the above graphic because of the number of distinct categories it contains. We show this chart in a separate view below:

In almost all cities, we see very little difference, however, we do note a difference in city_21 and city_103 in that both of these cities are much more likely to have data scientists looking for a new position.

Based on our EDA, it appears that there are quite a few of our categorical variables that hold little to no predictive power:

gender
relevant_experience
education_level
major_discipline
company_size
company_type
last_new_job

Some of the algorithms we have chosen to model with may handle non-predictive variables better than others. We will make the decision to include/exclude each of these variables in the modeling section for each of these algorithms. Additionally, as some of these variables have a large percentage of missing values, we can also elect to drop them from the dataset if necessary.

Missing Data

As we’ve seen throughout our EDA, there are quite a few missing values within the columns. Let’s take a closer look at this so we can determine the best way to deal with them.

We can see above that ~8% of our dataset is missing, with the largest contributors coming from company_type, company_size, gender, and major_discipline. Based on our analysis above, we have seen that these factors hold little to no predictive power, as such, we’ve decided to drop these variables from the dataset.

train <- train %>%
  dplyr::select(-company_type, -company_size, -gender, -major_discipline, -enrollee_id)

Having dropped these variables, lets look again at our missing values:

We can see now that we are only missing 0.7% of the values within our dataset now. Additionally, we don’t note any patterns within the missing data. We can now use an imputation method to fill in the remainder of the missing values.

We have chosen to use the pmm method (predictive mean matching) from the mice library to impute our missing values. Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.

train <- mice(data = train, m = 1, method = "pmm", seed = 500)

## 
##  iter imp variable
##   1   1  enrolled_university  education_level  experience  last_new_job
##   2   1  enrolled_university  education_level  experience  last_new_job
##   3   1  enrolled_university  education_level  experience  last_new_job
##   4   1  enrolled_university  education_level  experience  last_new_job
##   5   1  enrolled_university  education_level  experience  last_new_job

## Warning: Number of logged events: 20

train <- mice::complete(train, 1)

We can see now that we have successfully imputed all missing values:

Unsupervised Learning - K-Means Clustering

As a conclusion to our exploratory data analysis, we determined it appropriate to see if clustering could provide additional insight into this dataset. We have chosen to use K-Means Clustering for this analysis.

K-Means Clustering

K-Means clustering is a technique to create clusters by finding groups that are similar, as determined by euclidean distance.

The first step is to select the columns we want to use. Since k-means is based on euclidean distance, we will include only the variables with continuous values, ruling out all binary and factor based predictors.

library(tidyverse)

kmeans_df <- train %>%
  dplyr::select(city_development_index, training_hours, experience, last_new_job) %>%
  mutate_if(is.factor, as.character) %>%
  mutate(experience = replace(experience, experience == ">20", 20)) %>%
  mutate(experience = replace(experience, experience == "<1", 0)) %>%
  mutate(last_new_job = replace(last_new_job, last_new_job == ">4", 4)) %>%
  mutate(last_new_job = replace(last_new_job, last_new_job == "never", 0)) %>%
  mutate_if(is.character, as.numeric)

The next step was to transform the data. Again, the euclidean based distance measurement requires that we center and scale the data. We also chose to perform a Box-Cox transformation to help reduce skewness.

library(caret)

kmeans_trans <- preProcess(kmeans_df,
                           method = c("center", "scale", "BoxCox"))
kmeans_trans <- predict(kmeans_trans, kmeans_df)

Now that we have our data set up, we can get a sense of how many clusters to create. To do this, we used the silhouette method. When the average silhouette width starts to decrease, you stop gaining information by dividing clusters. This led us to an optimal number of 3 clusters.

Now we can visualize our clusters! To visualize these four dimensional clusters, we used PCA to reduce the dimensionality of our data.

What incredible clusters! To get a sense of our clusters, we first computed the principle components of the original data via PCA. We then graphed the contribution of each variable to the top two dimensions in our PCA. We can see that dimension 2 is essentially completely comprised of training_hours and that dimension 1 is a combination of the rest of the variables.

Unfortunately, kmeans data is not reproducible (even with a set seed), so our discussion here is based off of what we noted after our first model run.

In an effort to get a better sense of the differences between these clusters and what made them unique, we went back to the original data and performed a comparative analysis. We computed the mean for each factor within each cluster and plotted these values on a graph to see the differences in the two clusters more clearly.

These 3 groups are generally different combinations of experience (high vs low) and last_new_job (high vs low). Our 3 groups look like:

lots of experience and a long time since the last job switch
medium experience and a short time since last job switch
little experience and a short time since last job switch

In understanding these clusters above, we could potentially engineer some features to try to capture some of these signals that would perhaps lead us to increased prediction accuracy.

Modeling

Predicting Which Data Scientists Are Looking to Leave

Having concluded our exploratory data analysis, we’ll now build several models to attempt to predict which data scientists are looking to leave vs stay at their current job. We will use the following modeling approaches:

Decision Tree
Random Forest
K-Nearest Neighbors
XGBoost
SVM

Decision Tree

A decision tree is a supervised machine learning algorithm that can be used for both classification and regression problems. A decision tree is simply a series of sequential decisions made to reach a specific result.

library(tidymodels)
set.seed(42)
df_split <- train %>% initial_split(strata = 'target')

df_train <- training(df_split)
df_test <- testing(df_split)

set.seed(43)
df_folds <- bootstraps(df_train, strata = target, times=10)

The training set is imbalanced with ~14k 0s and ~5k 1s. This will cause our model to over fit on the class that is over represented. Stated differently if our model only guessed the target 0 it was be 75% accurate. To adjust for imbalance we down sample the target so it’s 50-50.

df_train_rec <- recipe(target ~ ., data=df_train) %>%
  themis::step_downsample(target)

smp <- df_train_rec %>% 
  prep() %>% 
  bake(new_data=NULL)
table(smp$target)

## 
##    0    1 
## 3582 3582

We instantiate the decision tree and fit it with the training set:

dt <- rpart(target~., data=df_train)

The trained model is then used on the test set to evaluate it’s performance.

dt.predictions <- predict(dt, df_test, type='class')
dt_df = data.frame(y_true=df_test$target, y_pred=dt.predictions)
dt_cm <- confusionMatrix(table(dt_df$y_true, dt_df$y_pred), positive='1')
dt_cm

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 3219  377
##   1  699  496
##                                           
##                Accuracy : 0.7754          
##                  95% CI : (0.7633, 0.7872)
##     No Information Rate : 0.8178          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3409          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5682          
##             Specificity : 0.8216          
##          Pos Pred Value : 0.4151          
##          Neg Pred Value : 0.8952          
##              Prevalence : 0.1822          
##          Detection Rate : 0.1035          
##    Detection Prevalence : 0.2494          
##       Balanced Accuracy : 0.6949          
##                                           
##        'Positive' Class : 1               
##

The confusion matrix shows we correctly predicted No 3219 times and Yes 496 times for an overall accuracy of 77.54%. Our model has higher recall and precision. For our purposes, this model isn’t very useful. Out of the 873 data scientists who were looking to leave their job, the model only correctly identified 496 of them (~57%) and also included nearly 700 false positives. In the real world, this model would most likely be unusable.

Random Forest

The decision tree algorithm is quite easy to understand and interpret. However, often a single tree is not sufficient for producing effective results. This is where the Random Forest algorithm comes into the picture.

We’ll build a random forest model with 1500 trees where mtry and min_n are tunable parameters.

set.seed(42)
rf_spec <- rand_forest(mtry = tune(), min_n = tune(), trees=1500) %>%
  set_mode('classification') %>%
  set_engine('ranger')

rf_workflow <- workflow() %>%
  add_recipe(df_train_rec) %>%
  add_model(rf_spec)

The random forest is tuned through a grid of 30 possible values with parallel processing.

set.seed(42)
doParallel::registerDoParallel()
rf_tune <- tune_grid(rf_workflow, resamples = df_folds, grid=30)

## i Creating pre-processing data to finalize unknown parameter: mtry

With the model tuned we can look at the sample space for the highest roc_auc.

set.seed(42)
show_best(rf_tune, metric='roc_auc')

## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1    25 roc_auc binary     0.748    10 0.00219 Preprocessor1_Model25
## 2     1    20 roc_auc binary     0.748    10 0.00219 Preprocessor1_Model12
## 3     2    32 roc_auc binary     0.748    10 0.00196 Preprocessor1_Model13
## 4     2    27 roc_auc binary     0.747    10 0.00191 Preprocessor1_Model30
## 5     3    39 roc_auc binary     0.744    10 0.00185 Preprocessor1_Model07

autoplot(rf_tune)

Now with the best parameters we train the full model on the trained set to obtain our finalized model.

set.seed(42)
rf_model <- rf_workflow %>%
  finalize_workflow(select_best(rf_tune, metric='roc_auc'))
rf_fit <- last_fit(rf_model, df_split)

With the tuning completed we can now train the model with the best fit parameters on the train set and test on the test set.

Testing

The finalized model is evaluated on the test set to determine how it would perform on unseen data.

set.seed(42)
model <- rf_fit$.workflow[[1]]
y_pred <- predict(model, df_test)
rf_df <- data.frame(y_pred=y_pred$.pred_class, y_true=df_test$target)
rf_cm <- confusionMatrix(table(rf_df$y_true, rf_df$y_pred), positive='1')
rf_cm

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 2778  818
##   1  449  746
##                                          
##                Accuracy : 0.7355         
##                  95% CI : (0.7228, 0.748)
##     No Information Rate : 0.6736         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3597         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4770         
##             Specificity : 0.8609         
##          Pos Pred Value : 0.6243         
##          Neg Pred Value : 0.7725         
##              Prevalence : 0.3264         
##          Detection Rate : 0.1557         
##    Detection Prevalence : 0.2494         
##       Balanced Accuracy : 0.6689         
##                                          
##        'Positive' Class : 1              
##

The confusion matrix shows we correctly predicted an individual will stay at their current job 2,789 times and is looking for a new position 744 times. The sensitivity in this model is also not great at 47%, meaning out of the 1,551 individuals looking for new jobs, we were only correctly able to identify 744 of them. We’ll have to hope that on of our other models will be able to perform better.

Feature importance

One of the benefits of the random forest model is the ability to see feature importance within the model. While our model is not as accurate as we’d like, it is still beneficial to see what features the model is placing the most importance on.

The most important feature is city_development_index by a wide margin. Next is the city and relevant_experience followed by enrolled_university and education_level. The city_development_index would be a variable to take a closer look at in the future to see if some additional feature engineering could produce more signal.

KNN

KNN is highly susceptible to data that is on different scales (large values will be much further from each other than small values). With this in mind, we have chosen to center and scale all of our predictors. The final step of pre-processing is to remove any predictors that have near-zero variance so that there are no overlapping predictors.

library(caret)

knn_data <- train %>%
  dplyr::select(city_development_index, training_hours, experience, last_new_job) %>%
  mutate_if(is.factor, as.character) %>%
  mutate(experience = replace(experience, experience == ">20", 20)) %>%
  mutate(experience = replace(experience, experience == "<1", 0)) %>%
  mutate(last_new_job = replace(last_new_job, last_new_job == ">4", 4)) %>%
  mutate(last_new_job = replace(last_new_job, last_new_job == "never", 0)) %>%
  mutate_if(is.character, as.numeric)

knn_trans <- preProcess(knn_data,
                        method = c("center", "scale"))
knn_transformed_feat <- predict(knn_trans, knn_data)
nzv <- nearZeroVar(knn_transformed_feat, saveMetrics = TRUE)
nzv[nzv[,"nzv"] == TRUE,]

## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

It turns out that none of our predictors have near-zero variance, so we’re good to proceed!

At this point, we are ready to build our model. We’ll start by splitting our data into training and testing sets. We also need to remove enrollee_id since it will throw off our KNN model.

knn_processed <- cbind(train[9], knn_transformed_feat)
knn_processed <- knn_processed[complete.cases(knn_processed),]
set.seed(54321)
train_ind <- sample(seq_len(nrow(knn_processed)),
                    size = floor(0.75*nrow(knn_processed)))
knn_train <- knn_processed[train_ind,]
knn_test <- knn_processed[-train_ind,]

Our KNN model uses the kknn library. With this library we are able to test different distances (Manhattan, Euclidean, etc.) as well as different weights (kernels).

library(kknn)

kknn_func <- function(train_x, train_y, test_x, test_y){
  results_df <- data.frame(matrix(nrow = 0, ncol = 6))
  
  weights <- c("rectangular","triangular")
  
  for(d in 1:2){
    print(d)
    for(w in weights){
      print(w)
      for(i in 2:30){
        kknnModel <- kknn(train_y ~ .,
                          train_x,
                          test_x,
                          k = i,
                          distance = d,
                          kernel = w)
        
        cM <- table(test_y, fitted(kknnModel))
        accuracy <- (cM[1]+cM[4])/(cM[1]+cM[2]+cM[3]+cM[4])
        sensitivity <- (cM[4])/(cM[4]+cM[2])
        specificity <- (cM[1])/(cM[1]+cM[3])
        results_df <- rbind(results_df,c(i,
                                         accuracy,
                                         specificity,
                                         sensitivity,
                                         w,
                                         d))
      }
    }
  }

  colnames(results_df) <- c("k",
                            "Accuracy",
                            "Specificity",
                            "Sensitivity",
                            "Weight",
                            "Distance")
  results_df[,1] <- as.integer(results_df[,1])
  results_df[,2] <- as.numeric(results_df[,2])
  results_df[,3] <- as.numeric(results_df[,3])
  results_df[,4] <- as.numeric(results_df[,4])
  results_df[,6] <- as.integer(results_df[,6])
  
  return(results_df)
}

kknn_results <- kknn_func(knn_train[,-1],
                          knn_train[,1],
                          knn_test[,-1],
                          knn_test[,1])


acc_plot_data <- kknn_results[which(kknn_results$Distance == 5),]

ggplot(data = kknn_results, aes(x = k, y = Accuracy, color = Weight)) +
  geom_line() +
  geom_point() +
  labs(title = "KKNN: k distribution",
       x = "k",
       y = "Accuracy")

From the visual above, we can see that our model found that a k value of around 27 with a distance of 2 and a weighting function of rectangular produced the best model with an accuracy of 77.2%.

This is a rather large value for k, which indicates that the groups are rather spread out and that KNN may not be the best method for predicting our target variable. Additionally, the sensitivity of this model is extremely low, once again indicating that this model is likely not ideal.

Gradient Boosting

Gradient boosting is similar to random forests or bagging approaches, but instead of just growing a large number of trees from independent random samples of the data, they are grown sequentially on transformations of the data. Boosting is a method to improve the weak learners sequentially and increase the model accuracy with a combined model.

Unlike random forests, GBMs can have high variability in accuracy depending on their settings.

Boosting has 4 tuning parameters that we can focus on:

The number of trees.
The number of splits in each tree, which controls the complexity of the boosted ensemble (controlled with max.depth).
Interaction depth, the number of splits it has to perform on a tree.
The distribution type. We will use “adaboost”, as we are working with 0-1 outcomes.

We will tune the model with the training set to find the best combination with the lowest RMSE error.

# search grid
 hyper_grid <- expand.grid(
   n.trees = c(100, 500),
   interaction.depth = c(6, 9),
   n.minobsinnode = c(5, 10)
 )
 
 # create model fit function
 model_fit <- function(n.trees, shrinkage, interaction.depth, n.minobsinnode) {
   set.seed(777)
   m <- gbm(
     formula = as.character(target) ~ .,
     data = data.frame(df_train),
    distribution = "adaboost",
     n.trees = n.trees,
    shrinkage = 0.05,
     interaction.depth = interaction.depth,
     n.minobsinnode = n.minobsinnode,
     cv.folds = 10
   )
   # compute RMSE
   sqrt(min(m$cv.error))
 }

 
 hyper_grid$rmse <- purrr::pmap_dbl(
   hyper_grid,
   ~ model_fit(
     n.trees = ..1,
     interaction.depth = ..2,
     n.minobsinnode = ..3
   )
 )
 
 # results
arrange(hyper_grid, rmse)

##   n.trees interaction.depth n.minobsinnode      rmse
## 1     100                 6              5 0.8795454
## 2     500                 6              5 0.8795454
## 3     100                 6             10 0.8796386
## 4     500                 6             10 0.8796386
## 5     100                 9              5 0.8797920
## 6     500                 9              5 0.8797920
## 7     100                 9             10 0.8799780
## 8     500                 9             10 0.8799780

We experienced no improvement in our RMSE after we tried to increase the number of trees and interaction depth, however it significantly affected the processing time.

From the results above, we see the optimal hyperparameters are:

the number of trees to 100
interaction depth to 9
n.minobsinnode to 10

We will use these values to build the model for our test set.

set.seed(777)
 modelGBM=gbm(as.character(target)~., data=data.frame(df_train)
             ,n.trees=100,distribution='adaboost',interaction.depth=9,shrinkage=0.05, n.minobsinnode = 10)

print(modelGBM)

## gbm(formula = as.character(target) ~ ., distribution = "adaboost", 
##     data = data.frame(df_train), n.trees = 100, interaction.depth = 9, 
##     n.minobsinnode = 10, shrinkage = 0.05)
## A gradient boosted model with adaboost loss function.
## 100 iterations were performed.
## There were 8 predictors of which 7 had non-zero influence.

set.seed(777)
pgbm=predict(modelGBM,newdata=data.frame(df_test),n.trees = 100,type='response')
pgbm[pgbm>0.5]=1
pgbm[pgbm<=0.5]=0

confusionMatrix(as.factor(df_test$target),as.factor(pgbm), positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3270  326
##          1  734  461
##                                           
##                Accuracy : 0.7788          
##                  95% CI : (0.7667, 0.7904)
##     No Information Rate : 0.8357          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3331          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.58577         
##             Specificity : 0.81668         
##          Pos Pred Value : 0.38577         
##          Neg Pred Value : 0.90934         
##              Prevalence : 0.16427         
##          Detection Rate : 0.09622         
##    Detection Prevalence : 0.24943         
##       Balanced Accuracy : 0.70123         
##                                           
##        'Positive' Class : 1               
##

Above, we can see the accuracy metrics for our final gradient boosting model. It appears that Boosting has a correct classification rate of 77.83% and a sensitivity of 58.34%.

As a point of interest, we can use the summary function on our final model object to give us a variable importance plot. Not surprisingly, cityand experience are the most important variables in the model. This aligns closely with what we saw as the important features in our random forest model.

 summary(modelGBM,
        cBars = 10,
       n.trees = modelGBM$n.trees,
         plotit = TRUE,
         order = TRUE,
        method = relative.influence,
         normalize = TRUE)

##                                           var   rel.inf
## city                                     city 75.043972
## experience                         experience 12.262542
## relevent_experience       relevent_experience  4.103970
## enrolled_university       enrolled_university  3.033921
## education_level               education_level  2.727331
## training_hours                 training_hours  1.522973
## last_new_job                     last_new_job  1.305290
## city_development_index city_development_index  0.000000

Support Vector Machine

A Support Vector Machine (SVM) is an algorithm that searches the feature space for the optimal hyper plane. This hyper plane will separate the features by classes with the maximum margin. Here we train an SVM to find the dividing plane between those who Not looking for a job change and those that are looking for a job change based on the features we have.

The trained dataset smp is fit to an SVM

set.seed(42)
svm_model = svm(target ~ ., data=smp) 
summary(svm_model)

## 
## Call:
## svm(formula = target ~ ., data = smp)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  4993
## 
##  ( 2511 2482 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Our base model consists of 4993 support vectors where 2511 are assigned to label 0 (not looking for a job change) and 2482 to label 1 (looking for a job change).

We will tune the SVM with the training set to find the best values for gamma and cost. The tuning process is done with 10 fold cross validation.

It looks like the best parameters are gamma = 0.5 and cost = 1.

## 
## Call:
## svm(formula = target ~ ., data = smp, cost = 1, gamma = 0.5, kernel = "radial", 
##     probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  5657
## 
##  ( 2888 2769 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Now with the best model we can try it against the test set:

set.seed(42)

x <-  subset(df_test, select=-target) 
y <- subset(df_test, select=target) 
  
y_test = predict(svm_best, x )
confusionMatrix(df_test[,'target'], y_test, positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2692  904
##          1  433  762
##                                          
##                Accuracy : 0.7209         
##                  95% CI : (0.708, 0.7336)
##     No Information Rate : 0.6523         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3414         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4574         
##             Specificity : 0.8614         
##          Pos Pred Value : 0.6377         
##          Neg Pred Value : 0.7486         
##              Prevalence : 0.3477         
##          Detection Rate : 0.1590         
##    Detection Prevalence : 0.2494         
##       Balanced Accuracy : 0.6594         
##                                          
##        'Positive' Class : 1              
##

Based on the results above, we can see that the model is most accurate at predicting someone who will not change job with 2692 true negatives. There are also 433 false positives, meaning the model predicted that those individuals would change job when they didn’t. Additionally, there are 904 false negatives which is a situation where the model predicts those individual will not change job when they did.

Modeling Comparison

Detailed results of the individual models are given above, however, a comparison of our models are provided below:

Model	Accuracy	Specificity	Sensitivity
Decision Tree	77.5%	82.2%	56.8%
Random Forest	73.6%	86.1%	47.7%
KNN	77.2%	91.7%	33.6%
Gradient Boost	77.8%	81.7%	58.3%
SVM	72.1%	86.1%	45.7%

While the accuracy of these models may at first glance seem satisfactory, the purpose of these models was to predict whether a data scientist was looking to leave their job or not. This means we shouldn’t necessarily focus on accuracy as our measure for this model, instead we should focus on the sensitivity, which tells us how many of the data scientists that were looking to leave their job were correctly predicted by our model. In looking at our sensitivity above we can see that our results are lackluster. We would not recommend using any of these models in production to determine if a data scientist was planning on leaving their job. If our objective was to build a model that could predict who would stay at their job, these models would be potential prospects as all of them had fairly high specificity (82%+).

So why did our modeling attempts fail? It’s hard to say with certainty, but a large reason could be a lack of signal from our dataset. We saw when performing our exploratory data analysis that very few features within the dataset had any signal at all. It is quite possible that the few columns that did hold predictive power just didn’t quite have enough signal for our models to detect.

Conclusion

Based on our analysis and modeling, it appears that detecting which data scientists are planning to leave their jobs does not have a straightforward solution (at least with the present dataset). This should, intuitively, make some sense. If companies were good at detecting who was going to leave, they could do more to incentivize them to stay, which is not what we have seen in the past year and a half. This problem will continue to stump analysts for years to come.

As a next step in this project, we would recommend expanding the dataset to include additional data points that held more predictive power.