Overview

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:

Introduction

I will utilize a dataset HR Analytics: Job Changes of Data Scientists from Kaggle. This dataset aims to determine which data scientists will be looking for a job change.

Dataset Features

Features

  • enrollee_id : Unique ID for candidate
  • city: City code
  • city_ development _index : Developement index of the city (scaled)
  • gender: Gender of candidate
  • relevent_experience: Relevant experience of candidate
  • enrolled_university: Type of University course enrolled if any
  • education_level: Education level of candidate
  • major_discipline :Education major discipline of candidate
  • experience: Candidate total experience in years
  • company_size: No of employees in current employer’s company
  • company_type : Type of current employer
  • lastnewjob: Difference in years between previous job and current job
  • training_hours: training hours completed
  • target: 0 – Not looking for job change, 1 – Looking for a job change

The target variable is labeled target and can either be 0 indicating the individual is not looking for a job change or 1 indicating the individual is looking for a job change.

Methodology of Analysis

I will transform the given dataset, perform expolatory analysis as well as clustering so as to get understand the data and build models to determine which model is most accurate in predicting which data scientists are looking to leave their jobs.

I will build the following models

  1. Decision Tree
  2. Random Forest
  3. K-Nearest Neighbors
  4. SVM

Loading Data

## [1] 19158    14
enrollee_id city city_development_index gender relevent_experience enrolled_university education_level major_discipline experience company_size company_type last_new_job training_hours target
8949 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM >20 1 36 1
29725 city_40 0.776 Male No relevent experience no_enrollment Graduate STEM 15 50-99 Pvt Ltd >4 47 0
11561 city_21 0.624 No relevent experience Full time course Graduate STEM 5 never 83 0
33241 city_115 0.789 No relevent experience Graduate Business Degree <1 Pvt Ltd never 52 1
666 city_162 0.767 Male Has relevent experience no_enrollment Masters STEM >20 50-99 Funded Startup 4 8 0
21651 city_176 0.764 Has relevent experience Part time course Graduate STEM 11 1 24 1

Data Exploration

## 'data.frame':    19158 obs. of  14 variables:
##  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
##  $ city                  : chr  "city_103" "city_40" "city_21" "city_115" ...
##  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
##  $ gender                : chr  "Male" "Male" "" "" ...
##  $ relevent_experience   : chr  "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
##  $ enrolled_university   : chr  "no_enrollment" "no_enrollment" "Full time course" "" ...
##  $ education_level       : chr  "Graduate" "Graduate" "Graduate" "Graduate" ...
##  $ major_discipline      : chr  "STEM" "STEM" "STEM" "Business Degree" ...
##  $ experience            : chr  ">20" "15" "5" "<1" ...
##  $ company_size          : chr  "" "50-99" "" "" ...
##  $ company_type          : chr  "" "Pvt Ltd" "" "Pvt Ltd" ...
##  $ last_new_job          : chr  "1" ">4" "never" "never" ...
##  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
##  $ target                : num  1 0 0 1 0 1 0 1 1 0 ...

There is a mix of numerical and categorical data. Additionally, our target variable target should also be a factor. I will convert categorical column types, as well as the target variable, to factor. In addition, enrollee_id is only used for identification, hence I will not use it during modeling.

Numerical Features

There are two numerical features, city_development_index and training_hours. Below is a visualization of the relationship of each of these variables with target.

From the boxplots above, those who are not looking to leave their jobs, majorly live in cities with a high city development index. From the boxplot on the right, the median value is much lower and the interquartile range is much wider. This implies that city_development_index has a strong relationship with target.

From the boxplots above, there does not exist significant relationship between this variable and target.

Categorical Features

To explore the relationship between categorical variables and target, I will focus on the percentage of individuals who are looking for new jobs vs those who are not at each level of the factor. Differences between the percentages in each level, can imply the factor and its associated levels are predictive of target.

From the charts above, there are very minor differences between the percentages for those who are looking to leave their jobs and those who are not in almost all of our categorical features. The following are the significant differences:

  • enrolled_university: Those who are enrolled in a full time course are twice as likely to be looking for a new position.
  • experience: Those with more than 20 years of experience are twice as likely to stay at their current job.

I excluded a the view of city in the above graph due to the number of distinct categories it contains.

Thus far it appears that the following categorical variables that hold insignificant predictive power:

  • gender
  • relevant_experience
  • education_level
  • major_discipline
  • company_size
  • company_type
  • last_new_job

Missing Data

Visualization of missing values.

From the above plots, the dataset has no missing values. I am going to drop company_type, company_size, gender, and major_discipline from the dataset since these factors hold little to no predictive power.

Modelling

Decision Tree

The training set is imbalanced having around 14k 0s and around 5k 1s. This will cause our model to over fit on the class that is over represented. To cater for this I will down sample the target so that it is 50-50.

df_train_rec <- recipe(target ~ ., data=df_train) %>%
  step_downsample(target)
## Warning: `step_downsample()` was deprecated in recipes 0.1.13.
## Please use `themis::step_downsample()` instead.
smp <- df_train_rec %>% 
  prep() %>% 
  bake(new_data=NULL)
table(smp$target)
## 
##    0    1 
## 3582 3582

Below I am instantiating the decision tree and fitting it with the training set:

dt <- rpart(target~., data=df_train)

I then use the trained model on the test set to evaluate it’s performance.

dt.predictions <- predict(dt, df_test, type='class')
dt_df = data.frame(y_true=df_test$target, y_pred=dt.predictions)
dt_cm <- confusionMatrix(table(dt_df$y_true, dt_df$y_pred), positive='1')
dt_cm
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 3219  377
##   1  699  496
##                                           
##                Accuracy : 0.7754          
##                  95% CI : (0.7633, 0.7872)
##     No Information Rate : 0.8178          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3409          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5682          
##             Specificity : 0.8216          
##          Pos Pred Value : 0.4151          
##          Neg Pred Value : 0.8952          
##              Prevalence : 0.1822          
##          Detection Rate : 0.1035          
##    Detection Prevalence : 0.2494          
##       Balanced Accuracy : 0.6949          
##                                           
##        'Positive' Class : 1               
## 

From the confusion matrix the model correctly predicted No 3219 times and Yes 496 times giving an overall accuracy of 77.54%. Out of the 873 data scientists who were looking to leave their job, the model only correctly identified 496 of them and also included nearly 700 false positives. This model isn’t very useful in reality.

KNN

I will center and scale all predictors then remove any predictors that have near-zero variance so that there are no overlapping predictors.

## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

None of the predictors have near-zero variance.

I am building the model by splitting data into training and testing sets then removing the enrollee_id.

## [1] 1
## [1] "rectangular"
## [1] "triangular"
## [1] 2
## [1] "rectangular"
## [1] "triangular"

The model found that a k value of around 27 with a distance of 2 and a weighting function of rectangular produced the best model with an accuracy of 77.2%.

This value of k is a rather large value for k, implying that the groups are spread out and hence KNN may not be the best method for predicting the target variable. Sensitivity of this model is also very low.

Support Vector Machine

I am going to train an SVM to find the dividing plane between those not looking for a job change and those that are looking for a job change based on the features we have.

The trained dataset smp from the decision tree model is fit to an SVM

The base model consists of 4993 support vectors with 2511 assigned to label 0 (not looking for a job change) and 2482 to label 1 (looking for a job change).

I will tune the SVM with the training set to find the best values for gamma and cost. I will do this with 10 fold cross validation.

The best parameters are gamma = 0.5 and cost = 1.

## 
## Call:
## svm(formula = target ~ ., data = smp, cost = 1, gamma = 0.5, kernel = "radial", 
##     probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  5784
## 
##  ( 2946 2838 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Now with the best model we can try it against the test set:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2697  899
##          1  422  773
##                                           
##                Accuracy : 0.7243          
##                  95% CI : (0.7114, 0.7369)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3502          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4623          
##             Specificity : 0.8647          
##          Pos Pred Value : 0.6469          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.3490          
##          Detection Rate : 0.1613          
##    Detection Prevalence : 0.2494          
##       Balanced Accuracy : 0.6635          
##                                           
##        'Positive' Class : 1               
## 

This model is most accurate at predicting someone who will not change job with 2692 true negatives. There are also 433 false positives when the model predicted that those individuals would change job when they didn’t and 904 false negatives where the model predicted those individual will not change job when they did.

Model Comparison

The accuracies of these models may seem satisfactory. However, accuracy is not of focus in this case, rather we focus on models’ sensitivity. Sensitivity tells how many of the data scientists looking to leave their job were correctly predicted by our model. Looking at sensitivity, none of the models is recommended.

In my opinon, one reason why the modeling failed is because of few features in the dataset that predictive power.

Conclusion

From my analysis and modeling of the given dataset, there is no direct way of predicting data scientists that are planning to leave their jobs. I recommend adding more features to the dataset that offer higher prediction.