Introduction

I will utilize a dataset HR Analytics: Job Changes of Data Scientists from Kaggle. This dataset aims to determine which data scientists will be looking for a job change.

Dataset Features

Features

enrollee_id : Unique ID for candidate
city: City code
city_ development _index : Developement index of the city (scaled)
gender: Gender of candidate
relevent_experience: Relevant experience of candidate
enrolled_university: Type of University course enrolled if any
education_level: Education level of candidate
major_discipline :Education major discipline of candidate
experience: Candidate total experience in years
company_size: No of employees in current employer’s company
company_type : Type of current employer
lastnewjob: Difference in years between previous job and current job
training_hours: training hours completed
target: 0 – Not looking for job change, 1 – Looking for a job change

The target variable is labeled target and can either be 0 indicating the individual is not looking for a job change or 1 indicating the individual is looking for a job change.

Methodology of Analysis

I will transform the given dataset, perform expolatory analysis as well as clustering so as to get understand the data and build models to determine which model is most accurate in predicting which data scientists are looking to leave their jobs.

I will build the following models

Decision Tree
Random Forest
K-Nearest Neighbors
SVM

Loading Data

## [1] 19158    14

enrollee_id	city	city_development_index	gender	relevent_experience	enrolled_university	education_level	major_discipline	experience	company_size	company_type	last_new_job	training_hours	target
8949	city_103	0.920	Male	Has relevent experience	no_enrollment	Graduate	STEM	>20			1	36	1
29725	city_40	0.776	Male	No relevent experience	no_enrollment	Graduate	STEM	15	50-99	Pvt Ltd	>4	47	0
11561	city_21	0.624		No relevent experience	Full time course	Graduate	STEM	5			never	83	0
33241	city_115	0.789		No relevent experience		Graduate	Business Degree	<1		Pvt Ltd	never	52	1
666	city_162	0.767	Male	Has relevent experience	no_enrollment	Masters	STEM	>20	50-99	Funded Startup	4	8	0
21651	city_176	0.764		Has relevent experience	Part time course	Graduate	STEM	11			1	24	1

Data Exploration

## 'data.frame':    19158 obs. of  14 variables:
##  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
##  $ city                  : chr  "city_103" "city_40" "city_21" "city_115" ...
##  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
##  $ gender                : chr  "Male" "Male" "" "" ...
##  $ relevent_experience   : chr  "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
##  $ enrolled_university   : chr  "no_enrollment" "no_enrollment" "Full time course" "" ...
##  $ education_level       : chr  "Graduate" "Graduate" "Graduate" "Graduate" ...
##  $ major_discipline      : chr  "STEM" "STEM" "STEM" "Business Degree" ...
##  $ experience            : chr  ">20" "15" "5" "<1" ...
##  $ company_size          : chr  "" "50-99" "" "" ...
##  $ company_type          : chr  "" "Pvt Ltd" "" "Pvt Ltd" ...
##  $ last_new_job          : chr  "1" ">4" "never" "never" ...
##  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
##  $ target                : num  1 0 0 1 0 1 0 1 1 0 ...

There is a mix of numerical and categorical data. Additionally, our target variable target should also be a factor. I will convert categorical column types, as well as the target variable, to factor. In addition, enrollee_id is only used for identification, hence I will not use it during modeling.

Numerical Features

There are two numerical features, city_development_index and training_hours. Below is a visualization of the relationship of each of these variables with target.

From the boxplots above, those who are not looking to leave their jobs, majorly live in cities with a high city development index. From the boxplot on the right, the median value is much lower and the interquartile range is much wider. This implies that city_development_index has a strong relationship with target.

From the boxplots above, there does not exist significant relationship between this variable and target.

Categorical Features

To explore the relationship between categorical variables and target, I will focus on the percentage of individuals who are looking for new jobs vs those who are not at each level of the factor. Differences between the percentages in each level, can imply the factor and its associated levels are predictive of target.

From the charts above, there are very minor differences between the percentages for those who are looking to leave their jobs and those who are not in almost all of our categorical features. The following are the significant differences:

enrolled_university: Those who are enrolled in a full time course are twice as likely to be looking for a new position.
experience: Those with more than 20 years of experience are twice as likely to stay at their current job.

I excluded a the view of city in the above graph due to the number of distinct categories it contains.

Thus far it appears that the following categorical variables that hold insignificant predictive power:

gender
relevant_experience
education_level
major_discipline
company_size
company_type
last_new_job

Missing Data

Visualization of missing values.

From the above plots, the dataset has no missing values. I am going to drop company_type, company_size, gender, and major_discipline from the dataset since these factors hold little to no predictive power.

Modelling

Decision Tree

The training set is imbalanced having around 14k 0s and around 5k 1s. This will cause our model to over fit on the class that is over represented. To cater for this I will down sample the target so that it is 50-50.

df_train_rec <- recipe(target ~ ., data=df_train) %>%
  step_downsample(target)

## Warning: `step_downsample()` was deprecated in recipes 0.1.13.
## Please use `themis::step_downsample()` instead.

smp <- df_train_rec %>% 
  prep() %>% 
  bake(new_data=NULL)
table(smp$target)

## 
##    0    1 
## 3582 3582

Below I am instantiating the decision tree and fitting it with the training set:

dt <- rpart(target~., data=df_train)

I then use the trained model on the test set to evaluate it’s performance.

dt.predictions <- predict(dt, df_test, type='class')
dt_df = data.frame(y_true=df_test$target, y_pred=dt.predictions)
dt_cm <- confusionMatrix(table(dt_df$y_true, dt_df$y_pred), positive='1')
dt_cm

## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 3219  377
##   1  699  496
##                                           
##                Accuracy : 0.7754          
##                  95% CI : (0.7633, 0.7872)
##     No Information Rate : 0.8178          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3409          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5682          
##             Specificity : 0.8216          
##          Pos Pred Value : 0.4151          
##          Neg Pred Value : 0.8952          
##              Prevalence : 0.1822          
##          Detection Rate : 0.1035          
##    Detection Prevalence : 0.2494          
##       Balanced Accuracy : 0.6949          
##                                           
##        'Positive' Class : 1               
##

From the confusion matrix the model correctly predicted No 3219 times and Yes 496 times giving an overall accuracy of 77.54%. Out of the 873 data scientists who were looking to leave their job, the model only correctly identified 496 of them and also included nearly 700 false positives. This model isn’t very useful in reality.

KNN

I will center and scale all predictors then remove any predictors that have near-zero variance so that there are no overlapping predictors.

## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

None of the predictors have near-zero variance.

I am building the model by splitting data into training and testing sets then removing the enrollee_id.

## [1] 1
## [1] "rectangular"
## [1] "triangular"
## [1] 2
## [1] "rectangular"
## [1] "triangular"

The model found that a k value of around 27 with a distance of 2 and a weighting function of rectangular produced the best model with an accuracy of 77.2%.

This value of k is a rather large value for k, implying that the groups are spread out and hence KNN may not be the best method for predicting the target variable. Sensitivity of this model is also very low.

Support Vector Machine

I am going to train an SVM to find the dividing plane between those not looking for a job change and those that are looking for a job change based on the features we have.

The trained dataset smp from the decision tree model is fit to an SVM

The base model consists of 4993 support vectors with 2511 assigned to label 0 (not looking for a job change) and 2482 to label 1 (looking for a job change).

I will tune the SVM with the training set to find the best values for gamma and cost. I will do this with 10 fold cross validation.

The best parameters are gamma = 0.5 and cost = 1.

## 
## Call:
## svm(formula = target ~ ., data = smp, cost = 1, gamma = 0.5, kernel = "radial", 
##     probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  5784
## 
##  ( 2946 2838 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Now with the best model we can try it against the test set:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2697  899
##          1  422  773
##                                           
##                Accuracy : 0.7243          
##                  95% CI : (0.7114, 0.7369)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3502          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4623          
##             Specificity : 0.8647          
##          Pos Pred Value : 0.6469          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.3490          
##          Detection Rate : 0.1613          
##    Detection Prevalence : 0.2494          
##       Balanced Accuracy : 0.6635          
##                                           
##        'Positive' Class : 1               
##

This model is most accurate at predicting someone who will not change job with 2692 true negatives. There are also 433 false positives when the model predicted that those individuals would change job when they didn’t and 904 false negatives where the model predicted those individual will not change job when they did.

Model Comparison

The accuracies of these models may seem satisfactory. However, accuracy is not of focus in this case, rather we focus on models’ sensitivity. Sensitivity tells how many of the data scientists looking to leave their job were correctly predicted by our model. Looking at sensitivity, none of the models is recommended.

In my opinon, one reason why the modeling failed is because of few features in the dataset that predictive power.

Conclusion

From my analysis and modeling of the given dataset, there is no direct way of predicting data scientists that are planning to leave their jobs. I recommend adding more features to the dataset that offer higher prediction.

Data 622 Final Project

Trishita Nath

5/22/2022

Overview