You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:
I will utilize a dataset HR Analytics: Job Changes of Data Scientists from Kaggle. This dataset aims to determine which data scientists will be looking for a job change.
Features
The target variable is labeled target and can either be 0 indicating the individual is not looking for a job change or 1 indicating the individual is looking for a job change.
I will transform the given dataset, perform expolatory analysis as well as clustering so as to get understand the data and build models to determine which model is most accurate in predicting which data scientists are looking to leave their jobs.
I will build the following models
## [1] 19158 14
enrollee_id | city | city_development_index | gender | relevent_experience | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8949 | city_103 | 0.920 | Male | Has relevent experience | no_enrollment | Graduate | STEM | >20 | 1 | 36 | 1 | ||
29725 | city_40 | 0.776 | Male | No relevent experience | no_enrollment | Graduate | STEM | 15 | 50-99 | Pvt Ltd | >4 | 47 | 0 |
11561 | city_21 | 0.624 | No relevent experience | Full time course | Graduate | STEM | 5 | never | 83 | 0 | |||
33241 | city_115 | 0.789 | No relevent experience | Graduate | Business Degree | <1 | Pvt Ltd | never | 52 | 1 | |||
666 | city_162 | 0.767 | Male | Has relevent experience | no_enrollment | Masters | STEM | >20 | 50-99 | Funded Startup | 4 | 8 | 0 |
21651 | city_176 | 0.764 | Has relevent experience | Part time course | Graduate | STEM | 11 | 1 | 24 | 1 |
## 'data.frame': 19158 obs. of 14 variables:
## $ enrollee_id : int 8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
## $ city : chr "city_103" "city_40" "city_21" "city_115" ...
## $ city_development_index: num 0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
## $ gender : chr "Male" "Male" "" "" ...
## $ relevent_experience : chr "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
## $ enrolled_university : chr "no_enrollment" "no_enrollment" "Full time course" "" ...
## $ education_level : chr "Graduate" "Graduate" "Graduate" "Graduate" ...
## $ major_discipline : chr "STEM" "STEM" "STEM" "Business Degree" ...
## $ experience : chr ">20" "15" "5" "<1" ...
## $ company_size : chr "" "50-99" "" "" ...
## $ company_type : chr "" "Pvt Ltd" "" "Pvt Ltd" ...
## $ last_new_job : chr "1" ">4" "never" "never" ...
## $ training_hours : int 36 47 83 52 8 24 24 18 46 123 ...
## $ target : num 1 0 0 1 0 1 0 1 1 0 ...
There is a mix of numerical and categorical data. Additionally, our target variable target should also be a factor. I will convert categorical column types, as well as the target variable, to factor. In addition, enrollee_id is only used for identification, hence I will not use it during modeling.
There are two numerical features, city_development_index
and training_hours
. Below is a visualization of the relationship of each of these variables with target
.
From the boxplots above, those who are not looking to leave their jobs, majorly live in cities with a high city development index. From the boxplot on the right, the median value is much lower and the interquartile range is much wider. This implies that city_development_index
has a strong relationship with target
.
From the boxplots above, there does not exist significant relationship between this variable and target
.
To explore the relationship between categorical variables and target
, I will focus on the percentage of individuals who are looking for new jobs vs those who are not at each level of the factor. Differences between the percentages in each level, can imply the factor and its associated levels are predictive of target
.
From the charts above, there are very minor differences between the percentages for those who are looking to leave their jobs and those who are not in almost all of our categorical features. The following are the significant differences:
enrolled_university
: Those who are enrolled in a full time course are twice as likely to be looking for a new position.experience
: Those with more than 20 years of experience are twice as likely to stay at their current job.I excluded a the view of city
in the above graph due to the number of distinct categories it contains.
Thus far it appears that the following categorical variables that hold insignificant predictive power:
gender
relevant_experience
education_level
major_discipline
company_size
company_type
last_new_job
Visualization of missing values.
From the above plots, the dataset has no missing values. I am going to drop company_type
, company_size
, gender
, and major_discipline
from the dataset since these factors hold little to no predictive power.
The training set is imbalanced having around 14k 0s and around 5k 1s. This will cause our model to over fit on the class that is over represented. To cater for this I will down sample the target so that it is 50-50.
df_train_rec <- recipe(target ~ ., data=df_train) %>%
step_downsample(target)
## Warning: `step_downsample()` was deprecated in recipes 0.1.13.
## Please use `themis::step_downsample()` instead.
smp <- df_train_rec %>%
prep() %>%
bake(new_data=NULL)
table(smp$target)
##
## 0 1
## 3582 3582
Below I am instantiating the decision tree and fitting it with the training set:
dt <- rpart(target~., data=df_train)
I then use the trained model on the test set to evaluate it’s performance.
dt.predictions <- predict(dt, df_test, type='class')
dt_df = data.frame(y_true=df_test$target, y_pred=dt.predictions)
dt_cm <- confusionMatrix(table(dt_df$y_true, dt_df$y_pred), positive='1')
dt_cm
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 3219 377
## 1 699 496
##
## Accuracy : 0.7754
## 95% CI : (0.7633, 0.7872)
## No Information Rate : 0.8178
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3409
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5682
## Specificity : 0.8216
## Pos Pred Value : 0.4151
## Neg Pred Value : 0.8952
## Prevalence : 0.1822
## Detection Rate : 0.1035
## Detection Prevalence : 0.2494
## Balanced Accuracy : 0.6949
##
## 'Positive' Class : 1
##
From the confusion matrix the model correctly predicted No 3219 times and Yes 496 times giving an overall accuracy of 77.54%. Out of the 873 data scientists who were looking to leave their job, the model only correctly identified 496 of them and also included nearly 700 false positives. This model isn’t very useful in reality.
I will center and scale all predictors then remove any predictors that have near-zero variance so that there are no overlapping predictors.
## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
None of the predictors have near-zero variance.
I am building the model by splitting data into training and testing sets then removing the enrollee_id
.
## [1] 1
## [1] "rectangular"
## [1] "triangular"
## [1] 2
## [1] "rectangular"
## [1] "triangular"
The model found that a k value of around 27 with a distance of 2 and a weighting function of rectangular produced the best model with an accuracy of 77.2%.
This value of k is a rather large value for k, implying that the groups are spread out and hence KNN may not be the best method for predicting the target variable. Sensitivity of this model is also very low.
I am going to train an SVM to find the dividing plane between those not looking for a job change and those that are looking for a job change based on the features we have.
The trained dataset smp
from the decision tree model is fit to an SVM
The base model consists of 4993 support vectors with 2511 assigned to label 0 (not looking for a job change) and 2482 to label 1 (looking for a job change).
I will tune the SVM with the training set to find the best values for gamma and cost. I will do this with 10 fold cross validation.
The best parameters are gamma = 0.5 and cost = 1.
##
## Call:
## svm(formula = target ~ ., data = smp, cost = 1, gamma = 0.5, kernel = "radial",
## probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 5784
##
## ( 2946 2838 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Now with the best model we can try it against the test set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2697 899
## 1 422 773
##
## Accuracy : 0.7243
## 95% CI : (0.7114, 0.7369)
## No Information Rate : 0.651
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3502
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4623
## Specificity : 0.8647
## Pos Pred Value : 0.6469
## Neg Pred Value : 0.7500
## Prevalence : 0.3490
## Detection Rate : 0.1613
## Detection Prevalence : 0.2494
## Balanced Accuracy : 0.6635
##
## 'Positive' Class : 1
##
This model is most accurate at predicting someone who will not change job with 2692 true negatives. There are also 433 false positives when the model predicted that those individuals would change job when they didn’t and 904 false negatives where the model predicted those individual will not change job when they did.
The accuracies of these models may seem satisfactory. However, accuracy is not of focus in this case, rather we focus on models’ sensitivity. Sensitivity tells how many of the data scientists looking to leave their job were correctly predicted by our model. Looking at sensitivity, none of the models is recommended.
In my opinon, one reason why the modeling failed is because of few features in the dataset that predictive power.
From my analysis and modeling of the given dataset, there is no direct way of predicting data scientists that are planning to leave their jobs. I recommend adding more features to the dataset that offer higher prediction.