This report describe of Prediction Job Change of Data Scientists using Machine Learning Algorithms. We investigated 4 Algorithms : Logistic Regression, Desicion Tree, Random Forest, and Support Vector Machine(SVM).
Dataset used in this report is HR Analytics: Job Change of Data Scientists hosted in kaggle.
Report Outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recomendation
Dataset is downloaded from kaggle and saved in data folder. We used read.csv to read the dataset and put into df data frame. We used argumen na.strings to change all blank value "" to NA value.
df = read.csv("data/aug_train.csv", na.strings=c("","NA"))
We used head() function to see 6 top data in dataframe.
head(df)
## enrollee_id city city_development_index gender relevent_experience
## 1 8949 city_103 0.920 Male Has relevent experience
## 2 29725 city_40 0.776 Male No relevent experience
## 3 11561 city_21 0.624 <NA> No relevent experience
## 4 33241 city_115 0.789 <NA> No relevent experience
## 5 666 city_162 0.767 Male Has relevent experience
## 6 21651 city_176 0.764 <NA> Has relevent experience
## enrolled_university education_level major_discipline experience company_size
## 1 no_enrollment Graduate STEM >20 <NA>
## 2 no_enrollment Graduate STEM 15 50-99
## 3 Full time course Graduate STEM 5 <NA>
## 4 <NA> Graduate Business Degree <1 <NA>
## 5 no_enrollment Masters STEM >20 50-99
## 6 Part time course Graduate STEM 11 <NA>
## company_type last_new_job training_hours target
## 1 <NA> 1 36 1
## 2 Pvt Ltd >4 47 0
## 3 <NA> never 83 0
## 4 Pvt Ltd never 52 1
## 5 Funded Startup 4 8 0
## 6 <NA> 1 24 1
To see the number of rows and columns, we used dim() function, The dataset has 19158 rows and 14 columns.
dim(df)
## [1] 19158 14
To see the columns name and type, we used str() function.
str(df)
## 'data.frame': 19158 obs. of 14 variables:
## $ enrollee_id : int 8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
## $ city : chr "city_103" "city_40" "city_21" "city_115" ...
## $ city_development_index: num 0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
## $ gender : chr "Male" "Male" NA NA ...
## $ relevent_experience : chr "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
## $ enrolled_university : chr "no_enrollment" "no_enrollment" "Full time course" NA ...
## $ education_level : chr "Graduate" "Graduate" "Graduate" "Graduate" ...
## $ major_discipline : chr "STEM" "STEM" "STEM" "Business Degree" ...
## $ experience : chr ">20" "15" "5" "<1" ...
## $ company_size : chr NA "50-99" NA NA ...
## $ company_type : chr NA "Pvt Ltd" NA "Pvt Ltd" ...
## $ last_new_job : chr "1" ">4" "never" "never" ...
## $ training_hours : int 36 47 83 52 8 24 24 18 46 123 ...
## $ target : num 1 0 0 1 0 1 0 1 1 0 ...
From the result above, we know the following:
1. The first column is enrollee_id, that is a “Unique ID for candidate” and its unnecessary columns, it should be removed later.
2. The second column is city, that is a “City code” that must be class variable, but currently the type is char, it should be converted to factor later.
3. The third column is city_development_index, that is a “Developement index of the city (scaled)” and the type is numeric.
4. The fourth column is gender, that is a “Gender of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
5. The fifth column is relevent_experience, that is a “Relevant experience of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
6. The sixth column is enrolled_university, that is a “Type of University course enrolled if any” that must be class variable, but currently the type is char, it should be converted to factor later.
7. The seventh column is education_level, that is a “Education level of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
8. The eighth column is major_discipline, that is a “Education major discipline of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
9. The ninth column is experience, that is a “Candidate total experience in years” that must be class variable, but currently the type is char, it should be converted to factor later.
10. The tenth column is company_size, that is a “No of employees in current employer’s company” that must be class variable, but currently the type is char, it should be converted to factor later.
11. The eleventh column is company_type, that is a “Type of current employer” that must be class variable, but currently the type is char, it should be converted to factor later.
12. The twelfth column is last_new_job, that is a “Difference in years between previous job and current job” that must be class variable, but currently the type is char, it should be converted to factor later.
13. The thirteenth column is training_hours, that is a “Training hours completed” and the type is integer.
14. The last column is target, that is a target for us to predict the probability of a candidate will work for the company.
# Remove enrolle_id
df$enrollee_id = NULL
To see how many class in each column, we used count() function. we can see that the city has many category, its about 93 category. So im not going to input the city into the plot
library(plyr)
dim(count(df$city))
## [1] 123 2
Now we create Histogram for all class variable to check missing data
We are going to create a function to creating a plot
library(ggplot2)
library(gridExtra)
# Function ploting
ploting = function(x_column, label){
ggplot(df, aes(x = x_column, fill = as.factor(target))) +
geom_bar() +
labs(title = paste(label, "histogram"), x = label, fill = "target")+
theme(legend.position = "top", plot.title = element_blank())
}
p1 = ploting(df$gender, "Gender")
p2 = ploting(df$relevent_experience, "Relevent Experience")
p3 = ploting(df$enrolled_university, "Enrolled University")
grid.arrange(p1, p2, p3, ncol = 1, nrow = 3)
p4 = ploting(df$education_level, "Education Level")
p5 = ploting(df$major_discipline, "Major Discipline")
p6 = ploting(df$experience, "Experience")
grid.arrange(p4, p5, p6, ncol = 1, nrow = 3)
p7 = ploting(df$company_size, "Company Size")
p8 = ploting(df$company_type, "Company Type")
p9 = ploting(df$last_new_job, "Last New Job")
grid.arrange(p7, p8, p9, ncol = 1, nrow = 3)
From the result above, we can conclude that there is 8 out of 9 have NA values.
We have known before that company_size and company_type have many missing value from other columns. So we have to fill the missing value for gender, experience, major_discipline, education_level, enrolled_university, and last_new_job before company_size and company_type because we will predict NA value for company_size and company_type
Here we fill missing value of each column with mode for gender, experience, major_discipline, education_level, enrolled_university, and last_new_job.
library(dplyr)
# Insert to missing value
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
## Insert to missing value with mode
df = df %>%
mutate(gender = replace(gender,
is.na(gender),
getmode(gender)),
experience = replace(experience,
is.na(experience),
getmode(experience)),
major_discipline = replace(major_discipline,
is.na(major_discipline),
getmode((major_discipline))),
education_level = replace(education_level,
is.na(education_level),
getmode(education_level)),
enrolled_university = replace(enrolled_university,
is.na(enrolled_university),
getmode(enrolled_university)),
last_new_job = replace(last_new_job,
is.na(last_new_job),
getmode(last_new_job)))
We change the data type to factor for gender, experience, major_discipline, education_level, enrolled_university, and last_new_job.
# Change Factor
df$gender = as.factor(df$gender)
df$last_new_job = as.factor(df$last_new_job)
df$education_level = as.factor(df$education_level)
df$experience = as.factor(df$experience)
df$relevent_experience = as.factor(df$relevent_experience)
df$major_discipline = as.factor(df$major_discipline)
df$enrolled_university = as.factor(df$enrolled_university)
df$city = as.factor(df$city)
In this step, we will predict missing value for company_type using Classification Tree by removing company_size.
After company_type is filled, we change the data type of company_type to factor.
library(party)
df_without_size = subset(df, select = -company_size)
na_idx = which(is.na(df_without_size$company_type), arr.ind=TRUE)
train_without_size = df_without_size[-na_idx, ]
test_without_size = df_without_size[na_idx, ]
train_without_size$company_type = as.factor(train_without_size$company_type)
# predict missing value company type
fit_ctree = ctree(formula = company_type ~ ., data = train_without_size)
pred_company_type = predict(fit_ctree, test_without_size)
pred_company_type = lapply(pred_company_type, as.character)
v = c()
for (i in 1:length(pred_company_type)){
v <- c(v, pred_company_type[[i]])
}
df = df %>%
mutate(company_type = replace(company_type,
is.na(company_type),
v))
df$company_type = as.factor(df$company_type)
The last step, we will predict missing value for company_size using Classification Tree without removing company_type.
After company_size is filled, we change the data type of company_size to factor.
na_idx = which(is.na(df$company_size), arr.ind=TRUE)
train_size_without_na = df[-na_idx, ]
test_size_with_na = df[na_idx, ]
train_size_without_na$company_size = as.factor(train_size_without_na$company_size)
# predict missing value company type
fit_ctree = ctree(formula = company_size ~ ., data = train_size_without_na)
pred_size_type1 = predict(fit_ctree, test_size_with_na)
pred_size_type1 = lapply(pred_size_type1, as.character)
v = c()
for (i in 1:length(pred_size_type1)){
v <- c(v, pred_size_type1[[i]])
}
df = df %>%
mutate(company_size = replace(company_size,
is.na(company_size),
v))
df$company_size = as.factor(df$company_size)
Removed city vairables. Based on discussion with others and also tried to remove 1 by 1 column.
df$city = NULL
Create function to encode the data frame df, and we create 2 variable df_encode and df_encode_for_pca to save a new dataframe with encoded data.
library(caret)
dummy = function(data_type){
dmy = dummyVars(" ~ .", data = df[, 1:12])
df_encode_for_pca = data.frame(predict(dmy, newdata = df[, 1:12]))
df_encode = df_encode_for_pca
df_encode$target = df$target
if (data_type == "pca") {
return(df_encode_for_pca)
}else if (data_type == "encode") {
return(df_encode)
}
}
df_encode = dummy("encode")
Transform the data frame into PCA data frame
Create Function get_pca to get how many variable that we will use.
get_pca = function(df_encode_for_pca, type){
pr_out = prcomp(df_encode_for_pca, scale = TRUE)
pr_var = pr_out$sdev ^ 2
pve = pr_var / sum(pr_var)
if (type == "count") {
plot(pve,
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0,1), type = "b")
plot(cumsum(pve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0,1), type = "b")
return(pve)
}else if (type == "result") {
return(pr_out)
}
}
Create Function pca to transform and save a new data frame.
pca = function(df_encode_for_pca, size){
pr_out = get_pca(df_encode_for_pca = df_encode_for_pca, type = "result")
df_transform = pr_out$x[, 1:size]
df_transform = cbind(df_transform, target = df$target)
df_transform = as.data.frame(df_transform)
return(df_transform)
}
To see how many variable that we need, we used get_pca function. We used 55 variable component, because 55 variables can describe almost the data of all variables.
get_pca(df_encode_for_pca = df_encode, type = "count")
## [1] 5.818281e-02 3.514174e-02 3.439676e-02 3.167619e-02 2.849552e-02
## [6] 2.548207e-02 2.331527e-02 2.203020e-02 2.148684e-02 2.054113e-02
## [11] 2.024944e-02 1.871574e-02 1.849730e-02 1.811017e-02 1.776721e-02
## [16] 1.750449e-02 1.722536e-02 1.715866e-02 1.699786e-02 1.682313e-02
## [21] 1.672160e-02 1.669305e-02 1.661753e-02 1.654004e-02 1.648518e-02
## [26] 1.636428e-02 1.631785e-02 1.630282e-02 1.623661e-02 1.617872e-02
## [31] 1.607596e-02 1.601509e-02 1.597434e-02 1.589711e-02 1.587992e-02
## [36] 1.586938e-02 1.574873e-02 1.572035e-02 1.560815e-02 1.548359e-02
## [41] 1.545455e-02 1.536186e-02 1.526094e-02 1.512123e-02 1.509058e-02
## [46] 1.482443e-02 1.440615e-02 1.382816e-02 1.357677e-02 1.350540e-02
## [51] 1.244944e-02 1.147335e-02 1.055545e-02 9.065626e-03 7.497900e-03
## [56] 6.627407e-29 2.832323e-29 2.465461e-29 1.405880e-29 5.006776e-30
## [61] 2.519493e-30 1.603192e-30 7.846857e-31 1.855042e-32
We used 55 size, it means we used 55 variable component, and saved into df_pca.
df_pca = pca(df_encode_for_pca = dummy("pca"), 55)
Use set_seed(2021) for reproducible result. Ratio train:test = 80:20.
We create function to divide the data.
divide_data = function(data, result){
n = nrow(data)
set.seed(2021)
train_index = sample(n, 0.8 * n)
df_train = data[train_index, ]
df_test = data[-train_index, ]
if (result == "train") {
return(df_train)
}else if (result == "test") {
return(df_test)
}
}
We call the function to get train and test data.
We have 3 data frame df (raw data frame), df_encode (encoded data frame), and df_pca (PCA data frame)
df_train_raw = divide_data(data = df, result = "train")
df_test_raw = divide_data(data = df, result = "test")
df_train_encode = divide_data(data = df_encode, result = "train")
df_test_encode = divide_data(data = df_encode, result = "test")
df_train_pca = divide_data(data = df_pca, result = "train")
df_test_pca = divide_data(data = df_pca, result = "test")
We use 4 Machine Learning Algorithms
forest = function(train, test){
library(randomForest)
set.seed(2021)
fit_forest = randomForest(formula = target ~ ., data = train,
importance = TRUE)
pred_forest <- predict(fit_forest, test, type = "response")
pred_forest = factor(pred_forest > 0.5, c(FALSE,TRUE),
labels = c(0,1))
perf_forest <- table(test$target, pred_forest,
dnn = c("Actual", "Predicted"))
return(perf_forest)
}
perf_forest_pca = forest(train = df_train_pca, test = df_test_pca)
perf_forest_encode = forest(train = df_train_encode, test = df_test_encode)
perf_forest_raw = forest(train = df_train_raw, test = df_test_raw)
func_tree = function(train, test){
library(party)
fit_tree = ctree(formula = target ~ ., data = train)
prob_ctree = predict(fit_tree, test)
pred_ctree = factor(prob_ctree > 0.5, c(FALSE,TRUE),
labels = c(0,1))
perf_ctree <- table(test$target, pred_ctree,
dnn = c("Actual", "Predicted"))
return(perf_ctree)
}
perf_ctree_pca = func_tree(train = df_train_pca, test = df_test_pca)
perf_ctree_encode = func_tree(train = df_train_encode, test = df_test_encode)
perf_ctree_raw = func_tree(train = df_train_raw, test = df_test_raw)
func_svm = function(train, test){
library(e1071)
fit_svm = svm(formula = target ~ ., data = train)
prob_svm = predict(fit_svm, test)
pred_svm = factor(prob_svm > 0.5, c(FALSE,TRUE),
labels = c(0,1))
perf_svm <- table(test$target, pred_svm,
dnn = c("Actual", "Prediction"))
return(perf_svm)
}
perf_svm_pca = func_svm(train = df_train_pca, test = df_test_pca )
perf_svm_encode = func_svm(train = df_train_encode, test = df_test_encode )
perf_svm_raw = func_svm(train = df_train_raw, test = df_test_raw )
func_logit = function(train, test){
fit_logit = glm(formula = target ~ ., data = train)
prob_logit = predict(fit_logit, test)
pred_logit = factor(prob_logit > 0.5, c(FALSE,TRUE),
labels = c(0,1))
perf_logit = table(test$target, pred_logit,
dnn = c("Actual", "Prediction"))
return(perf_logit)
}
perf_logit_pca = func_logit(train = df_train_pca, test = df_test_pca )
perf_logit_encode = func_logit(train = df_train_encode, test = df_test_encode )
perf_logit_raw = func_logit(train = df_train_raw, test = df_test_raw )
To compute accuracy, precision, Recall, and F1 Score, we created perform() function.
# Accuracy
perform = function(table, n = 2, check, data_model){
TN = table[1,1]
TP = table[2,2]
FN = table[2,1]
FP = table[1,2]
accuracy = (TP + TN ) / (TP + FP + FN + TN)
precision = TP / (FP + TP)
recall = TP / (FN + TP)
f1 = 2 * precision * recall / (precision + recall)
result = paste(data_model, "->",
"Accuracy =", round(accuracy, n),
",Precision =", round(precision, n),
",Recall =", round(recall, n),
",F1 Score =", round(f1, n))
hasil = paste(data_model, "->",
"TP :", TP,
",TN :", TN,
",FP :", FP,
",FN :", FN)
if(check == "score"){
result
}else if(check == "value") {
hasil
}else if(check == "all"){
print(result)
print(hasil)
}
}
# RESULT RANDOM FOREST
perform(perf_forest_pca, check = "all", data_model = "PCA With City")
## [1] "PCA With City -> Accuracy = 0.99 ,Precision = 1 ,Recall = 0.97 ,F1 Score = 0.98"
## [1] "PCA With City -> TP : 923 ,TN : 2879 ,FP : 0 ,FN : 30"
perform(perf_forest_encode, check = "all", data_model = "Encode With City")
## [1] "Encode With City -> Accuracy = 0.77 ,Precision = 0.55 ,Recall = 0.4 ,F1 Score = 0.46"
## [1] "Encode With City -> TP : 377 ,TN : 2575 ,FP : 304 ,FN : 576"
perform(perf_forest_raw, check = "all", data_model = "Raw With City")
## [1] "Raw With City -> Accuracy = 0.77 ,Precision = 0.58 ,Recall = 0.34 ,F1 Score = 0.43"
## [1] "Raw With City -> TP : 322 ,TN : 2645 ,FP : 234 ,FN : 631"
# RESULT CTREE
perform(perf_ctree_pca, check = "all", data_model = "PCA")
## [1] "PCA -> Accuracy = 0.96 ,Precision = 0.94 ,Recall = 0.91 ,F1 Score = 0.93"
## [1] "PCA -> TP : 870 ,TN : 2822 ,FP : 57 ,FN : 83"
perform(perf_ctree_encode, check = "all", data_model = "Encode")
## [1] "Encode -> Accuracy = 0.78 ,Precision = 0.58 ,Recall = 0.42 ,F1 Score = 0.49"
## [1] "Encode -> TP : 402 ,TN : 2589 ,FP : 290 ,FN : 551"
perform(perf_ctree_raw, check = "all", data_model = "Raw")
## [1] "Raw -> Accuracy = 0.78 ,Precision = 0.57 ,Recall = 0.43 ,F1 Score = 0.49"
## [1] "Raw -> TP : 407 ,TN : 2570 ,FP : 309 ,FN : 546"
# RESULT SUPPORT VECTOR MACHINE (SVM)
perform(perf_svm_pca, check = "all", data_model = "PCA")
## [1] "PCA -> Accuracy = 1 ,Precision = 1 ,Recall = 0.99 ,F1 Score = 1"
## [1] "PCA -> TP : 944 ,TN : 2879 ,FP : 0 ,FN : 9"
perform(perf_svm_encode, check = "all", data_model = "Encode")
## [1] "Encode -> Accuracy = 0.76 ,Precision = 0.55 ,Recall = 0.18 ,F1 Score = 0.27"
## [1] "Encode -> TP : 173 ,TN : 2736 ,FP : 143 ,FN : 780"
perform(perf_svm_raw, check = "all", data_model = "Raw")
## [1] "Raw -> Accuracy = 0.76 ,Precision = 0.54 ,Recall = 0.21 ,F1 Score = 0.31"
## [1] "Raw -> TP : 203 ,TN : 2708 ,FP : 171 ,FN : 750"
# RESULT LOGIT
perform(perf_logit_pca, check = "all", data_model = "PCA")
## [1] "PCA -> Accuracy = 1 ,Precision = 1 ,Recall = 1 ,F1 Score = 1"
## [1] "PCA -> TP : 953 ,TN : 2879 ,FP : 0 ,FN : 0"
perform(perf_logit_encode, check = "all", data_model = "Encode")
## [1] "Encode -> Accuracy = 0.76 ,Precision = 0.54 ,Recall = 0.25 ,F1 Score = 0.35"
## [1] "Encode -> TP : 241 ,TN : 2677 ,FP : 202 ,FN : 712"
perform(perf_logit_raw, check = "all", data_model = "Raw")
## [1] "Raw -> Accuracy = 0.76 ,Precision = 0.54 ,Recall = 0.25 ,F1 Score = 0.35"
## [1] "Raw -> TP : 241 ,TN : 2677 ,FP : 202 ,FN : 712"
Logistic Regression algorithm is the best from all the tested algorithms with score accuracy = 1, precision = 1, recall = 1, and F1 score = 1.