Job Change of Data Scientists Using Raw, Encode, and PCA Data

Description

This report describe of Prediction Job Change of Data Scientists using Machine Learning Algorithms. We investigated 4 Algorithms : Logistic Regression, Desicion Tree, Random Forest, and Support Vector Machine(SVM).

Dataset used in this report is HR Analytics: Job Change of Data Scientists hosted in kaggle.

Download Dataset Here

Report Outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recomendation

1. Data Extraction

Dataset is downloaded from kaggle and saved in data folder. We used read.csv to read the dataset and put into df data frame. We used argumen na.strings to change all blank value "" to NA value.

df = read.csv("data/aug_train.csv", na.strings=c("","NA"))

We used head() function to see 6 top data in dataframe.

head(df)

##   enrollee_id     city city_development_index gender     relevent_experience
## 1        8949 city_103                  0.920   Male Has relevent experience
## 2       29725  city_40                  0.776   Male  No relevent experience
## 3       11561  city_21                  0.624   <NA>  No relevent experience
## 4       33241 city_115                  0.789   <NA>  No relevent experience
## 5         666 city_162                  0.767   Male Has relevent experience
## 6       21651 city_176                  0.764   <NA> Has relevent experience
##   enrolled_university education_level major_discipline experience company_size
## 1       no_enrollment        Graduate             STEM        >20         <NA>
## 2       no_enrollment        Graduate             STEM         15        50-99
## 3    Full time course        Graduate             STEM          5         <NA>
## 4                <NA>        Graduate  Business Degree         <1         <NA>
## 5       no_enrollment         Masters             STEM        >20        50-99
## 6    Part time course        Graduate             STEM         11         <NA>
##     company_type last_new_job training_hours target
## 1           <NA>            1             36      1
## 2        Pvt Ltd           >4             47      0
## 3           <NA>        never             83      0
## 4        Pvt Ltd        never             52      1
## 5 Funded Startup            4              8      0
## 6           <NA>            1             24      1

To see the number of rows and columns, we used dim() function, The dataset has 19158 rows and 14 columns.

dim(df)

## [1] 19158    14

2. Exploratory Data Analysis

To see the columns name and type, we used str() function.

str(df)

## 'data.frame':    19158 obs. of  14 variables:
##  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
##  $ city                  : chr  "city_103" "city_40" "city_21" "city_115" ...
##  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
##  $ gender                : chr  "Male" "Male" NA NA ...
##  $ relevent_experience   : chr  "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
##  $ enrolled_university   : chr  "no_enrollment" "no_enrollment" "Full time course" NA ...
##  $ education_level       : chr  "Graduate" "Graduate" "Graduate" "Graduate" ...
##  $ major_discipline      : chr  "STEM" "STEM" "STEM" "Business Degree" ...
##  $ experience            : chr  ">20" "15" "5" "<1" ...
##  $ company_size          : chr  NA "50-99" NA NA ...
##  $ company_type          : chr  NA "Pvt Ltd" NA "Pvt Ltd" ...
##  $ last_new_job          : chr  "1" ">4" "never" "never" ...
##  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
##  $ target                : num  1 0 0 1 0 1 0 1 1 0 ...

From the result above, we know the following:
1. The first column is enrollee_id, that is a “Unique ID for candidate” and its unnecessary columns, it should be removed later.
2. The second column is city, that is a “City code” that must be class variable, but currently the type is char, it should be converted to factor later.
3. The third column is city_development_index, that is a “Developement index of the city (scaled)” and the type is numeric.
4. The fourth column is gender, that is a “Gender of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
5. The fifth column is relevent_experience, that is a “Relevant experience of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
6. The sixth column is enrolled_university, that is a “Type of University course enrolled if any” that must be class variable, but currently the type is char, it should be converted to factor later.
7. The seventh column is education_level, that is a “Education level of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
8. The eighth column is major_discipline, that is a “Education major discipline of candidate” that must be class variable, but currently the type is char, it should be converted to factor later.
9. The ninth column is experience, that is a “Candidate total experience in years” that must be class variable, but currently the type is char, it should be converted to factor later.
10. The tenth column is company_size, that is a “No of employees in current employer’s company” that must be class variable, but currently the type is char, it should be converted to factor later.
11. The eleventh column is company_type, that is a “Type of current employer” that must be class variable, but currently the type is char, it should be converted to factor later.
12. The twelfth column is last_new_job, that is a “Difference in years between previous job and current job” that must be class variable, but currently the type is char, it should be converted to factor later.
13. The thirteenth column is training_hours, that is a “Training hours completed” and the type is integer.
14. The last column is target, that is a target for us to predict the probability of a candidate will work for the company.

# Remove enrolle_id
df$enrollee_id = NULL

2.1 Univariate Data Analysis

To see how many class in each column, we used count() function. we can see that the city has many category, its about 93 category. So im not going to input the city into the plot

library(plyr)
dim(count(df$city))

## [1] 123   2

Now we create Histogram for all class variable to check missing data
We are going to create a function to creating a plot

library(ggplot2)
library(gridExtra)
# Function ploting
ploting = function(x_column, label){
  ggplot(df, aes(x = x_column, fill = as.factor(target))) +
    geom_bar() +
    labs(title = paste(label, "histogram"), x = label, fill = "target")+
    theme(legend.position = "top", plot.title = element_blank())
}

2.1.1 Gender, Relevent Experience, Enrolled University

p1 = ploting(df$gender, "Gender")
p2 = ploting(df$relevent_experience, "Relevent Experience")
p3 = ploting(df$enrolled_university, "Enrolled University")
grid.arrange(p1, p2, p3, ncol = 1, nrow = 3)

2.1.2 Education Level, Major Discipline, Experience

p4 = ploting(df$education_level, "Education Level")
p5 = ploting(df$major_discipline, "Major Discipline")
p6 = ploting(df$experience, "Experience")
grid.arrange(p4, p5, p6, ncol = 1, nrow = 3)

2.1.3 Company Size, Company Type, Last New Job

p7 = ploting(df$company_size, "Company Size")
p8 = ploting(df$company_type, "Company Type")
p9 = ploting(df$last_new_job, "Last New Job")
grid.arrange(p7, p8, p9, ncol = 1, nrow = 3)

From the result above, we can conclude that there is 8 out of 9 have NA values.

3. Data Preparation

3.1 Handle Missing Value

We have known before that company_size and company_type have many missing value from other columns. So we have to fill the missing value for gender, experience, major_discipline, education_level, enrolled_university, and last_new_job before company_size and company_type because we will predict NA value for company_size and company_type

3.1.1 Fill Missing Value

Here we fill missing value of each column with mode for gender, experience, major_discipline, education_level, enrolled_university, and last_new_job.

library(dplyr)
# Insert to missing value
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
## Insert to missing value with mode
df = df %>%
  mutate(gender = replace(gender,
                          is.na(gender),
                          getmode(gender)),
         experience = replace(experience,
                              is.na(experience),
                              getmode(experience)),
         major_discipline = replace(major_discipline,
                                    is.na(major_discipline),
                                    getmode((major_discipline))),
         education_level = replace(education_level,
                                   is.na(education_level),
                                   getmode(education_level)),
         enrolled_university = replace(enrolled_university,
                                       is.na(enrolled_university),
                                       getmode(enrolled_university)),
         last_new_job = replace(last_new_job, 
                                is.na(last_new_job),
                                getmode(last_new_job)))

3.1.2 Change the data type to factor

We change the data type to factor for gender, experience, major_discipline, education_level, enrolled_university, and last_new_job.

# Change Factor
df$gender = as.factor(df$gender)
df$last_new_job = as.factor(df$last_new_job)
df$education_level = as.factor(df$education_level)
df$experience = as.factor(df$experience)
df$relevent_experience = as.factor(df$relevent_experience)
df$major_discipline = as.factor(df$major_discipline)
df$enrolled_university = as.factor(df$enrolled_university)
df$city = as.factor(df$city)

3.1.3 Predict Missing Value

In this step, we will predict missing value for company_type using Classification Tree by removing company_size.
After company_type is filled, we change the data type of company_type to factor.

library(party)
df_without_size = subset(df, select = -company_size)
na_idx = which(is.na(df_without_size$company_type), arr.ind=TRUE)
train_without_size = df_without_size[-na_idx, ]
test_without_size = df_without_size[na_idx, ]
train_without_size$company_type = as.factor(train_without_size$company_type)
# predict missing value company type
fit_ctree = ctree(formula = company_type ~ ., data = train_without_size)
pred_company_type = predict(fit_ctree, test_without_size)
pred_company_type = lapply(pred_company_type, as.character)
v = c()
for (i in 1:length(pred_company_type)){
  v <- c(v, pred_company_type[[i]])
}
df = df %>%
  mutate(company_type = replace(company_type,
                                is.na(company_type),
                                v))
df$company_type = as.factor(df$company_type)

The last step, we will predict missing value for company_size using Classification Tree without removing company_type.
After company_size is filled, we change the data type of company_size to factor.

na_idx = which(is.na(df$company_size), arr.ind=TRUE)
train_size_without_na = df[-na_idx, ]
test_size_with_na = df[na_idx, ]
train_size_without_na$company_size = as.factor(train_size_without_na$company_size)
# predict missing value company type
fit_ctree = ctree(formula = company_size ~ ., data = train_size_without_na)
pred_size_type1 = predict(fit_ctree, test_size_with_na)
pred_size_type1 = lapply(pred_size_type1, as.character)
v = c()
for (i in 1:length(pred_size_type1)){
  v <- c(v, pred_size_type1[[i]])
}
df = df %>%
  mutate(company_size = replace(company_size,
                                is.na(company_size),
                                v))
df$company_size = as.factor(df$company_size)

3.2 Feature Selection

Removed city vairables. Based on discussion with others and also tried to remove 1 by 1 column.

df$city = NULL

3.3 Encode Data

Create function to encode the data frame df, and we create 2 variable df_encode and df_encode_for_pca to save a new dataframe with encoded data.

library(caret)
dummy = function(data_type){
  dmy = dummyVars(" ~ .", data = df[, 1:12])
  df_encode_for_pca = data.frame(predict(dmy, newdata = df[, 1:12]))
  df_encode = df_encode_for_pca
  df_encode$target = df$target
  if (data_type == "pca") {
    return(df_encode_for_pca)
  }else if (data_type == "encode") {
    return(df_encode)
  }
}
df_encode = dummy("encode")

3.4 PCA

Transform the data frame into PCA data frame

3.4.1 Function to get PCA data frame

Create Function get_pca to get how many variable that we will use.

get_pca = function(df_encode_for_pca, type){
  pr_out = prcomp(df_encode_for_pca, scale = TRUE)
  pr_var = pr_out$sdev ^ 2
  pve = pr_var / sum(pr_var)
  if (type == "count") {
    plot(pve,
       xlab = "Principal Component",
       ylab = "Proportion of Variance Explained",
       ylim = c(0,1), type = "b")
    plot(cumsum(pve),
       xlab = "Principal Component",
       ylab = "Cumulative Proportion of Variance Explained",
       ylim = c(0,1), type = "b")
    return(pve)
  }else if (type == "result") {
    return(pr_out) 
  }
}

3.4.2 Function to save PCA data frame

Create Function pca to transform and save a new data frame.

pca = function(df_encode_for_pca, size){
  pr_out = get_pca(df_encode_for_pca = df_encode_for_pca, type = "result")
  df_transform = pr_out$x[, 1:size]
  df_transform = cbind(df_transform, target = df$target)
  df_transform = as.data.frame(df_transform)
  return(df_transform)
}

3.4.3 Call the function PCA

To see how many variable that we need, we used get_pca function. We used 55 variable component, because 55 variables can describe almost the data of all variables.

get_pca(df_encode_for_pca = df_encode, type = "count")

##  [1] 5.818281e-02 3.514174e-02 3.439676e-02 3.167619e-02 2.849552e-02
##  [6] 2.548207e-02 2.331527e-02 2.203020e-02 2.148684e-02 2.054113e-02
## [11] 2.024944e-02 1.871574e-02 1.849730e-02 1.811017e-02 1.776721e-02
## [16] 1.750449e-02 1.722536e-02 1.715866e-02 1.699786e-02 1.682313e-02
## [21] 1.672160e-02 1.669305e-02 1.661753e-02 1.654004e-02 1.648518e-02
## [26] 1.636428e-02 1.631785e-02 1.630282e-02 1.623661e-02 1.617872e-02
## [31] 1.607596e-02 1.601509e-02 1.597434e-02 1.589711e-02 1.587992e-02
## [36] 1.586938e-02 1.574873e-02 1.572035e-02 1.560815e-02 1.548359e-02
## [41] 1.545455e-02 1.536186e-02 1.526094e-02 1.512123e-02 1.509058e-02
## [46] 1.482443e-02 1.440615e-02 1.382816e-02 1.357677e-02 1.350540e-02
## [51] 1.244944e-02 1.147335e-02 1.055545e-02 9.065626e-03 7.497900e-03
## [56] 6.627407e-29 2.832323e-29 2.465461e-29 1.405880e-29 5.006776e-30
## [61] 2.519493e-30 1.603192e-30 7.846857e-31 1.855042e-32

We used 55 size, it means we used 55 variable component, and saved into df_pca.

df_pca = pca(df_encode_for_pca = dummy("pca"), 55)

3.5 Training and Testing Division

Use set_seed(2021) for reproducible result. Ratio train:test = 80:20.
We create function to divide the data.

divide_data = function(data, result){
  n = nrow(data)
  set.seed(2021)
  train_index = sample(n, 0.8 * n)
  df_train = data[train_index, ]
  df_test = data[-train_index, ]
  if (result == "train") {
    return(df_train)
  }else if (result == "test") {
    return(df_test)
  }
}

We call the function to get train and test data.
We have 3 data frame df (raw data frame), df_encode (encoded data frame), and df_pca (PCA data frame)

df_train_raw = divide_data(data = df, result = "train")
df_test_raw = divide_data(data = df, result = "test")
df_train_encode = divide_data(data = df_encode, result = "train")
df_test_encode = divide_data(data = df_encode, result = "test")
df_train_pca = divide_data(data = df_pca, result = "train")
df_test_pca = divide_data(data = df_pca, result = "test")

4. Modeling

We use 4 Machine Learning Algorithms

4.1 Random Forest

forest = function(train, test){
  library(randomForest)
  set.seed(2021)
  fit_forest = randomForest(formula = target ~ ., data = train,
                            importance = TRUE)
  pred_forest <- predict(fit_forest, test, type = "response")
  pred_forest = factor(pred_forest > 0.5, c(FALSE,TRUE),
                       labels = c(0,1))
  perf_forest <- table(test$target, pred_forest,
                       dnn = c("Actual", "Predicted"))
  return(perf_forest)
}

perf_forest_pca = forest(train = df_train_pca, test = df_test_pca)
perf_forest_encode = forest(train = df_train_encode, test = df_test_encode)
perf_forest_raw = forest(train = df_train_raw, test = df_test_raw)

4.2 Classification Tree

func_tree = function(train, test){
  library(party)
  fit_tree = ctree(formula = target ~ ., data = train)
  prob_ctree = predict(fit_tree, test)
  pred_ctree = factor(prob_ctree > 0.5, c(FALSE,TRUE),
                      labels = c(0,1))
  perf_ctree <- table(test$target, pred_ctree,
                      dnn = c("Actual", "Predicted"))
  return(perf_ctree)
}

perf_ctree_pca = func_tree(train = df_train_pca, test = df_test_pca)
perf_ctree_encode = func_tree(train = df_train_encode, test = df_test_encode)
perf_ctree_raw = func_tree(train = df_train_raw, test = df_test_raw)

4.3 SVM

func_svm = function(train, test){
  library(e1071)
  fit_svm = svm(formula = target ~ ., data = train)
  prob_svm = predict(fit_svm, test)
  pred_svm = factor(prob_svm > 0.5, c(FALSE,TRUE),
                      labels = c(0,1))
  perf_svm <- table(test$target, pred_svm,
                    dnn = c("Actual", "Prediction"))
  return(perf_svm)
}

perf_svm_pca = func_svm(train = df_train_pca, test = df_test_pca )
perf_svm_encode = func_svm(train = df_train_encode, test = df_test_encode )
perf_svm_raw = func_svm(train = df_train_raw, test = df_test_raw )

4.4 Logistic Regression

func_logit = function(train, test){
  fit_logit = glm(formula = target ~ ., data = train)
  prob_logit = predict(fit_logit, test) 
  pred_logit = factor(prob_logit > 0.5, c(FALSE,TRUE),
                      labels = c(0,1))
  perf_logit = table(test$target, pred_logit,
                     dnn = c("Actual", "Prediction"))
  return(perf_logit)
}

perf_logit_pca = func_logit(train = df_train_pca, test = df_test_pca )
perf_logit_encode = func_logit(train = df_train_encode, test = df_test_encode )
perf_logit_raw = func_logit(train = df_train_raw, test = df_test_raw )

5. Evaluation

To compute accuracy, precision, Recall, and F1 Score, we created perform() function.

# Accuracy

perform = function(table, n = 2, check, data_model){
  TN = table[1,1]
  TP = table[2,2]
  FN = table[2,1]
  FP = table[1,2]
  
  accuracy = (TP + TN ) / (TP + FP + FN + TN)
  precision = TP / (FP + TP)
  recall = TP / (FN + TP)
  f1 = 2 * precision * recall / (precision + recall)
  result = paste(data_model, "->",
                "Accuracy =", round(accuracy, n),
                ",Precision =", round(precision, n),
                ",Recall =", round(recall, n),
                ",F1 Score =", round(f1, n))
  hasil = paste(data_model, "->",
              "TP :", TP,
              ",TN :", TN,
              ",FP :", FP,
              ",FN :", FN)
  
  if(check == "score"){
    result
  }else if(check == "value") {
    hasil
  }else if(check == "all"){
    print(result)
    print(hasil)
  }
}

5.1 Random Forest

# RESULT RANDOM FOREST
perform(perf_forest_pca, check = "all", data_model = "PCA With City")

## [1] "PCA With City -> Accuracy = 0.99 ,Precision = 1 ,Recall = 0.97 ,F1 Score = 0.98"
## [1] "PCA With City -> TP : 923 ,TN : 2879 ,FP : 0 ,FN : 30"

perform(perf_forest_encode, check = "all", data_model = "Encode With City")

## [1] "Encode With City -> Accuracy = 0.77 ,Precision = 0.55 ,Recall = 0.4 ,F1 Score = 0.46"
## [1] "Encode With City -> TP : 377 ,TN : 2575 ,FP : 304 ,FN : 576"

perform(perf_forest_raw, check = "all", data_model = "Raw With City")

## [1] "Raw With City -> Accuracy = 0.77 ,Precision = 0.58 ,Recall = 0.34 ,F1 Score = 0.43"
## [1] "Raw With City -> TP : 322 ,TN : 2645 ,FP : 234 ,FN : 631"

5.2 Classification Tree

# RESULT CTREE
perform(perf_ctree_pca, check = "all", data_model = "PCA")

## [1] "PCA -> Accuracy = 0.96 ,Precision = 0.94 ,Recall = 0.91 ,F1 Score = 0.93"
## [1] "PCA -> TP : 870 ,TN : 2822 ,FP : 57 ,FN : 83"

perform(perf_ctree_encode, check = "all", data_model = "Encode")

## [1] "Encode -> Accuracy = 0.78 ,Precision = 0.58 ,Recall = 0.42 ,F1 Score = 0.49"
## [1] "Encode -> TP : 402 ,TN : 2589 ,FP : 290 ,FN : 551"

perform(perf_ctree_raw, check = "all", data_model = "Raw")

## [1] "Raw -> Accuracy = 0.78 ,Precision = 0.57 ,Recall = 0.43 ,F1 Score = 0.49"
## [1] "Raw -> TP : 407 ,TN : 2570 ,FP : 309 ,FN : 546"

5.3 SVM

# RESULT SUPPORT VECTOR MACHINE (SVM)
perform(perf_svm_pca, check = "all", data_model = "PCA")

## [1] "PCA -> Accuracy = 1 ,Precision = 1 ,Recall = 0.99 ,F1 Score = 1"
## [1] "PCA -> TP : 944 ,TN : 2879 ,FP : 0 ,FN : 9"

perform(perf_svm_encode, check = "all", data_model = "Encode")

## [1] "Encode -> Accuracy = 0.76 ,Precision = 0.55 ,Recall = 0.18 ,F1 Score = 0.27"
## [1] "Encode -> TP : 173 ,TN : 2736 ,FP : 143 ,FN : 780"

perform(perf_svm_raw, check = "all", data_model = "Raw")

## [1] "Raw -> Accuracy = 0.76 ,Precision = 0.54 ,Recall = 0.21 ,F1 Score = 0.31"
## [1] "Raw -> TP : 203 ,TN : 2708 ,FP : 171 ,FN : 750"

5.4 Logistic Regression

# RESULT LOGIT
perform(perf_logit_pca, check = "all", data_model = "PCA")

## [1] "PCA -> Accuracy = 1 ,Precision = 1 ,Recall = 1 ,F1 Score = 1"
## [1] "PCA -> TP : 953 ,TN : 2879 ,FP : 0 ,FN : 0"

perform(perf_logit_encode, check = "all", data_model = "Encode")

## [1] "Encode -> Accuracy = 0.76 ,Precision = 0.54 ,Recall = 0.25 ,F1 Score = 0.35"
## [1] "Encode -> TP : 241 ,TN : 2677 ,FP : 202 ,FN : 712"

perform(perf_logit_raw, check = "all", data_model = "Raw")

## [1] "Raw -> Accuracy = 0.76 ,Precision = 0.54 ,Recall = 0.25 ,F1 Score = 0.35"
## [1] "Raw -> TP : 241 ,TN : 2677 ,FP : 202 ,FN : 712"

6. Recomendation

Logistic Regression algorithm is the best from all the tested algorithms with score accuracy = 1, precision = 1, recall = 1, and F1 score = 1.