LBB-C2: Evaluating Models of Classification

Wayan K.

5/03/2021


About the Data

This report will assess the dataset designed to understand the factors that lead a person to leave current job for HR researches. By certain variabels that use the some credentials,demographics information, experience data, this report will try to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

This report will try to evaluate three Models of Classification: Naive Bayes Model, Random Forest Model, and Decision Tree Model to predict all of the predictor variables of the data test. And based on the evaluation of each model, the report will then try to answer what will be the optimal model that can represents the business decision.

Data Source

Later on after some pre-processing, the data will be divided into two: train data and test data. The train data will be used to for modeling and model selection, meanwhile the test data will be used to test the prediction based on the selected models. The data is based from csv-type file dataset with the following information:

job <- read.csv("job.csv", stringsAsFactors = T, na.strings=c("","","NA"))

str(job)
#> 'data.frame':    19158 obs. of  14 variables:
#>  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
#>  $ city                  : Factor w/ 123 levels "city_1","city_10",..: 6 78 65 15 51 58 50 84 6 6 ...
#>  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
#>  $ gender                : Factor w/ 3 levels "Female","Male",..: 2 2 NA NA 2 NA 2 2 2 NA ...
#>  $ relevent_experience   : Factor w/ 2 levels "Has relevent experience",..: 1 2 2 2 1 1 1 1 1 1 ...
#>  $ enrolled_university   : Factor w/ 3 levels "Full time course",..: 2 2 1 NA 2 3 2 2 2 2 ...
#>  $ education_level       : Factor w/ 5 levels "Graduate","High School",..: 1 1 1 1 3 1 2 1 1 1 ...
#>  $ major_discipline      : Factor w/ 6 levels "Arts","Business Degree",..: 6 6 6 2 6 6 NA 6 6 6 ...
#>  $ experience            : Factor w/ 22 levels "<1",">20","1",..: 2 9 18 1 2 5 18 7 20 11 ...
#>  $ company_size          : Factor w/ 8 levels "<10","10/49",..: NA 6 NA NA 6 NA 6 1 6 5 ...
#>  $ company_type          : Factor w/ 6 levels "Early Stage Startup",..: NA 6 NA 6 2 NA 2 6 6 6 ...
#>  $ last_new_job          : Factor w/ 6 levels ">4","1","2","3",..: 2 1 6 6 5 2 2 1 2 1 ...
#>  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
#>  $ target                : num  1 0 0 1 0 1 0 1 1 0 ...

Data Dictionary

  • enrollee_id : Unique ID for candidate
  • city: City code
  • city_ development _index : Developement index of the city (scaled)
  • gender: Gender of candidate
  • relevent_experience: Relevant experience of candidate
  • enrolled_university: Type of University course enrolled if any
  • education_level: Education level of candidate
  • major_discipline :Education major discipline of candidate
  • experience: Candidate total experience in years
  • company_size: No of employees in current employer’s company
  • company_type : Type of current employer
  • lastnewjob: Difference in years between previous job and current job
  • training_hours: training hours completed
  • target: 0 – Not looking for job change, 1 – Looking for a job change

Data Wrangling

Checking for Data Types and NA value of Data:

summary(job)
#>   enrollee_id          city      city_development_index    gender     
#>  Min.   :    1   city_103:4355   Min.   :0.4480         Female: 1238  
#>  1st Qu.: 8554   city_21 :2702   1st Qu.:0.7400         Male  :13221  
#>  Median :16983   city_16 :1533   Median :0.9030         Other :  191  
#>  Mean   :16875   city_114:1336   Mean   :0.8288         NA's  : 4508  
#>  3rd Qu.:25170   city_160: 845   3rd Qu.:0.9200                       
#>  Max.   :33380   city_136: 586   Max.   :0.9490                       
#>                  (Other) :7801                                        
#>               relevent_experience       enrolled_university
#>  Has relevent experience:13792    Full time course: 3757   
#>  No relevent experience : 5366    no_enrollment   :13817   
#>                                   Part time course: 1198   
#>                                   NA's            :  386   
#>                                                            
#>                                                            
#>                                                            
#>        education_level         major_discipline   experience   
#>  Graduate      :11598   Arts           :  253   >20    : 3286  
#>  High School   : 2017   Business Degree:  327   5      : 1430  
#>  Masters       : 4361   Humanities     :  669   4      : 1403  
#>  Phd           :  414   No Major       :  223   3      : 1354  
#>  Primary School:  308   Other          :  381   6      : 1216  
#>  NA's          :  460   STEM           :14492   (Other):10404  
#>                         NA's           : 2813   NA's   :   65  
#>     company_size               company_type  last_new_job training_hours  
#>  50-99    :3083   Early Stage Startup: 603   >4   :3290   Min.   :  1.00  
#>  100-500  :2571   Funded Startup     :1001   1    :8040   1st Qu.: 23.00  
#>  10000+   :2019   NGO                : 521   2    :2900   Median : 47.00  
#>  10/49    :1471   Other              : 121   3    :1024   Mean   : 65.37  
#>  1000-4999:1328   Public Sector      : 955   4    :1029   3rd Qu.: 88.00  
#>  (Other)  :2748   Pvt Ltd            :9817   never:2452   Max.   :336.00  
#>  NA's     :5938   NA's               :6140   NA's : 423                   
#>      target      
#>  Min.   :0.0000  
#>  1st Qu.:0.0000  
#>  Median :0.0000  
#>  Mean   :0.2493  
#>  3rd Qu.:0.0000  
#>  Max.   :1.0000  
#> 
glimpse(job)
#> Rows: 19,158
#> Columns: 14
#> $ enrollee_id            <int> 8949, 29725, 11561, 33241, 666, 21651, 28806, 4~
#> $ city                   <fct> city_103, city_40, city_21, city_115, city_162,~
#> $ city_development_index <dbl> 0.920, 0.776, 0.624, 0.789, 0.767, 0.764, 0.920~
#> $ gender                 <fct> Male, Male, NA, NA, Male, NA, Male, Male, Male,~
#> $ relevent_experience    <fct> Has relevent experience, No relevent experience~
#> $ enrolled_university    <fct> no_enrollment, no_enrollment, Full time course,~
#> $ education_level        <fct> Graduate, Graduate, Graduate, Graduate, Masters~
#> $ major_discipline       <fct> STEM, STEM, STEM, Business Degree, STEM, STEM, ~
#> $ experience             <fct> >20, 15, 5, <1, >20, 11, 5, 13, 7, 17, 2, 5, >2~
#> $ company_size           <fct> NA, 50-99, NA, NA, 50-99, NA, 50-99, <10, 50-99~
#> $ company_type           <fct> NA, Pvt Ltd, NA, Pvt Ltd, Funded Startup, NA, F~
#> $ last_new_job           <fct> 1, >4, never, never, 4, 1, 1, >4, 1, >4, never,~
#> $ training_hours         <int> 36, 47, 83, 52, 8, 24, 24, 18, 46, 123, 32, 108~
#> $ target                 <dbl> 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,~

Data Cleansing & Exploratory Data Analysis

These data cleansing steps will be done in order to create a more optimal on both data train and data test set:

  • enrollee_id: will be ommitted as it will not be used on the report.

  • target : will be changed as factor data type

job <- job %>% 
  dplyr::select(-enrollee_id) %>% 
  mutate(target = as.factor(target))

glimpse(job)
#> Rows: 19,158
#> Columns: 13
#> $ city                   <fct> city_103, city_40, city_21, city_115, city_162,~
#> $ city_development_index <dbl> 0.920, 0.776, 0.624, 0.789, 0.767, 0.764, 0.920~
#> $ gender                 <fct> Male, Male, NA, NA, Male, NA, Male, Male, Male,~
#> $ relevent_experience    <fct> Has relevent experience, No relevent experience~
#> $ enrolled_university    <fct> no_enrollment, no_enrollment, Full time course,~
#> $ education_level        <fct> Graduate, Graduate, Graduate, Graduate, Masters~
#> $ major_discipline       <fct> STEM, STEM, STEM, Business Degree, STEM, STEM, ~
#> $ experience             <fct> >20, 15, 5, <1, >20, 11, 5, 13, 7, 17, 2, 5, >2~
#> $ company_size           <fct> NA, 50-99, NA, NA, 50-99, NA, 50-99, <10, 50-99~
#> $ company_type           <fct> NA, Pvt Ltd, NA, Pvt Ltd, Funded Startup, NA, F~
#> $ last_new_job           <fct> 1, >4, never, never, 4, 1, 1, >4, 1, >4, never,~
#> $ training_hours         <int> 36, 47, 83, 52, 8, 24, 24, 18, 46, 123, 32, 108~
#> $ target                 <fct> 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,~
anyNA(job)
#> [1] TRUE
colSums(is.na(job))
#>                   city city_development_index                 gender 
#>                      0                      0                   4508 
#>    relevent_experience    enrolled_university        education_level 
#>                      0                    386                    460 
#>       major_discipline             experience           company_size 
#>                   2813                     65                   5938 
#>           company_type           last_new_job         training_hours 
#>                   6140                    423                      0 
#>                 target 
#>                      0

Note:

  • Column target as the target variabel has been changed to factor data type.

  • From the data proportion checking, it can be concluded that the initial train data is quite imbalanced; hence we will have to balance the data train before using it fr modelling.

  • There are also ‘NA’ values based on preliminary check on columns:

    • gender
    • enrolled_university
    • education_level
    • major_discipline
    • experience
    • company_size
    • company_type
    • last_new_job

In order to create a better overall result, we will try to replace any missing/ NA values based on its types:

  • Data with missing Numeric type values: will be replaced by its mean values (using mean() function).

  • Data value with the factor data type will be replaced with value that has highest number of occurrences in its set of data (using mode() function).

Creating Function for data cleansing:

Mode = function(x){
  a = table(x)
  b = max(a)
  if(all(a == b))
    mod = NA
  else if(is.numeric(x))
    mod = as.numeric(names(a))[a==b]
    else
      mod = names(a)[a==b]
  return(mod)
}
job$gender[is.na(job$gender)] <-  Mode(job$gender)

job$enrolled_university[is.na(job$enrolled_university)] <- Mode(job$enrolled_university)

job$education_level[is.na(job$education_level)] <-  Mode(job$education_level)

job$major_discipline[is.na(job$major_discipline)] <-  Mode(job$major_discipline)

job$company_type[is.na(job$company_type)] <-  Mode(job$company_type)

job$experience[is.na(job$experience)] <-  Mode(job$experience)

job$company_size[is.na(job$company_size)] <-  Mode(job$company_size)

job$last_new_job[is.na(job$last_new_job)] <-  Mode(job$last_new_job)
summary(job)
#>        city      city_development_index    gender     
#>  city_103:4355   Min.   :0.4480         Female: 1238  
#>  city_21 :2702   1st Qu.:0.7400         Male  :17729  
#>  city_16 :1533   Median :0.9030         Other :  191  
#>  city_114:1336   Mean   :0.8288                       
#>  city_160: 845   3rd Qu.:0.9200                       
#>  city_136: 586   Max.   :0.9490                       
#>  (Other) :7801                                        
#>               relevent_experience       enrolled_university
#>  Has relevent experience:13792    Full time course: 3757   
#>  No relevent experience : 5366    no_enrollment   :14203   
#>                                   Part time course: 1198   
#>                                                            
#>                                                            
#>                                                            
#>                                                            
#>        education_level         major_discipline   experience      company_size 
#>  Graduate      :12058   Arts           :  253   >20    :3351   50-99    :9021  
#>  High School   : 2017   Business Degree:  327   5      :1430   100-500  :2571  
#>  Masters       : 4361   Humanities     :  669   4      :1403   10000+   :2019  
#>  Phd           :  414   No Major       :  223   3      :1354   10/49    :1471  
#>  Primary School:  308   Other          :  381   6      :1216   1000-4999:1328  
#>                         STEM           :17305   2      :1127   <10      :1308  
#>                                                 (Other):9277   (Other)  :1440  
#>               company_type   last_new_job training_hours   target   
#>  Early Stage Startup:  603   >4   :3290   Min.   :  1.00   0:14381  
#>  Funded Startup     : 1001   1    :8463   1st Qu.: 23.00   1: 4777  
#>  NGO                :  521   2    :2900   Median : 47.00            
#>  Other              :  121   3    :1024   Mean   : 65.37            
#>  Public Sector      :  955   4    :1029   3rd Qu.: 88.00            
#>  Pvt Ltd            :15957   never:2452   Max.   :336.00            
#> 

Cross Validation

In order to do the cross validation, the data will be splitted into two parts, the train data (consiste of 80% od data), and test data (consist of 20% od data).

The train data will be used for training of the model, where the test data will be used for testing the model performance. The model then will also be tested to predict the test data. The predicted results and actual data from the test data will then be compared to validate the model performance.

RNGkind(sample.kind = "Rounding")
set.seed(123)

index <- sample(nrow(job), 
       nrow(job)*0.8)

job_train <- job[index,]
job_test <- job[-index,]

head(job_train)
#>           city city_development_index gender     relevent_experience
#> 5510  city_152                  0.698   Male Has relevent experience
#> 15102  city_45                  0.890   Male Has relevent experience
#> 7835   city_75                  0.939   Male  No relevent experience
#> 16915  city_21                  0.624   Male  No relevent experience
#> 18014 city_173                  0.878   Male Has relevent experience
#> 873     city_9                  0.743 Female  No relevent experience
#>       enrolled_university education_level major_discipline experience
#> 5510     Full time course        Graduate             STEM          9
#> 15102       no_enrollment        Graduate             STEM         15
#> 7835        no_enrollment         Masters             STEM         15
#> 16915    Full time course     High School             STEM         <1
#> 18014       no_enrollment     High School             STEM          6
#> 873      Full time course     High School             STEM          4
#>       company_size company_type last_new_job training_hours target
#> 5510           <10      Pvt Ltd           >4              9      0
#> 15102      500-999      Pvt Ltd        never             74      0
#> 7835        10000+      Pvt Ltd           >4             52      0
#> 16915        50-99      Pvt Ltd        never            111      0
#> 18014        50-99      Pvt Ltd            1             85      0
#> 873          50-99      Pvt Ltd        never             22      0
tail(job_test)
#>           city city_development_index gender     relevent_experience
#> 19136  city_65                  0.802   Male Has relevent experience
#> 19142  city_23                  0.899   Male Has relevent experience
#> 19149  city_21                  0.624   Male Has relevent experience
#> 19150 city_103                  0.920   Male Has relevent experience
#> 19152 city_149                  0.689   Male  No relevent experience
#> 19153 city_103                  0.920 Female Has relevent experience
#>       enrolled_university education_level major_discipline experience
#> 19136       no_enrollment        Graduate             STEM          8
#> 19142       no_enrollment        Graduate             STEM         17
#> 19149       no_enrollment         Masters             STEM          3
#> 19150       no_enrollment         Masters             STEM          9
#> 19152    Full time course        Graduate             STEM          2
#> 19153       no_enrollment        Graduate       Humanities          7
#>       company_size   company_type last_new_job training_hours target
#> 19136        50-99  Public Sector            2            136      0
#> 19142        10/49 Funded Startup            3             12      0
#> 19149      100-500        Pvt Ltd            3             40      1
#> 19150        50-99        Pvt Ltd            1             36      1
#> 19152        50-99        Pvt Ltd            1             60      0
#> 19153        10/49 Funded Startup            1             25      0

We will also check the proportion of target variable (looking for a job vs. not looking for a job).

prop.table(table(job_train$target))
#> 
#>         0         1 
#> 0.7508809 0.2491191

As the proportion shows, the class on the target variabels does not have a balances proportion, but due to the data and time constraint, it will considered as adequate to continue with the modeling.

Model Fitting

Model 1 - Naive Bayes Model

model_naive <- naiveBayes(target~., data = job_train)
model_naive
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>         0         1 
#> 0.7508809 0.2491191 
#> 
#> Conditional probabilities:
#>    city
#> Y          city_1       city_10      city_100      city_101      city_102
#>   0 0.00156412930 0.00547445255 0.01511991658 0.00208550574 0.01772679875
#>   1 0.00078575170 0.00235725511 0.01283394447 0.00916710320 0.01126244107
#>    city
#> Y        city_103      city_104      city_105      city_106      city_107
#>   0 0.23618352450 0.01885644769 0.00434480361 0.00034758429 0.00017379214
#>   1 0.19696176008 0.00654793085 0.00130958617 0.00052383447 0.00078575170
#>    city
#> Y        city_109       city_11      city_111      city_114      city_115
#>   0 0.00017379214 0.00695168578 0.00026068822 0.08507125478 0.00217240181
#>   1 0.00052383447 0.03195390257 0.00000000000 0.02671555788 0.00419067575
#>    city
#> Y        city_116      city_117      city_118       city_12      city_120
#>   0 0.00790754258 0.00069516858 0.00112964894 0.00078206465 0.00034758429
#>   1 0.00392875851 0.00104766894 0.00157150340 0.00052383447 0.00026191723
#>    city
#> Y        city_121      city_123      city_126      city_127      city_128
#>   0 0.00017379214 0.00408411540 0.00078206465 0.00069516858 0.00278067431
#>   1 0.00026191723 0.00392875851 0.00366684128 0.00026191723 0.01073860660
#>    city
#> Y        city_129       city_13      city_131      city_133      city_134
#>   0 0.00017379214 0.00286757039 0.00034758429 0.00060827251 0.00234619395
#>   1 0.00000000000 0.00104766894 0.00078575170 0.00052383447 0.00235725511
#>    city
#> Y        city_136      city_138      city_139       city_14      city_140
#>   0 0.03710462287 0.00799443865 0.00008689607 0.00139033716 0.00008689607
#>   1 0.01309586171 0.00183342064 0.00078575170 0.00130958617 0.00000000000
#>    city
#> Y        city_141      city_142      city_143      city_144      city_145
#>   0 0.00156412930 0.00260688217 0.00173792145 0.00121654501 0.00173792145
#>   1 0.00052383447 0.00288108958 0.00340492404 0.00183342064 0.00811943426
#>    city
#> Y        city_146      city_149      city_150      city_152      city_155
#>   0 0.00026068822 0.00538755648 0.00312825860 0.00269377824 0.00026068822
#>   1 0.00078575170 0.00392875851 0.00340492404 0.00235725511 0.00261917234
#>    city
#> Y        city_157      city_158      city_159       city_16      city_160
#>   0 0.00156412930 0.00182481752 0.00538755648 0.09289190129 0.04449078902
#>   1 0.00052383447 0.00314300681 0.00314300681 0.03771608172 0.04112100576
#>    city
#> Y        city_162      city_165      city_166      city_167      city_171
#>   0 0.00625651721 0.00408411540 0.00017379214 0.00060827251 0.00000000000
#>   1 0.00680984809 0.00392875851 0.00026191723 0.00078575170 0.00000000000
#>    city
#> Y        city_173      city_175      city_176      city_179       city_18
#>   0 0.00929787974 0.00060827251 0.00104275287 0.00008689607 0.00026068822
#>   1 0.00419067575 0.00052383447 0.00157150340 0.00078575170 0.00026191723
#>    city
#> Y        city_180       city_19        city_2       city_20       city_21
#>   0 0.00043448036 0.00530066041 0.00034758429 0.00182481752 0.07838025721
#>   1 0.00052383447 0.00969093766 0.00000000000 0.00078575170 0.33368255631
#>    city
#> Y         city_23       city_24       city_25       city_26       city_27
#>   0 0.01112269725 0.00356273896 0.00008689607 0.00156412930 0.00312825860
#>   1 0.00366684128 0.00340492404 0.00052383447 0.00104766894 0.00183342064
#>    city
#> Y         city_28       city_30       city_31       city_33       city_36
#>   0 0.01251303441 0.00104275287 0.00026068822 0.00043448036 0.01051442475
#>   1 0.00288108958 0.00026191723 0.00026191723 0.00235725511 0.00288108958
#>    city
#> Y         city_37       city_39       city_40       city_41       city_42
#>   0 0.00078206465 0.00034758429 0.00382342718 0.00460549183 0.00026068822
#>   1 0.00052383447 0.00000000000 0.00235725511 0.00340492404 0.00157150340
#>    city
#> Y         city_43       city_44       city_45       city_46       city_48
#>   0 0.00043448036 0.00078206465 0.00669099757 0.00686478971 0.00043448036
#>   1 0.00104766894 0.00157150340 0.00340492404 0.00707176532 0.00130958617
#>    city
#> Y         city_50       city_53       city_54       city_55       city_57
#>   0 0.00842891901 0.00104275287 0.00086896072 0.00078206465 0.00643030935
#>   1 0.00340492404 0.00104766894 0.00052383447 0.00078575170 0.00288108958
#>    city
#> Y         city_59       city_61       city_62       city_64       city_65
#>   0 0.00052137643 0.01216545012 0.00043448036 0.00634341328 0.01060132082
#>   1 0.00052383447 0.00419067575 0.00000000000 0.00314300681 0.00471451021
#>    city
#> Y         city_67       city_69        city_7       city_70       city_71
#>   0 0.02537365311 0.00121654501 0.00165102537 0.00199860966 0.01581508516
#>   1 0.01257202724 0.00026191723 0.00130958617 0.00392875851 0.00995285490
#>    city
#> Y         city_72       city_73       city_74       city_75       city_76
#>   0 0.00112964894 0.01364268335 0.00382342718 0.01946472019 0.00260688217
#>   1 0.00026191723 0.01571503405 0.01073860660 0.00654793085 0.00235725511
#>    city
#> Y         city_77       city_78       city_79        city_8       city_80
#>   0 0.00243309002 0.00121654501 0.00034758429 0.00034758429 0.00095585680
#>   1 0.00000000000 0.00314300681 0.00052383447 0.00000000000 0.00052383447
#>    city
#> Y         city_81       city_82       city_83       city_84       city_89
#>   0 0.00043448036 0.00026068822 0.00947167188 0.00139033716 0.00347584289
#>   1 0.00000000000 0.00000000000 0.00471451021 0.00157150340 0.00340492404
#>    city
#> Y          city_9       city_90       city_91       city_93       city_94
#>   0 0.00095585680 0.00895029545 0.00217240181 0.00156412930 0.00078206465
#>   1 0.00104766894 0.01309586171 0.00314300681 0.00104766894 0.00261917234
#>    city
#> Y         city_97       city_98       city_99
#>   0 0.00651720542 0.00495307612 0.00582203684
#>   1 0.00104766894 0.00130958617 0.00340492404
#> 
#>    city_development_index
#> Y        [,1]      [,2]
#>   0 0.8528382 0.1057588
#>   1 0.7562677 0.1436121
#> 
#>    gender
#> Y        Female        Male       Other
#>   0 0.063694821 0.926485923 0.009819256
#>   1 0.070717653 0.918543740 0.010738607
#> 
#>    relevent_experience
#> Y   Has relevent experience No relevent experience
#>   0               0.7539103              0.2460897
#>   1               0.6165532              0.3834468
#> 
#>    enrolled_university
#> Y   Full time course no_enrollment Part time course
#>   0       0.16388599    0.77189781       0.06421620
#>   1       0.29701414    0.64012572       0.06286014
#> 
#>    education_level
#> Y      Graduate High School     Masters         Phd Primary School
#>   0 0.605665624 0.113051790 0.239051095 0.024070212    0.018161279
#>   1 0.690937664 0.083551598 0.204819277 0.012310110    0.008381351
#> 
#>    major_discipline
#> Y         Arts Business Degree Humanities   No Major      Other       STEM
#>   0 0.01338200      0.01746611 0.03597497 0.01129649 0.01981230 0.90206813
#>   1 0.01126244      0.01754845 0.02985856 0.01100052 0.02147721 0.90885280
#> 
#>    experience
#> Y            <1         >20           1          10          11          12
#>   0 0.018943344 0.195429267 0.023461940 0.054396941 0.035279805 0.027546055
#>   1 0.046883185 0.113933997 0.047668937 0.042430592 0.031168151 0.017810372
#>    experience
#> Y            13          14          15          16          17          18
#>   0 0.022506083 0.032412235 0.037799791 0.030674314 0.019638512 0.017379214
#>   1 0.015976951 0.023834468 0.024358303 0.012833944 0.011786276 0.009690938
#>    experience
#> Y            19           2          20           3           4           5
#>   0 0.017292318 0.051529371 0.007733750 0.061087939 0.065172054 0.071080987
#>   1 0.010476689 0.075432163 0.007071765 0.099266632 0.098218963 0.087218439
#>    experience
#> Y             6           7           8           9
#>   0 0.060740355 0.052572124 0.042752868 0.054570733
#>   1 0.069408067 0.067050812 0.041121006 0.046359350
#> 
#>    company_size
#> Y          <10      10/49    100-500  1000-4999     10000+      50-99
#>   0 0.07559958 0.07881474 0.15059089 0.07864095 0.11478971 0.41979493
#>   1 0.04557360 0.07255107 0.08721844 0.03981142 0.08067051 0.62074384
#>    company_size
#> Y      500-999  5000-9999
#>   0 0.04961766 0.03215155
#>   1 0.03195390 0.02147721
#> 
#>    company_type
#> Y   Early Stage Startup Funded Startup         NGO       Other Public Sector
#>   0         0.033368092    0.060305874 0.029544665 0.006430309   0.051007994
#>   1         0.029072813    0.031953903 0.020167627 0.005500262   0.043740178
#>    company_type
#> Y       Pvt Ltd
#>   0 0.819343066
#>   1 0.869565217
#> 
#>    last_new_job
#> Y           >4          1          2          3          4      never
#>   0 0.18300313 0.43248175 0.15154675 0.05622176 0.05848106 0.11826555
#>   1 0.12807753 0.47249869 0.15217391 0.04924044 0.04740702 0.15060241
#> 
#>    training_hours
#> Y       [,1]     [,2]
#>   0 65.93491 61.08793
#>   1 63.17575 57.32707

Interpretation of Naive Bayes Model Results:

From the result we can see that there is no indication of skewness due to data scarcity, as all of the class variables are having probability values more than zero (>0). The model also creates the conditional probability for each feature separately. From the result also indicating a-priori probabilities which indicates the distribution of the data.

Create Prediction based on Naive Bayes Model

pred_naive <- predict(model_naive, newdata = job_test)
pred_naive
#>    [1] 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
#>   [38] 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0
#>   [75] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0
#>  [112] 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
#>  [149] 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0
#>  [186] 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 0
#>  [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1
#>  [260] 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1
#>  [297] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
#>  [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
#>  [371] 0 0 0 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0
#>  [408] 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0
#>  [445] 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0
#>  [482] 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1
#>  [519] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1
#>  [556] 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
#>  [593] 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1
#>  [630] 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0
#>  [667] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1
#>  [704] 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
#>  [741] 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0
#>  [778] 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0
#>  [815] 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
#>  [852] 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0
#>  [889] 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0
#>  [926] 0 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#>  [963] 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1
#> [1000] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0
#> [1037] 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0
#> [1074] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0
#> [1111] 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
#> [1148] 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1
#> [1185] 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [1222] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
#> [1259] 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
#> [1296] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0
#> [1333] 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1370] 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [1407] 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [1444] 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0
#> [1481] 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1
#> [1518] 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1555] 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
#> [1592] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
#> [1629] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0
#> [1666] 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0
#> [1703] 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
#> [1740] 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0
#> [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
#> [1814] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0
#> [1851] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
#> [1888] 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0
#> [1925] 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1962] 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
#> [1999] 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0
#> [2036] 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1
#> [2073] 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0
#> [2110] 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [2147] 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
#> [2184] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0
#> [2221] 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#> [2258] 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0
#> [2295] 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1
#> [2332] 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0
#> [2369] 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0
#> [2406] 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
#> [2443] 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
#> [2480] 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
#> [2517] 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0
#> [2554] 0 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1
#> [2591] 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1
#> [2628] 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 0
#> [2665] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#> [2702] 0 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [2739] 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0
#> [2776] 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [2813] 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
#> [2850] 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0
#> [2887] 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0
#> [2924] 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 0 0
#> [2961] 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0
#> [2998] 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1
#> [3035] 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0
#> [3072] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3109] 0 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
#> [3146] 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0
#> [3183] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
#> [3220] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0
#> [3257] 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0
#> [3294] 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0
#> [3331] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1
#> [3368] 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
#> [3405] 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1
#> [3442] 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0
#> [3479] 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
#> [3516] 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0
#> [3553] 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1
#> [3590] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0
#> [3627] 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0
#> [3664] 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
#> [3701] 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1
#> [3738] 0 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 1 0 1
#> [3775] 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
#> [3812] 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0
#> Levels: 0 1
plot(pred_naive)

Checking Accuracy of Naive Bayes Model

confusionMatrix(data = pred_naive, reference = job_test$target, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 2484  471
#>          1  389  488
#>                                          
#>                Accuracy : 0.7756         
#>                  95% CI : (0.762, 0.7887)
#>     No Information Rate : 0.7497         
#>     P-Value [Acc > NIR] : 0.0001018      
#>                                          
#>                   Kappa : 0.3844         
#>                                          
#>  Mcnemar's Test P-Value : 0.0057435      
#>                                          
#>             Sensitivity : 0.5089         
#>             Specificity : 0.8646         
#>          Pos Pred Value : 0.5564         
#>          Neg Pred Value : 0.8406         
#>              Prevalence : 0.2503         
#>          Detection Rate : 0.1273         
#>    Detection Prevalence : 0.2289         
#>       Balanced Accuracy : 0.6867         
#>                                          
#>        'Positive' Class : 1              
#> 

Interpreting the Naive Bayes Model

From the result above, it can be concluded that the model is able to determine 2484 out of 2873 of “0” (Not looking for job change) cases correctly, and 488 out of 959 of “1” (Looking for job change) cases correctly. This means the ability of Naive Bayes algorithm to predict “0” cases is about 86.5%, but it then falls down to about 50.9% of the “1” cases resulting in an overall accuracy of about 77.56%

Model 2 - Random Forest Model

The report will also gather prediction using Random Forest model as an alternative model and prediction for the test data set.

The Random Forest model is a model that can be used as a classification method based on the ensamble method. The Random Forest model is built from several Decision Trees model with different characteristics. Each tree will use different observations and predictors from the sampling results.

Creating Random Forest Model for all available predictor using k-fold cross validation:

set.seed(123)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
# job_forest <- train(target ~., data = job_train, method = "rf", trControl= ctrl)
# saveRDS(object = job_forest, file = "job_forest.RDS")

Reading the model that has been run and saved on RDS file:

job_forest <- readRDS("job_forest.RDS")

Checking the summary of the final model been built using model_rf$finalModel.

summary(job_forest$finalModel)
#>                 Length Class      Mode     
#> call                4  -none-     call     
#> type                1  -none-     character
#> predicted       15326  factor     numeric  
#> err.rate         1500  -none-     numeric  
#> confusion           6  -none-     numeric  
#> votes           30652  matrix     numeric  
#> oob.times       15326  -none-     numeric  
#> classes             2  -none-     character
#> importance        176  -none-     numeric  
#> importanceSD        0  -none-     NULL     
#> localImportance     0  -none-     NULL     
#> proximity           0  -none-     NULL     
#> ntree               1  -none-     numeric  
#> mtry                1  -none-     numeric  
#> forest             14  -none-     list     
#> y               15326  factor     numeric  
#> test                0  -none-     NULL     
#> inbag               0  -none-     NULL     
#> xNames            176  -none-     character
#> problemType         1  -none-     character
#> tuneValue           1  data.frame list     
#> obsLevels           2  -none-     character
#> param               0  -none-     list

Visualizing the Random Forest Model

plot(job_forest)

In practice, the random forest already have out-of-bag estimates (OOB) that represent an unbiased estimate of its accuracy on unseen data.

Based on the model summary above, the out-of-bag estimate of error rate is 23.54%. This means that we have of about 23.54% of error rate of unseen data.

Predicting the test data

After building the Random Forest model, we will predict the train and test data set based on the predetermined model.

pred_test_jf <-  predict(object = job_forest, 
                          newdata = job_test,
                          type = "raw")
pred_test_jf
#>    [1] 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
#>   [38] 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0
#>   [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0
#>  [112] 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
#>  [149] 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
#>  [186] 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0
#>  [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0
#>  [260] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
#>  [297] 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
#>  [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0
#>  [371] 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0
#>  [408] 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0
#>  [445] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0
#>  [482] 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 1
#>  [519] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
#>  [556] 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#>  [593] 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1
#>  [630] 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
#>  [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
#>  [704] 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
#>  [741] 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0
#>  [778] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#>  [815] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0
#>  [852] 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
#>  [889] 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
#>  [926] 0 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#>  [963] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [1000] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0
#> [1037] 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
#> [1074] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1
#> [1111] 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1
#> [1148] 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
#> [1185] 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
#> [1222] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [1259] 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0
#> [1296] 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
#> [1333] 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0
#> [1370] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0
#> [1407] 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [1444] 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [1481] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1
#> [1518] 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
#> [1555] 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0
#> [1592] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
#> [1629] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
#> [1666] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0
#> [1703] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0
#> [1740] 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0
#> [1777] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0
#> [1814] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1851] 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
#> [1888] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
#> [1925] 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1962] 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
#> [1999] 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
#> [2036] 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
#> [2073] 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0
#> [2110] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [2147] 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
#> [2184] 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0
#> [2221] 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#> [2258] 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
#> [2295] 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 1 0 0
#> [2332] 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0
#> [2369] 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0
#> [2406] 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
#> [2443] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [2480] 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
#> [2517] 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0
#> [2554] 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1
#> [2591] 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1
#> [2628] 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 1 0 1 1 0 0 0
#> [2665] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0
#> [2702] 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [2739] 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0
#> [2776] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
#> [2813] 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
#> [2850] 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
#> [2887] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0
#> [2924] 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0
#> [2961] 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [2998] 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [3035] 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
#> [3072] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [3109] 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
#> [3146] 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
#> [3183] 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#> [3220] 1 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [3257] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 0
#> [3294] 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [3331] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1
#> [3368] 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3405] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
#> [3442] 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [3479] 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
#> [3516] 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0
#> [3553] 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0
#> [3590] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3627] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [3664] 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3701] 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
#> [3738] 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1
#> [3775] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
#> [3812] 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0
#> Levels: 0 1

Random Forest Model Evaluation

We will evaluate the Random Forest model using the Confusion Matrix function and then try to evaluate the performance of the Random Forest model.

# data test
confusionMatrix(data = pred_test_jf, 
                reference = job_test$target, 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 2560  580
#>          1  313  379
#>                                                
#>                Accuracy : 0.767                
#>                  95% CI : (0.7532, 0.7803)     
#>     No Information Rate : 0.7497               
#>     P-Value [Acc > NIR] : 0.006971             
#>                                                
#>                   Kappa : 0.3155               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.3952               
#>             Specificity : 0.8911               
#>          Pos Pred Value : 0.5477               
#>          Neg Pred Value : 0.8153               
#>              Prevalence : 0.2503               
#>          Detection Rate : 0.0989               
#>    Detection Prevalence : 0.1806               
#>       Balanced Accuracy : 0.6431               
#>                                                
#>        'Positive' Class : 1                    
#> 

We will then try to check which variable has a high significance to the prediction by using varImp function and visualize it to get the visualization.

# your code here
varImp_jf <- varImp(job_forest)
varImp_jf
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 176)
#> 
#>                                           Overall
#> training_hours                            100.000
#> city_development_index                     67.950
#> citycity_21                                16.975
#> company_size50-99                          16.521
#> education_levelMasters                     10.428
#> last_new_job1                              10.393
#> enrolled_universityno_enrollment            9.673
#> relevent_experienceNo relevent experience   9.592
#> last_new_job2                               7.666
#> education_levelHigh School                  6.839
#> experience4                                 6.737
#> company_typePvt Ltd                         6.585
#> experience5                                 6.472
#> genderMale                                  6.415
#> experience3                                 6.291
#> last_new_jobnever                           6.247
#> experience>20                               6.243
#> experience7                                 5.576
#> experience2                                 5.506
#> experience6                                 5.336
plot(varImp_jf)

From the result above, we can conclude that the top five variable that have high significance to the prediction are as follows:

  • training_hours
  • city_development_index
  • city
  • company_size
  • education_level

Interpreting the Random Forest Model

From the result above, it can be concluded that the model is able to determine 2560 out of 2873 of “0” (Not looking for job change) cases correctly, and 382 out of 959 of “1” (Looking for job change) cases correctly. This means the ability of Naive Bayes algorithm to predict “0” cases is about 89.1%, but it then falls down to about 39.8% of the “1” cases resulting in an overall accuracy of about 76.77%

Model 3- Decision Tree Model

The report will also gather prediction using Decision Tree model as an alternative model and prediction for the test data set.

Decision tree model is selected as it is powerful and versatile and also very interpretable. It also works by simplifying the rules for making decisions.

When building a decision tree model, we can determine how complex the rules by pruning method (by limiting the formation of branches in the tree / simplifying the tree being formed) to prevent overfitting. This report will use ctree_control function with the following parameters:

  • mincriterion: Value is 1 - P-value. Works as a “regulator” for tree depth. The smaller the value, the more complex the resulting tree will be.
  • minsplit: Minimum number of observations on the node before splitting.
  • minbucket: minimum number of observations at the terminal / leaf node.

Creating Decision Tree Model by Pruning method:

library(party)

model_tree <- ctree(target ~., 
                    data = job_train, 
                    control = ctree_control(mincriterion = 0.90))
plot(model_tree)

Predicting the test data

After building the Decision Tree model, we will predict the train and test data set based on the predetermined model.

ctree_test <- predict(model_tree, newdata = job_test)
ctree_train <- predict(model_tree, newdata = job_train)

Decision Tree Model Evaluation

We will evaluate the Decision Tree model using the Confusion Matrix function and then try to evaluate the performance of the Decision Tree model.

# confusion matrix
confusionMatrix(ctree_test, reference = job_test$target, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 2495  477
#>          1  378  482
#>                                         
#>                Accuracy : 0.7769        
#>                  95% CI : (0.7634, 0.79)
#>     No Information Rate : 0.7497        
#>     P-Value [Acc > NIR] : 0.00004697    
#>                                         
#>                   Kappa : 0.3843        
#>                                         
#>  Mcnemar's Test P-Value : 0.0008037     
#>                                         
#>             Sensitivity : 0.5026        
#>             Specificity : 0.8684        
#>          Pos Pred Value : 0.5605        
#>          Neg Pred Value : 0.8395        
#>              Prevalence : 0.2503        
#>          Detection Rate : 0.1258        
#>    Detection Prevalence : 0.2244        
#>       Balanced Accuracy : 0.6855        
#>                                         
#>        'Positive' Class : 1             
#> 
confusionMatrix(ctree_train, reference = job_train$target, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 9997 1760
#>          1 1511 2058
#>                                                
#>                Accuracy : 0.7866               
#>                  95% CI : (0.78, 0.793)        
#>     No Information Rate : 0.7509               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.4168               
#>                                                
#>  Mcnemar's Test P-Value : 0.0000145            
#>                                                
#>             Sensitivity : 0.5390               
#>             Specificity : 0.8687               
#>          Pos Pred Value : 0.5766               
#>          Neg Pred Value : 0.8503               
#>              Prevalence : 0.2491               
#>          Detection Rate : 0.1343               
#>    Detection Prevalence : 0.2329               
#>       Balanced Accuracy : 0.7039               
#>                                                
#>        'Positive' Class : 1                    
#> 

Interpreting the Decision Tree Model

From the result above, it can be concluded that the model is able to determine 2498 out of 2873 of “0” (Not looking for job change) cases correctly, and 482 out of 959 of “1” (Looking for job change) cases correctly. This means the ability of Naive Bayes algorithm to predict “0” cases is about 86.9%, but it then falls down to about 50.3% of the “1” cases resulting in an overall accuracy of about 77.8%

Summary of Naive Bayes Model vs. Random Forest Model vs. Decision Tree Model

Below are highlights that can be concluded based on the comparison of the three Models:

  • Considering between these three models, Naive Bayes model is having an overall accuracy of approximately 77.56%, where the Random Forest model’s overall accuracy is about 76.77%, and the Decision Tree model is showing an overall accuracy of approximately at 77.8%.

  • While Decision Tree model is considered as a quite powerful classification model (ie. between predictors can be interrelated / dependent) and is interpretable (easy to interpret)but the model is also having its limitations, for example it is tend to be overfitting and also that a small change in the data can lead to a large change in the structure of the optimal decision tree.

  • ALthough the accuracy difference between the three models are not far from each other, but the most optimal accuracy from all of the three models here in this report is likely the Decision Three model; However all of those models have their own limitations and also its strength; thus it is up to the business users on how these models may be used later on.