Group Member

Matric Number Name
S2153101 ROWENA CHOY XIN HUI
S2115935 LIM JIE-YING
S2151909 LIM KIM HOONG
S2141911 JING SU
S2117541 RONG SONG

HR Analytics : Job Change of Data Scientist

Year of dataset

  • 2021

Purpose of dataset

  • The selected dataset is designed to predict on the candidates for leaving or staying in the company after training provided. Information such as candidates’ demographics, education and experience in hands are recorded during sign-up process.

Dimension of dataset :

  • 19158 rows and 14 columns

Content

  • A Big data and Data Science company would like to hire potential data scientists among those who successfully pass some courses provided by the company itself. Initially, many people sign-up for the training. The company would like to know on which of these candidates are really interested to work with the company upon completing the training. In such, it would help the company to reduce undesired cost and time in providing the training as well as increase the quality of training.

Structure

  • Overall, the dataset is imported into R-Studio as a dataframe, in which contained 14 columns (including independent variable and dependent variable) and 19158 rows of data.
Column Name Description
1. enrollee_id Unique ID for every candidate
2. city City code
3. city_development_index Development index of the city (scaled)
4. gender Gender of candidate
5. relevant_experience Relevant experience of candidate
6. enrolled_university Type of University course enrolled if any
7. education level Education level of candidate
8. major_discipline Education major discipline of candidate
9. experience Candidate total experience in years
10. company_size No of employees in current employer’s company
11. company_type Type of current employer
12. last_new_job Difference in years between previous job and current job
13. training_hours Training hours completed
14. target 0 – Not looking for job change, 1 – Looking for a job change
  • Notes
  1. Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality.
  2. Missing imputation can be a part of our pipeline as well.

Inspiration

  • Predict the probability of a candidate will work for the company or leave after completed the training

  • Interpret the model(s) in such a way that illustrates which features will affect candidates’ decision

Question(s)

  • Regression : What is the probability that a candidate will leave the company after the training?
  • Classification : Will the candidate decides to stay or leave the company after the training?

Objective of processing this dataset

  • We would like to understand what is the probability of a candidate work / leave the company after attending training provided by the company. On top of that, we also like to predict of the new comers, using the same independent variables, whether they will leave or stay at the company. On top of that, we are also able to analyse on which factor(s) lead to the leaving of a candidate.

Data processing

install packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.6     ✓ purrr   0.3.4
## ✓ tibble  3.1.7     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(formattable)
library(fastDummies)
library(caTools)
library(InformationValue)
library(e1071)
library(class)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(C50)
library(tree)
library(rpart)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:formattable':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
library(ROCR)

load dataset

data = read.csv('https://raw.githubusercontent.com/RongSong1110/WDQ7004/main/aug_train.csv') 
head(data)
##   enrollee_id     city city_development_index gender     relevent_experience
## 1        8949 city_103                  0.920   Male Has relevent experience
## 2       29725  city_40                  0.776   Male  No relevent experience
## 3       11561  city_21                  0.624         No relevent experience
## 4       33241 city_115                  0.789         No relevent experience
## 5         666 city_162                  0.767   Male Has relevent experience
## 6       21651 city_176                  0.764        Has relevent experience
##   enrolled_university education_level major_discipline experience company_size
## 1       no_enrollment        Graduate             STEM        >20             
## 2       no_enrollment        Graduate             STEM         15        50-99
## 3    Full time course        Graduate             STEM          5             
## 4                            Graduate  Business Degree         <1             
## 5       no_enrollment         Masters             STEM        >20        50-99
## 6    Part time course        Graduate             STEM         11             
##     company_type last_new_job training_hours target
## 1                           1             36      1
## 2        Pvt Ltd           >4             47      0
## 3                       never             83      0
## 4        Pvt Ltd        never             52      1
## 5 Funded Startup            4              8      0
## 6                           1             24      1
class(data)
## [1] "data.frame"

Dimension of the dataset

dim(data)
## [1] 19158    14

Overall Statistic Report

summary(data)
##   enrollee_id        city           city_development_index    gender         
##  Min.   :    1   Length:19158       Min.   :0.4480         Length:19158      
##  1st Qu.: 8554   Class :character   1st Qu.:0.7400         Class :character  
##  Median :16982   Mode  :character   Median :0.9030         Mode  :character  
##  Mean   :16875                      Mean   :0.8288                           
##  3rd Qu.:25170                      3rd Qu.:0.9200                           
##  Max.   :33380                      Max.   :0.9490                           
##  relevent_experience enrolled_university education_level    major_discipline  
##  Length:19158        Length:19158        Length:19158       Length:19158      
##  Class :character    Class :character    Class :character   Class :character  
##  Mode  :character    Mode  :character    Mode  :character   Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##   experience        company_size       company_type       last_new_job      
##  Length:19158       Length:19158       Length:19158       Length:19158      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  training_hours       target      
##  Min.   :  1.00   Min.   :0.0000  
##  1st Qu.: 23.00   1st Qu.:0.0000  
##  Median : 47.00   Median :0.0000  
##  Mean   : 65.37   Mean   :0.2493  
##  3rd Qu.: 88.00   3rd Qu.:0.0000  
##  Max.   :336.00   Max.   :1.0000
str(data)
## 'data.frame':    19158 obs. of  14 variables:
##  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
##  $ city                  : chr  "city_103" "city_40" "city_21" "city_115" ...
##  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
##  $ gender                : chr  "Male" "Male" "" "" ...
##  $ relevent_experience   : chr  "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
##  $ enrolled_university   : chr  "no_enrollment" "no_enrollment" "Full time course" "" ...
##  $ education_level       : chr  "Graduate" "Graduate" "Graduate" "Graduate" ...
##  $ major_discipline      : chr  "STEM" "STEM" "STEM" "Business Degree" ...
##  $ experience            : chr  ">20" "15" "5" "<1" ...
##  $ company_size          : chr  "" "50-99" "" "" ...
##  $ company_type          : chr  "" "Pvt Ltd" "" "Pvt Ltd" ...
##  $ last_new_job          : chr  "1" ">4" "never" "never" ...
##  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
##  $ target                : num  1 0 0 1 0 1 0 1 1 0 ...

Remove the unwanted columns

  • Enrollee id is not contributing to the output as this is indicate as an entry id, therefore we decide to remove this column.
data[,c('enrollee_id')]<-list(NULL)

rename the column with spelling error

names(data)[names(data) == 'relevent_experience'] <- "relevant_experience"

Explore the unique value of each categorical variable

col_name<-names(data)
categorical_col_name<-names(Filter(is.character,data))
num_col_name<-setdiff(col_name,categorical_col_name)
num_col_name<-num_col_name[!num_col_name %in% 'target']
categorical_data<-data[,c(categorical_col_name)]
unique_value<-function(x){
  print("Unique values of categorical variables in the dataset:")
  lapply(x,unique)
}
unique_value(categorical_data)
## [1] "Unique values of categorical variables in the dataset:"
## $city
##   [1] "city_103" "city_40"  "city_21"  "city_115" "city_162" "city_176"
##   [7] "city_160" "city_46"  "city_61"  "city_114" "city_13"  "city_159"
##  [13] "city_102" "city_67"  "city_100" "city_16"  "city_71"  "city_104"
##  [19] "city_64"  "city_101" "city_83"  "city_105" "city_73"  "city_75" 
##  [25] "city_41"  "city_11"  "city_93"  "city_90"  "city_36"  "city_20" 
##  [31] "city_57"  "city_152" "city_19"  "city_65"  "city_74"  "city_173"
##  [37] "city_136" "city_98"  "city_97"  "city_50"  "city_138" "city_82" 
##  [43] "city_157" "city_89"  "city_150" "city_70"  "city_175" "city_94" 
##  [49] "city_28"  "city_59"  "city_165" "city_145" "city_142" "city_26" 
##  [55] "city_12"  "city_37"  "city_43"  "city_116" "city_23"  "city_99" 
##  [61] "city_149" "city_10"  "city_45"  "city_80"  "city_128" "city_158"
##  [67] "city_123" "city_7"   "city_72"  "city_106" "city_143" "city_78" 
##  [73] "city_109" "city_24"  "city_134" "city_48"  "city_144" "city_91" 
##  [79] "city_146" "city_133" "city_126" "city_118" "city_9"   "city_167"
##  [85] "city_27"  "city_84"  "city_54"  "city_39"  "city_79"  "city_76" 
##  [91] "city_77"  "city_81"  "city_131" "city_44"  "city_117" "city_155"
##  [97] "city_33"  "city_141" "city_127" "city_62"  "city_53"  "city_25" 
## [103] "city_2"   "city_69"  "city_120" "city_111" "city_30"  "city_1"  
## [109] "city_140" "city_179" "city_55"  "city_14"  "city_42"  "city_107"
## [115] "city_18"  "city_139" "city_180" "city_166" "city_121" "city_129"
## [121] "city_8"   "city_31"  "city_171"
## 
## $gender
## [1] "Male"   ""       "Female" "Other" 
## 
## $relevant_experience
## [1] "Has relevent experience" "No relevent experience" 
## 
## $enrolled_university
## [1] "no_enrollment"    "Full time course" ""                 "Part time course"
## 
## $education_level
## [1] "Graduate"       "Masters"        "High School"    ""              
## [5] "Phd"            "Primary School"
## 
## $major_discipline
## [1] "STEM"            "Business Degree" ""                "Arts"           
## [5] "Humanities"      "No Major"        "Other"          
## 
## $experience
##  [1] ">20" "15"  "5"   "<1"  "11"  "13"  "7"   "17"  "2"   "16"  "1"   "4"  
## [13] "10"  "14"  "18"  "19"  "12"  "3"   "6"   "9"   "8"   "20"  ""   
## 
## $company_size
## [1] ""          "50-99"     "<10"       "10000+"    "5000-9999" "1000-4999"
## [7] "10/49"     "100-500"   "500-999"  
## 
## $company_type
## [1] ""                    "Pvt Ltd"             "Funded Startup"     
## [4] "Early Stage Startup" "Other"               "Public Sector"      
## [7] "NGO"                
## 
## $last_new_job
## [1] "1"     ">4"    "never" "4"     "3"     "2"     ""

Rename the value of some categorical variables

#replace 'has relevant experience' and 'no relevant experience' with TRUE and FALSE
data$relevant_experience[data$relevant_experience=='Has relevent experience']<-'yes'
data$relevant_experience[data$relevant_experience=='No relevent experience']<-'no'
#replace the inconsistent value on company size variable
data$company_size<-replace(data$company_size,data$company_size == '10/49', '10-49')
data$company_size<-replace(data$company_size,data$company_size == '100-500', '100-499')
data$company_size<-replace(data$company_size,data$company_size == '10000+', '>9999')
data$last_new_job<-replace(data$last_new_job,data$last_new_job=='never',0)

Data cleaning and preprocessing

detect missing values

missing_value<-function(x){
  print("Missing values in the dataset:")
  for(i in x) {
    print(paste(i,sum(data[i]==""|is.na(data[i]))))
  }
}
missing_value(col_name)
## [1] "Missing values in the dataset:"
## [1] "city 0"
## [1] "city_development_index 0"
## [1] "gender 4508"
## [1] "relevant_experience 0"
## [1] "enrolled_university 386"
## [1] "education_level 460"
## [1] "major_discipline 2813"
## [1] "experience 65"
## [1] "company_size 5938"
## [1] "company_type 6140"
## [1] "last_new_job 423"
## [1] "training_hours 0"
## [1] "target 0"

Visualize each variable with missing values

cols_with_nan_data = data[,c('gender', 'enrolled_university', 'major_discipline', 'experience', 'company_size', 'last_new_job', 'company_type', 'education_level')] 
print(head(cols_with_nan_data))
##   gender enrolled_university major_discipline experience company_size
## 1   Male       no_enrollment             STEM        >20             
## 2   Male       no_enrollment             STEM         15        50-99
## 3           Full time course             STEM          5             
## 4                             Business Degree         <1             
## 5   Male       no_enrollment             STEM        >20        50-99
## 6           Part time course             STEM         11             
##   last_new_job   company_type education_level
## 1            1                       Graduate
## 2           >4        Pvt Ltd        Graduate
## 3            0                       Graduate
## 4            0        Pvt Ltd        Graduate
## 5            4 Funded Startup         Masters
## 6            1                       Graduate
lapply(names(cols_with_nan_data), function(col) {
  ggplot(cols_with_nan_data, aes(.data[[col]], ..count..)) + 
    geom_bar(aes(fill = .data[[col]]), position = "dodge")
}) -> list_plots
list_plots
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

From the above we can observe the missing values only occur in categorical variables

Filling missing values for each variable

for(i in col_name){
  data[!is.na(data[i])&data[i]=="",i]<-NA
}
# Find the size of missing value to each variable
missing_value_size<-function(x){
  print("The proportion of missing value")
  for(i in x){
    if(sum(is.na(data[i]))>0){
      print(paste(i,percent((sum(is.na(data[i]))/nrow(data)))))
    }
  }
}
missing_value_size(col_name)
## [1] "The proportion of missing value"
## [1] "gender 23.53%"
## [1] "enrolled_university 2.01%"
## [1] "education_level 2.40%"
## [1] "major_discipline 14.68%"
## [1] "experience 0.34%"
## [1] "company_size 30.99%"
## [1] "company_type 32.05%"
## [1] "last_new_job 2.21%"

From the above result, we observe that the missing value for ‘experience’ is extremely small. Hence, we replace it with the mode value

data$experience<-replace(data$experience,is.na(data$experience), '>20')

We will replace the rest of the missing values with a new class called ‘unknown’ as the proportion of the missing values is extremely large.

col_miss<-c('gender','enrolled_university','major_discipline','company_size','last_new_job','company_type','education_level')
for(i in col_miss){
  data[is.na(data[i]),i]<-'unknown'
}
head(data)
##       city city_development_index  gender relevant_experience
## 1 city_103                  0.920    Male                 yes
## 2  city_40                  0.776    Male                  no
## 3  city_21                  0.624 unknown                  no
## 4 city_115                  0.789 unknown                  no
## 5 city_162                  0.767    Male                 yes
## 6 city_176                  0.764 unknown                 yes
##   enrolled_university education_level major_discipline experience company_size
## 1       no_enrollment        Graduate             STEM        >20      unknown
## 2       no_enrollment        Graduate             STEM         15        50-99
## 3    Full time course        Graduate             STEM          5      unknown
## 4             unknown        Graduate  Business Degree         <1      unknown
## 5       no_enrollment         Masters             STEM        >20        50-99
## 6    Part time course        Graduate             STEM         11      unknown
##     company_type last_new_job training_hours target
## 1        unknown            1             36      1
## 2        Pvt Ltd           >4             47      0
## 3        unknown            0             83      0
## 4        Pvt Ltd            0             52      1
## 5 Funded Startup            4              8      0
## 6        unknown            1             24      1

Check again if missing values have been processed or not

missing_value(col_name)
## [1] "Missing values in the dataset:"
## [1] "city 0"
## [1] "city_development_index 0"
## [1] "gender 0"
## [1] "relevant_experience 0"
## [1] "enrolled_university 0"
## [1] "education_level 0"
## [1] "major_discipline 0"
## [1] "experience 0"
## [1] "company_size 0"
## [1] "company_type 0"
## [1] "last_new_job 0"
## [1] "training_hours 0"
## [1] "target 0"

Feature Engineering

# handling the value of the variable 'city' as a numeric type
data <- separate(data, city, c('Name','city'),sep = '_')
data[,'Name']<-list(NULL)
head(data)
##   city city_development_index  gender relevant_experience enrolled_university
## 1  103                  0.920    Male                 yes       no_enrollment
## 2   40                  0.776    Male                  no       no_enrollment
## 3   21                  0.624 unknown                  no    Full time course
## 4  115                  0.789 unknown                  no             unknown
## 5  162                  0.767    Male                 yes       no_enrollment
## 6  176                  0.764 unknown                 yes    Part time course
##   education_level major_discipline experience company_size   company_type
## 1        Graduate             STEM        >20      unknown        unknown
## 2        Graduate             STEM         15        50-99        Pvt Ltd
## 3        Graduate             STEM          5      unknown        unknown
## 4        Graduate  Business Degree         <1      unknown        Pvt Ltd
## 5         Masters             STEM        >20        50-99 Funded Startup
## 6        Graduate             STEM         11      unknown        unknown
##   last_new_job training_hours target
## 1            1             36      1
## 2           >4             47      0
## 3            0             83      0
## 4            0             52      1
## 5            4              8      0
## 6            1             24      1

Check outliers

for(i in num_col_name){
  boxplot(data[i],main="Boxplot")
  print(paste(i,length(data[,i][data[,i] %in% boxplot.stats(data[,i])$out])))
}

## [1] "city_development_index 17"

## [1] "training_hours 984"

From the above result, we observe that there are 17 outliers on vairable ‘city_development_index’ and 984 outliers on ‘training hours’. We will need to check the minimum outlier value and maximum value outlier for ‘training_hour’.

print(paste('minimum outlier value for training hour is:',min(data[,'training_hours'][data[,'training_hours'] %in% boxplot.stats(data[,'training_hours'])$out])))
## [1] "minimum outlier value for training hour is: 188"
print(paste('maximum outlier value for training hour is:',max(data[,'training_hours'][data[,'training_hours'] %in% boxplot.stats(data[,'training_hours'])$out])))
## [1] "maximum outlier value for training hour is: 336"

If we assume the working hour for a data scientist per day is 8 hours, the minimum outlier value for training hour of a data scientist is 23.5 days (which is almost 1 month) while the maximum outlier value for training hour of a data scientist 42 days (which is almost 2 months). The training hours seems to be quite reasonable and training hour might contribute a significant decision factors on the output, hence we decide not to replace the outlier for training_hour.

Next, we check on the outlier for ’city_development index.

print(unique(data[,'city_development_index'][data[,'city_development_index'] %in% boxplot.stats(data[,'city_development_index'])$out]))
## [1] 0.448

The outlier for city_development_index is 0.448. We now will need to check if is all the entry for the outlier is coming from the same city.

unique(filter(data,city_development_index==0.448)[,'city'])
## [1] "33"

From the above result, we observe that the city development index with 0.448 is all come from the same city with is city, 33. We now further validate if all the entry with city 33 are having a same city development index, 0.448.

unique(filter(data,city==33)[,'city_development_index'])
## [1] 0.448

From the above result, we can prove all the entry with city, 33 have the same city development index 0.448. Hence, we could say the city development index with 0.448 is not an outlier and is a reasonable entry.

Convert the categorical variables to dummy value for data analysis facilitating purpose

data_preprocess_categorical<-dummy_cols(data,select_columns=categorical_col_name)
data_preprocess_categorical[,c(categorical_col_name)]<-list(NULL)
dim(data_preprocess_categorical)
## [1] 19158   194

EDA

Bivariate Analysis

# city & target
dat0 <- data.frame(table(data$city,data$target))
names(dat0) <- c("City","target","Count")
dat0<-spread(dat0,target,Count)
dat0['prob_of_stay']<-round(dat0['0']/(dat0['0']+dat0['1']),4)
dat0<-dat0[order(dat0[,"prob_of_stay"],decreasing=TRUE),]
dat0
##     City    0    1 prob_of_stay
## 13   111    3    0       1.0000
## 26   129    3    0       1.0000
## 35   140    1    0       1.0000
## 63     2    7    0       1.0000
## 77    39   11    0       1.0000
## 93    62    5    0       1.0000
## 109    8    4    0       1.0000
## 112   82    4    0       1.0000
## 106   77   31    1       0.9688
## 121   97   96    8       0.9231
## 71    28  177   15       0.9219
## 75    36  147   13       0.9188
## 32   138  110   10       0.9167
## 66    23  166   16       0.9121
## 7    104  273   28       0.9070
## 92    61  178   19       0.9036
## 14   114 1203  133       0.9004
## 101   72   18    2       0.9000
## 122   98   71    8       0.8987
## 104   75  274   31       0.8984
## 64    20   26    3       0.8966
## 31   136  525   61       0.8959
## 2     10   77    9       0.8953
## 86    50  125   15       0.8929
## 8    105   70    9       0.8861
## 1      1   23    3       0.8846
## 49    16 1354  179       0.8832
## 97    69   15    2       0.8824
## 27    13   42    6       0.8750
## 56   173  132   19       0.8742
## 90    57   90   13       0.8738
## 96    67  374   57       0.8677
## 110   80   13    2       0.8667
## 46   157   19    3       0.8636
## 95    65  151   24       0.8629
## 94    64   98   16       0.8596
## 83    45   97   16       0.8584
## 72    30   18    3       0.8571
## 76    37   12    2       0.8571
## 111   81    6    1       0.8571
## 100   71  227   39       0.8534
## 78    40   58   10       0.8529
## 36   141   23    4       0.8519
## 48   159   80   14       0.8511
## 19    12   11    2       0.8462
## 123   99   79   15       0.8404
## 113   83  120   23       0.8392
## 20   120    5    1       0.8333
## 5    102  252   52       0.8289
## 16   116  106   22       0.8281
## 98     7   22    5       0.8148
## 70    27   38    9       0.8085
## 119   93   21    5       0.8077
## 24   127    8    2       0.8000
## 29   133    8    2       0.8000
## 79    41   71   18       0.7978
## 69    26   19    5       0.7917
## 6    103 3427  928       0.7869
## 57   175   11    3       0.7857
## 88    54   11    3       0.7857
## 44   152   40   11       0.7843
## 52   165   64   18       0.7805
## 67    24   48   14       0.7742
## 42   149   78   24       0.7647
## 50   160  646  199       0.7645
## 3    100  210   65       0.7636
## 115   89   51   16       0.7612
## 43   150   49   16       0.7538
## 34    14   21    7       0.7500
## 53   166    3    1       0.7500
## 60    18    3    1       0.7500
## 73    31    3    1       0.7500
## 102   73  206   74       0.7357
## 22   123   58   21       0.7342
## 84    46   93   35       0.7266
## 39   144   21    8       0.7241
## 116    9   13    5       0.7222
## 30   134   31   12       0.7209
## 105   76   36   14       0.7200
## 61   180    5    2       0.7143
## 89    55   10    4       0.7143
## 51   162   91   37       0.7109
## 58   176   17    7       0.7083
## 114   84   17    7       0.7083
## 54   167    7    3       0.7000
## 91    59    7    3       0.7000
## 37   142   37   16       0.6981
## 47   158   34   15       0.6939
## 17   117    9    4       0.6923
## 87    53   18    8       0.6923
## 117   90  135   62       0.6853
## 9    106    6    3       0.6667
## 11   109    6    3       0.6667
## 18   118   18    9       0.6667
## 21   121    2    1       0.6667
## 28   131    6    3       0.6667
## 82    44   12    6       0.6667
## 118   91   29   16       0.6444
## 41   146    5    3       0.6250
## 108   79    5    3       0.6250
## 62    19   74   45       0.6218
## 120   94   16   10       0.6154
## 15   115   33   21       0.6111
## 38   143   25   16       0.6098
## 99    70   25   19       0.5682
## 68    25    2    2       0.5000
## 107   78   15   16       0.4839
## 103   74   50   54       0.4808
## 23   126   13   15       0.4643
## 85    48    6    7       0.4615
## 25   128   40   52       0.4348
## 4    101   32   43       0.4267
## 81    43    5    7       0.4167
## 40   145   26   37       0.4127
## 65    21 1105 1597       0.4090
## 12    11  100  147       0.4049
## 59   179    2    3       0.4000
## 74    33    6   11       0.3529
## 10   107    2    4       0.3333
## 80    42    4    9       0.3077
## 45   155    3   11       0.2143
## 33   139    1    4       0.2000
## 55   171    0    1       0.0000

According to the analysis, those data scientist who based in city 111, 129, 140, 2, 39, 62, 8 and 82 will choose to stay in their company and don’t feel like changing a job.

filter(dat0,prob_of_stay<0.5)
##    City    0    1 prob_of_stay
## 1    78   15   16       0.4839
## 2    74   50   54       0.4808
## 3   126   13   15       0.4643
## 4    48    6    7       0.4615
## 5   128   40   52       0.4348
## 6   101   32   43       0.4267
## 7    43    5    7       0.4167
## 8   145   26   37       0.4127
## 9    21 1105 1597       0.4090
## 10   11  100  147       0.4049
## 11  179    2    3       0.4000
## 12   33    6   11       0.3529
## 13  107    2    4       0.3333
## 14   42    4    9       0.3077
## 15  155    3   11       0.2143
## 16  139    1    4       0.2000
## 17  171    0    1       0.0000

Out of 123 cities where the data scientist live, data scientist from only 17 cities has a higher probability to change their job. The highest number of the data scientist who choose to leave is from city 21 (1105) followed by city 11 (100) and city 74 (50).

dat0['total']<-dat0['0']+dat0['1']
dat0<-dat0[order(dat0[,"total"],decreasing=TRUE),]
dat0
##     City    0    1 prob_of_stay total
## 6    103 3427  928       0.7869  4355
## 65    21 1105 1597       0.4090  2702
## 49    16 1354  179       0.8832  1533
## 14   114 1203  133       0.9004  1336
## 50   160  646  199       0.7645   845
## 31   136  525   61       0.8959   586
## 96    67  374   57       0.8677   431
## 104   75  274   31       0.8984   305
## 5    102  252   52       0.8289   304
## 7    104  273   28       0.9070   301
## 102   73  206   74       0.7357   280
## 3    100  210   65       0.7636   275
## 100   71  227   39       0.8534   266
## 12    11  100  147       0.4049   247
## 92    61  178   19       0.9036   197
## 117   90  135   62       0.6853   197
## 71    28  177   15       0.9219   192
## 66    23  166   16       0.9121   182
## 95    65  151   24       0.8629   175
## 75    36  147   13       0.9188   160
## 56   173  132   19       0.8742   151
## 113   83  120   23       0.8392   143
## 86    50  125   15       0.8929   140
## 16   116  106   22       0.8281   128
## 84    46   93   35       0.7266   128
## 51   162   91   37       0.7109   128
## 32   138  110   10       0.9167   120
## 62    19   74   45       0.6218   119
## 94    64   98   16       0.8596   114
## 83    45   97   16       0.8584   113
## 121   97   96    8       0.9231   104
## 103   74   50   54       0.4808   104
## 90    57   90   13       0.8738   103
## 42   149   78   24       0.7647   102
## 48   159   80   14       0.8511    94
## 123   99   79   15       0.8404    94
## 25   128   40   52       0.4348    92
## 79    41   71   18       0.7978    89
## 2     10   77    9       0.8953    86
## 52   165   64   18       0.7805    82
## 122   98   71    8       0.8987    79
## 8    105   70    9       0.8861    79
## 22   123   58   21       0.7342    79
## 4    101   32   43       0.4267    75
## 78    40   58   10       0.8529    68
## 115   89   51   16       0.7612    67
## 43   150   49   16       0.7538    65
## 40   145   26   37       0.4127    63
## 67    24   48   14       0.7742    62
## 15   115   33   21       0.6111    54
## 37   142   37   16       0.6981    53
## 44   152   40   11       0.7843    51
## 105   76   36   14       0.7200    50
## 47   158   34   15       0.6939    49
## 27    13   42    6       0.8750    48
## 70    27   38    9       0.8085    47
## 118   91   29   16       0.6444    45
## 99    70   25   19       0.5682    44
## 30   134   31   12       0.7209    43
## 38   143   25   16       0.6098    41
## 106   77   31    1       0.9688    32
## 107   78   15   16       0.4839    31
## 64    20   26    3       0.8966    29
## 39   144   21    8       0.7241    29
## 34    14   21    7       0.7500    28
## 23   126   13   15       0.4643    28
## 36   141   23    4       0.8519    27
## 98     7   22    5       0.8148    27
## 18   118   18    9       0.6667    27
## 1      1   23    3       0.8846    26
## 119   93   21    5       0.8077    26
## 87    53   18    8       0.6923    26
## 120   94   16   10       0.6154    26
## 69    26   19    5       0.7917    24
## 58   176   17    7       0.7083    24
## 114   84   17    7       0.7083    24
## 46   157   19    3       0.8636    22
## 72    30   18    3       0.8571    21
## 101   72   18    2       0.9000    20
## 116    9   13    5       0.7222    18
## 82    44   12    6       0.6667    18
## 97    69   15    2       0.8824    17
## 74    33    6   11       0.3529    17
## 110   80   13    2       0.8667    15
## 76    37   12    2       0.8571    14
## 57   175   11    3       0.7857    14
## 88    54   11    3       0.7857    14
## 89    55   10    4       0.7143    14
## 45   155    3   11       0.2143    14
## 19    12   11    2       0.8462    13
## 17   117    9    4       0.6923    13
## 85    48    6    7       0.4615    13
## 80    42    4    9       0.3077    13
## 81    43    5    7       0.4167    12
## 77    39   11    0       1.0000    11
## 24   127    8    2       0.8000    10
## 29   133    8    2       0.8000    10
## 54   167    7    3       0.7000    10
## 91    59    7    3       0.7000    10
## 9    106    6    3       0.6667     9
## 11   109    6    3       0.6667     9
## 28   131    6    3       0.6667     9
## 41   146    5    3       0.6250     8
## 108   79    5    3       0.6250     8
## 63     2    7    0       1.0000     7
## 111   81    6    1       0.8571     7
## 61   180    5    2       0.7143     7
## 20   120    5    1       0.8333     6
## 10   107    2    4       0.3333     6
## 93    62    5    0       1.0000     5
## 59   179    2    3       0.4000     5
## 33   139    1    4       0.2000     5
## 109    8    4    0       1.0000     4
## 112   82    4    0       1.0000     4
## 53   166    3    1       0.7500     4
## 60    18    3    1       0.7500     4
## 73    31    3    1       0.7500     4
## 68    25    2    2       0.5000     4
## 13   111    3    0       1.0000     3
## 26   129    3    0       1.0000     3
## 21   121    2    1       0.6667     3
## 35   140    1    0       1.0000     1
## 55   171    0    1       0.0000     1

While if we sort the table, most of the data scientist are live in city 103 followed by city 65 and city 49.

# city_development_index & target
dat1 <- data.frame(table(data$city_development_index,data$target))
names(dat1) <- c("city_development_index","target","Count")
dat1<-spread(dat1,target,Count)
dat1['total']<-dat1['0']+dat1['1']
dat1<-dat1[order(dat1[,"total"],decreasing=TRUE),]
dat1
##    city_development_index    0    1 total
## 86                   0.92 4073 1127  5200
## 15                  0.624 1105 1597  2702
## 83                   0.91 1354  179  1533
## 91                  0.926 1203  133  1336
## 28                  0.698  489  194   683
## 79                  0.897  525   61   586
## 92                  0.939  451   46   497
## 68                  0.855  374   57   431
## 58                  0.804  252   52   304
## 89                  0.924  273   28   301
## 41                  0.754  206   74   280
## 74                  0.887  210   65   275
## 73                  0.884  227   39   266
## 9                    0.55  100  147   247
## 84                  0.913  178   19   197
## 81                  0.899  166   16   182
## 57                  0.802  151   24   175
## 90                  0.925  147   24   171
## 76                  0.893  147   13   160
## 72                  0.878  132   19   151
## 39                  0.743  119   27   146
## 88                  0.923  120   23   143
## 78                  0.896  125   15   140
## 61                  0.827  113   24   137
## 14                  0.579   65   70   135
## 42                  0.762   93   35   128
## 46                  0.767   91   37   128
## 63                  0.836  110   10   120
## 24                  0.682   74   45   119
## 22                  0.666   98   16   114
## 75                   0.89   97   16   113
## 71                  0.866   90   13   103
## 25                  0.689   78   24   102
## 65                  0.843   80   14    94
## 85                  0.915   79   15    94
## 54                  0.794   82   11    93
## 8                   0.527   40   52    92
## 77                  0.895   77    9    86
## 49                  0.776   69   13    82
## 82                  0.903   64   18    82
## 35                  0.738   58   21    79
## 93                  0.949   71    8    79
## 12                  0.558   32   43    75
## 37                   0.74   43   24    67
## 10                  0.555   26   37    63
## 53                  0.789   33   21    54
## 32                  0.727   37   16    53
## 45                  0.766   34   15    49
## 67                  0.848   38    9    47
## 26                  0.691   29   16    45
## 66                  0.847   36    5    41
## 62                   0.83   31    1    32
## 69                  0.856   27    5    32
## 56                  0.796   26    3    29
## 64                   0.84   21    8    29
## 2                   0.479   13   15    28
## 19                  0.647   22    5    27
## 30                  0.722   18    9    27
## 43                  0.763   23    4    27
## 70                  0.865   21    5    26
## 44                  0.764   17    7    24
## 47                  0.769   19    3    22
## 55                  0.795   18    2    20
## 31                  0.725   12    6    18
## 1                   0.448    6   11    17
## 11                  0.556    3   11    14
## 36                  0.739   10    4    14
## 4                   0.493    6    7    13
## 13                  0.563    4    9    13
## 17                   0.64   11    2    13
## 6                   0.516    5    7    12
## 80                  0.898   11    0    11
## 38                  0.742    8    2    10
## 40                  0.745    8    2    10
## 48                  0.775    7    3    10
## 87                  0.921    7    3    10
## 23                   0.68    6    3     9
## 29                  0.701    6    3     9
## 34                  0.735    5    3     8
## 33                   0.73    6    1     7
## 52                  0.788    7    0     7
## 7                   0.518    2    4     6
## 50                   0.78    5    1     6
## 3                   0.487    1    4     5
## 5                   0.512    2    3     5
## 18                  0.645    5    0     5
## 20                  0.649    3    1     4
## 27                  0.693    4    0     4
## 59                  0.807    3    1     4
## 60                  0.824    3    1     4
## 16                  0.625    3    0     3
## 51                  0.781    2    1     3
## 21                  0.664    0    1     1
plot(data$city_development_index,data$target)

From the scatter plot, it can be seen that most of the entry of this dataset is from city development index 0.8 to 0.93 except for city with city development index of 0.624, 0.754 and 0.55.

# Gender & target
dat2 <- data.frame(table(data$gender,data$target))
names(dat2) <- c("Gender","target","Count")
dat2
##    Gender target Count
## 1  Female      0   912
## 2    Male      0 10209
## 3   Other      0   141
## 4 unknown      0  3119
## 5  Female      1   326
## 6    Male      1  3012
## 7   Other      1    50
## 8 unknown      1  1389
ggplot(data = dat2,aes(y = Gender, 
                       x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of target by gender', x= 'Number of target', y='Gender')

The bar chart illustrates that for both men and women, it is obviously that most of the data is collected from make. Overall, we can see that employees choose to stay more than leave. While look in more details, the probability of male who choose to stay in their company is higher compared to female.

# relevent_experience & target
dat3 <- data.frame(table(data$relevant_experience,data$target))
names(dat3) <- c("Relevant_experience","target","Count")
dat3
##   Relevant_experience target Count
## 1                  no      0  3550
## 2                 yes      0 10831
## 3                  no      1  1816
## 4                 yes      1  2961
ggplot(data = dat3,aes(y = Relevant_experience,
                       x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of targets by relevant experience', x= 'Number of target', y='Relevant experience')

It can be seen from the bar chart that, regardless of whether they have relevant work experience, the proportion of employees who choose to stay is greater than that of those who choose to leave. But for overall mobility, most of the data scientist in this dataset has relevant experience and the proportion of those who has relevant experience and choose to stay in their company is much higher than those who has no relevant experience.

# enrolled_university & target
dat4 <- data.frame(table(data$enrolled_university,data$target))
names(dat4) <- c("Enrolled_university","target","Count")
dat4
##   Enrolled_university target Count
## 1    Full time course      0  2326
## 2       no_enrollment      0 10896
## 3    Part time course      0   896
## 4             unknown      0   263
## 5    Full time course      1  1431
## 6       no_enrollment      1  2921
## 7    Part time course      1   302
## 8             unknown      1   123
ggplot(data = dat4,aes(y = Enrolled_university,
                       x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of targets by enrolled university', x= 'Number of target', y='Enrolled university')

The bar chart shows that employees choose to stay more than leave, regardless of whether they have enrolled in university or not. But for overall mobility ,most of the data is collected from those who do not enroll to any full time or part time univeristy couse, and they are more prone to stay the company.

# education_level & target
dat5 <- data.frame(table(data$education_level,data$target))
names(dat5) <- c("Education_level","target","Count")
dat5
##    Education_level target Count
## 1         Graduate      0  8353
## 2      High School      0  1623
## 3          Masters      0  3426
## 4              Phd      0   356
## 5   Primary School      0   267
## 6          unknown      0   356
## 7         Graduate      1  3245
## 8      High School      1   394
## 9          Masters      1   935
## 10             Phd      1    58
## 11  Primary School      1    41
## 12         unknown      1   104
ggplot(data = dat5,aes(y = Education_level,
                       x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of targets by education level', x= 'Number of target', y='Education level')

The bar chart shows that, regardless of education level, employees choose to stay more than they leave. For overall mobility, graduate students account for the largest proportion. We also can see from the graph that, it is very rare for a primary school education level to be in a data scientist. For high school education level, they are more likely to stay in the company while those graduate education level, they are more likely to went for a job change.

# major_discipline & target
dat6 <- data.frame(table(data$major_discipline,data$target))
names(dat6) <- c("Major_discipline","target","Count")
dat6
##    Major_discipline target Count
## 1              Arts      0   200
## 2   Business Degree      0   241
## 3        Humanities      0   528
## 4          No Major      0   168
## 5             Other      0   279
## 6              STEM      0 10701
## 7           unknown      0  2264
## 8              Arts      1    53
## 9   Business Degree      1    86
## 10       Humanities      1   141
## 11         No Major      1    55
## 12            Other      1   102
## 13             STEM      1  3791
## 14          unknown      1   549
ggplot(data = dat6,aes(y = Major_discipline,
                       x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of targets by major discipline', x= 'Number of target', y='Major discipline')

From the bar chart, we can observe that most of the employees worked as data scientists are majored in STEM.

# experience & target
dat7 <- data.frame(table(data$experience,data$target))
names(dat7) <- c("Experience","target","Count")
dat7<-spread(dat7,target,Count)
dat7['prob_of_stay']<-round(dat7['0']/(dat7['0']+dat7['1']),4)
dat7<-dat7[order(dat7[,"prob_of_stay"],decreasing=TRUE),]
dat7
##    Experience    0   1 prob_of_stay
## 10         16  436  72       0.8583
## 12         18  237  43       0.8464
## 2         >20 2825 526       0.8430
## 9          15  572 114       0.8338
## 11         17  285  57       0.8333
## 13         19  251  53       0.8257
## 8          14  479 107       0.8174
## 6          12  402  92       0.8138
## 7          13  322  77       0.8070
## 4          10  778 207       0.7898
## 22          9  767 213       0.7827
## 15         20  115  33       0.7770
## 5          11  513 151       0.7726
## 21          8  607 195       0.7569
## 19          6  873 343       0.7179
## 18          5 1018 412       0.7119
## 20          7  725 303       0.7053
## 17          4  946 457       0.6743
## 14          2  753 374       0.6681
## 16          3  876 478       0.6470
## 3           1  316 233       0.5756
## 1          <1  285 237       0.5460

We can observe that the the data scientist with a lower experience are more likely to went for a job change. While the highest probability of those will stay in the company are those has more than 10 years experience.

# company_size & target
dat8 <- data.frame(table(data$company_size,data$target))
names(dat8) <- c("Company_size","target","Count")
dat8
##    Company_size target Count
## 1           <10      0  1084
## 2         >9999      0  1634
## 3         10-49      0  1127
## 4       100-499      0  2156
## 5     1000-4999      0  1128
## 6         50-99      0  2538
## 7       500-999      0   725
## 8     5000-9999      0   461
## 9       unknown      0  3528
## 10          <10      1   224
## 11        >9999      1   385
## 12        10-49      1   344
## 13      100-499      1   415
## 14    1000-4999      1   200
## 15        50-99      1   545
## 16      500-999      1   152
## 17    5000-9999      1   102
## 18      unknown      1  2410
ggplot(data = dat8,aes(y = Company_size, x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of targets by company size', x= 'Number of target', y='Company size')

Since the ‘unknown’ category represents missing value, so this category will not be considered. It can be seen from the bar chart that most of the data scientist are working in the company with size of 50-99 employees.

# company_type & target
dat9 <- data.frame(table(data$company_type,data$target))
names(dat9) <- c("Company_type","target","Count")
dat9
##           Company_type target Count
## 1  Early Stage Startup      0   461
## 2       Funded Startup      0   861
## 3                  NGO      0   424
## 4                Other      0    92
## 5        Public Sector      0   745
## 6              Pvt Ltd      0  8042
## 7              unknown      0  3756
## 8  Early Stage Startup      1   142
## 9       Funded Startup      1   140
## 10                 NGO      1    97
## 11               Other      1    29
## 12       Public Sector      1   210
## 13             Pvt Ltd      1  1775
## 14             unknown      1  2384
ggplot(data = dat9,aes(y =Company_type,
                       x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of targets by company type', x= 'Number of target', y='Company type')

We can observe that most of the data scientists are working in the company type of Pvt Ltd. The funded startup companies are more able to retain their data scientist workpower.

# last_new_job & target
dat10 <- data.frame(table(data$last_new_job,data$target))
names(dat10) <- c("Last_new_job","target","Count")
dat10
##    Last_new_job target Count
## 1            >4      0  2690
## 2             0      0  1713
## 3             1      0  5915
## 4             2      0  2200
## 5             3      0   793
## 6             4      0   801
## 7       unknown      0   269
## 8            >4      1   600
## 9             0      1   739
## 10            1      1  2125
## 11            2      1   700
## 12            3      1   231
## 13            4      1   228
## 14      unknown      1   154
ggplot(data = dat10,aes(y = Last_new_job, x = Count, fill=target))+
  geom_bar(stat="identity",position="dodge")+
  labs(title = 'Number of targets by last_new_job', x= 'Number of target', y='last_new_job')

We can observe that the proportion of employees who have left their last job for about a year is the largest, followed by employees who have been more than four years and two years. While if we analyse the data by probability, the data scientist who did not change their job for more than past 4 years are more likely to continue work for their company. However, for those data scientists who has just left their previous job for less than 1 year are more likely to leave the company they are working now.

# training_hours & target
dat11 <- data.frame(table(data$training_hours,data$target))
names(dat11) <- c("training_hours","target","Count")
dat11<-spread(dat11,target,Count)
dat11['total']<-dat11['0']+dat11['1']
dat11['prob_of_stay']<-dat11['0']/dat11['total']
dat11<-dat11[order(dat11[,"total"],decreasing=TRUE),]
dat11
##     training_hours   0  1 total prob_of_stay
## 28              28 250 79   329    0.7598784
## 12              12 229 63   292    0.7842466
## 18              18 215 76   291    0.7388316
## 22              22 220 62   282    0.7801418
## 50              50 195 84   279    0.6989247
## 20              20 210 68   278    0.7553957
## 17              17 202 71   273    0.7399267
## 24              24 200 73   273    0.7326007
## 6                6 193 68   261    0.7394636
## 34              34 194 67   261    0.7432950
## 23              23 192 66   258    0.7441860
## 21              21 186 70   256    0.7265625
## 26              26 182 72   254    0.7165354
## 56              56 175 75   250    0.7000000
## 42              42 186 56   242    0.7685950
## 10              10 175 66   241    0.7261411
## 11              11 187 50   237    0.7890295
## 48              48 174 63   237    0.7341772
## 9                9 162 72   234    0.6923077
## 14              14 177 54   231    0.7662338
## 15              15 174 56   230    0.7565217
## 8                8 176 51   227    0.7753304
## 4                4 186 38   224    0.8303571
## 46              46 163 60   223    0.7309417
## 13              13 148 65   213    0.6948357
## 36              36 156 55   211    0.7393365
## 7                7 148 61   209    0.7081340
## 32              32 144 63   207    0.6956522
## 44              44 151 54   205    0.7365854
## 25              25 143 56   199    0.7185930
## 43              43 141 58   199    0.7085427
## 52              52 155 41   196    0.7908163
## 16              16 145 47   192    0.7552083
## 40              40 137 55   192    0.7135417
## 30              30 139 48   187    0.7433155
## 31              31 144 40   184    0.7826087
## 29              29 132 47   179    0.7374302
## 39              39 131 47   178    0.7359551
## 51              51 122 54   176    0.6931818
## 45              45 129 46   175    0.7371429
## 55              55 131 40   171    0.7660819
## 78              78 107 58   165    0.6484848
## 19              19 119 44   163    0.7300613
## 37              37 125 38   163    0.7668712
## 35              35 124 38   162    0.7654321
## 54              54 127 34   161    0.7888199
## 47              47 117 40   157    0.7452229
## 72              72 120 33   153    0.7843137
## 33              33 114 36   150    0.7600000
## 41              41 117 28   145    0.8068966
## 80              80 110 34   144    0.7638889
## 57              57 102 40   142    0.7183099
## 101            102  95 42   137    0.6934307
## 53              53  99 37   136    0.7279412
## 64              64 102 30   132    0.7727273
## 70              70 108 24   132    0.8181818
## 58              58  98 33   131    0.7480916
## 74              74  99 32   131    0.7557252
## 62              62  91 37   128    0.7109375
## 93              94  97 27   124    0.7822581
## 95              96  92 31   123    0.7479675
## 3                3  87 32   119    0.7310924
## 90              90  96 22   118    0.8135593
## 27              27  87 29   116    0.7500000
## 38              38  88 27   115    0.7652174
## 68              68  77 36   113    0.6814159
## 99             100  87 26   113    0.7699115
## 84              84  89 22   111    0.8018018
## 5                5  81 26   107    0.7570093
## 66              66  85 22   107    0.7943925
## 61              61  73 25    98    0.7448980
## 82              82  78 20    98    0.7959184
## 60              60  75 22    97    0.7731959
## 92              92  76 21    97    0.7835052
## 2                2  71 25    96    0.7395833
## 111            112  66 30    96    0.6875000
## 107            108  76 19    95    0.8000000
## 86              86  76 18    94    0.8085106
## 105            106  76 18    94    0.8085106
## 88              88  66 26    92    0.7173913
## 67              67  70 20    90    0.7777778
## 77              77  65 23    88    0.7386364
## 83              83  68 18    86    0.7906977
## 76              76  59 24    83    0.7108434
## 65              65  62 17    79    0.7848101
## 97              98  63 16    79    0.7974684
## 69              69  61 17    78    0.7820513
## 109            110  57 20    77    0.7402597
## 63              63  54 21    75    0.7200000
## 162            166  56 15    71    0.7887324
## 59              59  57 12    69    0.8260870
## 91              91  49 19    68    0.7205882
## 113            114  48 19    67    0.7164179
## 103            104  48 17    65    0.7384615
## 104            105  56  9    65    0.8615385
## 89              89  47 17    64    0.7343750
## 114            116  41 23    64    0.6406250
## 106            107  43 20    63    0.6825397
## 73              73  48 14    62    0.7741935
## 81              81  46 16    62    0.7419355
## 79              79  48 13    61    0.7868852
## 85              85  52  9    61    0.8524590
## 110            111  39 21    60    0.6500000
## 108            109  42 17    59    0.7118644
## 153            156  38 20    58    0.6551724
## 75              75  40 17    57    0.7017544
## 87              87  35 20    55    0.6363636
## 132            134  39 15    54    0.7222222
## 156            160  44 10    54    0.8148148
## 128            130  42  9    51    0.8235294
## 143            146  38 12    50    0.7600000
## 49              49  36 12    48    0.7500000
## 120            122  37 11    48    0.7708333
## 137            140  34 14    48    0.7083333
## 98              99  37 10    47    0.7872340
## 149            152  35 11    46    0.7608696
## 112            113  36  9    45    0.8000000
## 141            144  32 12    44    0.7272727
## 96              97  33 10    43    0.7674419
## 122            124  32 11    43    0.7441860
## 147            150  31 12    43    0.7209302
## 155            158  36  6    42    0.8571429
## 151            154  35  6    41    0.8536585
## 158            162  32  9    41    0.7804878
## 134            136  28 12    40    0.7000000
## 135            138  30 10    40    0.7500000
## 171            182  31  9    40    0.7750000
## 164            168  26 13    39    0.6666667
## 175            192  26 13    39    0.6666667
## 100            101  28 10    38    0.7368421
## 116            118  26 12    38    0.6842105
## 94              95  25 12    37    0.6756757
## 126            128  29  8    37    0.7837838
## 102            103  26  9    35    0.7428571
## 124            126  29  5    34    0.8529412
## 145            148  26  8    34    0.7647059
## 189            222  27  6    33    0.8181818
## 169            178  25  7    32    0.7812500
## 181            204  27  5    32    0.8437500
## 142            145  23  8    31    0.7419355
## 185            214  24  7    31    0.7741935
## 130            132  21  9    30    0.7000000
## 159            163  24  5    29    0.8275862
## 170            180  23  6    29    0.7931034
## 177            196  21  8    29    0.7241379
## 183            210  24  5    29    0.8275862
## 167            174  19  9    28    0.6785714
## 182            206  24  4    28    0.8571429
## 129            131  23  4    27    0.8518519
## 154            157  20  7    27    0.7407407
## 173            188  19  8    27    0.7037037
## 178            198  18  9    27    0.6666667
## 133            135  21  5    26    0.8076923
## 138            141  19  7    26    0.7307692
## 146            149  23  3    26    0.8846154
## 152            155  19  6    25    0.7600000
## 172            184  18  7    25    0.7200000
## 123            125  18  6    24    0.7500000
## 165            170  21  3    24    0.8750000
## 187            218  20  4    24    0.8333333
## 131            133  19  4    23    0.8260870
## 136            139  16  7    23    0.6956522
## 179            200  18  5    23    0.7826087
## 140            143  20  2    22    0.9090909
## 115            117  14  7    21    0.6666667
## 125            127  17  4    21    0.8095238
## 148            151  13  8    21    0.6190476
## 180            202  15  6    21    0.7142857
## 71              71  14  6    20    0.7000000
## 121            123  18  2    20    0.9000000
## 127            129  15  5    20    0.7500000
## 139            142  14  6    20    0.7000000
## 190            224  17  3    20    0.8500000
## 191            226  16  4    20    0.8000000
## 160            164  15  4    19    0.7894737
## 166            172  14  5    19    0.7368421
## 168            176  16  3    19    0.8421053
## 176            194  16  3    19    0.8421053
## 117            119  13  5    18    0.7222222
## 157            161  15  3    18    0.8333333
## 174            190  15  3    18    0.8333333
## 118            120  14  3    17    0.8235294
## 163            167  13  4    17    0.7647059
## 186            216  12  5    17    0.7058824
## 188            220  12  5    17    0.7058824
## 119            121  13  3    16    0.8125000
## 161            165  12  4    16    0.7500000
## 184            212  12  4    16    0.7500000
## 150            153  10  5    15    0.6666667
## 204            256  14  1    15    0.9333333
## 208            264  13  2    15    0.8666667
## 231            314  12  3    15    0.8000000
## 236            326  13  2    15    0.8666667
## 144            147  11  3    14    0.7857143
## 193            232  12  2    14    0.8571429
## 200            246  12  2    14    0.8571429
## 214            278  14  0    14    1.0000000
## 228            308  10  4    14    0.7142857
## 202            250  11  2    13    0.8461538
## 205            258   7  6    13    0.5384615
## 223            298   7  6    13    0.5384615
## 226            304  10  3    13    0.7692308
## 234            322  11  2    13    0.8461538
## 198            242  12  0    12    1.0000000
## 207            262  11  1    12    0.9166667
## 219            288  10  2    12    0.8333333
## 224            300  11  1    12    0.9166667
## 227            306  10  2    12    0.8333333
## 230            312  11  1    12    0.9166667
## 232            316   9  3    12    0.7500000
## 239            332   8  4    12    0.6666667
## 201            248   8  3    11    0.7272727
## 210            268   6  5    11    0.5454545
## 221            292   8  3    11    0.7272727
## 237            328   9  2    11    0.8181818
## 238            330  10  1    11    0.9090909
## 240            334   9  2    11    0.8181818
## 241            336   8  3    11    0.7272727
## 233            320   9  1    10    0.9000000
## 1                1   7  2     9    0.7777778
## 203            254   8  1     9    0.8888889
## 206            260   8  1     9    0.8888889
## 217            284   6  3     9    0.6666667
## 220            290   5  4     9    0.5555556
## 225            302   6  3     9    0.6666667
## 235            324   7  2     9    0.7777778
## 199            244   6  2     8    0.7500000
## 216            282   6  2     8    0.7500000
## 229            310   8  0     8    1.0000000
## 192            228   3  4     7    0.4285714
## 195            236   7  0     7    1.0000000
## 197            240   6  1     7    0.8571429
## 211            270   4  3     7    0.5714286
## 215            280   4  3     7    0.5714286
## 209            266   5  1     6    0.8333333
## 213            276   6  0     6    1.0000000
## 222            294   6  0     6    1.0000000
## 194            234   5  0     5    1.0000000
## 212            272   4  1     5    0.8000000
## 218            286   2  3     5    0.4000000
## 196            238   4  0     4    1.0000000
plot(data$training_hours,data$target)

From the scatter plot, it can be seen that most of the data scientist have went for about 150 hours or below of training.

dat11<-dat11[order(dat11[,"prob_of_stay"],decreasing=TRUE),]
dat11
##     training_hours   0  1 total prob_of_stay
## 214            278  14  0    14    1.0000000
## 198            242  12  0    12    1.0000000
## 229            310   8  0     8    1.0000000
## 195            236   7  0     7    1.0000000
## 213            276   6  0     6    1.0000000
## 222            294   6  0     6    1.0000000
## 194            234   5  0     5    1.0000000
## 196            238   4  0     4    1.0000000
## 204            256  14  1    15    0.9333333
## 207            262  11  1    12    0.9166667
## 224            300  11  1    12    0.9166667
## 230            312  11  1    12    0.9166667
## 140            143  20  2    22    0.9090909
## 238            330  10  1    11    0.9090909
## 121            123  18  2    20    0.9000000
## 233            320   9  1    10    0.9000000
## 203            254   8  1     9    0.8888889
## 206            260   8  1     9    0.8888889
## 146            149  23  3    26    0.8846154
## 165            170  21  3    24    0.8750000
## 208            264  13  2    15    0.8666667
## 236            326  13  2    15    0.8666667
## 104            105  56  9    65    0.8615385
## 155            158  36  6    42    0.8571429
## 182            206  24  4    28    0.8571429
## 193            232  12  2    14    0.8571429
## 200            246  12  2    14    0.8571429
## 197            240   6  1     7    0.8571429
## 151            154  35  6    41    0.8536585
## 124            126  29  5    34    0.8529412
## 85              85  52  9    61    0.8524590
## 129            131  23  4    27    0.8518519
## 190            224  17  3    20    0.8500000
## 202            250  11  2    13    0.8461538
## 234            322  11  2    13    0.8461538
## 181            204  27  5    32    0.8437500
## 168            176  16  3    19    0.8421053
## 176            194  16  3    19    0.8421053
## 187            218  20  4    24    0.8333333
## 157            161  15  3    18    0.8333333
## 174            190  15  3    18    0.8333333
## 219            288  10  2    12    0.8333333
## 227            306  10  2    12    0.8333333
## 209            266   5  1     6    0.8333333
## 4                4 186 38   224    0.8303571
## 159            163  24  5    29    0.8275862
## 183            210  24  5    29    0.8275862
## 59              59  57 12    69    0.8260870
## 131            133  19  4    23    0.8260870
## 128            130  42  9    51    0.8235294
## 118            120  14  3    17    0.8235294
## 70              70 108 24   132    0.8181818
## 189            222  27  6    33    0.8181818
## 237            328   9  2    11    0.8181818
## 240            334   9  2    11    0.8181818
## 156            160  44 10    54    0.8148148
## 90              90  96 22   118    0.8135593
## 119            121  13  3    16    0.8125000
## 125            127  17  4    21    0.8095238
## 86              86  76 18    94    0.8085106
## 105            106  76 18    94    0.8085106
## 133            135  21  5    26    0.8076923
## 41              41 117 28   145    0.8068966
## 84              84  89 22   111    0.8018018
## 107            108  76 19    95    0.8000000
## 112            113  36  9    45    0.8000000
## 191            226  16  4    20    0.8000000
## 231            314  12  3    15    0.8000000
## 212            272   4  1     5    0.8000000
## 97              98  63 16    79    0.7974684
## 82              82  78 20    98    0.7959184
## 66              66  85 22   107    0.7943925
## 170            180  23  6    29    0.7931034
## 52              52 155 41   196    0.7908163
## 83              83  68 18    86    0.7906977
## 160            164  15  4    19    0.7894737
## 11              11 187 50   237    0.7890295
## 54              54 127 34   161    0.7888199
## 162            166  56 15    71    0.7887324
## 98              99  37 10    47    0.7872340
## 79              79  48 13    61    0.7868852
## 144            147  11  3    14    0.7857143
## 65              65  62 17    79    0.7848101
## 72              72 120 33   153    0.7843137
## 12              12 229 63   292    0.7842466
## 126            128  29  8    37    0.7837838
## 92              92  76 21    97    0.7835052
## 31              31 144 40   184    0.7826087
## 179            200  18  5    23    0.7826087
## 93              94  97 27   124    0.7822581
## 69              69  61 17    78    0.7820513
## 169            178  25  7    32    0.7812500
## 158            162  32  9    41    0.7804878
## 22              22 220 62   282    0.7801418
## 67              67  70 20    90    0.7777778
## 1                1   7  2     9    0.7777778
## 235            324   7  2     9    0.7777778
## 8                8 176 51   227    0.7753304
## 171            182  31  9    40    0.7750000
## 73              73  48 14    62    0.7741935
## 185            214  24  7    31    0.7741935
## 60              60  75 22    97    0.7731959
## 64              64 102 30   132    0.7727273
## 120            122  37 11    48    0.7708333
## 99             100  87 26   113    0.7699115
## 226            304  10  3    13    0.7692308
## 42              42 186 56   242    0.7685950
## 96              97  33 10    43    0.7674419
## 37              37 125 38   163    0.7668712
## 14              14 177 54   231    0.7662338
## 55              55 131 40   171    0.7660819
## 35              35 124 38   162    0.7654321
## 38              38  88 27   115    0.7652174
## 145            148  26  8    34    0.7647059
## 163            167  13  4    17    0.7647059
## 80              80 110 34   144    0.7638889
## 149            152  35 11    46    0.7608696
## 33              33 114 36   150    0.7600000
## 143            146  38 12    50    0.7600000
## 152            155  19  6    25    0.7600000
## 28              28 250 79   329    0.7598784
## 5                5  81 26   107    0.7570093
## 15              15 174 56   230    0.7565217
## 74              74  99 32   131    0.7557252
## 20              20 210 68   278    0.7553957
## 16              16 145 47   192    0.7552083
## 27              27  87 29   116    0.7500000
## 49              49  36 12    48    0.7500000
## 135            138  30 10    40    0.7500000
## 123            125  18  6    24    0.7500000
## 127            129  15  5    20    0.7500000
## 161            165  12  4    16    0.7500000
## 184            212  12  4    16    0.7500000
## 232            316   9  3    12    0.7500000
## 199            244   6  2     8    0.7500000
## 216            282   6  2     8    0.7500000
## 58              58  98 33   131    0.7480916
## 95              96  92 31   123    0.7479675
## 47              47 117 40   157    0.7452229
## 61              61  73 25    98    0.7448980
## 23              23 192 66   258    0.7441860
## 122            124  32 11    43    0.7441860
## 30              30 139 48   187    0.7433155
## 34              34 194 67   261    0.7432950
## 102            103  26  9    35    0.7428571
## 81              81  46 16    62    0.7419355
## 142            145  23  8    31    0.7419355
## 154            157  20  7    27    0.7407407
## 109            110  57 20    77    0.7402597
## 17              17 202 71   273    0.7399267
## 2                2  71 25    96    0.7395833
## 6                6 193 68   261    0.7394636
## 36              36 156 55   211    0.7393365
## 18              18 215 76   291    0.7388316
## 77              77  65 23    88    0.7386364
## 103            104  48 17    65    0.7384615
## 29              29 132 47   179    0.7374302
## 45              45 129 46   175    0.7371429
## 100            101  28 10    38    0.7368421
## 166            172  14  5    19    0.7368421
## 44              44 151 54   205    0.7365854
## 39              39 131 47   178    0.7359551
## 89              89  47 17    64    0.7343750
## 48              48 174 63   237    0.7341772
## 24              24 200 73   273    0.7326007
## 3                3  87 32   119    0.7310924
## 46              46 163 60   223    0.7309417
## 138            141  19  7    26    0.7307692
## 19              19 119 44   163    0.7300613
## 53              53  99 37   136    0.7279412
## 141            144  32 12    44    0.7272727
## 201            248   8  3    11    0.7272727
## 221            292   8  3    11    0.7272727
## 241            336   8  3    11    0.7272727
## 21              21 186 70   256    0.7265625
## 10              10 175 66   241    0.7261411
## 177            196  21  8    29    0.7241379
## 132            134  39 15    54    0.7222222
## 117            119  13  5    18    0.7222222
## 147            150  31 12    43    0.7209302
## 91              91  49 19    68    0.7205882
## 63              63  54 21    75    0.7200000
## 172            184  18  7    25    0.7200000
## 25              25 143 56   199    0.7185930
## 57              57 102 40   142    0.7183099
## 88              88  66 26    92    0.7173913
## 26              26 182 72   254    0.7165354
## 113            114  48 19    67    0.7164179
## 180            202  15  6    21    0.7142857
## 228            308  10  4    14    0.7142857
## 40              40 137 55   192    0.7135417
## 108            109  42 17    59    0.7118644
## 62              62  91 37   128    0.7109375
## 76              76  59 24    83    0.7108434
## 43              43 141 58   199    0.7085427
## 137            140  34 14    48    0.7083333
## 7                7 148 61   209    0.7081340
## 186            216  12  5    17    0.7058824
## 188            220  12  5    17    0.7058824
## 173            188  19  8    27    0.7037037
## 75              75  40 17    57    0.7017544
## 56              56 175 75   250    0.7000000
## 134            136  28 12    40    0.7000000
## 130            132  21  9    30    0.7000000
## 71              71  14  6    20    0.7000000
## 139            142  14  6    20    0.7000000
## 50              50 195 84   279    0.6989247
## 32              32 144 63   207    0.6956522
## 136            139  16  7    23    0.6956522
## 13              13 148 65   213    0.6948357
## 101            102  95 42   137    0.6934307
## 51              51 122 54   176    0.6931818
## 9                9 162 72   234    0.6923077
## 111            112  66 30    96    0.6875000
## 116            118  26 12    38    0.6842105
## 106            107  43 20    63    0.6825397
## 68              68  77 36   113    0.6814159
## 167            174  19  9    28    0.6785714
## 94              95  25 12    37    0.6756757
## 164            168  26 13    39    0.6666667
## 175            192  26 13    39    0.6666667
## 178            198  18  9    27    0.6666667
## 115            117  14  7    21    0.6666667
## 150            153  10  5    15    0.6666667
## 239            332   8  4    12    0.6666667
## 217            284   6  3     9    0.6666667
## 225            302   6  3     9    0.6666667
## 153            156  38 20    58    0.6551724
## 110            111  39 21    60    0.6500000
## 78              78 107 58   165    0.6484848
## 114            116  41 23    64    0.6406250
## 87              87  35 20    55    0.6363636
## 148            151  13  8    21    0.6190476
## 211            270   4  3     7    0.5714286
## 215            280   4  3     7    0.5714286
## 220            290   5  4     9    0.5555556
## 210            268   6  5    11    0.5454545
## 205            258   7  6    13    0.5384615
## 223            298   7  6    13    0.5384615
## 192            228   3  4     7    0.4285714
## 218            286   2  3     5    0.4000000

However, data scientists who took the training more than 150 hours are more likely to choose to stay in the company.

Split train and test dataset

preprocessed_data<-data_preprocess_categorical
preprocessed_data$target<-as.factor(preprocessed_data$target)
set.seed(1)
sample<-sample.split(preprocessed_data$target,SplitRatio=0.7)
train<-subset(preprocessed_data,sample==TRUE)
test<-subset(preprocessed_data,sample==FALSE)

Model

Construct a performance table

performance = function(xtab, desc=""){
    cat(desc,"\n")
    ACR = sum(diag(xtab))/sum(xtab)
    TPR = xtab[1,1]/sum(xtab[,1]); TNR = xtab[2,2]/sum(xtab[,2])
    PPV = xtab[1,1]/sum(xtab[1,]); NPV = xtab[2,2]/sum(xtab[2,])
    FPR = 1 - TNR                ; FNR = 1 - TPR
    RandomAccuracy = (sum(xtab[,2])*sum(xtab[2,]) + 
      sum(xtab[,1])*sum(xtab[1,]))/(sum(xtab)^2)
    Kappa = (ACR - RandomAccuracy)/(1 - RandomAccuracy)
    print(xtab)
    cat("\nAccuracy (ACR)                  :", ACR, "\n")
    cat("Sensitivity(TPR)                :", TPR, "\n")
    cat("Specificity (TNR)               :", TNR, "\n")
    cat("Positive Predictive Value (PPV) :", PPV, "\n")
    cat("Negative Predictive Value (NPV) :", NPV, "\n")
    cat("False Positive Rate (FPR)       :", FPR, "\n")
    cat("False Negative Rate(FNR)        :", FNR, "\n")
}

prob_gen<-function(x){
  print(paste("Predicted probability of candidate will stay at the company",round(sum(x[2,])/sum(x),4)))
  print(paste("Predicted probability of candidate will leave the company",round(sum(x[1,])/sum(x),4)))
}

LDA model

set.seed(123)
LDA_model = lda(target ~., data = train)
## Warning in lda.default(x, grouping, ...): variables are collinear
summary(LDA_model)
##         Length Class  Mode     
## prior     2    -none- numeric  
## counts    2    -none- numeric  
## means   386    -none- numeric  
## scaling 193    -none- numeric  
## lev       2    -none- character
## svd       1    -none- numeric  
## N         1    -none- numeric  
## call      3    -none- call     
## terms     3    terms  call     
## xlevels   0    -none- list
head(coef(LDA_model),10)
##                                  LD1
## city_development_index -2.8176411446
## training_hours         -0.0008037416
## city_1                 -0.9476649769
## city_2                 -1.2760041362
## city_7                 -0.9714193180
## city_8                 -2.8664365483
## city_9                 -0.0911228721
## city_10                -0.2367186872
## city_11                 0.7851554560
## city_12                -0.7449616264
plot(LDA_model)

LDA_pred= LDA_model %>% predict(test)
mean(LDA_pred$class == test$target)
## [1] 0.7845833
LDA = test$target
mean(LDA_pred$class == LDA)
## [1] 0.7845833
cfmat_LDA= table(LDA_pred$class,LDA)
cfmat_LDA = cfmat_LDA[2:1,2:1]
performance(cfmat_LDA, "LDA")
## LDA 
##    LDA
##        1    0
##   1  634  439
##   0  799 3875
## 
## Accuracy (ACR)                  : 0.7845833 
## Sensitivity(TPR)                : 0.4424285 
## Specificity (TNR)               : 0.8982383 
## Positive Predictive Value (PPV) : 0.5908667 
## Negative Predictive Value (NPV) : 0.8290543 
## False Positive Rate (FPR)       : 0.1017617 
## False Negative Rate(FNR)        : 0.5575715

Linear Discriminant Analysis (LDA) is one of the “parametric” and generative models on the dataset. In LDA, it assumes the predictors are numeric and are drawn from multivariate gaussian/normal distribution.This is one of the LDA limitation as it only can takes in numerical predictors. In our assignment, we include all our attributes to train LDA model.From the confusion matrix stimulated above, we observed that the performance of LDA model is not operating well for data scientists job change analysis. By assuming 1 is the data scientists looking for job change and while 0 is vice-versa, the accuracy for the prediction is computed to be 78.46% which is not high and not low. Besides, the sensitivity is quite low (with 44.24%), but the PPV is higher than sensitivity with 59.09%, which means 44.24% of the predicted cases turned out to be positive cases whereas 59.09% of the positive were successfully predicted by LDA model. Apart from that, FPR (with 10.18%) has achieved an impressive performance, which means there is low probability of getting Type 1 error.

prob_gen(cfmat_LDA)
## [1] "Predicted probability of candidate will stay at the company 0.8133"
## [1] "Predicted probability of candidate will leave the company 0.1867"

We realized the probability of candidates who stay in the company is higher than candidates who leave the company.

Logistic Regression

set.seed(123)
log_model<-glm(target~.,data=train,family=binomial)
head(summary(log_model)$coef)
##                             Estimate   Std. Error     z value    Pr(>|z|)
## (Intercept)              5.733172973 4.775930e+00  1.20043062 0.229972143
## city_development_index  -9.037900075 7.635061e+00 -1.18373639 0.236517430
## training_hours          -0.001075671 3.947257e-04 -2.72511003 0.006428006
## city_1                  -0.322369868 2.165329e+00 -0.14887802 0.881649882
## city_2                 -12.214310032 6.304742e+02 -0.01937321 0.984543380
## city_7                  -0.913119562 9.205612e-01 -0.99191616 0.321238424

This is the summary statistics of the logistic regression model.

logreg_prob<-log_model %>% predict(test,type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
head(logreg_prob)
##          2          5          8         10         11         12 
## 0.09507453 0.11999194 0.17029563 0.12335782 0.64942558 0.11304205
hist(logreg_prob)

The above table show the probability of each data scientist to stay or leave the company.

contrasts(test$target)
##   1
## 0 0
## 1 1
logreg_pred<-ifelse(logreg_prob>0.5,1,0)
cfmat_LR1 = table(logreg_pred,test$target)
cfmat_LR1 = cfmat_LR1[2:1,2:1]
performance(cfmat_LR1)
##  
##            
## logreg_pred    1    0
##           1  562  371
##           0  871 3943
## 
## Accuracy (ACR)                  : 0.7838872 
## Sensitivity(TPR)                : 0.3921842 
## Specificity (TNR)               : 0.9140009 
## Positive Predictive Value (PPV) : 0.602358 
## Negative Predictive Value (NPV) : 0.8190694 
## False Positive Rate (FPR)       : 0.08599907 
## False Negative Rate(FNR)        : 0.6078158

From the above result, we observe there is a 78.39% of accuracy with cut-off point = 0.5. Besides, the LR model shows that the positive class has a lower recall (sensitivity) of 39.22% and lower precision (positive predictive value) of 60.23% while the negative class has higher recall (specificity) of 91.40% and slightly higher precision (negative predictive value) of 81.91%.

prob_gen(cfmat_LR1)
## [1] "Predicted probability of candidate will stay at the company 0.8377"
## [1] "Predicted probability of candidate will leave the company 0.1623"

Here, the probability of candidates who stay in the company is higher than candidates who leave the company.

Next, we would like to find the optimal cut-off point for this model.

optcutoff<-optimalCutoff(test$target, logreg_prob)[1] 

The optimum cut-off point is 0.3883685

logreg_pred1<-ifelse(logreg_prob>optcutoff,1,0)
cfmat_LR2=table(logreg_pred1,test$target)
cfmat_LR2=cfmat_LR2[2:1,2:1]
performance(cfmat_LR2)
##  
##             
## logreg_pred1    1    0
##            1  842  620
##            0  591 3694
## 
## Accuracy (ACR)                  : 0.7892814 
## Sensitivity(TPR)                : 0.5875785 
## Specificity (TNR)               : 0.8562819 
## Positive Predictive Value (PPV) : 0.5759234 
## Negative Predictive Value (NPV) : 0.862077 
## False Positive Rate (FPR)       : 0.1437181 
## False Negative Rate(FNR)        : 0.4124215

Once we obtained the optimum cut-off point (0.3883685), we then use it to train logistic regression model, we then obtained a higher accuracy which is 78.93% but the previous logistic regression model with cut-off point = 0.5 obtained an accuracy of 78.40%.

prob_gen(cfmat_LR2)
## [1] "Predicted probability of candidate will stay at the company 0.7456"
## [1] "Predicted probability of candidate will leave the company 0.2544"

Again, the probability of candidates who stay in the company is still higher than candidates who leave the company with optimum cut-off point.

Naive Bayes

set.seed(123)
naive_model<-naiveBayes(target~.,data=train)
naive_pred<-predict(naive_model,test)
cfmat = table(naive_pred,test$target)
cfmat = cfmat[2:1,2:1]
performance(cfmat)
##  
##           
## naive_pred    1    0
##          1 1287 3276
##          0  146 1038
## 
## Accuracy (ACR)                  : 0.4045589 
## Sensitivity(TPR)                : 0.8981158 
## Specificity (TNR)               : 0.240612 
## Positive Predictive Value (PPV) : 0.2820513 
## Negative Predictive Value (NPV) : 0.8766892 
## False Positive Rate (FPR)       : 0.759388 
## False Negative Rate(FNR)        : 0.1018842

In this section, we have used Naïve Bayes theorem (without Laplace Smoothing)to build the model by using training data.Then, we carried out the prediction with testing data and the confusion matrix for evaluating the accuracy.The accuracy without laplace smoothing is 40.46% which is extremely low. This indicates that it can predict correctly the status of job change, whether looking for a job change or not, with only accuracy of 40.46%. For the positive classes, the Naïve Bayes has a high recall/sensitivity of 89.81% but with only 28.21% of precision (PPV). This means the ability of this model to detect a person is looking for a job is 89.81%, and for those that are detected, it can predict 28.21% of them correctly. For negative classes, Naïve Bayes has a very low recall of 24.10% but has high precision up to 87.67%. This reflects that although the model does not capture much of the people that is not looking for job change but it is able to perform well to predict it.

prob_gen(cfmat)
## [1] "Predicted probability of candidate will stay at the company 0.206"
## [1] "Predicted probability of candidate will leave the company 0.794"

We now observed the probability of candidates who stay in the company is lower than candidates who leave the company.

We now build naive bayes model with laplace smoothing.

set.seed(123)
naive_model_laplace<-naiveBayes(target~.,data=train,laplace=1)
naive_pred_laplace<-predict(naive_model_laplace,test,type="class")
cfmat_laplace=table(naive_pred_laplace,test$target)
cfmat_laplace = cfmat_laplace[2:1,2:1]
performance(cfmat_laplace)
##  
##                   
## naive_pred_laplace    1    0
##                  1 1287 3276
##                  0  146 1038
## 
## Accuracy (ACR)                  : 0.4045589 
## Sensitivity(TPR)                : 0.8981158 
## Specificity (TNR)               : 0.240612 
## Positive Predictive Value (PPV) : 0.2820513 
## Negative Predictive Value (NPV) : 0.8766892 
## False Positive Rate (FPR)       : 0.759388 
## False Negative Rate(FNR)        : 0.1018842

Besides, we also have constructed the Naïve Bayes model with Laplace Smoothing. As the algorithm used by Naïve Bayes may cause the numerator to have zero value, it will produce zero probability leading the Naïve Bayes classifier to fail to perform. Thus, Laplace Smoothing is introduced to solve the problem and then improve the prediction or accuracy of our model. After training Naive Bayes model with Laplace Smoothing. The above result indicates that Laplace smoothing does not bring much effect on the prediction of our model. Our model able to perform a well prediction without Laplace Smoothing and give rise to the same accuracy and same confusion matrix result. In other words, our model may not have any zero probabilities that affect the prediction of our model previously (without Laplace Smoothing) and this results in the same accuracy.

prob_gen(cfmat_laplace)
## [1] "Predicted probability of candidate will stay at the company 0.206"
## [1] "Predicted probability of candidate will leave the company 0.794"

Here, we found that the probability of candidates who stay in the company is lower than candidates who leave the company.

KNN

We first try with k=115 and k=116

set.seed(123)
knn115_model<-knn(train, test, cl=train$target, k=floor(sqrt(nrow(train))))
knn116_model<-knn(train, test,cl=train$target,  k=ceiling(sqrt(nrow(train))))
cfmat_115=table(knn115_model,test$target)
cfmat_115 = cfmat_115[2:1,2:1]
performance(cfmat_115)
##  
##             
## knn115_model    1    0
##            1  200    0
##            0 1233 4314
## 
## Accuracy (ACR)                  : 0.7854533 
## Sensitivity(TPR)                : 0.1395673 
## Specificity (TNR)               : 1 
## Positive Predictive Value (PPV) : 1 
## Negative Predictive Value (NPV) : 0.7777177 
## False Positive Rate (FPR)       : 0 
## False Negative Rate(FNR)        : 0.8604327
prob_gen(cfmat_115)
## [1] "Predicted probability of candidate will stay at the company 0.9652"
## [1] "Predicted probability of candidate will leave the company 0.0348"
cfmat_116=table(knn116_model,test$target)
cfmat_116 = cfmat_116[2:1,2:1]
performance(cfmat_116)
##  
##             
## knn116_model    1    0
##            1  188    0
##            0 1245 4314
## 
## Accuracy (ACR)                  : 0.7833652 
## Sensitivity(TPR)                : 0.1311933 
## Specificity (TNR)               : 1 
## Positive Predictive Value (PPV) : 1 
## Negative Predictive Value (NPV) : 0.7760389 
## False Positive Rate (FPR)       : 0 
## False Negative Rate(FNR)        : 0.8688067
prob_gen(cfmat_116)
## [1] "Predicted probability of candidate will stay at the company 0.9673"
## [1] "Predicted probability of candidate will leave the company 0.0327"

From the above results, we can see the knn model with k=115 has the higher accuracy, 78.55% as compared to K=116 (78.34%). When k=115 and 116, the highest specifity is achieved at 100%. When k=115, this indicates knn model predicted person looking for a job changed is 100%, and for those that are predicted, it can predict 77.78% of them correctly.However, k=116 has also the similar confusion matrix result.The probability of candidates who leave the company is lower than candidates who stay in the company when k=115,116.

Now, We would like to find the optimum k to score a better accuracy.

k.optm=1
k_optm=c()
for (i in 1:300){
  knn.mod <- knn(train, test,cl=train$target, k=i)
  k.optm [i]<- 100 * sum(test$target == knn.mod)/NROW(test$target)
  k_optm[i]<-k.optm[i]
}
k_optm
##   [1] 83.41744 83.88725 86.28850 86.53210 87.24552 86.93231 87.75013 87.57613
##   [9] 87.68053 87.59353 87.45432 87.55873 87.40212 87.24552 87.22812 87.12372
##  [17] 87.10632 86.89751 86.79311 86.70611 86.61911 86.35810 86.34070 86.44510
##  [25] 86.21890 86.14930 86.30590 85.92309 85.88829 85.66208 85.50548 85.40108
##  [33] 85.26188 85.36628 85.08787 85.01827 85.01827 84.94867 84.87907 84.49626
##  [41] 84.46146 84.44406 84.42666 84.35706 84.28745 84.13085 83.88725 83.74804
##  [49] 83.66104 83.59144 83.45224 83.40003 83.24343 83.19123 83.08683 82.87802
##  [57] 82.77362 82.61702 82.49521 82.51262 82.40821 82.28641 82.14721 82.09501
##  [65] 82.00800 81.86880 81.71220 81.66000 81.55559 81.43379 81.27719 81.15539
##  [73] 80.92918 80.91178 80.98138 80.82478 80.65077 80.65077 80.58117 80.54637
##  [81] 80.42457 80.42457 80.40717 80.32017 80.32017 80.26797 80.25057 80.30277
##  [89] 80.25057 80.21576 80.14616 80.00696 79.97216 79.85036 79.79816 79.74595
##  [97] 79.62415 79.57195 79.46755 79.46755 79.32835 79.22394 79.22394 79.22394
## [105] 79.24134 79.04994 79.04994 78.89334 78.82373 78.84113 78.70193 78.70193
## [113] 78.64973 78.58013 78.56273 78.38872 78.28432 78.21472 78.12772 78.09292
## [121] 78.17992 78.07552 78.04072 78.09292 78.09292 77.97112 77.98852 77.88411
## [129] 77.84931 77.77971 77.72751 77.65791 77.58831 77.62311 77.58831 77.50131
## [137] 77.51871 77.48390 77.46650 77.44910 77.44910 77.39690 77.27510 77.34470
## [145] 77.25770 77.29250 77.24030 77.20550 77.15330 77.04890 77.06630 77.06630
## [153] 77.06630 77.03149 77.01409 76.96189 76.97929 76.96189 76.92709 76.90969
## [161] 76.87489 76.85749 76.80529 76.80529 76.73569 76.75309 76.71829 76.71829
## [169] 76.66609 76.64869 76.64869 76.54428 76.47468 76.40508 76.42248 76.35288
## [177] 76.37028 76.31808 76.28328 76.21368 76.16148 76.12667 76.09187 76.09187
## [185] 76.09187 76.07447 76.05707 76.05707 76.05707 76.02227 75.98747 75.98747
## [193] 75.95267 75.93527 75.88307 75.86567 75.81347 75.77867 75.76127 75.76127
## [201] 75.76127 75.74387 75.70907 75.63946 75.60466 75.60466 75.60466 75.58726
## [209] 75.62206 75.62206 75.60466 75.56986 75.56986 75.55246 75.55246 75.58726
## [217] 75.50026 75.48286 75.50026 75.53506 75.50026 75.48286 75.48286 75.44806
## [225] 75.46546 75.48286 75.46546 75.44806 75.46546 75.46546 75.46546 75.46546
## [233] 75.46546 75.46546 75.41326 75.39586 75.37846 75.37846 75.37846 75.37846
## [241] 75.36106 75.36106 75.32626 75.34366 75.30886 75.32626 75.30886 75.32626
## [249] 75.32626 75.32626 75.29146 75.30886 75.30886 75.27406 75.27406 75.27406
## [257] 75.27406 75.25666 75.25666 75.27406 75.25666 75.25666 75.25666 75.22185
## [265] 75.22185 75.22185 75.20445 75.18705 75.16965 75.16965 75.16965 75.16965
## [273] 75.16965 75.16965 75.16965 75.16965 75.16965 75.15225 75.15225 75.15225
## [281] 75.15225 75.15225 75.15225 75.15225 75.13485 75.13485 75.13485 75.13485
## [289] 75.13485 75.13485 75.13485 75.13485 75.11745 75.13485 75.11745 75.11745
## [297] 75.10005 75.10005 75.10005 75.10005
k_optm_pred<-which(k_optm==max(k_optm))
print(k_optm_pred)
## [1] 7
plot(k.optm,type="b",xlab="k-value",ylab="Accuracy level")

Hence, from the above result we will use the optimum k value, which is k=7 to generate the optimum knn model.

knn_optm_model<-knn(train, test, cl=train$target, k=k_optm_pred)
cfmat_optm=table(knn_optm_model,test$target)
cfmat_optm = cfmat_optm[2:1,2:1]
performance(cfmat_optm)
##  
##               
## knn_optm_model    1    0
##              1  855  135
##              0  578 4179
## 
## Accuracy (ACR)                  : 0.8759353 
## Sensitivity(TPR)                : 0.5966504 
## Specificity (TNR)               : 0.9687065 
## Positive Predictive Value (PPV) : 0.8636364 
## Negative Predictive Value (NPV) : 0.8784948 
## False Positive Rate (FPR)       : 0.03129346 
## False Negative Rate(FNR)        : 0.4033496

With k=7, the knn model is able to generate an accuracy of 87.59%. When we compare all the knn model with different k-values, k=7 obtained the highest accuracy among all. Now, we no longer observed the sensitivity to be 100% anymore. However,the sensitivity is now 59.67% and also with higher specificity (which is 96.87%) after reducing the number of the k. Besides, with k=7, we able to note that the false negative rate is quite low. In our case, we are more concerned about the accuracy and also the false negative rate. Accuracy is the portion of true results, either true positive or true negative in a population. It measures the degree of veracity of a prediction test.False negative rate is an error in which a prediction test result improperly by indicating no presence of the person looking for a job changed (negative result), when in fact it is actually present.Therefore, both of these measures are equally important because we do not want to predict wrongly when a person who is intended to look for job changed but it is predicted as the opposite one.

prob_gen(cfmat_optm)
## [1] "Predicted probability of candidate will stay at the company 0.8277"
## [1] "Predicted probability of candidate will leave the company 0.1723"

The probability of candidates who leave the company is lower than candidates who stay in the company when k=7 (optimum k).

Random Forest

set.seed(123)
train_RF<-train
train_RF[,c('target')]<-list(NULL)
set.seed(2)
RF_model<-randomForest(x=train_RF,y=train$target,mtry=6,importance=TRUE)
RF_pred<-predict(RF_model,test,type="class")
cfmat_RF=table(RF_pred,test$target)
cfmat_RF = cfmat_RF[2:1,2:1]
performance(cfmat_RF,"\n*** Random Forest (RF) Strategy (m=6) Performance")
## 
## *** Random Forest (RF) Strategy (m=6) Performance 
##        
## RF_pred    1    0
##       1  699  506
##       0  734 3808
## 
## Accuracy (ACR)                  : 0.7842353 
## Sensitivity(TPR)                : 0.4877879 
## Specificity (TNR)               : 0.8827075 
## Positive Predictive Value (PPV) : 0.580083 
## Negative Predictive Value (NPV) : 0.8383972 
## False Positive Rate (FPR)       : 0.1172925 
## False Negative Rate(FNR)        : 0.5122121
prob_gen(cfmat_RF)
## [1] "Predicted probability of candidate will stay at the company 0.7903"
## [1] "Predicted probability of candidate will leave the company 0.2097"

The probability of candidates leave the company is lower than candidate who stay.

print(head(importance(RF_model),10))
##                                  0          1 MeanDecreaseAccuracy
## city_development_index 22.48625289 30.8574514            28.909420
## training_hours         -1.64999709  1.8695377            -0.115869
## city_1                 -4.08862763  1.1247649            -2.575812
## city_2                  0.00000000  0.0000000             0.000000
## city_7                 -0.10053494 -2.3408929            -1.911969
## city_8                  0.00000000  0.0000000             0.000000
## city_9                 -3.20366875  0.4181672            -2.975831
## city_10                -2.14319873  2.8904382            -0.262619
## city_11                14.76277210  9.9920435            17.217732
## city_12                 0.01345356  2.4638771             0.894456
##                        MeanDecreaseGini
## city_development_index     307.14523553
## training_hours              68.54850291
## city_1                       0.37960444
## city_2                       0.01146603
## city_7                       0.87084751
## city_8                       0.10958582
## city_9                       0.88825253
## city_10                      1.38089198
## city_11                     17.30325071
## city_12                      0.69658997

For our selected dataset, we performed Random Forest model by setting the mtry = 6. Based on the confusion matrix performance, Random Forest has achieved the second highest performance with accuracy of 78.42%.The ranges for the accuracy, sensitivity, specificity, PPV, and NPV are from 48% to 88%. The accuracy and specificity for this model has achieved above 78% which is 78.42% and 88.27% respectively. However, the sensitivity in Random Value only achieved 48.78% which is extremely low. The positive class has a lower recall (sensitivity with 48.78%) and lower precision (positive predictive value with 58%) as they are both below 60%, while the negative class has higher recall (specificity with 88.27%) and higher precision (negative predictive value with 83.84%) when compared with the positive class. Comparing within positive class, the positive recall(sensitivity) is lower than the positive precision (positive predictive value) which means the model does not capture a lot of positive predictions but the one captures are mostly correct. The number of people who looked for job changed that are captured correctly in the random forest model is lower. Compared within the negative class, the negative recall (specificity) is higher than the negative precision (negative predictive value). This means the model captures a lot of negative predictions and the ones that are captured are mostly incorrect. The number of people who looked for another job that are captured incorrectly in the random forest model is higher.

C5.0 Decision Tree

set.seed(123)
C50_model = C5.0(x=train_RF, y=train$target)
summary(C50_model)
## 
## Call:
## C5.0.default(x = train_RF, y = train$target)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Jun 20 13:12:54 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 13411 cases (194 attributes) from undefined.data
## 
## Decision tree:
## 
## city_development_index <= 0.624:
## :...experience_15 > 0: 0 (35/13)
## :   experience_15 <= 0:
## :   :...education_level_High School <= 0: 1 (2154/859)
## :       education_level_High School > 0:
## :       :...company_type_Pvt Ltd <= 0: 1 (176/81)
## :           company_type_Pvt Ltd > 0: 0 (26/7)
## city_development_index > 0.624:
## :...company_size_unknown <= 0: 0 (7667/708)
##     company_size_unknown > 0:
##     :...major_discipline_unknown > 0: 0 (953/163)
##         major_discipline_unknown <= 0:
##         :...education_level_Phd > 0: 0 (55/7)
##             education_level_Phd <= 0:
##             :...city_114 > 0: 0 (110/20)
##                 city_114 <= 0:
##                 :...company_type_Pvt Ltd > 0: 0 (124/25)
##                     company_type_Pvt Ltd <= 0:
##                     :...city_50 > 0: 0 (29/5)
##                         city_50 <= 0:
##                         :...city_136 > 0: 0 (86/21)
##                             city_136 <= 0:
##                             :...city_16 > 0:
##                                 :...relevant_experience_no <= 0: 0 (111/13)
##                                 :   relevant_experience_no > 0:
##                                 :   :...education_level_Graduate <= 0: 0 (17/4)
##                                 :       education_level_Graduate > 0: 1 (52/20)
##                                 city_16 <= 0:
##                                 :...city_67 > 0: 0 (57/16)
##                                     city_67 <= 0:
##                                     :...city_104 > 0: 0 (32/9)
##                                         city_104 <= 0:
##                                         :...city_75 > 0: 0 (41/12)
##                                             city_75 <= 0:
##                                             :...experience_16 > 0: 0 (40/13)
##                                                 experience_16 <= 0:
##                                                 :...city_103 > 0: 1 (624/239)
##                                                     city_103 <= 0:
##                                                     :...experience_>20 > 0: [S1]
##                                                         experience_>20 <= 0: [S2]
## 
## SubTree [S1]
## 
## city_160 <= 0: 0 (137/31)
## city_160 > 0: 1 (44/13)
## 
## SubTree [S2]
## 
## training_hours > 157: 0 (57/15)
## training_hours <= 157:
## :...education_level_Graduate > 0: 1 (635/275)
##     education_level_Graduate <= 0:
##     :...experience_4 <= 0: 0 (135/53)
##         experience_4 > 0: 1 (14/3)
## 
## 
## Evaluation on training data (13411 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      25 2625(19.6%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    8577  1490    (a): class 0
##    1135  2209    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% city_development_index
##   82.17% company_size_unknown
##   25.00% major_discipline_unknown
##   18.17% company_type_Pvt Ltd
##   17.90% education_level_Phd
##   17.83% experience_15
##   17.57% education_level_High School
##   17.49% city_114
##   15.74% city_50
##   15.52% city_136
##   14.88% city_16
##   13.54% city_67
##   13.12% city_104
##   12.88% city_75
##   12.57% experience_16
##   12.27% city_103
##    7.62% experience_>20
##    6.36% education_level_Graduate
##    6.27% training_hours
##    1.35% city_160
##    1.34% relevant_experience_no
##    1.11% experience_4
## 
## 
## Time: 0.6 secs
C50_pred=predict(C50_model, test)
cfmat_C50=table(C50_pred,test$target)
cfmat_C50 = cfmat_C50[2:1,2:1]
performance(cfmat_C50)
##  
##         
## C50_pred    1    0
##        1  920  673
##        0  513 3641
## 
## Accuracy (ACR)                  : 0.7936315 
## Sensitivity(TPR)                : 0.6420098 
## Specificity (TNR)               : 0.8439963 
## Positive Predictive Value (PPV) : 0.5775267 
## Negative Predictive Value (NPV) : 0.8765046 
## False Positive Rate (FPR)       : 0.1560037 
## False Negative Rate(FNR)        : 0.3579902

In this section, we fit C50 decision tree model to our training dataset. From there, we observed the accuracy to be 79.36%. Comparing within positive classes, positive recall(sensitivity) is higher than the positive precision (positive predictive value). This means the decision tree capture a lot of positive predictions with 64.20% and the one captures correctly are only 57.75%.

prob_gen(cfmat_C50)
## [1] "Predicted probability of candidate will stay at the company 0.7228"
## [1] "Predicted probability of candidate will leave the company 0.2772"

The probability of candidates leave the company is lower than candidate who stay.

Decision Tree

set.seed(123)
DT_model<-rpart(target~.,train)
DT_pred<-predict(DT_model,test,type="class")
cfmat_DT=table(DT_pred,test$target)
cfmat_DT = cfmat_DT[2:1,2:1]
performance(cfmat_DT)
##  
##        
## DT_pred    1    0
##       1  607  429
##       0  826 3885
## 
## Accuracy (ACR)                  : 0.7816252 
## Sensitivity(TPR)                : 0.4235869 
## Specificity (TNR)               : 0.9005563 
## Positive Predictive Value (PPV) : 0.5859073 
## Negative Predictive Value (NPV) : 0.8246657 
## False Positive Rate (FPR)       : 0.09944367 
## False Negative Rate(FNR)        : 0.5764131

In this section, we fit rpart decision tree model to our training dataset. From there, we observed the accuracy to be lower compared to C50 decision tree model (78.16%).The main difference of C50 and rpart decision tree is that rpart is a regression tree model while c50 is a classification tree. We built two different decision models to compare and see which accuracy is higher.However, now by comparing within positive class, sensitivity is lower than the positive predicted value. This indicates the model does not capture a lot of positive predictions but the one captures are mostly correct. For negative class, specificity is still lower than negative predicted value.

prob_gen(cfmat_DT)
## [1] "Predicted probability of candidate will stay at the company 0.8197"
## [1] "Predicted probability of candidate will leave the company 0.1803"

The probability of candidates leave the company is lower than candidate who stay.

SVM

set.seed(123)
SVM_model = svm(target ~ ., data = train, kernel='radial', scale=F)
summary(SVM_model)
## 
## Call:
## svm(formula = target ~ ., data = train, kernel = "radial", scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  7271
## 
##  ( 3344 3927 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
SVM_pred = predict(SVM_model, newdata = test)
cfmat_svm = table(SVM_pred, test$target)
cfmat_svm = cfmat_svm[2:1,2:1]
performance(cfmat_svm, "SVM Model")
## SVM Model 
##         
## SVM_pred    1    0
##        1  144  105
##        0 1289 4209
## 
## Accuracy (ACR)                  : 0.7574387 
## Sensitivity(TPR)                : 0.1004885 
## Specificity (TNR)               : 0.9756606 
## Positive Predictive Value (PPV) : 0.5783133 
## Negative Predictive Value (NPV) : 0.7655511 
## False Positive Rate (FPR)       : 0.02433936 
## False Negative Rate(FNR)        : 0.8995115

From the confusion matrix, we can see that the SVM model for the accuracy is 75.74%. It has an accuracy which is second lowest among all machine learning algorithm. Besides, we able to observe the accuracy measures like ACR, TNR, and NPV are at least 75% and above except for sensitivity (TPR) where it is 10% only. With such low sensitivity, this means that the model can predict inaccurately the amount of people who actually looked for a job change. For example, in a total of 100 person looking for a job change, only 10 person is predicted correctly that they actually went to look for another job. Also, this model has a quite high false negative rate which is 89.95%. This means that it might mispredict on the person who did not look for a job where there will be around 90% person among 100 people who actually looked for a job change but predicted as they did not. On the other hand, the FPN is quite low which is 2.4% only.

prob_gen(cfmat_svm)
## [1] "Predicted probability of candidate will stay at the company 0.9567"
## [1] "Predicted probability of candidate will leave the company 0.0433"

The probability of candidates leave the company is lower than candidate who stay.

Discussion and Conclusion

The table below provides a summarization of performance among the model built and trained.

A caption

A caption

We had achieved an accuracy of 88% using K-Nearest-Neighbor with a k value of 7, which was determined by the optimum k to score a better accuracy. This KNN model was trained and tested on our datasets effectively with a binary response such as ours. Such a KNN model is time-consuming in a way that finding the optimum K value using the looping method takes time. KNN might be very easy to implement but as the dataset grows efficiency or speed of the algorithm declines very fast under the optimum K method. Thus, we recommend that K-fold cross-validation can be used for further work as it can prevent overfitting or probably generate higher accuracy provided it can be time-consuming. Noticeably that when k = 115 or 116 for the KNN model, it obtained a 100% specificity and PPV rate. It means that the trained model is capable of predicting the true negative class as well as positive predictive value. However, the model only obtained an accuracy of approximately 78%. It may not be a good model for predicting the true positive class and negative predictive value.

On top of that, we also achieved an accuracy of 79% using a Random Forest C5.0 model. This model works by splitting our training data based on the field that provides the maximum information gain. In our training dataset, we utilized the ‘city_development_index’ as a root node to indicate whether a person will leave their job to work as a Data Scientist. While these decreases in accuracy may look small, they are crucial to reducing the time and cost associated with investing in false positive errors, such as when a Data Science company devotes 150 hours to a Data Science training hosted by a company from which no employees apply to the Data Scientist job.

One of the interesting points to be noticed is that we obtained the same result for the Naïve Bayes model with or without Laplace smoothing. It is a technique that helps to tackle the problem of zero probability in the algorithm. Thus, we can conclude that our datasets do not exist any zero-probability for the prior information.

Next, for the probability that a candidate will leave the company after the training on each model, all predicted probability for stay is higher except for Naïve Bayes model. For Naïve Bayes model, the predicted probability for leave is higher than the stay. If naïve bayes model is used to predict the job change of Data Scientist, the outcome generated will have the highest probability for leave, which means the data scientist will likely leave the company.

We discovered that the magnitude of people in developed cities is a lot higher than anywhere else, and high rates of job retention are majorly in developed cities. In comparison, people from underdeveloped cities are choosing to leave their jobs due to reason of prospect that they will thrive more in a developed city with Data Science work opportunities. Another stunning discovery is that younger professionals who just started their careers after graduating would not mind testing their potential elsewhere. In such cases, they will potentially be prone to leave their jobs as Data scientists. This will lead the employers to filter in candidates with the least experience in the field. However, this discovery was non-sensical as it might filter out potential talent in the first place.

These findings significantly enhanced the allocation of resources by the Data Science company in their outreach toward training for prospective employees. Human Resources researchers at Data Science companies should target their training to companies in the cities with lower city development indices. Moreover, they should also target those individuals with less professional experience as the models suggested above. In return, those people who have an unrealized passion for work in Data Science as their long-term career will be selected for free Data Scientist bootcamps and receive the opportunity in boosting and improving their skills and their motivation to take the leap of faith and apply for a Data Scientist position.