This report provides employee promotion prediction using classification algoritms. The dataset used in this report for modeling is real employee data in a company. It is publicly available at Kaggle.
It can be downloaded here: https://www.kaggle.com/muhammadimran112233/employees-evaluation-for-promotion
The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Import necessary libraries
library(dplyr)
library(ggplot2)
library(corrgram)
library(gridExtra)
library(e1071)
library(party)
library(randomForest)
library(caret)
library(ROSE)
Library dpylr : for manipulation data.
Library ggplot : for graphic and visualization.
Library corrgram : for visualization of coefficient.
Library gridExtra : for plotting multiple draft.
Library e1071 : for use Support Vector Machine (SVM) model.
Library party : for use Decision Tree model.
Library ramdomForest : for use Random Forest model.
Library caret : for overSampling and underSampling.
Library ROSE : for handle unbalance data.
Read employee promotion dataset from .csv file to R dataframe. Then, see the dataframe’s structure.
# read data to dataframe
employee_df <- read.csv("employee_promotion.csv", na.strings = "")
The dataset has 54808 observations (rows) and 13 variables (colums). The target variable is is_promoted and the remaining variables are features candidate.
Compute statistical summary of each variables.
# statistical summary
summary(employee_df)
## employee_id department region education
## Min. : 1 Length:54808 Length:54808 Length:54808
## 1st Qu.:19670 Class :character Class :character Class :character
## Median :39226 Mode :character Mode :character Mode :character
## Mean :39196
## 3rd Qu.:58731
## Max. :78298
##
## gender recruitment_channel no_of_trainings age
## Length:54808 Length:54808 Min. : 1.000 Min. :20.0
## Class :character Class :character 1st Qu.: 1.000 1st Qu.:29.0
## Mode :character Mode :character Median : 1.000 Median :33.0
## Mean : 1.253 Mean :34.8
## 3rd Qu.: 1.000 3rd Qu.:39.0
## Max. :10.000 Max. :60.0
##
## previous_year_rating length_of_service awards_won avg_training_score
## Min. :1.000 Min. : 1.000 Min. :0.00000 Min. :39.00
## 1st Qu.:3.000 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:51.00
## Median :3.000 Median : 5.000 Median :0.00000 Median :60.00
## Mean :3.329 Mean : 5.866 Mean :0.02317 Mean :63.71
## 3rd Qu.:4.000 3rd Qu.: 7.000 3rd Qu.:0.00000 3rd Qu.:77.00
## Max. :5.000 Max. :37.000 Max. :1.00000 Max. :99.00
## NA's :4124 NA's :2560
## is_promoted
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.08517
## 3rd Qu.:0.00000
## Max. :1.00000
##
We can see the min, median, mean, and max values of each numeric variable.
We must create a new dataframe. A new dataframe will used for Exploratory Data Analysis.
# create a new dataframe
employee_df2 <- employee_df
Now, change variable column type to factor for Exploratory Data Analysis.
# change column type is_promoted
employee_df2$is_promoted <- factor(employee_df$is_promoted,
levels = c(0,1),
labels = c("Not Promoted", "Promoted"))
# change column type previous_year_rating
employee_df2$previous_year_rating <- factor(employee_df$previous_year_rating,
levels = c(1,2,3,4,5),
labels = c("Very Bad",
"Bad",
"Enough",
"Good",
"Very Good"))
# change column type awards_won
employee_df2$awards_won <- factor(employee_df$awards_won,
levels = c(0,1),
labels = c("None", "Won"))
# change column type gender
employee_df2$gender <- factor(employee_df$gender,
levels = c("f", "m"),
labels = c("Female", "Male"))
Plot distribution of department using bar plot.
# departement distribution
ggplot(data = employee_df2, aes(x = department, fill = department)) +
geom_bar() +
theme(legend.position = c(-10,-10),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Department")
Based on the bar plot above, we can see that the Sales & Marketing department is the department with the most employees.
Plot distribution of region using bar plot.
# region distribution
ggplot(data = employee_df2, aes(x = region, fill = region)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Region")
Based on the bar plot above, we can see that the most employees are in region 2.
Plot distribution of gender using bar plot.
# gender distribution
ggplot(data = employee_df2, aes(x = gender, fill = gender)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Gender")
Based on the bar plot above, we can see that the most employees gender is male.
Plot distribution of previous_year_rating using bar plot.
# previous_year_rating distribution
ggplot(data = employee_df2, aes(x = previous_year_rating,
fill = previous_year_rating)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Previous year rating")
Based on the bar plot above, we can see that the most employees rating in 3 rating. We consider the N/A rating as an employee who has just entered the company.
Plot distribution of recruitment_channel using bar plot.
# recruitment_channel distribution
ggplot(data = employee_df2, aes(x = recruitment_channel,
fill = recruitment_channel)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Recruitment channel")
Based on the bar plot above, we can see that the most recruitment channel is other channel.
Plot distribution of education using bar plot.
# education distribution
ggplot(data = employee_df2, aes(x = education, fill = education)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Education")
Based on the bar plot above, we can see that the most education in a company is bachelor’s degree. N/A elements here we conclude as outliers. We will process it in the data preparation section.
Plot distribution of awards_won using bar plot.
# awards_won distribution
ggplot(data = employee_df2, aes(x = awards_won, fill = awards_won)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Awards won")
Based on the bar plot above, we can see that many employee is not won awards.
Plot distribution numeric feature distribution using histogram plot and boxplot.
# no_of_trainings distribution
univariate1 <- ggplot(data = employee_df2,
aes (x = no_of_trainings)) +
geom_histogram(bins = 20, fill = '#107009AA') +
labs(title = "Distribution by Number of trainings")
# age distribution
univariate2 <- ggplot(data = employee_df2,
aes (y = age)) +
geom_boxplot(color = '#107009AA') +
labs(title = "Distribution by Age")
# length_of_service distribution
univariate3 <- ggplot(data = employee_df2,
aes (x = length_of_service)) +
geom_histogram(bins = 20, fill = '#107009AA') +
labs(title = "Distribution by Length of service")
# avg_training_score distribution
univariate4 <- ggplot(data = employee_df2,
aes (x = avg_training_score)) +
geom_histogram(bins = 20, fill = '#107009AA') +
labs(title = "Distribution by Avg training score")
Combine all histogram plot and boxplot using grid.arrange.
univariate.plot <- grid.arrange(univariate1,
univariate2,
univariate3,
univariate4, nrow = 2)
Based on the histogram plot and boxplot above, we can see that the gender variable has outliers. In addition, most employees only attend 1 training and most of the training scores are 50. Then, many employees in the company have a working period of zero to ten years.
Plot distribution of gender and is_promoted using bar plot.
# Promotion by gender
ggplot(data = employee_df2, aes(x = gender, fill = is_promoted)) +
geom_bar(position = "dodge") +
labs(title = "Promotion by Gender")
Based on the bar plot above, we can see that more male employees are promoted than female employees. In addition, there are still more male employees who are not promoted than female employees.
Plot distribution of no_of_trainings and is_promoted using bar plot.
# Promotion by no_of_trainings
ggplot(data = employee_df2, aes(x = no_of_trainings, fill = is_promoted)) +
geom_bar(position = "dodge") +
labs(title = "Promotion by Number of trainings")
Based on the bar plot above, we can see that more employees who have attended training 1 time were not promoted than those who were promoted.
Plot distribution of length_of_service and is_promoted using bar plot.
# Promotion by length_of_service
ggplot(data = employee_df2, aes(x = length_of_service, fill = is_promoted)) +
geom_bar(position = "dodge") +
labs(title = "Promotion by Length of service")
Based on the bar plot above, we can see that most of the promoted employees are in the range of 1-10 years of service.
Plot distribution of awards_won and is_promoted using bar plot.
# Promotion by awards_won
ggplot(data = employee_df2, aes(x = awards_won, fill = is_promoted)) +
geom_bar(position = "dodge") +
labs(title = "Promotion by Awards won")
Based on the bar plot above, we can see that employees who have not received awards are promoted more than employees who have received awards.
Plot distribution of previous_year_rating and is_promoted using bar plot.
# Promotion by previous_year_rating
ggplot(data = employee_df2, aes(x = previous_year_rating, fill = is_promoted)) +
geom_bar(position = "stack") +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Promotion by Previous year rating")
Based on the bar plot above, we can see that employees who get enough rating, good rating, and very good rating are more likely to be promoted.
Plot distribution of avg_training_score and is_promoted using bar plot.
# Promotion by avg_training_score
ggplot(data = employee_df2, aes(x = avg_training_score, fill = is_promoted)) +
geom_bar(position = "dodge") +
labs(title = "Promotion by Average training score")
Based on the bar plot above, we can see that many employees who received training scores in the 40-80 range were promoted, although many were not promoted.
Plot distribution relationship is_promoted to any variable.
# Relationship between is_promoted and avg_training_score
ggplot(data = employee_df2, aes(y = avg_training_score,
x = no_of_trainings,
color = awards_won)) +
geom_point(position = "jitter" ) +
facet_grid(gender ~ is_promoted) +
labs(title = "Relationship between Number of trainings vs Avg training score & Awards won",
x = "Number of trainings",
y = "Avg taining score",
color = "Awards won")
Based on the relationship above, we can see that the distribution of male and female employees who participated in one training received an award and would be promoted.
# Relationship between length_of_service and age
ggplot(data = employee_df2, aes(y = age,
x = length_of_service,
color = awards_won)) +
geom_point(position = "jitter" ) +
facet_grid(gender ~ is_promoted) +
labs(title = "Relationship between Length of service vs Age & Awards won",
x = "Length of service",
y = "Age",
color = "Awards won")
Based on the relationship above, we can see that the distribution of male and female employees whose tenure ranges from 1-10 years and gets awards tends to be promoted.
Now, we must create corrgram for see a correlation coefficient variable.
# Correlation Coefficient
corrgram(employee_df,
upper.panel = panel.cor)
Based on corrgram above, we can see that previous_year_rating, awards_won, and avg_training_score have big correlation coefficient to is_promoted
is_promoted).education variable shows N/A but is not read as N/A because the data type is character.length_of_service less than 10 years have no_of_training in the range 0-1, so it is difficult to relate them to other variables.gender variable in the age range of 55-60 years.We must clean a missing value in previous_year_rating, avg_training_score, and education.
Check a summary from 1st dataframe before remove a missing value.
# Summary before remove special value in variable
sum(is.na(employee_df))
## [1] 9093
We must use 1st dataframe to remove a special value in previous_year_rating, and avg_training_score and use 2nd dataframe to remove special value in education.
# Remove special value in previous_year_rating
employee_df$previous_year_rating = ifelse(is.na(employee_df$previous_year_rating),
ave(employee_df$previous_year_rating,
FUN = function(x) mean(x, na.rm = TRUE)),
employee_df$previous_year_rating)
# Remove special value in avg_training score
employee_df$avg_training_score = ifelse(is.na(employee_df$avg_training_score),
ave(employee_df$avg_training_score,
FUN = function(x) mean(x, na.rm = TRUE)),
employee_df$avg_training_score)
# Remove special value in education's variable
employee_df2 <- employee_df[which(employee_df$education != "NA"), ]
Now, check again a summary from 1st dataset after remove a special value.
# Summary after special value in variable.
sum(is.na(employee_df))
## [1] 2409
Then, remove employee_id column from 2nd dataframe. Because, employee_id have a special character.
# Remove employee_id column
employee_df2 <- select(employee_df2, -employee_id)
Now, we get a new 2nd dataset after remove employee_id column.
str(employee_df2)
## 'data.frame': 52399 obs. of 12 variables:
## $ department : chr "Sales & Marketing" "Operations" "Sales & Marketing" "Sales & Marketing" ...
## $ region : chr "region_7" "region_22" "region_19" "region_23" ...
## $ education : chr "Master's & above" "Bachelor's" "Bachelor's" "Bachelor's" ...
## $ gender : chr "f" "m" "m" "m" ...
## $ recruitment_channel : chr "sourcing" "other" "sourcing" "other" ...
## $ no_of_trainings : int 1 1 1 2 1 2 1 1 1 1 ...
## $ age : int 35 30 34 39 45 31 31 33 28 32 ...
## $ previous_year_rating: num 5 5 3 1 3 3 3 3 4 5 ...
## $ length_of_service : int 8 4 7 10 2 7 5 6 5 5 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : num 49 60 50 50 73 85 59 63 83 54 ...
## $ is_promoted : int 0 0 0 0 0 0 0 0 0 0 ...
Change a feature extraction data type to a factor from 1st dataframe.
# Change is_promoted column type
employee_df$is_promoted <- factor(employee_df$is_promoted,
levels = c(0,1),
labels = c("Not Promoted", "Promoted"))
# Change previous_year_rating column type
employee_df$previous_year_rating <- factor(employee_df$previous_year_rating,
levels = c(0,1,2,3,4,5),
labels = c("Null",
"Very Bad",
"Bad",
"Enough",
"Good",
"Very Good"))
# Change awards_won column type
employee_df$awards_won <- factor(employee_df$awards_won,
levels = c(0,1),
labels = c("None", "Won"))
Now, we must check unbalanced data using target variable (is_promoted).
ggplot(data = employee_df, aes(x = is_promoted, fill = is_promoted)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Promotion")
Based on the bar plot above, we can conclude that this data is included in the unbalance data. So we have to do SMOTE so that the data is balanced.
Now, we must create data partition before using SMOTE.
set.seed(2021)
idx <- sample(2, nrow(employee_df), replace = TRUE,)
train <- employee_df[idx==1, ]
test <- employee_df[idx==2,]
# Check a table is_promoted's train data
table(train$is_promoted)
##
## Not Promoted Promoted
## 24997 2329
# Check a proportional table is_promoted's train data before using SMOTE
prop.table(table(train$is_promoted))
##
## Not Promoted Promoted
## 0.91476982 0.08523018
Once we partition the data, we can handle unbalanced data.
employee_df3 <- ovun.sample(is_promoted~., data = train, method = "over", N = 44154)$data
employee_df4 <- ovun.sample(is_promoted~., data = train, method = "under", N = 4340)$data
employee_df5 <- ovun.sample(is_promoted~., data = train, method = "both",
p = 0.5,
seed = 222,
N = 10000)$data
# Check a proportional table is_promoted's train data after using SMOTE
prop.table(table(employee_df3$is_promoted))
##
## Not Promoted Promoted
## 0.5003624 0.4996376
Check data after handling unbalanced data using target variable.
ggplot(data = employee_df3, aes(x = is_promoted, fill = is_promoted)) +
geom_bar() +
theme(legend.position = c(-10.076,-100.790),
axis.text.x = element_text(angle = 90,
vjust = 0.5,
hjust = 1)) +
labs(title = "Distribution by Promotion")
Based on the bar plot above, we can conclude that after SMOTE, the proportion of each data becomes more balanced than before SMOTE. Because the data is proportional, we can enter the next process.
We must feature selection a data using 3rd dataframe (employee_df3). employee_df3 is over sampling’s dataframe.
X <- select(employee_df3, -department, -region, -education, -gender,
-recruitment_channel, -is_promoted, -previous_year_rating, -awards_won)
After feature selection a data, we can process a data using Principal Component Analysis. We used 80% information in Principal Component Analysis.
pr.out <- prcomp(X, scale. = TRUE) # PCA
pr.var <- pr.out$sdev^2 # Variance
pve <- pr.var / sum(pr.var) # Proportion of Variance Explained (PVE)
cumsum(pve)
## [1] 0.3270913 0.5346093 0.7318913 0.9229744 1.0000000
# Create a features, target, and new dataframe
features <- data.frame(pr.out$x[ , 1:4])
target <- employee_df3$is_promoted
employee_df6 <- cbind(features, target)
Now, we must create split data for training data and testing data. Here, we used 80:20 ratio for training and testing data. In other word, we must used a new dataframe.
# number of all samples and number of train samples
m <- nrow(employee_df6)
m_train <- 0.8 * m
m_train <- floor(m_train)
# random with set.seed for reproducible result
set.seed(2021)
train_index <- sample(m, m_train)
train_index[1:5]
## [1] 25018 17478 37189 16487 35763
# create dataframe for train and test
train_df <- employee_df6[train_index, ]
test_df <- employee_df6[-train_index, ]
After creating training data and testing data, we can enter the modeling process.
In this modeling step, we carried out four data modeling using training data and using classification model.
model.logit <- glm(formula = target ~ .,
data = train_df,
family = binomial)
model.logit
##
## Call: glm(formula = target ~ ., family = binomial, data = train_df)
##
## Coefficients:
## (Intercept) PC1 PC2 PC3 PC4
## -0.003941 0.074198 0.271251 0.305247 0.515660
##
## Degrees of Freedom: 35322 Total (i.e. Null); 35318 Residual
## Null Deviance: 48970
## Residual Deviance: 45550 AIC: 45560
model.tree <- ctree(formula = target ~ .,
data = train_df)
model.tree
##
## Conditional inference tree with 88 terminal nodes
##
## Response: target
## Inputs: PC1, PC2, PC3, PC4
## Number of observations: 35323
##
## 1) PC4 <= 0.1834015; criterion = 1, statistic = 1902.366
## 2) PC3 <= 0.3319995; criterion = 1, statistic = 243.112
## 3) PC2 <= 2.664413; criterion = 1, statistic = 44.368
## 4) PC4 <= -1.367108; criterion = 1, statistic = 93.044
## 5) PC2 <= 0.619071; criterion = 1, statistic = 34.198
## 6) PC3 <= -1.866225; criterion = 0.999, statistic = 14.217
## 7)* weights = 16
## 6) PC3 > -1.866225
## 8)* weights = 749
## 5) PC2 > 0.619071
## 9) PC3 <= -0.5424553; criterion = 0.99, statistic = 9.164
## 10) PC4 <= -3.136583; criterion = 1, statistic = 20.203
## 11)* weights = 13
## 10) PC4 > -3.136583
## 12) PC2 <= 2.072296; criterion = 0.992, statistic = 9.474
## 13)* weights = 201
## 12) PC2 > 2.072296
## 14) PC3 <= -1.074581; criterion = 1, statistic = 20.148
## 15)* weights = 7
## 14) PC3 > -1.074581
## 16)* weights = 29
## 9) PC3 > -0.5424553
## 17)* weights = 215
## 4) PC4 > -1.367108
## 18) PC2 <= 0.9170353; criterion = 1, statistic = 65.903
## 19) PC4 <= -0.2059072; criterion = 1, statistic = 22.109
## 20) PC3 <= -0.6977661; criterion = 0.998, statistic = 12.622
## 21) PC3 <= -1.052106; criterion = 0.99, statistic = 9.079
## 22) PC3 <= -1.849351; criterion = 0.998, statistic = 12.622
## 23)* weights = 270
## 22) PC3 > -1.849351
## 24)* weights = 1409
## 21) PC3 > -1.052106
## 25) PC4 <= -0.6731815; criterion = 1, statistic = 25.566
## 26) PC1 <= -3.660994; criterion = 1, statistic = 20.656
## 27) PC4 <= -0.757441; criterion = 0.987, statistic = 8.72
## 28)* weights = 10
## 27) PC4 > -0.757441
## 29)* weights = 10
## 26) PC1 > -3.660994
## 30) PC3 <= -0.7385885; criterion = 1, statistic = 14.807
## 31)* weights = 41
## 30) PC3 > -0.7385885
## 32) PC3 <= -0.7309253; criterion = 1, statistic = 14.954
## 33)* weights = 10
## 32) PC3 > -0.7309253
## 34)* weights = 11
## 25) PC4 > -0.6731815
## 35) PC3 <= -0.9562167; criterion = 0.959, statistic = 6.578
## 36)* weights = 136
## 35) PC3 > -0.9562167
## 37)* weights = 464
## 20) PC3 > -0.6977661
## 38) PC3 <= 0.1877901; criterion = 1, statistic = 15.938
## 39) PC1 <= -0.4132578; criterion = 0.992, statistic = 9.621
## 40) PC3 <= -0.6873006; criterion = 0.993, statistic = 9.741
## 41)* weights = 26
## 40) PC3 > -0.6873006
## 42) PC1 <= -0.6088672; criterion = 0.993, statistic = 9.691
## 43)* weights = 819
## 42) PC1 > -0.6088672
## 44)* weights = 101
## 39) PC1 > -0.4132578
## 45) PC1 <= -0.3961709; criterion = 0.974, statistic = 7.394
## 46)* weights = 35
## 45) PC1 > -0.3961709
## 47)* weights = 1839
## 38) PC3 > 0.1877901
## 48) PC4 <= -0.6006943; criterion = 0.998, statistic = 11.715
## 49)* weights = 209
## 48) PC4 > -0.6006943
## 50) PC4 <= -0.4597684; criterion = 0.999, statistic = 14.523
## 51)* weights = 122
## 50) PC4 > -0.4597684
## 52)* weights = 120
## 19) PC4 > -0.2059072
## 53) PC3 <= -0.3696245; criterion = 0.999, statistic = 12.993
## 54)* weights = 2996
## 53) PC3 > -0.3696245
## 55)* weights = 1537
## 18) PC2 > 0.9170353
## 56)* weights = 1102
## 3) PC2 > 2.664413
## 57) PC4 <= -3.936369; criterion = 0.999, statistic = 13.87
## 58) PC1 <= 2.261653; criterion = 0.961, statistic = 6.667
## 59)* weights = 24
## 58) PC1 > 2.261653
## 60)* weights = 7
## 57) PC4 > -3.936369
## 61)* weights = 199
## 2) PC3 > 0.3319995
## 62) PC4 <= -1.591272; criterion = 1, statistic = 158.546
## 63) PC2 <= 0.7507008; criterion = 1, statistic = 80.541
## 64) PC4 <= -1.973742; criterion = 0.999, statistic = 13.194
## 65)* weights = 214
## 64) PC4 > -1.973742
## 66) PC4 <= -1.966164; criterion = 1, statistic = 14.946
## 67)* weights = 15
## 66) PC4 > -1.966164
## 68) PC1 <= -2.738524; criterion = 0.998, statistic = 12.276
## 69)* weights = 13
## 68) PC1 > -2.738524
## 70)* weights = 177
## 63) PC2 > 0.7507008
## 71) PC4 <= -2.922163; criterion = 1, statistic = 28.703
## 72)* weights = 238
## 71) PC4 > -2.922163
## 73) PC2 <= 4.251806; criterion = 1, statistic = 16.713
## 74) PC3 <= 1.136422; criterion = 1, statistic = 15.196
## 75) PC1 <= 0.005669644; criterion = 0.963, statistic = 6.748
## 76) PC3 <= 0.5425239; criterion = 0.993, statistic = 9.85
## 77)* weights = 7
## 76) PC3 > 0.5425239
## 78) PC1 <= -0.3620893; criterion = 0.997, statistic = 11.377
## 79)* weights = 13
## 78) PC1 > -0.3620893
## 80)* weights = 15
## 75) PC1 > 0.005669644
## 81)* weights = 91
## 74) PC3 > 1.136422
## 82)* weights = 272
## 73) PC2 > 4.251806
## 83)* weights = 32
## 62) PC4 > -1.591272
## 84) PC2 <= 1.297146; criterion = 1, statistic = 112.001
## 85) PC1 <= -1.709617; criterion = 1, statistic = 24.551
## 86) PC4 <= 0.01179968; criterion = 1, statistic = 20.81
## 87)* weights = 569
## 86) PC4 > 0.01179968
## 88)* weights = 127
## 85) PC1 > -1.709617
## 89) PC3 <= 0.5565738; criterion = 0.996, statistic = 10.69
## 90) PC3 <= 0.4047064; criterion = 1, statistic = 20.925
## 91) PC2 <= -0.7849073; criterion = 1, statistic = 15.509
## 92)* weights = 459
## 91) PC2 > -0.7849073
## 93) PC1 <= -1.519028; criterion = 0.99, statistic = 9.225
## 94)* weights = 11
## 93) PC1 > -1.519028
## 95) PC3 <= 0.4043296; criterion = 1, statistic = 16.486
## 96)* weights = 48
## 95) PC3 > 0.4043296
## 97)* weights = 7
## 90) PC3 > 0.4047064
## 98) PC2 <= 1.005243; criterion = 0.991, statistic = 9.421
## 99)* weights = 728
## 98) PC2 > 1.005243
## 100) PC1 <= -0.5717216; criterion = 0.996, statistic = 11.051
## 101)* weights = 17
## 100) PC1 > -0.5717216
## 102)* weights = 16
## 89) PC3 > 0.5565738
## 103)* weights = 3438
## 84) PC2 > 1.297146
## 104) PC3 <= 1.168326; criterion = 1, statistic = 35.341
## 105) PC1 <= -0.5142853; criterion = 1, statistic = 29.728
## 106) PC2 <= 2.415267; criterion = 1, statistic = 16.952
## 107)* weights = 29
## 106) PC2 > 2.415267
## 108)* weights = 7
## 105) PC1 > -0.5142853
## 109)* weights = 419
## 104) PC3 > 1.168326
## 110) PC1 <= -1.231104; criterion = 0.998, statistic = 12.027
## 111) PC4 <= -0.856759; criterion = 0.983, statistic = 8.195
## 112) PC1 <= -2.595527; criterion = 0.988, statistic = 8.756
## 113)* weights = 14
## 112) PC1 > -2.595527
## 114)* weights = 13
## 111) PC4 > -0.856759
## 115)* weights = 17
## 110) PC1 > -1.231104
## 116) PC2 <= 1.51405; criterion = 0.978, statistic = 7.7
## 117) PC3 <= 1.518513; criterion = 0.995, statistic = 10.401
## 118)* weights = 16
## 117) PC3 > 1.518513
## 119)* weights = 55
## 116) PC2 > 1.51405
## 120)* weights = 109
## 1) PC4 > 0.1834015
## 121) PC4 <= 1.187093; criterion = 1, statistic = 619.878
## 122) PC3 <= 0.9999766; criterion = 1, statistic = 123.732
## 123) PC3 <= -1.651292; criterion = 1, statistic = 18.273
## 124)* weights = 216
## 123) PC3 > -1.651292
## 125)* weights = 9202
## 122) PC3 > 0.9999766
## 126) PC4 <= 0.9336104; criterion = 1, statistic = 83.055
## 127) PC2 <= 0.3869843; criterion = 0.972, statistic = 7.267
## 128) PC2 <= -0.2685459; criterion = 0.973, statistic = 7.298
## 129)* weights = 1339
## 128) PC2 > -0.2685459
## 130)* weights = 725
## 127) PC2 > 0.3869843
## 131)* weights = 184
## 126) PC4 > 0.9336104
## 132) PC3 <= 1.215289; criterion = 1, statistic = 28.505
## 133) PC1 <= 0.4961082; criterion = 0.999, statistic = 13.604
## 134)* weights = 19
## 133) PC1 > 0.4961082
## 135) PC1 <= 1.217222; criterion = 0.967, statistic = 6.974
## 136)* weights = 74
## 135) PC1 > 1.217222
## 137)* weights = 17
## 132) PC3 > 1.215289
## 138) PC4 <= 1.115847; criterion = 1, statistic = 15.367
## 139) PC1 <= -0.4169971; criterion = 0.997, statistic = 11.436
## 140)* weights = 116
## 139) PC1 > -0.4169971
## 141)* weights = 331
## 138) PC4 > 1.115847
## 142)* weights = 201
## 121) PC4 > 1.187093
## 143) PC3 <= -0.2128683; criterion = 1, statistic = 305.279
## 144) PC4 <= 1.514837; criterion = 1, statistic = 59.457
## 145) PC2 <= 0.8918199; criterion = 0.963, statistic = 6.729
## 146) PC3 <= -0.3881231; criterion = 1, statistic = 16.873
## 147) PC1 <= 1.680952; criterion = 0.994, statistic = 10.232
## 148)* weights = 146
## 147) PC1 > 1.680952
## 149)* weights = 9
## 146) PC3 > -0.3881231
## 150)* weights = 14
## 145) PC2 > 0.8918199
## 151) PC3 <= -0.804613; criterion = 1, statistic = 15.602
## 152) PC4 <= 1.190255; criterion = 1, statistic = 16.258
## 153)* weights = 17
## 152) PC4 > 1.190255
## 154)* weights = 127
## 151) PC3 > -0.804613
## 155)* weights = 538
## 144) PC4 > 1.514837
## 156) PC3 <= -0.655496; criterion = 0.996, statistic = 10.99
## 157) PC1 <= 0.376782; criterion = 1, statistic = 19.481
## 158)* weights = 15
## 157) PC1 > 0.376782
## 159)* weights = 7
## 156) PC3 > -0.655496
## 160)* weights = 228
## 143) PC3 > -0.2128683
## 161) PC4 <= 1.366997; criterion = 1, statistic = 103.474
## 162) PC3 <= 0.4799058; criterion = 1, statistic = 103.977
## 163) PC2 <= 1.071399; criterion = 1, statistic = 16.325
## 164)* weights = 210
## 163) PC2 > 1.071399
## 165)* weights = 91
## 162) PC3 > 0.4799058
## 166) PC3 <= 0.891597; criterion = 1, statistic = 22.893
## 167) PC2 <= 0.4328277; criterion = 0.998, statistic = 12.333
## 168)* weights = 9
## 167) PC2 > 0.4328277
## 169)* weights = 139
## 166) PC3 > 0.891597
## 170)* weights = 375
## 161) PC4 > 1.366997
## 171) PC3 <= 0.5383737; criterion = 0.99, statistic = 9.226
## 172) PC4 <= 1.451175; criterion = 0.997, statistic = 11.613
## 173)* weights = 152
## 172) PC4 > 1.451175
## 174)* weights = 485
## 171) PC3 > 0.5383737
## 175)* weights = 424
set.seed(2021)
model.forest <- randomForest(formula = target ~ .,
data = train_df)
model.forest
##
## Call:
## randomForest(formula = target ~ ., data = train_df)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 1.61%
## Confusion matrix:
## Not Promoted Promoted class.error
## Not Promoted 17153 563 0.0317791827
## Promoted 4 17603 0.0002271824
set.seed(2021)
model.svm <- svm(formula = target ~ .,
data = train_df)
model.svm
##
## Call:
## svm(formula = target ~ ., data = train_df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 27238
Now, we must create actual and predicted value for see confusion matrix and performance classification of each model.
# Actual values
actual <- test_df$target
# Probability [0-1] of predicted values
pred.logit.prob <- predict(model.logit, test_df, type = "response")
pred.logit.boolen <- pred.logit.prob > 0.5
# Change to factor (FALSE ==> "Not Promoted", TRUE ==> "Promoted")
pred.logit <- factor(pred.logit.boolen,
levels = c(FALSE, TRUE),
labels = c("Not Promoted", "Promoted"))
pred.tree <- predict(model.tree, test_df)
pred.forest <- predict(model.forest, test_df)
pred.svm <- predict(model.svm, test_df)
Create confusion matrix to see actual and predicted value of each model.
# Confusion Matrix : Logistic Regression
logit.perf <- table(test_df$target, pred.logit,
dnn = c("Actual","Predicted"))
logit.perf
## Predicted
## Actual Not Promoted Promoted
## Not Promoted 2902 1475
## Promoted 1900 2554
# Confusion Matrix : Decision Tree
ctree.perf <- table(test_df$target, pred.tree,
dnn = c("Actual","Predicted"))
ctree.perf
## Predicted
## Actual Not Promoted Promoted
## Not Promoted 2428 1949
## Promoted 1260 3194
# Confusion Matrix : Random Forest
forest.perf <- table(test_df$target, pred.forest,
dnn = c("Actual","Predicted"))
forest.perf
## Predicted
## Actual Not Promoted Promoted
## Not Promoted 4241 136
## Promoted 8 4446
# Confusion Matrix : Decision Tree
svm.perf <- table(na.omit(test_df)$target, pred.svm,
dnn = c("Actual","Predicted"))
svm.perf
## Predicted
## Actual Not Promoted Promoted
## Not Promoted 2777 1600
## Promoted 1699 2755
Now, create performance classification function to see performance of each model.
performance_classification <- function(prediction, actual){
# Confusion matrix
conf.matrix <- table(actual, prediction,
dnn = c("Actual", "Predicted"))
conf.matrix
# Compute performance
tp <- conf.matrix[2,2]
tn <- conf.matrix[1,1]
fp <- conf.matrix[1,2]
fn <- conf.matrix[2,1]
accuracy <- (tp+tn) / (tp+tn+fp+fn)
precision <- tp / (tp+fp)
recall <- tp / (tp+fn)
f1score <- (2*precision*recall) / (precision+recall)
result <- paste("Accuracy = ", round(accuracy,3),
"\nPrecision = ", round(precision,3),
"\nRecall = ", round(recall,3),
"\nF1-Score = ", round(f1score,3),
"\n")
cat(result)
}
Now, we can check a performance classification of each model.
# Performance Classification : Logistic Regression
performance_classification(pred.logit, actual)
## Accuracy = 0.618
## Precision = 0.634
## Recall = 0.573
## F1-Score = 0.602
# Performance Classification : Decision Tree
performance_classification(pred.tree, actual)
## Accuracy = 0.637
## Precision = 0.621
## Recall = 0.717
## F1-Score = 0.666
# Performance Classification : Random Forest
performance_classification(pred.forest, actual)
## Accuracy = 0.984
## Precision = 0.97
## Recall = 0.998
## F1-Score = 0.984
# Performance Classification : Support Vector Machine
performance_classification(pred.svm, actual)
## Accuracy = 0.626
## Precision = 0.633
## Recall = 0.619
## F1-Score = 0.625
is promoted, avg_training_score, length_of_service, number of training, age.