Objective

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.

Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Data

I am using a dataset that contains information about employees who worked in a company from Kaggle, https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction. I will create TWO Regression decision trees using the satisfaction_level variable as the output and random forest.

#Import required libraries
library(ggplot2)
library(dplyr)
library(modelr)
library(caTools)
library(rpart)
library(yardstick)
library(randomForest)
head(Data)
##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low
cols = colnames(Data)
print(cols)
##  [1] "satisfaction_level"    "last_evaluation"       "number_project"       
##  [4] "average_montly_hours"  "time_spend_company"    "Work_accident"        
##  [7] "left"                  "promotion_last_5years" "sales"                
## [10] "salary"
summary(Data)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left        promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000   Median :0.0000   Median :0.00000      
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381   Mean   :0.02127      
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000   Max.   :1.00000      
##     sales              salary         
##  Length:14999       Length:14999      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Exploratory Data Analysis

#Check if there are any NA values

sum(is.na(Data))
## [1] 0

The data does not have any missing values.

Data %>%
  count(salary)
##   salary    n
## 1   high 1237
## 2    low 7316
## 3 medium 6446

The number of employees with high salary is the lowest among others with medium and low.

Data %>%
  count(sales,salary)%>%
  group_by(sales)
## # A tibble: 30 x 3
## # Groups:   sales [10]
##    sales      salary     n
##    <chr>      <chr>  <int>
##  1 accounting high      74
##  2 accounting low      358
##  3 accounting medium   335
##  4 hr         high      45
##  5 hr         low      335
##  6 hr         medium   359
##  7 IT         high      83
##  8 IT         low      609
##  9 IT         medium   535
## 10 management high     225
## # ... with 20 more rows
ggplot(Data, aes(left,fill=factor(salary)))+
  geom_histogram(stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Data Splitting

#Data splitting to 80&20
sample = sample.split(Data$satisfaction_level,SplitRatio = .8)
train= subset(Data,sample==TRUE)
test = subset(Data,sample==FALSE)
X_test = subset(test, select= -c(satisfaction_level))
Y_test = subset(test, select= c(satisfaction_level))

Decision Tree: Model 1

#Build a Decision Tree regression tree using the training dataset.
dt_model = rpart (satisfaction_level ~ last_evaluation+number_project+average_montly_hours  +time_spend_company, method="anova",data=train)
print(dt_model) 
## n= 12000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 12000 741.50290 0.6129642  
##    2) number_project>=5.5 1159  69.90773 0.2488352  
##      4) average_montly_hours>=242.5 790  19.52350 0.1520127 *
##      5) average_montly_hours< 242.5 369  27.12276 0.4561247 *
##    3) number_project< 5.5 10841 501.49460 0.6518928  
##      6) number_project< 2.5 1902  46.98782 0.4796951  
##       12) average_montly_hours< 165.5 1467  15.54887 0.4350511 *
##       13) average_montly_hours>=165.5 435  18.65467 0.6302529 *
##      7) number_project>=2.5 8939 386.10830 0.6885323  
##       14) average_montly_hours>=275.5 135  10.41079 0.4277037 *
##       15) average_montly_hours< 275.5 8804 366.37250 0.6925318  
##         30) last_evaluation< 0.475 314  15.17669 0.5312739 *
##         31) last_evaluation>=0.475 8490 342.72850 0.6984959 *

Visualization

plot(dt_model,uniform = TRUE)
text(dt_model,use.n = TRUE)

Prediction

Y_pred = predict(dt_model,X_test,method = "anova")
#Y_pred

Evaluation

eval = metric_set(mae,rmse,rsq)
dt1_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred )
dt1_eval
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.149
## 2 rmse    standard       0.195
## 3 rsq     standard       0.385

Decision Tree: Model 2

dt_model2 = rpart (satisfaction_level ~ Work_accident + promotion_last_5years + left  +salary + sales, method="anova", data=train)
print(dt_model2) 
## n= 12000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 12000 741.5029 0.6129642  
##   2) left>=0.5 2865 200.1217 0.4418569 *
##   3) left< 0.5 9135 431.1932 0.6666284 *

Prediction

Y_pred2 = predict(dt_model2,X_test,method = "anova")

Evaluation

dt2_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred2 )

dt2_eval
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.182
## 2 rmse    standard       0.228
## 3 rsq     standard       0.160

Conclusion

The first model is performing better that the second model and has a lower rsmse value.

Random Forest

I will create a random forest model using the variables of the first model since it had better performance.

#Create a random forest model.
rf_model = randomForest(satisfaction_level ~ last_evaluation+number_project+average_montly_hours  +time_spend_company, type = "regression",ntree = 1200,split = 15,importnace = TRUE,data=train)

Prediction and Evaluation

Y_pred3 = predict(rf_model,X_test)
dt3_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred3 )
dt3_eval
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.138
## 2 rmse    standard       0.180
## 3 rsq     standard       0.481

We notice lower RMSE comparing the two decision tree models

Check the important features.

importance(rf_model)
##                      IncNodePurity
## last_evaluation           64.06552
## number_project           163.59319
## average_montly_hours     109.65167
## time_spend_company        49.60247

Summary

The RMSE from the Random Forest is 0.1800717 while the RMSE from the first decision tree model is 0.1955787. Therefore, the Random forest model is performing better than decision tree. In addition, looking at the importance features, we see that the average_montly_hours and number_project are the most important factors to evaluate the satisfaction_level.