Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.
Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
I am using a dataset that contains information about employees who worked in a company from Kaggle, https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction. I will create TWO Regression decision trees using the satisfaction_level variable as the output and random forest.
#Import required libraries
library(ggplot2)
library(dplyr)
library(modelr)
library(caTools)
library(rpart)
library(yardstick)
library(randomForest)
head(Data)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years sales salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
cols = colnames(Data)
print(cols)
## [1] "satisfaction_level" "last_evaluation" "number_project"
## [4] "average_montly_hours" "time_spend_company" "Work_accident"
## [7] "left" "promotion_last_5years" "sales"
## [10] "salary"
summary(Data)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 3.498 Mean :0.1446 Mean :0.2381 Mean :0.02127
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## sales salary
## Length:14999 Length:14999
## Class :character Class :character
## Mode :character Mode :character
##
##
##
#Check if there are any NA values
sum(is.na(Data))
## [1] 0
The data does not have any missing values.
Data %>%
count(salary)
## salary n
## 1 high 1237
## 2 low 7316
## 3 medium 6446
The number of employees with high salary is the lowest among others with medium and low.
Data %>%
count(sales,salary)%>%
group_by(sales)
## # A tibble: 30 x 3
## # Groups: sales [10]
## sales salary n
## <chr> <chr> <int>
## 1 accounting high 74
## 2 accounting low 358
## 3 accounting medium 335
## 4 hr high 45
## 5 hr low 335
## 6 hr medium 359
## 7 IT high 83
## 8 IT low 609
## 9 IT medium 535
## 10 management high 225
## # ... with 20 more rows
ggplot(Data, aes(left,fill=factor(salary)))+
geom_histogram(stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
#Data splitting to 80&20
sample = sample.split(Data$satisfaction_level,SplitRatio = .8)
train= subset(Data,sample==TRUE)
test = subset(Data,sample==FALSE)
X_test = subset(test, select= -c(satisfaction_level))
Y_test = subset(test, select= c(satisfaction_level))
#Build a Decision Tree regression tree using the training dataset.
dt_model = rpart (satisfaction_level ~ last_evaluation+number_project+average_montly_hours +time_spend_company, method="anova",data=train)
print(dt_model)
## n= 12000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 12000 741.50290 0.6129642
## 2) number_project>=5.5 1159 69.90773 0.2488352
## 4) average_montly_hours>=242.5 790 19.52350 0.1520127 *
## 5) average_montly_hours< 242.5 369 27.12276 0.4561247 *
## 3) number_project< 5.5 10841 501.49460 0.6518928
## 6) number_project< 2.5 1902 46.98782 0.4796951
## 12) average_montly_hours< 165.5 1467 15.54887 0.4350511 *
## 13) average_montly_hours>=165.5 435 18.65467 0.6302529 *
## 7) number_project>=2.5 8939 386.10830 0.6885323
## 14) average_montly_hours>=275.5 135 10.41079 0.4277037 *
## 15) average_montly_hours< 275.5 8804 366.37250 0.6925318
## 30) last_evaluation< 0.475 314 15.17669 0.5312739 *
## 31) last_evaluation>=0.475 8490 342.72850 0.6984959 *
plot(dt_model,uniform = TRUE)
text(dt_model,use.n = TRUE)
Y_pred = predict(dt_model,X_test,method = "anova")
#Y_pred
eval = metric_set(mae,rmse,rsq)
dt1_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred )
dt1_eval
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.149
## 2 rmse standard 0.195
## 3 rsq standard 0.385
dt_model2 = rpart (satisfaction_level ~ Work_accident + promotion_last_5years + left +salary + sales, method="anova", data=train)
print(dt_model2)
## n= 12000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 12000 741.5029 0.6129642
## 2) left>=0.5 2865 200.1217 0.4418569 *
## 3) left< 0.5 9135 431.1932 0.6666284 *
Y_pred2 = predict(dt_model2,X_test,method = "anova")
dt2_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred2 )
dt2_eval
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.182
## 2 rmse standard 0.228
## 3 rsq standard 0.160
Conclusion
The first model is performing better that the second model and has a lower rsmse value.
I will create a random forest model using the variables of the first model since it had better performance.
#Create a random forest model.
rf_model = randomForest(satisfaction_level ~ last_evaluation+number_project+average_montly_hours +time_spend_company, type = "regression",ntree = 1200,split = 15,importnace = TRUE,data=train)
Y_pred3 = predict(rf_model,X_test)
dt3_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred3 )
dt3_eval
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.138
## 2 rmse standard 0.180
## 3 rsq standard 0.481
We notice lower RMSE comparing the two decision tree models
Check the important features.
importance(rf_model)
## IncNodePurity
## last_evaluation 64.06552
## number_project 163.59319
## average_montly_hours 109.65167
## time_spend_company 49.60247
The RMSE from the Random Forest is 0.1800717 while the RMSE from the first decision tree model is 0.1955787. Therefore, the Random forest model is performing better than decision tree. In addition, looking at the importance features, we see that the average_montly_hours and number_project are the most important factors to evaluate the satisfaction_level.