To demonstrate the functionality of decision trees, I will select a dataset and construct a Decision Tree capable of addressing a classification or regression problem. By utilizing different variables, I will generate two distinct decision trees and compare their outcomes. Furthermore, I will develop a random forest model for regression and analyze the results it produces.
For this task, I have selected the HR Analytics and Job Prediction dataset from Kaggle, which provides information about employees who worked in a company. The dataset can be found at https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction.
To address the problem at hand, I will create two regression decision trees utilizing the satisfaction_level variable as the target output. Additionally, I will employ the random forest algorithm to further analyze the data and derive insights.
#Import required libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(modelr)
library(caTools)
library(rpart)
library(yardstick)
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:modelr':
##
## mae, mape, rmse
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
Data<- read.csv("HR_comma_sep.csv")
head(Data)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years Department salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
cols = colnames(Data)
print(cols)
## [1] "satisfaction_level" "last_evaluation" "number_project"
## [4] "average_montly_hours" "time_spend_company" "Work_accident"
## [7] "left" "promotion_last_5years" "Department"
## [10] "salary"
summary(Data)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 3.498 Mean :0.1446 Mean :0.2381 Mean :0.02127
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Department salary
## Length:14999 Length:14999
## Class :character Class :character
## Mode :character Mode :character
##
##
##
#Check if there are any NA values
sum(is.na(Data))
## [1] 0
The data does not have any missing values.
Data %>%
count(salary)
## salary n
## 1 high 1237
## 2 low 7316
## 3 medium 6446
Among the three salary categories (high, medium, and low), the number of employees receiving a high salary is the lowest compared to those receiving medium or low salaries.
Data %>%
count(Department,salary)%>%
group_by(Department)
## # A tibble: 30 × 3
## # Groups: Department [10]
## Department salary n
## <chr> <chr> <int>
## 1 IT high 83
## 2 IT low 609
## 3 IT medium 535
## 4 RandD high 51
## 5 RandD low 364
## 6 RandD medium 372
## 7 accounting high 74
## 8 accounting low 358
## 9 accounting medium 335
## 10 hr high 45
## # … with 20 more rows
ggplot(Data, aes(left,fill=factor(salary)))+
geom_histogram(stat="count")
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
#Data splitting to 80&20
sample = sample.split(Data$satisfaction_level,SplitRatio = .8)
train= subset(Data,sample==TRUE)
test = subset(Data,sample==FALSE)
X_test = subset(test, select= -c(satisfaction_level))
Y_test = subset(test, select= c(satisfaction_level))
#Build a Decision Tree regression tree using the training dataset.
dt_model = rpart (satisfaction_level ~ last_evaluation+number_project+average_montly_hours +time_spend_company, method="anova",data=train)
print(dt_model)
## n= 12000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 12000 741.502900 0.6129642
## 2) number_project>=5.5 1144 69.236830 0.2480682
## 4) average_montly_hours>=242.5 796 20.732250 0.1557538
## 8) last_evaluation>=0.765 722 7.696794 0.1244044 *
## 9) last_evaluation< 0.765 74 5.402805 0.4616216 *
## 5) average_montly_hours< 242.5 348 26.204890 0.4592241 *
## 3) number_project< 5.5 10856 503.891800 0.6514167
## 6) number_project< 2.5 1918 48.457990 0.4783681
## 12) average_montly_hours< 165.5 1474 15.566560 0.4339281 *
## 13) average_montly_hours>=165.5 444 20.316340 0.6259009 *
## 7) number_project>=2.5 8938 385.672500 0.6885511
## 14) average_montly_hours>=276.5 129 9.313206 0.4375194 *
## 15) average_montly_hours< 276.5 8809 368.111100 0.6922273
## 30) last_evaluation< 0.475 297 13.260270 0.5296970 *
## 31) last_evaluation>=0.475 8512 346.731500 0.6978983 *
plot(dt_model,uniform = TRUE)
text(dt_model,use.n = TRUE)
Y_pred = predict(dt_model,X_test,method = "anova")
eval = metric_set(mae,rmse,rsq)
# Create a data frame with the truth and estimated values
evaluation_data <- data.frame(truth = test$satisfaction_level, estimate = Y_pred)
# Evaluate the model
dt1_eval <- eval(data = evaluation_data, truth = truth, estimate = estimate)
# Print the evaluation results
dt1_eval
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.145
## 2 rmse standard 0.192
## 3 rsq standard 0.407
dt_model2 = rpart (satisfaction_level ~ Work_accident + promotion_last_5years + left +salary + Department, method="anova", data=train)
print(dt_model2)
## n= 12000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 12000 741.5029 0.6129642
## 2) left>=0.5 2849 198.1144 0.4396665 *
## 3) left< 0.5 9151 431.1891 0.6669173 *
Y_pred2 = predict(dt_model2,X_test,method = "anova")
# Create a data frame with the truth and estimated values for dt2
evaluation_data2 <- data.frame(truth = test$satisfaction_level, estimate = Y_pred2)
# Evaluate the model
dt2_eval <- eval(data = evaluation_data2, truth = truth, estimate = estimate)
# Print the evaluation results for dt2
dt2_eval
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.183
## 2 rmse standard 0.230
## 3 rsq standard 0.149
The first model is performing better that the second model and has a lower rsmse value.
Due to its superior performance, I will utilize the variables from the first model to construct a random forest model. This will involve training a random forest using the same set of variables and assessing its performance.
#Create a random forest model.
rf_model = randomForest(satisfaction_level ~ last_evaluation+number_project+average_montly_hours +time_spend_company, type = "regression",ntree = 1200,split = 15,importnace = TRUE,data=train)
# Make predictions using the random forest model
Y_pred3 <- predict(rf_model, X_test)
# Create a data frame with the truth and estimated values for dt3
evaluation_data3 <- data.frame(truth = test$satisfaction_level, estimate = Y_pred3)
# Evaluate the model
dt3_eval <- eval(data = evaluation_data3, truth = truth, estimate = estimate)
# Print the evaluation results for dt3
dt3_eval
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.137
## 2 rmse standard 0.177
## 3 rsq standard 0.500
We notice lower RMSE comparing the two decision tree models
Check the important features.
importance(rf_model)
## IncNodePurity
## last_evaluation 62.99741
## number_project 159.85841
## average_montly_hours 109.03291
## time_spend_company 51.17485
The Random Forest model achieved an RMSE (Root Mean Square Error) of 0.179, whereas the first decision tree model had an RMSE of 0.193. Consequently, the Random Forest model outperforms the decision tree model in terms of predictive accuracy. Moreover, when examining the feature importance, it is evident that average_monthly_hours and number_project are the most influential factors for evaluating the satisfaction_level.