Objective

To demonstrate the functionality of decision trees, I will select a dataset and construct a Decision Tree capable of addressing a classification or regression problem. By utilizing different variables, I will generate two distinct decision trees and compare their outcomes. Furthermore, I will develop a random forest model for regression and analyze the results it produces.

Data

For this task, I have selected the HR Analytics and Job Prediction dataset from Kaggle, which provides information about employees who worked in a company. The dataset can be found at https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction.

To address the problem at hand, I will create two regression decision trees utilizing the satisfaction_level variable as the target output. Additionally, I will employ the random forest algorithm to further analyze the data and derive insights.

#Import required libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(modelr)
library(caTools)
library(rpart)
library(yardstick)
## 
## Attaching package: 'yardstick'
## The following objects are masked from 'package:modelr':
## 
##     mae, mape, rmse
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
Data<- read.csv("HR_comma_sep.csv")
head(Data)
##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years Department salary
## 1                  3             0    1                     0      sales    low
## 2                  6             0    1                     0      sales medium
## 3                  4             0    1                     0      sales medium
## 4                  5             0    1                     0      sales    low
## 5                  3             0    1                     0      sales    low
## 6                  3             0    1                     0      sales    low
cols = colnames(Data)
print(cols)
##  [1] "satisfaction_level"    "last_evaluation"       "number_project"       
##  [4] "average_montly_hours"  "time_spend_company"    "Work_accident"        
##  [7] "left"                  "promotion_last_5years" "Department"           
## [10] "salary"
summary(Data)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left        promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000   Median :0.0000   Median :0.00000      
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381   Mean   :0.02127      
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000   Max.   :1.00000      
##   Department           salary         
##  Length:14999       Length:14999      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Exploratory Data Analysis

#Check if there are any NA values

sum(is.na(Data))
## [1] 0

The data does not have any missing values.

Data %>%
  count(salary)
##   salary    n
## 1   high 1237
## 2    low 7316
## 3 medium 6446

Among the three salary categories (high, medium, and low), the number of employees receiving a high salary is the lowest compared to those receiving medium or low salaries.

Data %>%
  count(Department,salary)%>%
  group_by(Department)
## # A tibble: 30 × 3
## # Groups:   Department [10]
##    Department salary     n
##    <chr>      <chr>  <int>
##  1 IT         high      83
##  2 IT         low      609
##  3 IT         medium   535
##  4 RandD      high      51
##  5 RandD      low      364
##  6 RandD      medium   372
##  7 accounting high      74
##  8 accounting low      358
##  9 accounting medium   335
## 10 hr         high      45
## # … with 20 more rows
ggplot(Data, aes(left,fill=factor(salary)))+
  geom_histogram(stat="count")
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

Data Splitting

#Data splitting to 80&20
sample = sample.split(Data$satisfaction_level,SplitRatio = .8)
train= subset(Data,sample==TRUE)
test = subset(Data,sample==FALSE)
X_test = subset(test, select= -c(satisfaction_level))
Y_test = subset(test, select= c(satisfaction_level))

Decision Tree: Model 1

#Build a Decision Tree regression tree using the training dataset.
dt_model = rpart (satisfaction_level ~ last_evaluation+number_project+average_montly_hours  +time_spend_company, method="anova",data=train)
print(dt_model)
## n= 12000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 12000 741.502900 0.6129642  
##    2) number_project>=5.5 1144  69.236830 0.2480682  
##      4) average_montly_hours>=242.5 796  20.732250 0.1557538  
##        8) last_evaluation>=0.765 722   7.696794 0.1244044 *
##        9) last_evaluation< 0.765 74   5.402805 0.4616216 *
##      5) average_montly_hours< 242.5 348  26.204890 0.4592241 *
##    3) number_project< 5.5 10856 503.891800 0.6514167  
##      6) number_project< 2.5 1918  48.457990 0.4783681  
##       12) average_montly_hours< 165.5 1474  15.566560 0.4339281 *
##       13) average_montly_hours>=165.5 444  20.316340 0.6259009 *
##      7) number_project>=2.5 8938 385.672500 0.6885511  
##       14) average_montly_hours>=276.5 129   9.313206 0.4375194 *
##       15) average_montly_hours< 276.5 8809 368.111100 0.6922273  
##         30) last_evaluation< 0.475 297  13.260270 0.5296970 *
##         31) last_evaluation>=0.475 8512 346.731500 0.6978983 *

Visualization

plot(dt_model,uniform = TRUE)
text(dt_model,use.n = TRUE)

Prediction

Y_pred = predict(dt_model,X_test,method = "anova")

Evaluation

eval = metric_set(mae,rmse,rsq)
# Create a data frame with the truth and estimated values
evaluation_data <- data.frame(truth = test$satisfaction_level, estimate = Y_pred)

# Evaluate the model
dt1_eval <- eval(data = evaluation_data, truth = truth, estimate = estimate)

# Print the evaluation results
dt1_eval
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.145
## 2 rmse    standard       0.192
## 3 rsq     standard       0.407

Decision Tree: Model 2

dt_model2 = rpart (satisfaction_level ~ Work_accident + promotion_last_5years + left  +salary + Department, method="anova", data=train)
print(dt_model2) 
## n= 12000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 12000 741.5029 0.6129642  
##   2) left>=0.5 2849 198.1144 0.4396665 *
##   3) left< 0.5 9151 431.1891 0.6669173 *

Prediction

Y_pred2 = predict(dt_model2,X_test,method = "anova")

Evaluation

# Create a data frame with the truth and estimated values for dt2
evaluation_data2 <- data.frame(truth = test$satisfaction_level, estimate = Y_pred2)

# Evaluate the model
dt2_eval <- eval(data = evaluation_data2, truth = truth, estimate = estimate)

# Print the evaluation results for dt2
dt2_eval
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.183
## 2 rmse    standard       0.230
## 3 rsq     standard       0.149

Conclusion

The first model is performing better that the second model and has a lower rsmse value.

Random Forest

Due to its superior performance, I will utilize the variables from the first model to construct a random forest model. This will involve training a random forest using the same set of variables and assessing its performance.

#Create a random forest model.
rf_model = randomForest(satisfaction_level ~ last_evaluation+number_project+average_montly_hours  +time_spend_company, type = "regression",ntree = 1200,split = 15,importnace = TRUE,data=train)
# Make predictions using the random forest model
Y_pred3 <- predict(rf_model, X_test)

# Create a data frame with the truth and estimated values for dt3
evaluation_data3 <- data.frame(truth = test$satisfaction_level, estimate = Y_pred3)

# Evaluate the model
dt3_eval <- eval(data = evaluation_data3, truth = truth, estimate = estimate)

# Print the evaluation results for dt3
dt3_eval
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.137
## 2 rmse    standard       0.177
## 3 rsq     standard       0.500

We notice lower RMSE comparing the two decision tree models

Check the important features.

importance(rf_model)
##                      IncNodePurity
## last_evaluation           62.99741
## number_project           159.85841
## average_montly_hours     109.03291
## time_spend_company        51.17485

Summary

The Random Forest model achieved an RMSE (Root Mean Square Error) of 0.179, whereas the first decision tree model had an RMSE of 0.193. Consequently, the Random Forest model outperforms the decision tree model in terms of predictive accuracy. Moreover, when examining the feature importance, it is evident that average_monthly_hours and number_project are the most influential factors for evaluating the satisfaction_level.