DATA 622 Assignment 3

Objective

Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework. Based on articles https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

Data

I used HR analytics dataset in homework 2, which had information about employees who worked in a company from Kaggle. https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction.

In homework 2, I created TWO Regression decision trees using the satisfaction_level variable as the output and one random forest.

#Import required libraries
library(ggplot2)
library(dplyr)
library(modelr)
library(caTools)
library(rpart)
library(yardstick)
library(randomForest)
library(e1071)

Data <- read.csv("HR.csv")
head(Data)

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low

Exploratory Data Analysis

cols = colnames(Data)
print(cols)

##  [1] "satisfaction_level"    "last_evaluation"       "number_project"       
##  [4] "average_montly_hours"  "time_spend_company"    "Work_accident"        
##  [7] "left"                  "promotion_last_5years" "sales"                
## [10] "salary"

summary(Data)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left        promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000   Median :0.0000   Median :0.00000      
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381   Mean   :0.02127      
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000   Max.   :1.00000      
##     sales              salary         
##  Length:14999       Length:14999      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

#Check if there are any NA values

sum(is.na(Data))

## [1] 0

The data does not have any missing values.

Data %>%
  count(salary)

##   salary    n
## 1   high 1237
## 2    low 7316
## 3 medium 6446

The number of employees with high salary is the lowest among others with medium and low.

Data %>%
  count(sales,salary)%>%
  group_by(sales)

## # A tibble: 30 x 3
## # Groups:   sales [10]
##    sales      salary     n
##    <chr>      <chr>  <int>
##  1 accounting high      74
##  2 accounting low      358
##  3 accounting medium   335
##  4 hr         high      45
##  5 hr         low      335
##  6 hr         medium   359
##  7 IT         high      83
##  8 IT         low      609
##  9 IT         medium   535
## 10 management high     225
## # ... with 20 more rows

ggplot(Data, aes(left,fill=factor(salary)))+
  geom_histogram(stat="count")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Data Splitting

#Data splitting to 80&20
sample = sample.split(Data$satisfaction_level,SplitRatio = .8)
train= subset(Data,sample==TRUE)
test = subset(Data,sample==FALSE)

X_test = subset(test, select= -c(satisfaction_level))
Y_test = subset(test, select= c(satisfaction_level))

Decision Tree Model:

We will build the SVM model and compare it to our a decision tree model from the last assignment. I will retrieve the decision model in this section in order to quickly reference the results for comparison.

#Build a Decision Tree regression tree using the training dataset.
dt_model = rpart (satisfaction_level ~ last_evaluation+number_project+average_montly_hours  +time_spend_company, method="anova",data=train)

print(dt_model)

## n= 12000 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 12000 741.50290 0.6129642  
##    2) number_project>=5.5 1139  67.89310 0.2464442  
##      4) average_montly_hours>=243.5 773  16.96882 0.1466494 *
##      5) average_montly_hours< 243.5 366  26.96696 0.4572131 *
##    3) number_project< 5.5 10861 504.55390 0.6514013  
##      6) number_project< 2.5 1895  46.94878 0.4775778  
##       12) average_montly_hours< 165.5 1455  15.09145 0.4336838 *
##       13) average_montly_hours>=165.5 440  19.78393 0.6227273 *
##      7) number_project>=2.5 8966 388.24700 0.6881396  
##       14) average_montly_hours>=275.5 144  10.76406 0.4409722 *
##       15) average_montly_hours< 275.5 8822 368.54210 0.6921741  
##         30) last_evaluation< 0.475 311  14.14855 0.5296141 *
##         31) last_evaluation>=0.475 8511 345.87480 0.6981142 *

Prediction

Y_pred = predict(dt_model,X_test,method = "anova")

#Y_pred

Evaluation

Root mean square error (RMSE), R-square (RSQ), and Mean Absolute Error (MAE) results from the decision tree:

eval = metric_set(mae,rmse,rsq)

dt1_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred )
dt1_eval

## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mae     standard       0.147
## 2 rmse    standard       0.195
## 3 rsq     standard       0.388

The decision tree model gave an RMSE of 0.1956.

Support Vector Machine Model:

svm_model <- svm(satisfaction_level ~ last_evaluation+ number_project+ average_montly_hours+ time_spend_company,data=train, kernel="linear",scale=FALSE)

summary(svm_model)

## 
## Call:
## svm(formula = satisfaction_level ~ last_evaluation + number_project + 
##     average_montly_hours + time_spend_company, data = train, kernel = "linear", 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.25 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  11538

Prediction and Evaluation

#Predictions
Y_pred2 <- predict(svm_model, X_test, na.action = na.exclude)
#Y_pred = predict(dt_model,X_test,method = "anova")
# Root Mean Square Error
sqrt(mean(test$satisfaction_level - Y_pred2, na.rm = TRUE)^2)

## [1] 0.1270532

# R-square
(cor(test$satisfaction_level, Y_pred2, use = "complete.obs"))^2

## [1] 0.0002790528

As we can see the SVM gives a lower RMSE of 0.1432 and R-squared of 0.017, comparing to DT model with RMSE of 0.1956601 RSQ of 0.3817321.

With this result, we do not need to tune the svm model for better performance.

Summary

The SVM model had a better RMSE score than the Decision Tree model from the previous homework at 0.1956601. For this homework, I lean more towards SVMs for regression over decision tree.

Decision Tree vs SVM stated that SVM uses a “kernel trick to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem” and “decision trees are better for categorical data and it deals with collinearity” better than SVM.