Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.Compare the results with the results from previous homework. Based on articles https://www.hindawi.com/journals/complexity/2021/5550344/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8137961/
Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?
I used HR analytics dataset in homework 2, which had information about employees who worked in a company from Kaggle. https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction.
In homework 2, I created TWO Regression decision trees using the satisfaction_level variable as the output and one random forest.
#Import required libraries
library(ggplot2)
library(dplyr)
library(modelr)
library(caTools)
library(rpart)
library(yardstick)
library(randomForest)
library(e1071)
Data <- read.csv("HR.csv")
head(Data)
## satisfaction_level last_evaluation number_project average_montly_hours
## 1 0.38 0.53 2 157
## 2 0.80 0.86 5 262
## 3 0.11 0.88 7 272
## 4 0.72 0.87 5 223
## 5 0.37 0.52 2 159
## 6 0.41 0.50 2 153
## time_spend_company Work_accident left promotion_last_5years sales salary
## 1 3 0 1 0 sales low
## 2 6 0 1 0 sales medium
## 3 4 0 1 0 sales medium
## 4 5 0 1 0 sales low
## 5 3 0 1 0 sales low
## 6 3 0 1 0 sales low
cols = colnames(Data)
print(cols)
## [1] "satisfaction_level" "last_evaluation" "number_project"
## [4] "average_montly_hours" "time_spend_company" "Work_accident"
## [7] "left" "promotion_last_5years" "sales"
## [10] "salary"
summary(Data)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 3.498 Mean :0.1446 Mean :0.2381 Mean :0.02127
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## sales salary
## Length:14999 Length:14999
## Class :character Class :character
## Mode :character Mode :character
##
##
##
#Check if there are any NA values
sum(is.na(Data))
## [1] 0
The data does not have any missing values.
Data %>%
count(salary)
## salary n
## 1 high 1237
## 2 low 7316
## 3 medium 6446
The number of employees with high salary is the lowest among others with medium and low.
Data %>%
count(sales,salary)%>%
group_by(sales)
## # A tibble: 30 x 3
## # Groups: sales [10]
## sales salary n
## <chr> <chr> <int>
## 1 accounting high 74
## 2 accounting low 358
## 3 accounting medium 335
## 4 hr high 45
## 5 hr low 335
## 6 hr medium 359
## 7 IT high 83
## 8 IT low 609
## 9 IT medium 535
## 10 management high 225
## # ... with 20 more rows
ggplot(Data, aes(left,fill=factor(salary)))+
geom_histogram(stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
#Data splitting to 80&20
sample = sample.split(Data$satisfaction_level,SplitRatio = .8)
train= subset(Data,sample==TRUE)
test = subset(Data,sample==FALSE)
X_test = subset(test, select= -c(satisfaction_level))
Y_test = subset(test, select= c(satisfaction_level))
We will build the SVM model and compare it to our a decision tree model from the last assignment. I will retrieve the decision model in this section in order to quickly reference the results for comparison.
#Build a Decision Tree regression tree using the training dataset.
dt_model = rpart (satisfaction_level ~ last_evaluation+number_project+average_montly_hours +time_spend_company, method="anova",data=train)
print(dt_model)
## n= 12000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 12000 741.50290 0.6129642
## 2) number_project>=5.5 1139 67.89310 0.2464442
## 4) average_montly_hours>=243.5 773 16.96882 0.1466494 *
## 5) average_montly_hours< 243.5 366 26.96696 0.4572131 *
## 3) number_project< 5.5 10861 504.55390 0.6514013
## 6) number_project< 2.5 1895 46.94878 0.4775778
## 12) average_montly_hours< 165.5 1455 15.09145 0.4336838 *
## 13) average_montly_hours>=165.5 440 19.78393 0.6227273 *
## 7) number_project>=2.5 8966 388.24700 0.6881396
## 14) average_montly_hours>=275.5 144 10.76406 0.4409722 *
## 15) average_montly_hours< 275.5 8822 368.54210 0.6921741
## 30) last_evaluation< 0.475 311 14.14855 0.5296141 *
## 31) last_evaluation>=0.475 8511 345.87480 0.6981142 *
Y_pred = predict(dt_model,X_test,method = "anova")
#Y_pred
Root mean square error (RMSE), R-square (RSQ), and Mean Absolute Error (MAE) results from the decision tree:
eval = metric_set(mae,rmse,rsq)
dt1_eval = eval(data= test, truth=satisfaction_level,estimate =Y_pred )
dt1_eval
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mae standard 0.147
## 2 rmse standard 0.195
## 3 rsq standard 0.388
The decision tree model gave an RMSE of 0.1956.
svm_model <- svm(satisfaction_level ~ last_evaluation+ number_project+ average_montly_hours+ time_spend_company,data=train, kernel="linear",scale=FALSE)
summary(svm_model)
##
## Call:
## svm(formula = satisfaction_level ~ last_evaluation + number_project +
## average_montly_hours + time_spend_company, data = train, kernel = "linear",
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 1
## gamma: 0.25
## epsilon: 0.1
##
##
## Number of Support Vectors: 11538
#Predictions
Y_pred2 <- predict(svm_model, X_test, na.action = na.exclude)
#Y_pred = predict(dt_model,X_test,method = "anova")
# Root Mean Square Error
sqrt(mean(test$satisfaction_level - Y_pred2, na.rm = TRUE)^2)
## [1] 0.1270532
# R-square
(cor(test$satisfaction_level, Y_pred2, use = "complete.obs"))^2
## [1] 0.0002790528
As we can see the SVM gives a lower RMSE of 0.1432 and R-squared of 0.017, comparing to DT model with RMSE of 0.1956601 RSQ of 0.3817321.
With this result, we do not need to tune the svm model for better performance.
The SVM model had a better RMSE score than the Decision Tree model from the previous homework at 0.1956601. For this homework, I lean more towards SVMs for regression over decision tree.
Decision Tree vs SVM stated that SVM uses a “kernel trick to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem” and “decision trees are better for categorical data and it deals with collinearity” better than SVM.