PENERAPAN MODEL MACHINE LEARNING DALAM MEMPREDIKSI PERFORMA UMKM

Methodology

Alur CRISP-DM Data Mining (image source: https://datascience-pm.com)

  1. Business Understanding
  • Memahami proses data secara komprehensif
  1. Data Understanding
  • Mendapatkan pemahaman awal mengenai data yang dibutuhkan untuk memecahkan permasalahan yang diberikan
  1. Data Preparation
  • Proses pengumpulan, penggabungan, penataan, dan pengorganisasian data
  1. Modelling
  • Membuat model deskriptif atau prediktif
  1. Evaluation
  • Melakukan interpretasi terhadap hasil dari data mining
  1. Deployment
  • Rencana penerapan atau penggunaan model

Business Understanding

UMKM Terdampak Pandemi (image source: https://xmui.or.id.com)

Salah satu sektor yang terdampak pandemi COVID-19 adalah sektor perekonomian. PT X yang biasanya berinvestasi kepada UMKM mengalami kerugian karena kesalahan sasaran dalam berinvestasi. Oleh karena itu, penting dilakukan analisis untuk mengetahui performa UMKM terkait sehingga PT X dapat berinvestasi dengan bijak ke UMKM sesuai dengan kondisi yang terjadi pada UMKM tersebut.

Data Understanding

Data mencakup fitur berikut:

Variabel Keterangan
Hasil Penjualan Hasil penjualan dalam rupiah pada tahun 2020 dan 2021 [Penj_20] [Penj_21]
Rasio Beban Rasio beban terhadap penjualan pada tahun 2020 dan 2021 [RB_Penj20] [RB_Penj21]
Sumber Laba Utama Sumber laba utama atau laba lainnya pada tahun 2020 dan 2021 [SL20] [Sl21]
Laba Rugi Laba rugi dalam rupiah pada tahun 2020 dan 2021 [LR20] [LR21]
Liabilitas Liabilitas dalam rupiah pada tahun 2020 dan 2021 [Liab_20] [Liab_21]
Aset Aset dalam rupiah pada tahun 2020 dan 2021 [Aset20] [Aset21]
Kas Kas dalam rupiah pada tahun 2020 dan 2021 [Kas20] [Kas21]
Performa Performa UMKM dalam kategori baik atau buruk [Performa]

Muat Data

Memuat data dan merubah struktur pada data sesuai tipe datanya

library(readr)
df <- read_delim("Data Final SSO.csv",
                 delim = ";",
                 escape_double = FALSE, trim_ws = TRUE)

names(df) <- c("Penj_21","Penj_20","RB_Penj21","RB_Penj20","SL21",
               "SL20","LR21","LR20","Liab_21","Liab_20","Aset21",
               "Aset20","Kas21","Kas20","Performa")

# Merubah ke tipe data data frame
df <- data.frame(df)

# Ganti fitur Ratio ke numerik dengan mengganti koma ke titik dahulu
library(stringr)
df$RB_Penj21 <- str_replace_all(string = df$RB_Penj21,pattern = ",",
                                replacement = ".")
df$RB_Penj20 <- str_replace_all(string = df$RB_Penj20,pattern = ",",
                                replacement = ".")
df$SL21 <- as.factor(df$SL21)
df$SL20 <- as.factor(df$SL20)
df$Performa <- as.factor(df$Performa)
df$RB_Penj21 <- as.numeric(df$RB_Penj21)
df$RB_Penj20 <- as.numeric(df$RB_Penj20)
str(df)
## 'data.frame':    2396 obs. of  15 variables:
##  $ Penj_21  : num  1.49e+08 8.87e+07 1.09e+08 1.33e+08 1.36e+08 ...
##  $ Penj_20  : num  1.37e+08 8.57e+07 7.84e+07 1.20e+08 8.69e+07 ...
##  $ RB_Penj21: num  0.8682 0.9786 0.0602 0.5316 0.2299 ...
##  $ RB_Penj20: num  0.8558 0.7311 0.0809 0.9895 0.162 ...
##  $ SL21     : Factor w/ 5 levels "Bidang Lain",..: 4 3 5 5 3 4 4 5 5 3 ...
##  $ SL20     : Factor w/ 5 levels "Bidang lain",..: 4 5 5 5 3 4 4 5 5 3 ...
##  $ LR21     : num  8.76e+09 5.64e+09 4.86e+09 3.61e+09 5.77e+09 ...
##  $ LR20     : num  7.32e+09 4.84e+09 3.99e+09 4.59e+09 4.94e+09 ...
##  $ Liab_21  : num  20792734 24472651 20586078 26786479 33833705 ...
##  $ Liab_20  : num  84451989 98329359 67450502 79683551 96559757 ...
##  $ Aset21   : num  1.95e+08 1.75e+08 1.83e+08 1.90e+08 1.96e+08 ...
##  $ Aset20   : num  1.84e+08 1.99e+08 1.78e+08 1.85e+08 1.95e+08 ...
##  $ Kas21    : num  2523754 30930175 194345 24602549 4836727 ...
##  $ Kas20    : num  11437529 36333943 10186417 30149281 11705490 ...
##  $ Performa : Factor w/ 3 levels "baik","Baik",..: 1 3 1 1 3 1 1 1 1 3 ...

Ringkasan Data Kategorik

Melihat ringkasan data pada data kategorik

summary(df[,c(5,6,15)])
##              SL21                  SL20       Performa   
##  Bidang Lain   :   2   Bidang lain   :   2   baik : 133  
##  Bidang lainnya:   2   Bidang Lain   :  48   Baik :  33  
##  Bidang Lainnya: 360   Bidang Lainnya: 291   buruk:2230  
##  Bidang usaha  : 736   Bidang usaha  : 635               
##  Bidang Usaha  :1296   Bidang Usaha  :1420

Terlihat bahwa terdapat kategori yang berbeda dengan maksud yang sama sehingga perlu dibuat penamaan yang sama. Pada SL21 dan SL20 menjadi dua kategori yaitu “bidang lainnya” dan “bidang usaha” dan pada Performa menjadi dua kategori yaitu “baik” dan “buruk”

#levels(df$Peforma)
# Ubah "Baik" menjadi "baik"

#levels(df$SL21)
# Samakan bidang lain menjadi 1 kategori dan bidang usaha menjadi 1 kategori

#levels(df$SL20)
# Samakan bidang lain menjadi 1 kategori dan bidang usaha menjadi 1 kategori

##MENGUBAH KATEGORI
levels(df$Performa)[2] <- c("baik")
levels(df$SL21)[1:3] <- c("bidang lainnya")
levels(df$SL21)[2:3] <- c("bidang usaha")
levels(df$SL20)[1:3] <- c("bidang lainnya")
levels(df$SL20)[2:3] <- c("bidang usaha")

summary(df[,c(5,6,15)])
##              SL21                  SL20       Performa   
##  bidang lainnya: 364   bidang lainnya: 341   baik : 166  
##  bidang usaha  :2032   bidang usaha  :2055   buruk:2230

Statistika Deskriptif

Pada statistika deskriptif disajikan ringkasan data dengan memperhatikan pola persebaran data, nilai minimum, maksimum, rata-rata, standar deviasi, mengetahui missing value, dan apakah data hanya berkumpul dalam rentang kuartil tertentu yang nantinya penting bagi kita untuk melakukan langkah preprocessing selanjutnya.

library(skimr)

skimmed <- skim_to_wide(df)
skimmed
Data summary
Name Piped data
Number of rows 2396
Number of columns 15
_______________________
Column type frequency:
factor 3
numeric 12
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
SL21 0 1 FALSE 2 bid: 2032, bid: 364
SL20 0 1 FALSE 2 bid: 2055, bid: 341
Performa 0 1 FALSE 2 bur: 2230, bai: 166

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Penj_21 0 1 9.584888e+07 3.040022e+07 50004986 7.178802e+07 9.104714e+07 1.179365e+08 169934173 ▇▇▅▃▂
Penj_20 0 1 9.641617e+07 3.022387e+07 50015181 7.299453e+07 9.106843e+07 1.189492e+08 169882206 ▆▇▃▃▂
RB_Penj21 0 1 5.100000e-01 2.900000e-01 0 2.500000e-01 5.100000e-01 7.600000e-01 1 ▇▇▇▇▇
RB_Penj20 0 1 5.000000e-01 2.900000e-01 0 2.400000e-01 5.100000e-01 7.400000e-01 1 ▇▇▇▇▇
LR21 0 1 1.528404e+09 4.013853e+09 -9973969453 -1.327392e+09 1.969982e+09 4.252804e+09 9970196722 ▁▂▇▇▂
LR20 0 1 1.726908e+09 3.970113e+09 -9996772003 -1.162818e+09 2.150437e+09 4.444185e+09 9990549175 ▁▂▆▇▂
Liab_21 0 1 6.009127e+07 2.305624e+07 20033006 3.995161e+07 6.001215e+07 8.042310e+07 99985816 ▇▇▇▇▇
Liab_20 0 1 6.086028e+07 2.317820e+07 20039450 4.103766e+07 6.090982e+07 8.070269e+07 99964618 ▇▇▇▇▇
Aset21 0 1 1.701125e+08 3.407744e+07 120047150 1.399093e+08 1.650409e+08 1.962644e+08 239980549 ▇▅▅▅▃
Aset20 0 1 1.701452e+08 3.399318e+07 120049822 1.403411e+08 1.649616e+08 1.964649e+08 239919416 ▇▆▅▅▃
Kas21 0 1 1.766185e+07 8.690964e+06 125897 1.018294e+07 1.767224e+07 2.494508e+07 34504948 ▅▇▇▇▆
Kas20 0 1 2.515371e+07 8.598909e+06 10019797 1.781578e+07 2.528927e+07 3.245916e+07 39994863 ▇▇▇▇▇

Memeriksa apakah data kita imbalance atau tidak

library(tidyverse)

df %>% 
  ggplot(aes(x = Performa)) +
  geom_bar(aes(fill = "blue")) +
  ggtitle("Distribusi dari UMKM Performa Baik dan UMKM Performa Buruk") +
  theme(legend.position="none")

Memeriksa Missing Value

Memeriksa data yang hilang

sapply(df,function(x) sum(is.na(x)))
##   Penj_21   Penj_20 RB_Penj21 RB_Penj20      SL21      SL20      LR21      LR20 
##         0         0         0         0         0         0         0         0 
##   Liab_21   Liab_20    Aset21    Aset20     Kas21     Kas20  Performa 
##         0         0         0         0         0         0         0

Terlihat bahwa tidak ada missing value sehingga tidak perlu dilakukan penanganan missing value pada data.

Korelasi Antar Feature

Memeriksa korelasi antar Feature dengan melakukan perubahan variabel kategorik menjadi numerik terlebih dahulu

df[,c('SL21','SL20')] <- sapply(df[,c('SL21','SL20')], unclass)

# Mencoba untuk melakukan statistika deskriptif kembali dari data yang telah diubah

skimmed <- skim_to_wide(df)
skimmed
Data summary
Name Piped data
Number of rows 2396
Number of columns 15
_______________________
Column type frequency:
factor 1
numeric 14
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Performa 0 1 FALSE 2 bur: 2230, bai: 166

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Penj_21 0 1 9.584888e+07 3.040022e+07 50004986 7.178802e+07 9.104714e+07 1.179365e+08 169934173 ▇▇▅▃▂
Penj_20 0 1 9.641617e+07 3.022387e+07 50015181 7.299453e+07 9.106843e+07 1.189492e+08 169882206 ▆▇▃▃▂
RB_Penj21 0 1 5.100000e-01 2.900000e-01 0 2.500000e-01 5.100000e-01 7.600000e-01 1 ▇▇▇▇▇
RB_Penj20 0 1 5.000000e-01 2.900000e-01 0 2.400000e-01 5.100000e-01 7.400000e-01 1 ▇▇▇▇▇
SL21 0 1 1.850000e+00 3.600000e-01 1 2.000000e+00 2.000000e+00 2.000000e+00 2 ▂▁▁▁▇
SL20 0 1 1.860000e+00 3.500000e-01 1 2.000000e+00 2.000000e+00 2.000000e+00 2 ▁▁▁▁▇
LR21 0 1 1.528404e+09 4.013853e+09 -9973969453 -1.327392e+09 1.969982e+09 4.252804e+09 9970196722 ▁▂▇▇▂
LR20 0 1 1.726908e+09 3.970113e+09 -9996772003 -1.162818e+09 2.150437e+09 4.444185e+09 9990549175 ▁▂▆▇▂
Liab_21 0 1 6.009127e+07 2.305624e+07 20033006 3.995161e+07 6.001215e+07 8.042310e+07 99985816 ▇▇▇▇▇
Liab_20 0 1 6.086028e+07 2.317820e+07 20039450 4.103766e+07 6.090982e+07 8.070269e+07 99964618 ▇▇▇▇▇
Aset21 0 1 1.701125e+08 3.407744e+07 120047150 1.399093e+08 1.650409e+08 1.962644e+08 239980549 ▇▅▅▅▃
Aset20 0 1 1.701452e+08 3.399318e+07 120049822 1.403411e+08 1.649616e+08 1.964649e+08 239919416 ▇▆▅▅▃
Kas21 0 1 1.766185e+07 8.690964e+06 125897 1.018294e+07 1.767224e+07 2.494508e+07 34504948 ▅▇▇▇▆
Kas20 0 1 2.515371e+07 8.598909e+06 10019797 1.781578e+07 2.528927e+07 3.245916e+07 39994863 ▇▇▇▇▇

Analisis korelasi antar Feature sebagai berikut

cordata <- data.matrix(df)
cormat <- round(cor(cordata, method = "pearson"),2)

library(reshape2)
melted_cormat <- melt(cormat)

# Peroleh segitiga bawah dari matriks korelasi
get_lower_tri <- function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

# Peroleh segitiga atas dari matriks korelasi
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)] <- NA
  return(cormat)
}

upper_tri <- get_upper_tri(cormat)

# Atur matriks korelasi
melted_cormat <- melt(upper_tri, na.rm = TRUE)

# Buat ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Pearson\nCorrelation") +
  theme_minimal() + # minimal theme
  theme(axis.text.x = element_text(angle = 45, vjust = 1,
                                   size = 9, hjust = 1))+
  coord_fixed()

ggheatmap +
  geom_text(aes(Var2, Var1, label = value), color = "black",
            size = 2) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank(),
    legend.justification = c(1, 0),
    legend.position = c(0.6, 0.7),
    legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                               title.position = "top",
                               title.hjust = 0.5))

Data Preparation

Resampling

Karena distribusi class atau label pada Performa tidak seimbang atau imbalance yang terlihat pada grafik sebelumnya maka perlu melakukan Resampling pada data

plotImbalance <- function(data){
  ggplot(data, aes(x = Performa)) +
    geom_bar(aes(fill = "blue")) +
    ggtitle("Distribusi dari UMKM Performa Baik dan UMKM Performa Buruk") +
    theme(legend.position="none")
}

head(df)
##     Penj_21   Penj_20  RB_Penj21  RB_Penj20 SL21 SL20       LR21       LR20
## 1 149207164 136533758 0.86817985 0.85582566    2    2 8760704261 7317358071
## 2  88706100  85686167 0.97856674 0.73113980    1    2 5637124272 4835651598
## 3 108887378  78405829 0.06015514 0.08093554    2    2 4861115054 3990873285
## 4 133103033 120427245 0.53161329 0.98946267    2    2 3611989881 4587487935
## 5 136475517  86949433 0.22992654 0.16201631    1    1 5773418132 4935491114
## 6 117591684  96478934 0.51937699 0.66088548    2    2 2153954438 4328320319
##    Liab_21  Liab_20    Aset21    Aset20    Kas21    Kas20 Performa
## 1 20792734 84451989 195406981 183660291  2523754 11437529     baik
## 2 24472651 98329359 174655998 199015767 30930175 36333943    buruk
## 3 20586078 67450502 182547662 177529646   194345 10186417     baik
## 4 26786479 79683551 190119809 185314245 24602549 30149281     baik
## 5 33833705 96559757 196396280 195426775  4836727 11705490    buruk
## 6 21977440 58498050 190926650 182901681 13546277 20701195     baik
plotImbalance(df)

library(imbalance)
imbalanceRatio(df, classAttr = "Performa")
## [1] 0.07443946

Akan coba dibandingkan beberapa algoritma resampling untuk menyeimbangkan distribusi data yang dimiliki karena apabila data imbalance dilakukan pemodelan, maka akibatnya model yang terbentuk lebih banyak belajar pada saat kondisi UMKM dengan performa buruk sehingga model yang terbentuk lebih condong untuk memprediksi UMKM dengan performa buruk.

  1. Synthetic Minority Over-sampling Technique (SMOTE)
dfSMOTE <- oversample(df, ratio = 0.90, method = "SMOTE",
                      classAttr = "Performa")
plotImbalance(dfSMOTE)

imbalanceRatio(dfSMOTE, classAttr = "Performa")
## [1] 0.9
  1. Adaptive Synthetic (ADASYN)
dfADASYN <- oversample(df, method = "ADASYN", classAttr = "Performa")
plotImbalance(dfADASYN)

imbalanceRatio(dfADASYN, classAttr = "Performa")
## [1] 0.9869955
  1. Majority-Weighted Minority Over-sampling Technique (MWMOTE)
dfMWMOTE <- oversample(df, ratio = 0.95, method = "MWMOTE", classAttr = "Performa")
plotImbalance(dfMWMOTE)

imbalanceRatio(dfMWMOTE, classAttr = "Performa")
## [1] 0.9502242

Dari ketiga metode resampling dipilih metode ADASYN sebagai metode resampling yang terbaik untuk digunakan karena memiliki jumlah kelas yang hampir seimbang.

library(writexl)
write_xlsx(dfADASYN,"dataADASYN.xlsx")

Modelling

Pada tahap awal akan digunakan k-fold cross validation

k-fold Cross Validation (image source: https://rpubs.com/jvaldeleon)

# Set seed untuk menghasilkan keacakan yang sama
set.seed(123)

# Tentukan cross validation berulang dengan 10 folds/lipatan dan satu pengulangan
library(caret)
evaluationSetting <- trainControl(method='repeatedcv', 
                                  number=10, 
                                  repeats=1,
                                  summaryFunction = multiClassSummary)

Decision Tree

Classification and Regression Tree (CART)

DT_Model <- train(Performa~.,
                    data=dfADASYN,
                    method="rpart",
                    trControl=evaluationSetting)
print(DT_Model)
## CART 
## 
## 4431 samples
##   14 predictor
##    2 classes: 'baik', 'buruk' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 3987, 3988, 3988, 3988, 3988, 3988, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa      F1         Sensitivity  Specificity
##   0.1004089  0.7962118  0.5930495  0.8169223  0.9136672    0.6802691  
##   0.1785552  0.7165472  0.4350117  0.7740602  0.9718244    0.4645740  
##   0.3698319  0.5897286  0.1776729  0.7531898  0.4972727    0.6811659  
##   Pos_Pred_Value  Neg_Pred_Value  Precision  Recall     Detection_Rate
##   0.7410003       0.8929800       0.7410003  0.9136672  0.4538456     
##   0.6451266       0.9525796       0.6451266  0.9718244  0.4827334     
##   0.6061078       0.7442883       0.6061078  0.4972727  0.2469526     
##   Balanced_Accuracy
##   0.7969681        
##   0.7181992        
##   0.5892193        
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1004089.

Logistic Regression

RegLog_Model <- train(Performa~.,
                      data=dfADASYN,
                      method="glm",
                      family = 'binomial',
                      trControl=evaluationSetting)
print(RegLog_Model)
## Generalized Linear Model 
## 
## 4431 samples
##   14 predictor
##    2 classes: 'baik', 'buruk' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 3988, 3988, 3987, 3988, 3988, 3988, ... 
## Resampling results:
## 
##   Accuracy  Kappa      F1         Sensitivity  Specificity  Pos_Pred_Value
##   0.869106  0.7382837  0.8712062  0.8914315    0.8470852    0.8522767     
##   Neg_Pred_Value  Precision  Recall     Detection_Rate  Balanced_Accuracy
##   0.8880895       0.8522767  0.8914315  0.4427938       0.8692584

Support Vector Machine (SVM) Kernel Radial

library(kernlab)
SVMRad_Model <- train(Performa~.,
                   data=dfADASYN,
                   method="svmRadial",
                   trControl=evaluationSetting)
print(SVMRad_Model)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 4431 samples
##   14 predictor
##    2 classes: 'baik', 'buruk' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 3988, 3988, 3988, 3988, 3988, 3987, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa      F1         Sensitivity  Specificity
##   0.25  0.9451610  0.8903608  0.9464649  0.9731900    0.9174888  
##   0.50  0.9535101  0.9070500  0.9544987  0.9795496    0.9278027  
##   1.00  0.9620849  0.9241955  0.9629253  0.9890930    0.9354260  
##   Pos_Pred_Value  Neg_Pred_Value  Precision  Recall     Detection_Rate
##   0.9215046       0.9720526       0.9215046  0.9731900  0.4834111     
##   0.9309551       0.9788086       0.9309551  0.9795496  0.4865704     
##   0.9382552       0.9886282       0.9382552  0.9890930  0.4913103     
##   Balanced_Accuracy
##   0.9453394        
##   0.9536761        
##   0.9622595        
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05147064
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05147064 and C = 1.

Support Vector Machine (SVM) Kernel Linear

SVMLin_Model <- train(Performa~.,
                   data=dfADASYN,
                   method="svmLinear",
                   trControl=evaluationSetting)
print(SVMLin_Model)
## Support Vector Machines with Linear Kernel 
## 
## 4431 samples
##   14 predictor
##    2 classes: 'baik', 'buruk' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 3987, 3988, 3988, 3988, 3988, 3988, ... 
## Resampling results:
## 
##   Accuracy   Kappa      F1         Sensitivity  Specificity  Pos_Pred_Value
##   0.8697786  0.7396848  0.8739646  0.908225     0.8318386    0.8427512     
##   Neg_Pred_Value  Precision  Recall    Detection_Rate  Balanced_Accuracy
##   0.9021679       0.8427512  0.908225  0.4511399       0.8700318        
## 
## Tuning parameter 'C' was held constant at a value of 1

Naive Bayes

library(naivebayes)
NB_Model <- train(Performa~.,
                    data=dfADASYN,
                    method="naive_bayes",
                    trControl=evaluationSetting)
print(NB_Model)
## Naive Bayes 
## 
## 4431 samples
##   14 predictor
##    2 classes: 'baik', 'buruk' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 3988, 3988, 3988, 3988, 3987, 3988, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa      F1         Sensitivity  Specificity
##   FALSE      0.6795315  0.3617138  0.7561499  1.0000000    0.3632287  
##    TRUE      0.8521765  0.7039299  0.8335684  0.7464644    0.9565022  
##   Pos_Pred_Value  Neg_Pred_Value  Precision  Recall     Detection_Rate
##   0.6079515       1.0000000       0.6079515  1.0000000  0.4967274     
##   0.9448611       0.7930638       0.9448611  0.7464644  0.3707929     
##   Balanced_Accuracy
##   0.6816143        
##   0.8514833        
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

Random Forest

library(randomForest)
RF_Model <- train(Performa~.,
                    data=dfADASYN,
                    method="rf",
                    tuneGrid=expand.grid(.mtry= 7),
                    trControl=evaluationSetting,
                    ntree = 300)
print(RF_Model)
## Random Forest 
## 
## 4431 samples
##   14 predictor
##    2 classes: 'baik', 'buruk' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 3988, 3988, 3988, 3988, 3988, 3988, ... 
## Resampling results:
## 
##   Accuracy  Kappa      F1         Sensitivity  Specificity  Pos_Pred_Value
##   0.980139  0.9602824  0.9802359  0.9909132    0.9695067    0.9698749     
##   Neg_Pred_Value  Precision  Recall     Detection_Rate  Balanced_Accuracy
##   0.9909016       0.9698749  0.9909132  0.4922137       0.98021          
## 
## Tuning parameter 'mtry' was held constant at a value of 7

Evaluation

Perbandingan Performances

# Bandingkan kinerja model menggunakan resample()
models_compare <- resamples(list(DecisionTree = DT_Model, 
                                 LogisticRegression = RegLog_Model,
                                 SupportVectorMachineRadial = SVMRad_Model,
                                 SupportVectorMachineLinear = SVMLin_Model,
                                 NaiveBayes = NB_Model,
                                 RandomForest = RF_Model
                                 ))

# Ringkasan models performances
summary(models_compare)
## 
## Call:
## summary.resamples(object = models_compare)
## 
## Models: DecisionTree, LogisticRegression, SupportVectorMachineRadial, SupportVectorMachineLinear, NaiveBayes, RandomForest 
## Number of resamples: 10 
## 
## Accuracy 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.7629797 0.7777757 0.7878104 0.7962118 0.8182844
## LogisticRegression         0.8465011 0.8619746 0.8702032 0.8691060 0.8752822
## SupportVectorMachineRadial 0.9367946 0.9520519 0.9661400 0.9620849 0.9706546
## SupportVectorMachineLinear 0.8419865 0.8628668 0.8656885 0.8697786 0.8783059
## NaiveBayes                 0.8329571 0.8442438 0.8545620 0.8521765 0.8600451
## RandomForest               0.9729120 0.9779910 0.9819413 0.9801390 0.9836343
##                                 Max. NA's
## DecisionTree               0.8306998    0
## LogisticRegression         0.8871332    0
## SupportVectorMachineRadial 0.9796840    0
## SupportVectorMachineLinear 0.9097065    0
## NaiveBayes                 0.8645598    0
## RandomForest               0.9842342    0
## 
## Balanced_Accuracy 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.7643294 0.7788655 0.7882898 0.7969681 0.8187806
## LogisticRegression         0.8468304 0.8619749 0.8704494 0.8692584 0.8754484
## SupportVectorMachineRadial 0.9370669 0.9522804 0.9662913 0.9622595 0.9708214
## SupportVectorMachineLinear 0.8423461 0.8632236 0.8659040 0.8700318 0.8784284
## NaiveBayes                 0.8320628 0.8435105 0.8541044 0.8514833 0.8594043
## RandomForest               0.9729719 0.9780932 0.9819863 0.9802100 0.9836756
##                                 Max. NA's
## DecisionTree               0.8313188    0
## LogisticRegression         0.8873115    0
## SupportVectorMachineRadial 0.9797595    0
## SupportVectorMachineLinear 0.9100082    0
## NaiveBayes                 0.8639115    0
## RandomForest               0.9842744    0
## 
## Detection_Rate 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.4153499 0.4401806 0.4593679 0.4538456 0.4681355
## LogisticRegression         0.4234234 0.4401806 0.4458239 0.4427938 0.4492099
## SupportVectorMachineRadial 0.4830700 0.4864560 0.4932280 0.4913103 0.4952108
## SupportVectorMachineLinear 0.4311512 0.4458239 0.4498302 0.4511399 0.4548533
## NaiveBayes                 0.3476298 0.3594808 0.3758465 0.3707929 0.3814898
## RandomForest               0.4875847 0.4898420 0.4938000 0.4922137 0.4943567
##                                 Max. NA's
## DecisionTree               0.4785553    0
## LogisticRegression         0.4537246    0
## SupportVectorMachineRadial 0.4966140    0
## SupportVectorMachineLinear 0.4740406    0
## NaiveBayes                 0.3873874    0
## RandomForest               0.4966140    0
## 
## F1 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.7982646 0.8040349 0.8097055 0.8169223 0.8297685
## LogisticRegression         0.8528139 0.8609848 0.8736232 0.8712062 0.8777588
## SupportVectorMachineRadial 0.9388646 0.9534819 0.9666626 0.9629253 0.9711752
## SupportVectorMachineLinear 0.8491379 0.8668642 0.8706800 0.8739646 0.8806552
## NaiveBayes                 0.8062827 0.8242305 0.8396366 0.8335684 0.8451566
## RandomForest               0.9729730 0.9781799 0.9819405 0.9802359 0.9836310
##                                 Max. NA's
## DecisionTree               0.8440748    0
## LogisticRegression         0.8893805    0
## SupportVectorMachineRadial 0.9797753    0
## SupportVectorMachineLinear 0.9130435    0
## NaiveBayes                 0.8492462    0
## RandomForest               0.9842697    0
## 
## Kappa 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.5272241 0.5565102 0.5760176 0.5930495 0.6369179
## LogisticRegression         0.6931945 0.7239431 0.7405253 0.7382837 0.7506385
## SupportVectorMachineRadial 0.8736529 0.9041455 0.9322975 0.9241955 0.9413263
## SupportVectorMachineLinear 0.6841902 0.7259202 0.7314846 0.7396848 0.7566669
## NaiveBayes                 0.6653054 0.6880217 0.7088492 0.7039299 0.7197218
## RandomForest               0.9458281 0.9559891 0.9638843 0.9602824 0.9672701
##                                 Max. NA's
## DecisionTree               0.6618080    0
## LogisticRegression         0.7743388    0
## SupportVectorMachineRadial 0.9593723    0
## SupportVectorMachineLinear 0.8195152    0
## NaiveBayes                 0.7287589    0
## RandomForest               0.9684697    0
## 
## Neg_Pred_Value 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.8217822 0.8622549 0.9052198 0.8929800 0.9163036
## LogisticRegression         0.8546256 0.8835813 0.8916233 0.8880895 0.8996377
## SupportVectorMachineRadial 0.9715640 0.9791574 0.9930205 0.9886282 0.9952830
## SupportVectorMachineLinear 0.8693694 0.8919823 0.9012106 0.9021679 0.9075126
## NaiveBayes                 0.7651246 0.7811488 0.8011152 0.7930638 0.8053349
## RandomForest               0.9817352 0.9863943 0.9931071 0.9909016 0.9953810
##                                 Max. NA's
## DecisionTree               0.9402985    0
## LogisticRegression         0.9099526    0
## SupportVectorMachineRadial 1.0000000    0
## SupportVectorMachineLinear 0.9507389    0
## NaiveBayes                 0.8100775    0
## RandomForest               1.0000000    0
## 
## Pos_Pred_Value 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.6860841 0.7058465 0.7505640 0.7410003 0.7725310
## LogisticRegression         0.8140496 0.8457935 0.8530476 0.8522767 0.8654775
## SupportVectorMachineRadial 0.9033613 0.9258529 0.9421111 0.9382552 0.9482199
## SupportVectorMachineLinear 0.8073770 0.8230579 0.8430736 0.8427512 0.8637073
## NaiveBayes                 0.9184783 0.9293245 0.9500277 0.9448611 0.9533221
## RandomForest               0.9521739 0.9644424 0.9710604 0.9698749 0.9765923
##                                 Max. NA's
## DecisionTree               0.7827869    0
## LogisticRegression         0.8684211    0
## SupportVectorMachineRadial 0.9688889    0
## SupportVectorMachineLinear 0.8766520    0
## NaiveBayes                 0.9753086    0
## RandomForest               0.9819005    0
## 
## Precision 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.6860841 0.7058465 0.7505640 0.7410003 0.7725310
## LogisticRegression         0.8140496 0.8457935 0.8530476 0.8522767 0.8654775
## SupportVectorMachineRadial 0.9033613 0.9258529 0.9421111 0.9382552 0.9482199
## SupportVectorMachineLinear 0.8073770 0.8230579 0.8430736 0.8427512 0.8637073
## NaiveBayes                 0.9184783 0.9293245 0.9500277 0.9448611 0.9533221
## RandomForest               0.9521739 0.9644424 0.9710604 0.9698749 0.9765923
##                                 Max. NA's
## DecisionTree               0.7827869    0
## LogisticRegression         0.8684211    0
## SupportVectorMachineRadial 0.9688889    0
## SupportVectorMachineLinear 0.8766520    0
## NaiveBayes                 0.9753086    0
## RandomForest               0.9819005    0
## 
## Recall 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.8363636 0.8863636 0.9250000 0.9136672 0.9421226
## LogisticRegression         0.8506787 0.8863636 0.8977273 0.8914315 0.9045455
## SupportVectorMachineRadial 0.9727273 0.9795455 0.9931818 0.9890930 0.9954700
## SupportVectorMachineLinear 0.8681818 0.8977273 0.9047614 0.9082250 0.9159091
## NaiveBayes                 0.7000000 0.7238636 0.7568182 0.7464644 0.7681818
## RandomForest               0.9818182 0.9863636 0.9932024 0.9909132 0.9954545
##                                 Max. NA's
## DecisionTree               0.9636364    0
## LogisticRegression         0.9136364    0
## SupportVectorMachineRadial 1.0000000    0
## SupportVectorMachineLinear 0.9545455    0
## NaiveBayes                 0.7782805    0
## RandomForest               1.0000000    0
## 
## Sensitivity 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.8363636 0.8863636 0.9250000 0.9136672 0.9421226
## LogisticRegression         0.8506787 0.8863636 0.8977273 0.8914315 0.9045455
## SupportVectorMachineRadial 0.9727273 0.9795455 0.9931818 0.9890930 0.9954700
## SupportVectorMachineLinear 0.8681818 0.8977273 0.9047614 0.9082250 0.9159091
## NaiveBayes                 0.7000000 0.7238636 0.7568182 0.7464644 0.7681818
## RandomForest               0.9818182 0.9863636 0.9932024 0.9909132 0.9954545
##                                 Max. NA's
## DecisionTree               0.9636364    0
## LogisticRegression         0.9136364    0
## SupportVectorMachineRadial 1.0000000    0
## SupportVectorMachineLinear 0.9545455    0
## NaiveBayes                 0.7782805    0
## RandomForest               1.0000000    0
## 
## Specificity 
##                                 Min.   1st Qu.    Median      Mean   3rd Qu.
## DecisionTree               0.5650224 0.6121076 0.7085202 0.6802691 0.7399103
## LogisticRegression         0.7982063 0.8374439 0.8497758 0.8470852 0.8609865
## SupportVectorMachineRadial 0.8968610 0.9226457 0.9394619 0.9354260 0.9461883
## SupportVectorMachineLinear 0.7892377 0.8038117 0.8340807 0.8318386 0.8632287
## NaiveBayes                 0.9327354 0.9439462 0.9618834 0.9565022 0.9641256
## RandomForest               0.9506726 0.9641256 0.9708520 0.9695067 0.9764574
##                                 Max. NA's
## DecisionTree               0.7623318    0
## LogisticRegression         0.8699552    0
## SupportVectorMachineRadial 0.9686099    0
## SupportVectorMachineLinear 0.8744395    0
## NaiveBayes                 0.9820628    0
## RandomForest               0.9820628    0

Kesimpulan

Dengan menggunakan perbandingan evaluasi dari model Decision Tree, Regresi Logistik, SVM, Naïve Bayes, Random Forest, didapatkan model terbaik yang mampu mengidentifikasi performa UMKM perusahaan adalah model Random Forest.