PENERAPAN MODEL MACHINE LEARNING DALAM MEMPREDIKSI PERFORMA UMKM
Methodology
Alur CRISP-DM Data Mining (image source: https://datascience-pm.com)
- Business Understanding
- Memahami proses data secara komprehensif
- Data Understanding
- Mendapatkan pemahaman awal mengenai data yang dibutuhkan untuk memecahkan permasalahan yang diberikan
- Data Preparation
- Proses pengumpulan, penggabungan, penataan, dan pengorganisasian data
- Modelling
- Membuat model deskriptif atau prediktif
- Evaluation
- Melakukan interpretasi terhadap hasil dari data mining
- Deployment
- Rencana penerapan atau penggunaan model
Business Understanding
UMKM Terdampak Pandemi (image source: https://xmui.or.id.com)
Salah satu sektor yang terdampak pandemi COVID-19 adalah sektor perekonomian. PT X yang biasanya berinvestasi kepada UMKM mengalami kerugian karena kesalahan sasaran dalam berinvestasi. Oleh karena itu, penting dilakukan analisis untuk mengetahui performa UMKM terkait sehingga PT X dapat berinvestasi dengan bijak ke UMKM sesuai dengan kondisi yang terjadi pada UMKM tersebut.
Data Understanding
Data mencakup fitur berikut:
| Variabel | Keterangan |
|---|---|
| Hasil Penjualan | Hasil penjualan dalam rupiah pada tahun 2020 dan 2021 [Penj_20] [Penj_21] |
| Rasio Beban | Rasio beban terhadap penjualan pada tahun 2020 dan 2021 [RB_Penj20] [RB_Penj21] |
| Sumber Laba Utama | Sumber laba utama atau laba lainnya pada tahun 2020 dan 2021 [SL20] [Sl21] |
| Laba Rugi | Laba rugi dalam rupiah pada tahun 2020 dan 2021 [LR20] [LR21] |
| Liabilitas | Liabilitas dalam rupiah pada tahun 2020 dan 2021 [Liab_20] [Liab_21] |
| Aset | Aset dalam rupiah pada tahun 2020 dan 2021 [Aset20] [Aset21] |
| Kas | Kas dalam rupiah pada tahun 2020 dan 2021 [Kas20] [Kas21] |
| Performa | Performa UMKM dalam kategori baik atau buruk [Performa] |
Muat Data
Memuat data dan merubah struktur pada data sesuai tipe datanya
library(readr)
df <- read_delim("Data Final SSO.csv",
delim = ";",
escape_double = FALSE, trim_ws = TRUE)
names(df) <- c("Penj_21","Penj_20","RB_Penj21","RB_Penj20","SL21",
"SL20","LR21","LR20","Liab_21","Liab_20","Aset21",
"Aset20","Kas21","Kas20","Performa")
# Merubah ke tipe data data frame
df <- data.frame(df)
# Ganti fitur Ratio ke numerik dengan mengganti koma ke titik dahulu
library(stringr)
df$RB_Penj21 <- str_replace_all(string = df$RB_Penj21,pattern = ",",
replacement = ".")
df$RB_Penj20 <- str_replace_all(string = df$RB_Penj20,pattern = ",",
replacement = ".")
df$SL21 <- as.factor(df$SL21)
df$SL20 <- as.factor(df$SL20)
df$Performa <- as.factor(df$Performa)
df$RB_Penj21 <- as.numeric(df$RB_Penj21)
df$RB_Penj20 <- as.numeric(df$RB_Penj20)
str(df)## 'data.frame': 2396 obs. of 15 variables:
## $ Penj_21 : num 1.49e+08 8.87e+07 1.09e+08 1.33e+08 1.36e+08 ...
## $ Penj_20 : num 1.37e+08 8.57e+07 7.84e+07 1.20e+08 8.69e+07 ...
## $ RB_Penj21: num 0.8682 0.9786 0.0602 0.5316 0.2299 ...
## $ RB_Penj20: num 0.8558 0.7311 0.0809 0.9895 0.162 ...
## $ SL21 : Factor w/ 5 levels "Bidang Lain",..: 4 3 5 5 3 4 4 5 5 3 ...
## $ SL20 : Factor w/ 5 levels "Bidang lain",..: 4 5 5 5 3 4 4 5 5 3 ...
## $ LR21 : num 8.76e+09 5.64e+09 4.86e+09 3.61e+09 5.77e+09 ...
## $ LR20 : num 7.32e+09 4.84e+09 3.99e+09 4.59e+09 4.94e+09 ...
## $ Liab_21 : num 20792734 24472651 20586078 26786479 33833705 ...
## $ Liab_20 : num 84451989 98329359 67450502 79683551 96559757 ...
## $ Aset21 : num 1.95e+08 1.75e+08 1.83e+08 1.90e+08 1.96e+08 ...
## $ Aset20 : num 1.84e+08 1.99e+08 1.78e+08 1.85e+08 1.95e+08 ...
## $ Kas21 : num 2523754 30930175 194345 24602549 4836727 ...
## $ Kas20 : num 11437529 36333943 10186417 30149281 11705490 ...
## $ Performa : Factor w/ 3 levels "baik","Baik",..: 1 3 1 1 3 1 1 1 1 3 ...
Ringkasan Data Kategorik
Melihat ringkasan data pada data kategorik
summary(df[,c(5,6,15)])## SL21 SL20 Performa
## Bidang Lain : 2 Bidang lain : 2 baik : 133
## Bidang lainnya: 2 Bidang Lain : 48 Baik : 33
## Bidang Lainnya: 360 Bidang Lainnya: 291 buruk:2230
## Bidang usaha : 736 Bidang usaha : 635
## Bidang Usaha :1296 Bidang Usaha :1420
Terlihat bahwa terdapat kategori yang berbeda dengan maksud yang sama sehingga perlu dibuat penamaan yang sama. Pada SL21 dan SL20 menjadi dua kategori yaitu “bidang lainnya” dan “bidang usaha” dan pada Performa menjadi dua kategori yaitu “baik” dan “buruk”
#levels(df$Peforma)
# Ubah "Baik" menjadi "baik"
#levels(df$SL21)
# Samakan bidang lain menjadi 1 kategori dan bidang usaha menjadi 1 kategori
#levels(df$SL20)
# Samakan bidang lain menjadi 1 kategori dan bidang usaha menjadi 1 kategori
##MENGUBAH KATEGORI
levels(df$Performa)[2] <- c("baik")
levels(df$SL21)[1:3] <- c("bidang lainnya")
levels(df$SL21)[2:3] <- c("bidang usaha")
levels(df$SL20)[1:3] <- c("bidang lainnya")
levels(df$SL20)[2:3] <- c("bidang usaha")
summary(df[,c(5,6,15)])## SL21 SL20 Performa
## bidang lainnya: 364 bidang lainnya: 341 baik : 166
## bidang usaha :2032 bidang usaha :2055 buruk:2230
Statistika Deskriptif
Pada statistika deskriptif disajikan ringkasan data dengan memperhatikan pola persebaran data, nilai minimum, maksimum, rata-rata, standar deviasi, mengetahui missing value, dan apakah data hanya berkumpul dalam rentang kuartil tertentu yang nantinya penting bagi kita untuk melakukan langkah preprocessing selanjutnya.
library(skimr)
skimmed <- skim_to_wide(df)
skimmed| Name | Piped data |
| Number of rows | 2396 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| SL21 | 0 | 1 | FALSE | 2 | bid: 2032, bid: 364 |
| SL20 | 0 | 1 | FALSE | 2 | bid: 2055, bid: 341 |
| Performa | 0 | 1 | FALSE | 2 | bur: 2230, bai: 166 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Penj_21 | 0 | 1 | 9.584888e+07 | 3.040022e+07 | 50004986 | 7.178802e+07 | 9.104714e+07 | 1.179365e+08 | 169934173 | ▇▇▅▃▂ |
| Penj_20 | 0 | 1 | 9.641617e+07 | 3.022387e+07 | 50015181 | 7.299453e+07 | 9.106843e+07 | 1.189492e+08 | 169882206 | ▆▇▃▃▂ |
| RB_Penj21 | 0 | 1 | 5.100000e-01 | 2.900000e-01 | 0 | 2.500000e-01 | 5.100000e-01 | 7.600000e-01 | 1 | ▇▇▇▇▇ |
| RB_Penj20 | 0 | 1 | 5.000000e-01 | 2.900000e-01 | 0 | 2.400000e-01 | 5.100000e-01 | 7.400000e-01 | 1 | ▇▇▇▇▇ |
| LR21 | 0 | 1 | 1.528404e+09 | 4.013853e+09 | -9973969453 | -1.327392e+09 | 1.969982e+09 | 4.252804e+09 | 9970196722 | ▁▂▇▇▂ |
| LR20 | 0 | 1 | 1.726908e+09 | 3.970113e+09 | -9996772003 | -1.162818e+09 | 2.150437e+09 | 4.444185e+09 | 9990549175 | ▁▂▆▇▂ |
| Liab_21 | 0 | 1 | 6.009127e+07 | 2.305624e+07 | 20033006 | 3.995161e+07 | 6.001215e+07 | 8.042310e+07 | 99985816 | ▇▇▇▇▇ |
| Liab_20 | 0 | 1 | 6.086028e+07 | 2.317820e+07 | 20039450 | 4.103766e+07 | 6.090982e+07 | 8.070269e+07 | 99964618 | ▇▇▇▇▇ |
| Aset21 | 0 | 1 | 1.701125e+08 | 3.407744e+07 | 120047150 | 1.399093e+08 | 1.650409e+08 | 1.962644e+08 | 239980549 | ▇▅▅▅▃ |
| Aset20 | 0 | 1 | 1.701452e+08 | 3.399318e+07 | 120049822 | 1.403411e+08 | 1.649616e+08 | 1.964649e+08 | 239919416 | ▇▆▅▅▃ |
| Kas21 | 0 | 1 | 1.766185e+07 | 8.690964e+06 | 125897 | 1.018294e+07 | 1.767224e+07 | 2.494508e+07 | 34504948 | ▅▇▇▇▆ |
| Kas20 | 0 | 1 | 2.515371e+07 | 8.598909e+06 | 10019797 | 1.781578e+07 | 2.528927e+07 | 3.245916e+07 | 39994863 | ▇▇▇▇▇ |
Memeriksa apakah data kita imbalance atau tidak
library(tidyverse)
df %>%
ggplot(aes(x = Performa)) +
geom_bar(aes(fill = "blue")) +
ggtitle("Distribusi dari UMKM Performa Baik dan UMKM Performa Buruk") +
theme(legend.position="none")Memeriksa Missing Value
Memeriksa data yang hilang
sapply(df,function(x) sum(is.na(x)))## Penj_21 Penj_20 RB_Penj21 RB_Penj20 SL21 SL20 LR21 LR20
## 0 0 0 0 0 0 0 0
## Liab_21 Liab_20 Aset21 Aset20 Kas21 Kas20 Performa
## 0 0 0 0 0 0 0
Terlihat bahwa tidak ada missing value sehingga tidak perlu dilakukan penanganan missing value pada data.
Korelasi Antar Feature
Memeriksa korelasi antar Feature dengan melakukan perubahan variabel kategorik menjadi numerik terlebih dahulu
df[,c('SL21','SL20')] <- sapply(df[,c('SL21','SL20')], unclass)
# Mencoba untuk melakukan statistika deskriptif kembali dari data yang telah diubah
skimmed <- skim_to_wide(df)
skimmed| Name | Piped data |
| Number of rows | 2396 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Performa | 0 | 1 | FALSE | 2 | bur: 2230, bai: 166 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Penj_21 | 0 | 1 | 9.584888e+07 | 3.040022e+07 | 50004986 | 7.178802e+07 | 9.104714e+07 | 1.179365e+08 | 169934173 | ▇▇▅▃▂ |
| Penj_20 | 0 | 1 | 9.641617e+07 | 3.022387e+07 | 50015181 | 7.299453e+07 | 9.106843e+07 | 1.189492e+08 | 169882206 | ▆▇▃▃▂ |
| RB_Penj21 | 0 | 1 | 5.100000e-01 | 2.900000e-01 | 0 | 2.500000e-01 | 5.100000e-01 | 7.600000e-01 | 1 | ▇▇▇▇▇ |
| RB_Penj20 | 0 | 1 | 5.000000e-01 | 2.900000e-01 | 0 | 2.400000e-01 | 5.100000e-01 | 7.400000e-01 | 1 | ▇▇▇▇▇ |
| SL21 | 0 | 1 | 1.850000e+00 | 3.600000e-01 | 1 | 2.000000e+00 | 2.000000e+00 | 2.000000e+00 | 2 | ▂▁▁▁▇ |
| SL20 | 0 | 1 | 1.860000e+00 | 3.500000e-01 | 1 | 2.000000e+00 | 2.000000e+00 | 2.000000e+00 | 2 | ▁▁▁▁▇ |
| LR21 | 0 | 1 | 1.528404e+09 | 4.013853e+09 | -9973969453 | -1.327392e+09 | 1.969982e+09 | 4.252804e+09 | 9970196722 | ▁▂▇▇▂ |
| LR20 | 0 | 1 | 1.726908e+09 | 3.970113e+09 | -9996772003 | -1.162818e+09 | 2.150437e+09 | 4.444185e+09 | 9990549175 | ▁▂▆▇▂ |
| Liab_21 | 0 | 1 | 6.009127e+07 | 2.305624e+07 | 20033006 | 3.995161e+07 | 6.001215e+07 | 8.042310e+07 | 99985816 | ▇▇▇▇▇ |
| Liab_20 | 0 | 1 | 6.086028e+07 | 2.317820e+07 | 20039450 | 4.103766e+07 | 6.090982e+07 | 8.070269e+07 | 99964618 | ▇▇▇▇▇ |
| Aset21 | 0 | 1 | 1.701125e+08 | 3.407744e+07 | 120047150 | 1.399093e+08 | 1.650409e+08 | 1.962644e+08 | 239980549 | ▇▅▅▅▃ |
| Aset20 | 0 | 1 | 1.701452e+08 | 3.399318e+07 | 120049822 | 1.403411e+08 | 1.649616e+08 | 1.964649e+08 | 239919416 | ▇▆▅▅▃ |
| Kas21 | 0 | 1 | 1.766185e+07 | 8.690964e+06 | 125897 | 1.018294e+07 | 1.767224e+07 | 2.494508e+07 | 34504948 | ▅▇▇▇▆ |
| Kas20 | 0 | 1 | 2.515371e+07 | 8.598909e+06 | 10019797 | 1.781578e+07 | 2.528927e+07 | 3.245916e+07 | 39994863 | ▇▇▇▇▇ |
Analisis korelasi antar Feature sebagai berikut
cordata <- data.matrix(df)
cormat <- round(cor(cordata, method = "pearson"),2)
library(reshape2)
melted_cormat <- melt(cormat)
# Peroleh segitiga bawah dari matriks korelasi
get_lower_tri <- function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Peroleh segitiga atas dari matriks korelasi
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
# Atur matriks korelasi
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Buat ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() + # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 9, hjust = 1))+
coord_fixed()
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black",
size = 2) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top",
title.hjust = 0.5))Data Preparation
Resampling
Karena distribusi class atau label pada Performa tidak seimbang atau imbalance yang terlihat pada grafik sebelumnya maka perlu melakukan Resampling pada data
plotImbalance <- function(data){
ggplot(data, aes(x = Performa)) +
geom_bar(aes(fill = "blue")) +
ggtitle("Distribusi dari UMKM Performa Baik dan UMKM Performa Buruk") +
theme(legend.position="none")
}
head(df)## Penj_21 Penj_20 RB_Penj21 RB_Penj20 SL21 SL20 LR21 LR20
## 1 149207164 136533758 0.86817985 0.85582566 2 2 8760704261 7317358071
## 2 88706100 85686167 0.97856674 0.73113980 1 2 5637124272 4835651598
## 3 108887378 78405829 0.06015514 0.08093554 2 2 4861115054 3990873285
## 4 133103033 120427245 0.53161329 0.98946267 2 2 3611989881 4587487935
## 5 136475517 86949433 0.22992654 0.16201631 1 1 5773418132 4935491114
## 6 117591684 96478934 0.51937699 0.66088548 2 2 2153954438 4328320319
## Liab_21 Liab_20 Aset21 Aset20 Kas21 Kas20 Performa
## 1 20792734 84451989 195406981 183660291 2523754 11437529 baik
## 2 24472651 98329359 174655998 199015767 30930175 36333943 buruk
## 3 20586078 67450502 182547662 177529646 194345 10186417 baik
## 4 26786479 79683551 190119809 185314245 24602549 30149281 baik
## 5 33833705 96559757 196396280 195426775 4836727 11705490 buruk
## 6 21977440 58498050 190926650 182901681 13546277 20701195 baik
plotImbalance(df)library(imbalance)
imbalanceRatio(df, classAttr = "Performa")## [1] 0.07443946
Akan coba dibandingkan beberapa algoritma resampling untuk menyeimbangkan distribusi data yang dimiliki karena apabila data imbalance dilakukan pemodelan, maka akibatnya model yang terbentuk lebih banyak belajar pada saat kondisi UMKM dengan performa buruk sehingga model yang terbentuk lebih condong untuk memprediksi UMKM dengan performa buruk.
- Synthetic Minority Over-sampling Technique (SMOTE)
dfSMOTE <- oversample(df, ratio = 0.90, method = "SMOTE",
classAttr = "Performa")
plotImbalance(dfSMOTE)imbalanceRatio(dfSMOTE, classAttr = "Performa")## [1] 0.9
- Adaptive Synthetic (ADASYN)
dfADASYN <- oversample(df, method = "ADASYN", classAttr = "Performa")
plotImbalance(dfADASYN)imbalanceRatio(dfADASYN, classAttr = "Performa")## [1] 0.9869955
- Majority-Weighted Minority Over-sampling Technique (MWMOTE)
dfMWMOTE <- oversample(df, ratio = 0.95, method = "MWMOTE", classAttr = "Performa")
plotImbalance(dfMWMOTE)imbalanceRatio(dfMWMOTE, classAttr = "Performa")## [1] 0.9502242
Dari ketiga metode resampling dipilih metode ADASYN sebagai metode resampling yang terbaik untuk digunakan karena memiliki jumlah kelas yang hampir seimbang.
library(writexl)
write_xlsx(dfADASYN,"dataADASYN.xlsx")Modelling
Pada tahap awal akan digunakan k-fold cross validation
k-fold Cross Validation (image source: https://rpubs.com/jvaldeleon)
# Set seed untuk menghasilkan keacakan yang sama
set.seed(123)
# Tentukan cross validation berulang dengan 10 folds/lipatan dan satu pengulangan
library(caret)
evaluationSetting <- trainControl(method='repeatedcv',
number=10,
repeats=1,
summaryFunction = multiClassSummary)Decision Tree
Classification and Regression Tree (CART)
DT_Model <- train(Performa~.,
data=dfADASYN,
method="rpart",
trControl=evaluationSetting)
print(DT_Model)## CART
##
## 4431 samples
## 14 predictor
## 2 classes: 'baik', 'buruk'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 3987, 3988, 3988, 3988, 3988, 3988, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa F1 Sensitivity Specificity
## 0.1004089 0.7962118 0.5930495 0.8169223 0.9136672 0.6802691
## 0.1785552 0.7165472 0.4350117 0.7740602 0.9718244 0.4645740
## 0.3698319 0.5897286 0.1776729 0.7531898 0.4972727 0.6811659
## Pos_Pred_Value Neg_Pred_Value Precision Recall Detection_Rate
## 0.7410003 0.8929800 0.7410003 0.9136672 0.4538456
## 0.6451266 0.9525796 0.6451266 0.9718244 0.4827334
## 0.6061078 0.7442883 0.6061078 0.4972727 0.2469526
## Balanced_Accuracy
## 0.7969681
## 0.7181992
## 0.5892193
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1004089.
Logistic Regression
RegLog_Model <- train(Performa~.,
data=dfADASYN,
method="glm",
family = 'binomial',
trControl=evaluationSetting)
print(RegLog_Model)## Generalized Linear Model
##
## 4431 samples
## 14 predictor
## 2 classes: 'baik', 'buruk'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 3988, 3988, 3987, 3988, 3988, 3988, ...
## Resampling results:
##
## Accuracy Kappa F1 Sensitivity Specificity Pos_Pred_Value
## 0.869106 0.7382837 0.8712062 0.8914315 0.8470852 0.8522767
## Neg_Pred_Value Precision Recall Detection_Rate Balanced_Accuracy
## 0.8880895 0.8522767 0.8914315 0.4427938 0.8692584
Support Vector Machine (SVM) Kernel Radial
library(kernlab)
SVMRad_Model <- train(Performa~.,
data=dfADASYN,
method="svmRadial",
trControl=evaluationSetting)
print(SVMRad_Model)## Support Vector Machines with Radial Basis Function Kernel
##
## 4431 samples
## 14 predictor
## 2 classes: 'baik', 'buruk'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 3988, 3988, 3988, 3988, 3988, 3987, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa F1 Sensitivity Specificity
## 0.25 0.9451610 0.8903608 0.9464649 0.9731900 0.9174888
## 0.50 0.9535101 0.9070500 0.9544987 0.9795496 0.9278027
## 1.00 0.9620849 0.9241955 0.9629253 0.9890930 0.9354260
## Pos_Pred_Value Neg_Pred_Value Precision Recall Detection_Rate
## 0.9215046 0.9720526 0.9215046 0.9731900 0.4834111
## 0.9309551 0.9788086 0.9309551 0.9795496 0.4865704
## 0.9382552 0.9886282 0.9382552 0.9890930 0.4913103
## Balanced_Accuracy
## 0.9453394
## 0.9536761
## 0.9622595
##
## Tuning parameter 'sigma' was held constant at a value of 0.05147064
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05147064 and C = 1.
Support Vector Machine (SVM) Kernel Linear
SVMLin_Model <- train(Performa~.,
data=dfADASYN,
method="svmLinear",
trControl=evaluationSetting)
print(SVMLin_Model)## Support Vector Machines with Linear Kernel
##
## 4431 samples
## 14 predictor
## 2 classes: 'baik', 'buruk'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 3987, 3988, 3988, 3988, 3988, 3988, ...
## Resampling results:
##
## Accuracy Kappa F1 Sensitivity Specificity Pos_Pred_Value
## 0.8697786 0.7396848 0.8739646 0.908225 0.8318386 0.8427512
## Neg_Pred_Value Precision Recall Detection_Rate Balanced_Accuracy
## 0.9021679 0.8427512 0.908225 0.4511399 0.8700318
##
## Tuning parameter 'C' was held constant at a value of 1
Naive Bayes
library(naivebayes)
NB_Model <- train(Performa~.,
data=dfADASYN,
method="naive_bayes",
trControl=evaluationSetting)
print(NB_Model)## Naive Bayes
##
## 4431 samples
## 14 predictor
## 2 classes: 'baik', 'buruk'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 3988, 3988, 3988, 3988, 3987, 3988, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa F1 Sensitivity Specificity
## FALSE 0.6795315 0.3617138 0.7561499 1.0000000 0.3632287
## TRUE 0.8521765 0.7039299 0.8335684 0.7464644 0.9565022
## Pos_Pred_Value Neg_Pred_Value Precision Recall Detection_Rate
## 0.6079515 1.0000000 0.6079515 1.0000000 0.4967274
## 0.9448611 0.7930638 0.9448611 0.7464644 0.3707929
## Balanced_Accuracy
## 0.6816143
## 0.8514833
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 1.
Random Forest
library(randomForest)
RF_Model <- train(Performa~.,
data=dfADASYN,
method="rf",
tuneGrid=expand.grid(.mtry= 7),
trControl=evaluationSetting,
ntree = 300)
print(RF_Model)## Random Forest
##
## 4431 samples
## 14 predictor
## 2 classes: 'baik', 'buruk'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 3988, 3988, 3988, 3988, 3988, 3988, ...
## Resampling results:
##
## Accuracy Kappa F1 Sensitivity Specificity Pos_Pred_Value
## 0.980139 0.9602824 0.9802359 0.9909132 0.9695067 0.9698749
## Neg_Pred_Value Precision Recall Detection_Rate Balanced_Accuracy
## 0.9909016 0.9698749 0.9909132 0.4922137 0.98021
##
## Tuning parameter 'mtry' was held constant at a value of 7
Evaluation
Perbandingan Performances
# Bandingkan kinerja model menggunakan resample()
models_compare <- resamples(list(DecisionTree = DT_Model,
LogisticRegression = RegLog_Model,
SupportVectorMachineRadial = SVMRad_Model,
SupportVectorMachineLinear = SVMLin_Model,
NaiveBayes = NB_Model,
RandomForest = RF_Model
))
# Ringkasan models performances
summary(models_compare)##
## Call:
## summary.resamples(object = models_compare)
##
## Models: DecisionTree, LogisticRegression, SupportVectorMachineRadial, SupportVectorMachineLinear, NaiveBayes, RandomForest
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.7629797 0.7777757 0.7878104 0.7962118 0.8182844
## LogisticRegression 0.8465011 0.8619746 0.8702032 0.8691060 0.8752822
## SupportVectorMachineRadial 0.9367946 0.9520519 0.9661400 0.9620849 0.9706546
## SupportVectorMachineLinear 0.8419865 0.8628668 0.8656885 0.8697786 0.8783059
## NaiveBayes 0.8329571 0.8442438 0.8545620 0.8521765 0.8600451
## RandomForest 0.9729120 0.9779910 0.9819413 0.9801390 0.9836343
## Max. NA's
## DecisionTree 0.8306998 0
## LogisticRegression 0.8871332 0
## SupportVectorMachineRadial 0.9796840 0
## SupportVectorMachineLinear 0.9097065 0
## NaiveBayes 0.8645598 0
## RandomForest 0.9842342 0
##
## Balanced_Accuracy
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.7643294 0.7788655 0.7882898 0.7969681 0.8187806
## LogisticRegression 0.8468304 0.8619749 0.8704494 0.8692584 0.8754484
## SupportVectorMachineRadial 0.9370669 0.9522804 0.9662913 0.9622595 0.9708214
## SupportVectorMachineLinear 0.8423461 0.8632236 0.8659040 0.8700318 0.8784284
## NaiveBayes 0.8320628 0.8435105 0.8541044 0.8514833 0.8594043
## RandomForest 0.9729719 0.9780932 0.9819863 0.9802100 0.9836756
## Max. NA's
## DecisionTree 0.8313188 0
## LogisticRegression 0.8873115 0
## SupportVectorMachineRadial 0.9797595 0
## SupportVectorMachineLinear 0.9100082 0
## NaiveBayes 0.8639115 0
## RandomForest 0.9842744 0
##
## Detection_Rate
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.4153499 0.4401806 0.4593679 0.4538456 0.4681355
## LogisticRegression 0.4234234 0.4401806 0.4458239 0.4427938 0.4492099
## SupportVectorMachineRadial 0.4830700 0.4864560 0.4932280 0.4913103 0.4952108
## SupportVectorMachineLinear 0.4311512 0.4458239 0.4498302 0.4511399 0.4548533
## NaiveBayes 0.3476298 0.3594808 0.3758465 0.3707929 0.3814898
## RandomForest 0.4875847 0.4898420 0.4938000 0.4922137 0.4943567
## Max. NA's
## DecisionTree 0.4785553 0
## LogisticRegression 0.4537246 0
## SupportVectorMachineRadial 0.4966140 0
## SupportVectorMachineLinear 0.4740406 0
## NaiveBayes 0.3873874 0
## RandomForest 0.4966140 0
##
## F1
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.7982646 0.8040349 0.8097055 0.8169223 0.8297685
## LogisticRegression 0.8528139 0.8609848 0.8736232 0.8712062 0.8777588
## SupportVectorMachineRadial 0.9388646 0.9534819 0.9666626 0.9629253 0.9711752
## SupportVectorMachineLinear 0.8491379 0.8668642 0.8706800 0.8739646 0.8806552
## NaiveBayes 0.8062827 0.8242305 0.8396366 0.8335684 0.8451566
## RandomForest 0.9729730 0.9781799 0.9819405 0.9802359 0.9836310
## Max. NA's
## DecisionTree 0.8440748 0
## LogisticRegression 0.8893805 0
## SupportVectorMachineRadial 0.9797753 0
## SupportVectorMachineLinear 0.9130435 0
## NaiveBayes 0.8492462 0
## RandomForest 0.9842697 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.5272241 0.5565102 0.5760176 0.5930495 0.6369179
## LogisticRegression 0.6931945 0.7239431 0.7405253 0.7382837 0.7506385
## SupportVectorMachineRadial 0.8736529 0.9041455 0.9322975 0.9241955 0.9413263
## SupportVectorMachineLinear 0.6841902 0.7259202 0.7314846 0.7396848 0.7566669
## NaiveBayes 0.6653054 0.6880217 0.7088492 0.7039299 0.7197218
## RandomForest 0.9458281 0.9559891 0.9638843 0.9602824 0.9672701
## Max. NA's
## DecisionTree 0.6618080 0
## LogisticRegression 0.7743388 0
## SupportVectorMachineRadial 0.9593723 0
## SupportVectorMachineLinear 0.8195152 0
## NaiveBayes 0.7287589 0
## RandomForest 0.9684697 0
##
## Neg_Pred_Value
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.8217822 0.8622549 0.9052198 0.8929800 0.9163036
## LogisticRegression 0.8546256 0.8835813 0.8916233 0.8880895 0.8996377
## SupportVectorMachineRadial 0.9715640 0.9791574 0.9930205 0.9886282 0.9952830
## SupportVectorMachineLinear 0.8693694 0.8919823 0.9012106 0.9021679 0.9075126
## NaiveBayes 0.7651246 0.7811488 0.8011152 0.7930638 0.8053349
## RandomForest 0.9817352 0.9863943 0.9931071 0.9909016 0.9953810
## Max. NA's
## DecisionTree 0.9402985 0
## LogisticRegression 0.9099526 0
## SupportVectorMachineRadial 1.0000000 0
## SupportVectorMachineLinear 0.9507389 0
## NaiveBayes 0.8100775 0
## RandomForest 1.0000000 0
##
## Pos_Pred_Value
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.6860841 0.7058465 0.7505640 0.7410003 0.7725310
## LogisticRegression 0.8140496 0.8457935 0.8530476 0.8522767 0.8654775
## SupportVectorMachineRadial 0.9033613 0.9258529 0.9421111 0.9382552 0.9482199
## SupportVectorMachineLinear 0.8073770 0.8230579 0.8430736 0.8427512 0.8637073
## NaiveBayes 0.9184783 0.9293245 0.9500277 0.9448611 0.9533221
## RandomForest 0.9521739 0.9644424 0.9710604 0.9698749 0.9765923
## Max. NA's
## DecisionTree 0.7827869 0
## LogisticRegression 0.8684211 0
## SupportVectorMachineRadial 0.9688889 0
## SupportVectorMachineLinear 0.8766520 0
## NaiveBayes 0.9753086 0
## RandomForest 0.9819005 0
##
## Precision
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.6860841 0.7058465 0.7505640 0.7410003 0.7725310
## LogisticRegression 0.8140496 0.8457935 0.8530476 0.8522767 0.8654775
## SupportVectorMachineRadial 0.9033613 0.9258529 0.9421111 0.9382552 0.9482199
## SupportVectorMachineLinear 0.8073770 0.8230579 0.8430736 0.8427512 0.8637073
## NaiveBayes 0.9184783 0.9293245 0.9500277 0.9448611 0.9533221
## RandomForest 0.9521739 0.9644424 0.9710604 0.9698749 0.9765923
## Max. NA's
## DecisionTree 0.7827869 0
## LogisticRegression 0.8684211 0
## SupportVectorMachineRadial 0.9688889 0
## SupportVectorMachineLinear 0.8766520 0
## NaiveBayes 0.9753086 0
## RandomForest 0.9819005 0
##
## Recall
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.8363636 0.8863636 0.9250000 0.9136672 0.9421226
## LogisticRegression 0.8506787 0.8863636 0.8977273 0.8914315 0.9045455
## SupportVectorMachineRadial 0.9727273 0.9795455 0.9931818 0.9890930 0.9954700
## SupportVectorMachineLinear 0.8681818 0.8977273 0.9047614 0.9082250 0.9159091
## NaiveBayes 0.7000000 0.7238636 0.7568182 0.7464644 0.7681818
## RandomForest 0.9818182 0.9863636 0.9932024 0.9909132 0.9954545
## Max. NA's
## DecisionTree 0.9636364 0
## LogisticRegression 0.9136364 0
## SupportVectorMachineRadial 1.0000000 0
## SupportVectorMachineLinear 0.9545455 0
## NaiveBayes 0.7782805 0
## RandomForest 1.0000000 0
##
## Sensitivity
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.8363636 0.8863636 0.9250000 0.9136672 0.9421226
## LogisticRegression 0.8506787 0.8863636 0.8977273 0.8914315 0.9045455
## SupportVectorMachineRadial 0.9727273 0.9795455 0.9931818 0.9890930 0.9954700
## SupportVectorMachineLinear 0.8681818 0.8977273 0.9047614 0.9082250 0.9159091
## NaiveBayes 0.7000000 0.7238636 0.7568182 0.7464644 0.7681818
## RandomForest 0.9818182 0.9863636 0.9932024 0.9909132 0.9954545
## Max. NA's
## DecisionTree 0.9636364 0
## LogisticRegression 0.9136364 0
## SupportVectorMachineRadial 1.0000000 0
## SupportVectorMachineLinear 0.9545455 0
## NaiveBayes 0.7782805 0
## RandomForest 1.0000000 0
##
## Specificity
## Min. 1st Qu. Median Mean 3rd Qu.
## DecisionTree 0.5650224 0.6121076 0.7085202 0.6802691 0.7399103
## LogisticRegression 0.7982063 0.8374439 0.8497758 0.8470852 0.8609865
## SupportVectorMachineRadial 0.8968610 0.9226457 0.9394619 0.9354260 0.9461883
## SupportVectorMachineLinear 0.7892377 0.8038117 0.8340807 0.8318386 0.8632287
## NaiveBayes 0.9327354 0.9439462 0.9618834 0.9565022 0.9641256
## RandomForest 0.9506726 0.9641256 0.9708520 0.9695067 0.9764574
## Max. NA's
## DecisionTree 0.7623318 0
## LogisticRegression 0.8699552 0
## SupportVectorMachineRadial 0.9686099 0
## SupportVectorMachineLinear 0.8744395 0
## NaiveBayes 0.9820628 0
## RandomForest 0.9820628 0
Kesimpulan
Dengan menggunakan perbandingan evaluasi dari model Decision Tree, Regresi Logistik, SVM, Naïve Bayes, Random Forest, didapatkan model terbaik yang mampu mengidentifikasi performa UMKM perusahaan adalah model Random Forest.