Lakukan installasi packages yang dibutuhkan di console
install.packages(dplyr) install.packages(readxl) install.packages (rpart) install.packages(rattle) install.packages(caret)
Langkah 1: Memanggil Dataset
#Load Data sesuai tempat penyimpanan
library(readxl)
OVA <- read_excel("C:/Bibib/03_MATERI NGAJAR/2122_GASAL/DATA MINING/OVA.xlsx")
# load data
summary(OVA)
## JK Umur Pekerjaan LamaNasabah
## Length:1600 Length:1600 Length:1600 Length:1600
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Persepsi
## Length:1600
## Class :character
## Mode :character
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Change all character variables as factor/categorical data
OVA<- OVA %>%
mutate_if(is.character, as.factor)
summary(OVA)
## JK Umur Pekerjaan LamaNasabah
## Laki-laki:943 <25 tahun :208 Ibu rumah tangga: 32 <1 tahun :174
## Wanita :657 >55 tahun :176 Mahasiswa :144 >5 tahun :576
## 25-35 tahun:430 Pengusaha :543 1-2 tahun:224
## 36-45 tahun:578 PNS :224 2-5 tahun:626
## 46-55 tahun:208 Swasta :657
## Persepsi
## Cukup Puas: 433
## Puas :1167
##
##
##
#untuk melihat data dan dimensi data dapat menggunakan perintah berikut
dim(OVA)
## [1] 1600 5
Langkah 2: membagi data training dan testing Pada kasus ini data training dibagi dengan perbandingan 80:20 yaitu: → data training=80% → data testing=20% disini saya menggunakan set.seed sebanyak 123. Nilai set.seed menunjukan berapa kali data dilakukan pengacakan. banyaknya pengacakan bebas tergantung peneliti. Tapi, untuk pelajaran kali ini disarankan untuk menggunakan 123 agar hasil random yang didapatkan sama.
## Pembagian Data TRaining dan testtin
n <- round(nrow(OVA)*0.80);n
## [1] 1280
set.seed(123)
samp=sample(1:nrow(OVA),n)
data.train = OVA[samp,]
data.test = OVA[-samp,]
dim(data.train)
## [1] 1280 5
dim(data.test)
## [1] 320 5
Langkah 3: Klasifikasi dengan Decision Tree langkah selanjutnya yaitu membangun model klasifikasi dengan decision tree menggunakan data training.
library(rpart)
fit <- rpart(Persepsi~., data = data.train, method = 'class')
summary(fit)
## Call:
## rpart(formula = Persepsi ~ ., data = data.train, method = "class")
## n= 1280
##
## CP nsplit rel error xerror xstd
## 1 0.11420613 0 1.0000000 1.0000000 0.04476902
## 2 0.03621170 1 0.8857939 0.8857939 0.04306274
## 3 0.03528319 2 0.8495822 0.8440111 0.04236129
## 4 0.03203343 6 0.7047354 0.7186630 0.03997936
## 5 0.01810585 12 0.5125348 0.5125348 0.03496349
## 6 0.01671309 14 0.4763231 0.5069638 0.03480466
## 7 0.01000000 16 0.4428969 0.4623955 0.03348091
##
## Variable importance
## JK LamaNasabah Umur Pekerjaan
## 45 25 23 7
##
## Node number 1: 1280 observations, complexity param=0.1142061
## predicted class=Puas expected loss=0.2804687 P(node) =1
## class counts: 359 921
## probabilities: 0.280 0.720
## left son=2 (145 obs) right son=3 (1135 obs)
## Primary splits:
## Umur splits as RLRRR, improve=42.600170, (0 missing)
## Pekerjaan splits as LRRLR, improve= 6.252716, (0 missing)
## LamaNasabah splits as RLLL, improve= 1.377072, (0 missing)
## JK splits as RL, improve= 1.096365, (0 missing)
## Surrogate splits:
## Pekerjaan splits as LRRRR, agree=0.89, adj=0.028, (0 split)
##
## Node number 2: 145 observations, complexity param=0.0362117
## predicted class=Cukup Puas expected loss=0.3586207 P(node) =0.1132812
## class counts: 93 52
## probabilities: 0.641 0.359
## left son=4 (132 obs) right son=5 (13 obs)
## Primary splits:
## LamaNasabah splits as -LLR, improve=11.748900, (0 missing)
## Pekerjaan splits as L-R--, improve= 4.625929, (0 missing)
## JK splits as RL, improve= 2.932922, (0 missing)
##
## Node number 3: 1135 observations, complexity param=0.03528319
## predicted class=Puas expected loss=0.2343612 P(node) =0.8867188
## class counts: 266 869
## probabilities: 0.234 0.766
## left son=6 (799 obs) right son=7 (336 obs)
## Primary splits:
## Umur splits as R-LLR, improve=11.4168200, (0 missing)
## Pekerjaan splits as RRRLR, improve= 7.3098630, (0 missing)
## LamaNasabah splits as RRLL, improve= 6.7135900, (0 missing)
## JK splits as RL, improve= 0.6641095, (0 missing)
## Surrogate splits:
## Pekerjaan splits as RRLLL, agree=0.817, adj=0.381, (0 split)
## LamaNasabah splits as RLLL, agree=0.785, adj=0.274, (0 split)
##
## Node number 4: 132 observations
## predicted class=Cukup Puas expected loss=0.2954545 P(node) =0.103125
## class counts: 93 39
## probabilities: 0.705 0.295
##
## Node number 5: 13 observations
## predicted class=Puas expected loss=0 P(node) =0.01015625
## class counts: 0 13
## probabilities: 0.000 1.000
##
## Node number 6: 799 observations, complexity param=0.03528319
## predicted class=Puas expected loss=0.2803504 P(node) =0.6242188
## class counts: 224 575
## probabilities: 0.280 0.720
## left son=12 (155 obs) right son=13 (644 obs)
## Primary splits:
## Pekerjaan splits as --RLR, improve=8.137406000, (0 missing)
## LamaNasabah splits as RLLL, improve=3.889455000, (0 missing)
## JK splits as LR, improve=0.268494700, (0 missing)
## Umur splits as --RL-, improve=0.005302165, (0 missing)
##
## Node number 7: 336 observations
## predicted class=Puas expected loss=0.125 P(node) =0.2625
## class counts: 42 294
## probabilities: 0.125 0.875
##
## Node number 12: 155 observations, complexity param=0.03528319
## predicted class=Puas expected loss=0.4258065 P(node) =0.1210938
## class counts: 66 89
## probabilities: 0.426 0.574
## left son=24 (38 obs) right son=25 (117 obs)
## Primary splits:
## LamaNasabah splits as -LRR, improve=33.19526, (0 missing)
## JK splits as LR, improve=28.82432, (0 missing)
## Umur splits as --RL-, improve=25.21411, (0 missing)
##
## Node number 13: 644 observations, complexity param=0.03203343
## predicted class=Puas expected loss=0.2453416 P(node) =0.503125
## class counts: 158 486
## probabilities: 0.245 0.755
## left son=26 (461 obs) right son=27 (183 obs)
## Primary splits:
## LamaNasabah splits as RRLL, improve=14.575080, (0 missing)
## Umur splits as --LR-, improve= 6.688275, (0 missing)
## JK splits as RL, improve= 3.057631, (0 missing)
## Pekerjaan splits as --R-L, improve= 1.188159, (0 missing)
##
## Node number 24: 38 observations
## predicted class=Cukup Puas expected loss=0 P(node) =0.0296875
## class counts: 38 0
## probabilities: 1.000 0.000
##
## Node number 25: 117 observations, complexity param=0.03528319
## predicted class=Puas expected loss=0.2393162 P(node) =0.09140625
## class counts: 28 89
## probabilities: 0.239 0.761
## left son=50 (42 obs) right son=51 (75 obs)
## Primary splits:
## JK splits as LR, improve=23.931620, (0 missing)
## Umur splits as --RL-, improve= 9.322928, (0 missing)
## LamaNasabah splits as --RL, improve= 3.103554, (0 missing)
## Surrogate splits:
## Umur splits as --RL-, agree=0.769, adj=0.357, (0 split)
##
## Node number 26: 461 observations, complexity param=0.03203343
## predicted class=Puas expected loss=0.3123644 P(node) =0.3601563
## class counts: 144 317
## probabilities: 0.312 0.688
## left son=52 (222 obs) right son=53 (239 obs)
## Primary splits:
## Umur splits as --LR-, improve=2.78298600, (0 missing)
## Pekerjaan splits as --L-R, improve=0.84840840, (0 missing)
## JK splits as RL, improve=0.29976450, (0 missing)
## LamaNasabah splits as --RL, improve=0.02088677, (0 missing)
## Surrogate splits:
## LamaNasabah splits as --LR, agree=0.683, adj=0.342, (0 split)
## JK splits as RL, agree=0.523, adj=0.009, (0 split)
##
## Node number 27: 183 observations
## predicted class=Puas expected loss=0.07650273 P(node) =0.1429688
## class counts: 14 169
## probabilities: 0.077 0.923
##
## Node number 50: 42 observations
## predicted class=Cukup Puas expected loss=0.3333333 P(node) =0.0328125
## class counts: 28 14
## probabilities: 0.667 0.333
##
## Node number 51: 75 observations
## predicted class=Puas expected loss=0 P(node) =0.05859375
## class counts: 0 75
## probabilities: 0.000 1.000
##
## Node number 52: 222 observations, complexity param=0.03203343
## predicted class=Puas expected loss=0.3693694 P(node) =0.1734375
## class counts: 82 140
## probabilities: 0.369 0.631
## left son=104 (173 obs) right son=105 (49 obs)
## Primary splits:
## Pekerjaan splits as --R-L, improve=1.3618450, (0 missing)
## JK splits as LR, improve=0.5343538, (0 missing)
## LamaNasabah splits as --LR, improve=0.0636195, (0 missing)
##
## Node number 53: 239 observations, complexity param=0.01671309
## predicted class=Puas expected loss=0.2594142 P(node) =0.1867187
## class counts: 62 177
## probabilities: 0.259 0.741
## left son=106 (51 obs) right son=107 (188 obs)
## Primary splits:
## Pekerjaan splits as --L-R, improve=5.782573, (0 missing)
## LamaNasabah splits as --RL, improve=3.926533, (0 missing)
## JK splits as RL, improve=1.850180, (0 missing)
##
## Node number 104: 173 observations, complexity param=0.03203343
## predicted class=Puas expected loss=0.3988439 P(node) =0.1351563
## class counts: 69 104
## probabilities: 0.399 0.601
## left son=208 (79 obs) right son=209 (94 obs)
## Primary splits:
## LamaNasabah splits as --LR, improve=2.6148030, (0 missing)
## JK splits as LR, improve=0.2078039, (0 missing)
## Surrogate splits:
## JK splits as LR, agree=0.601, adj=0.127, (0 split)
##
## Node number 105: 49 observations, complexity param=0.01810585
## predicted class=Puas expected loss=0.2653061 P(node) =0.03828125
## class counts: 13 36
## probabilities: 0.265 0.735
## left son=210 (26 obs) right son=211 (23 obs)
## Primary splits:
## LamaNasabah splits as --RL, improve=6.102041, (0 missing)
## JK splits as LR, improve=2.490930, (0 missing)
## Surrogate splits:
## JK splits as RL, agree=0.735, adj=0.435, (0 split)
##
## Node number 106: 51 observations, complexity param=0.01671309
## predicted class=Puas expected loss=0.4705882 P(node) =0.03984375
## class counts: 24 27
## probabilities: 0.471 0.529
## left son=212 (12 obs) right son=213 (39 obs)
## Primary splits:
## JK splits as LR, improve=8.79638, (0 missing)
##
## Node number 107: 188 observations
## predicted class=Puas expected loss=0.2021277 P(node) =0.146875
## class counts: 38 150
## probabilities: 0.202 0.798
##
## Node number 208: 79 observations, complexity param=0.03203343
## predicted class=Puas expected loss=0.4936709 P(node) =0.06171875
## class counts: 39 40
## probabilities: 0.494 0.506
## left son=416 (39 obs) right son=417 (40 obs)
## Primary splits:
## JK splits as RL, improve=39.49367, (0 missing)
##
## Node number 209: 94 observations, complexity param=0.03203343
## predicted class=Puas expected loss=0.3191489 P(node) =0.0734375
## class counts: 30 64
## probabilities: 0.319 0.681
## left son=418 (30 obs) right son=419 (64 obs)
## Primary splits:
## JK splits as LR, improve=40.85106, (0 missing)
##
## Node number 210: 26 observations, complexity param=0.01810585
## predicted class=Cukup Puas expected loss=0.5 P(node) =0.0203125
## class counts: 13 13
## probabilities: 0.500 0.500
## left son=420 (13 obs) right son=421 (13 obs)
## Primary splits:
## JK splits as LR, improve=13, (0 missing)
##
## Node number 211: 23 observations
## predicted class=Puas expected loss=0 P(node) =0.01796875
## class counts: 0 23
## probabilities: 0.000 1.000
##
## Node number 212: 12 observations
## predicted class=Cukup Puas expected loss=0 P(node) =0.009375
## class counts: 12 0
## probabilities: 1.000 0.000
##
## Node number 213: 39 observations
## predicted class=Puas expected loss=0.3076923 P(node) =0.03046875
## class counts: 12 27
## probabilities: 0.308 0.692
##
## Node number 416: 39 observations
## predicted class=Cukup Puas expected loss=0 P(node) =0.03046875
## class counts: 39 0
## probabilities: 1.000 0.000
##
## Node number 417: 40 observations
## predicted class=Puas expected loss=0 P(node) =0.03125
## class counts: 0 40
## probabilities: 0.000 1.000
##
## Node number 418: 30 observations
## predicted class=Cukup Puas expected loss=0 P(node) =0.0234375
## class counts: 30 0
## probabilities: 1.000 0.000
##
## Node number 419: 64 observations
## predicted class=Puas expected loss=0 P(node) =0.05
## class counts: 0 64
## probabilities: 0.000 1.000
##
## Node number 420: 13 observations
## predicted class=Cukup Puas expected loss=0 P(node) =0.01015625
## class counts: 13 0
## probabilities: 1.000 0.000
##
## Node number 421: 13 observations
## predicted class=Puas expected loss=0 P(node) =0.01015625
## class counts: 0 13
## probabilities: 0.000 1.000
fit$variable.importance
## JK LamaNasabah Umur Pekerjaan
## 129.08186 72.31485 65.34698 20.80627
barplot(fit$variable.importance)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(fit)
Langkah 5: Prediksi Data Testing Selanjutnya memprediksi data testing untuk melihat akurasi.
# prediksi testing
prediksi = predict(fit, newdata = data.test, type = "class")
# Confusion matrix
table(prediksi, data.test$Persepsi)
##
## prediksi Cukup Puas Puas
## Cukup Puas 51 11
## Puas 23 235
Selanjutnya memprediksi data testing untuk melihat akurasi.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# prediksi testing
prediksi = predict(fit, newdata = data.test, type = "class")
# Confusion matrix
confusionMatrix(data=prediksi, reference=data.test$Persepsi)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Cukup Puas Puas
## Cukup Puas 51 11
## Puas 23 235
##
## Accuracy : 0.8938
## 95% CI : (0.8547, 0.9253)
## No Information Rate : 0.7688
## P-Value [Acc > NIR] : 6.929e-09
##
## Kappa : 0.6832
##
## Mcnemar's Test P-Value : 0.05923
##
## Sensitivity : 0.6892
## Specificity : 0.9553
## Pos Pred Value : 0.8226
## Neg Pred Value : 0.9109
## Prevalence : 0.2313
## Detection Rate : 0.1594
## Detection Prevalence : 0.1938
## Balanced Accuracy : 0.8222
##
## 'Positive' Class : Cukup Puas
##
Langkah 6: Variabel importantce Variabel importantce adalah untuk membantu menentukan variabel mana yang paling penting. Variabel paling penting itu kemudian diletakkan di atas pohon kita.
fit$variable.importance
## JK LamaNasabah Umur Pekerjaan
## 129.08186 72.31485 65.34698 20.80627
barplot(fit$variable.importance)