Decision tree merupakan salah satu metode yang digunakan untuk melakukan prediksi data dimana target variabel yang digunakan berupa kategorikal data. Dalam artian decision tree dapat digunakan untuk melakukan klasifikasi. Dalam melakukan klasifikasi, decision tree akan menghasilkan rules yang membantu dalam menghasilkan hasil klasifikasi.
Dalam tutorial kali ini, kita akan belajar menggunakan decision tree dalam mengklasifikasikan orang yang tergolong Health atau Not Health menggunakan dataset diabetes disease
diabetes<-read.csv("diabetes.csv")
str(diabetes)
## 'data.frame': 296 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
diabetes <- diabetes %>%
mutate_if(is.integer, as.factor) %>%
mutate(
target = factor(Outcome, levels = c(0,1),
labels = c("Health","Not Health")))
glimpse(diabetes)
## Rows: 296
## Columns: 10
## $ Pregnancies <fct> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
## $ Glucose <fct> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
## $ BloodPressure <fct> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
## $ SkinThickness <fct> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
## $ Insulin <fct> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
## $ Age <fct> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
## $ Outcome <fct> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~
## $ target <fct> Not Health, Health, Not Health, Health, Not H~
colSums(is.na(diabetes))
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
## target
## 0
Tidak ada missing value dari data diabetes. ### Checking Variance
library(caret)
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: ggplot2
## Loading required package: lattice
nearZeroVar(diabetes)
## integer(0)
nearZeroVar() digunakan untuk melakukan pengecekan apakah variasi dari data yang digunakan mendekati 0 atau tidak. Berdasarkan hasil output menunjukkan bahwa tidak ada data atau variabel yang memiliki variasi mendekati 0.
set.seed(250)
intrain <- sample(nrow(diabetes),nrow(diabetes)*0.8)
diabetes_train <- diabetes[intrain, ]
diabetes_test <- diabetes[-intrain, ]
diabetes_train
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 202 1 138 82 0 0 40.1
## 18 7 107 74 0 0 29.6
## 105 2 85 65 0 0 39.6
## 131 4 173 70 14 168 29.7
## 206 5 111 72 28 0 23.9
## 187 8 181 68 36 495 30.1
## 97 2 92 62 28 0 31.6
## 99 6 93 50 30 64 28.7
## 59 0 146 82 0 0 40.5
## 64 2 141 58 34 128 25.4
## 263 4 95 70 32 0 32.1
## 196 5 158 84 41 210 39.4
## 76 1 0 48 20 0 24.7
## 284 7 161 86 0 0 30.4
## 163 0 114 80 34 285 44.2
## 268 2 128 64 42 0 40.0
## 108 4 144 58 28 140 29.5
## 237 7 181 84 21 192 35.9
## 209 1 96 64 27 87 33.2
## 181 6 87 80 0 0 23.2
## 70 4 146 85 27 100 28.9
## 101 1 163 72 0 0 39.0
## 86 2 110 74 29 125 32.4
## 175 2 75 64 24 55 29.7
## 211 2 81 60 22 0 27.7
## 253 2 90 80 14 55 24.4
## 222 2 158 90 0 0 31.6
## 16 7 100 0 0 0 30.0
## 246 9 184 85 15 0 30.0
## 84 0 101 65 28 0 24.6
## 279 5 114 74 0 0 24.9
## 203 0 108 68 20 0 27.3
## 271 10 101 86 37 0 45.6
## 252 2 129 84 0 0 28.0
## 221 0 177 60 29 478 34.6
## 264 3 142 80 15 0 32.4
## 79 0 131 0 0 0 43.2
## 179 5 143 78 0 0 45.0
## 180 5 130 82 0 0 39.1
## 137 0 100 70 26 50 30.8
## 149 5 147 78 0 0 33.7
## 225 1 100 66 15 56 23.6
## 295 0 161 50 0 0 21.9
## 212 0 147 85 54 0 42.8
## 292 0 107 62 30 74 36.6
## 56 1 73 50 10 0 23.0
## 71 2 100 66 20 90 32.9
## 150 2 90 70 17 0 27.3
## 51 1 103 80 11 82 19.4
## 290 5 108 72 43 75 36.1
## 182 0 119 64 18 92 34.9
## 254 0 86 68 32 0 35.8
## 55 7 150 66 42 342 34.7
## 174 1 79 60 42 48 43.5
## 12 10 168 74 0 0 38.0
## 11 4 110 92 0 0 37.6
## 110 0 95 85 25 36 37.4
## 41 3 180 64 25 70 34.0
## 281 0 146 70 0 0 37.9
## 224 7 142 60 33 190 28.8
## 100 1 122 90 51 220 49.7
## 68 2 109 92 0 0 42.7
## 233 1 79 80 25 37 25.4
## 262 3 141 0 0 0 30.0
## 118 5 78 48 0 0 33.7
## 167 3 148 66 25 0 32.5
## 139 0 129 80 0 0 31.2
## 120 4 99 76 15 51 23.2
## 230 0 117 80 31 53 45.2
## 106 1 126 56 29 152 28.7
## 199 4 109 64 44 99 34.8
## 58 0 100 88 60 110 46.8
## 134 8 84 74 31 0 38.3
## 161 4 151 90 38 0 29.7
## 146 0 102 75 23 0 0.0
## 126 1 88 30 42 99 55.0
## 213 7 179 95 31 0 34.2
## 154 1 153 82 42 485 40.6
## 69 1 95 66 13 38 19.6
## 62 8 133 72 0 0 32.9
## 229 4 197 70 39 744 36.7
## 287 5 155 84 44 545 38.7
## 82 2 74 0 0 0 0.0
## 25 11 143 94 33 146 36.6
## 122 6 111 64 39 0 34.2
## 159 2 88 74 19 53 29.0
## 198 3 107 62 13 48 22.9
## 183 1 0 74 20 23 27.7
## 13 10 139 80 0 0 27.1
## 245 2 146 76 35 194 38.2
## 259 1 193 50 16 375 25.9
## 185 4 141 74 0 0 27.6
## 83 7 83 78 26 71 29.3
## 234 4 122 68 0 0 35.0
## 204 2 99 70 16 44 20.4
## 141 3 128 78 0 0 21.1
## 14 1 189 60 23 846 30.1
## 93 7 81 78 40 48 46.7
## 218 6 125 68 30 120 30.0
## 26 10 125 70 26 115 31.1
## 148 2 106 64 35 119 30.5
## 210 7 184 84 33 0 35.5
## 73 13 126 90 0 0 43.4
## 267 0 138 0 0 0 36.3
## 27 7 147 76 0 0 39.4
## 111 3 171 72 33 135 33.3
## 286 7 136 74 26 135 26.0
## 158 1 109 56 21 135 25.2
## 147 9 57 80 37 0 32.8
## 189 8 109 76 39 114 27.9
## 102 1 151 60 0 0 26.1
## 22 8 99 84 0 0 35.4
## 236 4 171 72 0 0 43.6
## 127 3 120 70 30 135 42.9
## 31 5 109 75 26 0 36.0
## 63 5 44 62 0 0 25.0
## 112 8 155 62 26 495 34.0
## 272 2 108 62 32 56 25.2
## 33 3 88 58 11 54 24.8
## 65 7 114 66 0 0 32.8
## 129 1 117 88 24 145 34.5
## 88 2 100 68 25 71 38.5
## 143 2 108 52 26 63 32.5
## 74 4 129 86 20 270 35.1
## 44 9 171 110 24 240 45.4
## 94 4 134 72 0 0 23.8
## 104 1 81 72 18 40 26.6
## 156 7 152 88 44 0 50.0
## 30 5 117 92 0 0 34.1
## 171 6 102 82 0 0 30.8
## 6 5 116 74 0 0 25.6
## 241 1 91 64 24 0 29.2
## 61 2 84 0 0 0 0.0
## 153 9 156 86 28 155 34.3
## 157 2 99 52 15 94 24.6
## 128 1 118 58 36 94 33.3
## 208 5 162 104 0 0 37.7
## 193 7 159 66 0 0 30.4
## 155 8 188 78 0 0 47.9
## 151 1 136 74 50 204 37.4
## 107 1 96 122 0 0 22.4
## 278 0 104 64 23 116 27.8
## 29 13 145 82 19 110 22.2
## 169 4 110 66 0 0 31.9
## 23 7 196 90 0 0 39.8
## 170 3 111 90 12 78 28.4
## 194 11 135 0 0 0 52.3
## 255 12 92 62 7 258 27.6
## 168 4 120 68 0 0 29.6
## 75 1 79 75 30 0 32.0
## 124 5 132 80 0 0 26.8
## 36 4 103 60 33 192 24.0
## 121 0 162 76 56 100 53.2
## 244 6 119 50 22 176 27.1
## 67 0 109 88 30 0 32.5
## 231 4 142 86 0 0 44.0
## 145 4 154 62 31 284 32.8
## 5 0 137 40 35 168 43.1
## 130 0 105 84 0 0 27.9
## 165 0 131 88 0 0 31.6
## 91 1 80 55 0 0 19.1
## 260 11 155 76 28 150 33.3
## 34 6 92 92 0 0 19.9
## 3 8 183 64 0 0 23.3
## 92 4 123 80 15 176 32.0
## 1 6 148 72 35 0 33.6
## 192 9 123 70 44 94 33.1
## 160 17 163 72 41 114 40.9
## 289 4 96 56 17 49 20.8
## 47 1 146 56 0 0 29.7
## 266 5 96 74 18 67 33.6
## 9 2 197 70 45 543 30.5
## 54 8 176 90 34 300 33.7
## 109 3 83 58 31 18 34.3
## 39 2 90 68 42 0 38.2
## 258 2 114 68 22 0 28.7
## 89 15 136 70 32 110 37.1
## 269 0 102 52 0 0 25.1
## 85 5 137 108 0 0 48.8
## 242 4 91 70 32 88 33.1
## 66 5 99 74 27 0 29.0
## 52 1 101 50 15 36 24.2
## 288 1 119 86 39 220 45.6
## 219 5 85 74 22 0 29.0
## 283 7 133 88 15 155 32.4
## 214 0 140 65 26 130 42.6
## 35 10 122 78 31 0 27.6
## 257 3 111 56 39 0 30.1
## 140 5 105 72 29 325 36.9
## 138 0 93 60 25 92 28.7
## 103 0 125 96 0 0 22.5
## 215 9 112 82 32 175 34.2
## 277 7 106 60 24 0 26.5
## 256 1 113 64 35 0 33.6
## 87 13 106 72 54 0 36.6
## 195 8 85 55 20 0 24.4
## 95 2 142 82 18 64 24.7
## 207 8 196 76 29 280 37.5
## 136 2 125 60 20 140 33.8
## 96 6 144 72 27 228 33.9
## 191 3 111 62 0 0 22.6
## 123 2 107 74 30 100 33.6
## 173 2 87 0 23 0 28.9
## 280 2 108 62 10 278 25.3
## 235 3 74 68 28 45 29.7
## 240 0 104 76 0 0 18.4
## 50 7 105 0 0 0 0.0
## 57 7 187 68 39 304 37.7
## 205 6 103 72 32 190 37.7
## 172 6 134 70 23 130 35.4
## 114 4 76 62 0 0 34.0
## 19 1 103 30 38 83 43.3
## 282 10 129 76 28 122 35.9
## 188 1 128 98 41 58 32.0
## 265 4 123 62 0 0 32.0
## 178 0 129 110 46 130 67.1
## 250 1 111 86 19 0 30.1
## 239 9 164 84 21 0 30.8
## 166 6 104 74 18 156 29.9
## 4 1 89 66 23 94 28.1
## 274 1 71 78 50 45 33.2
## 261 3 191 68 15 130 30.9
## 40 4 111 72 47 207 37.1
## 216 12 151 70 40 271 41.8
## 249 9 124 70 33 402 35.4
## 17 0 118 84 47 230 45.8
## 248 0 165 90 33 680 52.3
## 32 3 158 76 36 245 31.6
## 42 7 133 84 0 0 40.2
## 15 5 166 72 19 175 25.8
## 45 7 159 64 0 0 27.4
## 113 1 89 76 34 37 31.2
## 291 0 78 88 29 40 36.9
## 217 5 109 62 41 129 35.8
## 90 1 107 68 19 0 26.5
## 78 5 95 72 33 0 37.7
## DiabetesPedigreeFunction Age Outcome target
## 202 0.236 28 0 Health
## 18 0.254 31 1 Not Health
## 105 0.930 27 0 Health
## 131 0.361 33 1 Not Health
## 206 0.407 27 0 Health
## 187 0.615 60 1 Not Health
## 97 0.130 24 0 Health
## 99 0.356 23 0 Health
## 59 1.781 44 0 Health
## 64 0.699 24 0 Health
## 263 0.612 24 0 Health
## 196 0.395 29 1 Not Health
## 76 0.140 22 0 Health
## 284 0.165 47 1 Not Health
## 163 0.167 27 0 Health
## 268 1.101 24 0 Health
## 108 0.287 37 0 Health
## 237 0.586 51 1 Not Health
## 209 0.289 21 0 Health
## 181 0.084 32 0 Health
## 70 0.189 27 0 Health
## 101 1.222 33 1 Not Health
## 86 0.698 27 0 Health
## 175 0.370 33 0 Health
## 211 0.290 25 0 Health
## 253 0.249 24 0 Health
## 222 0.805 66 1 Not Health
## 16 0.484 32 1 Not Health
## 246 1.213 49 1 Not Health
## 84 0.237 22 0 Health
## 279 0.744 57 0 Health
## 203 0.787 32 0 Health
## 271 1.136 38 1 Not Health
## 252 0.284 27 0 Health
## 221 1.072 21 1 Not Health
## 264 0.200 63 0 Health
## 79 0.270 26 1 Not Health
## 179 0.190 47 0 Health
## 180 0.956 37 1 Not Health
## 137 0.597 21 0 Health
## 149 0.218 65 0 Health
## 225 0.666 26 0 Health
## 295 0.254 65 0 Health
## 212 0.375 24 0 Health
## 292 0.757 25 1 Not Health
## 56 0.248 21 0 Health
## 71 0.867 28 1 Not Health
## 150 0.085 22 0 Health
## 51 0.491 22 0 Health
## 290 0.263 33 0 Health
## 182 0.725 23 0 Health
## 254 0.238 25 0 Health
## 55 0.718 42 0 Health
## 174 0.678 23 0 Health
## 12 0.537 34 1 Not Health
## 11 0.191 30 0 Health
## 110 0.247 24 1 Not Health
## 41 0.271 26 0 Health
## 281 0.334 28 1 Not Health
## 224 0.687 61 0 Health
## 100 0.325 31 1 Not Health
## 68 0.845 54 0 Health
## 233 0.583 22 0 Health
## 262 0.761 27 1 Not Health
## 118 0.654 25 0 Health
## 167 0.256 22 0 Health
## 139 0.703 29 0 Health
## 120 0.223 21 0 Health
## 230 0.089 24 0 Health
## 106 0.801 21 0 Health
## 199 0.905 26 1 Not Health
## 58 0.962 31 0 Health
## 134 0.457 39 0 Health
## 161 0.294 36 0 Health
## 146 0.572 21 0 Health
## 126 0.496 26 1 Not Health
## 213 0.164 60 0 Health
## 154 0.687 23 0 Health
## 69 0.334 25 0 Health
## 62 0.270 39 1 Not Health
## 229 2.329 31 0 Health
## 287 0.619 34 0 Health
## 82 0.102 22 0 Health
## 25 0.254 51 1 Not Health
## 122 0.260 24 0 Health
## 159 0.229 22 0 Health
## 198 0.678 23 1 Not Health
## 183 0.299 21 0 Health
## 13 1.441 57 0 Health
## 245 0.329 29 0 Health
## 259 0.655 24 0 Health
## 185 0.244 40 0 Health
## 83 0.767 36 0 Health
## 234 0.394 29 0 Health
## 204 0.235 27 0 Health
## 141 0.268 55 0 Health
## 14 0.398 59 1 Not Health
## 93 0.261 42 0 Health
## 218 0.464 32 0 Health
## 26 0.205 41 1 Not Health
## 148 1.400 34 0 Health
## 210 0.355 41 1 Not Health
## 73 0.583 42 1 Not Health
## 267 0.933 25 1 Not Health
## 27 0.257 43 1 Not Health
## 111 0.199 24 1 Not Health
## 286 0.647 51 0 Health
## 158 0.833 23 0 Health
## 147 0.096 41 0 Health
## 189 0.640 31 1 Not Health
## 102 0.179 22 0 Health
## 22 0.388 50 0 Health
## 236 0.479 26 1 Not Health
## 127 0.452 30 0 Health
## 31 0.546 60 0 Health
## 63 0.587 36 0 Health
## 112 0.543 46 1 Not Health
## 272 0.128 21 0 Health
## 33 0.267 22 0 Health
## 65 0.258 42 1 Not Health
## 129 0.403 40 1 Not Health
## 88 0.324 26 0 Health
## 143 0.318 22 0 Health
## 74 0.231 23 0 Health
## 44 0.721 54 1 Not Health
## 94 0.277 60 1 Not Health
## 104 0.283 24 0 Health
## 156 0.337 36 1 Not Health
## 30 0.337 38 0 Health
## 171 0.180 36 1 Not Health
## 6 0.201 30 0 Health
## 241 0.192 21 0 Health
## 61 0.304 21 0 Health
## 153 1.189 42 1 Not Health
## 157 0.637 21 0 Health
## 128 0.261 23 0 Health
## 208 0.151 52 1 Not Health
## 193 0.383 36 1 Not Health
## 155 0.137 43 1 Not Health
## 151 0.399 24 0 Health
## 107 0.207 27 0 Health
## 278 0.454 23 0 Health
## 29 0.245 57 0 Health
## 169 0.471 29 0 Health
## 23 0.451 41 1 Not Health
## 170 0.495 29 0 Health
## 194 0.578 40 1 Not Health
## 255 0.926 44 1 Not Health
## 168 0.709 34 0 Health
## 75 0.396 22 0 Health
## 124 0.186 69 0 Health
## 36 0.966 33 0 Health
## 121 0.759 25 1 Not Health
## 244 1.318 33 1 Not Health
## 67 0.855 38 1 Not Health
## 231 0.645 22 1 Not Health
## 145 0.237 23 0 Health
## 5 2.288 33 1 Not Health
## 130 0.741 62 1 Not Health
## 165 0.743 32 1 Not Health
## 91 0.258 21 0 Health
## 260 1.353 51 1 Not Health
## 34 0.188 28 0 Health
## 3 0.672 32 1 Not Health
## 92 0.443 34 0 Health
## 1 0.627 50 1 Not Health
## 192 0.374 40 0 Health
## 160 0.817 47 1 Not Health
## 289 0.340 26 0 Health
## 47 0.564 29 0 Health
## 266 0.997 43 0 Health
## 9 0.158 53 1 Not Health
## 54 0.467 58 1 Not Health
## 109 0.336 25 0 Health
## 39 0.503 27 1 Not Health
## 258 0.092 25 0 Health
## 89 0.153 43 1 Not Health
## 269 0.078 21 0 Health
## 85 0.227 37 1 Not Health
## 242 0.446 22 0 Health
## 66 0.203 32 0 Health
## 52 0.526 26 0 Health
## 288 0.808 29 1 Not Health
## 219 1.224 32 1 Not Health
## 283 0.262 37 0 Health
## 214 0.431 24 1 Not Health
## 35 0.512 45 0 Health
## 257 0.557 30 0 Health
## 140 0.159 28 0 Health
## 138 0.532 22 0 Health
## 103 0.262 21 0 Health
## 215 0.260 36 1 Not Health
## 277 0.296 29 1 Not Health
## 256 0.543 21 1 Not Health
## 87 0.178 45 0 Health
## 195 0.136 42 0 Health
## 95 0.761 21 0 Health
## 207 0.605 57 1 Not Health
## 136 0.088 31 0 Health
## 96 0.255 40 0 Health
## 191 0.142 21 0 Health
## 123 0.404 23 0 Health
## 173 0.773 25 0 Health
## 280 0.881 22 0 Health
## 235 0.293 23 0 Health
## 240 0.582 27 0 Health
## 50 0.305 24 0 Health
## 57 0.254 41 1 Not Health
## 205 0.324 55 0 Health
## 172 0.542 29 1 Not Health
## 114 0.391 25 0 Health
## 19 0.183 33 0 Health
## 282 0.280 39 0 Health
## 188 1.321 33 1 Not Health
## 265 0.226 35 1 Not Health
## 178 0.319 26 1 Not Health
## 250 0.143 23 0 Health
## 239 0.831 32 1 Not Health
## 166 0.722 41 1 Not Health
## 4 0.167 21 0 Health
## 274 0.422 21 0 Health
## 261 0.299 34 0 Health
## 40 1.390 56 1 Not Health
## 216 0.742 38 1 Not Health
## 249 0.282 34 0 Health
## 17 0.551 31 1 Not Health
## 248 0.427 23 0 Health
## 32 0.851 28 1 Not Health
## 42 0.696 37 0 Health
## 15 0.587 51 1 Not Health
## 45 0.294 40 0 Health
## 113 0.192 23 0 Health
## 291 0.434 21 0 Health
## 217 0.514 25 1 Not Health
## 90 0.165 24 0 Health
## 78 0.370 27 0 Health
Cek proporsi untuk masing-masing split data sudah memiliki proporsi klasifikasi yang seimbang atau belum.
prop.table(table(diabetes$target))
##
## Health Not Health
## 0.6216216 0.3783784
prop.table(table(diabetes_train$target))
##
## Health Not Health
## 0.6398305 0.3601695
prop.table(table(diabetes_test$target))
##
## Health Not Health
## 0.55 0.45
Dari proporsi diatas, untuk test data memiliki proporsi yang cukup sama dari data awal dan data train.
library(partykit)
## Warning: package 'partykit' was built under R version 4.1.2
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.1.2
## Loading required package: mvtnorm
diabetes_tree <- ctree(target~., data = diabetes_train)
diabetes_tree
##
## Model formula:
## target ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age + Outcome
##
## Fitted party:
## [1] root
## | [2] Outcome in 0: Health (n = 151, err = 0.0%)
## | [3] Outcome in 1: Not Health (n = 85, err = 0.0%)
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
plot(diabetes_tree, type = "simple")
Berdasarkan hasil model decision tree yang tebentuk, root node adalah variabel Outcome. inner node yang terbentuk sebanyak 2. Ketika nilai Outcome = 0, sebanyak 151 observasi diklasifikasikan sebagai Health dengan nilai error sebesar 0%. Sedangkan ketika nilai Outcome = 1 sebanyak 85 observasi akan diklasifikasikan sebagai Not Health dengan error sebesar 0%.
Sebelum melakukan prediksi terhadap data test, kita lihat dulu evaluasi model ketika digunakan pada data train. Hal ini untuk mengecek apakah model yang dibuat overfit atau underfit.
pred_diabetes_train <- predict(diabetes_tree, diabetes_train)
confusionMatrix(pred_diabetes_train, diabetes_train$target, positive = "Not Health")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 151 0
## Not Health 0 85
##
## Accuracy : 1
## 95% CI : (0.9845, 1)
## No Information Rate : 0.6398
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3602
## Detection Rate : 0.3602
## Detection Prevalence : 0.3602
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Not Health
##
Dari hasil evaluasi model, kita akan gunakan nilai recall karena kita tidak ingin salah tebak orang yang sebenarnya Not Health kita prediksi Health. Nilai recall yang diperoleh sebesar 40.3%. tidak terlalu tinggi. Selanjutnya coba menggunakan data test.
pred_diabetes_test <- predict(diabetes_tree, diabetes_test)
confusionMatrix(pred_diabetes_test, diabetes_test$target, positive = "Not Health")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 33 0
## Not Health 0 27
##
## Accuracy : 1
## 95% CI : (0.9404, 1)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 2.641e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Prevalence : 0.45
## Detection Rate : 0.45
## Detection Prevalence : 0.45
## Balanced Accuracy : 1.00
##
## 'Positive' Class : Not Health
##
Saat model diinputkan untuk memprediksi data test, nilai sensitivitynya menjadi 36.7%. Untuk melakukan improve model, kita bisa melakukan resampling agar klasifikasi untuktarget variabel menjadi lebih stabil.
set.seed(250)
intrain <- sample(nrow(diabetes), nrow(diabetes)*0.8)
re_train <- diabetes[intrain,]
re_train <- upSample(re_train[,-9], re_train[,9], yname = "Outcome" )
re_test <- diabetes[-intrain,]
Selanjutnya kita cek proporsi dari data train dan test.
prop.table(table(re_train$target))
##
## Health Not Health
## 0.5 0.5
prop.table(table(re_test$target))
##
## Health Not Health
## 0.55 0.45
Setelah itu, kita buat model lagi dengan data train yang ada.
diabetes_tree_new <- ctree(target~., re_train)
diabetes_tree_new
##
## Model formula:
## target ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age + Outcome
##
## Fitted party:
## [1] root
## | [2] Outcome in 0: Health (n = 151, err = 0.0%)
## | [3] Outcome in 1: Not Health (n = 151, err = 0.0%)
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
plot(diabetes_tree_new, type = "simple")
Berdasarkan plot model decision tree yang baru, root node dimulai dari variael Outcome dan memiliki internal nodes sebanyak 1 dan menghasilkan terminal nodes sebanyak 2.
pred_train <- predict(diabetes_tree_new, re_train)
confusionMatrix(pred_train, re_train$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 151 0
## Not Health 0 151
##
## Accuracy : 1
## 95% CI : (0.9879, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : Health
##
Hasil recall yang diperoleh yaitu sebesar 100% dan nilai accuracy nya juga 100%.
pred_test <- predict(diabetes_tree_new, re_test)
confusionMatrix(pred_test, re_test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 33 0
## Not Health 0 27
##
## Accuracy : 1
## 95% CI : (0.9404, 1)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 2.641e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Prevalence : 0.55
## Detection Rate : 0.55
## Detection Prevalence : 0.55
## Balanced Accuracy : 1.00
##
## 'Positive' Class : Health
##
Dari proses resampling yang telah dilakukan ternyata mampu mengimprove model agar menghasilkan recall yang lebih baik. Artinya model decision tree yang dibuat sudah mampu digunakan untuk mengklasifikasikan target variabel yang diinginkan.
Kesimpulan yang dapat diambil yaitu :
Daftar Pustaka :