Classification case, Anomaly Detection for Detecting Defected Manufactured Semi-Conductors. In this project we propose machine learning techniques to automatically generate an accurate predictive model to predict equipment faults during the wafer fabrication process of the semiconductor industries. Aim at constructing a decision model to help detecting as quickly as possible any equipment faults in order to maintain high process yields in manufacturing.
This project is based from this github. We’re given a dataset with 590 predictors, an imbalance target variables, lots of missing value, and most of the predictors has different scale. In this case, Recall
is the metrics we wanted to be the highest, because we don’t want any defected items to be sold in the market. It’s better if we classify the items are defected even though they are not, the items are always can be re-checked. So we’re going to build model with highest Recall
(also called as sensitivity
), or in Confusion Matrix
languange, False positives is better then False Negatives.
You can load the package into your workspace using the library()
function
There’s a lot to do with pre-processing. I’ll do pre-process twice: one with dplyr/base pacakge and do the models with parsnip, second with fully tidymodels, just to see the effectiveness. For the models, i’ll use Random Forest
, Decision Tree
, and XGBoost
.
Exploratory Data Analysis, Data Wrangling, Feature selection and any data preparation without recipe
. hard ways as a man we are.
The dimension are too wide to be glimpse()
with. Here is the summary about the data:
- The dimension is: 1567 rows x 592 columns
- The variables contain: 1 Time attribute, 590 variables as predictors, one target variable with “-1” as items that are not defected and “1” as defected items.
- Based on the github, 590 variables are actually sensors measurements with different scale.
- The predictors are all numeric and has a lot of missing value
- The target variable are imbalance with 1:14 proportion
change variable type
Check NA
## Time X0 X1 X2 X3 X4 X5 X6
## 0 6 7 14 14 14 14 14
## X7 X8 X9 X10 X11 X12 X13 X14
## 9 2 2 2 2 2 3 3
## X15 X16 X17 X18 X19 X20 X21 X22
## 3 3 3 3 10 0 2 2
## X23 X24 X25 X26 X27 X28 X29 X30
## 2 2 2 2 2 2 2 2
## X31 X32 X33 X34 X35 X36 X37 X38
## 2 1 1 1 1 1 1 1
## X39 X40 X41 X42 X43 X44 X45 X46
## 1 24 24 1 1 1 1 1
## X47 X48 X49 X50 X51 X52 X53 X54
## 1 1 1 1 1 1 4 4
## X55 X56 X57 X58 X59 X60 X61 X62
## 4 4 4 4 7 6 6 6
## X63 X64 X65 X66 X67 X68 X69 X70
## 7 7 7 6 6 6 6 6
## X71 X72 X73 X74 X75 X76 X77 X78
## 6 794 794 6 24 24 24 24
## X79 X80 X81 X82 X83 X84 X85 X86
## 24 24 24 24 1 12 1341 0
## X87 X88 X89 X90 X91 X92 X93 X94
## 0 0 51 51 6 2 2 6
## X95 X96 X97 X98 X99 X100 X101 X102
## 6 6 6 6 6 6 6 6
## X103 X104 X105 X106 X107 X108 X109 X110
## 2 2 6 6 6 6 1018 1018
## X111 X112 X113 X114 X115 X116 X117 X118
## 1018 715 0 0 0 0 0 24
## X119 X120 X121 X122 X123 X124 X125 X126
## 0 0 9 9 9 9 9 9
## X127 X128 X129 X130 X131 X132 X133 X134
## 9 9 9 9 9 8 8 8
## X135 X136 X137 X138 X139 X140 X141 X142
## 5 6 7 14 14 14 14 14
## X143 X144 X145 X146 X147 X148 X149 X150
## 9 2 2 2 2 2 3 3
## X151 X152 X153 X154 X155 X156 X157 X158
## 3 3 3 3 10 0 1429 1429
## X159 X160 X161 X162 X163 X164 X165 X166
## 2 2 2 2 2 2 2 2
## X167 X168 X169 X170 X171 X172 X173 X174
## 2 2 2 1 1 1 1 1
## X175 X176 X177 X178 X179 X180 X181 X182
## 1 1 1 24 1 1 1 1
## X183 X184 X185 X186 X187 X188 X189 X190
## 1 1 1 1 1 1 1 4
## X191 X192 X193 X194 X195 X196 X197 X198
## 4 4 4 4 4 7 6 6
## X199 X200 X201 X202 X203 X204 X205 X206
## 6 7 7 7 6 6 6 6
## X207 X208 X209 X210 X211 X212 X213 X214
## 6 6 6 24 24 24 24 24
## X215 X216 X217 X218 X219 X220 X221 X222
## 24 24 24 1 12 1341 0 0
## X223 X224 X225 X226 X227 X228 X229 X230
## 0 51 51 6 2 2 6 6
## X231 X232 X233 X234 X235 X236 X237 X238
## 6 6 6 6 6 6 6 2
## X239 X240 X241 X242 X243 X244 X245 X246
## 2 6 6 6 6 1018 1018 1018
## X247 X248 X249 X250 X251 X252 X253 X254
## 715 0 0 0 0 0 24 0
## X255 X256 X257 X258 X259 X260 X261 X262
## 0 9 9 9 9 9 9 9
## X263 X264 X265 X266 X267 X268 X269 X270
## 9 9 9 9 8 8 8 5
## X271 X272 X273 X274 X275 X276 X277 X278
## 6 7 14 14 14 14 14 9
## X279 X280 X281 X282 X283 X284 X285 X286
## 2 2 2 2 2 3 3 3
## X287 X288 X289 X290 X291 X292 X293 X294
## 3 3 3 10 0 1429 1429 2
## X295 X296 X297 X298 X299 X300 X301 X302
## 2 2 2 2 2 2 2 2
## X303 X304 X305 X306 X307 X308 X309 X310
## 2 2 1 1 1 1 1 1
## X311 X312 X313 X314 X315 X316 X317 X318
## 1 1 24 24 1 1 1 1
## X319 X320 X321 X322 X323 X324 X325 X326
## 1 1 1 1 1 1 1 4
## X327 X328 X329 X330 X331 X332 X333 X334
## 4 4 4 4 4 7 6 6
## X335 X336 X337 X338 X339 X340 X341 X342
## 6 7 7 7 6 6 6 6
## X343 X344 X345 X346 X347 X348 X349 X350
## 6 6 794 794 6 24 24 24
## X351 X352 X353 X354 X355 X356 X357 X358
## 24 24 24 24 24 1 12 1341
## X359 X360 X361 X362 X363 X364 X365 X366
## 0 0 0 51 51 6 2 2
## X367 X368 X369 X370 X371 X372 X373 X374
## 6 6 6 6 6 6 6 6
## X375 X376 X377 X378 X379 X380 X381 X382
## 6 2 2 6 6 6 6 1018
## X383 X384 X385 X386 X387 X388 X389 X390
## 1018 1018 715 0 0 0 0 0
## X391 X392 X393 X394 X395 X396 X397 X398
## 24 0 0 9 9 9 9 9
## X399 X400 X401 X402 X403 X404 X405 X406
## 9 9 9 9 9 9 8 8
## X407 X408 X409 X410 X411 X412 X413 X414
## 8 5 6 7 14 14 14 14
## X415 X416 X417 X418 X419 X420 X421 X422
## 14 9 2 2 2 2 2 3
## X423 X424 X425 X426 X427 X428 X429 X430
## 3 3 3 3 3 10 0 2
## X431 X432 X433 X434 X435 X436 X437 X438
## 2 2 2 2 2 2 2 2
## X439 X440 X441 X442 X443 X444 X445 X446
## 2 2 1 1 1 1 1 1
## X447 X448 X449 X450 X451 X452 X453 X454
## 1 1 24 24 1 1 1 1
## X455 X456 X457 X458 X459 X460 X461 X462
## 1 1 1 1 1 1 1 4
## X463 X464 X465 X466 X467 X468 X469 X470
## 4 4 4 4 4 7 6 6
## X471 X472 X473 X474 X475 X476 X477 X478
## 6 7 7 7 6 6 6 6
## X479 X480 X481 X482 X483 X484 X485 X486
## 6 6 6 24 24 24 24 24
## X487 X488 X489 X490 X491 X492 X493 X494
## 24 24 24 1 12 1341 0 0
## X495 X496 X497 X498 X499 X500 X501 X502
## 0 51 51 6 2 2 6 6
## X503 X504 X505 X506 X507 X508 X509 X510
## 6 6 6 6 6 6 6 2
## X511 X512 X513 X514 X515 X516 X517 X518
## 2 6 6 6 6 1018 1018 1018
## X519 X520 X521 X522 X523 X524 X525 X526
## 715 0 0 0 0 0 24 0
## X527 X528 X529 X530 X531 X532 X533 X534
## 0 9 9 9 9 9 9 9
## X535 X536 X537 X538 X539 X540 X541 X542
## 9 9 9 9 8 8 8 2
## X543 X544 X545 X546 X547 X548 X549 X550
## 2 2 2 260 260 260 260 260
## X551 X552 X553 X554 X555 X556 X557 X558
## 260 260 260 260 260 260 260 1
## X559 X560 X561 X562 X563 X564 X565 X566
## 1 1 1 273 273 273 273 273
## X567 X568 X569 X570 X571 X572 X573 X574
## 273 273 273 0 0 0 0 0
## X575 X576 X577 X578 X579 X580 X581 X582
## 0 0 0 949 949 949 949 1
## X583 X584 X585 X586 X587 X588 X589 Pass.Fail
## 1 1 1 1 1 1 1 0
there’s a lot, very lot of missing value. we’ll deal with them one by one. first lets just delete variable which half of its value is missing. we also want to delete rows that most of their column are missing value
# delete column with > 50% value are missing
data.1 <- data.1[, which(colMeans(!is.na(data.1)) > 0.5)]
# delete row with > 50% value are missing
data.1 <- data.1[which(rowMeans(!is.na(data.1)) > 0.5), ]
apparently there’s no rows have NA > 50%
Next lets specify the missing values in our data
# i want to know how many rows have at least 1 NA
haveNA <- apply(data.1, 1, function(x)any(is.na(x)))
length(which(haveNA))
## [1] 1106
## [1] 0.7058073
70.5% of our data have at least one missing value. if we just na.omit()
them then we have nothing left.
I really want to use MICE
or similliar package which fill NA using prediction model. I’ve tried it and is very time-consuming (and expensive computing). i dont have such time. so, for the time being, we will fill NA with the variable mean.
# make a loop function. find the mean from each selected variable and append mean to every NA
for(i in 1:ncol(data.1)){
data.1[is.na(data.1[,i]), i] <- mean(data.1[,i], na.rm = TRUE)
}
## Warning in mean.default(data.1[, i], na.rm = TRUE): argument is not numeric or
## logical: returning NA
## [1] FALSE
I want to remove variable that has zero variance as they can contribute nearly nothing to the future model construction.
## [1] 1567 437
We still have a lot of variables. Before building a model, we need to reduce variables. One way for dimensionality reduction
is Principal Component Analysis or PCA for short. PCA looks for correlation within our data and use that redundancy to create a new matrix with just enough dimensions to explain most of the variance in the original data. New variables that are created by PCA is called principal component. PCA also helps to reduce multicolinearity. Two birds with one stone eh
# build PCA. seperate factor variable, and scale numeric data. all in one function
pca1 <- PCA(data.1, quali.sup = c(1,437), scale.unit = T, graph = F, ncp = 200)
There’s a lot of PC have been created. we want to see in which PC is covered 85% cumulative percentage of variance
## [1] 105
105 PC. It simply means we only need to use 105 dimensions to cover 85% of data. in other words, we are able to reduce 75.8% data and stil retain 85% variance of actual data. lets make new df contain selected PC and our target variable
df.pca <- data.frame(pca1$ind$coord[,1:105])
df.pca <- cbind(df.pca, pass.fail = data.1$Pass.Fail)
# i'm not including Time variable when binding df pca with our main df. imo, time variable wont affect anything to our future model
We’ve try to reduce variable even though that’s still ‘a lot’ for me, personally. but its way better for using 105 variable rather than 539. Next is we’ll deal with our imbalance target data
##
## -1 1
## 0.93363114 0.06636886
For dealing with this problem, im gonna use Synthetic Minority Oversampling Technique (SMOTE) from UBL
library. SMOTE can handle class imbalancy in binary classification like our case. Reminder: “1” in our target variables means fail cases, and “-1” means not fail.
data.balanced <- SmoteClassif(pass.fail ~., df.pca, C.perc = "balance", dist = "Euclidean")
# re-check our balanced data
prop.table(table(data.balanced$pass.fail))
##
## -1 1
## 0.5003191 0.4996809
i’ll do the modeling using parsnip
package. First, i’ll do Random Forest
# Model building
# trees = 500 and mtry = 3 arguments are just random throw for me.
mod.rf <- rand_forest(trees = 500, mtry = 3, mode = "classification") %>%
set_engine("ranger") %>%
fit(pass.fail~., data = train)
# Prediction to unseen data
pred.rf <- test %>%
bind_cols(predict(mod.rf, test)) %>%
bind_cols(predict(mod.rf, test, type = "prob")) %>%
# i combine the prediction result to test data. so we need to split it
select(106:109)
head(pred.rf)
for the evaluation, we will use yardstick
package
# Model Evaluation
metrics.eval.rf <- pred.rf %>%
summarise(
accuracy = accuracy_vec(pass.fail, .pred_class),
F1 = f_meas_vec(pass.fail, .pred_class),
specificity = spec_vec(pass.fail, .pred_class),
precision = precision_vec(pass.fail, .pred_class),
recall = recall_vec(pass.fail, .pred_class)
)
metrics.eval.rf
98% recall lol
# Model Building
mod.dt <- decision_tree(mode = "classification") %>%
set_engine("rpart") %>%
fit(pass.fail~., data = train)
# Prediction to unseen data
pred.dt <- test %>%
bind_cols(predict(mod.dt, test)) %>%
bind_cols(predict(mod.dt, test, type = "prob")) %>%
select(106:109)
head(pred.dt)
# Model evaluation
metrics.eval.dt <- pred.dt %>%
summarise(
accuracy = accuracy_vec(pass.fail, .pred_class),
F1 = f_meas_vec(pass.fail, .pred_class),
specificity = spec_vec(pass.fail, .pred_class),
precision = precision_vec(pass.fail, .pred_class),
recall = recall_vec(pass.fail, .pred_class)
)
metrics.eval.dt
# Model building
# i will use same tuning as random forest
mod.xg <- boost_tree(trees = 500, mtry = 3, mode = "classification") %>%
set_engine("xgboost") %>%
fit(pass.fail~., data = train)
# prediction to unseen data
pred.xg <- test %>%
bind_cols(predict(mod.xg, test)) %>%
bind_cols(predict(mod.xg, test, type = "prob")) %>%
select(106:109)
head(pred.xg)
metrics.eval.xg <- pred.xg %>%
summarise(
accuracy = accuracy_vec(pass.fail, .pred_class),
F1 = f_meas_vec(pass.fail, .pred_class),
specificity = spec_vec(pass.fail, .pred_class),
precision = precision_vec(pass.fail, .pred_class),
recall = recall_vec(pass.fail, .pred_class)
)
metrics.eval.xg
Overview the metrics
all.metric <- rbind(Random_Forest = metrics.eval.rf,
Decision_Tree = metrics.eval.dt,
Boosted_Tree = metrics.eval.xg)
all.metric
The result are amazing even with random number in hyperparameter. The highest recall
we can get is 98% from Random Forest
model. its too high, makes me think maybe there’s some data leak. maybe from upsampling/downsampling step.
But can we make it even better?
for random forest tuning, i’ll try search grid tuning for trees and mtry parameters. due to my machine capacity, i’ll only do 5 grid search. also we will do 5-fold cross validation.
# create the grid. the tune will build models using combination of given values in trees and mtry
rf.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)
rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
rf.tune <- tune_grid(pass.fail~., model = rf.setup, resamples = folds,
grid = rf.grid, metrics = metric_set(accuracy, spec, sens))
## i Creating pre-processing data to finalize unknown parameter: mtry
# select the best parameters and apply to new model
best.rf <- rf.tune %>% select_best("sens", maximize = F)
# The best parameters for highest recall are mtry = 6 and trees = 600
mod.rf.2 <- rf.setup %>% finalize_model(parameters =
best.rf)
# build new model using the best tuned parameters
mod.rf.2.x <- mod.rf.2 %>% fit(pass.fail~., data = test)
# predicti new model to unseen data
pred.rf.2 <- test %>%
bind_cols(predict(mod.rf.2.x, test)) %>%
bind_cols(predict(mod.rf.2.x, test, type = "prob")) %>%
select(106:109)
head(pred.rf.2)
metrics.eval.rf.2 <- pred.rf.2 %>%
summarise(
accuracy = accuracy_vec(pass.fail, .pred_class),
F1 = f_meas_vec(pass.fail, .pred_class),
specificity = spec_vec(pass.fail, .pred_class),
precision = precision_vec(pass.fail, .pred_class),
recall = recall_vec(pass.fail, .pred_class)
)
metrics.eval.rf.2
We’re successfully predict all of the target correctly hahahaha
# create the grid. the tune will build models using combination of randomized cost_complexity and min_n from grid_max_entropy function
dt.setup <- decision_tree(cost_complexity = tune(), min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
dt.pars <- parameters(cost_complexity(), min_n())
dt.grid <- grid_max_entropy(dt.pars, size = 10)
dt.tune <- tune_grid(pass.fail~., model = dt.setup, resamples = folds,
grid = dt.grid, metrics = metric_set(accuracy, spec, sens))
# select the best parameters and apply to new model
best.dt <- dt.tune %>% select_best("sens", maximize = F)
# The best parameters for highest recall are cost_complexity = 0.0899 and min_n = 35
dt.final <- dt.setup %>% finalize_model(parameters =
best.dt)
mod.dt.2 <- dt.final %>% fit(pass.fail~., data = test)
# predict new model to unseen data
pred.dt.2 <- test %>%
bind_cols(predict(mod.dt.2, test)) %>%
bind_cols(predict(mod.dt.2, test, type = "prob")) %>%
select(106:109)
head(pred.dt.2)
# Model evaluation
metrics.eval.dt.2 <- pred.dt.2 %>%
summarise(
accuracy = accuracy_vec(pass.fail, .pred_class),
F1 = f_meas_vec(pass.fail, .pred_class),
specificity = spec_vec(pass.fail, .pred_class),
precision = precision_vec(pass.fail, .pred_class),
recall = recall_vec(pass.fail, .pred_class)
)
metrics.eval.dt.2
# create the grid. the tune will build models using combination of given values in trees and mtry
xg.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)
xg.setup <- boost_tree(learn_rate = 0.1, trees = tune(), mtry = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
xg.tune <- tune_grid(pass.fail~., model = rf.setup, resamples = folds,
grid = xg.grid, metrics = metric_set(accuracy, spec, sens))
## i Creating pre-processing data to finalize unknown parameter: mtry
# select the best parameters and apply to new model
best.xg <- xg.tune %>% select_best("sens", maximize = F)
# the best first tuning for highest recall are trees = 450 and mtry = 6
xg.final <- xg.setup %>% finalize_model(parameters =
best.xg)
# build new model
mod.xg.2 <- xg.final %>% fit(pass.fail~., data = test)
# predict with new model
pred.xg.2 <- test %>%
bind_cols(predict(mod.xg.2, test)) %>%
bind_cols(predict(mod.xg.2, test, type = "prob")) %>%
select(106:109)
head(pred.xg.2)
metrics.eval.xg.2 <- pred.xg.2 %>%
summarise(
accuracy = accuracy_vec(pass.fail, .pred_class),
F1 = f_meas_vec(pass.fail, .pred_class),
specificity = spec_vec(pass.fail, .pred_class),
precision = precision_vec(pass.fail, .pred_class),
recall = recall_vec(pass.fail, .pred_class)
)
metrics.eval.xg.2
again we found a prefect score for every metric
all.metric.2 <- rbind(Random_Forest.tn = metrics.eval.rf.2,
Decision_Tree.tn = metrics.eval.dt.2,
Boosted_Tree.tn = metrics.eval.xg.2)
all.metric.2
The result are perfect, but there’s no such thing as perfect accuracy. something must be wrong. maybe something in upsampling step generate data for minority variables are very similar to existing data, hence make the duplicated data aviliable both in test and train data, inflict in evaluating step are not predicting from unseen anymore.
anyway from our second goal, lets build preprocess step using tidymodels data. let’s see how simple it is.
tidymodels
re-import the data to make everything anew
Do the pre-process step.
data.recipe <- recipe(Pass.Fail ~., data = train) %>%
# remove unnecessary variable
step_rm(Time) %>%
# Remove near zero variable
step_nzv(all_predictors()) %>%
# Impute missing value with variable's mean
step_meanimpute(all_predictors()) %>%
# downsample the majority level with 1:0.75 ratio
step_downsample(Pass.Fail, under_ratio = 1/0.75, seed = 1502) %>%
# Normalize the mean of numeric column to zero
step_center(all_numeric()) %>%
# Normalize the standar deviation of numeric column to one
step_scale(all_numeric()) %>%
# create PCA with 85% variance threshold
step_pca(all_numeric(), threshold = 0.85) %>%
prep(strings_as_factors = F)
apply preprocess step to our new train and test data
Our old train model have 105 variabels as predictors, our new one has 72. Both has same variance treshold when doing PCA, 85%. i also set the balancing step to 1:0.75 ratio, not a actual balance of 1:1
##
## -1 1
## 0.5699482 0.4300518
i’ll do the modeling using tidymodels
package. First, ill do Random Forest
mod.rfX <- rand_forest(trees = 500, mtry = 3, mode = "classification") %>%
set_engine("ranger") %>%
fit(Pass.Fail~., data = train.2)
pred.rfX <- test.2 %>%
bind_cols(predict(mod.rfX, test.2)) %>%
bind_cols(predict(mod.rfX, test.2, type = "prob"))
pred.rfX <- pred.rfX %>% select(1, tail(names(.),3))
head(pred.rfX)
# model evaluation
metrics.eval.rfX <- pred.rfX %>%
summarise(
accuracy = accuracy_vec(Pass.Fail, .pred_class),
F1 = f_meas_vec(Pass.Fail, .pred_class),
specificity = spec_vec(Pass.Fail, .pred_class),
precision = precision_vec(Pass.Fail, .pred_class),
recall = recall_vec(Pass.Fail, .pred_class)
)
metrics.eval.rfX
94% recall. just as good as the old one
mod.dtX <- decision_tree(mode = "classification") %>%
set_engine("rpart") %>%
fit(Pass.Fail~., data = train.2)
pred.dtX <- test.2 %>%
bind_cols(predict(mod.dtX, test.2)) %>%
bind_cols(predict(mod.dtX, test.2, type = "prob"))
pred.dtX <- pred.dtX %>% select(1, tail(names(.),3))
head(pred.dtX)
metrics.eval.dtX <- pred.dtX %>%
summarise(
accuracy = accuracy_vec(Pass.Fail, .pred_class),
F1 = f_meas_vec(Pass.Fail, .pred_class),
specificity = spec_vec(Pass.Fail, .pred_class),
precision = precision_vec(Pass.Fail, .pred_class),
recall = recall_vec(Pass.Fail, .pred_class)
)
metrics.eval.dtX
# i will use same tuning as random forest
mod.xgX <- boost_tree(trees = 500, mtry = 3, mode = "classification") %>%
set_engine("xgboost") %>%
fit(Pass.Fail~., data = train.2)
pred.xgX <- test.2 %>%
bind_cols(predict(mod.xgX, test.2)) %>%
bind_cols(predict(mod.xgX, test.2, type = "prob"))
pred.xgX <- pred.xgX %>% select(1, tail(names(.),3))
head(pred.xgX)
# model evaluation
metrics.eval.xgX <- pred.xgX %>%
summarise(
accuracy = accuracy_vec(Pass.Fail, .pred_class),
F1 = f_meas_vec(Pass.Fail, .pred_class),
specificity = spec_vec(Pass.Fail, .pred_class),
precision = precision_vec(Pass.Fail, .pred_class),
recall = recall_vec(Pass.Fail, .pred_class)
)
metrics.eval.xgX
Overview the metrics
all.metric.X <- rbind(Random_Forest.2 = metrics.eval.rfX,
Decision_Tree.2 = metrics.eval.dtX,
Boosted_Tree.2 = metrics.eval.xgX) %>% data.frame()
all.metric.X
we saw that recal for boosted are not quite high as before. we’ll do the tuning just like before, hoping for better result.
rf.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)
rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
rf.tune <- tune_grid(Pass.Fail~., model = rf.setup, resamples = foldsX,
grid = rf.grid, metrics = metric_set(accuracy, sens, spec))
## i Creating pre-processing data to finalize unknown parameter: mtry
best.rfX <- rf.tune %>% select_best("sens", maximize = F)
# the best parameter for highest reacall are mtry = 6 and trees = 450
mod.rf.2X <- rf.setup %>% finalize_model(parameters =
best.rfX)
# rebuild the model
mod.rf.2.new <- mod.rf.2X %>% fit(Pass.Fail~., data = test.2)
# predict to unseen data
pred.rf.2X <- test.2 %>%
bind_cols(predict(mod.rf.2.new, test.2)) %>%
bind_cols(predict(mod.rf.2.new, test.2, type = "prob"))
pred.rf.2X <- pred.rf.2X %>% select(1, tail(names(.),3))
head(pred.rf.2X)
# model evaluation
metrics.eval.rf.2X <- pred.rf.2X %>%
summarise(
accuracy = accuracy_vec(Pass.Fail, .pred_class),
F1 = f_meas_vec(Pass.Fail, .pred_class),
specificity = spec_vec(Pass.Fail, .pred_class),
precision = precision_vec(Pass.Fail, .pred_class),
recall = recall_vec(Pass.Fail, .pred_class)
)
metrics.eval.rf.2X
again, a prefect recall score
dt.setup <- decision_tree(cost_complexity = tune(), min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
dt.pars <- parameters(cost_complexity(), min_n())
dt.grid <- grid_max_entropy(dt.pars, size = 10)
dt.tune <- tune_grid(Pass.Fail~., model = dt.setup, resamples = foldsX,
grid = dt.grid, metrics = metric_set(accuracy, sens, spec))
best.dtX <- dt.tune %>% select_best("sens", maximize = F)
dt.finalX <- dt.setup %>% finalize_model(parameters =
best.dtX)
mod.dt.2.new <- dt.finalX %>% fit(Pass.Fail~., data = test.2)
pred.dt.2X <- test.2 %>%
bind_cols(predict(mod.dt.2.new, test.2)) %>%
bind_cols(predict(mod.dt.2.new, test.2, type = "prob"))
pred.dt.2X <- pred.dt.2X %>% select(1, tail(names(.),3))
head(pred.dt.2X)
metrics.eval.dt.2X <- pred.dt.2X %>%
summarise(
accuracy = accuracy_vec(Pass.Fail, .pred_class),
F1 = f_meas_vec(Pass.Fail, .pred_class),
specificity = spec_vec(Pass.Fail, .pred_class),
precision = precision_vec(Pass.Fail, .pred_class),
recall = recall_vec(Pass.Fail, .pred_class)
)
metrics.eval.dt.2X
xg.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)
xg.setup <- boost_tree(learn_rate = 0.1, trees = tune(), mtry = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
xg.tune <- tune_grid(Pass.Fail~., model = rf.setup, resamples = foldsX,
grid = rf.grid, metrics = metric_set(accuracy, sens, spec))
## i Creating pre-processing data to finalize unknown parameter: mtry
best.xgX <- xg.tune %>% select_best("spec", maximize = F)
# the best parameter for highest recall are mtry = 2 and trees = 500
xg.finalX <- xg.setup %>% finalize_model(parameters = best.xgX)
# rebuild model with new parameter
mod.xg.new <- xg.finalX %>% fit(Pass.Fail~., data = test.2)
pred.xg.2X <- test.2 %>%
bind_cols(predict(mod.xg.new, test.2)) %>%
bind_cols(predict(mod.xg.new, test.2, type = "prob"))
pred.xg.2X <- pred.xg.2X %>% select(1, tail(names(.),3))
head(pred.xg.2X)
metrics.eval.xg.2X <- pred.xg.2X %>%
summarise(
accuracy = accuracy_vec(Pass.Fail, .pred_class),
F1 = f_meas_vec(Pass.Fail, .pred_class),
specificity = spec_vec(Pass.Fail, .pred_class),
precision = precision_vec(Pass.Fail, .pred_class),
recall = recall_vec(Pass.Fail, .pred_class)
)
metrics.eval.xg.2X
lol, perfect score
all.metric.2X <- rbind(Random_Forest.tn2 = metrics.eval.rf.2X,
Decision_Tree.tn2 = metrics.eval.dt.2X,
Boosted_Tree.tn2 = metrics.eval.xg.2X) %>% data.frame()
all.metric.2X
Let’s show all the metrics to make the evaluation easier
metric.all <- rbind(all.metric,all.metric.2,all.metric.X,all.metric.2X)
metric.all <- rownames_to_column(metric.all)
# arrange the table to highest recall
metric.all %>% arrange(desc(recall))
From the table above we know that every tuned Random_forest
and Boosted_tree
models are achieving perfect score. it means the model are able to correctly predict unseen data. however like i said before, there’s no such thing as perfect metrics. something might be wrong from our preprocessed step.
the only rational best result for me are Random_Forest.2
model. it has 94% recall but the tradeoff are low specificity. In this case, our produced items are mostly will detected as defected items and there will be very minimum defected items which sold to public as return. back to our business case, if we’re going to pretend those model are correctly builded, we’ll choose Random_Forest 2
models (which preprocessed by recipe package) as our anomaly detection model.
In this article we also learned that preprocessed with tidymodels (espescially recipe
package) are way easier. The modeling and tuning step are a bit simple than caret
package (personally). For the future, it might be a wise step if we use tidymodel
package to save time when preprocessing and buiding model.
Thank You!