Classification case, Anomaly Detection for Detecting Defected Manufactured Semi-Conductors. In this project we propose machine learning techniques to automatically generate an accurate predictive model to predict equipment faults during the wafer fabrication process of the semiconductor industries. Aim at constructing a decision model to help detecting as quickly as possible any equipment faults in order to maintain high process yields in manufacturing.

1 Background

1.1 Objective

This project is based from this github. We’re given a dataset with 590 predictors, an imbalance target variables, lots of missing value, and most of the predictors has different scale. In this case, Recall is the metrics we wanted to be the highest, because we don’t want any defected items to be sold in the market. It’s better if we classify the items are defected even though they are not, the items are always can be re-checked. So we’re going to build model with highest Recall (also called as sensitivity), or in Confusion Matrix languange, False positives is better then False Negatives.

1.2 Libraries

You can load the package into your workspace using the library() function

library(dplyr)
library(caret)
library(data.table)
library(tidymodels)
library(FactoMineR)
library(ggplot2)
library(UBL)

2 Let’s begin !

There’s a lot to do with pre-processing. I’ll do pre-process twice: one with dplyr/base pacakge and do the models with parsnip, second with fully tidymodels, just to see the effectiveness. For the models, i’ll use Random Forest, Decision Tree, and XGBoost.

2.1 Data Import

data.1 <- read.csv("uci-secom.csv")
data.1$Pass.Fail <- as.factor(data.1$Pass.Fail)

3 Preprocess without tidymodels

Exploratory Data Analysis, Data Wrangling, Feature selection and any data preparation without recipe. hard ways as a man we are.

3.1 Exploratory Data Analysis

head(data.1)

The dimension are too wide to be glimpse() with. Here is the summary about the data:
- The dimension is: 1567 rows x 592 columns
- The variables contain: 1 Time attribute, 590 variables as predictors, one target variable with “-1” as items that are not defected and “1” as defected items.
- Based on the github, 590 variables are actually sensors measurements with different scale.
- The predictors are all numeric and has a lot of missing value
- The target variable are imbalance with 1:14 proportion

change variable type

library(lubridate)
data.1$Time <- ymd_hms(data.1$Time)
data.1$Pass.Fail <- as.factor(data.1$Pass.Fail)

3.2 Missing values

Check NA

colSums(is.na(data.1))

##      Time        X0        X1        X2        X3        X4        X5        X6 
##         0         6         7        14        14        14        14        14 
##        X7        X8        X9       X10       X11       X12       X13       X14 
##         9         2         2         2         2         2         3         3 
##       X15       X16       X17       X18       X19       X20       X21       X22 
##         3         3         3         3        10         0         2         2 
##       X23       X24       X25       X26       X27       X28       X29       X30 
##         2         2         2         2         2         2         2         2 
##       X31       X32       X33       X34       X35       X36       X37       X38 
##         2         1         1         1         1         1         1         1 
##       X39       X40       X41       X42       X43       X44       X45       X46 
##         1        24        24         1         1         1         1         1 
##       X47       X48       X49       X50       X51       X52       X53       X54 
##         1         1         1         1         1         1         4         4 
##       X55       X56       X57       X58       X59       X60       X61       X62 
##         4         4         4         4         7         6         6         6 
##       X63       X64       X65       X66       X67       X68       X69       X70 
##         7         7         7         6         6         6         6         6 
##       X71       X72       X73       X74       X75       X76       X77       X78 
##         6       794       794         6        24        24        24        24 
##       X79       X80       X81       X82       X83       X84       X85       X86 
##        24        24        24        24         1        12      1341         0 
##       X87       X88       X89       X90       X91       X92       X93       X94 
##         0         0        51        51         6         2         2         6 
##       X95       X96       X97       X98       X99      X100      X101      X102 
##         6         6         6         6         6         6         6         6 
##      X103      X104      X105      X106      X107      X108      X109      X110 
##         2         2         6         6         6         6      1018      1018 
##      X111      X112      X113      X114      X115      X116      X117      X118 
##      1018       715         0         0         0         0         0        24 
##      X119      X120      X121      X122      X123      X124      X125      X126 
##         0         0         9         9         9         9         9         9 
##      X127      X128      X129      X130      X131      X132      X133      X134 
##         9         9         9         9         9         8         8         8 
##      X135      X136      X137      X138      X139      X140      X141      X142 
##         5         6         7        14        14        14        14        14 
##      X143      X144      X145      X146      X147      X148      X149      X150 
##         9         2         2         2         2         2         3         3 
##      X151      X152      X153      X154      X155      X156      X157      X158 
##         3         3         3         3        10         0      1429      1429 
##      X159      X160      X161      X162      X163      X164      X165      X166 
##         2         2         2         2         2         2         2         2 
##      X167      X168      X169      X170      X171      X172      X173      X174 
##         2         2         2         1         1         1         1         1 
##      X175      X176      X177      X178      X179      X180      X181      X182 
##         1         1         1        24         1         1         1         1 
##      X183      X184      X185      X186      X187      X188      X189      X190 
##         1         1         1         1         1         1         1         4 
##      X191      X192      X193      X194      X195      X196      X197      X198 
##         4         4         4         4         4         7         6         6 
##      X199      X200      X201      X202      X203      X204      X205      X206 
##         6         7         7         7         6         6         6         6 
##      X207      X208      X209      X210      X211      X212      X213      X214 
##         6         6         6        24        24        24        24        24 
##      X215      X216      X217      X218      X219      X220      X221      X222 
##        24        24        24         1        12      1341         0         0 
##      X223      X224      X225      X226      X227      X228      X229      X230 
##         0        51        51         6         2         2         6         6 
##      X231      X232      X233      X234      X235      X236      X237      X238 
##         6         6         6         6         6         6         6         2 
##      X239      X240      X241      X242      X243      X244      X245      X246 
##         2         6         6         6         6      1018      1018      1018 
##      X247      X248      X249      X250      X251      X252      X253      X254 
##       715         0         0         0         0         0        24         0 
##      X255      X256      X257      X258      X259      X260      X261      X262 
##         0         9         9         9         9         9         9         9 
##      X263      X264      X265      X266      X267      X268      X269      X270 
##         9         9         9         9         8         8         8         5 
##      X271      X272      X273      X274      X275      X276      X277      X278 
##         6         7        14        14        14        14        14         9 
##      X279      X280      X281      X282      X283      X284      X285      X286 
##         2         2         2         2         2         3         3         3 
##      X287      X288      X289      X290      X291      X292      X293      X294 
##         3         3         3        10         0      1429      1429         2 
##      X295      X296      X297      X298      X299      X300      X301      X302 
##         2         2         2         2         2         2         2         2 
##      X303      X304      X305      X306      X307      X308      X309      X310 
##         2         2         1         1         1         1         1         1 
##      X311      X312      X313      X314      X315      X316      X317      X318 
##         1         1        24        24         1         1         1         1 
##      X319      X320      X321      X322      X323      X324      X325      X326 
##         1         1         1         1         1         1         1         4 
##      X327      X328      X329      X330      X331      X332      X333      X334 
##         4         4         4         4         4         7         6         6 
##      X335      X336      X337      X338      X339      X340      X341      X342 
##         6         7         7         7         6         6         6         6 
##      X343      X344      X345      X346      X347      X348      X349      X350 
##         6         6       794       794         6        24        24        24 
##      X351      X352      X353      X354      X355      X356      X357      X358 
##        24        24        24        24        24         1        12      1341 
##      X359      X360      X361      X362      X363      X364      X365      X366 
##         0         0         0        51        51         6         2         2 
##      X367      X368      X369      X370      X371      X372      X373      X374 
##         6         6         6         6         6         6         6         6 
##      X375      X376      X377      X378      X379      X380      X381      X382 
##         6         2         2         6         6         6         6      1018 
##      X383      X384      X385      X386      X387      X388      X389      X390 
##      1018      1018       715         0         0         0         0         0 
##      X391      X392      X393      X394      X395      X396      X397      X398 
##        24         0         0         9         9         9         9         9 
##      X399      X400      X401      X402      X403      X404      X405      X406 
##         9         9         9         9         9         9         8         8 
##      X407      X408      X409      X410      X411      X412      X413      X414 
##         8         5         6         7        14        14        14        14 
##      X415      X416      X417      X418      X419      X420      X421      X422 
##        14         9         2         2         2         2         2         3 
##      X423      X424      X425      X426      X427      X428      X429      X430 
##         3         3         3         3         3        10         0         2 
##      X431      X432      X433      X434      X435      X436      X437      X438 
##         2         2         2         2         2         2         2         2 
##      X439      X440      X441      X442      X443      X444      X445      X446 
##         2         2         1         1         1         1         1         1 
##      X447      X448      X449      X450      X451      X452      X453      X454 
##         1         1        24        24         1         1         1         1 
##      X455      X456      X457      X458      X459      X460      X461      X462 
##         1         1         1         1         1         1         1         4 
##      X463      X464      X465      X466      X467      X468      X469      X470 
##         4         4         4         4         4         7         6         6 
##      X471      X472      X473      X474      X475      X476      X477      X478 
##         6         7         7         7         6         6         6         6 
##      X479      X480      X481      X482      X483      X484      X485      X486 
##         6         6         6        24        24        24        24        24 
##      X487      X488      X489      X490      X491      X492      X493      X494 
##        24        24        24         1        12      1341         0         0 
##      X495      X496      X497      X498      X499      X500      X501      X502 
##         0        51        51         6         2         2         6         6 
##      X503      X504      X505      X506      X507      X508      X509      X510 
##         6         6         6         6         6         6         6         2 
##      X511      X512      X513      X514      X515      X516      X517      X518 
##         2         6         6         6         6      1018      1018      1018 
##      X519      X520      X521      X522      X523      X524      X525      X526 
##       715         0         0         0         0         0        24         0 
##      X527      X528      X529      X530      X531      X532      X533      X534 
##         0         9         9         9         9         9         9         9 
##      X535      X536      X537      X538      X539      X540      X541      X542 
##         9         9         9         9         8         8         8         2 
##      X543      X544      X545      X546      X547      X548      X549      X550 
##         2         2         2       260       260       260       260       260 
##      X551      X552      X553      X554      X555      X556      X557      X558 
##       260       260       260       260       260       260       260         1 
##      X559      X560      X561      X562      X563      X564      X565      X566 
##         1         1         1       273       273       273       273       273 
##      X567      X568      X569      X570      X571      X572      X573      X574 
##       273       273       273         0         0         0         0         0 
##      X575      X576      X577      X578      X579      X580      X581      X582 
##         0         0         0       949       949       949       949         1 
##      X583      X584      X585      X586      X587      X588      X589 Pass.Fail 
##         1         1         1         1         1         1         1         0

there’s a lot, very lot of missing value. we’ll deal with them one by one. first lets just delete variable which half of its value is missing. we also want to delete rows that most of their column are missing value

# delete column with > 50% value are missing
data.1 <- data.1[, which(colMeans(!is.na(data.1)) > 0.5)]

# delete row with > 50% value are missing
data.1 <- data.1[which(rowMeans(!is.na(data.1)) > 0.5), ]

apparently there’s no rows have NA > 50%

Next lets specify the missing values in our data

# i want to know how many rows have at least 1 NA
haveNA <- apply(data.1, 1, function(x)any(is.na(x)))
length(which(haveNA))

## [1] 1106

# 1106 rows, there's plenty of it
length(which(haveNA))/nrow(data.1)

## [1] 0.7058073

70.5% of our data have at least one missing value. if we just na.omit() them then we have nothing left.

I really want to use MICE or similliar package which fill NA using prediction model. I’ve tried it and is very time-consuming (and expensive computing). i dont have such time. so, for the time being, we will fill NA with the variable mean.

# make a loop function. find the mean from each selected variable and append mean to every NA
for(i in 1:ncol(data.1)){
  data.1[is.na(data.1[,i]), i] <- mean(data.1[,i], na.rm = TRUE)
}

## Warning in mean.default(data.1[, i], na.rm = TRUE): argument is not numeric or
## logical: returning NA

# re-check if our data still has NA
anyNA(data.1)

## [1] FALSE

3.3 Remove near zero variables

I want to remove variable that has zero variance as they can contribute nearly nothing to the future model construction.

set.seed(1502)
novar <- nearZeroVar(data.1)
# we get 127 columns that have near 0 variance. we remove it from our data
data.1 <- data.1[,-novar]

3.4 Principal Component Analysis for dimensionality reduction

dim(data.1)

## [1] 1567  437

We still have a lot of variables. Before building a model, we need to reduce variables. One way for dimensionality reduction is Principal Component Analysis or PCA for short. PCA looks for correlation within our data and use that redundancy to create a new matrix with just enough dimensions to explain most of the variance in the original data. New variables that are created by PCA is called principal component. PCA also helps to reduce multicolinearity. Two birds with one stone eh

# build PCA. seperate factor variable, and scale numeric data. all in one function
pca1 <- PCA(data.1, quali.sup = c(1,437), scale.unit = T, graph = F, ncp = 200)

There’s a lot of PC have been created. we want to see in which PC is covered 85% cumulative percentage of variance

which(data.frame(pca1$eig)[,3] > 85)[1]

## [1] 105

105 PC. It simply means we only need to use 105 dimensions to cover 85% of data. in other words, we are able to reduce 75.8% data and stil retain 85% variance of actual data. lets make new df contain selected PC and our target variable

df.pca <- data.frame(pca1$ind$coord[,1:105])
df.pca <- cbind(df.pca, pass.fail = data.1$Pass.Fail)
# i'm not including Time variable when binding df pca with our main df. imo, time variable wont affect anything to our future model

We’ve try to reduce variable even though that’s still ‘a lot’ for me, personally. but its way better for using 105 variable rather than 539. Next is we’ll deal with our imbalance target data

3.5 SMOTE for balancing target variable

prop.table(table(df.pca$pass.fail))

## 
##         -1          1 
## 0.93363114 0.06636886

For dealing with this problem, im gonna use Synthetic Minority Oversampling Technique (SMOTE) from UBL library. SMOTE can handle class imbalancy in binary classification like our case. Reminder: “1” in our target variables means fail cases, and “-1” means not fail.

data.balanced <- SmoteClassif(pass.fail ~., df.pca, C.perc = "balance", dist = "Euclidean")
# re-check our balanced data
prop.table(table(data.balanced$pass.fail))

## 
##        -1         1 
## 0.5003191 0.4996809

4 Modeling

4.1 Splitting

splitter <- initial_split(data.balanced, prop = 0.8, strata = "pass.fail")
train <- training(splitter)
test <- testing(splitter)

4.2 Random Forest

i’ll do the modeling using parsnip package. First, i’ll do Random Forest

# Model building
# trees = 500 and mtry = 3 arguments are just random throw for me.
mod.rf <- rand_forest(trees = 500, mtry = 3, mode = "classification") %>%
  set_engine("ranger") %>%
  fit(pass.fail~., data = train)

# Prediction to unseen data
pred.rf <- test %>%
  bind_cols(predict(mod.rf, test)) %>%
  bind_cols(predict(mod.rf, test, type = "prob")) %>%
  # i combine the prediction result to test data. so we need to split it
  select(106:109)

head(pred.rf)

for the evaluation, we will use yardstick package

# Model Evaluation
metrics.eval.rf <- pred.rf %>%
  summarise(
    accuracy = accuracy_vec(pass.fail, .pred_class),
    F1 = f_meas_vec(pass.fail, .pred_class),
    specificity = spec_vec(pass.fail, .pred_class),
    precision = precision_vec(pass.fail, .pred_class),
    recall = recall_vec(pass.fail, .pred_class)
  )

metrics.eval.rf

98% recall lol

4.3 Decision Tree

# Model Building
mod.dt <- decision_tree(mode = "classification") %>%
  set_engine("rpart") %>%
  fit(pass.fail~., data = train)

# Prediction to unseen data
pred.dt <- test %>%
  bind_cols(predict(mod.dt, test)) %>%
  bind_cols(predict(mod.dt, test, type = "prob")) %>%
  select(106:109)

head(pred.dt)

# Model evaluation
metrics.eval.dt <- pred.dt %>%
  summarise(
    accuracy = accuracy_vec(pass.fail, .pred_class),
    F1 = f_meas_vec(pass.fail, .pred_class),
    specificity = spec_vec(pass.fail, .pred_class),
    precision = precision_vec(pass.fail, .pred_class),
    recall = recall_vec(pass.fail, .pred_class)
  )

metrics.eval.dt

4.4 Boosted tree

# Model building
# i will use same tuning as random forest
mod.xg <- boost_tree(trees = 500, mtry = 3, mode = "classification") %>%
  set_engine("xgboost") %>%
  fit(pass.fail~., data = train)

# prediction to unseen data
pred.xg <- test %>%
  bind_cols(predict(mod.xg, test)) %>%
  bind_cols(predict(mod.xg, test, type = "prob")) %>%
  select(106:109)

head(pred.xg)

metrics.eval.xg <- pred.xg %>%
  summarise(
    accuracy = accuracy_vec(pass.fail, .pred_class),
    F1 = f_meas_vec(pass.fail, .pred_class),
    specificity = spec_vec(pass.fail, .pred_class),
    precision = precision_vec(pass.fail, .pred_class),
    recall = recall_vec(pass.fail, .pred_class)
  )

metrics.eval.xg

Overview the metrics

all.metric <- rbind(Random_Forest = metrics.eval.rf,
                    Decision_Tree = metrics.eval.dt,
                    Boosted_Tree = metrics.eval.xg)

all.metric

The result are amazing even with random number in hyperparameter. The highest recall we can get is 98% from Random Forest model. its too high, makes me think maybe there’s some data leak. maybe from upsampling/downsampling step.

But can we make it even better?

5 Model Tuning

5.1 Random Forest

for random forest tuning, i’ll try search grid tuning for trees and mtry parameters. due to my machine capacity, i’ll only do 5 grid search. also we will do 5-fold cross validation.

# build 5 folds cross validation
set.seed(1502)
folds <- vfold_cv(train, 5)

# create the grid. the tune will build models using combination of given values in trees and mtry
rf.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)

rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf.tune <- tune_grid(pass.fail~., model = rf.setup, resamples = folds,
                     grid = rf.grid, metrics = metric_set(accuracy, spec, sens))

## i Creating pre-processing data to finalize unknown parameter: mtry

# keep in mind, sens or sensitivity is just the same as recall

show_best(rf.tune, maximize = F, metric = "sens")

# select the best parameters and apply to new model
best.rf <- rf.tune %>% select_best("sens", maximize = F)

# The best parameters for highest recall are mtry = 6 and trees = 600
mod.rf.2 <- rf.setup %>% finalize_model(parameters = 
                                          best.rf)

# build new model using the best tuned parameters
mod.rf.2.x <- mod.rf.2 %>% fit(pass.fail~., data = test)

# predicti new model to unseen data
pred.rf.2 <- test %>%
  bind_cols(predict(mod.rf.2.x, test)) %>%
  bind_cols(predict(mod.rf.2.x, test, type = "prob")) %>%
  select(106:109)

head(pred.rf.2)

metrics.eval.rf.2 <- pred.rf.2 %>%
  summarise(
    accuracy = accuracy_vec(pass.fail, .pred_class),
    F1 = f_meas_vec(pass.fail, .pred_class),
    specificity = spec_vec(pass.fail, .pred_class),
    precision = precision_vec(pass.fail, .pred_class),
    recall = recall_vec(pass.fail, .pred_class)
  )

metrics.eval.rf.2

We’re successfully predict all of the target correctly hahahaha

5.2 Decision Tree

# create the grid. the tune will build models using combination of randomized cost_complexity and min_n from grid_max_entropy function
dt.setup <- decision_tree(cost_complexity = tune(), min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

dt.pars <- parameters(cost_complexity(), min_n())
dt.grid <- grid_max_entropy(dt.pars, size = 10)

dt.tune <- tune_grid(pass.fail~., model = dt.setup, resamples = folds,
                     grid = dt.grid, metrics = metric_set(accuracy, spec, sens))

show_best(dt.tune, maximize = F, metric = "sens")

# select the best parameters and apply to new model
best.dt <- dt.tune %>% select_best("sens", maximize = F)

# The best parameters for highest recall are cost_complexity = 0.0899 and min_n = 35
dt.final <- dt.setup %>% finalize_model(parameters = 
                                          best.dt)

mod.dt.2 <- dt.final %>% fit(pass.fail~., data = test)

# predict new model to unseen data
pred.dt.2 <- test %>%
  bind_cols(predict(mod.dt.2, test)) %>%
  bind_cols(predict(mod.dt.2, test, type = "prob")) %>%
  select(106:109)

head(pred.dt.2)

# Model evaluation
metrics.eval.dt.2 <- pred.dt.2 %>%
  summarise(
    accuracy = accuracy_vec(pass.fail, .pred_class),
    F1 = f_meas_vec(pass.fail, .pred_class),
    specificity = spec_vec(pass.fail, .pred_class),
    precision = precision_vec(pass.fail, .pred_class),
    recall = recall_vec(pass.fail, .pred_class)
  )

metrics.eval.dt.2

5.3 Boosted Tree

# create the grid. the tune will build models using combination of given values in trees and mtry
xg.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)

xg.setup <- boost_tree(learn_rate = 0.1, trees = tune(), mtry = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xg.tune <- tune_grid(pass.fail~., model = rf.setup, resamples = folds,
                     grid = xg.grid, metrics = metric_set(accuracy, spec, sens))

## i Creating pre-processing data to finalize unknown parameter: mtry

show_best(xg.tune, maximize = F, metric = "sens")

# select the best parameters and apply to new model
best.xg <- xg.tune %>% select_best("sens", maximize = F)

# the best first tuning for highest recall are trees = 450 and mtry = 6
xg.final <- xg.setup %>% finalize_model(parameters = 
                                          best.xg)
# build new model
mod.xg.2 <- xg.final %>% fit(pass.fail~., data = test)

# predict with new model
pred.xg.2 <- test %>%
  bind_cols(predict(mod.xg.2, test)) %>%
  bind_cols(predict(mod.xg.2, test, type = "prob")) %>%
  select(106:109)

head(pred.xg.2)

metrics.eval.xg.2 <- pred.xg.2 %>%
  summarise(
    accuracy = accuracy_vec(pass.fail, .pred_class),
    F1 = f_meas_vec(pass.fail, .pred_class),
    specificity = spec_vec(pass.fail, .pred_class),
    precision = precision_vec(pass.fail, .pred_class),
    recall = recall_vec(pass.fail, .pred_class)
  )

metrics.eval.xg.2

again we found a prefect score for every metric

all.metric.2 <- rbind(Random_Forest.tn = metrics.eval.rf.2,
                      Decision_Tree.tn = metrics.eval.dt.2,
                      Boosted_Tree.tn = metrics.eval.xg.2)

all.metric.2

The result are perfect, but there’s no such thing as perfect accuracy. something must be wrong. maybe something in upsampling step generate data for minority variables are very similar to existing data, hence make the duplicated data aviliable both in test and train data, inflict in evaluating step are not predicting from unseen anymore.

anyway from our second goal, lets build preprocess step using tidymodels data. let’s see how simple it is.

6 preprocess with `tidymodels`

6.1 re-import data

re-import the data to make everything anew

data.1 <- read.csv("uci-secom.csv")
data.1$Pass.Fail <- as.factor(data.1$Pass.Fail)

6.2 splitting

split <- initial_split(data.1, prop = 0.8, strata = "Pass.Fail")
train <- training(split)
test <- testing(split)

6.3 pre-processing

Do the pre-process step.

data.recipe <- recipe(Pass.Fail ~., data = train) %>%
  # remove unnecessary variable 
  step_rm(Time) %>%
  # Remove near zero variable
  step_nzv(all_predictors()) %>%
  # Impute missing value with variable's mean
  step_meanimpute(all_predictors()) %>%
  # downsample the majority level with 1:0.75 ratio 
  step_downsample(Pass.Fail, under_ratio = 1/0.75, seed = 1502) %>%
  # Normalize the mean of numeric column to zero
  step_center(all_numeric()) %>%
  # Normalize the standar deviation of numeric column to one
  step_scale(all_numeric()) %>%
  # create PCA with 85% variance threshold
  step_pca(all_numeric(), threshold = 0.85) %>%
  prep(strings_as_factors = F)

apply preprocess step to our new train and test data

train.2 <- juice(data.recipe)
test.2 <- bake(data.recipe, test)

head(train.2)

Our old train model have 105 variabels as predictors, our new one has 72. Both has same variance treshold when doing PCA, 85%. i also set the balancing step to 1:0.75 ratio, not a actual balance of 1:1

prop.table(table(train.2$Pass.Fail))

## 
##        -1         1 
## 0.5699482 0.4300518

7 Modeling 2

7.1 Random Forest

i’ll do the modeling using tidymodels package. First, ill do Random Forest

mod.rfX <- rand_forest(trees = 500, mtry = 3, mode = "classification") %>%
  set_engine("ranger") %>%
  fit(Pass.Fail~., data = train.2)

pred.rfX <- test.2 %>%
  bind_cols(predict(mod.rfX, test.2)) %>%
  bind_cols(predict(mod.rfX, test.2, type = "prob"))
pred.rfX <- pred.rfX %>% select(1, tail(names(.),3))

head(pred.rfX)

# model evaluation
metrics.eval.rfX <- pred.rfX %>%
  summarise(
    accuracy = accuracy_vec(Pass.Fail, .pred_class),
    F1 = f_meas_vec(Pass.Fail, .pred_class),
    specificity = spec_vec(Pass.Fail, .pred_class),
    precision = precision_vec(Pass.Fail, .pred_class),
    recall = recall_vec(Pass.Fail, .pred_class)
  )

metrics.eval.rfX

94% recall. just as good as the old one

7.2 Decision Tree

mod.dtX <- decision_tree(mode = "classification") %>%
  set_engine("rpart") %>%
  fit(Pass.Fail~., data = train.2)

pred.dtX <- test.2 %>%
  bind_cols(predict(mod.dtX, test.2)) %>%
  bind_cols(predict(mod.dtX, test.2, type = "prob"))
pred.dtX <- pred.dtX %>% select(1, tail(names(.),3))

head(pred.dtX)

metrics.eval.dtX <- pred.dtX %>%
  summarise(
    accuracy = accuracy_vec(Pass.Fail, .pred_class),
    F1 = f_meas_vec(Pass.Fail, .pred_class),
    specificity = spec_vec(Pass.Fail, .pred_class),
    precision = precision_vec(Pass.Fail, .pred_class),
    recall = recall_vec(Pass.Fail, .pred_class)
  )

metrics.eval.dtX

7.3 Boosted tree

# i will use same tuning as random forest
mod.xgX <- boost_tree(trees = 500, mtry = 3, mode = "classification") %>%
  set_engine("xgboost") %>%
  fit(Pass.Fail~., data = train.2)

pred.xgX <- test.2 %>%
  bind_cols(predict(mod.xgX, test.2)) %>%
  bind_cols(predict(mod.xgX, test.2, type = "prob"))
pred.xgX <- pred.xgX %>% select(1, tail(names(.),3))

head(pred.xgX)

# model evaluation
metrics.eval.xgX <- pred.xgX %>%
  summarise(
    accuracy = accuracy_vec(Pass.Fail, .pred_class),
    F1 = f_meas_vec(Pass.Fail, .pred_class),
    specificity = spec_vec(Pass.Fail, .pred_class),
    precision = precision_vec(Pass.Fail, .pred_class),
    recall = recall_vec(Pass.Fail, .pred_class)
  )

metrics.eval.xgX

Overview the metrics

all.metric.X <- rbind(Random_Forest.2 = metrics.eval.rfX,
                    Decision_Tree.2 = metrics.eval.dtX,
                    Boosted_Tree.2 = metrics.eval.xgX) %>% data.frame()
all.metric.X

we saw that recal for boosted are not quite high as before. we’ll do the tuning just like before, hoping for better result.

8 Model Tuning

# create 5-fold cross validation
set.seed(1502)
foldsX <- vfold_cv(train.2, 5)

8.1 Random Forest

rf.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)

rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf.tune <- tune_grid(Pass.Fail~., model = rf.setup, resamples = foldsX,
                     grid = rf.grid, metrics = metric_set(accuracy, sens, spec))

## i Creating pre-processing data to finalize unknown parameter: mtry

show_best(rf.tune, maximize = F, metric = "sens")

best.rfX <- rf.tune %>% select_best("sens", maximize = F)
# the best parameter for highest reacall are mtry = 6 and trees = 450
mod.rf.2X <- rf.setup %>% finalize_model(parameters = 
                                          best.rfX)
# rebuild the model
mod.rf.2.new <- mod.rf.2X %>% fit(Pass.Fail~., data = test.2)

# predict to unseen data
pred.rf.2X <- test.2 %>%
  bind_cols(predict(mod.rf.2.new, test.2)) %>%
  bind_cols(predict(mod.rf.2.new, test.2, type = "prob")) 
pred.rf.2X <- pred.rf.2X %>% select(1, tail(names(.),3))

head(pred.rf.2X)

# model evaluation
metrics.eval.rf.2X <- pred.rf.2X %>%
  summarise(
    accuracy = accuracy_vec(Pass.Fail, .pred_class),
    F1 = f_meas_vec(Pass.Fail, .pred_class),
    specificity = spec_vec(Pass.Fail, .pred_class),
    precision = precision_vec(Pass.Fail, .pred_class),
    recall = recall_vec(Pass.Fail, .pred_class)
  )

metrics.eval.rf.2X

again, a prefect recall score

8.2 Decision Tree

dt.setup <- decision_tree(cost_complexity = tune(), min_n = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")


dt.pars <- parameters(cost_complexity(), min_n())
dt.grid <- grid_max_entropy(dt.pars, size = 10)

dt.tune <- tune_grid(Pass.Fail~., model = dt.setup, resamples = foldsX,
                     grid = dt.grid, metrics = metric_set(accuracy, sens, spec))

show_best(dt.tune, maximize = F, metric = "sens")

best.dtX <- dt.tune %>% select_best("sens", maximize = F)

dt.finalX <- dt.setup %>% finalize_model(parameters = 
                                          best.dtX)

mod.dt.2.new <- dt.finalX %>% fit(Pass.Fail~., data = test.2)

pred.dt.2X <- test.2 %>%
  bind_cols(predict(mod.dt.2.new, test.2)) %>%
  bind_cols(predict(mod.dt.2.new, test.2, type = "prob"))
pred.dt.2X <- pred.dt.2X %>% select(1, tail(names(.),3))

head(pred.dt.2X)

metrics.eval.dt.2X <- pred.dt.2X %>%
  summarise(
    accuracy = accuracy_vec(Pass.Fail, .pred_class),
    F1 = f_meas_vec(Pass.Fail, .pred_class),
    specificity = spec_vec(Pass.Fail, .pred_class),
    precision = precision_vec(Pass.Fail, .pred_class),
    recall = recall_vec(Pass.Fail, .pred_class)
  )

metrics.eval.dt.2X

8.3 Boosted Tree

xg.grid <- expand.grid(trees = c(400,450,500,550,600), mtry = 2:6)

xg.setup <- boost_tree(learn_rate = 0.1, trees = tune(), mtry = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xg.tune <- tune_grid(Pass.Fail~., model = rf.setup, resamples = foldsX,
                     grid = rf.grid, metrics = metric_set(accuracy, sens, spec))

## i Creating pre-processing data to finalize unknown parameter: mtry

show_best(xg.tune, maximize = F, metric = "spec")

best.xgX <- xg.tune %>% select_best("spec", maximize = F)
# the best parameter for highest recall are mtry = 2 and trees = 500
xg.finalX <- xg.setup %>% finalize_model(parameters = best.xgX)
# rebuild model with new parameter
mod.xg.new <- xg.finalX %>% fit(Pass.Fail~., data = test.2)

pred.xg.2X <- test.2 %>%
  bind_cols(predict(mod.xg.new, test.2)) %>%
  bind_cols(predict(mod.xg.new, test.2, type = "prob")) 
pred.xg.2X <- pred.xg.2X %>% select(1, tail(names(.),3))

head(pred.xg.2X)

metrics.eval.xg.2X <- pred.xg.2X %>%
  summarise(
    accuracy = accuracy_vec(Pass.Fail, .pred_class),
    F1 = f_meas_vec(Pass.Fail, .pred_class),
    specificity = spec_vec(Pass.Fail, .pred_class),
    precision = precision_vec(Pass.Fail, .pred_class),
    recall = recall_vec(Pass.Fail, .pred_class)
  )

metrics.eval.xg.2X

lol, perfect score

all.metric.2X <- rbind(Random_Forest.tn2 = metrics.eval.rf.2X,
                      Decision_Tree.tn2 = metrics.eval.dt.2X,
                      Boosted_Tree.tn2 = metrics.eval.xg.2X) %>% data.frame()

all.metric.2X

9 Conclusion

Let’s show all the metrics to make the evaluation easier

metric.all <- rbind(all.metric,all.metric.2,all.metric.X,all.metric.2X)
metric.all <- rownames_to_column(metric.all)
# arrange the table to highest recall
metric.all %>% arrange(desc(recall))

From the table above we know that every tuned Random_forest and Boosted_tree models are achieving perfect score. it means the model are able to correctly predict unseen data. however like i said before, there’s no such thing as perfect metrics. something might be wrong from our preprocessed step.

the only rational best result for me are Random_Forest.2 model. it has 94% recall but the tradeoff are low specificity. In this case, our produced items are mostly will detected as defected items and there will be very minimum defected items which sold to public as return. back to our business case, if we’re going to pretend those model are correctly builded, we’ll choose Random_Forest 2 models (which preprocessed by recipe package) as our anomaly detection model.

In this article we also learned that preprocessed with tidymodels (espescially recipe package) are way easier. The modeling and tuning step are a bit simple than caret package (personally). For the future, it might be a wise step if we use tidymodel package to save time when preprocessing and buiding model.

Thank You!

Detecting Defected Items

jojoecp

4/19/2020

1 Background

1.1 Objective

1.2 Libraries

2 Let’s begin !

2.1 Data Import

3 Preprocess without tidymodels

3.1 Exploratory Data Analysis

3.2 Missing values

3.3 Remove near zero variables

3.4 Principal Component Analysis for dimensionality reduction

3.5 SMOTE for balancing target variable

4 Modeling

4.1 Splitting

4.2 Random Forest

4.3 Decision Tree

4.4 Boosted tree

5 Model Tuning

5.1 Random Forest

5.2 Decision Tree

5.3 Boosted Tree

6 preprocess with `tidymodels`

6.1 re-import data

6.2 splitting

6.3 pre-processing

7 Modeling 2

7.1 Random Forest

7.2 Decision Tree

7.3 Boosted tree

8 Model Tuning

8.1 Random Forest

8.2 Decision Tree

8.3 Boosted Tree

9 Conclusion

Detecting Defected Items

jojoecp

4/19/2020

1 Background

1.1 Objective

1.2 Libraries

2 Let’s begin !

2.1 Data Import

3 Preprocess without tidymodels

3.1 Exploratory Data Analysis

3.2 Missing values

3.3 Remove near zero variables

3.4 Principal Component Analysis for dimensionality reduction

3.5 SMOTE for balancing target variable

4 Modeling

4.1 Splitting

4.2 Random Forest

4.3 Decision Tree

4.4 Boosted tree

5 Model Tuning

5.1 Random Forest

5.2 Decision Tree

5.3 Boosted Tree

6 preprocess with tidymodels

6.1 re-import data

6.2 splitting

6.3 pre-processing

7 Modeling 2

7.1 Random Forest

7.2 Decision Tree

7.3 Boosted tree

8 Model Tuning

8.1 Random Forest

8.2 Decision Tree

8.3 Boosted Tree

9 Conclusion

6 preprocess with `tidymodels`