Jeff Hung

Data scientist of the Institute of Manufacturing Information and Systems of National Cheng Kung University
Gmail LinkedIn Github Polab


Abstract

     Semiconductor manufacturing is one of the most technologically and highly complicated manufacturing processes. Traditional machine learning algorithms such as uni-variate and multivariate analyses have long been deployed as a tool for creating predictive model to detect faults. In the past decade major collaborative research projects have been undertaken between fab industries and academia in the areas of predictive modeling. In this study we propose machine learning techniques to automatically generate an accurate predictive model to predict equipment faults during the wafer fabrication process of the semiconductor industries. Aim at constructing a decision model to help detecting as quickly as possible any equipment faults in order to maintain high process yields in manufacturing.


Framework

  1. Data Description
  2. Data Preprocessing
  3. Feature Selection
  4. Model Construction & Validation
  5. Conclusion

Data Description

Data Sources & Goal

     The SECOM (Semiconductor Manufacturing) dataset, consists of manufacturing operation data and the semiconductor quality data. It contains 1567 observations taken from a wafer fabrication production line. Each observation is a vector of 590 sensor measurements plus a label of pass/fail test. Also, there are only 104 fail cases which are labeled as positive (encoded as 1), whereas much larger amount of examples pass the test and are labeled as negative (encoded as -1). This is a 1:14 proportion. In this work not only a feature selection method for extracting the post discriminative sensors is proposed, but also boosting and data generation techniques are devised to deal with highly imbalance between the pass and fail cases.
Source:UCI SECOM DataSet

Load Packages

     Before librarying the packages below, you can install them first in your local services.

library(data.table)   # Read Data
library(plotly)       # Data Visualization
library(DMwR)         # Data Imputation by Knn
library(mice)         # Data Imputation by Regression
library(missForest)   # Data Imputation by Random Forest
library(ROSE)         # Synthetic Data Generation
library(glmnet)       # Lasso Regression
library(plotmo)       # Lasso Regression Visualization
library(xgboost)      # Gradient Boosting Machine
library(caret)        # Cross Validation

Read Data

     After reading feature and label dataset, we directly binded these datasets together and renamed the coloumn names of each variable. Also, we changed the data type of response into factor for the analysis afterward.

feature <- fread("/Users/hungyushin/Desktop/R/Rpubs/SECOM/Semiconductor/secom.data.txt", data.table = F)
label <- fread("/Users/hungyushin/Desktop/R/Rpubs/SECOM/Semiconductor/secom_labels.data.txt", data.table = F)
data <- cbind(label,feature)
colnames(data) <- c("Class", "Time", paste0(rep("Feature", ncol(feature)), seq(1,ncol(feature))))
data$Class <- factor(data$Class, labels = c("pass", "fail"))
data$Time <-  as.POSIXct(data$Time, format = "%d/%m/%Y %H:%M:%S", tz = "GMT")

Data Summary

     The dataset contains 1567 observations taken from a wafer fabrication production line. Each observation is a vector of 590 sensor measurements plus a label of pass/fail test. Also, there are only 104 fail cases which is a 1:14 proportion. Observing some of the features, we can see that there are missing values and equal value that needed to be preprocessing which comes to the next step.

str(data, list.len=8)
## 'data.frame':    1567 obs. of  592 variables:
##  $ Class     : Factor w/ 2 levels "pass","fail": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Time      : POSIXct, format: "2008-07-19 11:55:00" "2008-07-19 12:32:00" ...
##  $ Feature1  : num  3031 3096 2933 2989 3032 ...
##  $ Feature2  : num  2564 2465 2560 2480 2503 ...
##  $ Feature3  : num  2188 2230 2186 2199 2233 ...
##  $ Feature4  : num  1411 1464 1698 910 1327 ...
##  $ Feature5  : num  1.36 0.829 1.51 1.32 1.533 ...
##  $ Feature6  : num  100 100 100 100 100 100 100 100 100 100 ...
##   [list output truncated]
summary(data[,1:8])
##   Class           Time                        Feature1       Feature2   
##  pass:1463   Min.   :2008-07-19 11:55:00   Min.   :2743   Min.   :2159  
##  fail: 104   1st Qu.:2008-08-22 00:55:30   1st Qu.:2966   1st Qu.:2452  
##              Median :2008-09-11 08:06:00   Median :3011   Median :2499  
##              Mean   :2008-09-09 18:37:39   Mean   :3014   Mean   :2496  
##              3rd Qu.:2008-09-29 11:33:00   3rd Qu.:3057   3rd Qu.:2539  
##              Max.   :2008-10-17 06:07:00   Max.   :3356   Max.   :2846  
##                                            NA's   :6      NA's   :7     
##     Feature3       Feature4       Feature5            Feature6  
##  Min.   :2061   Min.   :   0   Min.   :   0.6815   Min.   :100  
##  1st Qu.:2181   1st Qu.:1082   1st Qu.:   1.0177   1st Qu.:100  
##  Median :2201   Median :1285   Median :   1.3168   Median :100  
##  Mean   :2201   Mean   :1396   Mean   :   4.1970   Mean   :100  
##  3rd Qu.:2218   3rd Qu.:1591   3rd Qu.:   1.5257   3rd Qu.:100  
##  Max.   :2315   Max.   :3715   Max.   :1114.5366   Max.   :100  
##  NA's   :14     NA's   :14     NA's   :14          NA's   :14

Data Preprocessing

     After observe all variables, there are two kinds of situation that needs to be correct which is “Redundant” and “Missing Value”.

Variable Redundant

     Drop the equal value features and variable “Time” which we do not concern in this study.

# Time #
index_vr1 <- which(colnames(data) == "Time")

# Equal Value #
equal_v <- apply(data, 2, function(x) max(na.omit(x)) == min(na.omit(x)))
index_vr2 <- which(equal_v == T)

Missing Value Imputation

     In order to deal with missing values, we check the missing values by each observation and variable. The figure below shows that there are no more than 25.7% of missing values in each observation, however high percentage of missing values do occur in some of the variables. Thus, we drop the variables which contain more than 40% missing values and impute the remaining variables.
     There are several methods for missing value imputation such as k nearest neighbor, regression, random forest. K nearest neighbor (KNN) represents a natural improvement of Mean that exploits the observed data structure. Multivariate Imputation by Chained Equations (MICE) are based on a much more complex algorithm and its behavior appears to be related to the size of the dataset, however it becomes time-intensive when applied to the large datasets. More detail can be study in following article. In this study, we conduct KNN to impute the missing values which is relatively a efficient algorithm while dealing with high dimensional data.
Review Article:A Comparison of Six Methods for Missing Data Imputation

# Missing Value #
row_NA <- apply(data, 1, function(x) sum(is.na(x))/ncol(data))
col_NA <- apply(data, 2, function(x) sum(is.na(x))/nrow(data))
plot_ly(x = seq(1,nrow(data)), y = row_NA, type = "scatter", mode = "markers") %>%
  layout(title = "Observation Missing Values Percentage",
         xaxis = list(title = "Observation Index"),
         yaxis = list(title = "Percentage(%)"))
plot_ly(x = seq(1,ncol(data)), y = col_NA, type = "scatter", mode = "markers") %>%
  layout(title = "Variable Missing Values Percentage",
         xaxis = list(title = "Variable Index"),
         yaxis = list(title = "Percentage(%)"))
index_mr <- which(col_NA > 0.4)
data_c <- data[,-unique(c(index_vr1, index_vr2, index_mr))]
data_I  <- knnImputation(data_c)
#data_IIpmm <- mice(data_c, m=1, maxit = 1, method = 'pmm', seed = 500)
#data_II <- complete(data_IIpmm,1)
#data_miss <- missForest(data_c)
#data_III <- data_miss$ximp

Training & Testing

     Split the dataset into training and testing for the model construction and validation.

set.seed(2)
index <- sample(1:nrow(data_I), nrow(data_I)/10)
train <- data_I[-index,]
test <- data_I[index,]

Synthetic Data Generation

     In order to deal with imbalance data, we applied a sampling method called “SMOTE algorithm”. It uses bootstrapping and k-nearest neighbors to generate artificial data. More detail can be study in following article.
Guide: Practical Guide to deal with Imbalanced Classification Problems in R

train_rose <- ROSE(Class ~ ., data = train, seed = 1)$data
table(train_rose$Class)
## 
## pass fail 
##  741  670

Feature Selection

Lasso Regression

     There are 590 sensor measurements which is a high dimension dataset. In this part, we conducted a powerful dimension reduction skill, Lasso Regression, to select the important features. The first figure below shows that with the higher penalty (lambda), the more coefficient of each feature shrinkages. In order to choose a proper penalty, we applied cross validation to find out which lambda gave us less missclassification error (ME). The second figure shows that the model gives the smallest ME when the features were shrinkaged about to 140. Thus, we selected these 140 features to construct the classification model.

fit_LS <- glmnet(as.matrix(train_rose[,-1]), train_rose[,1], family="binomial", alpha=1)
plot_glmnet(fit_LS, "lambda", label=5)

fit_LS_cv <- cv.glmnet(as.matrix(train_rose[,-1]), as.matrix(as.numeric(train_rose[,1])-1), type.measure="class", family="binomial", alpha=1)
plot(fit_LS_cv)

coef <- coef(fit_LS_cv, s = "lambda.min")
coef_df <- as.data.frame(as.matrix(coef))
index_LS <- rownames(coef_df)[which(coef_df[,1] != 0)][-1]

Model Construction

Logistic Regression

     After feature selection, we constructed the binary logistic regression model wtih the training set. The following table shows the result of this model. By arranging the top twenty significient coefficients, we can figure out that how the parameters of these sensors effect the semiconductors. As the baseline of this model is “pass”, we can say that when feature60, for example, add one unit, the odds ratio of failure will add approximate exp{0.0379}. In other words, when feature60 increase, there is a strong possibility that semiconductor will fail. Vice versa. As we know confusion matrix and Receiver operating characteristic (ROC) curve are two of the performance metrics use to evaluate accuracy, thus we chose both of them in this study. Since we hope to find out the failure one, considering a proper criteria to balance between sensitivity and specificity also be important too. If we wish to be more rigorous to find out the defect ones, on the other hand will happen to be a higher chance predicting good into bad. This is a trade off need to be balance.

fit_LR <- glm(Class ~ ., data=train_rose[,c("Class",index_LS)], family = "binomial")
table_LR <- round(summary(fit_LR)$coefficient, 4)
table_LR[order(table_LR[,4])[1:20],]
##            Estimate Std. Error z value Pr(>|z|)
## Feature32   -0.8370     0.1760 -4.7558   0.0000
## Feature60    0.0379     0.0085  4.4684   0.0000
## Feature130   0.3404     0.0723  4.7072   0.0000
## Feature501   0.0010     0.0002  4.4735   0.0000
## Feature319   0.3263     0.0823  3.9641   0.0001
## Feature166  -0.5996     0.1662 -3.6083   0.0003
## Feature222 -12.6033     3.4697 -3.6324   0.0003
## Feature481  -0.0080     0.0022 -3.5747   0.0004
## Feature211   5.0271     1.4354  3.5022   0.0005
## Feature12  -25.4590     7.4407 -3.4216   0.0006
## Feature189   0.0135     0.0040  3.4164   0.0006
## Feature442   1.3228     0.3978  3.3253   0.0009
## Feature589  89.8406    27.5770  3.2578   0.0011
## Feature282 -50.3598    15.9272 -3.1619   0.0016
## Feature30    0.6186     0.1967  3.1445   0.0017
## Feature424   0.0068     0.0022  3.0310   0.0024
## Feature78    6.9996     2.3285  3.0060   0.0026
## Feature351 -17.6774     6.0846 -2.9053   0.0037
## Feature103  -3.0308     1.0735 -2.8233   0.0048
## Feature486  -0.0010     0.0004 -2.8123   0.0049
pred_LR <- factor(ifelse(predict(fit_LR, test, type = "response") > 0.5, "fail", "pass"), levels = c("pass", "fail"))
table(test$Class, pred_LR)
##       pred_LR
##        pass fail
##   pass  111   34
##   fail    5    6
roc.curve(test$Class, predict(fit_LR, test))

## Area under the curve (AUC): 0.746

Gradient Boosting Machine

     A particular implementation of gradient boosting, XGBoost, is consistently used to win machine learning competitions on Kaggle. In this study, we constructed GBM as a second predictive model. However, futher technic with this methodology will not shown here. In short, the predict power is approximately same as the logistic regresion without the help of synthetic data generation and feature selection.

params <- list(
  "objective"           = "reg:logistic",
  "eval_metric"         = "logloss",
  "eta"                 = 0.1,
  "max_depth"           = 3,
  "min_child_weight"    = 10,
  "gamma"               = 0.70,
  "subsample"           = 0.76,
  "colsample_bytree"    = 0.95,
  "alpha"               = 2e-05,
  "lambda"              = 10
)
X <- xgb.DMatrix(as.matrix(train %>% select(-Class)), label = as.numeric(train$Class)-1)
fit_GBM <- xgboost(data = X, params = params, nrounds = 50, verbose = 0)
importance <- xgb.importance(colnames(train), model = fit_GBM)
xgb.plot.importance(importance[1:20])

Y <- xgb.DMatrix(as.matrix(test %>% select(-Class)))
pred_GBM <- factor(ifelse(predict(fit_GBM, Y) > 0.07, "fail", "pass"), levels = c("pass", "fail"))
table(test$Class, pred_GBM)
##       pred_GBM
##        pass fail
##   pass  101   44
##   fail    3    8
roc.curve(test$Class, predict(fit_GBM, Y))

## Area under the curve (AUC): 0.744

Cross Validation {#Cross Validation}


Conclusion

Conclusion

     The semiconductor industry is one of the most capitalintensive industries with a high of capital investment on equipment’s. Optimization of manufacturing equipment’s has received significant attention and shown to be a necessary competitive advantage. There are exciting challenges and opportunities for the engineers and researchers to develop a new standard for this vigorously growing industry. A good classification prediction model is beneficial for the prediction in semiconductor manufacturing fabrication process. Most semiconductor manufacturing is highly complex and has produced constantly hundreds of metrology data that are awaiting for process engineers to analyze for the purpose of maintaining efficient operations and getting optimum yield of high quality products. For such a large volume of measurement data, automatic data analysis technique such as data mining is essential.