Jeff Hung
Data scientist of the Institute of Manufacturing Information and Systems of National Cheng Kung University
Gmail
LinkedIn
Github
Polab
Abstract
Semiconductor manufacturing is one of the most technologically and highly complicated manufacturing processes. Traditional machine learning algorithms such as uni-variate and multivariate analyses have long been deployed as a tool for creating predictive model to detect faults. In the past decade major collaborative research projects have been undertaken between fab industries and academia in the areas of predictive modeling. In this study we propose machine learning techniques to automatically generate an accurate predictive model to predict equipment faults during the wafer fabrication process of the semiconductor industries. Aim at constructing a decision model to help detecting as quickly as possible any equipment faults in order to maintain high process yields in manufacturing.
Framework
- Data Description
- Data Preprocessing
- Feature Selection
- Model Construction & Validation
- Conclusion
Data Description
Data Sources & Goal
The SECOM (Semiconductor Manufacturing) dataset, consists of manufacturing operation data and the semiconductor quality data. It contains 1567 observations taken from a wafer fabrication production line. Each observation is a vector of 590 sensor measurements plus a label of pass/fail test. Also, there are only 104 fail cases which are labeled as positive (encoded as 1), whereas much larger amount of examples pass the test and are labeled as negative (encoded as -1). This is a 1:14 proportion. In this work not only a feature selection method for extracting the post discriminative sensors is proposed, but also boosting and data generation techniques are devised to deal with highly imbalance between the pass and fail cases.
Source:UCI SECOM DataSet
Load Packages
Before librarying the packages below, you can install them first in your local services.
library(data.table) # Read Data
library(plotly) # Data Visualization
library(DMwR) # Data Imputation by Knn
library(mice) # Data Imputation by Regression
library(missForest) # Data Imputation by Random Forest
library(ROSE) # Synthetic Data Generation
library(glmnet) # Lasso Regression
library(plotmo) # Lasso Regression Visualization
library(xgboost) # Gradient Boosting Machine
library(caret) # Cross ValidationRead Data
After reading feature and label dataset, we directly binded these datasets together and renamed the coloumn names of each variable. Also, we changed the data type of response into factor for the analysis afterward.
feature <- fread("/Users/hungyushin/Desktop/R/Rpubs/SECOM/Semiconductor/secom.data.txt", data.table = F)
label <- fread("/Users/hungyushin/Desktop/R/Rpubs/SECOM/Semiconductor/secom_labels.data.txt", data.table = F)
data <- cbind(label,feature)
colnames(data) <- c("Class", "Time", paste0(rep("Feature", ncol(feature)), seq(1,ncol(feature))))
data$Class <- factor(data$Class, labels = c("pass", "fail"))
data$Time <- as.POSIXct(data$Time, format = "%d/%m/%Y %H:%M:%S", tz = "GMT")Data Summary
The dataset contains 1567 observations taken from a wafer fabrication production line. Each observation is a vector of 590 sensor measurements plus a label of pass/fail test. Also, there are only 104 fail cases which is a 1:14 proportion. Observing some of the features, we can see that there are missing values and equal value that needed to be preprocessing which comes to the next step.
str(data, list.len=8)## 'data.frame': 1567 obs. of 592 variables:
## $ Class : Factor w/ 2 levels "pass","fail": 1 1 2 1 1 1 1 1 1 1 ...
## $ Time : POSIXct, format: "2008-07-19 11:55:00" "2008-07-19 12:32:00" ...
## $ Feature1 : num 3031 3096 2933 2989 3032 ...
## $ Feature2 : num 2564 2465 2560 2480 2503 ...
## $ Feature3 : num 2188 2230 2186 2199 2233 ...
## $ Feature4 : num 1411 1464 1698 910 1327 ...
## $ Feature5 : num 1.36 0.829 1.51 1.32 1.533 ...
## $ Feature6 : num 100 100 100 100 100 100 100 100 100 100 ...
## [list output truncated]
summary(data[,1:8])## Class Time Feature1 Feature2
## pass:1463 Min. :2008-07-19 11:55:00 Min. :2743 Min. :2159
## fail: 104 1st Qu.:2008-08-22 00:55:30 1st Qu.:2966 1st Qu.:2452
## Median :2008-09-11 08:06:00 Median :3011 Median :2499
## Mean :2008-09-09 18:37:39 Mean :3014 Mean :2496
## 3rd Qu.:2008-09-29 11:33:00 3rd Qu.:3057 3rd Qu.:2539
## Max. :2008-10-17 06:07:00 Max. :3356 Max. :2846
## NA's :6 NA's :7
## Feature3 Feature4 Feature5 Feature6
## Min. :2061 Min. : 0 Min. : 0.6815 Min. :100
## 1st Qu.:2181 1st Qu.:1082 1st Qu.: 1.0177 1st Qu.:100
## Median :2201 Median :1285 Median : 1.3168 Median :100
## Mean :2201 Mean :1396 Mean : 4.1970 Mean :100
## 3rd Qu.:2218 3rd Qu.:1591 3rd Qu.: 1.5257 3rd Qu.:100
## Max. :2315 Max. :3715 Max. :1114.5366 Max. :100
## NA's :14 NA's :14 NA's :14 NA's :14
Data Preprocessing
After observe all variables, there are two kinds of situation that needs to be correct which is “Redundant” and “Missing Value”.
Variable Redundant
Drop the equal value features and variable “Time” which we do not concern in this study.
# Time #
index_vr1 <- which(colnames(data) == "Time")
# Equal Value #
equal_v <- apply(data, 2, function(x) max(na.omit(x)) == min(na.omit(x)))
index_vr2 <- which(equal_v == T)Missing Value Imputation
In order to deal with missing values, we check the missing values by each observation and variable. The figure below shows that there are no more than 25.7% of missing values in each observation, however high percentage of missing values do occur in some of the variables. Thus, we drop the variables which contain more than 40% missing values and impute the remaining variables.
There are several methods for missing value imputation such as k nearest neighbor, regression, random forest. K nearest neighbor (KNN) represents a natural improvement of Mean that exploits the observed data structure. Multivariate Imputation by Chained Equations (MICE) are based on a much more complex algorithm and its behavior appears to be related to the size of the dataset, however it becomes time-intensive when applied to the large datasets. More detail can be study in following article. In this study, we conduct KNN to impute the missing values which is relatively a efficient algorithm while dealing with high dimensional data.
Review Article:A Comparison of Six Methods for Missing Data Imputation
# Missing Value #
row_NA <- apply(data, 1, function(x) sum(is.na(x))/ncol(data))
col_NA <- apply(data, 2, function(x) sum(is.na(x))/nrow(data))
plot_ly(x = seq(1,nrow(data)), y = row_NA, type = "scatter", mode = "markers") %>%
layout(title = "Observation Missing Values Percentage",
xaxis = list(title = "Observation Index"),
yaxis = list(title = "Percentage(%)"))plot_ly(x = seq(1,ncol(data)), y = col_NA, type = "scatter", mode = "markers") %>%
layout(title = "Variable Missing Values Percentage",
xaxis = list(title = "Variable Index"),
yaxis = list(title = "Percentage(%)"))index_mr <- which(col_NA > 0.4)
data_c <- data[,-unique(c(index_vr1, index_vr2, index_mr))]
data_I <- knnImputation(data_c)
#data_IIpmm <- mice(data_c, m=1, maxit = 1, method = 'pmm', seed = 500)
#data_II <- complete(data_IIpmm,1)
#data_miss <- missForest(data_c)
#data_III <- data_miss$ximpTraining & Testing
Split the dataset into training and testing for the model construction and validation.
set.seed(2)
index <- sample(1:nrow(data_I), nrow(data_I)/10)
train <- data_I[-index,]
test <- data_I[index,]Synthetic Data Generation
In order to deal with imbalance data, we applied a sampling method called “SMOTE algorithm”. It uses bootstrapping and k-nearest neighbors to generate artificial data. More detail can be study in following article.
Guide: Practical Guide to deal with Imbalanced Classification Problems in R
train_rose <- ROSE(Class ~ ., data = train, seed = 1)$data
table(train_rose$Class)##
## pass fail
## 741 670
Feature Selection
Lasso Regression
There are 590 sensor measurements which is a high dimension dataset. In this part, we conducted a powerful dimension reduction skill, Lasso Regression, to select the important features. The first figure below shows that with the higher penalty (lambda), the more coefficient of each feature shrinkages. In order to choose a proper penalty, we applied cross validation to find out which lambda gave us less missclassification error (ME). The second figure shows that the model gives the smallest ME when the features were shrinkaged about to 140. Thus, we selected these 140 features to construct the classification model.
fit_LS <- glmnet(as.matrix(train_rose[,-1]), train_rose[,1], family="binomial", alpha=1)
plot_glmnet(fit_LS, "lambda", label=5)fit_LS_cv <- cv.glmnet(as.matrix(train_rose[,-1]), as.matrix(as.numeric(train_rose[,1])-1), type.measure="class", family="binomial", alpha=1)
plot(fit_LS_cv)coef <- coef(fit_LS_cv, s = "lambda.min")
coef_df <- as.data.frame(as.matrix(coef))
index_LS <- rownames(coef_df)[which(coef_df[,1] != 0)][-1]Model Construction
Logistic Regression
After feature selection, we constructed the binary logistic regression model wtih the training set. The following table shows the result of this model. By arranging the top twenty significient coefficients, we can figure out that how the parameters of these sensors effect the semiconductors. As the baseline of this model is “pass”, we can say that when feature60, for example, add one unit, the odds ratio of failure will add approximate exp{0.0379}. In other words, when feature60 increase, there is a strong possibility that semiconductor will fail. Vice versa. As we know confusion matrix and Receiver operating characteristic (ROC) curve are two of the performance metrics use to evaluate accuracy, thus we chose both of them in this study. Since we hope to find out the failure one, considering a proper criteria to balance between sensitivity and specificity also be important too. If we wish to be more rigorous to find out the defect ones, on the other hand will happen to be a higher chance predicting good into bad. This is a trade off need to be balance.
fit_LR <- glm(Class ~ ., data=train_rose[,c("Class",index_LS)], family = "binomial")
table_LR <- round(summary(fit_LR)$coefficient, 4)
table_LR[order(table_LR[,4])[1:20],]## Estimate Std. Error z value Pr(>|z|)
## Feature32 -0.8370 0.1760 -4.7558 0.0000
## Feature60 0.0379 0.0085 4.4684 0.0000
## Feature130 0.3404 0.0723 4.7072 0.0000
## Feature501 0.0010 0.0002 4.4735 0.0000
## Feature319 0.3263 0.0823 3.9641 0.0001
## Feature166 -0.5996 0.1662 -3.6083 0.0003
## Feature222 -12.6033 3.4697 -3.6324 0.0003
## Feature481 -0.0080 0.0022 -3.5747 0.0004
## Feature211 5.0271 1.4354 3.5022 0.0005
## Feature12 -25.4590 7.4407 -3.4216 0.0006
## Feature189 0.0135 0.0040 3.4164 0.0006
## Feature442 1.3228 0.3978 3.3253 0.0009
## Feature589 89.8406 27.5770 3.2578 0.0011
## Feature282 -50.3598 15.9272 -3.1619 0.0016
## Feature30 0.6186 0.1967 3.1445 0.0017
## Feature424 0.0068 0.0022 3.0310 0.0024
## Feature78 6.9996 2.3285 3.0060 0.0026
## Feature351 -17.6774 6.0846 -2.9053 0.0037
## Feature103 -3.0308 1.0735 -2.8233 0.0048
## Feature486 -0.0010 0.0004 -2.8123 0.0049
pred_LR <- factor(ifelse(predict(fit_LR, test, type = "response") > 0.5, "fail", "pass"), levels = c("pass", "fail"))
table(test$Class, pred_LR)## pred_LR
## pass fail
## pass 111 34
## fail 5 6
roc.curve(test$Class, predict(fit_LR, test))## Area under the curve (AUC): 0.746
Gradient Boosting Machine
A particular implementation of gradient boosting, XGBoost, is consistently used to win machine learning competitions on Kaggle. In this study, we constructed GBM as a second predictive model. However, futher technic with this methodology will not shown here. In short, the predict power is approximately same as the logistic regresion without the help of synthetic data generation and feature selection.
params <- list(
"objective" = "reg:logistic",
"eval_metric" = "logloss",
"eta" = 0.1,
"max_depth" = 3,
"min_child_weight" = 10,
"gamma" = 0.70,
"subsample" = 0.76,
"colsample_bytree" = 0.95,
"alpha" = 2e-05,
"lambda" = 10
)
X <- xgb.DMatrix(as.matrix(train %>% select(-Class)), label = as.numeric(train$Class)-1)
fit_GBM <- xgboost(data = X, params = params, nrounds = 50, verbose = 0)
importance <- xgb.importance(colnames(train), model = fit_GBM)
xgb.plot.importance(importance[1:20])Y <- xgb.DMatrix(as.matrix(test %>% select(-Class)))
pred_GBM <- factor(ifelse(predict(fit_GBM, Y) > 0.07, "fail", "pass"), levels = c("pass", "fail"))
table(test$Class, pred_GBM)## pred_GBM
## pass fail
## pass 101 44
## fail 3 8
roc.curve(test$Class, predict(fit_GBM, Y))## Area under the curve (AUC): 0.744
Cross Validation {#Cross Validation}
Conclusion
Conclusion
The semiconductor industry is one of the most capitalintensive industries with a high of capital investment on equipment’s. Optimization of manufacturing equipment’s has received significant attention and shown to be a necessary competitive advantage. There are exciting challenges and opportunities for the engineers and researchers to develop a new standard for this vigorously growing industry. A good classification prediction model is beneficial for the prediction in semiconductor manufacturing fabrication process. Most semiconductor manufacturing is highly complex and has produced constantly hundreds of metrology data that are awaiting for process engineers to analyze for the purpose of maintaining efficient operations and getting optimum yield of high quality products. For such a large volume of measurement data, automatic data analysis technique such as data mining is essential.