Introduction

In electronics, a wafer is a thin slice of semiconductor used for the fabrication of integrated circuits and, in photovoltaics, to manufacture solar cells. The wafer serves as the substrate for microelectronic devices built in and upon the wafer.

The following is the general flow of semiconductor wafer fabrication process 1:

  1. A silicon wafer has been prepared from an ingot by cutting and polishing. The wafer then has layers of material applied. These include a silicon oxide layer, a silicon nitride layer and a layer of photoresist.

  2. A light is then projected through a reticle and a lens unto the wafer surface. This pattern is projected numerous times onto the wafer for each chip.

  3. The photoresist that was exposed to the light can now be chemically removed.

  4. The areas where the photoresist has been removed can now be etched, which in the case above, is done by gases.

  5. An ionic gaseous stream showers the chip and “dopes” those regions that were exposed due to etching. New photoresist can be applied to the wafer and steps 2-4 are repeated.

  6. In a similar repeated cycle, metal links can be laid down between transistors.Every step of the process requires elastomer seals to isolate the process from the outside atmosphere.

Due to the size of the semiconductor and the complex and delicate process that goes along with it, any defect during production seems cannot be avoided. In the wafer industry, a defect is the number one enemy. Any forms of defect can greatly affect the quality of the silicon wafer, even the scratches which are not visible to the eye matters. A lot of manufacturer use sensors to monitor the condition during the fabrication process to make sure any kind of abnormality is detected early. Building an automated inspection will be very beneficial for the manufacturer, as shown by the number of research that study the defect detection or pattern seeking on this field. In this article, I will illustrate how machine learning can be applied to detect abnormality in the fabrication process.

Library and Setup

If you are interested to reproduce the code in this article, here is the required library and setup.

Data

The data comes from R. Olszewski’s study on semiconductor fabrication process. The description of the data below are directly cited from the website.

“A collection of inline process control measurements recorded from various sensors during the processing of silicon wafers for semiconductor fabrication constitute the wafer database; each data set in the wafer database contains the measurements recorded by one sensor during the processing of one wafer by one tool. The two classes are normal and abnormal. There is a large class imbalance between normal and abnormal (10.7% of the train are abnormal, 12.1% of the test)”

Exploratory Data Analysis

We will do exploratory data analysis to get better understanding about the data, which can help us to prepare the data properly before feeding them into the machine learning model.

Correlation Between Variables

Let’s see the correlation between features/variables. Since we have a lot of feature, around 152 measurements from the sensors, if there are a lot of correlating variables, we can compress them to reduce the dimensions using Principal Component Analysis (PCA). Lower number of dimensions means that we can train the model faster.

Class Imbalance

Let’s check the class imbalance to prove the description about the data. First we check the proportion of each target in the training set.

## 
##         0         1 
## 0.1078845 0.8921155

Here is the proportion of class imbalance in the test dataset.

## 
##     0     1 
## 0.097 0.903

The target class in training and testing dataset is highly imbalance, with very few of the abnormal state (0) presence in the data. This kind of class imbalance will make our model harder to detect them. We can either build a classification model or an outlier detection model. On this occasion, I will illustrate how to handle the problem with both approach. I will build a Neural Network model to classify the data based on the target and build an Isolation Forest model to detect any anomaly in the data.

Data Preprocessing

Before we feed the data into the model, we will preprocess the data. The first step is to scale all the features so that they have the same variance of 1. This is done in order to give all features the same weight and no single features will distort the model just because it has higher or bigger range of data. The second step is do Principal Component Analysis (PCA). Based on the exploratory data analysis, some of the features are highly correlate with each others, so instead of using the original 152 features, we will reduce the dimensions that will still contain 99% of the total variation of the data.

Based on the preprocessing steps, now we have only 50 features and 1 target column instead of the original data, which has 152 features.

Cross-Validation

We will split the testing dataset into validation set and testing set with equal split (30:70), with 30% of the data test will used as validation set.

So, we will have 3 kind of datasets. The training set will be used to train the model. The validation set will used to evaluate the model while we tuned the hyper-parameter of the model. And lastly, the testing set will used as a final evaluation of our model performance.

Build Neural Network Architecture

General Multilayer Perceptron (MLP)

Neural Network is inspired by the biological neural network system of our brain. It consists of input layer, hidden layer, and output layer. The data will be fed into the input layer, processed through the hidden layer, and converted into specific values, such as probability, in the output layer. The MLP has a back-propagation feature, which means that it will go back and forth to adjust the weight of each connection between neurons in order to minimize the loss function and get better performance.

We will to convert the data to suit the keras infrastructure.

Neural Network Architecture

After the data is ready, now we can build the neural network architecture. Here, I build the MLP model with 3 layer dense. The first layer has 32 units of hidden neuron, with the input shape size is the same as the number of features. The second layer has 16 hidden neurons, and the last layer consists of 2 hidden neurons, the same as the number of target class (normal and abnormal). All layers use the sigmoid activation function.

## Model
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 32)                      1632        
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 16)                      528         
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 2)                       34          
## ================================================================================
## Total params: 2,194
## Trainable params: 2,194
## Non-trainable params: 0
## ________________________________________________________________________________

Model Fitting

Now we train the NN model. The model will propagate/iterate in 40 epochs so it has time to learn and adjust the weight on each hidden neurons in order to minimize the loss function, which is binary cross entropy. If the loss function start to stagnate, then the model is converged.

After around 10 epoch/iteration, the model start to converge and didn’t improve any further. The model have an impressive performance, with 90% accuracy on training dataset and on validation dataset. We shall further analyze the performance by using the testing dataset.

Model Evaluation

Sometimes accuracy is not a good representation of the model performance, especially if the data has a highly imbalanced class such as ours. We will also need to consider other metrics, such as the precision and recall. We can also look at the F1 score that seek balance between precision and recall.

Based on the performance metrics, the accuracy is not good enough. Moreover, the model has a very low both recall and precision, which imply that our model has difficulty to detect the positive class (here, the abnormal state). Based on this result, a simple neural network model is not enough to handle our problem. We need to build better model. Next, I will try to build an NN model with LSTM (Long-Short Term Memory) that is proven to be effective for this case.

LSTM (Long-Short Term Memory)

Since our data is consists of sensor reading from a sequenced process, we can consider them as a time series data, with each observation represent a single timestep. Thus the observation is ordered. Just like in text mining, where previous words can heavily alter the meaning of a sentence, the information from the previous observation may can give context to the current observation. This problem belong to a group called Time series classification (TSC). TSC problems are differentiated from traditional classification problems because the attributes/observations are ordered2. Whether the ordering is by time or not is in fact irrelevant. The important characteristic is that there may be discriminatory features dependent on the ordering.

LSTM is a layer in deep learning that is often employed to sequenced data such as time series forecasting and NLP (natural language processing), where the model is expected to read text as a sequence rather than a collection of single word. The benefit of LSTM is that it can handle the phenomenon called Vanishing Gradient, where the updated weight decreased as it travel back through the layers. LSTM are explicitly designed to avoid the long-term dependency problem.

Here is an illustration from developpaper .

The main part of LSTM is the cell state, which run straight horizontally. Along the way, additional information will be controlled and led to the cell state by 3 kind of gates: forget gate, input gate and output gate. The forget gate functioned to forget previous information whenever a new information is coming. The decision to forget information is handled by the sigmoid (S) function. Next, we decide what new information we are going to store in the cell state. The input gate consists of two part: the sigmoid layer to decide which part of the cell state will be updated and tanh (T), which functioned to decide what will be the new candidate value is. The multiplication of those two parts will update part of the cell state that has been forgotten by the forget gate. The output gate controls what will be the output to be fed into the current hidden state/units. The sigmoid layer will decide what part of the cell state that will be output, then multiply it with the cell state that has been through the tanh layer.

Further explanation and use case is available on Algotech3 and Colah4.

We will keep using the PCA to cut the computational time and also to see if PCA is still relevant for this kind of problem. Before we proceed, we will transform the data to suit the LSTM Neural Network model. The input will be something like the figure below. Thus, we will create a 3-dimensional array for the input of LSTM, with x-axis is the time step, y-axis is the number of features, and the z-axis is the 1(consists only of the value for each attributes/features).

Let’s check the dimentions of our 3-dimensional array

## [1]   50    2 6164

The array still contain both the value of each features and the value of the target. Now we will separate the input and the output.

Let’s check the dimensions for the input

## [1] 6164   50    1

Let’s check the dimension of the output

## [1] 6164    1

Neural Network Architecture

Now we build the architecture. The model will have 3 layers: LSTM, flatten, and dense. The LSTM layer is built to read the sequence of the input. The flatten layer functioned to convert the 3-dimensional output from LSTM to 2-dimensional array that will be fed into the layer dense. The layer dense with sigmoid activation function will predict the probability that the observation belong to certain class.

## Model
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## lstm (LSTM)                         (None, 50, 100)                 40800       
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 5000)                    0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 1)                       5001        
## ================================================================================
## Total params: 45,801
## Trainable params: 45,801
## Non-trainable params: 0
## ________________________________________________________________________________

Model Fitting

Now we fit or train the model and see the result.

## Trained on 6,164 samples (batch_size=512, epochs=15)
## Final epoch (plot to see history):
##         loss: 0.0214
##     accuracy: 0.9942
##     val_loss: 0.01469
## val_accuracy: 0.9933

As we can see, the LSTM model has a really good accuracy on both training set and validation set. The model converge before the 10th epoch.

Model Evaluation

Now we measure the model performance to if the model actually works.

Based on the result, all metrics show promising result, with more than 90% of recall and precision. This means that almost all of the postive class (abnormal state) is correctly predicted. Using PCA for this kind of problem still effective.

Another metric that we can use is ROC Curve. ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and it tells how much model is capable of distinguishing between classes. Good model should have curve above the dashed linear line, which represent the result if we classifying by random guess. The closer the line toward the top left corner, the better the model is.

We’ve sucessfully built a classifier model to detect abnormal state in semiconductor fabrication process. Now we will proceed to apply anomaly detection for the same problem.

Anomaly Detection with Isolation Forest

Anomaly detection is often used to handle a problem with imbalance class and generally work better than classification model such as K-NN or Naive Bayes. First, we look at the position of the abnormal observation in the generated dimension of PC01 and PC02. We will merge both training set and testing set and see if they are truly distinguishable from the normal state.

There are some abnormal observations inside of the two major cluster, which can be easily detected. However, most of the abnormal observations are gathered close to the normal one. This may make them harder to detect. We will proof it in the following section.

In order to do anomaly detection, we will use Isolation Forest model, which has the same concept of Random Forest model in classification problem. Isolation Forest adopt the concept of tree method in order to detect a distinct or a distanced observation. However, the output is kinda different. In Isolation Forest, the output is anomaly score, while in Random Forest the output is the probability of belonging into a specific class.

“The term isolation means ‘separating an instance from the rest of the instances’. Since anomalies are ‘few and different’ and therefore they are more susceptible to isolation. In a data-induced random tree, partitioning of instances are repeated recursively until all instances are isolated. This random partitioning produces noticeable shorter paths for anomalies since (a) the fewer instances of anomalies result in a smaller number of partitions – shorter paths in a tree structure, and (b) instances with distinguishable attribute-values are more likely to be separated in early partitioning. Hence, when a forest of random trees collectively produce shorter path lengths for some particular points, then they are highly likely to be anomalies.”
Liu et al.

The complete step-by-step on Isolation Forest can be found on Liu et al.5 study.

Model Fitting

Since the data has been preprocessed earlier, we can now directly feed them into the model. The Isolation Forest come from isotree package.

## Extended Isolation Forest model
## Splitting by 3 variables at a time
## Consisting of 500 trees
## Numeric columns: 50
## Categorical columns: 1

Model Evaluation

As I mentioned earlier, the output of the Isolation Forest is anomaly score. For any given observation of x:

  • If anomaly score is close to 1 then x is very likely to be an anomaly
  • If anomaly score is smaller than 0.5 then x is likely to be a normal value
  • If for a given sample all instances are assigned an anomaly score of around 0.5, then it is safe to assume that the sample doesn’t have any anomaly

I will classify the prediction of Isolation Forest with threshold of 0.5. If the anomaly score is more than 0.5, than it will be tagged as anomaly.

Based on the distribution of the anomaly score, most of the data has anomaly score around 0.35. There are some observations that has anomaly score of more than 0.5.

Next, we will measure the model performance, just like we measure the NN model.

Despite having a good accuracy, the Isolation Forest still failed to sucessfully detect abnormal state, which is shown by the low value of recall and precision. We can also inspect them visually.

There are so many abnormal obervations that are not detected by the Isolation Forest. This is not surprising, since the concept of Isolation Forest is detecting any presence of outlier, value that is isolated or distinguishable than the others. However, some abnormal observations have values almost similar to the normal observations, making them less likely to be called an outlier. Therefore, only the isolated one, observations that is positioned between two clusters, are detected by the Forest. An interesting finding is that most of the observation that has negative value of PC01 is less likely to be detected. We will go into this later.

Next, we measure the model performance for the testing dataset.

The result is similar to the training dataset, the model failed to correctly detect abnormal observation. We will also inspect them visually.

Just like in the training set, abnormal observation that is close to the normal observations is unlikely to be detected by the model. An interesting take is that observation with negative value of PC01 is unlikely to be detected. It also happened in the training dataset. We may want to see what features are contributing to the PC01.

PCA

First, to give context, we will look at the contribution of each PC dimensions toward the total information/variance.

Turns out with only 5 first PCs, we can keep around 85% of the total information, with the first PC (PC1) contribute to more than 65% of the total information.

Now we will observe the loading factor of each features to PC1 and PC2 and visualize them. The loading factor is extracted from the rotation matrix. The loading factor represent the degree of influence of each features. Features with loading factor close to 0 means that it has almost no contribution toward the component.

Here is the full list of features that has negative PC1 value. According to the Isolation Forest model, if the following features of an observation has negative PC1 value, then it is likely to be in normal state. I have sorted them based on their value of PC1, so the first feature that is mentioned has the highest negative PC1 value.

Conclusion

A semiconductor wafer fabrication process is often monitored by sensors in order to detect any abnormality during the process. Neural Network with general MLP layer is failed to detect the abnormal state and outlier detection with Isolation Forest is not good enough. Neural Network with LSTM has been proven to have satisfying performance in detecting the abnormal state. By employing the right method, machine learning can be applied to help manufacturer automate their inspection system or collaborate them with the field inspection worker to achieve better productivity, flexibility and product quality6.

Reference