Faulty Steel Plates Classification

James Wu

Abstract

This article is about data classification using Steel Plates Faults Data Set (UCI) and using support vector machine (SVM) for classification prediction. This is a practical exercise on a complete data analysis process for authors.

Framework

According to the chart shown, we conduct this data analysis based on the above process follow.

Data Description

Data Sources & Goal

This dataset comes from research by Semeion, Research Center of Sciences of Communication. The original aim of the research was to correctly classify the type of surface defects in stainless steel plates, with six types of possible defects (plus “other”). The Input vector was made up of 27 indicators that approximately [describe] the geometric shape of the defect and its outline. According to the research paper, Semeion was commissioned by the Centro Sviluppo Materiali (Italy) for this task and therefore it is not possible to provide details on the nature of the 27 indicators used as Input vectors or the types of the 6 classes of defects.
Source:UCI Steel Plates Faults Data
Reference:kaggle

Load Packages

In this case, we will use these packages for analysis.

library(tibble)  #Data frame
library(MASS)    #Modern Applied Statistics with S
library(ggplot2) #Data Visualization
library(lattice) #Data Visualization
library(caret)   #Cross Validation
library(e1071)   #SVM

Read Data

link:feature
link:label
According to the above two links, you can download two text files from a website and save with the name “feature.txt” & “label.txt”.
We renamed the coloumn names of each variable by the file(label.txt).

data <- read.table("C:/Users/User/Desktop/R/feature.txt",sep = "") 
label <- read.table("C:/Users/User/Desktop/R/label.txt")
colnames(data) <- t(label)

Data Summary

There are 1941 observations. These observations will be described by 34 defects. Unfortunately, there is no other information that I know of to describe these columns.

str(data, list.len=8)

## 'data.frame':    1941 obs. of  34 variables:
##  $ X_Minimum            : int  42 645 829 853 1289 430 413 190 330 74 ...
##  $ X_Maximum            : int  50 651 835 860 1306 441 446 200 343 90 ...
##  $ Y_Minimum            : int  270900 2538079 1553913 369370 498078 100250 138468 210936 429227 779144 ...
##  $ Y_Maximum            : int  270944 2538108 1553931 369415 498335 100337 138883 210956 429253 779308 ...
##  $ Pixels_Areas         : int  267 108 71 176 2409 630 9052 132 264 1506 ...
##  $ X_Perimeter          : int  17 10 8 13 60 20 230 11 15 46 ...
##  $ Y_Perimeter          : int  44 30 19 45 260 87 432 20 26 167 ...
##  $ Sum_of_Luminosity    : int  24220 11397 7972 18996 246930 62357 1481991 20007 29748 180215 ...
##   [list output truncated]

summary(data[,1:8])

##    X_Minimum        X_Maximum      Y_Minimum          Y_Maximum       
##  Min.   :   0.0   Min.   :   4   Min.   :    6712   Min.   :    6724  
##  1st Qu.:  51.0   1st Qu.: 192   1st Qu.:  471253   1st Qu.:  471281  
##  Median : 435.0   Median : 467   Median : 1204128   Median : 1204136  
##  Mean   : 571.1   Mean   : 618   Mean   : 1650685   Mean   : 1650739  
##  3rd Qu.:1053.0   3rd Qu.:1072   3rd Qu.: 2183073   3rd Qu.: 2183084  
##  Max.   :1705.0   Max.   :1713   Max.   :12987661   Max.   :12987692  
##   Pixels_Areas     X_Perimeter       Y_Perimeter       Sum_of_Luminosity 
##  Min.   :     2   Min.   :    2.0   Min.   :    1.00   Min.   :     250  
##  1st Qu.:    84   1st Qu.:   15.0   1st Qu.:   13.00   1st Qu.:    9522  
##  Median :   174   Median :   26.0   Median :   25.00   Median :   19202  
##  Mean   :  1894   Mean   :  111.9   Mean   :   82.97   Mean   :  206312  
##  3rd Qu.:   822   3rd Qu.:   84.0   3rd Qu.:   83.00   3rd Qu.:   83011  
##  Max.   :152655   Max.   :10449.0   Max.   :18152.00   Max.   :11591414

The last seven columns are dummy variables, i.e. if the plate fault is classified as “Stains” there will be a 1 in that column and 0’s in the other columns.

Data Preprocessing

For ease of classification, the state of this data was complete means that there are no any “Miss Values”. The preprocessing work of this data need to be transformed. We use the for loop formula to combine dummy variables to create a new column.First we coombine the “typeofsteel_A300” and “typeofsteel_A400” into the new column“typeofsteel” below figure.

data <- add_column(data,0)
colnames(data)[35] <- c("typeofsteel")
colnames(data)[12] <- c("A300")
colnames(data)[13] <- c("A400")
for(i in 12:13)
{
  for(j in 1:nrow(data))
    if (data[j,i]==1)
      data[j,i] <- colnames(data[i])
}

for (i in 12:13)
{
  for(j in 1:nrow(data))
    if (data[j,i] != 0)
      data[j,35] <- data[j,i]
}  

data <- data[,-c(12:13)]
data[33] <- lapply(data[33], as.factor)

With the above method, we can also combine the type of plate faults into the new column“type”, as shown below the figure.

for(i in 26:32)
{
  for(j in 1:nrow(data))
    if (data[j,i]==1)
      data[j,i] <- colnames(data[i])
}

data <- add_column(data,0)
colnames(data)[34] <- c("type")

for(i in 26:32)
{
  for(j in 1:nrow(data))
    if (data[j,i] != 0)
      data[j,34] <- data[j,i]
}

data <- data[,-c(26:32)]
data[27] <- lapply(data[27], as.factor)

Support Vector Machine

Training & Testing

Before training the model, we need split the dataset into training data and testing for the model construction and validation.

smp.size = floor(0.8*nrow(data)) 
set.seed(1029)                     
train.ind = sample(seq(nrow(data)), smp.size)
train = data[train.ind, ] 
test = data[-train.ind, ]

After splitting the data, we constructed the Support Vector Machine model wtih the training dataset.

result <- svm(type~. ,train)

Prediction

After We have just completed our training model, the prediction value can be obtained for the prediction result using the “Confusion Matrix”.

train.pred = predict(result, train)
test.pred = predict(result, test)
table(real=train$type, predict=train.pred) #Confusion matrix of train data

##               predict
## real           Bumps Dirtiness K_Scatch Other_Faults Pastry Stains
##   Bumps          228         0        0           78      5      0
##   Dirtiness        4        32        0            4      1      0
##   K_Scatch         5         0      295           13      0      0
##   Other_Faults    74         3        6          428      9      4
##   Pastry          16         0        0           39     69      0
##   Stains           2         0        0            3      0     55
##   Z_Scratch        3         0        0           19      0      0
##               predict
## real           Z_Scratch
##   Bumps                7
##   Dirtiness            0
##   K_Scatch             0
##   Other_Faults        14
##   Pastry               2
##   Stains               0
##   Z_Scratch          134

table(real=test$type, predict=test.pred) #Confusion matrix of test data

##               predict
## real           Bumps Dirtiness K_Scatch Other_Faults Pastry Stains
##   Bumps           57         0        0           27      0      0
##   Dirtiness        0        10        0            3      1      0
##   K_Scatch         0         0       72            6      0      0
##   Other_Faults    18         1        0          107      2      1
##   Pastry           5         0        0           16     10      0
##   Stains           0         0        0            0      0     12
##   Z_Scratch        0         0        1            3      0      0
##               predict
## real           Z_Scratch
##   Bumps                0
##   Dirtiness            0
##   K_Scatch             0
##   Other_Faults         6
##   Pastry               1
##   Stains               0
##   Z_Scratch           30

Finally we can get the accuracy of prediction 79%.

confus.matrix = table(real=train$type, predict=train.pred)
confus.matrix2 = table(real=test$type, predict=test.pred)
sum(diag(confus.matrix))/sum(confus.matrix)# the training accuracy

## [1] 0.7996134

sum(diag(confus.matrix2))/sum(confus.matrix2)# the testing accuracy

## [1] 0.7660668

Tune Parameters

The cost of SVM

The tuning parameter C which you claim “the price of the misclassification” is exactly the weight for penalizing the “soft margin”.
If C become larger, it provided high availability with fault tolerance.There are less support vectors, and the model is easy to overfitting.
If C become little,it provided low availability with fault tolerance. There are more support vectors and margins.

Reference:What are C and gamma with regards to a support vector machine?

num.SV = sapply(X=1:100, 
                FUN=function(C) svm(type~., train, cost=C, epsilon =.1)$tot.nSV)
data.plot = data.frame(index=1:100, num.SV)
ggplot(data.plot) +
  geom_point(aes(x=index, y=num.SV)) +
  ggtitle("Cost of SVM ") +
  labs(x= "C value", y = "support vectors")

The gamma of SVM

The influence of the data points is relatively close for a large gamma. The influence weight of the data point is relatively large for the hyperplane. It is easy to cause overfitting.

When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data.We can compare C with gamma below the figure.

Reference:RBF SVM parameters
How to find the good parameters of SVM model? We used the formula below the function.
The tune parameters of time : 16 mins
Personal Computer : Windows 7
CPU:Intel(R)Core(TM)i7-6700K CPU @ 4.00GHz
RAM:64 GB

tune.model = tune(svm,
                  type~.,
                  data=data,
                  kernel="radial",
                  range=list(cost=c(1:25), gamma=c(0.05,0.1,0.15,0.2,0.25))
)
tune.model$best.model

## 
## Call:
## best.tune(method = svm, train.x = type ~ ., data = data, ranges = list(cost = c(1:25), 
##     gamma = c(0.05, 0.1, 0.15, 0.2, 0.25)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  2 
##       gamma:  0.15 
## 
## Number of Support Vectors:  1356

Finally we can use the new parameters to build a new model, and we get the accuracy of prediction 92% .

new.model <- svm(type~. ,train , cost=2, gamma =.15)
train.pred = predict(new.model, train)
test.pred = predict(new.model, test)
confus.matrix = table(real=train$type, predict=train.pred)
sum(diag(confus.matrix))/sum(confus.matrix)

## [1] 0.9278351

Result & K-fold cross validation

The goal of cross validation is to define a dataset to “test” the model in the training phase, in order to limit problems like overfitting.

Reference:Cross-validation We use the 10-fold cross validations to check the model, and we can observe the figure show that there are no testing point higher than the training point.

set.seed(12)

tmp = createFolds(c(1:nrow(data)), k = 10, list = TRUE)

final.pred.train <- factor()
final.pred.test <- factor()

for(i in 1:10)
{
  train <- data[-unlist(tmp[i]),]
  test <- data[unlist(tmp[i]),]
  
  result <- svm(type~. ,train )
  
  train.pred = predict(result, train)
  test.pred = predict(result, test)
  
  table(real=train$type, predict=train.pred)
  table(real=test$type, predict=test.pred)
  
  confus.matrix = table(real=train$type, predict=train.pred)
  confus.matrix2 = table(real=test$type, predict=test.pred)
  
  pred.train <- sum(diag(confus.matrix))/sum(confus.matrix)
  pred.test <- sum(diag(confus.matrix2))/sum(confus.matrix2)

  
  final.pred.train <- c(final.pred.train ,pred.train) 
  final.pred.test <- c(final.pred.test ,pred.test)
}
data.plot3 = data.frame(index=1:10, final.pred.train)
ggplot(data.plot3) +
  geom_point(aes(x=index, y=final.pred.train),color="blue") +
  geom_point(aes(x=index, y=final.pred.test),color="brown") +
  ggtitle("10-fold cross validations") +
  labs(x= "index", y = "Accuracy")

Conclusion

Through this case, you can learn the complete flow of data analysis. Beginning of the data reading, binary variable transform, model construction, parameter selection and cross-validation are widely used in data analysis. Operating data analysis was a good way to accumulate experience for reader, then you can operate smoothly for more different types of data.

Reference

R筆記