This article is about data classification using Steel Plates Faults Data Set (UCI) and using support vector machine (SVM) for classification prediction. This is a practical exercise on a complete data analysis process for authors.
According to the chart shown, we conduct this data analysis based on the above process follow.
This dataset comes from research by Semeion, Research Center of Sciences of Communication. The original aim of the research was to correctly classify the type of surface defects in stainless steel plates, with six types of possible defects (plus “other”). The Input vector was made up of 27 indicators that approximately [describe] the geometric shape of the defect and its outline. According to the research paper, Semeion was commissioned by the Centro Sviluppo Materiali (Italy) for this task and therefore it is not possible to provide details on the nature of the 27 indicators used as Input vectors or the types of the 6 classes of defects.
Source:UCI Steel Plates Faults Data
Reference:kaggle
In this case, we will use these packages for analysis.
library(tibble) #Data frame
library(MASS) #Modern Applied Statistics with S
library(ggplot2) #Data Visualization
library(lattice) #Data Visualization
library(caret) #Cross Validation
library(e1071) #SVMlink:feature
link:label
According to the above two links, you can download two text files from a website and save with the name “feature.txt” & “label.txt”.
We renamed the coloumn names of each variable by the file(label.txt).
data <- read.table("C:/Users/User/Desktop/R/feature.txt",sep = "")
label <- read.table("C:/Users/User/Desktop/R/label.txt")
colnames(data) <- t(label)There are 1941 observations. These observations will be described by 34 defects. Unfortunately, there is no other information that I know of to describe these columns.
str(data, list.len=8)## 'data.frame': 1941 obs. of 34 variables:
## $ X_Minimum : int 42 645 829 853 1289 430 413 190 330 74 ...
## $ X_Maximum : int 50 651 835 860 1306 441 446 200 343 90 ...
## $ Y_Minimum : int 270900 2538079 1553913 369370 498078 100250 138468 210936 429227 779144 ...
## $ Y_Maximum : int 270944 2538108 1553931 369415 498335 100337 138883 210956 429253 779308 ...
## $ Pixels_Areas : int 267 108 71 176 2409 630 9052 132 264 1506 ...
## $ X_Perimeter : int 17 10 8 13 60 20 230 11 15 46 ...
## $ Y_Perimeter : int 44 30 19 45 260 87 432 20 26 167 ...
## $ Sum_of_Luminosity : int 24220 11397 7972 18996 246930 62357 1481991 20007 29748 180215 ...
## [list output truncated]
summary(data[,1:8])## X_Minimum X_Maximum Y_Minimum Y_Maximum
## Min. : 0.0 Min. : 4 Min. : 6712 Min. : 6724
## 1st Qu.: 51.0 1st Qu.: 192 1st Qu.: 471253 1st Qu.: 471281
## Median : 435.0 Median : 467 Median : 1204128 Median : 1204136
## Mean : 571.1 Mean : 618 Mean : 1650685 Mean : 1650739
## 3rd Qu.:1053.0 3rd Qu.:1072 3rd Qu.: 2183073 3rd Qu.: 2183084
## Max. :1705.0 Max. :1713 Max. :12987661 Max. :12987692
## Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity
## Min. : 2 Min. : 2.0 Min. : 1.00 Min. : 250
## 1st Qu.: 84 1st Qu.: 15.0 1st Qu.: 13.00 1st Qu.: 9522
## Median : 174 Median : 26.0 Median : 25.00 Median : 19202
## Mean : 1894 Mean : 111.9 Mean : 82.97 Mean : 206312
## 3rd Qu.: 822 3rd Qu.: 84.0 3rd Qu.: 83.00 3rd Qu.: 83011
## Max. :152655 Max. :10449.0 Max. :18152.00 Max. :11591414
The last seven columns are dummy variables, i.e. if the plate fault is classified as “Stains” there will be a 1 in that column and 0’s in the other columns.
For ease of classification, the state of this data was complete means that there are no any “Miss Values”. The preprocessing work of this data need to be transformed. We use the for loop formula to combine dummy variables to create a new column.First we coombine the “typeofsteel_A300” and “typeofsteel_A400” into the new column“typeofsteel” below figure.
data <- add_column(data,0)
colnames(data)[35] <- c("typeofsteel")
colnames(data)[12] <- c("A300")
colnames(data)[13] <- c("A400")
for(i in 12:13)
{
for(j in 1:nrow(data))
if (data[j,i]==1)
data[j,i] <- colnames(data[i])
}
for (i in 12:13)
{
for(j in 1:nrow(data))
if (data[j,i] != 0)
data[j,35] <- data[j,i]
}
data <- data[,-c(12:13)]
data[33] <- lapply(data[33], as.factor)With the above method, we can also combine the type of plate faults into the new column“type”, as shown below the figure.
for(i in 26:32)
{
for(j in 1:nrow(data))
if (data[j,i]==1)
data[j,i] <- colnames(data[i])
}
data <- add_column(data,0)
colnames(data)[34] <- c("type")
for(i in 26:32)
{
for(j in 1:nrow(data))
if (data[j,i] != 0)
data[j,34] <- data[j,i]
}
data <- data[,-c(26:32)]
data[27] <- lapply(data[27], as.factor)Before training the model, we need split the dataset into training data and testing for the model construction and validation.
smp.size = floor(0.8*nrow(data))
set.seed(1029)
train.ind = sample(seq(nrow(data)), smp.size)
train = data[train.ind, ]
test = data[-train.ind, ] After splitting the data, we constructed the Support Vector Machine model wtih the training dataset.
result <- svm(type~. ,train)After We have just completed our training model, the prediction value can be obtained for the prediction result using the “Confusion Matrix”.
train.pred = predict(result, train)
test.pred = predict(result, test)
table(real=train$type, predict=train.pred) #Confusion matrix of train data## predict
## real Bumps Dirtiness K_Scatch Other_Faults Pastry Stains
## Bumps 228 0 0 78 5 0
## Dirtiness 4 32 0 4 1 0
## K_Scatch 5 0 295 13 0 0
## Other_Faults 74 3 6 428 9 4
## Pastry 16 0 0 39 69 0
## Stains 2 0 0 3 0 55
## Z_Scratch 3 0 0 19 0 0
## predict
## real Z_Scratch
## Bumps 7
## Dirtiness 0
## K_Scatch 0
## Other_Faults 14
## Pastry 2
## Stains 0
## Z_Scratch 134
table(real=test$type, predict=test.pred) #Confusion matrix of test data## predict
## real Bumps Dirtiness K_Scatch Other_Faults Pastry Stains
## Bumps 57 0 0 27 0 0
## Dirtiness 0 10 0 3 1 0
## K_Scatch 0 0 72 6 0 0
## Other_Faults 18 1 0 107 2 1
## Pastry 5 0 0 16 10 0
## Stains 0 0 0 0 0 12
## Z_Scratch 0 0 1 3 0 0
## predict
## real Z_Scratch
## Bumps 0
## Dirtiness 0
## K_Scatch 0
## Other_Faults 6
## Pastry 1
## Stains 0
## Z_Scratch 30
Finally we can get the accuracy of prediction 79%.
confus.matrix = table(real=train$type, predict=train.pred)
confus.matrix2 = table(real=test$type, predict=test.pred)
sum(diag(confus.matrix))/sum(confus.matrix)# the training accuracy ## [1] 0.7996134
sum(diag(confus.matrix2))/sum(confus.matrix2)# the testing accuracy## [1] 0.7660668
The tuning parameter C which you claim “the price of the misclassification” is exactly the weight for penalizing the “soft margin”.
If C become larger, it provided high availability with fault tolerance.There are less support vectors, and the model is easy to overfitting.
If C become little,it provided low availability with fault tolerance. There are more support vectors and margins.
Reference:What are C and gamma with regards to a support vector machine?
num.SV = sapply(X=1:100,
FUN=function(C) svm(type~., train, cost=C, epsilon =.1)$tot.nSV)
data.plot = data.frame(index=1:100, num.SV)
ggplot(data.plot) +
geom_point(aes(x=index, y=num.SV)) +
ggtitle("Cost of SVM ") +
labs(x= "C value", y = "support vectors")The influence of the data points is relatively close for a large gamma. The influence weight of the data point is relatively large for the hyperplane. It is easy to cause overfitting.
When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data.We can compare C with gamma below the figure.
Reference:RBF SVM parameters
How to find the good parameters of SVM model? We used the formula below the function.
The tune parameters of time : 16 mins
Personal Computer : Windows 7
CPU:Intel(R)Core(TM)i7-6700K CPU @ 4.00GHz
RAM:64 GB
tune.model = tune(svm,
type~.,
data=data,
kernel="radial",
range=list(cost=c(1:25), gamma=c(0.05,0.1,0.15,0.2,0.25))
)
tune.model$best.model##
## Call:
## best.tune(method = svm, train.x = type ~ ., data = data, ranges = list(cost = c(1:25),
## gamma = c(0.05, 0.1, 0.15, 0.2, 0.25)), kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 2
## gamma: 0.15
##
## Number of Support Vectors: 1356
Finally we can use the new parameters to build a new model, and we get the accuracy of prediction 92% .
new.model <- svm(type~. ,train , cost=2, gamma =.15)
train.pred = predict(new.model, train)
test.pred = predict(new.model, test)
confus.matrix = table(real=train$type, predict=train.pred)
sum(diag(confus.matrix))/sum(confus.matrix)## [1] 0.9278351
The goal of cross validation is to define a dataset to “test” the model in the training phase, in order to limit problems like overfitting.
Reference:Cross-validation We use the 10-fold cross validations to check the model, and we can observe the figure show that there are no testing point higher than the training point.
set.seed(12)
tmp = createFolds(c(1:nrow(data)), k = 10, list = TRUE)
final.pred.train <- factor()
final.pred.test <- factor()
for(i in 1:10)
{
train <- data[-unlist(tmp[i]),]
test <- data[unlist(tmp[i]),]
result <- svm(type~. ,train )
train.pred = predict(result, train)
test.pred = predict(result, test)
table(real=train$type, predict=train.pred)
table(real=test$type, predict=test.pred)
confus.matrix = table(real=train$type, predict=train.pred)
confus.matrix2 = table(real=test$type, predict=test.pred)
pred.train <- sum(diag(confus.matrix))/sum(confus.matrix)
pred.test <- sum(diag(confus.matrix2))/sum(confus.matrix2)
final.pred.train <- c(final.pred.train ,pred.train)
final.pred.test <- c(final.pred.test ,pred.test)
}
data.plot3 = data.frame(index=1:10, final.pred.train)
ggplot(data.plot3) +
geom_point(aes(x=index, y=final.pred.train),color="blue") +
geom_point(aes(x=index, y=final.pred.test),color="brown") +
ggtitle("10-fold cross validations") +
labs(x= "index", y = "Accuracy")Through this case, you can learn the complete flow of data analysis. Beginning of the data reading, binary variable transform, model construction, parameter selection and cross-validation are widely used in data analysis. Operating data analysis was a good way to accumulate experience for reader, then you can operate smoothly for more different types of data.