At a high level, Classification is the separation of data into two or more categories, or (a point’s classification) the category a data point is put into. Classification in data mining is a function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.
A classification task begins with a data set in which the class assignments are known. The term Classifier, is often used to refer to an algorithm that performs classification. In this article, I will be referring to Support Vector machine (SVM - Classification algorithm that uses a boundary to separate the data into two or more categories/classes) and k-Nearest-Neighbor (KNN - Classification algorithm that defines a data point’s category as a function of the nearest k data points to it.)
For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case.
Let’s look at UCI’s Credit Approval Data Set for an example - http://archive.ics.uci.edu/ml/datasets/credit+approval.
To begin, we need to load in the UCI credit card approval data and analyze it.
ccdata=read.table("credit_card_data-headers.txt",header = T,sep='\t')
The file credit_card_data-headers.txt (with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit card applications with a binary response variable (last column) indicating if the application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine Learning Repository without the categorial variables and without data points that have missing values.
A portion of the raw data is shown below for reference. Variables A11 thru A15 are predictors and R1 is the target.
## A1 A2 A3 A8 A9 A10 A11 A12 A14 A15 R1
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1
The data structure is:
## Observations: 654
## Variables: 11
## $ A1 <int> 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0...
## $ A2 <dbl> 30.83, 58.67, 24.50, 27.83, 20.17, 32.08, 33.17, 22.92, 54...
## $ A3 <dbl> 0.000, 4.460, 0.500, 1.540, 5.625, 4.000, 1.040, 11.585, 0...
## $ A8 <dbl> 1.250, 3.040, 1.500, 3.750, 1.710, 2.500, 6.500, 0.040, 3....
## $ A9 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1...
## $ A10 <int> 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0...
## $ A11 <int> 1, 6, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 10, 3, 10, 0,...
## $ A12 <int> 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1...
## $ A14 <int> 202, 43, 280, 100, 120, 360, 164, 80, 180, 52, 128, 260, 0...
## $ A15 <int> 0, 560, 824, 3, 0, 0, 31285, 1349, 314, 1442, 0, 200, 0, 2...
## $ R1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
Support Vector Machine (SVM) is a supervised machine learning algorithm which is great for classification challenges. Here I build my SVM model in R using ksvm{kernlab}. ksvm requires a data matrix and factor, so it’s critical to use as.matrix and as.factor on the data set.
I decided to create four models with four different kernel functions to see what the difference would be. As shown below, it makes a difference!
library(kernlab)
# Setting the random number generator seed so that our results are reproducible
set.seed(1)
# Create our model using ksvm.
# Four different kernel functions are used for comparison - vanilladot, anovadot, rbfdot, and polydot
model1 <- ksvm(as.matrix(ccdata[,1:10]), as.factor(ccdata[,11]), type="C-svc", kernel="vanilladot", C=100, scaled=TRUE)
## Setting default kernel parameters
model1
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 189
##
## Objective Function Value : -17887.92
## Training error : 0.136086
model2 <- ksvm(as.matrix(ccdata[,1:10]), as.factor(ccdata[,11]), type="C-svc", kernel="anovadot", C=100, scaled=TRUE)
## Setting default kernel parameters
model2
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Anova RBF kernel function.
## Hyperparameter : sigma = 1 degree = 1
##
## Number of Support Vectors : 205
##
## Objective Function Value : -16400.74
## Training error : 0.093272
model3 <- ksvm(as.matrix(ccdata[,1:10]), as.factor(ccdata[,11]), type="C-svc", kernel="rbfdot", C=100, scaled=TRUE)
model3
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0890718384846732
##
## Number of Support Vectors : 244
##
## Objective Function Value : -9313.911
## Training error : 0.04893
Note: Model3 (using rbfdot kernel) has the lowest training error! This model should have the best prediction accuracy, I validate this later below when I print out the predication accuracy for each model.
model4 <- ksvm(as.matrix(ccdata[,1:10]), as.factor(ccdata[,11]), type="C-svc", kernel="polydot", C=100, scaled=TRUE)
## Setting default kernel parameters
model4
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Polynomial kernel function.
## Hyperparameters : degree = 1 scale = 1 offset = 1
##
## Number of Support Vectors : 190
##
## Objective Function Value : -17887.98
## Training error : 0.136086
ksvm does not directly return the coefficients a0 and a1.am, so I calculate a and a0 parameters of the decision boundary for each model then run the predict function across my four models as shown below. Since model3 has the lowest training error, I’ll just show the calcualted results for a3 and a03:
# calculate the weights - xmatrix and coef are attributes of the model used to get correct coeficients.
# The goal here is to obtain the weight for each of the 10 attributes from our credit card data so that the classifier will be:
# a1x1 + a2x2 + ......+ a10x10 + a0 = 0
# In R, we multiple each data point in xmatrix by a coefficient with coef then sum the adjusted data points together to get an equation of the form a1x1+a2x2+a3x3+...a10x10.
a1 = colSums(model1@xmatrix[[1]] * model1@coef[[1]])
a2 = colSums(model2@xmatrix[[1]] * model2@coef[[1]])
a3 = colSums(model3@xmatrix[[1]] * model3@coef[[1]]); a3
## A1 A2 A3 A8 A9 A10 A11
## -19.30440 -38.81550 -8.82235 56.79429 49.16483 -22.83664 11.46010
## A12 A14 A15
## -23.06623 -59.18260 49.67840
a4 = colSums(model4@xmatrix[[1]] * model4@coef[[1]])
# calculate the constant a0 (-intercept of b in model) for each model
a01 = -model1@b
a02 = -model2@b
a03 = -model3@b; a03
## [1] 0.8469984
a04 = -model4@b
# predict is a generic function for predictions from the results of various model fitting functions.
pred1 <- predict(model1,ccdata[,1:10])
pred2 <- predict(model2,ccdata[,1:10])
pred3 <- predict(model3,ccdata[,1:10])
pred4 <- predict(model4,ccdata[,1:10])
See what fraction of each model’s predictions match the actual classification
sum(pred1 == ccdata[,11]) / nrow(ccdata)
## [1] 0.8639144
sum(pred2 == ccdata[,11]) / nrow(ccdata)
## [1] 0.9067278
sum(pred3 == ccdata[,11]) / nrow(ccdata)
## [1] 0.9510703
sum(pred4 == ccdata[,11]) / nrow(ccdata)
## [1] 0.8639144
Now let’s try the k-nearest-neighbors classification function kknn contained in the R kknn package.
Unlike SVM which attempts to find a hyper-plane separating the different classes of the training instances with the maximum error margin, with k-Nearest Neighbors, you determine the nearest k training instances to your target instance. Figuring out which k are the nearest involves calculating a distance function.
library(kknn)
#Setting seed to produce reproducible results
set.seed(1)
check_accuracy = function(X){
predicted <- rep(0,(nrow(ccdata))) # predictions: start with a vector of all zeros
# for each row, estimate its response based on the other rows
for (i in 1:nrow(ccdata)){
# data[-i] means we remove row i of the data when finding nearest neighbors...
#...otherwise, it'll be its own nearest neighbor!
model=kknn(R1~.,ccdata[-i,],ccdata[i,],k=X, scale = TRUE) # use scaled data
# record whether the prediction is at least 0.5 (round to one) or less than 0.5 (round to zero)
predicted[i] <- as.integer(fitted(model)+0.5) # round off to 0 or 1
}
# calculate fraction of correct predictions
acc = sum(predicted == ccdata[,11]) / nrow(ccdata)
return(acc)
}
#
# Now call the function for values of k from 1 to 20 (you could try higher values of k too)
#
accurracy=rep(0,40) # set up a vector of 20 zeros to start
for (X in 1:40){
accurracy[X] = check_accuracy(X) # test knn with X neighbors
}
#
# report accuracies
#
plot(accurracy)
title("K-Nearest-Neighbors")
max(accurracy)
## [1] 0.853211
which.max(accurracy)
## [1] 12
Using K-Nearest-Neighbors provided an prediction accuracy of 85% at k=12.
You can also train kknn via a leave-one-out (train.kknn) crossvalidation method. This method can have a better accuracy as we will see below.
#Take (2/3)rd data for training and the rest for testing
#Number of rows in cc data
d.rows = nrow(ccdata)
#Randomly selecting (1/3)rd indexes among 654 indexes
d.sample = sample(1:d.rows, size = round(d.rows/3), replace = FALSE)
#Training data selected by excluding the 1/3rd sample
d.train = ccdata[-d.sample,]
#Test data seleted by including the 2/3rd sample
d.test = ccdata[d.sample,]
#Training of kknn method via leave-one-out (train.kknn) crossvalidation, we want to find the optimal value of 'k'
xval=train.kknn(R1 ~ ., data = d.train, kmax = 100, kernel = c("optimal","rectangular", "inv", "gaussian", "triangular"), scale = TRUE)
xval
##
## Call:
## train.kknn(formula = R1 ~ ., data = d.train, kmax = 100, kernel = c("optimal", "rectangular", "inv", "gaussian", "triangular"), scale = TRUE)
##
## Type of response variable: continuous
## minimal mean absolute error: 0.1880734
## Minimal mean squared error: 0.1114344
## Best kernel: inv
## Best k: 38
Now that we trained kknn, we can plot the classification error rate for different k values and kernel functions to see what value is best.
A confusion matrix displays the number of correct and incorrect predictions made by the model compared with the actual classifications in the test data. The matrix is n-by-n, where n is the number of classes. Below is the confusion matrix of my model on the test data:
#Testing the model on test data
pred<-predict(xval, d.test)
pred_bin<-round(pred)
pred_accuracy<-table(pred_bin,d.test$R1)
pred_accuracy
##
## pred_bin 0 1
## 0 107 19
## 1 7 85
The calculated prediction accuracy is:
#Print KKNN prediction accuracy
sum(pred_bin==d.test$R1)/length(d.test$R1)
## [1] 0.8807339
KNN prediction accuracy using the leave-one-out crossvalidation method is 88% with the best K value being 38, this is 3% better than the standard KKNN method for the same data set.
From the different methods tested (SVM, KKNN, TRAIN.KKNN), SVM provided the best prediction accuracy at 95% for this data set.