Suppose you have data on petals and sepals length and width of flowers of gender Iris. Would you be able to tell what is the species? Perhaps a machine learning model can help you.
The following data set contains 50 observations on those morphometric variables for three species of Iris. Check the means:
data(iris)
aggregate( .~ Species, data = iris, FUN = mean) Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.0 3.4 1.5 0.25
2 versicolor 5.9 2.8 4.3 1.33
3 virginica 6.6 3.0 5.6 2.03
There are several types of machine learning models, for regression and classification. We shall focus here on a just a couple of models suitable for multi-classification problems based on supervised learning, in which the labels of classes (Species) are already given.
To evaluate the accuracy of models, we should fit them using part of the data, say 70% or 80%, and use the rest for testing, so that we can compute a confusion matrix containing the count (or proportions) of correct (diagonal) and misclassifications.
id <- sample(nrow(iris), size = nrow(iris) * 0.7)
train_data <- iris[id, ]
nrow(train_data)[1] 105
test_data <- iris[-id, ]
nrow(test_data)[1] 45
Just like principal components, LDA consists of finding linear combinations of the original variables X to obtain latent variables, Z, the linear discriminant variables.
\[{\bf Z}_j = {\bf Xa}_j\] But, differently from PCA, whose objective is to explain variance, with LDA the objective is to maximize the ratio of the separation of the class means to the within-class variance. This is also done though singular value decomposition.
To fit a LDA model,
library(MASS)
model_lda <- lda(Species ~., data = train_data)
model_ldaCall:
lda(Species ~ ., data = train_data)
Prior probabilities of groups:
setosa versicolor virginica
0.35 0.30 0.34
Group means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.0 3.4 1.5 0.26
versicolor 5.9 2.7 4.3 1.33
virginica 6.5 2.9 5.5 1.98
Coefficients of linear discriminants:
LD1 LD2
Sepal.Length 1.1 0.19
Sepal.Width 1.2 2.02
Petal.Length -2.8 -0.97
Petal.Width -2.0 2.83
Proportion of trace:
LD1 LD2
0.9934 0.0066
Most of the separation (98.9%) is explaind by LD1. The number of discriminant variables is given by \(\min(p, k-1)\), where \(p\) is the number of variables in \({\bf X}\) and \(k\) is the number of classes.
The classification rule is to determine decision boundaries that maximizes distances between the class means on LD1 and LD2, as shown in the plot below. Because of the underlying linearity, the boundaries consist of straight lines. In higher dimensions, it is a hyperplane.
To get predictions for the testing data,
pred_lda <- predict(model_lda, newdata = test_data)$class
pred_lda[1:5] # check first 5 predictions[1] setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
Now we can compute the confusion matrix
table(obs = test_data$Species, pred = pred_lda) pred
obs setosa versicolor virginica
setosa 13 0 0
versicolor 0 17 1
virginica 0 0 14
The classification rule adopted by SVMs are similar to that used in LDA, but more flexible, using kernel functions for that, such as polynomial, radial, sigmoid or simply linear. The objective is to fit a ‘hyperplane’ (which can be nonlinear, depending on the kernel, and that is why SVMs are powerful) in the ‘gap’ between the classes, with maximal margins (distance from hyperplane to the nearest point). The points exactly in the margins are called support vectors. A program that finds such vectors are called support vector machines.
Source: Hastie et al. (2009) The elements of statistical learning: data mining, inference, and prediction (Vol. 2). New York: Springer.
However, there may be no such hyperplane well fitted enough to prevent misclassifications (left side ofthe figure above). The strategy then is to allow points to be on the wrong size of their margin (right side of the figure). This is determined by a hyperparameter cost. Another hyperparameter, gamma, might be needed if the kernel is not the linear one.
The choice of the hyperparameters impacts the accuracy of the fitted model. A cross-validation technique can be employed to find estimates that produce higher accuracy - this is done by fitting several models.
library(e1071)
model_svm <- svm(Species ~ Sepal.Length + Petal.Length,
data = train_data, scale = FALSE,
kernel = 'linear', gamma = 1/2, cost = 50,
cross = 10)
summary(model_svm)
Call:
svm(formula = Species ~ Sepal.Length + Petal.Length, data = train_data,
kernel = "linear", gamma = 1/2, cost = 50, cross = 10, scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 50
Number of Support Vectors: 15
( 2 6 7 )
Number of Classes: 3
Levels:
setosa versicolor virginica
10-fold cross-validation on training data:
Total Accuracy: 97
Single Accuracies:
100 100 100 91 100 100 90 91 100 100
The support vectors found can be seen (+) in the following plot:
plot(Petal.Length ~ Sepal.Length, data = train_data,
col = train_data$Species)
points(model_svm$SV, pch = 3)Now we can compute the confusion matrix
pred_svm <- predict(model_svm, newdata = test_data)
table(obs = test_data$Species, pred = pred_svm) pred
obs setosa versicolor virginica
setosa 13 0 0
versicolor 0 17 1
virginica 0 1 13
The package caret offers a broad list of machine learning models that can be fitted using a common framework. There are functions to split and pre-process the data, to evaluate the accuracy of the models and the importance of variables. Also, the process of tuning the hyperparameters is easier.
library(caret)
model <- train(Species ~., data = train_data,
method = "svmLinear2",
preProcess = c("center", "scale"),
tuneLength = 12,
trControl = trainControl(method = "cv", number = 7, p = 0.7),
metric = "Accuracy")
modelSupport Vector Machines with Linear Kernel
105 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (7 fold)
Summary of sample sizes: 90, 89, 89, 90, 91, 90, ...
Resampling results across tuning parameters:
cost Accuracy Kappa
0.25 0.94 0.91
0.50 0.96 0.94
1.00 0.96 0.94
2.00 0.96 0.94
4.00 0.95 0.93
8.00 0.95 0.93
16.00 0.96 0.94
32.00 0.95 0.93
64.00 0.95 0.93
128.00 0.96 0.94
256.00 0.96 0.94
512.00 0.97 0.96
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cost = 512.
Run also confusionMatrix(model) and varImp(model)