Supervised machine learning covers a range of methods for predicting categorical outcomes using a set of predictor variables. Some of these methods include:
In this lesson, we will demonstrate how to use discriminant analysis to predict categorical outcomes.
Given a dataset with many continuous (predictor) variables and a categorical (response) variable, to determine a linear combination of the predictor variables that best separates the categorical variable into mutually exclusive groups.
Sample size - Generally required to have 4 to 5 times as many observations as predictor variables.
Normal distribution - the predictor variables are sampled from a multivariate normal distribution.
Outliers - discriminat analysis is sensitive to univariate or multivariate ouliers.
Multicollinearity of the independent variables - the predictor variables are required to be non-collinear Multicollinearity limits the ability to reliably assess the importance of each predictor variable.
Homogeneity of the variance/covariance matrices - discriminant analysis is sensitive to the heteogenity of the variance/covariance matrices.
The data for this analysis is the classic iris data. This dataset contains 150 observations and five variables. The categorical variable Species is a factor with three levels. In general, we partition the data into a training set and a testing set. We train the model using data from the training set and test the model for accuracy using data from the testing set.
#Explore the dataset using scatterplots between the variables and their correlation coefficients.
pairs.panels(iris[, -5], gap = 0, bg = c("red", "blue", "green")[iris$Species], pch =18)
#Use sampling with replacement to select 70% as training and 30% as test data sets
set.seed(1234)
Index <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
TrainSet <- iris[Index==1, ]
TestSet <- iris[Index==2, ]
#
#str(TrainSet)
#str(TestSet)
#Fit the Linear discriminant model and explore its attributes
LinearDA <- lda(Species~., TrainSet)
LinearDA
## Call:
## lda(Species ~ ., data = TrainSet)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3571429 0.3392857 0.3035714
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 5.010000 3.440000 1.480000 0.255000
## versicolor 5.976316 2.776316 4.297368 1.336842
## virginica 6.605882 2.991176 5.611765 2.035294
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.8263761 0.2167902
## Sepal.Width 1.2790649 -2.4275193
## Petal.Length -2.2743967 0.9044990
## Petal.Width -2.2224018 -3.1236791
##
## Proportion of trace:
## LD1 LD2
## 0.9884 0.0116
#
#attributes(LinearDA)
#First, explore the performance of the discriminant model first with the Train data set
#and next with the Test dataset
#
PerformanceTest <- predict(LinearDA, TrainSet)
#PerformanceTest
#
#attributes(PerformanceTest)
#
# The separation achieved by first discriminant function
LD1_Separation <- ldahist(data= PerformanceTest$x[, 1], g = TrainSet$Species)
LD1_Separation
## NULL
#
# The separation achieved by second discriminant function
LD2_Separation <- ldahist(data= PerformanceTest$x[, 2], g = TrainSet$Species)
LD2_Separation
## NULL
#
#Use partition plots to show separation achieved by a combination of two of the four
#predictor variables in succession.
partitionPlot <- partimat(Species~., data=TrainSet, method="lda")
partitionPlot
## NULL
#
#Model accuracy using confusion matrix
pred <- predict(LinearDA, TrainSet)$class
tab <- table(predictions = pred, Actual = TrainSet$Species)
tab
## Actual
## predictions setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 36 1
## virginica 0 2 33
#
modelAccuracyTrain <- sum(diag(tab))/sum(tab)
modelAccuracyTrain
## [1] 0.9732143
#Model performance and accuracy using the test data - unseen dataset
TestPred <- predict(LinearDA, TestSet)$class
tabtest <- table(Predicted = TestPred, Actual = TestSet$Species)
tabtest
## Actual
## Predicted setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 12 1
## virginica 0 0 15
#Model accuracy on unseen data - using TestSet dataset
modelAccuracyTest <- sum(diag(tabtest))/sum(tabtest)
modelAccuracyTest
## [1] 0.9736842
#
ggord(LinearDA, TrainSet$Species, ylim = c(-10, 10), xlim = c(-15, 15))