Introduction

Supervised machine learning covers a range of methods for predicting categorical outcomes using a set of predictor variables. Some of these methods include:

In this lesson, we will demonstrate how to use discriminant analysis to predict categorical outcomes.

Objective

Given a dataset with many continuous (predictor) variables and a categorical (response) variable, to determine a linear combination of the predictor variables that best separates the categorical variable into mutually exclusive groups.

Examples of use cases

  • To separate loan/mortgage applicants into good or bad credit risks given their financial records and credits history.
  • To separate e-mails into spam or non-spam given its characteristics.
  • To separate patients into potential heart attack or non heart attack patients using their vital signs and symtoms.
  • To characterize tumors as benign or malignant given the characteristics of the tissue samples.

Assumptions of the model

  • Sample size - Generally required to have 4 to 5 times as many observations as predictor variables.

  • Normal distribution - the predictor variables are sampled from a multivariate normal distribution.

  • Outliers - discriminat analysis is sensitive to univariate or multivariate ouliers.

  • Multicollinearity of the independent variables - the predictor variables are required to be non-collinear Multicollinearity limits the ability to reliably assess the importance of each predictor variable.

  • Homogeneity of the variance/covariance matrices - discriminant analysis is sensitive to the heteogenity of the variance/covariance matrices.

Implementing a discriminant analysis model using R

The data for this analysis is the classic iris data. This dataset contains 150 observations and five variables. The categorical variable Species is a factor with three levels. In general, we partition the data into a training set and a testing set. We train the model using data from the training set and test the model for accuracy using data from the testing set.

Explore the iris dataset

#Explore the dataset using scatterplots between the variables and their correlation coefficients.

pairs.panels(iris[, -5], gap = 0, bg = c("red", "blue", "green")[iris$Species], pch =18)

Partition the dataset into training and test data sets.

#Use sampling with replacement to select 70% as training and 30% as test data sets

set.seed(1234)
Index  <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
TrainSet <- iris[Index==1, ]
TestSet <- iris[Index==2, ]
#
#str(TrainSet) 
#str(TestSet)

Fitting a linear discriminant analysis model

#Fit the Linear discriminant model and explore its attributes

LinearDA <- lda(Species~., TrainSet)
LinearDA
## Call:
## lda(Species ~ ., data = TrainSet)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3571429  0.3392857  0.3035714 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         5.010000    3.440000     1.480000    0.255000
## versicolor     5.976316    2.776316     4.297368    1.336842
## virginica      6.605882    2.991176     5.611765    2.035294
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Sepal.Length  0.8263761  0.2167902
## Sepal.Width   1.2790649 -2.4275193
## Petal.Length -2.2743967  0.9044990
## Petal.Width  -2.2224018 -3.1236791
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9884 0.0116
#
#attributes(LinearDA)

Performance of the fitted model - training dataset

#First, explore the performance of the discriminant model first with the Train data set 
#and next with the Test dataset
#

PerformanceTest <- predict(LinearDA, TrainSet)

#PerformanceTest
#
#attributes(PerformanceTest)
#

# The separation achieved by first discriminant function

LD1_Separation <- ldahist(data= PerformanceTest$x[, 1], g = TrainSet$Species)

LD1_Separation
## NULL
#
# The separation achieved by second discriminant function

LD2_Separation <- ldahist(data= PerformanceTest$x[, 2], g = TrainSet$Species)

LD2_Separation
## NULL
#

#Use partition plots to show separation achieved by a combination of two of the four 
#predictor variables in succession.  

partitionPlot <- partimat(Species~., data=TrainSet, method="lda")

partitionPlot
## NULL
#

#Model accuracy using confusion matrix

pred <- predict(LinearDA, TrainSet)$class
tab <- table(predictions = pred, Actual = TrainSet$Species)
tab
##             Actual
## predictions  setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         36         1
##   virginica       0          2        33
#
modelAccuracyTrain <- sum(diag(tab))/sum(tab)
modelAccuracyTrain
## [1] 0.9732143

Performance of the fitted model - testing dataset

#Model performance and accuracy using the test data - unseen dataset

TestPred <- predict(LinearDA, TestSet)$class
tabtest <- table(Predicted = TestPred, Actual = TestSet$Species)
tabtest
##             Actual
## Predicted    setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         12         1
##   virginica       0          0        15
#Model accuracy on unseen data - using TestSet dataset

modelAccuracyTest <- sum(diag(tabtest))/sum(tabtest)
modelAccuracyTest
## [1] 0.9736842
#

The separation achieved by the discriminant functions

ggord(LinearDA, TrainSet$Species, ylim = c(-10, 10), xlim = c(-15, 15))