Objective

Given a dataset with many continuous (predictor) variables and a categorical (response) variable, to determine a linear combination of the predictor variables that best separates the categorical variable into mutually exclusive groups.

Examples of use cases

To separate loan/mortgage applicants into good or bad credit risks given their financial records and credits history.
To separate e-mails into spam or non-spam given its characteristics.
To separate patients into potential heart attack or non heart attack patients using their vital signs and symtoms.
To characterize tumors as benign or malignant given the characteristics of the tissue samples.

Assumptions of the model

Sample size - Generally required to have 4 to 5 times as many observations as predictor variables.
Normal distribution - the predictor variables are sampled from a multivariate normal distribution.
Outliers - discriminat analysis is sensitive to univariate or multivariate ouliers.
Multicollinearity of the independent variables - the predictor variables are required to be non-collinear Multicollinearity limits the ability to reliably assess the importance of each predictor variable.
Homogeneity of the variance/covariance matrices - discriminant analysis is sensitive to the heteogenity of the variance/covariance matrices.

Implementing a discriminant analysis model using R

The data for this analysis is the classic iris data. This dataset contains 150 observations and five variables. The categorical variable Species is a factor with three levels. In general, we partition the data into a training set and a testing set. We train the model using data from the training set and test the model for accuracy using data from the testing set.

Explore the iris dataset

#Explore the dataset using scatterplots between the variables and their correlation coefficients.

pairs.panels(iris[, -5], gap = 0, bg = c("red", "blue", "green")[iris$Species], pch =18)

Partition the dataset into training and test data sets.

#Use sampling with replacement to select 70% as training and 30% as test data sets

set.seed(1234)
Index  <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
TrainSet <- iris[Index==1, ]
TestSet <- iris[Index==2, ]
#
#str(TrainSet) 
#str(TestSet)

Fitting a linear discriminant analysis model

#Fit the Linear discriminant model and explore its attributes

LinearDA <- lda(Species~., TrainSet)
LinearDA

## Call:
## lda(Species ~ ., data = TrainSet)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3571429  0.3392857  0.3035714 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         5.010000    3.440000     1.480000    0.255000
## versicolor     5.976316    2.776316     4.297368    1.336842
## virginica      6.605882    2.991176     5.611765    2.035294
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Sepal.Length  0.8263761  0.2167902
## Sepal.Width   1.2790649 -2.4275193
## Petal.Length -2.2743967  0.9044990
## Petal.Width  -2.2224018 -3.1236791
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9884 0.0116

#
#attributes(LinearDA)

Performance of the fitted model - training dataset

#First, explore the performance of the discriminant model first with the Train data set 
#and next with the Test dataset
#

PerformanceTest <- predict(LinearDA, TrainSet)

#PerformanceTest
#
#attributes(PerformanceTest)
#

# The separation achieved by first discriminant function

LD1_Separation <- ldahist(data= PerformanceTest$x[, 1], g = TrainSet$Species)

LD1_Separation

## NULL

#
# The separation achieved by second discriminant function

LD2_Separation <- ldahist(data= PerformanceTest$x[, 2], g = TrainSet$Species)

LD2_Separation

## NULL

#

#Use partition plots to show separation achieved by a combination of two of the four 
#predictor variables in succession.  

partitionPlot <- partimat(Species~., data=TrainSet, method="lda")

partitionPlot

## NULL

#

#Model accuracy using confusion matrix

pred <- predict(LinearDA, TrainSet)$class
tab <- table(predictions = pred, Actual = TrainSet$Species)
tab

##             Actual
## predictions  setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         36         1
##   virginica       0          2        33

#
modelAccuracyTrain <- sum(diag(tab))/sum(tab)
modelAccuracyTrain

## [1] 0.9732143

Performance of the fitted model - testing dataset

#Model performance and accuracy using the test data - unseen dataset

TestPred <- predict(LinearDA, TestSet)$class
tabtest <- table(Predicted = TestPred, Actual = TestSet$Species)
tabtest

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         12         1
##   virginica       0          0        15

#Model accuracy on unseen data - using TestSet dataset

modelAccuracyTest <- sum(diag(tabtest))/sum(tabtest)
modelAccuracyTest

## [1] 0.9736842

The separation achieved by the discriminant functions

ggord(LinearDA, TrainSet$Species, ylim = c(-10, 10), xlim = c(-15, 15))

Classification - Discriminant Analysis

Ike

2020-03-19

Introduction