Introduction

The purpose of Discriminant Analysis is to clasify objects into one or more groups based on a set of features that describe the objects. In general, we assign an object to one of a number of predetermined groups based on observations made on the object.

The first purpose is feature selection and the second purpose is classification.

For example, we want to know whether a soap product is good or bad based on several measurements on the product such as weight, volume, people’s preferential score, smell, color contrast etc. The object here is soap. The class category or the group (“good” and “bad”) is what we are looking for (dependent variable). Each measurement on the product is called features that describe the object (independent variable).

Thus, in discriminant analysis, the dependent variable (Y) is the group and the independent variables (X) are the object features that might describe the group. The dependent variable is always category (nominal scale) variable while the independent variables can be any measurement scale (i.e. nominal, ordinal, interval or ratio).

If we can assume that the groups are linearly separable, we can use linear discriminant model (LDA). Linearly separable suggests that the groups can be separated by a linear combination of features that describe the objects.

Using classification criterion to minimize total error of classification (TEC), we tend to make the proportion of object that it misclassifies as small as possible. TEC is the performance rule in the ‘long run’ on a random sample of objects. Thus, TEC should be thought as the probability that the rule under consideration will misclassify an object. The classification rule is to assign an object to the group with highest conditional probability. This is called Bayes Rule. This rule also minimizes the TEC.

Possible applications

Bankruptcy prediction: In bankruptcy prediction based on accounting ratios and other financial variables, linear discriminant analysis was the first statistical method applied to systematically explain which firms entered bankruptcy vs. survived.
Marketing: In marketing, discriminant analysis was once often used to determine the factors which distinguish different types of customers and/or products on the basis of surveys or other forms of collected data. Biomedical studies: The main application of discriminant analysis in medicine is the assessment of severity state of a patient and prognosis of disease outcome. For example, during retrospective analysis, patients are divided into groups according to severity of disease - mild, moderate and severe form. Then results of clinical and laboratory analyses are studied in order to reveal variables which are statistically different in studied groups. Using these variables, discriminant functions are built which help to objectively classify disease in a future patient into mild, moderate or severe form.

Example

Libraries Required

#install.packages("tidyverse, repos = http://cran.us.r-project.org")
#install.packages("modelr, repos = http://cran.us.r-project.org")
#install.packages("broom, repos = http://cran.us.r-project.org")
#install.packages("ISLR, repos = http://cran.us.r-project.org"")
#install.packages("ROCR, repos = http://cran.us.r-project.org"")
#install.packages("MASS, repos = http://cran.us.r-project.org"")
#install.packages("caret, repos = http://cran.us.r-project.org"")

library(tidyverse)
library(modelr)
library(broom)
library(ISLR)
library(ROCR)
library(MASS)
library(caret)

Predicts the probability of default based on the balance, and student status.

Load data

(default <- as_tibble(ISLR::Default))
## # A tibble: 10,000 x 4
##    default student balance income
##    <fct>   <fct>     <dbl>  <dbl>
##  1 No      No          730  44362
##  2 No      Yes         817  12106
##  3 No      No         1074  31767
##  4 No      No          529  35704
##  5 No      No          786  38463
##  6 No      Yes         920   7492
##  7 No      No          826  24905
##  8 No      Yes         809  17600
##  9 No      No         1161  37469
## 10 No      No            0  29275
## # ... with 9,990 more rows
attach(default)
## The following object is masked _by_ .GlobalEnv:
## 
##     default

Data preparation

set.seed(123)
trainIndex <- createDataPartition(default$default,
                                  p=0.6,
                                  list = FALSE,
                                  times = 1)
train <- default[trainIndex, ]
test  <- default[-trainIndex,]

Logistic Model

model1 <- glm(default~balance + student, family = "binomial", data = train)
tidy(model1)
##          term      estimate    std.error  statistic       p.value
## 1 (Intercept) -10.581764956 0.4699918313 -22.514785 2.973785e-112
## 2     balance   0.005671935 0.0002966342  19.120977  1.689151e-81
## 3  studentYes  -0.782496238 0.1892222899  -4.135328  3.544480e-05

Predict test dataset

test.predicted.m1 <- predict(model1, newdata = test, type = "response")

Validation

#Clasification rate
list(model1 = table(test$default, test.predicted.m1 > 0.5) %>% prop.table() %>% round(3))
## $model1
##      
##       FALSE  TRUE
##   No  0.962 0.005
##   Yes 0.022 0.011
#Misclasification rate
m1.pred = ifelse(test.predicted.m1 > 0.5, "Yes", "No")
m1.error = mean(default != m1.pred)
m1.error                 
## [1] 0.587625
#Confusion matrix
table(test$default, test.predicted.m1 > 0.5)
##      
##       FALSE TRUE
##   No   3847   19
##   Yes    88   45

We see that there are a total of 88 + 45 = 133 customers that defaulted.
Of the total defaults, 88/133 = 66% were not predicted. Alternatively, we could say that only 45/133 = 34% of default occurrences were predicted - this is known as the the precision (also known as sensitivity) of our model.

However, the specificity is the percentage of non-defaulters that are correctly identified, here 1-(19/(3847+19))=99.5% (the accuracy here is largely driven by the fact that 97% of the observations in our data are non-defaulters).
The importance between sensititivy and specificity is dependent on context. In this case, a credit card company is likely to be more concerned with sensititivy since they want to reduce their risk. Therefore, they may be more concerned with tuning a model so that their sensititivy/precision is improved.

LDA Model

(lda.m1 <- lda(default~balance + student, data = train))
## Call:
## lda(default ~ balance + student, data = train)
## 
## Prior probabilities of groups:
##         No        Yes 
## 0.96667222 0.03332778 
## 
## Group means:
##       balance studentYes
## No   798.8977  0.2994311
## Yes 1730.5745  0.3700000
## 
## Coefficients of linear discriminants:
##                     LD1
## balance     0.002249644
## studentYes -0.324036391

The LDA output indicates that our prior probabilities are No = 0.967 and Yes = 0.033; in other words, 96.7% of the training observations are customers who did not default and 3% represent those that defaulted.

It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates.
These suggest that customers that tend to default have, on average, a credit card balance of $1,730 and are more likely to be students then non-defaulters (30% of non-defaulters are students whereas 37% of defaulters are students).

The coefficients of linear discriminants output provides the linear combination of balance and studentYes that are used to form the LDA decision rule.

If 0.0022 × balance ??? 0.324 × studentYes is large, then the LDA classifier will predict that the customer will default, and if it is small, then the LDA classifier will predict the customer will not default.

We can use plot to produce plots of the linear discriminants, obtained by computing 0.0022 × balance ??? 0.324 × student for each of the training observations. As you can see, when 0.0022×balance ??? 0.324×student < 0 the probability increases that the customer will not default and
when 0.0022×balance ??? 0.324×student > 0 the probability increases that the customer will default.

plot(lda.m1)

Prediction

test.predicted.lda <- predict(lda.m1, newdata = test)

Validation

lda.cm <- table(test$default, test.predicted.lda$class)

#Clasification rate
LDA_model = lda.cm %>% prop.table() %>% round(3)
LDA_model
##      
##          No   Yes
##   No  0.964 0.003
##   Yes 0.025 0.009

First we’ll look at the confusion matrix in a percentage form. The results show that the models perform in a very similar to Logistic regression model.
> 96% are true negatives and
> 1% are true positives. > Type II error is less than 3% in which the model predicts the customer will not default but they actually did. > Type I error of less than 1% in which the models predict the customer will default but they never did.

#Misclasification rate
lda.pred = (test.predicted.lda$class)
lda.error = mean(default$default != lda.pred)
lda.error
## [1] 0.0443

the overall error rate = 4.4%

#Confusion matrix
LDA_model = lda.cm
LDA_model
##      
##         No  Yes
##   No  3854   12
##   Yes   98   35

Precision: LDA model: 35/(98+35) = 26.31%
(Logistic regression model precision was 34%)
Specificity: LDA model: 1-(12/(3854+12)) = 99.7%
(Logistic regression model specificity was 99.5%)

The credit card company is likely to be more concerned with sensititivy since they want to reduce their risk.

If we are concerned with increasing the precision of our model we can tune our model by adjusting the posterior probability threshold. For instance, we might label any customer with a posterior probability of default above 20% as high-risk.

Adjusted predictions

#Create adjusted predictions
lda.pred.adj <- ifelse(test.predicted.lda$posterior[, 2] > .20, "Yes", "No")

#Misclasification rate
lda.pred2 = (lda.pred.adj)
lda.error2 = mean(default$default != lda.pred2)
lda.error2
## [1] 0.0763
# create new confusion matrices
(LDA_model = table(test$default, lda.pred.adj))
##      lda.pred.adj
##         No  Yes
##   No  3772   94
##   Yes   44   89

Precision: LDA model: 89/(44+89) = 67%
Specificity: LDA model: 1-(94/(3772+94)) = 97.6%

However, the overall error rate has increased to 7.6%, the credit card company may consider this slight increase in the total error rate to be a small price to pay for more accurate identification of individuals who do indeed default. It’s important to keep in mind these kinds of trade-offs, which are common with classification models - tuning models can improve certain classification rates while worsening others.

QDA Model

#Create a QDA model
(qda.m1 <- qda(default ~ balance + student, data = train)) 
## Call:
## qda(default ~ balance + student, data = train)
## 
## Prior probabilities of groups:
##         No        Yes 
## 0.96667222 0.03332778 
## 
## Group means:
##       balance studentYes
## No   798.8977  0.2994311
## Yes 1730.5745  0.3700000
#Predict test data using the QDA model
test.predicted.qda <- predict(qda.m1, newdata = test)

#Validate the model performance
qda.cm <- table(test$default, test.predicted.qda$class)

#Clasification rate
(QDA_model = qda.cm %>% prop.table() %>% round(3))
##      
##          No   Yes
##   No  0.963 0.004
##   Yes 0.024 0.009
#Misclasification rate
qda.pred = (test.predicted.qda$class)
(qda.error = mean(default$default != qda.pred))
## [1] 0.0457
#Confusion matrix
(QDA_model = qda.cm)
##      
##         No  Yes
##   No  3850   16
##   Yes   96   37
#Posterior probability of default above 20% as high-risk
#Create adjusted predictions
qda.pred.adj <- ifelse(test.predicted.qda$posterior[, 2] > .20, "Yes", "No")

#Misclasification rate
qda.pred2 = (qda.pred.adj)
qda.error2 = mean(default$default != qda.pred2)
lda.error2
## [1] 0.0763
# create new confusion matrices
(QDA_model = table(test$default, qda.pred.adj))
##      qda.pred.adj
##         No  Yes
##   No  3727  139
##   Yes   36   97

Case Study: Wine Classification

The wine dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample.

#install.packages("car")
#install.packages("rattle")
library(car)  
library(rattle)

Load the wine data

UCI <- "ftp://ftp.ics.uci.edu/pub"
REPOS <- "machine-learning-databases"
wine.url <- sprintf("%s/%s/wine/wine.data", UCI, REPOS)
wine <- read.csv(wine.url, header=F) 
colnames(wine) <- c('Type', 'Alcohol', 'Malic', 'Ash', 
                    'Alcalinity', 'Magnesium', 'Phenols', 
                    'Flavanoids', 'Nonflavanoids',
                    'Proanthocyanins', 'Color', 'Hue', 
                    'Dilution', 'Proline')
wine$Type <- as.factor(wine$Type)
write.table(wine, "wine.csv", sep=",", row.names=FALSE)
save(wine, file="wine.Rdata", compress=TRUE)

Explore the dataset

head(wine)
##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
## 4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
## 5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
## 6    1   14.20  1.76 2.45       15.2       112    3.27       3.39
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185
## 4          0.24            2.18  7.80 0.86     3.45    1480
## 5          0.39            1.82  4.32 1.04     2.93     735
## 6          0.34            1.97  6.75 1.05     2.85    1450
attach(wine)
scatterplotMatrix(wine[2:6])

The purpose of linear discriminant analysis (LDA) in this example is to find the linear combinations of the original variables (the 13 chemical concentrations) that gives the best possible separation between the groups (wine cultivars) in the data set.

If we want to separate the wines by cultivar, the wines come from three different cultivars, so the number of groups G=3, and the number of variables is 13 (13 chemicals’ concentrations; p=13). The maximum number of useful discriminant functions that can separate the wines by cultivar is the minimum of G???1 and p, and so in this case it is the minimum of 2 and 13, which is 2. Thus, we can find at most 2 useful discriminant functions to separate the wines by cultivar, using the 13 chemical concentration variables.

Box’s M-Test

Equality of variance-covariance matrix - To check if it is same or different across the groups.
> H0: Variance-covariance matrix is same across groups
> H1: Variance-covariance matrix is not same across groups

#install.packages("biotools")
library(biotools)
## ---
## biotools version 3.1
boxM(wine[,2:14],wine$Type)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  wine[, 2:14]
## Chi-Sq (approx.) = 684.2, df = 182, p-value < 2.2e-16

p-value is less than 0.05, so we reject the H0, ie., variance-covariance matrix is same accross groups.
LDA is recommended if the matrix are the same. Otherwise use QDA.

LDA Model for the wine dataset

library(MASS)
wine.lda <- lda(Type~., data = wine)
wine.lda
## Call:
## lda(Type ~ ., data = wine)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3314607 0.3988764 0.2696629 
## 
## Group means:
##    Alcohol    Malic      Ash Alcalinity Magnesium  Phenols Flavanoids
## 1 13.74475 2.010678 2.455593   17.03729  106.3390 2.840169  2.9823729
## 2 12.27873 1.932676 2.244789   20.23803   94.5493 2.258873  2.0808451
## 3 13.15375 3.333750 2.437083   21.41667   99.3125 1.678750  0.7814583
##   Nonflavanoids Proanthocyanins    Color       Hue Dilution   Proline
## 1      0.290000        1.899322 5.528305 1.0620339 3.157797 1115.7119
## 2      0.363662        1.630282 3.086620 1.0562817 2.785352  519.5070
## 3      0.447500        1.153542 7.396250 0.6827083 1.683542  629.8958
## 
## Coefficients of linear discriminants:
##                          LD1           LD2
## Alcohol         -0.403399781  0.8717930699
## Malic            0.165254596  0.3053797325
## Ash             -0.369075256  2.3458497486
## Alcalinity       0.154797889 -0.1463807654
## Magnesium       -0.002163496 -0.0004627565
## Phenols          0.618052068 -0.0322128171
## Flavanoids      -1.661191235 -0.4919980543
## Nonflavanoids   -1.495818440 -1.6309537953
## Proanthocyanins  0.134092628 -0.3070875776
## Color            0.355055710  0.2532306865
## Hue             -0.818036073 -1.5156344987
## Dilution        -1.157559376  0.0511839665
## Proline         -0.002691206  0.0028529846
## 
## Proportion of trace:
##    LD1    LD2 
## 0.6875 0.3125

The first discriminant function is a linear combination of the variables: $-\(0.403\)\(Alcohol + 0.165\)$Malic \(...\) $-$0.00\(3\)Proline.
The value for each discriminant function are scaled so that their mean value is zero and its variance is one.

The “proportion of trace” is the percentage separation achieved by each discriminant function. For example, for the wine data we get the same values as 68.75% and 31.25%.

Histogram of the LDA values

wine.lda.values <- predict (wine.lda)

#Stacked histogram of the first discriminant function
ldahist(data = wine.lda.values$x[,1], g=Type)

#Stacked histogram of the second discriminant function
ldahist(data = wine.lda.values$x[,2], g=Type)

Scatterplots of the Discriminant Functions

#make a scatterplot
plot(wine.lda.values$x[,1], wine.lda.values$x[,2])

#add lables
text(wine.lda.values$x[,1], wine.lda.values$x[,2], Type, cex = 0.7, pos = 4, col = "red")

From the scatterplot of the first two discriminant functions, we can see that the wines from the three cultivars are well separated in the scatterplot. The first discriminant function (x-axis) separates cultivars 1 and 3 very well, but doesn’t not perfectly separate cultivars 1 and 2, or cultivars 2 and 3.

The second discriminant function (y-axis) achieves a fairly good separation of cultivars 1 and 2, and cultivars 2 and 3, although it is not totally perfect.

To achieve a very good separation of the three cultivars, it would be best to use both the first and second discriminant functions together, since the first discriminant function can separate cultivars 1 and 3 very well, and the second discriminant function can separate cultivars 1 and 2, and cultivars 2 and 3, reasonably well.

Scatterplot using ggplot2

#convert to data frame
newdata <- data.frame(type = wine[,1], lda=wine.lda.values$x)
library(ggplot2)
ggplot(newdata) + geom_point(aes(lda.LD1,lda.LD2, colour = type), size=2.5)

Pridiction Accuracy

#install.packages("caret")
#install.packages("e1071")
library(caret)
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
wine.lda.predict <- train(Type~., method = "lda", data = wine)
confusionMatrix(wine$Type, predict(wine.lda.predict, wine))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 59  0  0
##          2  0 71  0
##          3  0  0 48
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9795, 1)
##     No Information Rate : 0.3989     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000
## Prevalence             0.3315   0.3989   0.2697
## Detection Rate         0.3315   0.3989   0.2697
## Detection Prevalence   0.3315   0.3989   0.2697
## Balanced Accuracy      1.0000   1.0000   1.0000

Other visualization using klaR

#install.packages("klaR")
library(klaR)
## Warning: package 'klaR' was built under R version 3.4.4
partimat(Type~Alcohol+Alcalinity, data = wine, method="lda")

#Add more variable to examine other variables to the wine type classification
partimat(Type ~ Alcohol + Alcalinity + Ash + Magnesium, data = wine, method = "lda")

QDA Model for the wine dataset

wine.qda <- qda(Type~., data = wine)
wine.qda
## Call:
## qda(Type ~ ., data = wine)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3314607 0.3988764 0.2696629 
## 
## Group means:
##    Alcohol    Malic      Ash Alcalinity Magnesium  Phenols Flavanoids
## 1 13.74475 2.010678 2.455593   17.03729  106.3390 2.840169  2.9823729
## 2 12.27873 1.932676 2.244789   20.23803   94.5493 2.258873  2.0808451
## 3 13.15375 3.333750 2.437083   21.41667   99.3125 1.678750  0.7814583
##   Nonflavanoids Proanthocyanins    Color       Hue Dilution   Proline
## 1      0.290000        1.899322 5.528305 1.0620339 3.157797 1115.7119
## 2      0.363662        1.630282 3.086620 1.0562817 2.785352  519.5070
## 3      0.447500        1.153542 7.396250 0.6827083 1.683542  629.8958

QDA:Pridiction Accuracy

wine.qda.predict <- train(Type~., method = "qda", data = wine)
confusionMatrix(wine$Type, predict(wine.qda.predict, wine))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 59  0  0
##          2  1 70  0
##          3  0  0 48
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9944          
##                  95% CI : (0.9691, 0.9999)
##     No Information Rate : 0.3933          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9915          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9833   1.0000   1.0000
## Specificity            1.0000   0.9907   1.0000
## Pos Pred Value         1.0000   0.9859   1.0000
## Neg Pred Value         0.9916   1.0000   1.0000
## Prevalence             0.3371   0.3933   0.2697
## Detection Rate         0.3315   0.3933   0.2697
## Detection Prevalence   0.3315   0.3989   0.2697
## Balanced Accuracy      0.9917   0.9954   1.0000