Brief notes on the theory of Discriminant Analysis

Linear discriminant analysis (LDA) and the related Fisher’s linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

Consider a set of observations \(x\) (also called features, attributes, variables or measurements) for each sample of an object or event with known class \(y\in\{0,1\}\). This set of samples is called the training set. The classification problem is then to find a good predictor for the class y of any sample of the same distribution (not necessarily from the training set) given only an observation \(x\).

LDA approaches the problem by assuming that the conditional probability density functions \(p(x|y=1)\) and \(p(x|y=2)\) are both normally distributed with mean and covariance parameters \((\mu_1, \Sigma_1)\) and \((\mu_2, \Sigma_2)\), respectively. Under this assumption, the Bayes optimal solution is to predict points as being from the second class if the log of the likelihood ratios is below some threshold T, so that:

\[( x- \mu_1)^T \Sigma_1^{-1} ( x- \mu_1) + \ln|\Sigma_1| - ( x- \mu_2)^T \Sigma_2^{-1} ( x- \mu_2) - \ln|\Sigma_2| < T\] Without any further assumptions, the resulting classifier is referred to as QDA (quadratic discriminant analysis). When \(\Sigma_1=\Sigma_2=\Sigma\) (homoscedasticity assumption) the discriminant functions it is:

\[( x- \mu_1)^T \Sigma^{-1} ( x- \mu_1) - ( x- \mu_2)^T \Sigma^{-1} ( x- \mu_2) < T,\]

this is the expression for the LDA (linear discriminant analysis) function. In this case, several terms cancel and the above decision criterion it is linear: \[ w \cdot x > c, \] for some threshold constant c.

Fisher’s linear discriminant

The terms Fisher’s linear discriminant and LDA are often used interchangeably, although Fisher’s original article[1] actually describes a slightly different discriminant, which does not make some of the assumptions of LDA such as normally distributed classes or equal class covariances.

Suppose two classes of observations have means \(\mu_1, \mu_2\) and covariances \(\Sigma_1\) and \(\Sigma_2\). Then the linear combination of features \(w \cdot x\) will have means \(w \cdot \mu_i\) and variances \(w ^T \Sigma_i w\) for \(i=1,2\). Fisher defined the separation between these two distributions to be the ratio of the variance between the classes to the variance within the classes:

\[S=\frac{\sigma_{\text{between}}^2}{\sigma_{\text{within}}^2}= \frac{( w \cdot \mu_2 - w \cdot \mu_1)^2}{ w^T \Sigma_2 w + w^T \Sigma_1 w} = \frac{( w \cdot ( \mu_2 - \mu_1))^2}{ w^T (\Sigma_1+\Sigma_2) w}\] This measure is, in some sense, a measure of the signal-to-noise ratio for the class labeling. It can be shown that the maximum separation occurs when

\[w \propto (\Sigma_1+\Sigma_2)^{-1}( \mu_2 - \mu_1)\]

When the assumptions of LDA are satisfied, the above equation is equivalent to LDA.

Be sure to note that the vector \(w\) is the normal to the discriminant hyperplane. As an example, in a two dimensional problem, the line that best divides the two groups is perpendicular to \(w\).

Generally, the data points to be discriminated are projected onto \(w\); then the threshold that best separates the data is chosen from analysis of the one-dimensional distribution. There is no general rule for the threshold. However, if projections of points from both classes exhibit approximately the same distributions, a good choice would be the hyperplane between projections of the two means, \(w \cdot \mu_1\) and \(w \cdot \mu_2\). In this case the parameter c in threshold condition \(w\cdot x>c\) can be found explicitly:

\[c = w \cdot \frac{1}{2} ( \mu_1 + \mu_2) = \frac{1}{2} \mu_2^t \Sigma^{-1} \mu_2 - \frac{1}{2} \mu_1^t \Sigma^{-1} \mu_1\]

Possible applications:
  • Bankruptcy prediction: In bankruptcy prediction based on accounting ratios and other financial variables, linear discriminant analysis was the first statistical method applied to systematically explain which firms entered bankruptcy vs. survived.

  • Marketing: In marketing, discriminant analysis was once often used to determine the factors which distinguish different types of customers and/or products on the basis of surveys or other forms of collected data.
  • Biomedical studies: The main application of discriminant analysis in medicine is the assessment of severity state of a patient and prognosis of disease outcome. For example, during retrospective analysis, patients are divided into groups according to severity of disease - mild, moderate and severe form. Then results of clinical and laboratory analyses are studied in order to reveal variables which are statistically different in studied groups. Using these variables, discriminant functions are built which help to objectively classify disease in a future patient into mild, moderate or severe form.

Study Case I: Wine classification

In this first study case, the wine data set, we have 13 chemical concentrations describing wine samples from three cultivars.

library(car)
# install.packages('rattle')
data(wine, package='rattle')
attach(wine)
head(wine)
##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
## 4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
## 5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
## 6    1   14.20  1.76 2.45       15.2       112    3.27       3.39
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185
## 4          0.24            2.18  7.80 0.86     3.45    1480
## 5          0.39            1.82  4.32 1.04     2.93     735
## 6          0.34            1.97  6.75 1.05     2.85    1450
scatterplotMatrix(wine[2:6])

The purpose of linear discriminant analysis (LDA) in this example is to find the linear combinations of the original variables (the 13 chemical concentrations here) that gives the best possible separation between the groups (wine cultivars here) in our data set. Linear discriminant analysis is also known as “canonical discriminant analysis”, or simply “discriminant analysis”.

If we want to separate the wines by cultivar, the wines come from three different cultivars, so the number of groups \(G = 3\), and the number of variables is 13 (13 chemicals’ concentrations; \(p = 13\)). The maximum number of useful discriminant functions that can separate the wines by cultivar is the minimum of \(G-1\) and \(p\), and so in this case it is the minimum of 2 and 13, which is 2. Thus, we can find at most 2 useful discriminant functions to separate the wines by cultivar, using the 13 chemical concentration variables.

You can carry out a linear discriminant analysis using the “lda()” function from the R MASS package. To use this function, we first need to install the MASS R package.

# install.packages('MASS')
library(MASS)
wine.lda <- lda(Type ~ ., data=wine)

To get the values of the loadings of the discriminant functions for the wine data, we can type:

wine.lda
## Call:
## lda(Type ~ ., data = wine)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3314607 0.3988764 0.2696629 
## 
## Group means:
##    Alcohol    Malic      Ash Alcalinity Magnesium  Phenols Flavanoids
## 1 13.74475 2.010678 2.455593   17.03729  106.3390 2.840169  2.9823729
## 2 12.27873 1.932676 2.244789   20.23803   94.5493 2.258873  2.0808451
## 3 13.15375 3.333750 2.437083   21.41667   99.3125 1.678750  0.7814583
##   Nonflavanoids Proanthocyanins    Color       Hue Dilution   Proline
## 1      0.290000        1.899322 5.528305 1.0620339 3.157797 1115.7119
## 2      0.363662        1.630282 3.086620 1.0562817 2.785352  519.5070
## 3      0.447500        1.153542 7.396250 0.6827083 1.683542  629.8958
## 
## Coefficients of linear discriminants:
##                          LD1           LD2
## Alcohol         -0.403399781  0.8717930699
## Malic            0.165254596  0.3053797325
## Ash             -0.369075256  2.3458497486
## Alcalinity       0.154797889 -0.1463807654
## Magnesium       -0.002163496 -0.0004627565
## Phenols          0.618052068 -0.0322128171
## Flavanoids      -1.661191235 -0.4919980543
## Nonflavanoids   -1.495818440 -1.6309537953
## Proanthocyanins  0.134092628 -0.3070875776
## Color            0.355055710  0.2532306865
## Hue             -0.818036073 -1.5156344987
## Dilution        -1.157559376  0.0511839665
## Proline         -0.002691206  0.0028529846
## 
## Proportion of trace:
##    LD1    LD2 
## 0.6875 0.3125

This means that the first discriminant function is a linear combination of the variables: \(-0.403*Alcohol + 0.165*Malic \dots - 0.003*Proline\). For convenience, the value for each discriminant function (eg. the first discriminant function) are scaled so that their mean value is zero and its variance is one.

The “proportion of trace” that is printed when you type “wine.lda” (the variable returned by the lda() function) is the percentage separation achieved by each discriminant function. For example, for the wine data we get the same values as just calculated (68.75% and 31.25%).

A Stacked Histogram of the LDA Values

A nice way of displaying the results of a linear discriminant analysis (LDA) is to make a stacked histogram of the values of the discriminant function for the samples from different groups (different wine cultivars in our example).

We can do this using the “ldahist()” function in R. For example, to make a stacked histogram of the first discriminant function’s values for wine samples of the three different wine cultivars, we type:

wine.lda.values <- predict(wine.lda)
ldahist(data = wine.lda.values$x[,1], g=Type)

We therefore investigate whether the second discriminant function separates those cultivars, by making a stacked histogram of the second discriminant function’s values:

ldahist(data = wine.lda.values$x[,2], g=Type)

Scatterplots of the Discriminant Functions

We can obtain a scatterplot of the best two discriminant functions, with the data points labelled by cultivar, by typing:

plot(wine.lda.values$x[,1],wine.lda.values$x[,2]) # make a scatterplot
text(wine.lda.values$x[,1],wine.lda.values$x[,2],Type,cex=0.7,pos=4,col="red") # add labels

From the scatterplot of the first two discriminant functions, we can see that the wines from the three cultivars are well separated in the scatterplot. The first discriminant function (x-axis) separates cultivars 1 and 3 very well, but doesn’t not perfectly separate cultivars 1 and 3, or cultivars 2 and 3.

The second discriminant function (y-axis) achieves a fairly good separation of cultivars 1 and 3, and cultivars 2 and 3, although it is not totally perfect.

To achieve a very good separation of the three cultivars, it would be best to use both the first and second discriminant functions together, since the first discriminant function can separate cultivars 1 and 3 very well, and the second discriminant function can separate cultivars 1 and 2, and cultivars 2 and 3, reasonably well. ### Study Case II: MBA admission data

The dataset provide admission data for applicants to graduate schools in business. The objective is to use the GPA and GMAT scores to predict the likelihood of admission (admit, notadmit, and borderline).

url <- 'http://www.biz.uiowa.edu/faculty/jledolter/DataMining/admission.csv'

admit <- read.csv(url)

head(admit)
##    GPA GMAT    De
## 1 2.96  596 admit
## 2 3.14  473 admit
## 3 3.22  482 admit
## 4 3.29  527 admit
## 5 3.69  505 admit
## 6 3.46  693 admit

We can make a scatterplot with the data at hand:

adm=data.frame(admit)

plot(adm$GPA,adm$GMAT,col=adm$De)

Lets start by doing the LDA, notice that in this case we have \(3\) classes:

## linear discriminant analysis
m1=lda(De~.,adm)
m1
## Call:
## lda(De ~ ., data = adm)
## 
## Prior probabilities of groups:
##     admit    border  notadmit 
## 0.3647059 0.3058824 0.3294118 
## 
## Group means:
##               GPA     GMAT
## admit    3.403871 561.2258
## border   2.992692 446.2308
## notadmit 2.482500 447.0714
## 
## Coefficients of linear discriminants:
##              LD1         LD2
## GPA  5.008766354  1.87668220
## GMAT 0.008568593 -0.01445106
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9673 0.0327
predict(m1,newdata=data.frame(GPA=3.21,GMAT=497))
## $class
## [1] admit
## Levels: admit border notadmit
## 
## $posterior
##       admit    border     notadmit
## 1 0.5180421 0.4816015 0.0003563717
## 
## $x
##        LD1      LD2
## 1 1.252409 0.318194

The quadratic discriminant analysis:

m2=qda(De~.,adm)
m2
## Call:
## qda(De ~ ., data = adm)
## 
## Prior probabilities of groups:
##     admit    border  notadmit 
## 0.3647059 0.3058824 0.3294118 
## 
## Group means:
##               GPA     GMAT
## admit    3.403871 561.2258
## border   2.992692 446.2308
## notadmit 2.482500 447.0714
predict(m2,newdata=data.frame(GPA=3.21,GMAT=497))
## $class
## [1] admit
## Levels: admit border notadmit
## 
## $posterior
##       admit    border     notadmit
## 1 0.9226763 0.0768693 0.0004544468

Which model is best? In order to answer this question, we evaluate the linear discriminant analysis by randomly selecting 60 of 85 students, estimating the parameters on the training data, and classifying the remaining 25 students of the holdout sample. We repeat this 100 times.

n=85
nt=60
neval=n-nt
rep=100

### LDA
set.seed(123456789)
errlin=dim(rep)
for (k in 1:rep) {
train=sample(1:n,nt)
## linear discriminant analysis
m1=lda(De~.,adm[train,])
predict(m1,adm[-train,])$class
tablin=table(adm$De[-train],predict(m1,adm[-train,])$class)
errlin[k]=(neval-sum(diag(tablin)))/neval
}
merrlin=mean(errlin)
merrlin
## [1] 0.102
### QDA
set.seed(123456789)
errqda=dim(rep)
for (k in 1:rep) {
train=sample(1:n,nt)
## quadratic discriminant analysis
m1=qda(De~.,adm[train,])
predict(m1,adm[-train,])$class
tablin=table(adm$De[-train],predict(m1,adm[-train,])$class)
errqda[k]=(neval-sum(diag(tablin)))/neval
}
merrqda=mean(errlin)
merrqda
## [1] 0.102

We achieve a 10.2% misclassification rate in both cases. R also give us some visualization tools. For example library klaR:

# Exploratory Graph for LDA or QDA
#install.packages('klaR')
library(klaR)
partimat(De~.,data=adm,method="lda") 

url<-'http://www.correlatesofwar.org/COW2%20Data/WarData_NEW/Inter-StateWarData_v4.0.csv'
data=read.csv(url)
head(data)
##   WarNum             WarName WarType ccode                StateName Side
## 1      1  Franco-Spanish War       1   230                    Spain    2
## 2      1  Franco-Spanish War       1   220                   France    1
## 3      4 First Russo-Turkish       1   640           Ottoman Empire    2
## 4      4 First Russo-Turkish       1   365                   Russia    1
## 5      7    Mexican-American       1    70                   Mexico    2
## 6      7    Mexican-American       1     2 United States of America    1
##   StartMonth1 StartDay1 StartYear1 EndMonth1 EndDay1 EndYear1 StartMonth2
## 1           4         7       1823        11      13     1823          -8
## 2           4         7       1823        11      13     1823          -8
## 3           4        26       1828         9      14     1829          -8
## 4           4        26       1828         9      14     1829          -8
## 5           4        25       1846         9      14     1847          -8
## 6           4        25       1846         9      14     1847          -8
##   StartDay2 StartYear2 EndMonth2 EndDay2 EndYear2 TransFrom WhereFought
## 1        -8         -8        -8      -8       -8       503           2
## 2        -8         -8        -8      -8       -8       503           2
## 3        -8         -8        -8      -8       -8       506          11
## 4        -8         -8        -8      -8       -8       506          11
## 5        -8         -8        -8      -8       -8        -8           1
## 6        -8         -8        -8      -8       -8        -8           1
##   Initiator Outcome TransTo BatDeath Version
## 1         2       2      -8      600       4
## 2         1       1      -8      400       4
## 3         2       2      -8    80000       4
## 4         1       1      -8    50000       4
## 5         2       2      -8     6000       4
## 6         1       1      -8    13283       4

Study case III: Credit Scoring on German Bank Dataset

The German credit data set was obtained from the UCI Machine Learning Repository. The data set, which contains attributes and outcomes on 1000 loan applications, was provided in 1994 by Professor Dr. Hans Hofmann of the Institut fuer Statistik und Oekonometrie at the University of Hamburg. It has served as an important test data set for several credit-scoring algorithms. A description of the variables is given here. We start by loading the data:

## read data 
credit <- read.csv("http://www.biz.uiowa.edu/faculty/jledolter/DataMining/germancredit.csv")
head(credit,2) # See details about codification in the attached documentation.
##   Default checkingstatus1 duration history purpose amount savings employ
## 1       0             A11        6     A34     A43   1169     A65    A75
## 2       1             A12       48     A32     A43   5951     A61    A73
##   installment status others residence property age otherplans housing
## 1           4    A93   A101         4     A121  67       A143    A152
## 2           2    A92   A101         2     A121  22       A143    A152
##   cards  job liable tele foreign
## 1     2 A173      1 A192    A201
## 2     1 A173      1 A191    A201

As can be seen, only variables: duration, amount, installment and age are numeric. With the remaining (indicators) the assumptions of a normal distribution would be tenuous at best; therefore these variables are not considered here.

cred1=credit[, c("Default","duration","amount","installment","age")]
head(cred1)
##   Default duration amount installment age
## 1       0        6   1169           4  67
## 2       1       48   5951           2  22
## 3       0       12   2096           2  49
## 4       0       42   7882           2  45
## 5       1       24   4870           3  53
## 6       0       36   9055           2  35
summary(cred1)
##     Default       duration        amount       installment   
##  Min.   :0.0   Min.   : 4.0   Min.   :  250   Min.   :1.000  
##  1st Qu.:0.0   1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000  
##  Median :0.0   Median :18.0   Median : 2320   Median :3.000  
##  Mean   :0.3   Mean   :20.9   Mean   : 3271   Mean   :2.973  
##  3rd Qu.:1.0   3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000  
##  Max.   :1.0   Max.   :72.0   Max.   :18424   Max.   :4.000  
##       age       
##  Min.   :19.00  
##  1st Qu.:27.00  
##  Median :33.00  
##  Mean   :35.55  
##  3rd Qu.:42.00  
##  Max.   :75.00

Are the selected variables in cred1 normally distributed? Lets try to check this assumption by simple histograms:

hist(cred1$duration)

hist(cred1$amount)

hist(cred1$installment)

hist(cred1$age)

# Transform selected data into a data.frame to make LDA
cred1=data.frame(cred1)

The data do not fit the normality assumption and this will be a problem with the use of LDA or QDA analysis, but in any case we can try to see what happens if we try to predict the defaults by using discriminant functions.

library(MASS)
attach(cred1)
## LDA: class proportions of the training set used as prior probabilities
zlin=lda(Default~.,cred1)

# Confusion Matrix:
table(predict(zlin)$class, Default)
##    Default
##       0   1
##   0 669 256
##   1  31  44
# LDA predictions for two new observations:
predict(zlin,newdata=data.frame(duration=6,amount=1100,installment=4,age=67))
## $class
## [1] 0
## Levels: 0 1
## 
## $posterior
##           0         1
## 1 0.8704023 0.1295977
## 
## $x
##         LD1
## 1 -1.791853
predict(zlin,newdata=data.frame(duration=6,amount=1100, installment=4,age=67))$class 
## [1] 0
## Levels: 0 1
## QDA: class proportions of the training set used as prior probabilities
zqua=qda(Default~.,cred1)

# Confusion Matrix:
table(predict(zqua)$class, Default)
##    Default
##       0   1
##   0 628 221
##   1  72  79
# QDA predictions for two new observations:
predict(zqua,newdata=data.frame(duration=6,amount=1100,installment=4,age=67))
## $class
## [1] 0
## Levels: 0 1
## 
## $posterior
##           0          1
## 1 0.9375556 0.06244441
predict(zqua,newdata=data.frame(duration=6,amount=1100, installment=4,age=67))$class
## [1] 0
## Levels: 0 1

The model performance is not good enough! We will compare later the LDA/QDA versus other more sophisticated methods like: classification trees, SVM or Neural Networks in next sessions.