Discriminant Analysis Tutorial

Wine Classification

In this first study case, the wine data set, we have 13 chemical concentrations describing wine samples from three cultivars. Cultivar is a variety of a plant developed from natural species and maintained under cultivation.

Data

#search for pre-installed database
data(wine, package = 'rattle.data')
attach(wine)
head(wine)

##   Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 1    1   14.23  1.71 2.43       15.6       127    2.80       3.06
## 2    1   13.20  1.78 2.14       11.2       100    2.65       2.76
## 3    1   13.16  2.36 2.67       18.6       101    2.80       3.24
## 4    1   14.37  1.95 2.50       16.8       113    3.85       3.49
## 5    1   13.24  2.59 2.87       21.0       118    2.80       2.69
## 6    1   14.20  1.76 2.45       15.2       112    3.27       3.39
##   Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 1          0.28            2.29  5.64 1.04     3.92    1065
## 2          0.26            1.28  4.38 1.05     3.40    1050
## 3          0.30            2.81  5.68 1.03     3.17    1185
## 4          0.24            2.18  7.80 0.86     3.45    1480
## 5          0.39            1.82  4.32 1.04     2.93     735
## 6          0.34            1.97  6.75 1.05     2.85    1450

str(wine)

## 'data.frame':    178 obs. of  14 variables:
##  $ Type           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

Wine data showed that Type variable is a categorical data consist of 3 type categories. Notice the category is more than 2, thus linear discriminant analysis is appropriate method in this case.

scatterplotMatrix(wine[2:6])

The purpose of the linear discriminant analysis is to find combination of the variables that give best possible separation between groups (wine cultivars) in our data set.The . in the formula argument means that we use all the remaining variables in data as covariates.

wine.lda <- lda(Type ~., data = wine)
wine.lda

## Call:
## lda(Type ~ ., data = wine)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.3314607 0.3988764 0.2696629 
## 
## Group means:
##    Alcohol    Malic      Ash Alcalinity Magnesium  Phenols Flavanoids
## 1 13.74475 2.010678 2.455593   17.03729  106.3390 2.840169  2.9823729
## 2 12.27873 1.932676 2.244789   20.23803   94.5493 2.258873  2.0808451
## 3 13.15375 3.333750 2.437083   21.41667   99.3125 1.678750  0.7814583
##   Nonflavanoids Proanthocyanins    Color       Hue Dilution   Proline
## 1      0.290000        1.899322 5.528305 1.0620339 3.157797 1115.7119
## 2      0.363662        1.630282 3.086620 1.0562817 2.785352  519.5070
## 3      0.447500        1.153542 7.396250 0.6827083 1.683542  629.8958
## 
## Coefficients of linear discriminants:
##                          LD1           LD2
## Alcohol         -0.403399781  0.8717930699
## Malic            0.165254596  0.3053797325
## Ash             -0.369075256  2.3458497486
## Alcalinity       0.154797889 -0.1463807654
## Magnesium       -0.002163496 -0.0004627565
## Phenols          0.618052068 -0.0322128171
## Flavanoids      -1.661191235 -0.4919980543
## Nonflavanoids   -1.495818440 -1.6309537953
## Proanthocyanins  0.134092628 -0.3070875776
## Color            0.355055710  0.2532306865
## Hue             -0.818036073 -1.5156344987
## Dilution        -1.157559376  0.0511839665
## Proline         -0.002691206  0.0028529846
## 
## Proportion of trace:
##    LD1    LD2 
## 0.6875 0.3125

The linear discriminant function from the result in above is

\[-.403 * Alcohol + 0.165*Malic - 0.369*Ash + 0.155*Alcalinity - 0.002*Magnesium + 0.618*Phenols\\ - 1.66*Flavanoids - 1.496*Nonflavanoids + 0.134*Proanthocyanins + 0.355*Color \\- 0.818*Hue - 1.15*Dilution - 0.003*Proline\] The “proportion of trace”" that is printed when you type wine.lda (the variable returned by the lda() function) is the percentage separation achieved by each discriminant function. For example, for the wine data we get the same values as just calculated (\(68.75\%\) and \(31.25\%\)).

Stacked Histogram of the LDA Values

Histogram is a nice way to displaying result of the linear discriminant analysis.We can do using ldahist() function in R. Make prediction value based on LDA function and store it in an object. predict function generate value from selected model function. The length of the value predicted will be correspond with the length of the processed data.

wine.lda.values <- predict(wine.lda)
ldahist(wine.lda.values$x[,1], g = Type)

second discriminant function separates those cultivars, by making a stacked histogram of second discriminant function’s values

ldahist(wine.lda.values$x[,2], g = Type)

Scatterplot of Discriminant Function

Now we can produce scatter plot of the two discriminant function:

plot(wine.lda.values$x[,1], wine.lda.values$x[,2])
text(wine.lda.values$x[,1], wine.lda.values$x[,2], Type, cex = 0.7, pos = 4, col = "red")

For more advanced graphic, we use ggplot2 package. ggplot2 package only can deal with data frame object. Create new object contain type variable from wine data, and value from wine.lda.values.

#convert to data frame 
newdata <- data.frame(type = wine[,1], lda = wine.lda.values$x)
library(ggplot2)
ggplot(newdata) + geom_point(aes(lda.LD1, lda.LD2, colour = type), size = 2.5)

Prediction Accuracy

Prediction accuracy of LDA is comparing prediction result from the model output with the actual data. First step, activate caret package. Then, create prediction result using train function. We will use confusionMatrix command to see the prediction accuracy of the model.

library(caret)

## Loading required package: lattice

wine.lda.predict <- train(Type ~ ., method = "lda", data = wine)
confusionMatrix(wine$Type, predict(wine.lda.predict, wine))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 59  0  0
##          2  0 71  0
##          3  0  0 48
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9795, 1)
##     No Information Rate : 0.3989     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000
## Prevalence             0.3315   0.3989   0.2697
## Detection Rate         0.3315   0.3989   0.2697
## Detection Prevalence   0.3315   0.3989   0.2697
## Balanced Accuracy      1.0000   1.0000   1.0000

Other visualization tool using klaR package.

library(klaR)
partimat(Type ~ Alcohol + Alcalinity, data = wine, method = "lda")

You can add more variable to examine other variables to the wine type classification