Discriminant analysis is used to predict the probability of belonging to a given class (or category) based on one or multiple predictor variables. It works with continuous and/or categorical predictor variables. Discriminant analysis can be used for both binary classification task or multiclass classification task. Linear discriminant analysis (LDA) Uses linear combinations of predictors to predict the class of a given observation. it Assumes that the predictor variables (p) are normally distributed and the classes have identical variances (for univariate analysis, p = 1) or identical covariance matrices (for multivariate analysis, p > 1).

Quadratic discriminant analysis (QDA) is more flexible than LDA. Here, there is no assumption that the covariance matrix of classes is the same. Under Mixture discriminant analysis (MDA), each class is assumed to be a Gaussian mixture of subclasses. For MDA, there are classes, and each class is assumed to be a Gaussian mixture of subclasses, where each data point has a probability of belonging to each class. Equality of covariance matrix, among classes, is still assumed.

Read the data set

df=read.csv("E:/MScDataScience/stats602/letter-recognition.txt", header=FALSE)

head(df)

dim(df) # look at the dimesion of the data

## [1] 20000    17

summary(df$V1)

##    Length     Class      Mode 
##     20000 character character

table(df$V1)

## 
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T 
## 789 766 736 805 768 775 773 734 755 747 739 761 792 783 753 803 783 758 748 796 
##   U   V   W   X   Y   Z 
## 813 764 752 787 786 734

# filter to include only I and L
df.il=df[df$V1%in%c("I", "L"), ]

# get a snapshot of the data
head(df.il)

# summarise data
summary(df.il)

##       V1                  V2               V3               V4       
##  Length:1516        Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
##  Class :character   1st Qu.: 1.000   1st Qu.: 5.000   1st Qu.:2.000  
##  Mode  :character   Median : 3.000   Median : 7.000   Median :4.000  
##                     Mean   : 2.846   Mean   : 7.055   Mean   :3.497  
##                     3rd Qu.: 4.000   3rd Qu.:10.000   3rd Qu.:5.000  
##                     Max.   :10.000   Max.   :15.000   Max.   :9.000  
##        V5               V6               V7               V8        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.000   1st Qu.: 1.000   1st Qu.: 5.000   1st Qu.: 4.000  
##  Median : 6.000   Median : 2.000   Median : 7.000   Median : 6.000  
##  Mean   : 5.253   Mean   : 2.239   Mean   : 6.124   Mean   : 5.307  
##  3rd Qu.: 7.000   3rd Qu.: 3.000   3rd Qu.: 7.000   3rd Qu.: 7.000  
##  Max.   :10.000   Max.   :11.000   Max.   :12.000   Max.   :11.000  
##        V9             V10              V11              V12       
##  Min.   :0.000   Min.   : 3.000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.: 5.000   1st Qu.: 6.000   1st Qu.:2.000  
##  Median :3.000   Median : 7.000   Median : 7.000   Median :6.000  
##  Mean   :2.699   Mean   : 6.309   Mean   : 7.209   Mean   :4.165  
##  3rd Qu.:4.000   3rd Qu.: 7.000   3rd Qu.:10.000   3rd Qu.:6.000  
##  Max.   :9.000   Max.   :11.000   Max.   :14.000   Max.   :9.000  
##       V13              V14              V15              V16        
##  Min.   : 2.000   Min.   :0.0000   Min.   : 4.000   Min.   : 0.000  
##  1st Qu.: 7.000   1st Qu.:0.0000   1st Qu.: 7.000   1st Qu.: 0.000  
##  Median : 8.000   Median :0.0000   Median : 8.000   Median : 2.000  
##  Mean   : 7.908   Mean   :0.8232   Mean   : 7.728   Mean   : 2.312  
##  3rd Qu.: 9.000   3rd Qu.:1.0000   3rd Qu.: 8.000   3rd Qu.: 3.000  
##  Max.   :13.000   Max.   :8.0000   Max.   :12.000   Max.   :11.000  
##       V17        
##  Min.   : 2.000  
##  1st Qu.: 7.000  
##  Median : 8.000  
##  Mean   : 7.791  
##  3rd Qu.: 8.000  
##  Max.   :15.000

table(df.il$V1)

## 
##   I   L 
## 755 761

# get the data into dataframe
df.il$V1<-paste(df.il$V1)

str(df.il)

## 'data.frame':    1516 obs. of  17 variables:
##  $ V1 : chr  "I" "L" "L" "L" ...
##  $ V2 : int  5 2 1 3 5 2 2 3 2 6 ...
##  $ V3 : int  12 3 3 6 10 3 9 9 6 11 ...
##  $ V4 : int  3 3 3 3 6 2 3 4 2 9 ...
##  $ V5 : int  7 4 1 4 8 4 7 7 4 8 ...
##  $ V6 : int  2 1 1 1 5 1 2 3 1 11 ...
##  $ V7 : int  10 0 6 1 9 0 8 7 9 7 ...
##  $ V8 : int  5 1 4 0 3 1 7 7 5 7 ...
##  $ V9 : int  5 5 1 6 2 5 0 0 0 3 ...
##  $ V10: int  4 6 7 6 6 6 7 7 6 6 ...
##  $ V11: int  13 0 8 0 9 0 13 13 13 6 ...
##  $ V12: int  3 0 3 1 2 0 6 6 5 7 ...
##  $ V13: int  9 6 10 5 9 6 9 8 9 11 ...
##  $ V14: int  2 0 0 0 2 0 0 0 0 6 ...
##  $ V15: int  8 8 7 8 5 8 8 8 8 11 ...
##  $ V16: int  4 0 2 0 3 0 1 1 1 9 ...
##  $ V17: int  10 8 9 8 9 8 8 8 8 7 ...

# data splitting into training and test set
(2/3)*sum(table(df.il$V1))

## [1] 1010.667

rows.trnil=sort(sample(1:1516, size = 1010, replace = F))


trnil.dat=df.il[rows.trnil,] # the training set
tstil.dat=df.il[-rows.trnil,] # the test set


summary(trnil.dat)

##       V1                  V2               V3               V4       
##  Length:1010        Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
##  Class :character   1st Qu.: 1.000   1st Qu.: 5.000   1st Qu.:2.000  
##  Mode  :character   Median : 3.000   Median : 7.000   Median :4.000  
##                     Mean   : 2.827   Mean   : 6.982   Mean   :3.504  
##                     3rd Qu.: 4.000   3rd Qu.:10.000   3rd Qu.:5.000  
##                     Max.   :10.000   Max.   :15.000   Max.   :9.000  
##        V5               V6               V7               V8        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.000   1st Qu.: 1.000   1st Qu.: 5.000   1st Qu.: 4.000  
##  Median : 6.000   Median : 2.000   Median : 7.000   Median : 6.000  
##  Mean   : 5.226   Mean   : 2.286   Mean   : 6.141   Mean   : 5.381  
##  3rd Qu.: 7.000   3rd Qu.: 3.000   3rd Qu.: 7.000   3rd Qu.: 7.000  
##  Max.   :10.000   Max.   :11.000   Max.   :12.000   Max.   :11.000  
##        V9             V10             V11              V12       
##  Min.   :0.000   Min.   : 3.00   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.: 5.00   1st Qu.: 6.000   1st Qu.:2.000  
##  Median :3.000   Median : 7.00   Median : 7.000   Median :6.000  
##  Mean   :2.719   Mean   : 6.29   Mean   : 7.153   Mean   :4.231  
##  3rd Qu.:4.000   3rd Qu.: 7.00   3rd Qu.: 9.000   3rd Qu.:6.000  
##  Max.   :9.000   Max.   :10.00   Max.   :14.000   Max.   :9.000  
##       V13              V14              V15              V16        
##  Min.   : 2.000   Min.   :0.0000   Min.   : 5.000   Min.   : 0.000  
##  1st Qu.: 7.000   1st Qu.:0.0000   1st Qu.: 7.000   1st Qu.: 0.000  
##  Median : 8.000   Median :0.0000   Median : 8.000   Median : 2.000  
##  Mean   : 7.938   Mean   :0.8693   Mean   : 7.744   Mean   : 2.375  
##  3rd Qu.: 9.000   3rd Qu.:1.0000   3rd Qu.: 8.000   3rd Qu.: 4.000  
##  Max.   :13.000   Max.   :8.0000   Max.   :12.000   Max.   :11.000  
##       V17       
##  Min.   : 4.00  
##  1st Qu.: 7.00  
##  Median : 8.00  
##  Mean   : 7.77  
##  3rd Qu.: 8.00  
##  Max.   :15.00

# boxplot dataset
boxplot(V2~V1, data = trnil.dat)

boxplot(V3~V1, data = trnil.dat)

boxplot(V4~V1, data = trnil.dat)

boxplot(V5~V1, data = trnil.dat)

boxplot(V6~V1, data = trnil.dat)

boxplot(V7~V1, data = trnil.dat)

boxplot(V8~V1, data = trnil.dat)

boxplot(V9~V1, data = trnil.dat)

boxplot(V10~V1, data = trnil.dat)

boxplot(V11~V1, data = trnil.dat)

boxplot(V12~V1, data = trnil.dat)

boxplot(V13~V1, data = trnil.dat)

boxplot(V14~V1, data = trnil.dat)

boxplot(V15~V1, data = trnil.dat)

boxplot(V16~V1, data = trnil.dat)

boxplot(V17~V1, data = trnil.dat)

boxplots 7 and 11 seems to have a good separation

X.dat=trnil.dat[,-1]  # get the predictor variables
Class.dat=trnil.dat[,1] # get the class variable

fit the model to MclustDA using different values of G. An integer vector specifying the numbers of mixture components (clusters)for which the BIC is to be calculated within each class. The default is G = 1:5

library(mclust)

## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.

set.seed(123)
mod.DA.G1= MclustDA(X.dat, Class.dat, G = 1)
mod.DA.G2= MclustDA(X.dat, Class.dat, G = 2)
mod.DA.G3= MclustDA(X.dat, Class.dat, G = 3)
mod.DA.G4= MclustDA(X.dat, Class.dat, G = 4)
summary(mod.DA.G1)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood    n  df       BIC
##        -22978.8 1010 304 -48060.58
##        
## Classes   n    % Model G
##       I 501 49.6   XXX 1
##       L 509 50.4   XXX 1
## 
## Training confusion matrix:
##      Predicted
## Class   I   L
##     I 494   7
##     L   5 504
## Classification error = 0.0119 
## Brier score          = 0.009

summary(mod.DA.G2)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood    n  df       BIC
##       -17017.14 1010 595 -38150.32
##        
## Classes   n    % Model G
##       I 501 49.6   VEV 2
##       L 509 50.4   VVV 2
## 
## Training confusion matrix:
##      Predicted
## Class   I   L
##     I 498   3
##     L   0 509
## Classification error = 0.003 
## Brier score          = 0.0031

summary(mod.DA.G3)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood    n  df       BIC
##       -14029.46 1010 856 -33980.48
##        
## Classes   n    % Model G
##       I 501 49.6   VEV 3
##       L 509 50.4   VEV 3
## 
## Training confusion matrix:
##      Predicted
## Class   I   L
##     I 498   3
##     L   1 508
## Classification error = 0.004 
## Brier score          = 0.0032

summary(mod.DA.G4)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood    n   df      BIC
##       -10857.48 1010 1132 -29545.8
##        
## Classes   n    % Model G
##       I 501 49.6   VEV 4
##       L 509 50.4   VEV 4
## 
## Training confusion matrix:
##      Predicted
## Class   I   L
##     I 496   5
##     L   0 509
## Classification error = 0.005 
## Brier score          = 0.0041

From the summary results, model 2 had the lowest classification error and thus would be used for the prediction of the test set

# Make prediction on the test set and column bind it to the actual dataset
results.2=cbind(paste(predict.MclustDA(mod.DA.G2, newdata = tstil.dat[, -1])$classification), paste(tstil.dat[, 1]))

# compute the accuracy
mean(results.2[,1]==results.2[,2])

## [1] 0.9960474

The computed accuracy is 98.8%

## make plots of the G2 model
#plot(mod.DA.G2)
summary(mod.DA.G2)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood    n  df       BIC
##       -17017.14 1010 595 -38150.32
##        
## Classes   n    % Model G
##       I 501 49.6   VEV 2
##       L 509 50.4   VVV 2
## 
## Training confusion matrix:
##      Predicted
## Class   I   L
##     I 498   3
##     L   0 509
## Classification error = 0.003 
## Brier score          = 0.0031

#######

######

mod.DA.G1.v7.v11= MclustDA(X.dat[, c(7, 11)], Class.dat, G = 2)

#plot(mod.DA.G1.v7.v11)
summary(mod.DA.G1.v7.v11)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood    n df       BIC
##       -3342.559 1010 17 -6802.719
##        
## Classes   n    % Model G
##       I 501 49.6   EEE 2
##       L 509 50.4   EEV 2
## 
## Training confusion matrix:
##      Predicted
## Class   I   L
##     I 466  35
##     L  69 440
## Classification error = 0.103 
## Brier score          = 0.0774

names(mod.DA.G1.v7.v11$models$U)

## NULL

mod.DA.G1.v7.v11$models$U$parameters

## NULL

mod.DA.G1.v7.v11$models$V$parameters

## NULL

R Notebook

Read the data set

fit the model to MclustDA using different values of G. An integer vector specifying the numbers of mixture components (clusters)for which the BIC is to be calculated within each class. The default is G = 1:5