Discriminant analysis is used to predict the probability of belonging to a given class (or category) based on one or multiple predictor variables. It works with continuous and/or categorical predictor variables. Discriminant analysis can be used for both binary classification task or multiclass classification task. Linear discriminant analysis (LDA) Uses linear combinations of predictors to predict the class of a given observation. it Assumes that the predictor variables (p) are normally distributed and the classes have identical variances (for univariate analysis, p = 1) or identical covariance matrices (for multivariate analysis, p > 1).
Quadratic discriminant analysis (QDA) is more flexible than LDA. Here, there is no assumption that the covariance matrix of classes is the same. Under Mixture discriminant analysis (MDA), each class is assumed to be a Gaussian mixture of subclasses. For MDA, there are classes, and each class is assumed to be a Gaussian mixture of subclasses, where each data point has a probability of belonging to each class. Equality of covariance matrix, among classes, is still assumed.
df=read.csv("E:/MScDataScience/stats602/letter-recognition.txt", header=FALSE)
head(df)
dim(df) # look at the dimesion of the data
## [1] 20000 17
summary(df$V1)
## Length Class Mode
## 20000 character character
table(df$V1)
##
## A B C D E F G H I J K L M N O P Q R S T
## 789 766 736 805 768 775 773 734 755 747 739 761 792 783 753 803 783 758 748 796
## U V W X Y Z
## 813 764 752 787 786 734
# filter to include only I and L
df.il=df[df$V1%in%c("I", "L"), ]
# get a snapshot of the data
head(df.il)
# summarise data
summary(df.il)
## V1 V2 V3 V4
## Length:1516 Min. : 0.000 Min. : 0.000 Min. :0.000
## Class :character 1st Qu.: 1.000 1st Qu.: 5.000 1st Qu.:2.000
## Mode :character Median : 3.000 Median : 7.000 Median :4.000
## Mean : 2.846 Mean : 7.055 Mean :3.497
## 3rd Qu.: 4.000 3rd Qu.:10.000 3rd Qu.:5.000
## Max. :10.000 Max. :15.000 Max. :9.000
## V5 V6 V7 V8
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.000 1st Qu.: 1.000 1st Qu.: 5.000 1st Qu.: 4.000
## Median : 6.000 Median : 2.000 Median : 7.000 Median : 6.000
## Mean : 5.253 Mean : 2.239 Mean : 6.124 Mean : 5.307
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.: 7.000
## Max. :10.000 Max. :11.000 Max. :12.000 Max. :11.000
## V9 V10 V11 V12
## Min. :0.000 Min. : 3.000 Min. : 0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.: 6.000 1st Qu.:2.000
## Median :3.000 Median : 7.000 Median : 7.000 Median :6.000
## Mean :2.699 Mean : 6.309 Mean : 7.209 Mean :4.165
## 3rd Qu.:4.000 3rd Qu.: 7.000 3rd Qu.:10.000 3rd Qu.:6.000
## Max. :9.000 Max. :11.000 Max. :14.000 Max. :9.000
## V13 V14 V15 V16
## Min. : 2.000 Min. :0.0000 Min. : 4.000 Min. : 0.000
## 1st Qu.: 7.000 1st Qu.:0.0000 1st Qu.: 7.000 1st Qu.: 0.000
## Median : 8.000 Median :0.0000 Median : 8.000 Median : 2.000
## Mean : 7.908 Mean :0.8232 Mean : 7.728 Mean : 2.312
## 3rd Qu.: 9.000 3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.: 3.000
## Max. :13.000 Max. :8.0000 Max. :12.000 Max. :11.000
## V17
## Min. : 2.000
## 1st Qu.: 7.000
## Median : 8.000
## Mean : 7.791
## 3rd Qu.: 8.000
## Max. :15.000
table(df.il$V1)
##
## I L
## 755 761
# get the data into dataframe
df.il$V1<-paste(df.il$V1)
str(df.il)
## 'data.frame': 1516 obs. of 17 variables:
## $ V1 : chr "I" "L" "L" "L" ...
## $ V2 : int 5 2 1 3 5 2 2 3 2 6 ...
## $ V3 : int 12 3 3 6 10 3 9 9 6 11 ...
## $ V4 : int 3 3 3 3 6 2 3 4 2 9 ...
## $ V5 : int 7 4 1 4 8 4 7 7 4 8 ...
## $ V6 : int 2 1 1 1 5 1 2 3 1 11 ...
## $ V7 : int 10 0 6 1 9 0 8 7 9 7 ...
## $ V8 : int 5 1 4 0 3 1 7 7 5 7 ...
## $ V9 : int 5 5 1 6 2 5 0 0 0 3 ...
## $ V10: int 4 6 7 6 6 6 7 7 6 6 ...
## $ V11: int 13 0 8 0 9 0 13 13 13 6 ...
## $ V12: int 3 0 3 1 2 0 6 6 5 7 ...
## $ V13: int 9 6 10 5 9 6 9 8 9 11 ...
## $ V14: int 2 0 0 0 2 0 0 0 0 6 ...
## $ V15: int 8 8 7 8 5 8 8 8 8 11 ...
## $ V16: int 4 0 2 0 3 0 1 1 1 9 ...
## $ V17: int 10 8 9 8 9 8 8 8 8 7 ...
# data splitting into training and test set
(2/3)*sum(table(df.il$V1))
## [1] 1010.667
rows.trnil=sort(sample(1:1516, size = 1010, replace = F))
trnil.dat=df.il[rows.trnil,] # the training set
tstil.dat=df.il[-rows.trnil,] # the test set
summary(trnil.dat)
## V1 V2 V3 V4
## Length:1010 Min. : 0.000 Min. : 0.000 Min. :0.000
## Class :character 1st Qu.: 1.000 1st Qu.: 5.000 1st Qu.:2.000
## Mode :character Median : 3.000 Median : 7.000 Median :4.000
## Mean : 2.827 Mean : 6.982 Mean :3.504
## 3rd Qu.: 4.000 3rd Qu.:10.000 3rd Qu.:5.000
## Max. :10.000 Max. :15.000 Max. :9.000
## V5 V6 V7 V8
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.000 1st Qu.: 1.000 1st Qu.: 5.000 1st Qu.: 4.000
## Median : 6.000 Median : 2.000 Median : 7.000 Median : 6.000
## Mean : 5.226 Mean : 2.286 Mean : 6.141 Mean : 5.381
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.: 7.000
## Max. :10.000 Max. :11.000 Max. :12.000 Max. :11.000
## V9 V10 V11 V12
## Min. :0.000 Min. : 3.00 Min. : 0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.: 5.00 1st Qu.: 6.000 1st Qu.:2.000
## Median :3.000 Median : 7.00 Median : 7.000 Median :6.000
## Mean :2.719 Mean : 6.29 Mean : 7.153 Mean :4.231
## 3rd Qu.:4.000 3rd Qu.: 7.00 3rd Qu.: 9.000 3rd Qu.:6.000
## Max. :9.000 Max. :10.00 Max. :14.000 Max. :9.000
## V13 V14 V15 V16
## Min. : 2.000 Min. :0.0000 Min. : 5.000 Min. : 0.000
## 1st Qu.: 7.000 1st Qu.:0.0000 1st Qu.: 7.000 1st Qu.: 0.000
## Median : 8.000 Median :0.0000 Median : 8.000 Median : 2.000
## Mean : 7.938 Mean :0.8693 Mean : 7.744 Mean : 2.375
## 3rd Qu.: 9.000 3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.: 4.000
## Max. :13.000 Max. :8.0000 Max. :12.000 Max. :11.000
## V17
## Min. : 4.00
## 1st Qu.: 7.00
## Median : 8.00
## Mean : 7.77
## 3rd Qu.: 8.00
## Max. :15.00
# boxplot dataset
boxplot(V2~V1, data = trnil.dat)
boxplot(V3~V1, data = trnil.dat)
boxplot(V4~V1, data = trnil.dat)
boxplot(V5~V1, data = trnil.dat)
boxplot(V6~V1, data = trnil.dat)
boxplot(V7~V1, data = trnil.dat)
boxplot(V8~V1, data = trnil.dat)
boxplot(V9~V1, data = trnil.dat)
boxplot(V10~V1, data = trnil.dat)
boxplot(V11~V1, data = trnil.dat)
boxplot(V12~V1, data = trnil.dat)
boxplot(V13~V1, data = trnil.dat)
boxplot(V14~V1, data = trnil.dat)
boxplot(V15~V1, data = trnil.dat)
boxplot(V16~V1, data = trnil.dat)
boxplot(V17~V1, data = trnil.dat)
boxplots 7 and 11 seems to have a good separation
X.dat=trnil.dat[,-1] # get the predictor variables
Class.dat=trnil.dat[,1] # get the class variable
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
set.seed(123)
mod.DA.G1= MclustDA(X.dat, Class.dat, G = 1)
mod.DA.G2= MclustDA(X.dat, Class.dat, G = 2)
mod.DA.G3= MclustDA(X.dat, Class.dat, G = 3)
mod.DA.G4= MclustDA(X.dat, Class.dat, G = 4)
summary(mod.DA.G1)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log-likelihood n df BIC
## -22978.8 1010 304 -48060.58
##
## Classes n % Model G
## I 501 49.6 XXX 1
## L 509 50.4 XXX 1
##
## Training confusion matrix:
## Predicted
## Class I L
## I 494 7
## L 5 504
## Classification error = 0.0119
## Brier score = 0.009
summary(mod.DA.G2)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log-likelihood n df BIC
## -17017.14 1010 595 -38150.32
##
## Classes n % Model G
## I 501 49.6 VEV 2
## L 509 50.4 VVV 2
##
## Training confusion matrix:
## Predicted
## Class I L
## I 498 3
## L 0 509
## Classification error = 0.003
## Brier score = 0.0031
summary(mod.DA.G3)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log-likelihood n df BIC
## -14029.46 1010 856 -33980.48
##
## Classes n % Model G
## I 501 49.6 VEV 3
## L 509 50.4 VEV 3
##
## Training confusion matrix:
## Predicted
## Class I L
## I 498 3
## L 1 508
## Classification error = 0.004
## Brier score = 0.0032
summary(mod.DA.G4)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log-likelihood n df BIC
## -10857.48 1010 1132 -29545.8
##
## Classes n % Model G
## I 501 49.6 VEV 4
## L 509 50.4 VEV 4
##
## Training confusion matrix:
## Predicted
## Class I L
## I 496 5
## L 0 509
## Classification error = 0.005
## Brier score = 0.0041
From the summary results, model 2 had the lowest classification error and thus would be used for the prediction of the test set
# Make prediction on the test set and column bind it to the actual dataset
results.2=cbind(paste(predict.MclustDA(mod.DA.G2, newdata = tstil.dat[, -1])$classification), paste(tstil.dat[, 1]))
# compute the accuracy
mean(results.2[,1]==results.2[,2])
## [1] 0.9960474
The computed accuracy is 98.8%
## make plots of the G2 model
#plot(mod.DA.G2)
summary(mod.DA.G2)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log-likelihood n df BIC
## -17017.14 1010 595 -38150.32
##
## Classes n % Model G
## I 501 49.6 VEV 2
## L 509 50.4 VVV 2
##
## Training confusion matrix:
## Predicted
## Class I L
## I 498 3
## L 0 509
## Classification error = 0.003
## Brier score = 0.0031
#######
######
mod.DA.G1.v7.v11= MclustDA(X.dat[, c(7, 11)], Class.dat, G = 2)
#plot(mod.DA.G1.v7.v11)
summary(mod.DA.G1.v7.v11)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log-likelihood n df BIC
## -3342.559 1010 17 -6802.719
##
## Classes n % Model G
## I 501 49.6 EEE 2
## L 509 50.4 EEV 2
##
## Training confusion matrix:
## Predicted
## Class I L
## I 466 35
## L 69 440
## Classification error = 0.103
## Brier score = 0.0774
names(mod.DA.G1.v7.v11$models$U)
## NULL
mod.DA.G1.v7.v11$models$U$parameters
## NULL
mod.DA.G1.v7.v11$models$V$parameters
## NULL