In this project, I aim to study about the linear discriminant regression. We will use an available dataset Iris in R to learn about LDR.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We can see that there are 5 variables, which contains 4 variables about the information of different parts of the specified type, namely setosa, versicolor, and virginica. Our purpose is to create the linear discriminant model to classify 3 groups based on their information.

library(psych)
## Warning: package 'psych' was built under R version 4.2.2
pairs.panels(iris[1:4],
             gap=0,
             bg=c("red","green","blue")[iris$Species],
             pch=21)

We see on the scatter plot, some plot has the good separation between three groups, but some scatter plot have the overlap between groups. We will create the linear combinations of 4 varaibles to find the best separation for three species.

First, we create a training dataset, and a testing dataset.

set.seed(320)
id<-sample(2,size=nrow(iris),replace=TRUE,
           prob=c(0.7,0.3))
train_data<-iris[id==1,]
test_data<-iris[id==2,]
str(train_data)
## 'data.frame':    106 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.7 5 5.4 5 4.4 5.4 4.8 4.8 4.3 ...
##  $ Sepal.Width : num  3.5 3.2 3.6 3.9 3.4 2.9 3.7 3.4 3 3 ...
##  $ Petal.Length: num  1.4 1.3 1.4 1.7 1.5 1.4 1.5 1.6 1.4 1.1 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 0.1 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
str(test_data)
## 'data.frame':    44 obs. of  5 variables:
##  $ Sepal.Length: num  4.9 4.6 4.6 4.9 5.8 5.1 5.7 4.6 5.1 5 ...
##  $ Sepal.Width : num  3 3.1 3.4 3.1 4 3.5 3.8 3.6 3.3 3.4 ...
##  $ Petal.Length: num  1.4 1.5 1.4 1.5 1.2 1.4 1.7 1 1.7 1.6 ...
##  $ Petal.Width : num  0.2 0.2 0.3 0.1 0.2 0.3 0.3 0.2 0.5 0.4 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We use the training data for our model.

library(MASS)
## Warning: package 'MASS' was built under R version 4.2.2
linear<-lda(Species~.,train_data)
attributes(linear)
## $names
##  [1] "prior"   "counts"  "means"   "scaling" "lev"     "svd"     "N"      
##  [8] "call"    "terms"   "xlevels"
## 
## $class
## [1] "lda"

lda() function is the function for linear discriminant regression.

linear$prior
##     setosa versicolor  virginica 
##  0.2924528  0.3679245  0.3396226
linear$means
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.967742    3.412903     1.458065   0.2419355
## versicolor     5.956410    2.728205     4.253846   1.3153846
## virginica      6.408333    2.950000     5.419444   2.0305556

This attribute shows the prior probability of each species. Usually, the prior probability is calculated by the number of sample size in class k divide by the total number of samples. We see that the prior probability of the setosa specie is 0.3125. And sum of three probability equals 1.

The mean attribute shows the mean of each variable of each species. For example, the sepal length of setosa in our training data has the mean equals 4.994.

linear$scaling
##                     LD1        LD2
## Sepal.Length  0.8028726 -1.0606641
## Sepal.Width   1.7438864  2.4117434
## Petal.Length -2.2776605 -0.4946693
## Petal.Width  -2.9655438  2.7617091

The scaling show us the coefficients of two linear discriminants(Since we have 3 groups, so the model will have 3-1 discriminants).

The first linear discriminant function is a linear combination of the four variables: 0.828xSepal.Length + 1.7438xSepal.Width + -2.2776xPetal.length + –2.9655xPetal.Width

linear
## Call:
## lda(Species ~ ., data = train_data)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.2924528  0.3679245  0.3396226 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.967742    3.412903     1.458065   0.2419355
## versicolor     5.956410    2.728205     4.253846   1.3153846
## virginica      6.408333    2.950000     5.419444   2.0305556
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Sepal.Length  0.8028726 -1.0606641
## Sepal.Width   1.7438864  2.4117434
## Petal.Length -2.2776605 -0.4946693
## Petal.Width  -2.9655438  2.7617091
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9874 0.0126

We have the last attribute in our linear discriminant model is the proportion of trace. We see LD1 has the proportion of separtion up to 98.74%. While the second discriminant has the poor separation with only 1.26%.

Now, we will use the model created to predict the class for our training dataset.

p<-predict(linear,train_data)
attributes(p)
## $names
## [1] "class"     "posterior" "x"

We store the prediction into p and step by step understand each attribute in our prediction values.

p$class
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     versicolor versicolor versicolor versicolor versicolor
##  [37] versicolor versicolor versicolor versicolor versicolor versicolor
##  [43] versicolor versicolor versicolor versicolor versicolor versicolor
##  [49] versicolor versicolor versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor virginica  versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor virginica  virginica 
##  [73] virginica  virginica  virginica  virginica  virginica  virginica 
##  [79] virginica  virginica  virginica  virginica  virginica  virginica 
##  [85] virginica  virginica  virginica  virginica  virginica  virginica 
##  [91] virginica  virginica  versicolor virginica  virginica  virginica 
##  [97] virginica  virginica  virginica  virginica  virginica  virginica 
## [103] virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Class shows the prediction of our model in training data. The first 31 species are setosa. And the rest will be versicolor and virginica respectively.

p$posterior
##           setosa   versicolor    virginica
## 1   1.000000e+00 1.281326e-24 3.084223e-46
## 3   1.000000e+00 7.827576e-22 1.568234e-42
## 5   1.000000e+00 3.406489e-25 9.301194e-47
## 6   1.000000e+00 1.143400e-23 4.943282e-44
## 8   1.000000e+00 1.915839e-22 2.459458e-43
## 9   1.000000e+00 2.517236e-17 1.411450e-36
## 11  1.000000e+00 3.076029e-26 2.037798e-48
## 12  1.000000e+00 7.615632e-21 5.914609e-41
## 13  1.000000e+00 1.587369e-20 1.881123e-41
## 14  1.000000e+00 4.496141e-22 4.595802e-43
## 16  1.000000e+00 6.321926e-31 2.077113e-53
## 17  1.000000e+00 1.058265e-27 1.557266e-49
## 20  1.000000e+00 4.478290e-25 4.312065e-46
## 21  1.000000e+00 1.309994e-21 1.349954e-42
## 22  1.000000e+00 4.671607e-23 3.152005e-43
## 25  1.000000e+00 8.070655e-18 7.909798e-37
## 26  1.000000e+00 5.943776e-18 7.244823e-38
## 30  1.000000e+00 8.295262e-19 2.097251e-38
## 32  1.000000e+00 2.486132e-21 1.149233e-41
## 34  1.000000e+00 6.760653e-32 2.059171e-55
## 36  1.000000e+00 9.972579e-24 2.048744e-45
## 38  1.000000e+00 4.789073e-26 4.274818e-48
## 39  1.000000e+00 3.324278e-19 5.633894e-39
## 41  1.000000e+00 3.485460e-24 2.864357e-45
## 42  1.000000e+00 2.948031e-12 1.788695e-30
## 43  1.000000e+00 6.026236e-21 5.057327e-41
## 45  1.000000e+00 6.795861e-20 9.479723e-39
## 46  1.000000e+00 3.131374e-18 9.022621e-38
## 48  1.000000e+00 1.575800e-20 1.184843e-40
## 49  1.000000e+00 6.073841e-26 6.486313e-48
## 50  1.000000e+00 1.395673e-22 1.093629e-43
## 51  7.018881e-21 9.999788e-01 2.118236e-05
## 53  3.274050e-25 9.987047e-01 1.295258e-03
## 54  6.066443e-25 9.996470e-01 3.530496e-04
## 55  5.566963e-26 9.980124e-01 1.987596e-03
## 56  4.846861e-25 9.983985e-01 1.601531e-03
## 57  2.235145e-24 9.898529e-01 1.014708e-02
## 58  2.412500e-15 9.999999e-01 9.810250e-08
## 59  1.613698e-22 9.999642e-01 3.575494e-05
## 60  1.739842e-22 9.992323e-01 7.676993e-04
## 61  1.506036e-20 9.999987e-01 1.345281e-06
## 62  5.605881e-22 9.994114e-01 5.886185e-04
## 63  6.795351e-21 9.999996e-01 3.845921e-07
## 64  3.737727e-26 9.955519e-01 4.448113e-03
## 65  2.173288e-15 9.999991e-01 9.054551e-07
## 67  6.667505e-26 9.698072e-01 3.019276e-02
## 69  4.274448e-31 9.713352e-01 2.866485e-02
## 70  1.329440e-19 9.999981e-01 1.916537e-06
## 72  8.129272e-19 9.999965e-01 3.473073e-06
## 73  2.774938e-32 8.421635e-01 1.578365e-01
## 74  9.969231e-25 9.997389e-01 2.610707e-04
## 75  4.386232e-20 9.999926e-01 7.362738e-06
## 76  8.870068e-21 9.999771e-01 2.288908e-05
## 77  5.799639e-26 9.994776e-01 5.223590e-04
## 78  4.428307e-30 7.867178e-01 2.132822e-01
## 80  2.958695e-13 1.000000e+00 5.775080e-09
## 81  9.242109e-20 9.999981e-01 1.885763e-06
## 83  2.035810e-18 9.999982e-01 1.800871e-06
## 84  1.419814e-35 8.779558e-02 9.122044e-01
## 87  8.734423e-24 9.993786e-01 6.213729e-04
## 88  1.297683e-26 9.997725e-01 2.275249e-04
## 89  1.465277e-19 9.999564e-01 4.358039e-05
## 90  3.347054e-23 9.998251e-01 1.748545e-04
## 91  3.229763e-25 9.992673e-01 7.327158e-04
## 93  2.688494e-20 9.999940e-01 5.958148e-06
## 95  3.506989e-23 9.997089e-01 2.911055e-04
## 96  3.985966e-19 9.999872e-01 1.276508e-05
## 98  1.124969e-20 9.999809e-01 1.913206e-05
## 99  5.271244e-12 1.000000e+00 1.038071e-08
## 100 5.244894e-21 9.999454e-01 5.459577e-05
## 101 7.915420e-57 9.587546e-10 1.000000e+00
## 102 4.623367e-42 3.088381e-04 9.996912e-01
## 103 3.873214e-47 2.096444e-05 9.999790e-01
## 104 1.546476e-42 4.893009e-04 9.995107e-01
## 105 1.276484e-50 5.641472e-07 9.999994e-01
## 107 1.053375e-36 7.994670e-03 9.920053e-01
## 109 2.271101e-47 1.495139e-04 9.998505e-01
## 110 1.314991e-50 8.686077e-08 9.999999e-01
## 111 2.865981e-35 1.016566e-02 9.898343e-01
## 112 8.528104e-42 9.990544e-04 9.990009e-01
## 114 4.469965e-45 4.480519e-05 9.999552e-01
## 115 3.063708e-50 1.506040e-07 9.999998e-01
## 116 4.860379e-44 9.803761e-06 9.999902e-01
## 117 3.910770e-39 4.191047e-03 9.958090e-01
## 120 1.640620e-37 1.601208e-01 8.398792e-01
## 121 5.002694e-47 3.632164e-06 9.999964e-01
## 122 3.918779e-41 1.856821e-04 9.998143e-01
## 123 2.485225e-55 1.040567e-06 9.999990e-01
## 124 5.409830e-35 8.246335e-02 9.175367e-01
## 125 2.499600e-43 4.829424e-05 9.999517e-01
## 126 1.915321e-40 3.502230e-03 9.964978e-01
## 127 3.917957e-33 1.557340e-01 8.442660e-01
## 134 9.592218e-32 7.385625e-01 2.614375e-01
## 135 2.876384e-39 3.735841e-02 9.626416e-01
## 137 1.836657e-48 1.973802e-07 9.999998e-01
## 138 1.297433e-38 3.696516e-03 9.963035e-01
## 140 3.037919e-40 7.836165e-04 9.992164e-01
## 141 1.603406e-49 4.644126e-07 9.999995e-01
## 142 8.473518e-40 4.068602e-04 9.995931e-01
## 143 4.623367e-42 3.088381e-04 9.996912e-01
## 144 2.789602e-50 4.156987e-07 9.999996e-01
## 145 1.086576e-50 8.169567e-08 9.999999e-01
## 146 3.339562e-43 4.734166e-05 9.999527e-01
## 147 3.207165e-40 3.861707e-03 9.961383e-01
## 149 2.251511e-44 3.272595e-06 9.999967e-01
## 150 1.190333e-36 6.997823e-03 9.930022e-01

Continue for the first observation, we see that the proability of versicolor and virginica are smaller than the probability of setosa. Therefore, the first case is assigned to class Setosa.

ldahist(data = p$x[,1],g=train_data$Species)

We plot the histogram based on the LD1 on our prediction p. It easy to see that the LD1 coefficients of class setosa are much larger than other two groups. And there is a clear separation between group Setosa and Versicolor. While the distribution of Versicolor and Virginica are also separated but there are some overlaps between two classes.

Look back at the trace proportion in our model. We have the first discriminant has the good separation for classification, so the histogram shows us that good separation between three classes. Now, we figure out the case of the second discriminant, which just has around 1.26% trace proportion.

ldahist(data = p$x[,2],g=train_data$Species)

Unsupprisingly, that the histogram does not shows any separtion between three classes in the case of LD2.

library(devtools)
## Warning: package 'devtools' was built under R version 4.2.2
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 4.2.2
library(ggord)
ggord(linear,train_data$Species,ylim=c(-10,10))

The plot shows more detail the difference between LD1 and LD2. We can see that when the data is projected into LD2, there could not classify data into 3 classes. But when we project data into LD1, three species are now classified much better.

The plot also indicates the direction of each variable in two linear discriminant. For example, we can see that when projection the variable Sepal length onto LD1, we have the coefficient is 0.8028726 and onto LD2, we have the coefficient of Sepal length is -1.0606641. If we the sepal length has the LD1 coefficient around 5.3, we can classify that observation as group setosa.

We want to know the accuracy of our model when train the training data.

p1<-predict(linear,train_data)$class
tab<-table(Predicted=p1,Actual=train_data$Species)
print(tab)
##             Actual
## Predicted    setosa versicolor virginica
##   setosa         31          0         0
##   versicolor      0         38         1
##   virginica       0          1        35

The table compares our predicted classes and the actual classes in the training dataset. We have that we predict that there will be 31 setosa and the actual data is also 31 setosa. There is one misclassification, when the model predicts one is versicolor but the actual is verginical.

sum(diag(tab))/sum(tab)
## [1] 0.9811321

And the accuracy of our model for training data is 98.11%.

After training the model for our train data, we will use it to predict the class for our testing data.

p2<-predict(linear,test_data)$class
tab1<-table(Predicted=p2,Actual=test_data$Species)
print(tab1)
##             Actual
## Predicted    setosa versicolor virginica
##   setosa         19          0         0
##   versicolor      0         10         0
##   virginica       0          1        14
sum(diag(tab1))/sum(tab1)
## [1] 0.9772727

We have the accuracy of linear discriminant regression model when applying for testing data is around 97.72%.