Title: Discriminant analyses

Name: Tammy L. Elliott

Date: September 8, 2017

R version 3.1

Discriminant analysis is used when you want to predict classes from several different continuous variables, as one would do to predict which career stream (grouping) a student should enter based on several different grades (continous variables).

In this case, I am utilizing the table() option to create confusion matrices to compare values from a linear discriminant analysis to those optained from several different clustering methods (fuzzy clustering of Gower’s distances and raoD)

Treatment of data

Create matrix of variables of interest.

Drop soil moisture based on previous collinearity analyses.

Standardization of variables

It is not necessary that variables are standardized for LDA, as it will not change the final results of the analysis. However, the scatter matrices and eigenvectors will be different. I choose to standardize just in case I need the eigenvectors in future analyses.

## LDA for Site.Env

First - do I meet assumptions?

The LDA classifier assumes that each class comes from a normal distribution with a class-specific mean vector and a common variance.

It is difficult to know which of LDA or QDA to go with, since they both rely on the same assumptions (which my original data do not meet). QDA is preferred when there is a very large number of predictors.

My choice is to go with a LDA, after tranforming my data to meet assumptions of variance and normality

Transformations

In the end, I chose to scale only, as the transformations did not completely help meet assumptions AND lda() did not work with all of the NA’s formed from the square roots of the scaled values.

It appears that transforming the unscaled values (sqrt) does not really help much either.

The accuracy of the lda comparison using site.env decreases with using sqrt values.

I am keeping priors at their default levels, since I do not think that there is a reason to set them otherwise

LDA with scaled values; compare Cross-validation with non-cross-validation results

site.lda<- lda(Group ~ Elevation + Visibility + Slope + Depth, data = groups.site)
site.lda.pred<- predict(site.lda)

# redo with jacknifing
site.lda.CV<- lda(Group ~ Elevation + Visibility + Slope + Depth, data = groups.site, CV=TRUE)

#Compare accuracy of plain model and jacknived model
(site.lda.noCV.CV<-table(site.lda.CV$class, site.lda.pred$class)) 
##    
##      1  2  3  4
##   1 42  0  0  0
##   2  0 49  0  0
##   3  1  0 39  1
##   4  0  0  0 44

Evaluation metrics

A key metric to start with is the overall classification accuracy. It is defined as the fraction of instances that are correctly classified.

(accuracy.site.lda <- sum(diag) / n) 
## [1] 0.9886364

More evaluation metrics

Per-class Precision, Recall, and F-1

In order to assess the performance with respect to every class in the dataset, we will compute common per-class metrics such as precision, recall, and the F-1 score. These metrics are particularly useful when the class labels are not uniformly distributed (most instances belong to one class, for example). In such cases, accuracy could be misleading as one could predict the dominant class most of the time and still achieve a relatively high overall accuracy but very low precision or recall for other classes.

Precision is defined as the fraction of correct predictions for a certain class,

whereas recall is the fraction of instances of a class that were correctly predicted.

Notice that there is an obvious trade off between these 2 metrics. When a classifier attempts to predict one class, say class a, most of the time, it will achieve a high recall for a (most of the instances of that class will be identified). However, instances of other classes will most likely be incorrectly predicted as a in that process, resulting in a lower precision for a.

In addition to precision and recall, the F-1 score is also commonly reported. It is defined as the harmonic mean (or a weighted average) of precision and recall.

Create a confusion table for each of the 9 different cluster groupings with lda.CV for site.env

Columns are predictions based on site.dist.cv (made with lda.cv) Rows are actual observations based on clustering methods

(clust.confusion.table<-lapply(clust.df,function(x) table(x, site.lda.CV$class)))
## $site.env.dist
##    
## x    1  2  3  4
##   1 40  1  1  1
##   2  0 48  0  0
##   3  2  0 37  2
##   4  0  0  3 41
## 
## $pa.clust
##    
## x    1  2  3  4
##   1 29 30  6  2
##   2  5 12 15 17
##   3  6  6 10  8
##   4  2  1 10 17
## 
## $pa.phy.clust
##    
## x    1  2  3  4
##   1 22 19  3 12
##   2 11 10 19  9
##   3  3 15  5 14
##   4  6  5 14  9
## 
## $abd.clust
##    
## x    1  2  3  4
##   1  6  7 15 19
##   2  9 12  9 14
##   3 27 30 13 10
##   4  0  0  4  1
## 
## $abd.phy.clust
##    
## x    1  2  3  4
##   1 13 15  4  1
##   2  6  8 13 25
##   3 18 23  6  6
##   4  5  3 18 12
## 
## $angio.pa.clust
##    
## x    1  2  3  4
##   1 26 27  7  8
##   2  6  3  8  7
##   3  2 11 14 23
##   4  8  8 12  6
## 
## $angio.pa.phy.clust
##    
## x    1  2  3  4
##   1 10 12 15 10
##   2 15 16  4 16
##   3 13  9 18  6
##   4  4 12  4 12
## 
## $angio.abd.clust
##    
## x    1  2  3  4
##   1 12 17 14 15
##   2 28 31 15 17
##   3  2  1  9  8
##   4  0  0  3  4
## 
## $angio.abd.phy.clust
##    
## x    1  2  3  4
##   1  9 20  1 14
##   2 19 14 10 13
##   3 11 13 20 10
##   4  3  2 10  7

Accuracy, Precision, Recall and F1 for clustering methods against LDA.CV of environmental variables

clust.df
##                      Accuracy Precision    Recall        F1
## site.env.dist       0.9431818 0.9424134 0.9415575 0.9419852
## pa.clust            0.3863636 0.3883761 0.3914101 0.3898872
## pa.phy.clust        0.2613636 0.2609787 0.2635970 0.2622813
## abd.clust           0.1818182 0.1849971 0.1818889 0.1834298
## abd.phy.clust       0.2215909 0.2215259 0.2229645 0.2222429
## angio.pa.clust      0.2784091 0.2895458 0.2895248 0.2895353
## angio.pa.phy.clust  0.3181818 0.3218235 0.3190944 0.3204531
## angio.abd.clust     0.3181818 0.3078452 0.3071972 0.3075209
## angio.abd.phy.clust 0.2840909 0.2908482 0.2867239 0.2887713