Theory behind the process

This method is used for a pattern recognition problem.

The linear discrimant analysis equation: B = (a1)x(x1) + (a2)x(x2) + … + (ap)x(xp)

We are looking for the solution to the problem where we classify a newt subject into population v if the value of B < a given value.

Idea came from Fisher and is the first classification protocol developed. This is a classification technique which predated other classification methods like logistic regression and classification trees.

Practical Implementation

Load the salmon data from the rrcov package.

setwd("~/Google Drive/CR Rao Course/")
library(rrcov)
## Loading required package: robustbase
## Scalable Robust Estimators with High Breakdown Point (version 1.3-8)
data(salmon)
head(salmon)
##   Gender Freshwater Marine  Origin
## 1      2        108    368 Alaskan
## 2      1        131    355 Alaskan
## 3      1        105    469 Alaskan
## 4      2         86    506 Alaskan
## 5      1         99    402 Alaskan
## 6      2         87    423 Alaskan
salmon <- salmon[ , -1]
summary(salmon)
##    Freshwater        Marine           Origin  
##  Min.   : 53.0   Min.   :301.0   Alaskan :50  
##  1st Qu.: 99.0   1st Qu.:367.0   Canadian:50  
##  Median :117.5   Median :396.5                
##  Mean   :117.9   Mean   :398.1                
##  3rd Qu.:140.0   3rd Qu.:428.2                
##  Max.   :179.0   Max.   :511.0

Checking the data to find a point of seperation

alaska <- subset(salmon, salmon$Origin =="Alaskan") # Create a alaskan fish subset.
canada <- subset(salmon, salmon$Origin == "Canadian") # Create a canadian fish subset.
# Generate a scatter plot with a range of values that can be accomodated. .
plot(alaska$Freshwater, alaska$Marine, pch = 20, col=2, xlim=c(50,200), ylim=c(300, 550), main="Plot of Scale size of Salmon", xlab="Freshwater scale diameter", ylab="Marine scale diameter")
# Plot command does not accoept another plot unlike the points and curve commands.
points(canada$Freshwater, canada$Marine, col=3, pch=15)
legend("topright", legend =c("Alaskan Salmon", "Canadian Salmon"), pch=c(20,15), col=c(2:3))

The objective of Linear Discriminant analysis is to fit a line that seperate the alaskan and canadian fish.

Linear Discriminant Analysis

Load the package MASS.

library(MASS)
lda1 <- lda(salmon$Origin~salmon$Freshwater+salmon$Marine , na.action = "na.omit" )
lda1
## Call:
## lda(salmon$Origin ~ salmon$Freshwater + salmon$Marine, na.action = "na.omit")
## 
## Prior probabilities of groups:
##  Alaskan Canadian 
##      0.5      0.5 
## 
## Group means:
##          salmon$Freshwater salmon$Marine
## Alaskan              98.38        429.66
## Canadian            137.46        366.62
## 
## Coefficients of linear discriminants:
##                           LD1
## salmon$Freshwater  0.04458572
## salmon$Marine     -0.01803856

The output of the linear discriminant analysis starts with the baseline (bayesian) probablity of the probablities of the fish being alaskan or canadian.

Coefficients of the linear discrimant analysis gives you the a1 and a2 so that we can draw line.

Prediction of the goodness of fit

salmon1 <- predict(lda1)
confus_m <- table(salmon$Origin, salmon1$class)
confus_m
##           
##            Alaskan Canadian
##   Alaskan       44        6
##   Canadian       1       49

So the classification protocol correctly classifed 44 alaskan and 49 canadian fish. Misclassification rate = 7%

Support Vector Machines

Is an extension of the LDA method. We transfer the problem to a higher dimension. We look at multiple permutations of the predictor values like this. if x is a value then we make a high dimensional vector like this…
(x1, x2, x12, x22, x1.x2).
In this dimension we can potentially find a hyperplane which may be able to seperate the values.