Lab 4: LDA and KNN Classification

Installing the required packages
Visualization
- Explore the Data with some 3D graph
Training set and Test set
Linear Discriminant Analysis
Assignment

At this time, you should be familiar with the KNN and linear discriminant analysis model. During this lab, you will compare KNN classification and LDA, and compare your KNN classification result with the result from knn.

Installing the required packages

We need to load a few important packages first to begin our analysis

library(ggplot2)
library(GGally)
library(caret)
library(tidyr)
library(knitr)
library(rgl)

You are going to use lots of functions from caret package. Click here for the references.

Visualization

Can you try to reproduce this graph?

The top right triangular shows the correlation coefficients between different variable. In RStudio, Pearson Correlation Coefficient is calculating the correlation coefficient between two variable. \[ \rho_P = \frac{(x-\bar{x})(y-\bar{y})}{\sqrt{(x-\bar{x})^2(y-\bar{y})^2}} \] ‘cor()’ function will calculate the Pearson correlation coefficient. Here, we specify method="pearson"

a <- c(1,2,3)
b <- c(1,9,8)
cor(a,b,method = "pearson")

## [1] 0.8029551

To test if this relationship is statistically significant, use cor.test for correlation test.

cor.test(a,b,method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  a and b
## t = 1.3472, df = 1, p-value = 0.4065
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
##       cor 
## 0.8029551

In this hypothesis test, the null hypothesis is true correlation is equal to 0 (a and b are not correlated), while the alternative is true correlation is not equal to 0. Here the p-value is greater than 0.05, which suggest the null is true. There is a significantly correlation between a and b. [a and b are correlated, or associated. There is nothing on causation.]

Explore the Data with some 3D graph

The rgl package comes with the plot3d() function that is pretty close from the base R plot() function. Instead of providing just an x and a y argument, you also have to provide the z coordinate. Note that the output is interactive by default.

This is to output a rgl plot in a rmarkdown document. Note that you must add webgl=TRUE, results='hide' in the chunck header. For Mac user, download XQuartz from XQuartz.org before generating the 3d plot. For windows user, download X11.

library(knitr)
library(rgl)
knit_hooks$set(webgl = hook_webgl)

data <- iris
mycolors <- c('royalblue1', 'darkcyan', 'oldlace')
data$color <- mycolors[ as.numeric(data$Species) ]

plot3d( 
  x=data$`Sepal.Length`, y=data$`Sepal.Width`, z=data$`Petal.Length`, 
  col = data$color, 
  type = 's',  # try type = 'p' to see the difference
  radius = .1,
  xlab="Sepal Length", ylab="Sepal Width", zlab="Petal Length")

You must enable Javascript to view this page properly.

Suppose the response variable is Species. Let’s use Sepal.Length, Sepal.Width, Petal.Length and Petal.Width to set up a classifier.

Training set and Test set

We are going to use a different way to split the training set and test set.

In the previous lab, we randomly choose 70% of the data as training set, and 30% as the test set. That works most of the time. However, when there are more than 3 classes, or when we have imbalanced data(the prior probabilities of each class are not the same), simple random sampling has potential problem. Instead, stratified sampling will be used here. we sample 70% data from each class. createDataPartition function in caretpackage will create the index number for training set.

See here for more information on createDataPartition.

set.seed(1)
# We use the dataset to create a partition (70% training 30% testing)
index <- createDataPartition(iris$Species, p=0.70, list=FALSE)

# select 70% of data to train the models
training <- iris[index,]
train.x <- data.frame(training[,-5]) 
train.y <- training$Species


# select 30% of the data for testing
test <- iris[-index,]
test.x <- data.frame(test[,-5]) 
test.y <- test$Species

Now that we have loaded and prepared our data for analysis, we are ready to move onto the next step which is Exploring, Summarizing, Plotting and Understanding the Data`.

Let’s revisit the research problem:

The Problem: Given a set of data about the flowers (column 1 through 4) can we predict which of the 3 classes of flowers it belongs to.

We will build and fit a classification models to the training set and try to learn from the trainset data. This is machine learning. We will later use the best model selected (or even a combination of models) to predict the classification for the test. We will then measure how well our model is expected to perform in the real world.

Linear Discriminant Analysis

Use train function in caret package and specify method = "lda". See the R documentation here

x: predictors in training set.
y: response in training set.
method: The chosen model.

set.seed(1)
# Fit the model
model.lda<-train(x = train.x, y = train.y, method = "lda")
# Print the model
print(model.lda)

## Linear Discriminant Analysis 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9826199  0.9735528

The Accuracy here is the training accuracy. Remember to compare the model performance, we need to get the error metrics on the test set. Again, use predict function for prediction, and calculate the confusion matrix for test set.

## Verify the accuracy on the training set and testing set

pred_test <- predict(model.lda, test.x)
confusionMatrix(pred_test,test$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         13         0
##   virginica       0          2        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8667           1.0000
## Specificity                 1.0000            1.0000           0.9333
## Pos Pred Value              1.0000            1.0000           0.8824
## Neg Pred Value              1.0000            0.9375           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2889           0.3333
## Detection Prevalence        0.3333            0.2889           0.3778
## Balanced Accuracy           1.0000            0.9333           0.9667

Since we have 3 classes, the confusion matrix will be \(3\times3\). The error metrics are calculated for each class!

Let us now visualize the classification result.

ggplot(test) + ggtitle("LDA results for test data") +
  geom_text(aes(Petal.Width, Sepal.Width,label=Species, color = pred_test))

If there is no misclassification, there should be only one unique label within each color. For example, there should exist misclassification if there is any red ‘versicolor’, or blue ‘setosa’, or green ‘virginica’ on the graph.

Assignment

Dr. Wang is going to wear a ECG holter for 2 weeks. As you may see, wearable sensor including accelerometer is becoming more and more popular in clinical research. In this assignmenet, let’s apply the classification methods to the sequential motion data, and build a classifier for different activity class. Please visit UCI Machine Learning Repository for the description of the study and the dataset. You are going to write a short statistical report on this study. You will need to

read through the information, download the data( let’s use d1p01M, make sure to read the `README.txt``), import it to R;
use the acceleration readings as the predictors to predict the label of activity
try two different classification models,
define your own error metrics. Are you able to distinguish all 4 types of activities? If not, which 2 or 3 of them are the most indistinguishable? Write a short paragraph to discuss your findings.