PCA: Dimensionality Reduction Classification using Iris Dataset

library(DT)
library(caret)
library(nnet)

Objective

This is a short demonstration of how PCA could be used to reduce dimensionality and applied it to run a machine learning model. This example uses the iris dataset which we had looked at in class. The demo extends what was used in class to show how PCA could be used for modeling using the iris dataset.

data(iris)

datatable(iris)

Create Data Partition

A training and a testing dataset are needed to properly train the model and assess performance. I used an 80:20 split for this demo.

set.seed(1234)

train_ind <- createDataPartition(iris[,"Species"], p = 0.8, list = FALSE)

train <- iris[train_ind, ]
test <- iris[-train_ind, ]

Run PCA on Training Data and Transform Testing Data

It is important to run the PCA on only the training dataset and not the test dataset or the full iris dataset as one. If the full dataset is used there is likely to be data leakage from the test dataset which will hurt the model’s ability to generalize. Running PCA on the train and test separately will create two separate spaces which would not allow for one space to be applied to the other. The correct way to go about applying PCA to the test dataset would be to transform the test dataset to the same space as the training data.

pca.train <- prcomp(train[,-5],
                   center = TRUE,
                   scale. = TRUE)

train.df <- data.frame(pca.train$x, train[,"Species"])
test.df <- predict(pca.train, newdata = test)
test.df <- data.frame(test.df, test[,"Species"])

colnames(train.df)[5] <- "Species"
colnames(test.df)[5] <- "Species"

summary(pca.train)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7029 0.9701 0.37127 0.14612
## Proportion of Variance 0.7249 0.2353 0.03446 0.00534
## Cumulative Proportion  0.7249 0.9602 0.99466 1.00000

Model 1: Multinomial Logistic Regression using all variables in Training Data

This model uses Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width to predict the Species of plant. The model was fairly effective at generalizing to unknown data with an overall accuracy of 0.9. The model has a total of 4 dimensions. PCA can produce similar model results with fewer dimensions.

train$Species <- relevel(train$Species, ref = "setosa")

model.all.var <- multinom(Species ~ ., data = train)

pred <- predict(model.all.var, test)

tab <- table(pred, test$Species)

confusionMatrix(tab)

## Confusion Matrix and Statistics
## 
##             
## pred         setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         2
##   virginica       0          1         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           0.8000
## Specificity                 1.0000            0.9000           0.9500
## Pos Pred Value              1.0000            0.8182           0.8889
## Neg Pred Value              1.0000            0.9474           0.9048
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.2667
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9000           0.8750

Model 2: Multinomial Logistic Regression using First Principal Component from PCA Training Data

This second model uses the PCA of the training dataset, more specifically the PC1, to predict the Species of the plant. The PCA model was able to achieve the same overall accuracy (0.9) of the first model. PCA reduced the number of dimensions needed to achieve this level of accuracy from 4 dimensions to 1 dimension.

train.df$Species <- relevel(train.df$Species, ref = "setosa")

model.pca <- multinom(Species ~ PC1, data = train.df)

pred <- predict(model.pca, test.df)

tab <- table(pred, test.df$Species)

confusionMatrix(tab)

## Confusion Matrix and Statistics
## 
##             
## pred         setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         1
##   virginica       0          2         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8000           0.9000
## Specificity                 1.0000            0.9500           0.9000
## Pos Pred Value              1.0000            0.8889           0.8182
## Neg Pred Value              1.0000            0.9048           0.9474
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2667           0.3000
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.8750           0.9000

Summary

PCA can be used to reduce the number of features present in a dataset. Each principal component accounts for a percentage of the variability in the dataset. In the Iris training dataset PC1 accounts for 72.49% of the overall variance in the dataset and PC1 with PC2 accounts for over 96% of the variability. Though the iris dataset does not need any dimensionality reduction to run effective models, this demo shows the effectiveness of PCA by being able to produce similar model results to a more complex model.