PCA: Dimensionality Reduction Classification using Iris Dataset
library(DT)
library(caret)
library(nnet)
Objective
This is a short demonstration of how PCA could be used to reduce dimensionality and applied it to run a machine learning model. This example uses the iris dataset which we had looked at in class. The demo extends what was used in class to show how PCA could be used for modeling using the iris dataset.
data(iris)
datatable(iris)
Create Data Partition
A training and a testing dataset are needed to properly train the model and assess performance. I used an 80:20 split for this demo.
set.seed(1234)
<- createDataPartition(iris[,"Species"], p = 0.8, list = FALSE)
train_ind
<- iris[train_ind, ]
train <- iris[-train_ind, ] test
Run PCA on Training Data and Transform Testing Data
It is important to run the PCA on only the training dataset and not the test dataset or the full iris dataset as one. If the full dataset is used there is likely to be data leakage from the test dataset which will hurt the modelโs ability to generalize. Running PCA on the train and test separately will create two separate spaces which would not allow for one space to be applied to the other. The correct way to go about applying PCA to the test dataset would be to transform the test dataset to the same space as the training data.
<- prcomp(train[,-5],
pca.train center = TRUE,
scale. = TRUE)
<- data.frame(pca.train$x, train[,"Species"])
train.df <- predict(pca.train, newdata = test)
test.df <- data.frame(test.df, test[,"Species"])
test.df
colnames(train.df)[5] <- "Species"
colnames(test.df)[5] <- "Species"
summary(pca.train)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7029 0.9701 0.37127 0.14612
## Proportion of Variance 0.7249 0.2353 0.03446 0.00534
## Cumulative Proportion 0.7249 0.9602 0.99466 1.00000
Model 1: Multinomial Logistic Regression using all variables in Training Data
This model uses Sepal.Length
, Sepal.Width
,
Petal.Length
, and Petal.Width
to predict the
Species
of plant. The model was fairly effective at
generalizing to unknown data with an overall accuracy of 0.9. The model
has a total of 4 dimensions. PCA can produce similar model results with
fewer dimensions.
$Species <- relevel(train$Species, ref = "setosa")
train
<- multinom(Species ~ ., data = train)
model.all.var
<- predict(model.all.var, test)
pred
<- table(pred, test$Species) tab
confusionMatrix(tab)
## Confusion Matrix and Statistics
##
##
## pred setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 2
## virginica 0 1 8
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 0.8000
## Specificity 1.0000 0.9000 0.9500
## Pos Pred Value 1.0000 0.8182 0.8889
## Neg Pred Value 1.0000 0.9474 0.9048
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.2667
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9000 0.8750
Model 2: Multinomial Logistic Regression using First Principal Component from PCA Training Data
This second model uses the PCA of the training dataset, more
specifically the PC1
, to predict the Species
of the plant. The PCA model was able to achieve the same overall
accuracy (0.9) of the first model. PCA reduced the number of dimensions
needed to achieve this level of accuracy from 4 dimensions to 1
dimension.
$Species <- relevel(train.df$Species, ref = "setosa")
train.df
<- multinom(Species ~ PC1, data = train.df)
model.pca
<- predict(model.pca, test.df)
pred
<- table(pred, test.df$Species) tab
confusionMatrix(tab)
## Confusion Matrix and Statistics
##
##
## pred setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 1
## virginica 0 2 9
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8000 0.9000
## Specificity 1.0000 0.9500 0.9000
## Pos Pred Value 1.0000 0.8889 0.8182
## Neg Pred Value 1.0000 0.9048 0.9474
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2667 0.3000
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.8750 0.9000
Summary
PCA can be used to reduce the number of features present in a
dataset. Each principal component accounts for a percentage of the
variability in the dataset. In the Iris training dataset
PC1
accounts for 72.49% of the overall variance in the
dataset and PC1
with PC2
accounts for over 96%
of the variability. Though the iris dataset does not need any
dimensionality reduction to run effective models, this demo shows the
effectiveness of PCA by being able to produce similar model results to a
more complex model.