PRINCIPAL COMPONENT ANALYSIS IN R

Problem Statement

When faced with a data set with many numeric variables which may or may not be highly correlated, it’s always good to reduce the number of numeric variables which are highly correlated.
Principal component analysis(PCA) helps reduce the number of variables and multicollineality by producing few features which explain most of the variability in the data set.
Am going to use the Iris data set to demonstrate to perform a principal component analysis.

Read in Data

library(datasets)
data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

The data set has 150 observations and 5 variables. First four variables are numeric while the last variable is factor with 3 levels.

Looking at the summary of the data set.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

From the summary statistics, the numeric variables have different values and scale.It’s always good to normalize them .The factor variable has equal values in each level.

Data Partition

We partition the data set into training and testing data sets.

library(caret)
set.seed(100)
ind <- createDataPartition(iris$Species,p=0.80,list = F)
train <- iris[ind,]
test <- iris[-ind,]

dim(train)

## [1] 120   5

dim(test)

## [1] 30  5

The train data set has 120 observations and test data set has 30 observations.

Data Exploratoty

We explore the data set by means of scatter plots and correlation coefficients.

library(psych)
pairs.panels(train[,-5],gap=0,bg=c("red","blue","yellow")[train$Species],
             pch=21)

From scatter plots, some variables are very highly correlated; with correlation over 0.8. This leads to multicollineality problem when building a predictive models.Principal component analysis can help solve this problem.

Principal component Analysis

We build a principal component analysis model.

pc <- prcomp(train[,-5],center = T,scale. = T)
pc

## Standard deviations (1, .., p=4):
## [1] 1.7205132 0.9347160 0.3801546 0.1470468
## 
## Rotation (n x k) = (4 x 4):
##                     PC1        PC2        PC3        PC4
## Sepal.Length  0.5169663 0.39069181  0.7158654  0.2600817
## Sepal.Width  -0.2957684 0.91562347 -0.2381302 -0.1320937
## Petal.Length  0.5760094 0.03313838 -0.1418198 -0.8043645
## Petal.Width   0.5598929 0.08885612 -0.6408728  0.5175969

Looking at summary of the principal component analysis.

summary(pc)

## Importance of components:
##                          PC1    PC2     PC3     PC4
## Standard deviation     1.721 0.9347 0.38015 0.14705
## Proportion of Variance 0.740 0.2184 0.03613 0.00541
## Cumulative Proportion  0.740 0.9585 0.99459 1.00000

From the summary, first PC1 explains about 74% of total variables and by time we reach the second PC1 96% of total variance is explained.

Orthogonality of Principal component Analysis

pairs.panels(pc$x,gap=0,bg=c("red","blue","yellow")[train$Species],
             pch = 21)

From the scatter plot, you can there is zero correlation among the principal components.

Bi Plot

Making a bi plot of the principal component analysis.

library(devtools)
library(ggbiplot)
ggbiplot(pc, obs.scale = 1, var.scale = 1,
  groups = train$Species, ellipse = TRUE, circle = TRUE,ellipse.prob = 0.68) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')

Making Predictions

We first prepare the data sets

pred <- predict(pc,train)
train_1 <- data.frame(pred,train[5])
pred1 <- predict(pc,test)
test_1 <- data.frame(pred1,test[5])

Multinomial logistic regression with first two principal components.

We train a multinomial logistic model because our target variable has three levels.

library(nnet)
set.seed(100)
mymodel <- multinom(Species~PC1 +PC2,data = train_1)

## # weights:  12 (6 variable)
## initial  value 131.833475 
## iter  10 value 16.703027
## iter  20 value 15.868024
## iter  30 value 15.800425
## iter  40 value 15.798653
## iter  40 value 15.798653
## final  value 15.798653 
## converged

summary(mymodel)

## Call:
## multinom(formula = Species ~ PC1 + PC2, data = train_1)
## 
## Coefficients:
##            (Intercept)      PC1       PC2
## versicolor   6.9733588 10.52931 -4.896882
## virginica   -0.9828699 17.78230 -4.497136
## 
## Std. Errors:
##            (Intercept)      PC1      PC2
## versicolor    64.92176 62.31290 99.26767
## virginica     64.96125 62.34533 99.27009
## 
## Residual Deviance: 31.59731 
## AIC: 43.59731

Confusion Matrix

Lets calculate the misclassification error on train data set

library(caret)
prd <- predict(mymodel,train_1)
confusionMatrix(prd,train_1$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         36         4
##   virginica       0          4        36
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.8729, 0.9708)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9             
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           0.9000
## Specificity                 1.0000            0.9500           0.9500
## Pos Pred Value              1.0000            0.9000           0.9000
## Neg Pred Value              1.0000            0.9500           0.9500
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3000
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9250           0.9250

We have an overall accuracy of 93%.
Calculating misclassification error on test data set.

prt <- predict(mymodel,test_1)
confusionMatrix(prt,test_1$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         2
##   virginica       0          2         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.6928, 0.9624)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.296e-09       
##                                           
##                   Kappa : 0.8             
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8000           0.8000
## Specificity                 1.0000            0.9000           0.9000
## Pos Pred Value              1.0000            0.8000           0.8000
## Neg Pred Value              1.0000            0.9000           0.9000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2667           0.2667
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.8500           0.8500

We have an accuracy of about 87%.

NOTE

Principal component analysis can only be applied on numeric variables.

Thank you.