In this project, I will study about the PCA and how to apply it in R.

Principal component analysis (PCA) is know as the method to reduce the dimension. It is useful when the dataset contains a lot of predictors, and not all of them are important for data analysis. Therefore, PCA will find the most important variable and reduce the dimension of the dataset and keep as much of the data’s variation as possible. PCA are the underlying structure in the data. They find the directions that most the data spread out. The first principle component is the straight line that contains the most variance in data. Continuely, they will find the second, the third.. principle component until they shows all the variance of the dataset.

Data Iris will be used to represent the concept of PCA.

data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
pairs.panels(iris[,-5],
            gap=0,
            bg=c("red","yellow","blue")[iris$Species],
            pch=21)

set.seed(30)
id<-sample(2,nrow(iris),replace=TRUE, prob=c(0.7,0.3))
training<-iris[id==1,]
testing<-iris[id==2,]

Now, we will apply the PCA method for our training dataset. PCA only works on the quantitative data, hence we need to eliminate the Speciess variable out of the model

#Principal component analysis
pca<-prcomp(training[,-5],
            center=TRUE,
            scale.=TRUE)
attributes(pca)
## $names
## [1] "sdev"     "rotation" "center"   "scale"    "x"       
## 
## $class
## [1] "prcomp"
#The standard deviation of each variable
pca$scale
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8339635    0.4357551    1.7904028    0.7668951
#The mean value of each variable
pca$scenter
## NULL

We have two attributes scale shows the standard deviation of each variable, and center is the average value of each variable.

print(pca)
## Standard deviations (1, .., p=4):
## [1] 1.7010568 0.9711254 0.3783093 0.1421386
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5243803 -0.36821883  0.7247239  0.2534075
## Sepal.Width  -0.2476450 -0.92848403 -0.2547587 -0.1081081
## Petal.Length  0.5829651 -0.02054087 -0.1533458 -0.7976308
## Petal.Width   0.5690773 -0.04370776 -0.6215772  0.5365468

PCA is the linear combination of 4 variables contained in the model. And the roatation shows the loading score of each PCA. For example. PC1 is the linear combination of 4 variables. PC1 increases when Sepal Length, Petal Length, and Petal Width are increased and it is positively correlated whereas PC1 increases Sepal Width decrease because these values are negatively correlated.

#Variation
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7011 0.9711 0.37831 0.14214
## Proportion of Variance 0.7234 0.2358 0.03578 0.00505
## Cumulative Proportion  0.7234 0.9592 0.99495 1.00000

We can see that PC1 explains around 72.3% variance of the dataset, while PC2 displays nearly 24% of dataset. And the cummulative of proportaion of variance of 4 Principal components equal 100%.

And usually, the first two PC will explain the majority of the variability.

Now, we create the scatter plot based on four Principal Component to see the orthogonality between PC.

pairs.panels(pca$x,
             gap=0,
             bg=c("red","yellow","green")[training$Species],
             pch=21)

We see that there are no relationship between PC, which satisfies the orthogonality in Principal Components.

library(devtools)
## Warning: package 'devtools' was built under R version 4.2.2
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 4.2.2
#install_github("vqv/ggbiplot")
library(ggbiplot)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 4.2.2
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## Loading required package: grid
g <- ggbiplot(pca,
              obs.scale = 1,
              var.scale = 1,
              groups = training$Species,
              ellipse = TRUE,
              circle = TRUE,
              ellipse.prob = 0.68)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
               legend.position = 'top')
print(g)

The plot shows PC1 represents 72.3% variation of the dataset, and PC2 explains 23.6% of dataset. Moreover, for PC1, the direction of Petal Length, Petal Width and Sepal Length is on the right. It shows that they have the positive correlation with PC1. While Sepal Width is negative correlated with PC1.

Moreover, PCA is suitable for the clustering problems. Virginica is characterized by high value of Petal Length, Petal Width and sepal length.

Now, we look at the difference between PCA and LDA.

library(MASS)
## Warning: package 'MASS' was built under R version 4.2.2
linear<-lda(Species~.,training)
library(devtools)
library(ggord)

g2<-ggord(linear,training$Species,ylim=c(-10,10)) 
plot(g2)

plot(g)

PCA is an unsupervised learning algorithm while LDA is a supervised learning algorithm. This means that PCA finds directions of maximum variance regardless of class labels while LDA finds directions of maximum class separability.

In general, you should use LDA when your goal is classification – that is, when you have labels for your data points and want to predict which label new points will have based on their feature values . On the other hand, if you don’t have labels for your data or if your goal is simply to find patterns in your data (not classification), then PCA will likely work better .