PCA

PCA creates new coordinate system by converting highly correlated variables into orthogonal or uncorrelated variables.

data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

The summary shows the values have different ranges and different values e.g. minimum value for SL is 4.3, SW 2. PL 1, PW .1. Whenever the data has different scales for different values, it is a good idea to normalize the data so that one variable does not overshadow all other variables.

For the factor variable, Species, we can see that each of the levels had equal data points 50

#data partition
set.seed(419)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.8,0.2))

train <- iris[ind==1,]
test <- iris[ind==2,]

dim(train)

## [1] 113   5

dim(test)

## [1] 37  5

#scatter plot & correlations
library(psych)
pairs.panels(train[,-5], gap=0, bg=c("red","yellow", "blue")[train$Species], pch=21)

In developing predictive models, and when correlation among independent variables are high that gives rise to multicollinearity. In this case, estimates for the model we get are very unstable or predictions are not going to be accurate. One way to handle this is through PCA.

#Principal Component Analysis
pc <- prcomp(train[,-5], center = TRUE, scale. = TRUE)
pc

## Standard deviations (1, .., p=4):
## [1] 1.7215199 0.9373729 0.3734702 0.1349866
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5192954 -0.38123664  0.7175786  0.2647109
## Sepal.Width  -0.2898304 -0.91973272 -0.2315345 -0.1283815
## Petal.Length  0.5763578 -0.03442750 -0.1385648 -0.8046280
## Petal.Width   0.5604840 -0.08697755 -0.6420845  0.5157714

The above prcomp() performs PCA on the given data matrix and returns the results as an object of class prcomp. PCA is done on independent variables. Center & scale parameters will standardize the data with 0 mean, 1 std., this options converts all the given independent variables are standardized before PCA is done.

attributes(pc)

## $names
## [1] "sdev"     "rotation" "center"   "scale"    "x"       
## 
## $class
## [1] "prcomp"

#average of four variables
pc$center

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.824779     3.066372     3.720354     1.184956

mean(train$Sepal.Length)

## [1] 5.824779

#calculate standard deviations of four variales
pc$scale

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8402307    0.4425193    1.7989659    0.7788752

sd(train$Sepal.Length)

## [1] 0.8402307

#sd, rotations also called loadings 
print(pc)

## Standard deviations (1, .., p=4):
## [1] 1.7215199 0.9373729 0.3734702 0.1349866
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5192954 -0.38123664  0.7175786  0.2647109
## Sepal.Width  -0.2898304 -0.91973272 -0.2315345 -0.1283815
## Petal.Length  0.5763578 -0.03442750 -0.1385648 -0.8046280
## Petal.Width   0.5604840 -0.08697755 -0.6420845  0.5157714

In the above output, we have four prinipal components. Each principal component is a normalied linear combination of original variables which are listed here.

Rotations or loadings are the coefficients of the linear combinations of continuous variables.

PC1, PC2, PC3 & PC4 numbers are going to lie between -1 and +1.

PC1 increases as (1) Sepal.Length, (3) Petal.Length and (4) Petal.Width increase. Whereas PC1 increases (2) Sepal.Width decreases because the correlation between PC1 and Sepal.Width is negative. In terms of correlation, the highest value is for Petal.Length and the lowest value is for Sepal.Width.

Similarly if we look at PC2 the highest value is -0.91 - means, PC2 increases Sepal.Width decreases, so they have negative correlation and the negative correlation is so high very close to negative 1. So this principal components are mainly characterized by Sepal Width as the other correlations are very low

summary(pc)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7215 0.9374 0.37347 0.13499
## Proportion of Variance 0.7409 0.2197 0.03487 0.00456
## Cumulative Proportion  0.7409 0.9606 0.99544 1.00000

We have standard deviation, proportion of variance and cumulative proportion.

PC1 alone captures 74% of variability whereas PC2 captures 21% of variability. At Cumulative Proportion, more than 96% of overall variability is explained by reaching PC2. By the time, we reach PC3 & PC4, they do not play much role in terms of explaining variability.

In this case, we say PC1 & PC2 capture majority of variation.

attributes(pc)

## $names
## [1] "sdev"     "rotation" "center"   "scale"    "x"       
## 
## $class
## [1] "prcomp"

#orthogonality of principal component
#scatter plot with principle components
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

pairs.panels(pc$x, gap=0, bg=c("red","green","blue")[train$Species], pch=21)

Since principal components are orthogonal to each other correlation between them are always zero that helps get rid of multicollinearity problem that occurs when many variables which are highly correlated.

Principal components which are linear comination of normalized variables we find that the correlation is now zero.

print(pc)

## Standard deviations (1, .., p=4):
## [1] 1.7215199 0.9373729 0.3734702 0.1349866
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5192954 -0.38123664  0.7175786  0.2647109
## Sepal.Width  -0.2898304 -0.91973272 -0.2315345 -0.1283815
## Petal.Length  0.5763578 -0.03442750 -0.1385648 -0.8046280
## Petal.Width   0.5604840 -0.08697755 -0.6420845  0.5157714

correlation between PC1 and P.Length, 0.57 is the top 1 correlation between PC1 and P.Width, 0.56 is the top 2 correlation between PC1 and S.Length, 0.51 is the top 3

whereas correlation between PC1 and S.Width is negative, -0.28 so it is on the negative side.

Similar we can see correlation with PC2 and all the variables is negative. In fact S.Width has the highest negative correlation. If we look the above graph, from PC2 zero line. So as PC2 increases S.Width goes down much more than other three variables.

A data point that is faraway on the right side that points indicates that it has high value for PC1 which also means it has high values for P.Length, P.Width, and S.Length.

Similar a data point on the left side, so it has low values for PC1 which also means that it has low values for P.Length, P.Width and S.Length.

Also, we look at a point related to PC2, a high value for PC2 and by default it has low values for S.Width because this is the dominant variable and it has negative correlation.

Biplot is very important tool in PCA to understand.

#Bi-Plot needs devtools package - at present package is not working

library(devtools)

## Warning: package 'devtools' was built under R version 3.6.2

## Loading required package: usethis

## Warning: package 'usethis' was built under R version 3.6.2

#install_github("vqv/ggbiplot")
library(ggbiplot)

## Loading required package: plyr

## Loading required package: scales

## 
## Attaching package: 'scales'

## The following objects are masked from 'package:psych':
## 
##     alpha, rescale

## Loading required package: grid

#ellipse.prob default value is 0.68 or 68%
g <- ggbiplot(pc, obs.scale=1, var.scale=1, groups =train$Species,  circle=TRUE, ellipse=TRUE,
              ellipse.prob=0.68)

g <- g + scale_color_discrete(name = '')

g <- g + theme(legend.direction = 'horizontal',
               legend.position='top')

print(g)

On the x-axis, the plot has PC1 which explains 74.1% of the overall variability.

On the y-axis, PC2 that explains 22% of the variability.

As we have given ellipse.prob=0.68 that captures 68% of the data in each class of species. So 68% of the data that is being caputred by each ellipse that we can see in red, green, blue. We can modify the ellipse to any required prob, for example, if ellipse.prob=.95 then the ellipse captures 95% as the size of the ellipse gets increased.

#Bi-plot with ellipse.prob=.95
g <- ggbiplot(pc, obs.scale=1, var.scale=1, groups =train$Species,  circle=TRUE, ellipse=TRUE,
              ellipse.prob=0.95)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
               legend.position='top')

print(g)

#screeplot or elbow graph
screeplot(pc, type="lines")

#Kaiser's Criterion
pc$sdev^2

## [1] 2.96363071 0.87866793 0.13947996 0.01822139

#prediction with Principal components to create training dataset
trg <- predict(pc, train)
head(trg)

##         PC1          PC2          PC3          PC4
## 2 -1.980261  0.711941701  0.235638645  0.113500649
## 4 -2.199130  0.638306376 -0.080593426 -0.054751770
## 5 -2.311431 -0.580471994  0.007109791 -0.029063873
## 6 -2.020666 -1.413559178  0.003772130  0.008178089
## 7 -2.355695  0.005532621 -0.312294066 -0.030839036
## 9 -2.223785  1.146646098 -0.139052280 -0.015010599

#prediction with principal components to create testing dataset
tst <- predict(pc, test)
head(tst)

##          PC1        PC2         PC3         PC4
## 1  -2.184131 -0.4180047  0.14483426  0.03145219
## 3  -2.266899  0.3889209 -0.03210782  0.03719577
## 8  -2.148401 -0.1667055  0.10405113 -0.01576810
## 15 -2.143058 -1.7709879  0.49644764  0.19638099
## 19 -1.841719 -1.3306705  0.39473913  0.06548324
## 20 -2.276619 -1.0546059 -0.10227132 -0.03408958

#now with the above components we'll add target variable, species to the train dataset
trg <- data.frame(trg, train[5])
head(trg)

##         PC1          PC2          PC3          PC4 Species
## 2 -1.980261  0.711941701  0.235638645  0.113500649  setosa
## 4 -2.199130  0.638306376 -0.080593426 -0.054751770  setosa
## 5 -2.311431 -0.580471994  0.007109791 -0.029063873  setosa
## 6 -2.020666 -1.413559178  0.003772130  0.008178089  setosa
## 7 -2.355695  0.005532621 -0.312294066 -0.030839036  setosa
## 9 -2.223785  1.146646098 -0.139052280 -0.015010599  setosa

#adding target variable to the test dataset
tst <- data.frame(tst, test[5])
head(tst)

##          PC1        PC2         PC3         PC4 Species
## 1  -2.184131 -0.4180047  0.14483426  0.03145219  setosa
## 3  -2.266899  0.3889209 -0.03210782  0.03719577  setosa
## 8  -2.148401 -0.1667055  0.10405113 -0.01576810  setosa
## 15 -2.143058 -1.7709879  0.49644764  0.19638099  setosa
## 19 -1.841719 -1.3306705  0.39473913  0.06548324  setosa
## 20 -2.276619 -1.0546059 -0.10227132 -0.03408958  setosa

Now the data is ready in terms of principle components for both train and test to develop a model.

#build multinominal logistic regression model with first two PCs

library(nnet)

#relevel - below code indicates that the reference value is the first Species which is Setosa
trg$Species <- relevel(trg$Species, ref = "setosa")

#multinominal Logistic regression
#we are using first two PCs because they have more than 95% of variability capture
model <- multinom(Species ~ PC1+PC2, data = trg)

## # weights:  12 (6 variable)
## initial  value 124.143189 
## iter  10 value 18.039300
## iter  20 value 16.937119
## final  value 16.858234 
## converged

#now very quickly model converges and we get some solution
summary(model)

## Call:
## multinom(formula = Species ~ PC1 + PC2, data = trg)
## 
## Coefficients:
##            (Intercept)       PC1      PC2
## versicolor   6.9718797  9.756222 3.767176
## virginica    0.2414769 15.918514 4.717151
## 
## Std. Errors:
##            (Intercept)      PC1      PC2
## versicolor    38.16614 35.93577 52.88910
## virginica     38.20266 35.96631 52.89339
## 
## Residual Deviance: 33.71647 
## AIC: 45.71647

#confusion matrix & misclassification error - train
p.trg <- predict(model, trg)

tab <- table(p.trg, trg$Species)

tab

##             
## p.trg        setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         30         4
##   virginica       0          4        35

value 4 below virginica - actually value 4 belongs to virginica, but model predicted to versicolor so this is a misclassification error.

value 4 below versicolor - actually value 4 belongs to versicolor, but model predicted to virginica.

#calculate misclassification error with train
1-sum(diag(tab))/sum(tab)

## [1] 0.07079646

Misclassification error with train data is about 7.07%.

#confusion matrix & misclassification error - test
p.tst <- predict(model,tst)

tab.tst <- table(p.tst, tst$Species)
tab.tst

##             
## p.tst        setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         12         1
##   virginica       0          4        10

#calculate misclassification error with test
1-sum(diag(tab.tst))/sum(tab.tst)

## [1] 0.1351351

Misclassification error for testing data is about 13.5%.

PCA

Anil