PCA creates new coordinate system by converting highly correlated variables into orthogonal or uncorrelated variables.
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The summary shows the values have different ranges and different values e.g. minimum value for SL is 4.3, SW 2. PL 1, PW .1. Whenever the data has different scales for different values, it is a good idea to normalize the data so that one variable does not overshadow all other variables.
For the factor variable, Species, we can see that each of the levels had equal data points 50
#data partition
set.seed(419)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.8,0.2))
train <- iris[ind==1,]
test <- iris[ind==2,]
dim(train)
## [1] 113 5
dim(test)
## [1] 37 5
#scatter plot & correlations
library(psych)
pairs.panels(train[,-5], gap=0, bg=c("red","yellow", "blue")[train$Species], pch=21)
In developing predictive models, and when correlation among independent variables are high that gives rise to multicollinearity. In this case, estimates for the model we get are very unstable or predictions are not going to be accurate. One way to handle this is through PCA.
#Principal Component Analysis
pc <- prcomp(train[,-5], center = TRUE, scale. = TRUE)
pc
## Standard deviations (1, .., p=4):
## [1] 1.7215199 0.9373729 0.3734702 0.1349866
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5192954 -0.38123664 0.7175786 0.2647109
## Sepal.Width -0.2898304 -0.91973272 -0.2315345 -0.1283815
## Petal.Length 0.5763578 -0.03442750 -0.1385648 -0.8046280
## Petal.Width 0.5604840 -0.08697755 -0.6420845 0.5157714
The above prcomp() performs PCA on the given data matrix and returns the results as an object of class prcomp. PCA is done on independent variables. Center & scale parameters will standardize the data with 0 mean, 1 std., this options converts all the given independent variables are standardized before PCA is done.
attributes(pc)
## $names
## [1] "sdev" "rotation" "center" "scale" "x"
##
## $class
## [1] "prcomp"
#average of four variables
pc$center
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.824779 3.066372 3.720354 1.184956
mean(train$Sepal.Length)
## [1] 5.824779
#calculate standard deviations of four variales
pc$scale
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8402307 0.4425193 1.7989659 0.7788752
sd(train$Sepal.Length)
## [1] 0.8402307
#sd, rotations also called loadings
print(pc)
## Standard deviations (1, .., p=4):
## [1] 1.7215199 0.9373729 0.3734702 0.1349866
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5192954 -0.38123664 0.7175786 0.2647109
## Sepal.Width -0.2898304 -0.91973272 -0.2315345 -0.1283815
## Petal.Length 0.5763578 -0.03442750 -0.1385648 -0.8046280
## Petal.Width 0.5604840 -0.08697755 -0.6420845 0.5157714
In the above output, we have four prinipal components. Each principal component is a normalied linear combination of original variables which are listed here.
Rotations or loadings are the coefficients of the linear combinations of continuous variables.
PC1, PC2, PC3 & PC4 numbers are going to lie between -1 and +1.
PC1 increases as (1) Sepal.Length, (3) Petal.Length and (4) Petal.Width increase. Whereas PC1 increases (2) Sepal.Width decreases because the correlation between PC1 and Sepal.Width is negative. In terms of correlation, the highest value is for Petal.Length and the lowest value is for Sepal.Width.
Similarly if we look at PC2 the highest value is -0.91 - means, PC2 increases Sepal.Width decreases, so they have negative correlation and the negative correlation is so high very close to negative 1. So this principal components are mainly characterized by Sepal Width as the other correlations are very low
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7215 0.9374 0.37347 0.13499
## Proportion of Variance 0.7409 0.2197 0.03487 0.00456
## Cumulative Proportion 0.7409 0.9606 0.99544 1.00000
We have standard deviation, proportion of variance and cumulative proportion.
PC1 alone captures 74% of variability whereas PC2 captures 21% of variability. At Cumulative Proportion, more than 96% of overall variability is explained by reaching PC2. By the time, we reach PC3 & PC4, they do not play much role in terms of explaining variability.
In this case, we say PC1 & PC2 capture majority of variation.
attributes(pc)
## $names
## [1] "sdev" "rotation" "center" "scale" "x"
##
## $class
## [1] "prcomp"
#orthogonality of principal component
#scatter plot with principle components
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
pairs.panels(pc$x, gap=0, bg=c("red","green","blue")[train$Species], pch=21)
Since principal components are orthogonal to each other correlation between them are always zero that helps get rid of multicollinearity problem that occurs when many variables which are highly correlated.
Principal components which are linear comination of normalized variables we find that the correlation is now zero.
print(pc)
## Standard deviations (1, .., p=4):
## [1] 1.7215199 0.9373729 0.3734702 0.1349866
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5192954 -0.38123664 0.7175786 0.2647109
## Sepal.Width -0.2898304 -0.91973272 -0.2315345 -0.1283815
## Petal.Length 0.5763578 -0.03442750 -0.1385648 -0.8046280
## Petal.Width 0.5604840 -0.08697755 -0.6420845 0.5157714
correlation between PC1 and P.Length, 0.57 is the top 1 correlation between PC1 and P.Width, 0.56 is the top 2 correlation between PC1 and S.Length, 0.51 is the top 3
whereas correlation between PC1 and S.Width is negative, -0.28 so it is on the negative side.
Similar we can see correlation with PC2 and all the variables is negative. In fact S.Width has the highest negative correlation. If we look the above graph, from PC2 zero line. So as PC2 increases S.Width goes down much more than other three variables.
A data point that is faraway on the right side that points indicates that it has high value for PC1 which also means it has high values for P.Length, P.Width, and S.Length.
Similar a data point on the left side, so it has low values for PC1 which also means that it has low values for P.Length, P.Width and S.Length.
Also, we look at a point related to PC2, a high value for PC2 and by default it has low values for S.Width because this is the dominant variable and it has negative correlation.
Biplot is very important tool in PCA to understand.
#Bi-Plot needs devtools package - at present package is not working
library(devtools)
## Warning: package 'devtools' was built under R version 3.6.2
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 3.6.2
#install_github("vqv/ggbiplot")
library(ggbiplot)
## Loading required package: plyr
## Loading required package: scales
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## Loading required package: grid
#ellipse.prob default value is 0.68 or 68%
g <- ggbiplot(pc, obs.scale=1, var.scale=1, groups =train$Species, circle=TRUE, ellipse=TRUE,
ellipse.prob=0.68)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position='top')
print(g)
On the x-axis, the plot has PC1 which explains 74.1% of the overall variability.
On the y-axis, PC2 that explains 22% of the variability.
As we have given ellipse.prob=0.68 that captures 68% of the data in each class of species. So 68% of the data that is being caputred by each ellipse that we can see in red, green, blue. We can modify the ellipse to any required prob, for example, if ellipse.prob=.95 then the ellipse captures 95% as the size of the ellipse gets increased.
#Bi-plot with ellipse.prob=.95
g <- ggbiplot(pc, obs.scale=1, var.scale=1, groups =train$Species, circle=TRUE, ellipse=TRUE,
ellipse.prob=0.95)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position='top')
print(g)
#screeplot or elbow graph
screeplot(pc, type="lines")
#Kaiser's Criterion
pc$sdev^2
## [1] 2.96363071 0.87866793 0.13947996 0.01822139
#prediction with Principal components to create training dataset
trg <- predict(pc, train)
head(trg)
## PC1 PC2 PC3 PC4
## 2 -1.980261 0.711941701 0.235638645 0.113500649
## 4 -2.199130 0.638306376 -0.080593426 -0.054751770
## 5 -2.311431 -0.580471994 0.007109791 -0.029063873
## 6 -2.020666 -1.413559178 0.003772130 0.008178089
## 7 -2.355695 0.005532621 -0.312294066 -0.030839036
## 9 -2.223785 1.146646098 -0.139052280 -0.015010599
#prediction with principal components to create testing dataset
tst <- predict(pc, test)
head(tst)
## PC1 PC2 PC3 PC4
## 1 -2.184131 -0.4180047 0.14483426 0.03145219
## 3 -2.266899 0.3889209 -0.03210782 0.03719577
## 8 -2.148401 -0.1667055 0.10405113 -0.01576810
## 15 -2.143058 -1.7709879 0.49644764 0.19638099
## 19 -1.841719 -1.3306705 0.39473913 0.06548324
## 20 -2.276619 -1.0546059 -0.10227132 -0.03408958
#now with the above components we'll add target variable, species to the train dataset
trg <- data.frame(trg, train[5])
head(trg)
## PC1 PC2 PC3 PC4 Species
## 2 -1.980261 0.711941701 0.235638645 0.113500649 setosa
## 4 -2.199130 0.638306376 -0.080593426 -0.054751770 setosa
## 5 -2.311431 -0.580471994 0.007109791 -0.029063873 setosa
## 6 -2.020666 -1.413559178 0.003772130 0.008178089 setosa
## 7 -2.355695 0.005532621 -0.312294066 -0.030839036 setosa
## 9 -2.223785 1.146646098 -0.139052280 -0.015010599 setosa
#adding target variable to the test dataset
tst <- data.frame(tst, test[5])
head(tst)
## PC1 PC2 PC3 PC4 Species
## 1 -2.184131 -0.4180047 0.14483426 0.03145219 setosa
## 3 -2.266899 0.3889209 -0.03210782 0.03719577 setosa
## 8 -2.148401 -0.1667055 0.10405113 -0.01576810 setosa
## 15 -2.143058 -1.7709879 0.49644764 0.19638099 setosa
## 19 -1.841719 -1.3306705 0.39473913 0.06548324 setosa
## 20 -2.276619 -1.0546059 -0.10227132 -0.03408958 setosa
Now the data is ready in terms of principle components for both train and test to develop a model.
#build multinominal logistic regression model with first two PCs
library(nnet)
#relevel - below code indicates that the reference value is the first Species which is Setosa
trg$Species <- relevel(trg$Species, ref = "setosa")
#multinominal Logistic regression
#we are using first two PCs because they have more than 95% of variability capture
model <- multinom(Species ~ PC1+PC2, data = trg)
## # weights: 12 (6 variable)
## initial value 124.143189
## iter 10 value 18.039300
## iter 20 value 16.937119
## final value 16.858234
## converged
#now very quickly model converges and we get some solution
summary(model)
## Call:
## multinom(formula = Species ~ PC1 + PC2, data = trg)
##
## Coefficients:
## (Intercept) PC1 PC2
## versicolor 6.9718797 9.756222 3.767176
## virginica 0.2414769 15.918514 4.717151
##
## Std. Errors:
## (Intercept) PC1 PC2
## versicolor 38.16614 35.93577 52.88910
## virginica 38.20266 35.96631 52.89339
##
## Residual Deviance: 33.71647
## AIC: 45.71647
#confusion matrix & misclassification error - train
p.trg <- predict(model, trg)
tab <- table(p.trg, trg$Species)
tab
##
## p.trg setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 30 4
## virginica 0 4 35
value 4 below virginica - actually value 4 belongs to virginica, but model predicted to versicolor so this is a misclassification error.
value 4 below versicolor - actually value 4 belongs to versicolor, but model predicted to virginica.
#calculate misclassification error with train
1-sum(diag(tab))/sum(tab)
## [1] 0.07079646
Misclassification error with train data is about 7.07%.
#confusion matrix & misclassification error - test
p.tst <- predict(model,tst)
tab.tst <- table(p.tst, tst$Species)
tab.tst
##
## p.tst setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 12 1
## virginica 0 4 10
#calculate misclassification error with test
1-sum(diag(tab.tst))/sum(tab.tst)
## [1] 0.1351351
Misclassification error for testing data is about 13.5%.