When faced with a data set with many numeric variables which may or may not be highly correlated, it’s always good to reduce the number of numeric variables which are highly correlated.
Principal component analysis(PCA) helps reduce the number of variables and multicollineality by producing few features which explain most of the variability in the data set.
Am going to use the Iris data set to demonstrate to perform a principal component analysis.
library(datasets)
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The data set has 150 observations and 5 variables. First four variables are numeric while the last variable is factor with 3 levels.
Looking at the summary of the data set.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
From the summary statistics, the numeric variables have different values and scale.It’s always good to normalize them .The factor variable has equal values in each level.
We partition the data set into training and testing data sets.
library(caret)
set.seed(100)
ind <- createDataPartition(iris$Species,p=0.80,list = F)
train <- iris[ind,]
test <- iris[-ind,]
dim(train)
## [1] 120 5
dim(test)
## [1] 30 5
The train data set has 120 observations and test data set has 30 observations.
We explore the data set by means of scatter plots and correlation coefficients.
library(psych)
pairs.panels(train[,-5],gap=0,bg=c("red","blue","yellow")[train$Species],
pch=21)
From scatter plots, some variables are very highly correlated; with correlation over 0.8. This leads to multicollineality problem when building a predictive models.Principal component analysis can help solve this problem.
We build a principal component analysis model.
pc <- prcomp(train[,-5],center = T,scale. = T)
pc
## Standard deviations (1, .., p=4):
## [1] 1.7205132 0.9347160 0.3801546 0.1470468
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5169663 0.39069181 0.7158654 0.2600817
## Sepal.Width -0.2957684 0.91562347 -0.2381302 -0.1320937
## Petal.Length 0.5760094 0.03313838 -0.1418198 -0.8043645
## Petal.Width 0.5598929 0.08885612 -0.6408728 0.5175969
Looking at summary of the principal component analysis.
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.721 0.9347 0.38015 0.14705
## Proportion of Variance 0.740 0.2184 0.03613 0.00541
## Cumulative Proportion 0.740 0.9585 0.99459 1.00000
From the summary, first PC1 explains about 74% of total variables and by time we reach the second PC1 96% of total variance is explained.
pairs.panels(pc$x,gap=0,bg=c("red","blue","yellow")[train$Species],
pch = 21)
From the scatter plot, you can there is zero correlation among the principal components.
Making a bi plot of the principal component analysis.
library(devtools)
library(ggbiplot)
ggbiplot(pc, obs.scale = 1, var.scale = 1,
groups = train$Species, ellipse = TRUE, circle = TRUE,ellipse.prob = 0.68) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
We first prepare the data sets
pred <- predict(pc,train)
train_1 <- data.frame(pred,train[5])
pred1 <- predict(pc,test)
test_1 <- data.frame(pred1,test[5])
We train a multinomial logistic model because our target variable has three levels.
library(nnet)
set.seed(100)
mymodel <- multinom(Species~PC1 +PC2,data = train_1)
## # weights: 12 (6 variable)
## initial value 131.833475
## iter 10 value 16.703027
## iter 20 value 15.868024
## iter 30 value 15.800425
## iter 40 value 15.798653
## iter 40 value 15.798653
## final value 15.798653
## converged
summary(mymodel)
## Call:
## multinom(formula = Species ~ PC1 + PC2, data = train_1)
##
## Coefficients:
## (Intercept) PC1 PC2
## versicolor 6.9733588 10.52931 -4.896882
## virginica -0.9828699 17.78230 -4.497136
##
## Std. Errors:
## (Intercept) PC1 PC2
## versicolor 64.92176 62.31290 99.26767
## virginica 64.96125 62.34533 99.27009
##
## Residual Deviance: 31.59731
## AIC: 43.59731
Lets calculate the misclassification error on train data set
library(caret)
prd <- predict(mymodel,train_1)
confusionMatrix(prd,train_1$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 36 4
## virginica 0 4 36
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.8729, 0.9708)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 0.9000
## Specificity 1.0000 0.9500 0.9500
## Pos Pred Value 1.0000 0.9000 0.9000
## Neg Pred Value 1.0000 0.9500 0.9500
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.3000
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9250 0.9250
We have an overall accuracy of 93%.
Calculating misclassification error on test data set.
prt <- predict(mymodel,test_1)
confusionMatrix(prt,test_1$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 2
## virginica 0 2 8
##
## Overall Statistics
##
## Accuracy : 0.8667
## 95% CI : (0.6928, 0.9624)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.296e-09
##
## Kappa : 0.8
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8000 0.8000
## Specificity 1.0000 0.9000 0.9000
## Pos Pred Value 1.0000 0.8000 0.8000
## Neg Pred Value 1.0000 0.9000 0.9000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2667 0.2667
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.8500 0.8500
We have an accuracy of about 87%.
Principal component analysis can only be applied on numeric variables.
Thank you.