Objective:
Carry out a principal component analysis on the Flea Beetle data set and make an inference.
Data:
There are two Flea Beatle data set. First data set is called Haltica-Oleracea and the second one is called Haltica-Carduourum.
Flea Beetle Data Set:
A data frame with 39 observations; 19 from Haltica oleracea and 20 from H. carduourum (denoted by a factor) and four measurements.
Species - a factor with levels oleracea carduorum
TG (X1) - Distange of the Transverse Groove to the posterior border of the prothorax (microns)
Elytra (X2) - Length of the Elytra (in units of 0.01mm)
Second.Antenna (X3) - Length of the second antennal joint (microns)
Third.Antenna (X4) - Length of the third antennal joint (microns)
Method:
Find Covariance and Correlation Matrix and make decision which one to use.
Descrive eigen values and fractional contributions to the variance.
Give the eigen vectors for the principal components to retain.
Make appropriate plots of pairs of principal components.
Using plots and data to make a conclusion.
library(corrplot)
## corrplot 0.90 loaded
library(AMR)
## Warning: package 'AMR' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data1 <- read.csv("ho.csv")
head(data1)
## Experiment x1 x2 x3 x4
## 1 1 189 245 137 163
## 2 2 192 260 132 217
## 3 3 217 276 141 192
## 4 4 221 299 142 213
## 5 5 171 239 128 158
## 6 6 192 262 147 173
Data set
head(data1)
## Experiment x1 x2 x3 x4
## 1 1 189 245 137 163
## 2 2 192 260 132 217
## 3 3 217 276 141 192
## 4 4 221 299 142 213
## 5 5 171 239 128 158
## 6 6 192 262 147 173
Flea beetle data set includes index number(Experiment) and needs to be removed.
df1 <- data1 %>% select(x1,x2,x3,x4)
head(df1)
## x1 x2 x3 x4
## 1 189 245 137 163
## 2 192 260 132 217
## 3 217 276 141 192
## 4 221 299 142 213
## 5 171 239 128 158
## 6 192 262 147 173
Pairs Plot
pairs(df1)
Pair plot provide us a scatter plot between two variables. It is used to explain a relationship between two variables. We can notice that (X1 and X2) and (X2 and X4)are highly correlated compared to other relationships.
Covariance Matrix
S <- var(df1)
round(S,2)
## x1 x2 x3 x4
## x1 187.60 176.86 48.37 113.58
## x2 176.86 345.39 75.98 118.78
## x3 48.37 75.98 66.36 16.24
## x4 113.58 118.78 16.24 239.94
We can notice that X2 (345.39) and X4(239.94) have high variance in comparison to others. Especially, X3 (66.36) has very low variance so it may not have any effect on PCA.
Covariance matrix confirm the positive relationship between X1 and X2 and, X2 and X4
Correlation Matrix
C <- cor(df1)
round(C,2)
## x1 x2 x3 x4
## x1 1.00 0.69 0.43 0.54
## x2 0.69 1.00 0.50 0.41
## x3 0.43 0.50 1.00 0.13
## x4 0.54 0.41 0.13 1.00
corrplot((C) , method = "number")
Correlation Matrix is scaled version of the Covariance matrix. Correlation matrix removes the effect of dominant observations.
As we see here, X1 is still highly correlated with X2. But X2 and X4 are no longer highly correlated observations. However, X1 and X4 is also quite correlated.
X3 had no effect on Covariance matrix but it has an some effect on Correlation matrix.
In this analysis, I am going to use the Covariance Matrix, because the variables have roughly the same scale.
Eigen Values
eig<- eigen(S)
round(eig$values,2)
## [1] 561.31 168.99 65.28 43.71
First two eigen vectors are quite higher than the other two. It is a sign that 2 principal components explain the a large amount of variability.
Sum of the eigen values is equal to sum of diagonal of the covariance matrix. Therefore, each eigen value explains the variability of the original data set.
round(eig$values/sum(eig$values),3)
## [1] 0.669 0.201 0.078 0.052
In this example;
1st eigen value explains about 66.8% variability,
2nd eigen value explains about 20.1% variability,
3rd eigen value explains about 7.8% variability,
4th eigen value explains about 5.2% variability.
Criteria for PCs Selection
We can use scree plot and cumulative total to determine how many PC’s to keep.
Cumulative Sum
round(cumsum(eig$values)/sum(eig$values),2)
## [1] 0.67 0.87 0.95 1.00
Cumulative sum show that first two PCs explains 87% of the variability, so it is good enough to make an inference about the data set.
Scree Plot
plot(eig$values, type = "b")
Scree plot is used to determine the number of principal component(s) to retain. The scree plot criterion looks for the “elbow” in the curve, select all the components on the left side.
I will select first two principal components, because, they explains 87% of the variability and it is the elbow point on the scree plot.
Eigen Vector Analysis
round(eig$vectors,2)
## [,1] [,2] [,3] [,4]
## [1,] -0.50 0.01 0.82 0.27
## [2,] -0.72 -0.48 -0.48 0.14
## [3,] -0.17 -0.22 0.20 -0.94
## [4,] -0.45 0.85 -0.23 -0.17
When we look at eigen vectors, we will check its magnitudes and directions.
Interpretation of PC1: It is about the overall size of the Beatle. PC1 is linear combination of all variables. Since all the variables have same direction, there is no contrast on
Interpretation of PC2 : 4th variable has a larger magnitude. Magnitude of 2nd and 3rd variables has opposite direction(negative). Therefore, there is contrast between X4 with X2 and X3. A Beatles with a long third antenna has a short second antenna and Elytra, or vice versa.
Analysis of PCI
Plot of pairs of PC1 & PC2
By using Correlation matrix.
pca1 <- princomp(df1,cor=T)
summary(pca1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.5476548 0.9417681 0.6571892 0.53473353
## Proportion of Variance 0.5988088 0.2217318 0.1079744 0.07148499
## Cumulative Proportion 0.5988088 0.8205406 0.9285150 1.00000000
pca1 <- princomp(df1,cor=T)
biplot(pca1)
Plot of pairs of PC1 & PC2
By using Covariance matrix.
pca1 <- princomp(df1,cor=F)
summary(pca1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 23.0599943 12.6527406 7.86393386 6.43516871
## Proportion of Variance 0.6687938 0.2013460 0.07777743 0.05208273
## Cumulative Proportion 0.6687938 0.8701398 0.94791727 1.00000000
Summary table gives the same result as Eigen-vector table.
pca1 <- princomp(df1,cor=F)
biplot(pca1)
With biplot function, we can closer look at the each beetles, make an interpretation about their size.
For example, Beatles 2, 4, 5 and 16 are outlier. So, we can talk about them.
Observation 2 is an average size of Beetle, but has large PC2. This Beatle has either very big second antenna and very small third antenna or conversely.
Observation 4 is quite bigger than average size of beetle. This Beetle has either big second antenna and small third antenna or conversely.
Observation 5 is smaller than average size of Beetle. But this beatle, has average size of PC2, therefore, it has similar size of second and third antennas.
The rest of the beatles have average size. They have average size antennas and length of the Elytra.
Additional Analysis
boxplot(df1)
We can notice that there is no huge difference between variance, so it is ok to use covariance matrix. By using covariance matrix, we don’t lose any interpretation.
df1.pca <- prcomp(df1)
summary(df1.pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 23.6919 12.9995 8.07942 6.61151
## Proportion of Variance 0.6688 0.2014 0.07778 0.05208
## Cumulative Proportion 0.6688 0.8701 0.94792 1.00000