The goal of this paper is to use PCA (Principal Component Analysis) method of dimension reduction on a possum dataset:
The dataset was downloaded from Kaggle, however, as the authors write:
’Data originally found in the DAAG R package and used in the book Maindonald, J.H. and Braun, W.J. (2003, 2007, 2010) “Data Analysis and Graphics Using R”).
Original Source of dataset:
Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.’
First, let’s import our possums data characteristics and inspect it:
possum <- read.csv("possum.csv")
head(possum)
## case site Pop sex age hdlngth skullw totlngth taill footlgth earconch eye
## 1 1 1 Vic m 8 94.1 60.4 89.0 36.0 74.5 54.5 15.2
## 2 2 1 Vic f 6 92.5 57.6 91.5 36.5 72.5 51.2 16.0
## 3 3 1 Vic f 6 94.0 60.0 95.5 39.0 75.4 51.9 15.5
## 4 4 1 Vic f 6 93.2 57.1 92.0 38.0 76.1 52.2 15.2
## 5 5 1 Vic f 2 91.5 56.3 85.5 36.0 71.0 53.2 15.1
## 6 6 1 Vic f 1 93.1 54.8 90.5 35.5 73.2 53.6 14.2
## chest belly
## 1 28.0 36
## 2 28.5 33
## 3 30.0 34
## 4 28.0 34
## 5 28.5 33
## 6 30.0 32
summary(possum)
## case site Pop sex
## Min. : 1.00 Min. :1.000 Length:104 Length:104
## 1st Qu.: 26.75 1st Qu.:1.000 Class :character Class :character
## Median : 52.50 Median :3.000 Mode :character Mode :character
## Mean : 52.50 Mean :3.625
## 3rd Qu.: 78.25 3rd Qu.:6.000
## Max. :104.00 Max. :7.000
##
## age hdlngth skullw totlngth
## Min. :1.000 Min. : 82.50 Min. :50.00 Min. :75.00
## 1st Qu.:2.250 1st Qu.: 90.67 1st Qu.:54.98 1st Qu.:84.00
## Median :3.000 Median : 92.80 Median :56.35 Median :88.00
## Mean :3.833 Mean : 92.60 Mean :56.88 Mean :87.09
## 3rd Qu.:5.000 3rd Qu.: 94.72 3rd Qu.:58.10 3rd Qu.:90.00
## Max. :9.000 Max. :103.10 Max. :68.60 Max. :96.50
## NA's :2
## taill footlgth earconch eye chest
## Min. :32.00 Min. :60.30 Min. :40.30 Min. :12.80 Min. :22.0
## 1st Qu.:35.88 1st Qu.:64.60 1st Qu.:44.80 1st Qu.:14.40 1st Qu.:25.5
## Median :37.00 Median :68.00 Median :46.80 Median :14.90 Median :27.0
## Mean :37.01 Mean :68.46 Mean :48.13 Mean :15.05 Mean :27.0
## 3rd Qu.:38.00 3rd Qu.:72.50 3rd Qu.:52.00 3rd Qu.:15.72 3rd Qu.:28.0
## Max. :43.00 Max. :77.90 Max. :56.20 Max. :17.80 Max. :32.0
## NA's :1
## belly
## Min. :25.00
## 1st Qu.:31.00
## Median :32.50
## Mean :32.59
## 3rd Qu.:34.12
## Max. :40.00
##
As we can see, the dataset contains variables which are not numeric or are indexes of observation number or site.
Therefore, we cannot simply conduct a PCA procedure on the virgin dataset. Firstly, let’s limit the dataset only to numeric columns and get rid of the null values:
possum <- possum[sapply(possum, is.numeric)]
possum <- na.omit(possum)
Also, we need to get rid of the indexes and population descriptions:
possum_num <- possum[,-c(1,2,3)]
Let’s inspect boxplots, correlations matrix and distributions of the variables:
boxplot(possum_num, col = 'orange')
corrs <- cor(possum_num, method = c("spearman"))
round(corrs, 2)
## hdlngth skullw totlngth taill footlgth earconch eye chest belly
## hdlngth 1.00 0.74 0.62 0.18 0.45 0.22 0.37 0.61 0.55
## skullw 0.74 1.00 0.57 0.27 0.34 0.08 0.37 0.63 0.53
## totlngth 0.62 0.57 1.00 0.51 0.51 0.25 0.25 0.53 0.45
## taill 0.18 0.27 0.51 1.00 -0.14 -0.35 0.11 0.08 0.25
## footlgth 0.45 0.34 0.51 -0.14 1.00 0.74 0.07 0.50 0.33
## earconch 0.22 0.08 0.25 -0.35 0.74 1.00 -0.09 0.26 0.09
## eye 0.37 0.37 0.25 0.11 0.07 -0.09 1.00 0.12 0.24
## chest 0.61 0.63 0.53 0.08 0.50 0.26 0.12 1.00 0.58
## belly 0.55 0.53 0.45 0.25 0.33 0.09 0.24 0.58 1.00
chart.Correlation(possum_num, histogram=TRUE, pch=19)
possum_cor<-cor(possum_num, method="spearman")
corrplot(possum_cor, order ="alphabet", tl.cex=0.6, insig= 'n', col = COL1('YlOrBr', 200) )
Body length measures are highly correlated with each other. However, not all variables are correlated with each other. Before proceeding with the PCA method, we check whether it makes sense to perform principal component analysis at all. We can determine the adequacy statistic proposed by KaiserMeyerOlkin. This method compares correlations and partial correlations between variables. When the correlation is relatively high compared to the correlation, the KMO is small, which means that it is not feasible to obtain an adequate solution in the low dimensional space.
KMOS(possum_num)
##
## Kaiser-Meyer-Olkin Statistics
##
## Call: KMOS(x = possum_num)
##
## Measures of Sampling Adequacy (MSA):
## hdlngth skullw totlngth taill footlgth earconch eye chest
## 0.8327082 0.8241335 0.7474705 0.5581005 0.6982440 0.5593686 0.8111594 0.8330403
## belly
## 0.8602022
##
## KMO-Criterion: 0.7527333
In our case mean KMO-Criterion = 0.7527333 which is sufficient to perform further anaylsis.
Kaiser’s Stopping Rule is a method for deciding which components should be selected. In this approach, components with eigen values greater than 1 should be kept. It also involves a screening test, where we plot the eigenvalues on the vertical axis and the components on the horizontal axis. The ingredients are listed in descending order from largest to smallest and based on the rule of the elbow we choose the number of components. If the eigenvalue curve is stable, we should choose this number of components.
pca1 <- prcomp(possum_num, scale = TRUE)
fviz_eig(pca1, choice='eigenvalue', addlabels = TRUE, barfill = 'orange')+theme_bw()+geom_line(linetype = "dashed", y = 1)
Basing on Kaiser’s rule, we should choose 2 or 3 components. The third one has eigenvalue lower than 1.
Another approach is to look at the percentage of variance explained, it’s good if components explain 70-90%. I expect that this approach will clarify whether we should choose 2 or 3 components.
fviz_eig(pca1, choice = 'variance', addlabels = TRUE, ylim = c(0, 51), barfill = 'orange')
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9828 1.3959 0.9531 0.86693 0.75951 0.55665 0.51685
## Proportion of Variance 0.4368 0.2165 0.1009 0.08351 0.06409 0.03443 0.02968
## Cumulative Proportion 0.4368 0.6533 0.7543 0.83777 0.90186 0.93629 0.96597
## PC8 PC9
## Standard deviation 0.40314 0.37912
## Proportion of Variance 0.01806 0.01597
## Cumulative Proportion 0.98403 1.00000
Basing on the above plots, I will stand for choosing 3 components. If we take a look at the cumulative variance explained we will find that 2 components explain 65% of variance which is not enough. The first three components explain more than 75%, which will be satisfactory in our case.
fviz_pca_var(pca1, col.circle = "grey70", alpha.var = 1, col.var = "black")
pca1$rotation[,1:3]
## PC1 PC2 PC3
## hdlngth -0.4340508 -0.06327447 0.15067425
## skullw -0.3862225 -0.11092066 0.25134145
## totlngth -0.4182651 -0.07807963 -0.34060125
## taill -0.1931942 -0.48143520 -0.53479000
## footlgth -0.3069891 0.49706230 -0.07909586
## earconch -0.1385685 0.63854590 -0.02187670
## eye -0.1857378 -0.27637753 0.70938382
## chest -0.4094552 0.08877102 -0.01128898
## belly -0.3668335 -0.08209775 -0.04502895
grid.arrange(a,b,c,top='Contribution to the Principal Components')
We can try to interpret the principal components, but note that the principal components themselves do not have an economic interpretation. The first principal component accounts for the overall dimensions of the possums.
While the contribution to the second principal component is in 40% is the ear conch length variable.
The third principal component is in 50% the variable ‘eye’ corresponding to distance from medial canthus to lateral canthus of right eye.
Below in the graph, we can see that the possum population named Vic is dimensionally different from the others:
fviz_pca_ind(pca1,
geom="point",
habillage = possum1$Pop
)
A 3D chart was also created allowing us to visualize the 3 dimensions of our analysis on a single chart. The points on the graph correspond to each observation and their colors represent age.
library(rgl)
plot3d(pca1$x[,1:3], col = factor(possum$age))
text3d(pca1$rotation[,1:3], texts = rownames(pca1$rotation), col='blue')
coords <- NULL
for (i in 1:nrow(pca1$rotation)) {
coords <- rbind(coords, rbind(c(0,0,0),pca1$rotation[i,1:3]))
}
lines3d(coords, col="orange", lwd=4)
rglwidget()
The main goal of this research was to examine whether body dimensions of possums can be described by a smaller number of variables by dimension reduction. The results of the analysis suggest that it can be done by using three dimensions. This project was done by rather small dataset, but it was a great opportunity to get familiar with an algorithm and better understanding of the data and dependencies between variables.