Dimension Reduction in analysis of opposums characteristic dataset

Introduction and dataset description

The goal of this paper is to use PCA (Principal Component Analysis) method of dimension reduction on a possum dataset:

  • The dataset contains 104 observations and 14 columns:
    • case: observation number
    • site: The site number where the possum was trapped.
    • Pop: Population, either Vic (Victoria) or other (New South Wales or Queensland)
    • sex: Gender, either m (male) or f (female)
    • age: Age
    • hdlngth: Head length, in mm
    • skullw: Skull width, in mm
    • totlngth: Total length, in cm
    • taill: Tail length, in cm
    • footlgth: foot length
    • earconch: ear conch length
    • eye: distance from medial canthus to lateral canthus of right eye
    • chest: chest girth, in cm
    • belly: belly girth, in cm

The dataset was downloaded from Kaggle, however, as the authors write:
’Data originally found in the DAAG R package and used in the book Maindonald, J.H. and Braun, W.J. (2003, 2007, 2010) “Data Analysis and Graphics Using R”).

Original Source of dataset:
Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.’

First, let’s import our possums data characteristics and inspect it:

possum <- read.csv("possum.csv")
head(possum)
##   case site Pop sex age hdlngth skullw totlngth taill footlgth earconch  eye
## 1    1    1 Vic   m   8    94.1   60.4     89.0  36.0     74.5     54.5 15.2
## 2    2    1 Vic   f   6    92.5   57.6     91.5  36.5     72.5     51.2 16.0
## 3    3    1 Vic   f   6    94.0   60.0     95.5  39.0     75.4     51.9 15.5
## 4    4    1 Vic   f   6    93.2   57.1     92.0  38.0     76.1     52.2 15.2
## 5    5    1 Vic   f   2    91.5   56.3     85.5  36.0     71.0     53.2 15.1
## 6    6    1 Vic   f   1    93.1   54.8     90.5  35.5     73.2     53.6 14.2
##   chest belly
## 1  28.0    36
## 2  28.5    33
## 3  30.0    34
## 4  28.0    34
## 5  28.5    33
## 6  30.0    32
summary(possum)
##       case             site           Pop                sex           
##  Min.   :  1.00   Min.   :1.000   Length:104         Length:104        
##  1st Qu.: 26.75   1st Qu.:1.000   Class :character   Class :character  
##  Median : 52.50   Median :3.000   Mode  :character   Mode  :character  
##  Mean   : 52.50   Mean   :3.625                                        
##  3rd Qu.: 78.25   3rd Qu.:6.000                                        
##  Max.   :104.00   Max.   :7.000                                        
##                                                                        
##       age           hdlngth           skullw         totlngth    
##  Min.   :1.000   Min.   : 82.50   Min.   :50.00   Min.   :75.00  
##  1st Qu.:2.250   1st Qu.: 90.67   1st Qu.:54.98   1st Qu.:84.00  
##  Median :3.000   Median : 92.80   Median :56.35   Median :88.00  
##  Mean   :3.833   Mean   : 92.60   Mean   :56.88   Mean   :87.09  
##  3rd Qu.:5.000   3rd Qu.: 94.72   3rd Qu.:58.10   3rd Qu.:90.00  
##  Max.   :9.000   Max.   :103.10   Max.   :68.60   Max.   :96.50  
##  NA's   :2                                                       
##      taill          footlgth        earconch          eye            chest     
##  Min.   :32.00   Min.   :60.30   Min.   :40.30   Min.   :12.80   Min.   :22.0  
##  1st Qu.:35.88   1st Qu.:64.60   1st Qu.:44.80   1st Qu.:14.40   1st Qu.:25.5  
##  Median :37.00   Median :68.00   Median :46.80   Median :14.90   Median :27.0  
##  Mean   :37.01   Mean   :68.46   Mean   :48.13   Mean   :15.05   Mean   :27.0  
##  3rd Qu.:38.00   3rd Qu.:72.50   3rd Qu.:52.00   3rd Qu.:15.72   3rd Qu.:28.0  
##  Max.   :43.00   Max.   :77.90   Max.   :56.20   Max.   :17.80   Max.   :32.0  
##                  NA's   :1                                                     
##      belly      
##  Min.   :25.00  
##  1st Qu.:31.00  
##  Median :32.50  
##  Mean   :32.59  
##  3rd Qu.:34.12  
##  Max.   :40.00  
## 

As we can see, the dataset contains variables which are not numeric or are indexes of observation number or site.
Therefore, we cannot simply conduct a PCA procedure on the virgin dataset. Firstly, let’s limit the dataset only to numeric columns and get rid of the null values:

possum <- possum[sapply(possum, is.numeric)]
possum <- na.omit(possum)

Also, we need to get rid of the indexes and population descriptions:

possum_num <- possum[,-c(1,2,3)]

Let’s inspect boxplots, correlations matrix and distributions of the variables:

boxplot(possum_num, col = 'orange')

corrs <- cor(possum_num, method = c("spearman"))
round(corrs, 2)
##          hdlngth skullw totlngth taill footlgth earconch   eye chest belly
## hdlngth     1.00   0.74     0.62  0.18     0.45     0.22  0.37  0.61  0.55
## skullw      0.74   1.00     0.57  0.27     0.34     0.08  0.37  0.63  0.53
## totlngth    0.62   0.57     1.00  0.51     0.51     0.25  0.25  0.53  0.45
## taill       0.18   0.27     0.51  1.00    -0.14    -0.35  0.11  0.08  0.25
## footlgth    0.45   0.34     0.51 -0.14     1.00     0.74  0.07  0.50  0.33
## earconch    0.22   0.08     0.25 -0.35     0.74     1.00 -0.09  0.26  0.09
## eye         0.37   0.37     0.25  0.11     0.07    -0.09  1.00  0.12  0.24
## chest       0.61   0.63     0.53  0.08     0.50     0.26  0.12  1.00  0.58
## belly       0.55   0.53     0.45  0.25     0.33     0.09  0.24  0.58  1.00
chart.Correlation(possum_num, histogram=TRUE, pch=19)

possum_cor<-cor(possum_num, method="spearman") 
corrplot(possum_cor, order ="alphabet", tl.cex=0.6, insig= 'n', col = COL1('YlOrBr', 200) )

Body length measures are highly correlated with each other. However, not all variables are correlated with each other. Before proceeding with the PCA method, we check whether it makes sense to perform principal component analysis at all. We can determine the adequacy statistic proposed by KaiserMeyerOlkin. This method compares correlations and partial correlations between variables. When the correlation is relatively high compared to the correlation, the KMO is small, which means that it is not feasible to obtain an adequate solution in the low dimensional space.

  • The magnitudes of the coefficients:
    • 0.00 to 0.49 unacceptable
    • 0.50 to 0.59 very weak
    • 0.60 to 0.69 weak
    • 0.70 to 0.79 sufficient
    • 0.80 to 0.89 good
    • 0.90 to 1.00 excellent
KMOS(possum_num)
## 
## Kaiser-Meyer-Olkin Statistics
## 
## Call: KMOS(x = possum_num)
## 
## Measures of Sampling Adequacy (MSA):
##   hdlngth    skullw  totlngth     taill  footlgth  earconch       eye     chest 
## 0.8327082 0.8241335 0.7474705 0.5581005 0.6982440 0.5593686 0.8111594 0.8330403 
##     belly 
## 0.8602022 
## 
## KMO-Criterion: 0.7527333

In our case mean KMO-Criterion = 0.7527333 which is sufficient to perform further anaylsis.

Principal Component Analysis

Optimal number of components

Kaiser’s Stopping Rule is a method for deciding which components should be selected. In this approach, components with eigen values greater than 1 should be kept. It also involves a screening test, where we plot the eigenvalues on the vertical axis and the components on the horizontal axis. The ingredients are listed in descending order from largest to smallest and based on the rule of the elbow we choose the number of components. If the eigenvalue curve is stable, we should choose this number of components.

pca1 <- prcomp(possum_num, scale = TRUE)
fviz_eig(pca1, choice='eigenvalue', addlabels = TRUE, barfill = 'orange')+theme_bw()+geom_line(linetype = "dashed", y = 1)

Basing on Kaiser’s rule, we should choose 2 or 3 components. The third one has eigenvalue lower than 1.
Another approach is to look at the percentage of variance explained, it’s good if components explain 70-90%. I expect that this approach will clarify whether we should choose 2 or 3 components.

fviz_eig(pca1, choice = 'variance', addlabels = TRUE, ylim = c(0, 51), barfill = 'orange')

summary(pca1)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9828 1.3959 0.9531 0.86693 0.75951 0.55665 0.51685
## Proportion of Variance 0.4368 0.2165 0.1009 0.08351 0.06409 0.03443 0.02968
## Cumulative Proportion  0.4368 0.6533 0.7543 0.83777 0.90186 0.93629 0.96597
##                            PC8     PC9
## Standard deviation     0.40314 0.37912
## Proportion of Variance 0.01806 0.01597
## Cumulative Proportion  0.98403 1.00000

Basing on the above plots, I will stand for choosing 3 components. If we take a look at the cumulative variance explained we will find that 2 components explain 65% of variance which is not enough. The first three components explain more than 75%, which will be satisfactory in our case.

fviz_pca_var(pca1, col.circle = "grey70", alpha.var = 1, col.var = "black")

pca1$rotation[,1:3]
##                 PC1         PC2         PC3
## hdlngth  -0.4340508 -0.06327447  0.15067425
## skullw   -0.3862225 -0.11092066  0.25134145
## totlngth -0.4182651 -0.07807963 -0.34060125
## taill    -0.1931942 -0.48143520 -0.53479000
## footlgth -0.3069891  0.49706230 -0.07909586
## earconch -0.1385685  0.63854590 -0.02187670
## eye      -0.1857378 -0.27637753  0.70938382
## chest    -0.4094552  0.08877102 -0.01128898
## belly    -0.3668335 -0.08209775 -0.04502895
grid.arrange(a,b,c,top='Contribution to the Principal Components')

We can try to interpret the principal components, but note that the principal components themselves do not have an economic interpretation. The first principal component accounts for the overall dimensions of the possums.
While the contribution to the second principal component is in 40% is the ear conch length variable.
The third principal component is in 50% the variable ‘eye’ corresponding to distance from medial canthus to lateral canthus of right eye.

Additional visualizations

Below in the graph, we can see that the possum population named Vic is dimensionally different from the others:

fviz_pca_ind(pca1, 
             geom="point", 
             habillage = possum1$Pop 
)

A 3D chart was also created allowing us to visualize the 3 dimensions of our analysis on a single chart. The points on the graph correspond to each observation and their colors represent age.

library(rgl)
plot3d(pca1$x[,1:3], col = factor(possum$age))
text3d(pca1$rotation[,1:3], texts = rownames(pca1$rotation), col='blue')

coords <- NULL
for (i in 1:nrow(pca1$rotation)) {
  coords <- rbind(coords, rbind(c(0,0,0),pca1$rotation[i,1:3]))
}
lines3d(coords, col="orange", lwd=4)
rglwidget()

Conclusion

The main goal of this research was to examine whether body dimensions of possums can be described by a smaller number of variables by dimension reduction. The results of the analysis suggest that it can be done by using three dimensions. This project was done by rather small dataset, but it was a great opportunity to get familiar with an algorithm and better understanding of the data and dependencies between variables.