In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. The goal of dimension reduction is to make data more ‘workable’. Take the ANSUR II dataset as an example, the dataset is made up of 108 columns, each explaining an attribute of individuals in the army. However, some data may be highly correlated with other attributes, hence do not provide additional information. In short, using dimensionality reduction can effectively ignores the redundant attributes and includes only the ‘important’ attributes.
For this example, we are going to perform dimensionality reduction using Principal Component Analysis (PCA) and Factor Analysis (FA) on Anthropometric Survey of US Army Personnel (ANSUR II) Male dataset. More on the data can be found here: https://www.openlab.psu.edu/ansur2/
library(ggplot2)
library(ggfortify)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ansur_male <- read.csv(file='D:/Dataset/ANSUR/ANSUR_II_MALE_Public.csv',header=T)
dim(ansur_male)
## [1] 4082 108
ID column is also removed from the data as they do not provide additional information. Next, since PCA works best with only numerical data, categorical variables are removed from the original dataset.
# Remove ID
ansur_male <- ansur_male[,c(2:108)]
# colnames(ansur_male)
# select only numeric number
ansur_numeric <- ansur_male %>% select_if(is.numeric)# %>% glimpse
ansur_non_numeric <- ansur_male %>% select_if(is.factor) %>% glimpse
## Observations: 4,082
## Variables: 9
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Mal...
## $ Date <fct> 4-Oct-10, 4-Oct-10, 4-Oct-10, 12-Oct-10, 12-O...
## $ Installation <fct> Fort Hood, Fort Hood, Fort Hood, Fort Hood, F...
## $ Component <fct> Regular Army, Regular Army, Regular Army, Reg...
## $ Branch <fct> Combat Arms, Combat Support, Combat Support, ...
## $ PrimaryMOS <fct> 19D, 68W, 68W, 88M, 92G, 91L, 91D, 88M, 92F, ...
## $ SubjectsBirthLocation <fct> North Dakota, New York, New York, Wisconsin, ...
## $ Ethnicity <fct> , , , , , , , , , , Mexican Yaqui, , , , , , ...
## $ WritingPreference <fct> Right hand, Left hand, Left hand, Right hand,...
Next, instead of selecting 8 columns at random, a random search algorithm is implemented with 10,000 trials to search for the best combination of 8 variables that maximise the cumulative proportion in PC1 and PC2. The variables selected using the algorithm is as shown below. This step is optional, the goal of implementing the random search algorithm here is to optimise the end result. Random search algorithm can be replaced by other optimising algorithm like simulated annealing or grid search.
score = 0
n=100
score_l <- list()
df_best=ansur_male[,c(1:8)]
# random search algorithm to find combination with high PC1 & PC2 variance
for (i in 1:10000){
# print(i)
set.seed(i)
x <- sample(1:ncol(ansur_numeric),8,replace=F)
# print(x)
df <- ansur_numeric[,c(x)]
#print(colnames(df))
df_pca <- prcomp(df, scale = T)
eigs <- df_pca$sdev^2
total_var = (eigs[1] / sum(eigs)) +(eigs[2] / sum(eigs))
# print(total_var)
# print(summary(df_pca))
if (total_var>=score){
score=total_var
df_best=df
# print(colnames(df_best))
}
i=i+1
}
Using the variables obtained using Random Search Algorithm, PCA is conducted and result is summarised as Exhibit 2. Based on the Exhibit 2, PC1 explains 60.33% of the total variance, indicating that almost two-thirds of the dataset’s information can be encapsulated by PC1 alone. Subsequently, PC2 explains 31.58% of the total variance. Hence, as the cumulative proportion of total variance for PC1 and PC2 is 91.91%, knowing the position of a given sample in relation to only the first two principal components, one can accurately estimate where the data is located compared to other samples in the dataset.
Principal Component Analysis (PCA) is a technique used to perform dimensionality reduction by consolidating all the attributes into smaller sets of underlying factors (principal components) while preserving as much information as possible. To put this into context, the waistline, weight and body fat percentage may all be highly correlated since they are all explaining the same characteristic of an individual using different measurement, hence, these attributes can be transformed into one primary component using PCA.
df_pca_final <- prcomp(df_best, scale = T)
summary(df_pca_final)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1969 1.5894 0.47028 0.44447 0.30248 0.25027 0.22773
## Proportion of Variance 0.6033 0.3158 0.02764 0.02469 0.01144 0.00783 0.00648
## Cumulative Proportion 0.6033 0.9191 0.94673 0.97142 0.98286 0.99069 0.99717
## PC8
## Standard deviation 0.15042
## Proportion of Variance 0.00283
## Cumulative Proportion 1.00000
The scree plot summarising the proportion of variance is plotted as shown below. PCA is seen to be effective on these 8 PCs as the plot shows an elbow shape (L-shaped). This supports the numerical result that the first 2 principal components explain most of the variance.
plot(df_pca_final)
Biplot including both the loading and score in relation to PC1 and PC2 for the dataset is summarised as shown below.
# biplot is the base R function. We can create a better biplot using ggfortify::autoplot with more customisation.
#biplot(df_pca_final)
autoplot(df_pca_final, data = ansur_male,size=1,alpha=0.1,
loadings = TRUE,loadings.colour = 'blue',loadings.label=TRUE,loadings.label.size = 4,loadings.label.vjust = 1.5)+
ggtitle("PCA Biplot - Ansur Male") +
theme(plot.title = element_text(hjust = 0.5))
The second technique used is the factor analysis (FA). Like PCA, new factors are created to encapsulate the original attributes. The only distinction of FA from PCA is that in FA, factors are linear combinations that maximise the shared portion of the variance, whereas in PCA, components are orthogonal linear combinations that maximise the total variance. Output for FA is summarised as biplot as shown in below. Based on the biplot, 88.39% of the total variance in the dataset can be explained by factor 1 and factor 2 (45.2% and 43.19% respectively). The highly correlated attribute pairs are similar to the pairs observed in PCA biplot.
df_factor_analysis <- factanal(df_best, factors = 2, scores = 'regression')
autoplot(df_factor_analysis, label = F, label.size = 1,alpha=0.1,
loadings = TRUE, loadings.colour = 'blue',loadings.label = TRUE, loadings.label.size = 4)+
ggtitle("FA Biplot - Ansur Male") +
theme(plot.title = element_text(hjust = 0.5))
In conclusion, PCA is better than FA for this dataset as the percentage of total variance explained by the first two components using PCA is higher.