1.0 Overview

1.1 Intro to PCA

Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing us to better visualize the variation present in a dataset with many variables. It is particularly helpful in the case of “wide” datasets, where you have many variables for each sample. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after a normalization step of the initial data. PCA allows to see the overall “shape” of the data, identifying which samples are similar to one another and which are very different. This can enable us to identify groups of samples that are similar and work out which variables make one group different from another.The amount of variance retained by each principal component is measured by the so-called eigenvalue.

1.2 Eigenvalues / Variances

As described earlier, the eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

We examine the eigenvalues to determine the number of principal components to be considered. The eigenvalues and the proportion of variances (i.e., information) retained by the principal components (PCs) can be extracted using the function get_eigenvalue() [factoextra package].

1.3 Two general methods to perform PCA in R

Spectral decomposition which examines the covariances / correlations between variables
Singular value decomposition which examines the covariances / correlations between individuals The function princomp() uses the spectral decomposition approach. The functions prcomp() and PCA()[FactoMineR] use the singular value decomposition (SVD).

In this project we using the SVD concept for conducting the PCA hence we are deploying the prcomp() and PCA()[FactoMineR] packages to do the analysis.

Arguments for prcomp():

x: a numeric matrix or data frame

scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place

2.0 PCA for the Survey Questions

Because PCA works best with numerical data,we shall keep the ratings of the 26 main questions and exclude all other fields while doing the PCA.In this section we’ll provide an easy-to-use R code to compute and visualize PCA in R using the prcomp() function and the factoextra package to create a ggplot2-based elegant visualization.

The following is the table which shows the question and the number which we would be using to refer the question. List of Questions in the Survey

2.1 Loading Relevant Packages

The following snippet of code is used to load the relevant packages.

library(FactoMineR)
library(Factoshiny)
library(devtools)
library(plotly)
library(ggbiplot)
library(factoextra)

## Warning: package 'factoextra' was built under R version 3.6.2

library(corrplot)
library(tidyverse)
library(Hmisc)
library(corrgram)
library(DT)

2.2 Loading the dataset

The library dataset is imported in R console to perform the PCA

lib_surv_data<- read_csv("Data/Raw data 2018-03-07 SMU LCS data file - KLG.csv")
stu_data_h<- filter(lib_surv_data, Position == '1' | Position == '2'| Position == '3'| Position == '4'| Position == '5'| Position == '6'| Position == '7')
stu_data_hm<- stu_data_h%>%
  select(ResponseID,StudyArea,starts_with("I"),(-ID))
d<-na.omit(stu_data_hm)

Correlation Analysis of the Survey Responses Principal components analysis is based on correlations between input variables. Consequently, carrying out a principal components analysis only makes sense if the input variables are sufficiently correlated.One of the assumption of PCA is the input variables are linearly correlated. The correlation analysis is to ensure that this assumption is true.Therefore we are carrying out correlation analysis of the variables.

d1<-d[, 3:28]
c1<-cor(d1)
corrplot(c1)

2.3 Compute PCA

The below code is used to conduct PCA on the survey database-

res.pca <- prcomp(d[, 3:28], scale = TRUE,center = TRUE)

The calculation is done by a singular value decomposition of the centered and scaled data matrix X.By default, the prcomp() function centers the variables to have mean zero.By using the option scale=TRUE, we scale the variables to have standard deviation one.

Access to the PCA results

Eigenvalues

# Eigenvalues
eig.val <- get_eigenvalue(res.pca)
eig.val

The sum of all the eigenvalues give a total variance of 10.

The proportion of variation explained by each eigenvalue is given in the second column. For example, 10.6772 divided by 26 equals 0.41066, or, about 41.06% of the variation is explained by this first eigenvalue. The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. For instance, 41.242% plus 9.46% equals 50.526%, and so forth. Therefore, about 50.52% of the variation is explained by the first two eigenvalues together.Eigenvalues can be used to determine the number of principal components to retain after PCA.

Visualize eigenvalues using SCREE PLOT An alternative method to determine the number of principal components is to look at a Scree Plot, which is the plot of eigenvalues ordered from largest to the smallest. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size.

fviz_eig(res.pca,addlabels = TRUE, ylim = c(0, 50))

The above graph shows the percentage of variances explained by each principal component.

Quality of representation

The quality of representation of the variables on factor map is called cos2 (square cosine, squared coordinates) . You can access to the cos2 as follow:

var <- get_pca_var(res.pca)

head(var$cos2,4)

##         Dim.1      Dim.2        Dim.3      Dim.4      Dim.5        Dim.6
## I01 0.3698249 0.07795743 0.0097579316 0.13165527 0.08027516 0.0001323649
## I02 0.4012779 0.03264135 0.0006505832 0.01703074 0.19598649 0.0203935520
## I03 0.2983858 0.02773705 0.0311328662 0.01837885 0.10249154 0.2568099370
## I04 0.3139637 0.15238539 0.0130486999 0.10182436 0.01155662 0.0576178280
##            Dim.7       Dim.8       Dim.9      Dim.10      Dim.11
## I01 0.0001575286 0.027503954 0.005637628 0.047189446 0.055935684
## I02 0.0096793707 0.008810914 0.021633751 0.105625511 0.002450009
## I03 0.0112568966 0.015015940 0.033564414 0.150668381 0.007355093
## I04 0.0238361211 0.096714129 0.047899576 0.000329713 0.003251092
##           Dim.12       Dim.13       Dim.14       Dim.15       Dim.16
## I01 3.388296e-05 0.0014930577 0.0054326537 0.0478358463 0.0003517547
## I02 3.893051e-03 0.0157601441 0.0010747967 0.0285186168 0.0150004461
## I03 2.553992e-02 0.0001795323 0.0051072547 0.0004872657 0.0063017381
## I04 9.146311e-05 0.0207521669 0.0002072848 0.0423117658 0.0456266893
##           Dim.17      Dim.18      Dim.19       Dim.20       Dim.21
## I01 0.1028789323 0.010214263 0.001366510 0.0151083813 2.410928e-03
## I02 0.0979658097 0.015573082 0.001205045 0.0004431359 2.547832e-03
## I03 0.0006408435 0.001936561 0.001937114 0.0027221892 2.504878e-04
## I04 0.0013456278 0.010650548 0.028606312 0.0259577670 9.771452e-05
##           Dim.22       Dim.23       Dim.24       Dim.25       Dim.26
## I01 0.0060207668 3.073220e-04 2.085013e-04 4.973437e-05 0.0002601298
## I02 0.0008223325 6.756901e-06 4.737673e-04 3.344389e-05 0.0005016188
## I03 0.0001597469 1.143807e-03 6.092042e-04 1.086645e-05 0.0001766570
## I04 0.0009268968 3.630825e-04 7.035703e-05 3.327232e-04 0.0002323299

Note that:

A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle.

A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.

For a given variable, the sum of the cos2 on all the principal components is equal to one.

If a variable is perfectly represented by only two principal components (Dim.1 & Dim.2), the sum of the cos2 on these two PCs is equal to one. In this case the variables will be positioned on the circle of correlations.For some of the variables, more than 2 components might be required to perfectly represent the data. In this case the variables are positioned inside the circle of correlations.

fviz_cos2(res.pca, choice = "var", axes = 1:2)

Contributions of variables to PCs

The contributions of variables in accounting for the variability in a given principal component are expressed in percentage.

Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.Variables that do not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis. The contribution of variables can be extracted as follow :

It’s possible to use the function corrplot() [corrplot package] to highlight the most contributing variables for each dimension:

corrplot(var$contrib, is.corr=FALSE)

Correlation circle for PCA:

The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations.

The plot is also known as variable correlation plots. It shows the relationships between all variables. It can be interpreted as follow:

Positively correlated variables are grouped together.
Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).
The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.

fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

3.0 Interpreting the PCA

After doing the PCA for the survey data we need to see how we can get meaningful insights from our analysis.In addition to dimensional reduction, Principal Component Analysis can also be used to gain insight into the structure of the data set in two ways. First, the factor loadings can be used to plot the variables in the principal components space (e.g.loading plot), and it is sometimes possible to see which variables are “close” to each other in the principal components space. Second, the principal component scores can be plotted for each observation (e.g. score plot), and aberrant observations or small, unusual clusters might be noted.In this section, we consider both of these uses of Principal Component Analysis using the library survey dataset.

A modification of the corrplot plot involves arranging the nodes in a way that locates more highly correlated variables closer to one another. The sign of the correlations are indicated by color: positive correlations are green, negative magenta (or red). Since there are no negative correlations hence it is not shown in the fig below.

## numeric(0)

A second criteria we should consider is which principal components explain more than one variable’s worth of information. When we report our results, we should state the explained variance of each principal component we used, plus their combined explained variance.

The next step in our interpretation is to understand how our variables contribute to each of the principal components, and this is revealed by the loadings. Positive loadings indicate a variable and a principal component are positively correlated: an increase in one results in an increase in the other. Negative loadings indicate a negative correlation. Large (either positive or negative) loadings indicate that a variable has a strong effect on that principal component. We need to scan the table to find those large loadings, and we first need a criterion of what constitutes a “large” loading. Because the sum of the squares of all loadings for an individual principal component must sum to one, we can calculate what the loadings would be if all variables contributed equally to that principal component. Any variable that has a larger loading than this value contributes more than one variable’s worth of information and would be regarded as an important contributor to that principal component.

A table of loadings should always be presented for the principal components that are used. It is usually helpful to rearrange the rows so that all the variables that contribute strongly to principal component 1 are listed first, followed by those that contribute strongly to principal component 2, and so on. Large loadings can be highlighted in boldface to emphasize the variables that contribute to each principal component.

4.0 Conclusions

Using the Factor Laoding table we would see how each questions contribute to the corresponding Principal Component.

4.1 Variable loading Principal Component 1

The factor loading table below shows that Principal component 1 load heavily on I13, I22, I12, I24 and I25. All these questions are related to who effeciency of the services provided by the library.

Questions loading the PC1

4.2 Variable loading Principal Component 2

The factor loading table below shows that Principal component 1 load heavily on I14, I18, I19 and I15. All these questions are related to who facility and the resources of the library.

Questions loading the PC2

4.3 Variable loading Principal Component 3

The factor loading table below shows that Principal component 1 load heavily on I14, I15, I12 and I18. All these questions are related to the services offered by the library.

Questions loading the PC3

Final Conclusions Therefore using the PCA we can see how each questions are playing a role in laoding the respective Pricipal Components.Through the above illustrations we can see which variables are “close” to each other in the principal components space.

PCA for Survey Analysis

Shreyansh Shivam

25-Apr-2020, (updated on 03 May 2020)