---
title: "factor analysis of mixed data (LEEP)"
output:
flexdashboard::flex_dashboard:
source_code: embed
social: menu
theme: flatly
html_document:
df_print: paged
---
```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(kableExtra)
library(dplyr)
knitr::opts_chunk$set(cache = TRUE, echo=F, eval=T, warning = FALSE, message = FALSE)
```
Workflow {.storyboard}
=========================================
Inputs {.sidebar}
-------------------------------------
Dimensionality reduction using principal component methods is a very handy tool for identifying relationships amongst variables and hidden patterns in a dataset.
Principal component analysis (PCA) is arguably the most commonly known, but it is limited by its use for datasets containing **only** continuous variables. As real-world data likely contain a combination of continuous and categorical variables, **factor analysis of mixed data (FAMD)** is an extremely valuable alternative approach to be familiar with. FAMD can be seen as combining PCA for continuous variables and multiple correspondence analysis (MCA) for categorical variables.
Here, I will perform FAMD on the [Lived Experience of Epilepsy: Patient and Caregiver Perspectives (LEEP)](https://medicine.utah.edu/internalmedicine/epidemiology/research_programs/torch/research/leep.php), to gain insights into the relationships between various aspects of the pendamic influences on the routine lives of people with epilepsy(PWE).
### **Calculate & inspect principal dimensions** Dimensionality reduction through creating a new feature space {data-commentary-width=500}
```{r, fig.width=7}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)
library(grid)
library(readxl)
library(car)
## Import data
LEEP_covid <- read.csv("C:/Users/ali_r/OneDrive/backup 9.9.19/MJ/LEEP/QOL/LEEP_covidbig.csv")
LEEP_covid<-LEEP_covid %>%
select(-X, -Fear_care_recipient) %>%
dplyr::rename(cancelPH=Q8.3,cancelMH=Q8.4)
LEEP_covid$Financial<-Recode(LEEP_covid$Financial, recodes="'You are having difficulty paying the bills, no matter what you do.' = 'Difficulty'; 'After paying the bills, you still have enough money for special things that you want' = 'Good'; 'You have enough money to pay the bills, but little spare money to buy extra or special things.' = 'Good'; 'You have money to pay the bills, but only because you have to cut back on things.' = 'Normal'; else=NA",as.factor=T)
df<-LEEP_covid
df[df=="Not applicable"]<- "No change"
## FAMD
## Set the target variable "Churn" as a supplementary variable
res.famd <- FAMD(df,
sup.var = 20,
graph = FALSE,
ncp=25)
## Inspect principal dimensions
ev <- data.frame(get_eigenvalue(res.famd))
## Visualize
x <- fviz_eig(res.famd, ncp=16, choice='eigenvalue', geom='line') +
theme(text = element_text(size=8),
plot.title = element_blank(),
axis.text.x = element_text(size=7),
axis.text.y = element_text(size=7))
y <- fviz_eig(res.famd, ncp=16) +
theme(text = element_text(size=8),
plot.title = element_blank(),
axis.text.x = element_text(size=7),
axis.text.y = element_text(size=7))
grid.arrange(x, y, ncol=2)
```
***
Description
General workflow
As an **important note**, *standardization* of the numeric variables is critical for valid results from FAMD. This is done automatically by the three packages currently available for FAMD: `FactorMineR`, `PCAmixdata` and `prince`, so it will not be done beforehand.
We can first inspect the calculated principal dimensions (PDs), which are linear combinations of the original variables to better account for the variance in the dataset. Inspecting the eigenvalue and percentage variance explained by each PD, using scree plots, can provide insights into the "informativeness" of the original variables.
An eigenvalue >1 indicates that the PD accounts for more variance than one of the original variables in **standardized** data (N.B. This holds true **only** when the data are standardized.). This is commonly used as a cutoff point for which PDs are retained to be used in further analysis. The scree plot on the left indicates that 16 PDs account for more variance than each of the original variables, whereas the one on the right shows that together they account for only 66.56% of the total variance in the data set. This suggests that this dataset is quite complex, potentially due to 1) the relationships between the variables being non-linear, and/or 2) some factors (variables) that can account for variance in this dataset are not included in this analysis.
All in all, FAMD can be a great first step in sizing up a dataset, which we will further demonstrate in the next slide.
```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)
library(grid)
## FAMD
res.famd <- FAMD(df)
## Visualize
a <- fviz_eig(res.famd,
choice='eigenvalue',
geom='line')
b <- fviz_eig(res.famd)
grid.arrange(a, b, ncol=2)
```
### **Plot individual observations in new feature space** Exploring "learnable" patterns in the dataset {data-commentary-width=500}
```{r}
## Import libraries
library(FactoMineR)
library(factoextra)
library(plotly)
## FAMD
res.famd <- FAMD(df,
sup.var = 20,
graph = FALSE,
ncp=25)
## Concate original data with coordinates for
## the first three principal dimensions
val_df <- as.data.frame(res.famd$ind)
x <- cbind(df, val_df[1:3])
## Plot
plot_ly(x, x = ~coord.Dim.1, y = ~coord.Dim.2, z = ~coord.Dim.3, color = ~Financial) %>%
add_markers(size=3, opacity = 0.2) %>%
layout(scene = list(xaxis = list(title = 'Principal dimension 1'),
yaxis = list(title = 'Principal dimension 2'),
zaxis = list(title = 'Principal dimension 3')))
```
***
Description
General workflow
We can now visualize the individual data points in the new feature space created by the first three, and thus "most informative", PDs. This is particularly useful when we want to see how "separable" groups of data points are, in our case in terms of financial status during COVID-19, ahead of building supervised predictive models. To this end, the points are coloured by the variable "Financial".
This nifty 3D scatter plot can be dragged around to inspect the points from different angles, so give it a try!
As individuals with similar profiles, *i.e.* personal characteristics are close to each other on the figure, the large overlap between 3 levels SES suggests that if there are significant/meaningful differences between the groups, they are likely complex and non-linear. It is also possible that the variables we have in this dataset are not suited or sufficient to capture the difference between PWE. So, the overlapping distributions of the populations suggest that we are probably not going to get very good separation of SES.
This is a good example of how performing dimensionality reduction of a dataset can help us see how “successful” other analyses could be on it.
```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(plotly)
## FAMD
res.famd <- FAMD(df)
## Concate original data with coordinates for
## the first three principal dimensions
val_df <- as.data.frame(res.famd$ind)
x <- cbind(df, val_df[1:3])
## Plot
plot_ly(x,
x = ~coord.Dim.1,
y = ~coord.Dim.2,
z = ~coord.Dim.3,
color = ~Seizure_Freedom)
```
### **Correlation circle** To examine relationship between quantitative variables {data-commentary-width=650}
```{r}
## Import library
library(PCAmixdata)
## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure
#df <- within(df, rm('severe'))
## Split quantitative and qualitative variables
split <- splitmix(df)
## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,
X.quali=split$X.quali,
rename.level=TRUE,
graph=FALSE,
ndim=25)
## Plotting
p <- plot(res.pcamix,
choice="cor",
cex=0.6, main=' ')
```
***
Background
Description
General workflow
As PDs are linear combinations of the original variables, understanding their relationships can help to identify which variables are the most important in describing the total variance in a dataset.
The **factor loading** of a variable describes the correlation, *i.e.* information shared, between it and a given PD. By squaring the factor loading for a variable, we also get its **squared loading** (which you may see also called *squared cosine* or *cos2*). This provides a measure of the proportion of variance in a variable that is captured by a particular PD. For each variable, the sum of its squared loading across all PDs equals to 1.
One way to depict this relationship is using **correlation circles**, which plot *only* continuous variables using their loadings for any given two PDs as coordinates. Recall that the sum of squared loadings for a given variable across all PDs equals 1. So if a given variable can be perfectly represented by only the two PDs plotted, then:
$$
(factor\, loading_{PD1})^{2} + (factor\, loading_{PD2})^{2} = 1
$$
When plotted using factor loading on each PD as coordinates in a Cartesian grid, this is the same as:
$$
x^{2} + y^{2} = 1
$$
The circle in the plot has a radius of 1, meaning that the projection endpoint for any such variable would be positioned on the circle.
In this correlation circle, we see that PD1 and PD2 together shows more PDs are needed to capture the information contained in a variable, then the length of it projection would be less than 1 and the endpoint would be positioned inside the circle. Projection for the variables lies closer to the origin indicating that more than PD1 and PD2 are needed to completely represent the information it contains. Therefore, the closer a variable is to the circle, then more important it is to interpreting the PDs involved.
In addition, variables on opposite sides of the origin are inversely correlated, whereas those on the same side are positively correlated. It makes intuitive sense that `GASE` and `MSC` are inversely related, as PWEs reporeted higher seizure severity likely to have lower QOL.
```{r, eval=F, echo=T}
## Import library
library(PCAmixdata)
## Split quantitative and qualitative variables
split <- splitmix(df)
## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,
X.quali=split$X.quali, rename.level=TRUE)
## Plotting
plot(res.pcamix, choice="cor")
```
### **Squared loading plot** To visualize role of *all* variables in accounting for overall variation in a dataset {data-commentary-width=550}
```{r}
## Import libraries
library(FactoMineR)
library(factoextra)
## principal dimensionA
res.famd <- FAMD(df,
sup.var = 20,
graph = FALSE,
ncp=25)
## Plot
## Colour indiv obs by their squared loading
fviz_famd_var(res.famd, 'var', axes = c(1, 2),
labelsize = 3, pointsize = 1,
col.var = 'cos2',
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repl=TRUE) +
xlim(-0.1, 0.85) + ylim (-0.1, 0.70) +
theme(text = element_text(size=8),
plot.title = element_blank(),
axis.text.x = element_text(size=7),
axis.text.y = element_text(size=7))
## Add the supplementary variable, Churn, to the plot
#fviz_add(p, res.famd$var$coord.sup,
#geom = c("arrow", "text"),
# labelsize = 3, pointsize = 1,
# col.var = 'cos2',
# color = "blue",
# repel = TRUE)
```
***
Background
Description
General workflow
Squared loading plots allow us to visualize qualitative and quantitative variables **together** in the new feature space.
As an aside, unlike correlation circles, this plot depicts only positive values on the x- and y-axis. According to the authors of the package, the coordinates are to be interpreted as measuring “the links (signless) between variables and principal components”. This may be interpreted as the coordinates of each variable being the absolute value its squared loading.
Interpretation of the squared loading plot is very similar to that of the correlation circle. We see several interesting things:
1.The variables related to Health Care Use are more closely correlated with PD1 than with PD2, whereas `QOLEP is described by a more even combination of PD1 and PD2
2. Being furthest from the origin, the Mental Health variables have the highest squared loading values and so are more important in explaining the variance captured by PD1 and PD2 than variables clustered near the origin, such as `Having COVID-19`and `SES`.
```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
## principal dimensionA
res.famd <- FAMD(df)
## Plot
## Colour indiv obs by their squared loading
p <- fviz_famd_var(res.famd, 'var',
axes = c(1, 2),
col.var = 'cos2')
## Add the supplementary variable, Churn
#fviz_add(p, res.famd$var$coord.sup col.var = 'cos2')
```
### **Variable contribution** Contributions of variables to principal dimensions {data-commentary-width=550}
```{r, fig.width=7, fig.height=4, fig.align='center'}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)
## Plot
a <- fviz_contrib(res.famd,
choice = "var", axes = 1, top = 20)
b <- fviz_contrib(res.famd,
choice = "var", axes = 2, top = 20)
grid.arrange(a, b, nrow=1)
#fviz_contrib(res.famd, choice = "var", axes = 1:2, top = 15)
```
***
Background
Description
General workflow
Whereas factor loading and squared loading measure how well a given PD “describes” variation capture in a variable, **contribution** describes the converse, namely how much a variable accounts for the total variation captured by a given PD.
It is important to compare the squared loading and contribution for each variable to critically assess its relationship with a given PD, as a variable that is important for a PD may not be well represented by the same PD, which warrants very different interpretation as compared to the converse.
Top contributing variables to the first few PDs can provide insights into which variables underlie variations in the dataset, and may help with feature selection for downstream analyses. The red dashed line indicates the expected average contribution (100% contribution divided the total number of variables avaiable in the dataset). So variables meeting the cut-off would be considered as important in contributing to the PD.
From the variables that meet the cut-off, we can glean some insights into what are the most important variables in this dataset, such as `Health Care Use`, `Health Conditions` and `Stress`. So, FAMD can also be a handy tool for variable selection.
```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)
## FAMD
res.famd <- FAMD(df)
## Plots
a <- fviz_contrib(res.famd, choice = "var", axes = 1)
b <- fviz_contrib(res.famd, choice = "var", axes = 2)
grid.arrange(a, b, nrow=1)
#fviz_contrib(res.famd, choice = "var", axes = 1:2)
```
### **Level map** More granular insights into relationships amongst variables {data-commentary-width=550}
```{r}
## Import libraries
library(FactoMineR)
library(factoextra)
library(plyr)
library(dplyr)
library(arulesCBA)
df1<-df %>%
select(Access_mentalhealthcare,Access_physicalhealthcare, Fear_healthcare,Refilling_prescription, Use_health_care, Use_mental_health_care,Alcohol_consumption, Exercise, Hobbies, Meditation, Smoking,Time_spent_outside, Anxiety, Anger,Belief_system, Control_over_life,Social_isolation, Stress, Fear_seizure, health_conditions, Financial, cancelPH, cancelMH)
res.mca <- MCA(df1, ncp=5, graph = F)
## Plot relationship between levels of categorical variables obtained from MCA
fviz_mca_var(res.mca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
labelsize = 3, pointsize = 1, repel=TRUE) +
xlim(-1.5, 2) + ylim (-1.0, 1.25) + theme(text = element_text(size=8),
plot.title = element_blank(),
axis.text.x = element_text(size=7),
axis.text.y = element_text(size=7))
```
***
Background
Description
General workflow
We can further visualize the relationships between possible values of variables in level maps. This allows us to get much more fine-grained insights.
There are lots of information in this plot!
Interpretation of the level map is similar to that of correlation circles and squared loading plots:
Values that are situated close to each other are more closely related
Values that are closer to the origin are less important in accounting for the variance in the dataset than those farther away
Values that are on opposite sides of the origin are negatively correlated, whereas those on the same side are positively correlated
```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(plyr)
library(dplyr)
library(arulesCBA)
df1<-df %>%
select(Access_mentalhealthcare,Access_physicalhealthcare, Fear_healthcare,Refilling_prescription, Use_health_care, Use_mental_health_care,Alcohol_consumption, Exercise, Hobbies, Meditation, Smoking,Time_spent_outside, Anxiety, Anger,Belief_system, Control_over_life,Social_isolation, Stress, Fear_seizure, health_conditions)
res.mca <- MCA(df1, ncp=5, graph = T)
## Plot relationship between levels of categorical variables obtained from MCA
fviz_mca_var(res.mca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
labelsize = 3, pointsize = 1, repel=TRUE) +
xlim(-1.5, 2) + ylim (-1.0, 1.25) + theme(text = element_text(size=8),
plot.title = element_blank(),
axis.text.x = element_text(size=7),
axis.text.y = element_text(size=7))
```
### **Varimax rotation** To further facilitate interpretation of the relationships between variables and principal dimensions {data-commentary-width=500}
```{r, fig.height=8, fig.width=8}
## Import library
library(PCAmixdata)
## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure
#df <- within(df, rm('TotalCharges'))
## Split quantitative and qualitative variables
split <- splitmix(df)
## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,
X.quali=split$X.quali,
rename.level=TRUE,
graph=FALSE,
ndim=25)
## Apply varimax rotation to the first two PCs
res.pcarot <- PCArot(res.pcamix,
dim=2,
graph=FALSE)
## Plot
plot(res.pcarot,
choice="sqload",
coloring.var=TRUE,
axes=c(1, 2), leg=TRUE,
posleg="topright", main=' ',
xlim=c(0, 0.8), ylim=c(0, 0.6),
cex=0.8)
```
***
Description
General workflow
To further facilitate interpretation of the relationships between variables and PCs, additional rotation can be applied to PCs to result in high factor loadings for a few variables and low factor loadings for the rest. In other words, a small number of variables will become highly correlated with each PC. The most common form of rotation is varimax rotation, a generalized form of which is implemented in the `PCAmixdata` package for mixed data. To learn more, here is an excellet explanation on the [stats StackExchange](https://stats.stackexchange.com/questions/151653/what-is-the-intuitive-reason-behind-doing-rotations-in-factor-analysis-pca-how).
Here we see a slightly different version of the squared loading plot from before. What is different here is that there is higher factor loading of `QOLEP` and `Seizure Freedom` for the rotated PD1, and `Stress` and `Anger`, `Health Condition`, `Fear of Health care` for the rotated PD2 (as their projections are more closely aligned either axis).
```{r, eval=F, echo=T}
## Import library
library(PCAmixdata)
## Split quantitative and qualitative variables
split <- splitmix(df)
## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,
X.quali=split$X.quali, rename.level=TRUE)
## Apply varimax rotation to the first two PCs
res.pcarot <- PCArot(res.pcamix, dim=2,
graph=FALSE)
## Plot
plot(res.pcarot, choice="sqload",
coloring.var=TRUE, axes=c(1, 2))
```
### **Hierarchical clustering** Unsupervised clustering based on principal dimensions {data-commentary-width=500}
```{r}
## Hierachical clustering
res.hcpc <- HCPC(res.famd, nb.clust=-1, graph = FALSE)
## Plot
fviz_cluster(res.hcpc, geom = "point", main = "Factor map") +
theme(plot.title = element_blank(),
axis.text.x = element_text(size=4),
axis.text.y = element_text(size=4))
```
```{r eval=FALSE, include=FALSE}
res.hcpc$desc.var$test.chi2
res.hcpc$desc.var$category
res.hcpc$desc.axes
res.hcpc$desc.ind
```
***
Description
General workflow
As principal components (PCs) are essentially “synthetic” variables that summarize the most important sources of variations in the data, they are very useful in reducing noise and redundancy in a dataset that in turn helps to reveal its inherent structure. One way to take this one step further is by hierarchically clustering the individual data points by their "position" in the principal dimension feature space. The implementation by the `FactoMineR` package uses the Ward's criterion to carry out hierarchical clustering.
Doing this on the LEEP data showed three partially clusters.
In the next slide, we will generate some summary visualizations to see how the clusters differ in terms of customer demographics and purchasing behaviour.
```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
## principal dimensionA
res.famd <- FAMD(df, sup.var = 20)
## Hierachical clustering
res.hcpc <- HCPC(res.famd, nb.clust=-1)
## Plot
fviz_cluster(res.hcpc, geom = "point")
```
### **Examine clusters**Explore characteristics of clusters {data-commentary-width=550}
```{r eval=FALSE, cache=FALSE, include=FALSE}
## Import libraries
#install.packages("autoEDA")
library(autoEDA)
library(svglite)
library(htmlwidgets)
library(slickR)
library(shiny)
## Rename cluster column
names(res.hcpc$data.clust)[names(res.hcpc$data.clust) == 'clust'] <- 'Cluster'
## autoEDA
auto_res <- autoEDA(res.hcpc$data.clust,
y = "Cluster", returnPlotList = TRUE,
plotCategorical = "groupedBar",
plotContinuous = "histogram",
bins = 30, rotateLabels = TRUE,
color = "#26A69A", verbose = FALSE)
## Create list of autoEDA figures converted to SVG
plotsToSVG <- list()
i <- 1
for (v in auto_res$plots) {
x <- xmlSVG({show(v)}, standalone=TRUE)
plotsToSVG[[i]] <- x
i <- i + 1
}
## Custom function needed to render SVGs in Chrome/Firefox
hash_encode_url <- function(url){
gsub("#", "%23", url)
}
## Pass list of figures to SlickR
s.in <- sapply(plotsToSVG, function(sv){hash_encode_url(paste0("data:image/svg+xml;utf8,",as.character(sv)))})
#Sys.sleep(4)
slickR(s.in, slideId = 'ex4', slickOpts = list(dots=T, arrows=F, autoplay=T, adaptiveWidth=TRUE, adaptiveHeight=TRUE), height='525px')
```
***
Description
General workflow
The automated exploratory data analysis package `autoEDA` is very useful for generating summary visualizations for each variable in a given dataset with respect to a target variable of interest.
In our case, we want to see the distribution of values for each variable in each of the three clusters generated using hierarchical clustering based on FAMD. We see that each of the clusters have a different proportion of churned customers, with clusters 1 and 3 having very few churned customers, and cluster 2 being a more even split. Comparing the demographic characteristics and purchasing behaviours of these clusters can generate more nuanced insights into what differentiates loyal customers from those who are much more "on the fence".
As a side note, here you see a slider generated using the `slickR` package, which is a nifty way to show multiple figures when space is limited. Learn more about it at its Github repo [here](https://github.com/metrumresearchgroup/slickR).
```{r eval=FALSE, include=FALSE}
## Import libraries
library(autoEDA)
## autoEDA
auto_res <- autoEDA(res.hcpc$data.clust, y = "clust",
returnPlotList = TRUE,
plotCategorical = "groupedBar",
bins = 30, rotateLabels = TRUE,
color = "#26A69A", verbose = FALSE)
## Plot
auto_res$plots
```
Session info
=========================================
Column
---------------------------------------------
```{r}
sessionInfo()
```