Workflow

Inputs

Dimensionality reduction using principal component methods is a very handy tool for identifying relationships amongst variables and hidden patterns in a dataset.

Principal component analysis (PCA) is arguably the most commonly known, but it is limited by its use for datasets containing only continuous variables. As real-world data likely contain a combination of continuous and categorical variables, factor analysis of mixed data (FAMD) is an extremely valuable alternative approach to be familiar with. FAMD can be seen as combining PCA for continuous variables and multiple correspondence analysis (MCA) for categorical variables.

Here, we will perform FAMD on the IBM Telco customer churn dataset, to gain insights into the relationships between various aspects of customer behaviour. This will be a toy example of how FAMD can be used to derive actionable business insights.

Calculate & inspect principal dimensions

Dimensionality reduction through creating a new feature space

Description
General workflow

As an important note, standardization of the numeric variables is critical for valid results from FAMD. This is done automatically by the three packages currently available for FAMD: FactorMineR, PCAmixdata and prince, so it will not be done beforehand.

We can first inspect the calculated principal dimensions (PDs), which are linear combinations of the original variables to better account for the variance in the dataset. Inspecting the eigenvalue and percentage variance explained by each PD, using scree plots, can provide insights into the “informativeness” of the original variables.

An eigenvalue >1 indicates that the PD accounts for more variance than one of the original variables in standardized data (N.B. This holds true only when the data are standardized.). This is commonly used as a cutoff point for which PDs are retained to be used in further analysis. The scree plot on the left indicates that only the first four PDs account for more variance than each of the original variables, whereas the one on the right shows that together they account for only 46.7% of the total variance in the data set. This suggests that this dataset is quite complex, potentially due to 1) the relationships between the variables being non-linear, and/or 2) some factors (variables) that can account for variance in this dataset are not included in this analysis.

All in all, FAMD can be a great first step in sizing up a dataset, which we will further demonstrate in the next slide.

## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)
library(grid)

## FAMD
res.famd <- FAMD(df)

## Visualize
a <- fviz_eig(res.famd,  
              choice='eigenvalue', 
              geom='line') 

b <- fviz_eig(res.famd) 
  
grid.arrange(a, b, ncol=2)

Plot individual observations in new feature space

Exploring “learnable” patterns in the dataset

Description
General workflow

We can now visualize the individual data points in the new feature space created by the first three, and thus “most informative”, PDs. This is particularly useful when we want to see how “separable” groups of data points are, in our case in terms of customer churn, ahead of building supervised predictive models. To this end, the points are coloured by the variable “Churn”.

This nifty 3D scatter plot can be dragged around to inspect the points from different angles, so give it a try!

As individuals with similar profiles, i.e. personal characteristics and purchasing behaviour, are close to each other on the figure, the large overlap between the “Churn” and “No churn” populations of customers suggests that if there are significant/meaningful differences between the two populations, they are likely complex and non-linear. It is also possible that the variables we have in this dataset are not suited or sufficient to capture the difference between customers who churn and those who do not. So, the overlapping distributions of the two populations suggest that we are probably not going to get very good separation of churned and not churned customers using a predictive model like a random forest classifier, as least with the data as it is now.

This is a good example of how performing dimensionality reduction of a dataset can help us see how “successful” other analyses could be on it.

## Import libraries
library(FactoMineR)
library(factoextra)
library(plotly)

## FAMD 
res.famd <- FAMD(df)

## Concate original data with coordinates for 
## the first three principal dimensions
val_df <- as.data.frame(res.famd$ind)

x <- cbind(df, val_df[1:3])

## Plot
plot_ly(x, 
        x = ~coord.Dim.1, 
        y = ~coord.Dim.2, 
        z = ~coord.Dim.3, 
        color = ~Churn)

Correlation circle

To examine relationship between quantitative variables

As PDs are linear combinations of the original variables, understanding their relationships can help to identify which variables are the most important in describing the total variance in a dataset.

The factor loading of a variable describes the correlation, i.e. information shared, between it and a given PD. By squaring the factor loading for a variable, we also get its squared loading (which you may see also called squared cosine or cos2). This provides a measure of the proportion of variance in a variable that is captured by a particular PD. For each variable, the sum of its squared loading across all PDs equals to 1.

One way to depict this relationship is using correlation circles, which plot only continuous variables using their loadings for any given two PDs as coordinates. Recall that the sum of squared loadings for a given variable across all PDs equals 1. So if a given variable can be perfectly represented by only the two PDs plotted, then:

\[ (factor\, loading_{PD1})^{2} + (factor\, loading_{PD2})^{2} = 1 \]

When plotted using factor loading on each PD as coordinates in a Cartesian grid, this is the same as:

\[ x^{2} + y^{2} = 1 \]

The circle in the plot has a radius of 1, meaning that the projection endpoint for any such variable would be positioned on the circle.

In this correlation circle, we see that PD1 and PD2 together do a pretty good job of capturing information contained in the MonthlyCharges variable, as its endpoint is very close to the circle. Conversely, if more PDs are needed to capture the information contained in a variable, then the length of it projection would be less than 1 and the endpoint would be positioned inside the circle. Projection for the Tenure variable lies closer to the origin than that of MonthlyCharges, indicating that more than PD1 and PD2 are needed to completely represent the information it contains. Therefore, the closer a variable is to the circle, then more important it is to interpreting the PDs involved.

In addition, variables on opposite sides of the origin are inversely correlated, whereas those on the same side are positively correlated. It makes intuitive sense that MonthlyCharges and Tenure are inversely related, as customers that pay a high monthly fee likely try to change providers.

## Import library
library(PCAmixdata)

## Split quantitative and qualitative variables
split <- splitmix(df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali)

## Plotting
plot(res.pcamix, choice="cor")

Squared loading plot

To visualize role of all variables in accounting for overall variation in a dataset

Squared loading plots allow us to visualize qualitative and quantitative variables together in the new feature space. The implementation provided by the PCAmixdata package has an added benefit of allowing the Churn variable to be included as a supplementary variable, thereby seeing its relationship with other variables without including it in the original analysis. This is useful as most downstream analyses would try to predict Churn.

As an aside, unlike correlation circles, this plot depicts only positive values on the x- and y-axis. According to the authors of the package, the coordinates are to be interpreted as measuring “the links (signless) between variables and principal components”. This may be interpreted as the coordinates of each variable being the absolute value its squared loading.

Interpretation of the squared loading plot is very similar to that of the correlation circle. We see several interesting things:

Consistent with what we saw in the correlation circle, MonthlyCharges is more closely correlated with PD1 than with PD2, whereas Tenure is described by a more even combination of PD1 and PD2
Being furthest from the origin, the variables Contract, InternetService and MonthlyCharges have the highest squared loading values and so are more important in explaining the variance captured by PD1 and PD2 than variables clustered near the origin, such as Gender, PhoneService, and SeniorCitizen
The variable of interest, Churn, overlaps with the y-axis, indicating that PD2 alone perfectly captures all the variation contained in this variable

## Import libraries
library(FactoMineR)
library(factoextra)

## principal dimensionA
res.famd <- FAMD(df)

## Plot
## Colour indiv obs by their squared loading
p <- fviz_famd_var(res.famd, 'var', 
                   axes = c(1, 2),
                   col.var = 'cos2')

## Add the supplementary variable, Churn
fviz_add(p, res.famd$var$coord.sup
         col.var = 'cos2')

Variable contribution

Contributions of variables to principal dimensions

Whereas factor loading and squared loading measure how well a given PD “describes” variation capture in a variable, contribution describes the converse, namely how much a variable accounts for the total variation captured by a given PD.

It is important to compare the squared loading and contribution for each variable to critically assess its relationship with a given PD, as a variable that is important for a PD may not be well represented by the same PD, which warrants very different interpretation as compared to the converse.

Top contributing variables to the first few PDs can provide insights into which variables underlie variations in the dataset, and may help with feature selection for downstream analyses. The red dashed line indicates the expected average contribution (100% contribution divided the total number of variables avaiable in the dataset). So variables meeting the cut-off would be considered as important in contributing to the PD.

From the variables that meet the cut-off, we can glean some insights into what are the most important variables in this dataset, such as MonthlyCharges, InternetService and Tenure. So, FAMD can also be a handy tool for variable selection.

## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)

## FAMD
res.famd <- FAMD(df)

## Plots
a <- fviz_contrib(res.famd, choice = "var", axes = 1)

b <- fviz_contrib(res.famd, choice = "var", axes = 2)

grid.arrange(a, b, nrow=1)

#fviz_contrib(res.famd, choice = "var", axes = 1:2)

Level map

More granular insights into relationships amongst variables

We can further visualize the relationships between possible values of variables in level maps. This allows us to get much more fine-grained insights, as for example “Senior Citizen” and “Not Senior Citizen” carry very different meanings, which are lost when lumped together into a single variable.

For continuous variables to be used in this analysis, they need to be discretized and made into “pseudo” categorical variables. In this case, as I am interested in getting the most informative value bins with respect to the outcome variable of interest, Churn, I used the supervised discretization function implemented by the arulesCBA package (see more about it here).

There are lots of information in this plot!

Interpretation of the level map is similar to that of correlation circles and squared loading plots:

Values that are situated close to each other are more closely related
Values that are closer to the origin are less important in accounting for the variance in the dataset than those farther away
Values that are on opposite sides of the origin are negatively correlated, whereas those on the same side are positively correlated

Briefly, since we are most interested in Churn, we can see that whereas have a month-to-month plan and paying by electronic cheque are associated with customers who churn, having one-year contract and not being senior citizen are associated with those who do not.

These are actionable insights to inform marketing campaigns and customer retention strategies.

## Import libraries
library(FactoMineR)
library(factoextra)
library(arulesCBA)

## Discretize "Tenure" with respect to "Churn"/"No Churn"
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ .,
                                            df[, c('Tenure', 'Churn')],
                                            method = 'mdlp')$Tenure

## MCA, with "Churn" set as the supplementary variable
res.mca <- MCA(df, 
               quanti.sup = c(5, 18, 19),
               quali.sup = c(20))

## Plot relationship between levels of categorical variables obtained from MCA
fviz_mca_var(res.mca, col.var = "cos2")

Varimax rotation

To further facilitate interpretation of the relationships between variables and principal dimensions

Description
General workflow

To further facilitate interpretation of the relationships between variables and PCs, additional rotation can be applied to PCs to result in high factor loadings for a few variables and low factor loadings for the rest. In other words, a small number of variables will become highly correlated with each PC. The most common form of rotation is varimax rotation, a generalized form of which is implemented in the PCAmixdata package for mixed data. To learn more, here is an excellet explanation on the stats StackExchange.

Here we see a slightly different version of the squared loading plot from before. What is different here is that there is higher factor loading of MonthlyCharges and InternetService for the rotated PD1, and Tenure and Contract for the rotated PD2 (as their projections are more closely aligned either axis). This indicates these four variables are the most important in accounting for overall variance in the entire dataset.

## Import library
library(PCAmixdata)

## Split quantitative and qualitative variables
split <- splitmix(df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali)

## Apply varimax rotation to the first two PCs
res.pcarot <- PCArot(res.pcamix, dim=2,
                     graph=FALSE)

## Plot
plot(res.pcarot, choice="sqload", 
     coloring.var=TRUE, axes=c(1, 2))

Hierarchical clustering

Unsupervised clustering based on principal dimensions

Description
General workflow

As principal components (PCs) are essentially “synthetic” variables that summarize the most important sources of variations in the data, they are very useful in reducing noise and redundancy in a dataset that in turn helps to reveal its inherent structure. One way to take this one step further is by hierarchically clustering the individual data points by their “position” in the principal dimension feature space. The implementation by the FactoMineR package uses the Ward’s criterion to carry out hierarchical clustering.

Doing this on our Telco customer churn data reveals three partially overlapping clusters, each of which have some proportion of churned/not churned customers. This further supports the idea that the customers do not divide cleanly by that categorisation, instead by some complex interaction of multiple factors.

In the next slide, we will generate some summary visualizations to see how the clusters differ in terms of customer demographics and purchasing behaviour.

## Import libraries
library(FactoMineR)

## principal dimensionA
res.famd <- FAMD(df, sup.var = 20)

## Hierachical clustering
res.hcpc <- HCPC(res.famd, nb.clust=-1)

## Plot
fviz_cluster(res.hcpc, geom = "point")

Examine clusters

Explore characteristics of clusters

Description
General workflow

The automated exploratory data analysis package autoEDA is very useful for generating summary visualizations for each variable in a given dataset with respect to a target variable of interest.

In our case, we want to see the distribution of values for each variable in each of the three clusters generated using hierarchical clustering based on FAMD. We see that each of the clusters have a different proportion of churned customers, with clusters 1 and 3 having very few churned customers, and cluster 2 being a more even split. Comparing the demographic characteristics and purchasing behaviours of these clusters can generate more nuanced insights into what differentiates loyal customers from those who are much more “on the fence”.

As a side note, here you see a slider generated using the slickR package, which is a nifty way to show multiple figures when space is limited. Learn more about it at its Github repo here.

## Import libraries
library(autoEDA)

## autoEDA 
auto_res <- autoEDA(res.hcpc$data.clust, y = "clust", 
                    returnPlotList = TRUE,
                    plotCategorical = "groupedBar", 
                    bins = 30, rotateLabels = TRUE, 
                    color = "#26A69A", verbose = FALSE) 

## Plot
auto_res$plots

Session info

Column

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] gdtools_0.2.1         RColorBrewer_1.1-2    shiny_1.3.2          
 [4] slickR_0.4.4          htmlwidgets_1.5.1     svglite_1.2.2        
 [7] autoEDA_1.0           arulesCBA_1.1.5       glmnet_3.0           
[10] discretization_1.0-1  arules_1.6-4          Matrix_1.2-17        
[13] dplyr_0.8.3           plyr_1.8.4            PCAmixdata_3.1       
[16] plotly_4.9.1          gridExtra_2.3         factoextra_1.0.5     
[19] ggplot2_3.2.1         FactoMineR_1.42       kableExtra_1.1.0     
[22] knitr_1.26            flexdashboard_0.5.1.1

loaded via a namespace (and not attached):
 [1] httr_1.4.1           tidyr_1.0.0          jsonlite_1.6        
 [4] viridisLite_0.3.0    foreach_1.4.7        assertthat_0.2.1    
 [7] yaml_2.2.0           ggrepel_0.8.1        pillar_1.4.2        
[10] backports_1.1.5      lattice_0.20-38      glue_1.3.1          
[13] digest_0.6.23        ggsignif_0.6.0       promises_1.1.0      
[16] rvest_0.3.5          colorspace_1.4-1     httpuv_1.5.2        
[19] htmltools_0.4.0      pkgconfig_2.0.3      magick_2.2          
[22] xtable_1.8-4         purrr_0.3.3          scales_1.0.0        
[25] webshot_0.5.1        later_1.0.0          tibble_2.1.3        
[28] ggpubr_0.2.3         withr_2.1.2          lazyeval_0.2.2      
[31] mime_0.7             magrittr_1.5         crayon_1.3.4        
[34] evaluate_0.14        MASS_7.3-51.4        xml2_1.2.2          
[37] tools_3.6.0          data.table_1.12.6    hms_0.5.2           
[40] lifecycle_0.1.0      stringr_1.4.0        munsell_0.5.0       
[43] cluster_2.1.0        flashClust_1.01-2    compiler_3.6.0      
[46] systemfonts_0.1.1    rlang_0.4.2.9000     iterators_1.0.12    
[49] rstudioapi_0.10      crosstalk_1.0.0      leaps_3.0           
[52] labeling_0.3         base64enc_0.1-3      rmarkdown_1.17      
[55] testthat_2.3.0       gtable_0.3.0         codetools_0.2-16    
[58] reshape2_1.4.3       R6_2.4.1             zeallot_0.1.0       
[61] shape_1.4.4          readr_1.3.1          stringi_1.4.3       
[64] Rcpp_1.0.3           vctrs_0.2.0          scatterplot3d_0.3-41
[67] tidyselect_0.2.5     xfun_0.11

---
title: "Factor analysis of mixed data"
output: 
  flexdashboard::flex_dashboard:
    source_code: embed
    social: menu
    theme: flatly
---

```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(kableExtra)

knitr::opts_chunk$set(cache = TRUE, echo=F, eval=T, warning = FALSE, message = FALSE)
```




Workflow {.storyboard}
=========================================

Inputs {.sidebar}
-------------------------------------

Dimensionality reduction using principal component methods is a very handy tool for identifying relationships amongst variables and hidden patterns in a dataset. 

Principal component analysis (PCA) is arguably the most commonly known, but it is limited by its use for datasets containing **only** continuous variables. As real-world data likely contain a combination of continuous and categorical variables, **factor analysis of mixed data (FAMD)** is an extremely valuable alternative approach to be familiar with. FAMD can be seen as combining PCA for continuous variables and multiple correspondence analysis (MCA) for categorical variables. 

Here, we will perform FAMD on the [IBM Telco customer churn dataset](https://developer.ibm.com/patterns/predict-customer-churn-using-watson-studio-and-jupyter-notebooks/), to gain insights into the relationships between various aspects of customer behaviour. This will be a toy example of how FAMD can be used to derive actionable business insights.



### **Calculate & inspect principal dimensions** 

Dimensionality reduction through creating a new feature space {data-commentary-width=500}

```{r, fig.width=7}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)
library(grid)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## FAMD
## Set the target variable "Churn" as a supplementary variable
res.famd <- FAMD(df, 
                 sup.var = 20,  
                 graph = FALSE, 
                 ncp=25)

## Inspect principal dimensions
ev <- data.frame(get_eigenvalue(res.famd))

## Visualize
x <- fviz_eig(res.famd, ncp=7, choice='eigenvalue', geom='line') +
    theme(text = element_text(size=8), 
          plot.title = element_blank(),
          axis.text.x = element_text(size=7), 
          axis.text.y = element_text(size=7)) 

y <- fviz_eig(res.famd, ncp=7) +
    theme(text = element_text(size=8), 
          plot.title = element_blank(),
          axis.text.x = element_text(size=7), 
          axis.text.y = element_text(size=7)) 
  
grid.arrange(x, y, ncol=2)
```

***

 





Description



General workflow









As an **important note**, *standardization* of the numeric variables is critical for valid results from FAMD. This is done automatically by the three packages currently available for FAMD: `FactorMineR`, `PCAmixdata` and `prince`, so it will not be done beforehand.  

We can first inspect the calculated principal dimensions (PDs), which are linear combinations of the original variables to better account for the variance in the dataset. Inspecting the eigenvalue and percentage variance explained by each PD, using scree plots, can provide insights into the "informativeness" of the original variables. 

An eigenvalue >1 indicates that the PD accounts for more variance than one of the original variables in **standardized** data (N.B. This holds true **only** when the data are standardized.). This is commonly used as a cutoff point for which PDs are retained to be used in further analysis. The scree plot on the left indicates that only the first four PDs account for more variance than each of the original variables, whereas the one on the right shows that together they account for only 46.7% of the total variance in the data set. This suggests that this dataset is quite complex, potentially due to 1) the relationships between the variables being non-linear, and/or 2) some factors (variables) that can account for variance in this dataset are not included in this analysis.

All in all, FAMD can be a great first step in sizing up a dataset, which we will further demonstrate in the next slide.
 




```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)
library(grid)

## FAMD
res.famd <- FAMD(df)

## Visualize
a <- fviz_eig(res.famd,  
              choice='eigenvalue', 
              geom='line') 

b <- fviz_eig(res.famd) 
  
grid.arrange(a, b, ncol=2)
```




 
### **Plot individual observations in new feature space** 

Exploring "learnable" patterns in the dataset {data-commentary-width=500}

```{r}
## Import libraries
library(FactoMineR)
library(factoextra)
library(plotly)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## FAMD
res.famd <- FAMD(df, 
                 sup.var = 20,  
                 graph = FALSE, 
                 ncp=25)

## Concate original data with coordinates for 
## the first three principal dimensions
val_df <- as.data.frame(res.famd$ind)

x <- cbind(df, val_df[1:3])

## Plot
plot_ly(x, x = ~coord.Dim.1, y = ~coord.Dim.2, z = ~coord.Dim.3, color = ~Churn, colors = c('#FF0000', '#3CB371')) %>%
  add_markers(size=3, opacity = 0.2) %>%
  layout(scene = list(xaxis = list(title = 'Principal dimension 1'),
                     yaxis = list(title = 'Principal dimension 2'),
                     zaxis = list(title = 'Principal dimension 3'))) 
```


***






Description



General workflow









We can now visualize the individual data points in the new feature space created by the first three, and thus "most informative", PDs. This is particularly useful when we want to see how "separable" groups of data points are, in our case in terms of customer churn, ahead of building supervised predictive models. To this end, the points are coloured by the variable "Churn".

This nifty 3D scatter plot can be dragged around to inspect the points from different angles, so give it a try!

As individuals with similar profiles, *i.e.* personal characteristics and purchasing behaviour, are close to each other on the figure, the large overlap between the “Churn” and “No churn” populations of customers suggests that if there are significant/meaningful differences between the two populations, they are likely complex and non-linear. It is also possible that the variables we have in this dataset are not suited or sufficient to capture the difference between customers who churn and those who do not. So, the overlapping distributions of the two populations suggest that we are probably not going to get very good separation of churned and not churned customers using a predictive model like a random forest classifier, as least with the data as it is now. 

This is a good example of how performing dimensionality reduction of a dataset can help us see how “successful” other analyses could be on it.






```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(plotly)

## FAMD 
res.famd <- FAMD(df)

## Concate original data with coordinates for 
## the first three principal dimensions
val_df <- as.data.frame(res.famd$ind)

x <- cbind(df, val_df[1:3])

## Plot
plot_ly(x, 
        x = ~coord.Dim.1, 
        y = ~coord.Dim.2, 
        z = ~coord.Dim.3, 
        color = ~Churn) 
```






### **Correlation circle** 

 To examine relationship between quantitative variables {data-commentary-width=650}

```{r}
## Import library
library(PCAmixdata)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure
df <- within(df, rm('TotalCharges'))

## Split quantitative and qualitative variables
split <- splitmix(df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Plotting
p <- plot(res.pcamix,
     choice="cor", 
     cex=0.6, main=' ') 
```

***
 





Background



Description



General workflow









As PDs are linear combinations of the original variables, understanding their relationships can help to identify which variables are the most important in describing the total variance in a dataset.

The **factor loading** of a variable describes the correlation, *i.e.* information shared, between it and a given PD. By squaring the factor loading for a variable, we also get its **squared loading** (which you may see also called *squared cosine* or *cos2*). This provides a measure of the proportion of variance in a variable that is captured by a particular PD. For each variable, the sum of its squared loading across all PDs equals to 1.

One way to depict this relationship is using **correlation circles**, which plot *only* continuous variables using their loadings for any given two PDs as coordinates. Recall that the sum of squared loadings for a given variable across all PDs equals 1. So if a given variable can be perfectly represented by only the two PDs plotted, then:  

$$
(factor\,  loading_{PD1})^{2} + (factor\,  loading_{PD2})^{2} = 1
$$
 

When plotted using factor loading on each PD as coordinates in a Cartesian grid, this is the same as:

$$
x^{2} + y^{2} = 1
$$

The circle in the plot has a radius of 1, meaning that the projection endpoint for any such variable would be positioned on the circle.  





In this correlation circle, we see that PD1 and PD2 together do a pretty good job of capturing information contained in the `MonthlyCharges` variable, as its endpoint is very close to the circle. Conversely, if more PDs are needed to capture the information contained in a variable, then the length of it projection would be less than 1 and the endpoint would be positioned inside the circle. Projection for the `Tenure` variable lies closer to the origin than that of `MonthlyCharges`, indicating that more than PD1 and PD2 are needed to completely represent the information it contains. Therefore, the closer a variable is to the circle, then more important it is to interpreting the PDs involved.

In addition, variables on opposite sides of the origin are inversely correlated, whereas those on the same side are positively correlated. It makes intuitive sense that `MonthlyCharges` and `Tenure` are inversely related, as customers that pay a high monthly fee likely try to change providers.







```{r, eval=F, echo=T}
## Import library
library(PCAmixdata)

## Split quantitative and qualitative variables
split <- splitmix(df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali)

## Plotting
plot(res.pcamix, choice="cor") 
```




 

### **Squared loading plot** 

 To visualize role of *all* variables in accounting for overall variation in a dataset {data-commentary-width=550}


```{r}
## Import libraries
library(FactoMineR)
library(factoextra)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## principal dimensionA
res.famd <- FAMD(df, 
                 sup.var = 20, 
                 graph = FALSE, 
                 ncp=25)

## Plot
## Colour indiv obs by their squared loading
p <- fviz_famd_var(res.famd, 'var', axes = c(1, 2),
                     labelsize = 3, pointsize = 1,
                     col.var = 'cos2',   
                     gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
                     repl=TRUE) +
                     xlim(-0.1, 0.85) + ylim (-0.1, 0.70) + 
          theme(text = element_text(size=8), 
                plot.title = element_blank(),
                axis.text.x = element_text(size=7), 
                axis.text.y = element_text(size=7)) 


## Add the supplementary variable, Churn, to the plot
fviz_add(p, res.famd$var$coord.sup,
         geom = c("arrow", "text"),
         labelsize = 3, pointsize = 1,
         col.var = 'cos2',
         color = "blue", 
         repel = TRUE)
```

***







Background




Description



General workflow









Squared loading plots allow us to visualize qualitative and quantitative variables **together** in the new feature space. The implementation provided by the `PCAmixdata` package has an added benefit of allowing the `Churn` variable to be included as a supplementary variable, thereby seeing its relationship with other variables without including it in the original analysis. This is useful as most downstream analyses would try to predict `Churn`.

As an aside, unlike correlation circles, this plot depicts only positive values on the x- and y-axis. According to the authors of the package, the coordinates are to be interpreted as measuring “the links (signless) between variables and principal components”. This may be interpreted as the coordinates of each variable being the absolute value its squared loading.





Interpretation of the squared loading plot is very similar to that of the correlation circle. We see several interesting things:

1. Consistent with what we saw in the correlation circle, `MonthlyCharges` is more closely correlated with PD1 than with PD2, whereas `Tenure` is described by a more even combination of PD1 and PD2

2. Being furthest from the origin, the variables `Contract`, `InternetService` and `MonthlyCharges` have the highest squared loading values and so are more important in explaining the variance captured by PD1 and PD2 than variables clustered near the origin, such as `Gender`, `PhoneService`, and `SeniorCitizen`

3. The variable of interest, `Churn`, overlaps with the y-axis, indicating that PD2 alone perfectly captures all the variation contained in this variable






```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)

## principal dimensionA
res.famd <- FAMD(df)

## Plot
## Colour indiv obs by their squared loading
p <- fviz_famd_var(res.famd, 'var', 
                   axes = c(1, 2),
                   col.var = 'cos2')

## Add the supplementary variable, Churn
fviz_add(p, res.famd$var$coord.sup
         col.var = 'cos2')
```










### **Variable contribution** 

 Contributions of variables to principal dimensions {data-commentary-width=550}

```{r, fig.width=7, fig.height=4, fig.align='center'}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)

## Plot
a <- fviz_contrib(res.famd, 
                  choice = "var", axes = 1, top = 10)

b <- fviz_contrib(res.famd, 
                  choice = "var", axes = 2, top = 10)

grid.arrange(a, b, nrow=1)

#fviz_contrib(res.famd, choice = "var", axes = 1:2, top = 15)
```

***






Background




Description



General workflow









Whereas factor loading and squared loading measure how well a given PD “describes” variation capture in a variable, **contribution** describes the converse, namely how much a variable accounts for the total variation captured by a given PD. 

It is important to compare the squared loading and contribution for each variable to critically assess its relationship with a given PD, as a variable that is important for a PD may not be well represented by the same PD, which warrants very different interpretation as compared to the converse.






Top contributing variables to the first few PDs can provide insights into which variables underlie variations in the dataset, and may help with feature selection for downstream analyses. The red dashed line indicates the expected average contribution (100% contribution divided the total number of variables avaiable in the dataset). So variables meeting the cut-off would be considered as important in contributing to the PD.

From the variables that meet the cut-off, we can glean some insights into what are the most important variables in this dataset, such as `MonthlyCharges`, `InternetService` and `Tenure`. So, FAMD can also be a handy tool for variable selection.







```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(gridExtra)

## FAMD
res.famd <- FAMD(df)

## Plots
a <- fviz_contrib(res.famd, choice = "var", axes = 1)

b <- fviz_contrib(res.famd, choice = "var", axes = 2)

grid.arrange(a, b, nrow=1)

#fviz_contrib(res.famd, choice = "var", axes = 1:2)
```






### **Level map** 

 More granular insights into relationships amongst variables {data-commentary-width=550}

```{r}
## Import libraries
library(FactoMineR)
library(factoextra)
library(plyr)
library(dplyr)
library(arulesCBA) 

## Import data
df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv")

## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges

## Rename the levels based on knowledge of min/max monthly charges
df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges, 
                                   c("[-Inf,29.4)"="$0-29.4", 
                                     "[29.4,56)"="$29.4-56", 
                                     "[56,68.8)"="$56-68.8", 
                                     "[68.8,107)"="$68.8-107", 
                                     "[107, Inf]" = "$107-118.75"))

## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., 
                                            df[, c('Tenure', 'Churn')], 
                                            method='mdlp')$Tenure

## Rename the levels based on knowledge of min/max tenures
df$Binned_Tenure = revalue(df$Binned_Tenure, 
                           c("[-Inf,1.5)"="1-1.5m", 
                             "[1.5,5.5)"="1.5-5.5m",
                             "[5.5,17.5)"="5.5-17.5m",
                             "[17.5,43.5)"="17.5-43.5m",
                             "[43.5,59.5)"="43.5-59.5m",
                             "[59.5,70.5)"="59.5-70.5m",
                             "[70.5, Inf]"="70.5-72m"))

## MCA, with "Churn" set as the supplementary variable
res.mca <- MCA(df, quanti.sup=c(5, 18, 19), quali.sup=c(20), graph = FALSE)
                 
## Plot relationship between levels of categorical variables obtained from MCA
fviz_mca_var(res.mca, col.var = "cos2", 
            gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
            labelsize = 3, pointsize = 1, repel=TRUE) +
             xlim(-1.5, 2) + ylim (-1.0, 1.25) + theme(text = element_text(size=8), 
          plot.title = element_blank(),
          axis.text.x = element_text(size=7), 
          axis.text.y = element_text(size=7)) 
```

***






Background



Description



General workflow










We can further visualize the relationships between possible values of variables in level maps. This allows us to get much more fine-grained insights, as for example "Senior Citizen" and "Not Senior Citizen" carry very different meanings, which are lost when lumped together into a single variable. 

For continuous variables to be used in this analysis, they need to be discretized and made into "pseudo" categorical variables. In this case, as I am interested in getting the most informative value bins with respect to the outcome variable of interest, `Churn`, I used the supervised discretization function implemented by the `arulesCBA` package (see more about it [here](http://rpubs.com/nchelaru/eda)). 







There are lots of information in this plot! 

Interpretation of the level map is similar to that of correlation circles and squared loading plots:



Values that are situated close to each other are more closely related



Values that are closer to the origin are less important in accounting for the variance in the dataset than those farther away



Values that are on opposite sides of the origin are negatively correlated, whereas those on the same side are positively correlated




Briefly, since we are most interested in Churn, we can see that whereas have a month-to-month plan and paying by electronic cheque are associated with customers who churn, having one-year contract and not being senior citizen are associated with those who do not. 

These are actionable insights to inform marketing campaigns and customer retention strategies.






```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)
library(factoextra)
library(arulesCBA)

## Discretize "Tenure" with respect to "Churn"/"No Churn"
df$Binned_Tenure <- discretizeDF.supervised(Churn ~ .,
                                            df[, c('Tenure', 'Churn')],
                                            method = 'mdlp')$Tenure

## MCA, with "Churn" set as the supplementary variable
res.mca <- MCA(df, 
               quanti.sup = c(5, 18, 19),
               quali.sup = c(20))

## Plot relationship between levels of categorical variables obtained from MCA
fviz_mca_var(res.mca, col.var = "cos2") 
```







### **Varimax rotation** 

 To further facilitate interpretation of the relationships between variables and principal dimensions {data-commentary-width=500}


```{r, fig.height=8, fig.width=8}
## Import library
library(PCAmixdata)

## Import data
df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv')

## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure
df <- within(df, rm('TotalCharges'))

## Split quantitative and qualitative variables
split <- splitmix(df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali, 
                     rename.level=TRUE, 
                     graph=FALSE, 
                     ndim=25)

## Apply varimax rotation to the first two PCs
res.pcarot <- PCArot(res.pcamix,
                     dim=2,
                     graph=FALSE)

## Plot
plot(res.pcarot, 
     choice="sqload", 
     coloring.var=TRUE,
     axes=c(1, 2), leg=TRUE, 
     posleg="topright", main=' ', 
     xlim=c(0, 0.8), ylim=c(0, 0.6), 
     cex=0.8)
```



***







Description



General workflow










To further facilitate interpretation of the relationships between variables and PCs, additional rotation can be applied to PCs to result in high factor loadings for a few variables and low factor loadings for the rest. In other words, a small number of variables will become highly correlated with each PC. The most common form of rotation is varimax rotation, a generalized form of which is implemented in the `PCAmixdata` package for mixed data. To learn more, here is an excellet explanation on the [stats StackExchange](https://stats.stackexchange.com/questions/151653/what-is-the-intuitive-reason-behind-doing-rotations-in-factor-analysis-pca-how).

Here we see a slightly different version of the squared loading plot from before. What is different here is that there is higher factor loading of `MonthlyCharges` and `InternetService` for the rotated PD1, and `Tenure` and `Contract` for the rotated PD2 (as their projections are more closely aligned either axis). This indicates these four variables are the most important in accounting for overall variance in the entire dataset.







```{r, eval=F, echo=T}
## Import library
library(PCAmixdata)

## Split quantitative and qualitative variables
split <- splitmix(df)

## FAMD
res.pcamix <- PCAmix(X.quanti=split$X.quanti,  
                     X.quali=split$X.quali)

## Apply varimax rotation to the first two PCs
res.pcarot <- PCArot(res.pcamix, dim=2,
                     graph=FALSE)

## Plot
plot(res.pcarot, choice="sqload", 
     coloring.var=TRUE, axes=c(1, 2))
```








### **Hierarchical clustering** 

 Unsupervised clustering based on principal dimensions {data-commentary-width=500}
 

```{r}
## Hierachical clustering
res.hcpc <- HCPC(res.famd, nb.clust=-1, graph = FALSE)

## Plot
fviz_cluster(res.hcpc, geom = "point", main = "Factor map") +
  theme(plot.title = element_blank(),
          axis.text.x = element_text(size=4), 
          axis.text.y = element_text(size=4))
```

***







Description



General workflow










As principal components (PCs) are essentially “synthetic” variables that summarize the most important sources of variations in the data, they are very useful in reducing noise and redundancy in a dataset that in turn helps to reveal its inherent structure. One way to take this one step further is by hierarchically clustering the individual data points by their "position" in the principal dimension feature space. The implementation by the `FactoMineR` package uses the Ward's criterion to carry out hierarchical clustering.

Doing this on our Telco customer churn data reveals three partially overlapping clusters, each of which have some proportion of churned/not churned customers. This further supports the idea that the customers do not divide cleanly by that categorisation, instead by some complex interaction of multiple factors. 

In the next slide, we will generate some summary visualizations to see how the clusters differ in terms of customer demographics and purchasing behaviour.






```{r, eval=F, echo=T}
## Import libraries
library(FactoMineR)

## principal dimensionA
res.famd <- FAMD(df, sup.var = 20)

## Hierachical clustering
res.hcpc <- HCPC(res.famd, nb.clust=-1)

## Plot
fviz_cluster(res.hcpc, geom = "point")
```






 

### **Examine clusters**

Explore characteristics of clusters {data-commentary-width=550}

```{r cache = TRUE}
## Import libraries
library(autoEDA)
library(svglite)
library(htmlwidgets)
library(slickR)
library(shiny)

## Rename cluster column
names(res.hcpc$data.clust)[names(res.hcpc$data.clust) == 'clust'] <- 'Cluster'

## autoEDA 
auto_res <- autoEDA(res.hcpc$data.clust, 
                    y = "Cluster", returnPlotList = TRUE,
                    plotCategorical = "groupedBar", 
                    plotContinuous = "histogram", 
                    bins = 30, rotateLabels = TRUE, 
                    color = "#26A69A", verbose = FALSE) 

## Create list of autoEDA figures converted to SVG
plotsToSVG <- list()

i <- 1

for (v in auto_res$plots) {
  
  x <- xmlSVG({show(v)}, standalone=TRUE) 
  
  plotsToSVG[[i]] <- x
  
  i <- i + 1
  
}

## Custom function needed to render SVGs in Chrome/Firefox
hash_encode_url <- function(url){
  gsub("#", "%23", url)
}

## Pass list of figures to SlickR
s.in <- sapply(plotsToSVG, function(sv){hash_encode_url(paste0("data:image/svg+xml;utf8,",as.character(sv)))})

#Sys.sleep(4)

slickR(s.in, slideId = 'ex4', slickOpts = list(dots=T, arrows=F, autoplay=T, adaptiveWidth=TRUE, adaptiveHeight=TRUE), height='525px')

```

***







Description



General workflow










The automated exploratory data analysis package `autoEDA` is very useful for generating summary visualizations for each variable in a given dataset with respect to a target variable of interest. 

In our case, we want to see the distribution of values for each variable in each of the three clusters generated using hierarchical clustering based on FAMD. We see that each of the clusters have a different proportion of churned customers, with clusters 1 and 3 having very few churned customers, and cluster 2 being a more even split. Comparing the demographic characteristics and purchasing behaviours of these clusters can generate more nuanced insights into what differentiates loyal customers from those who are much more "on the fence".

As a side note, here you see a slider generated using the `slickR` package, which is a nifty way to show multiple figures when space is limited. Learn more about it at its Github repo [here](https://github.com/metrumresearchgroup/slickR).






```{r, eval=F, echo=T}
## Import libraries
library(autoEDA)

## autoEDA 
auto_res <- autoEDA(res.hcpc$data.clust, y = "clust", 
                    returnPlotList = TRUE,
                    plotCategorical = "groupedBar", 
                    bins = 30, rotateLabels = TRUE, 
                    color = "#26A69A", verbose = FALSE) 

## Plot
auto_res$plots
```






  

Session info
=========================================

Column 
---------------------------------------------

```{r}
sessionInfo()
```

Workflow

Inputs

Calculate & inspect principal dimensions Dimensionality reduction through creating a new feature space

Plot individual observations in new feature space Exploring “learnable” patterns in the dataset

Correlation circle To examine relationship between quantitative variables

Squared loading plot To visualize role of all variables in accounting for overall variation in a dataset

Variable contribution Contributions of variables to principal dimensions

Level map More granular insights into relationships amongst variables

Varimax rotation To further facilitate interpretation of the relationships between variables and principal dimensions

Hierarchical clustering Unsupervised clustering based on principal dimensions

Examine clustersExplore characteristics of clusters