Workflow

Session info

Column

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] gdtools_0.2.1         RColorBrewer_1.1-2    shiny_1.3.2          
 [4] slickR_0.4.4          htmlwidgets_1.5.1     svglite_1.2.2        
 [7] autoEDA_1.0           arulesCBA_1.1.5       glmnet_3.0           
[10] discretization_1.0-1  arules_1.6-4          Matrix_1.2-17        
[13] dplyr_0.8.3           plyr_1.8.4            PCAmixdata_3.1       
[16] plotly_4.9.1          gridExtra_2.3         factoextra_1.0.5     
[19] ggplot2_3.2.1         FactoMineR_1.42       kableExtra_1.1.0     
[22] knitr_1.26            flexdashboard_0.5.1.1

loaded via a namespace (and not attached):
 [1] httr_1.4.1           tidyr_1.0.0          jsonlite_1.6        
 [4] viridisLite_0.3.0    foreach_1.4.7        assertthat_0.2.1    
 [7] yaml_2.2.0           ggrepel_0.8.1        pillar_1.4.2        
[10] backports_1.1.5      lattice_0.20-38      glue_1.3.1          
[13] digest_0.6.23        ggsignif_0.6.0       promises_1.1.0      
[16] rvest_0.3.5          colorspace_1.4-1     httpuv_1.5.2        
[19] htmltools_0.4.0      pkgconfig_2.0.3      magick_2.2          
[22] xtable_1.8-4         purrr_0.3.3          scales_1.0.0        
[25] webshot_0.5.1        later_1.0.0          tibble_2.1.3        
[28] ggpubr_0.2.3         withr_2.1.2          lazyeval_0.2.2      
[31] mime_0.7             magrittr_1.5         crayon_1.3.4        
[34] evaluate_0.14        MASS_7.3-51.4        xml2_1.2.2          
[37] tools_3.6.0          data.table_1.12.6    hms_0.5.2           
[40] lifecycle_0.1.0      stringr_1.4.0        munsell_0.5.0       
[43] cluster_2.1.0        flashClust_1.01-2    compiler_3.6.0      
[46] systemfonts_0.1.1    rlang_0.4.2.9000     iterators_1.0.12    
[49] rstudioapi_0.10      crosstalk_1.0.0      leaps_3.0           
[52] labeling_0.3         base64enc_0.1-3      rmarkdown_1.17      
[55] testthat_2.3.0       gtable_0.3.0         codetools_0.2-16    
[58] reshape2_1.4.3       R6_2.4.1             zeallot_0.1.0       
[61] shape_1.4.4          readr_1.3.1          stringi_1.4.3       
[64] Rcpp_1.0.3           vctrs_0.2.0          scatterplot3d_0.3-41
[67] tidyselect_0.2.5     xfun_0.11           
---
title: "Factor analysis of mixed data"
output: 
  flexdashboard::flex_dashboard:
    source_code: embed
    social: menu
    theme: flatly
---

```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(kableExtra)

knitr::opts_chunk$set(cache = TRUE, echo=F, eval=T, warning = FALSE, message = FALSE)
```




Workflow {.storyboard}
=========================================

Inputs {.sidebar}
-------------------------------------

Dimensionality reduction using principal component methods is a very handy tool for identifying relationships amongst variables and hidden patterns in a dataset. 

Principal component analysis (PCA) is arguably the most commonly known, but it is limited by its use for datasets containing **only** continuous variables. As real-world data likely contain a combination of continuous and categorical variables, **factor analysis of mixed data (FAMD)** is an extremely valuable alternative approach to be familiar with. FAMD can be seen as combining PCA for continuous variables and multiple correspondence analysis (MCA) for categorical variables. 

Here, we will perform FAMD on the [IBM Telco customer churn dataset](https://developer.ibm.com/patterns/predict-customer-churn-using-watson-studio-and-jupyter-notebooks/), to gain insights into the relationships between various aspects of customer behaviour. This will be a toy example of how FAMD can be used to derive actionable business insights.



### **Calculate & inspect principal dimensions** 

Dimensionality reduction through creating a new feature space {data-commentary-width=500} ```{r, fig.width=7} ## Import libraries library(FactoMineR) library(factoextra) library(gridExtra) library(grid) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## FAMD ## Set the target variable "Churn" as a supplementary variable res.famd <- FAMD(df, sup.var = 20, graph = FALSE, ncp=25) ## Inspect principal dimensions ev <- data.frame(get_eigenvalue(res.famd)) ## Visualize x <- fviz_eig(res.famd, ncp=7, choice='eigenvalue', geom='line') + theme(text = element_text(size=8), plot.title = element_blank(), axis.text.x = element_text(size=7), axis.text.y = element_text(size=7)) y <- fviz_eig(res.famd, ncp=7) + theme(text = element_text(size=8), plot.title = element_blank(), axis.text.x = element_text(size=7), axis.text.y = element_text(size=7)) grid.arrange(x, y, ncol=2) ``` ***

As an **important note**, *standardization* of the numeric variables is critical for valid results from FAMD. This is done automatically by the three packages currently available for FAMD: `FactorMineR`, `PCAmixdata` and `prince`, so it will not be done beforehand. We can first inspect the calculated principal dimensions (PDs), which are linear combinations of the original variables to better account for the variance in the dataset. Inspecting the eigenvalue and percentage variance explained by each PD, using scree plots, can provide insights into the "informativeness" of the original variables. An eigenvalue >1 indicates that the PD accounts for more variance than one of the original variables in **standardized** data (N.B. This holds true **only** when the data are standardized.). This is commonly used as a cutoff point for which PDs are retained to be used in further analysis. The scree plot on the left indicates that only the first four PDs account for more variance than each of the original variables, whereas the one on the right shows that together they account for only 46.7% of the total variance in the data set. This suggests that this dataset is quite complex, potentially due to 1) the relationships between the variables being non-linear, and/or 2) some factors (variables) that can account for variance in this dataset are not included in this analysis. All in all, FAMD can be a great first step in sizing up a dataset, which we will further demonstrate in the next slide.

```{r, eval=F, echo=T} ## Import libraries library(FactoMineR) library(factoextra) library(gridExtra) library(grid) ## FAMD res.famd <- FAMD(df) ## Visualize a <- fviz_eig(res.famd, choice='eigenvalue', geom='line') b <- fviz_eig(res.famd) grid.arrange(a, b, ncol=2) ```
### **Plot individual observations in new feature space**

Exploring "learnable" patterns in the dataset {data-commentary-width=500} ```{r} ## Import libraries library(FactoMineR) library(factoextra) library(plotly) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## FAMD res.famd <- FAMD(df, sup.var = 20, graph = FALSE, ncp=25) ## Concate original data with coordinates for ## the first three principal dimensions val_df <- as.data.frame(res.famd$ind) x <- cbind(df, val_df[1:3]) ## Plot plot_ly(x, x = ~coord.Dim.1, y = ~coord.Dim.2, z = ~coord.Dim.3, color = ~Churn, colors = c('#FF0000', '#3CB371')) %>% add_markers(size=3, opacity = 0.2) %>% layout(scene = list(xaxis = list(title = 'Principal dimension 1'), yaxis = list(title = 'Principal dimension 2'), zaxis = list(title = 'Principal dimension 3'))) ``` ***

We can now visualize the individual data points in the new feature space created by the first three, and thus "most informative", PDs. This is particularly useful when we want to see how "separable" groups of data points are, in our case in terms of customer churn, ahead of building supervised predictive models. To this end, the points are coloured by the variable "Churn". This nifty 3D scatter plot can be dragged around to inspect the points from different angles, so give it a try! As individuals with similar profiles, *i.e.* personal characteristics and purchasing behaviour, are close to each other on the figure, the large overlap between the “Churn” and “No churn” populations of customers suggests that if there are significant/meaningful differences between the two populations, they are likely complex and non-linear. It is also possible that the variables we have in this dataset are not suited or sufficient to capture the difference between customers who churn and those who do not. So, the overlapping distributions of the two populations suggest that we are probably not going to get very good separation of churned and not churned customers using a predictive model like a random forest classifier, as least with the data as it is now. This is a good example of how performing dimensionality reduction of a dataset can help us see how “successful” other analyses could be on it.

```{r, eval=F, echo=T} ## Import libraries library(FactoMineR) library(factoextra) library(plotly) ## FAMD res.famd <- FAMD(df) ## Concate original data with coordinates for ## the first three principal dimensions val_df <- as.data.frame(res.famd$ind) x <- cbind(df, val_df[1:3]) ## Plot plot_ly(x, x = ~coord.Dim.1, y = ~coord.Dim.2, z = ~coord.Dim.3, color = ~Churn) ```
### **Correlation circle**

To examine relationship between quantitative variables {data-commentary-width=650} ```{r} ## Import library library(PCAmixdata) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure df <- within(df, rm('TotalCharges')) ## Split quantitative and qualitative variables split <- splitmix(df) ## FAMD res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali, rename.level=TRUE, graph=FALSE, ndim=25) ## Plotting p <- plot(res.pcamix, choice="cor", cex=0.6, main=' ') ``` ***

As PDs are linear combinations of the original variables, understanding their relationships can help to identify which variables are the most important in describing the total variance in a dataset. The **factor loading** of a variable describes the correlation, *i.e.* information shared, between it and a given PD. By squaring the factor loading for a variable, we also get its **squared loading** (which you may see also called *squared cosine* or *cos2*). This provides a measure of the proportion of variance in a variable that is captured by a particular PD. For each variable, the sum of its squared loading across all PDs equals to 1. One way to depict this relationship is using **correlation circles**, which plot *only* continuous variables using their loadings for any given two PDs as coordinates. Recall that the sum of squared loadings for a given variable across all PDs equals 1. So if a given variable can be perfectly represented by only the two PDs plotted, then: $$ (factor\, loading_{PD1})^{2} + (factor\, loading_{PD2})^{2} = 1 $$ When plotted using factor loading on each PD as coordinates in a Cartesian grid, this is the same as: $$ x^{2} + y^{2} = 1 $$ The circle in the plot has a radius of 1, meaning that the projection endpoint for any such variable would be positioned on the circle.

In this correlation circle, we see that PD1 and PD2 together do a pretty good job of capturing information contained in the `MonthlyCharges` variable, as its endpoint is very close to the circle. Conversely, if more PDs are needed to capture the information contained in a variable, then the length of it projection would be less than 1 and the endpoint would be positioned inside the circle. Projection for the `Tenure` variable lies closer to the origin than that of `MonthlyCharges`, indicating that more than PD1 and PD2 are needed to completely represent the information it contains. Therefore, the closer a variable is to the circle, then more important it is to interpreting the PDs involved. In addition, variables on opposite sides of the origin are inversely correlated, whereas those on the same side are positively correlated. It makes intuitive sense that `MonthlyCharges` and `Tenure` are inversely related, as customers that pay a high monthly fee likely try to change providers.

```{r, eval=F, echo=T} ## Import library library(PCAmixdata) ## Split quantitative and qualitative variables split <- splitmix(df) ## FAMD res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali) ## Plotting plot(res.pcamix, choice="cor") ```
### **Squared loading plot**

To visualize role of *all* variables in accounting for overall variation in a dataset {data-commentary-width=550} ```{r} ## Import libraries library(FactoMineR) library(factoextra) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## principal dimensionA res.famd <- FAMD(df, sup.var = 20, graph = FALSE, ncp=25) ## Plot ## Colour indiv obs by their squared loading p <- fviz_famd_var(res.famd, 'var', axes = c(1, 2), labelsize = 3, pointsize = 1, col.var = 'cos2', gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repl=TRUE) + xlim(-0.1, 0.85) + ylim (-0.1, 0.70) + theme(text = element_text(size=8), plot.title = element_blank(), axis.text.x = element_text(size=7), axis.text.y = element_text(size=7)) ## Add the supplementary variable, Churn, to the plot fviz_add(p, res.famd$var$coord.sup, geom = c("arrow", "text"), labelsize = 3, pointsize = 1, col.var = 'cos2', color = "blue", repel = TRUE) ``` ***

Squared loading plots allow us to visualize qualitative and quantitative variables **together** in the new feature space. The implementation provided by the `PCAmixdata` package has an added benefit of allowing the `Churn` variable to be included as a supplementary variable, thereby seeing its relationship with other variables without including it in the original analysis. This is useful as most downstream analyses would try to predict `Churn`. As an aside, unlike correlation circles, this plot depicts only positive values on the x- and y-axis. According to the authors of the package, the coordinates are to be interpreted as measuring “the links (signless) between variables and principal components”. This may be interpreted as the coordinates of each variable being the absolute value its squared loading.

Interpretation of the squared loading plot is very similar to that of the correlation circle. We see several interesting things: 1. Consistent with what we saw in the correlation circle, `MonthlyCharges` is more closely correlated with PD1 than with PD2, whereas `Tenure` is described by a more even combination of PD1 and PD2 2. Being furthest from the origin, the variables `Contract`, `InternetService` and `MonthlyCharges` have the highest squared loading values and so are more important in explaining the variance captured by PD1 and PD2 than variables clustered near the origin, such as `Gender`, `PhoneService`, and `SeniorCitizen` 3. The variable of interest, `Churn`, overlaps with the y-axis, indicating that PD2 alone perfectly captures all the variation contained in this variable

```{r, eval=F, echo=T} ## Import libraries library(FactoMineR) library(factoextra) ## principal dimensionA res.famd <- FAMD(df) ## Plot ## Colour indiv obs by their squared loading p <- fviz_famd_var(res.famd, 'var', axes = c(1, 2), col.var = 'cos2') ## Add the supplementary variable, Churn fviz_add(p, res.famd$var$coord.sup col.var = 'cos2') ```
### **Variable contribution**

Contributions of variables to principal dimensions {data-commentary-width=550} ```{r, fig.width=7, fig.height=4, fig.align='center'} ## Import libraries library(FactoMineR) library(factoextra) library(gridExtra) ## Plot a <- fviz_contrib(res.famd, choice = "var", axes = 1, top = 10) b <- fviz_contrib(res.famd, choice = "var", axes = 2, top = 10) grid.arrange(a, b, nrow=1) #fviz_contrib(res.famd, choice = "var", axes = 1:2, top = 15) ``` ***

Whereas factor loading and squared loading measure how well a given PD “describes” variation capture in a variable, **contribution** describes the converse, namely how much a variable accounts for the total variation captured by a given PD. It is important to compare the squared loading and contribution for each variable to critically assess its relationship with a given PD, as a variable that is important for a PD may not be well represented by the same PD, which warrants very different interpretation as compared to the converse.

Top contributing variables to the first few PDs can provide insights into which variables underlie variations in the dataset, and may help with feature selection for downstream analyses. The red dashed line indicates the expected average contribution (100% contribution divided the total number of variables avaiable in the dataset). So variables meeting the cut-off would be considered as important in contributing to the PD. From the variables that meet the cut-off, we can glean some insights into what are the most important variables in this dataset, such as `MonthlyCharges`, `InternetService` and `Tenure`. So, FAMD can also be a handy tool for variable selection.

```{r, eval=F, echo=T} ## Import libraries library(FactoMineR) library(factoextra) library(gridExtra) ## FAMD res.famd <- FAMD(df) ## Plots a <- fviz_contrib(res.famd, choice = "var", axes = 1) b <- fviz_contrib(res.famd, choice = "var", axes = 2) grid.arrange(a, b, nrow=1) #fviz_contrib(res.famd, choice = "var", axes = 1:2) ```
### **Level map**

More granular insights into relationships amongst variables {data-commentary-width=550} ```{r} ## Import libraries library(FactoMineR) library(factoextra) library(plyr) library(dplyr) library(arulesCBA) ## Import data df <-read.csv("https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv") ## Discretize "MonthlyCharges" with respect to "Churn"/"No Churn" label and assign to new column in dataframe df$Binned_MonthlyCharges <- discretizeDF.supervised(Churn ~ ., df[, c('MonthlyCharges', 'Churn')], method='mdlp')$MonthlyCharges ## Rename the levels based on knowledge of min/max monthly charges df$Binned_MonthlyCharges = revalue(df$Binned_MonthlyCharges, c("[-Inf,29.4)"="$0-29.4", "[29.4,56)"="$29.4-56", "[56,68.8)"="$56-68.8", "[68.8,107)"="$68.8-107", "[107, Inf]" = "$107-118.75")) ## Discretize "Tenure" with respect to "Churn"/"No Churn" label and assign to new column in dataframe df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., df[, c('Tenure', 'Churn')], method='mdlp')$Tenure ## Rename the levels based on knowledge of min/max tenures df$Binned_Tenure = revalue(df$Binned_Tenure, c("[-Inf,1.5)"="1-1.5m", "[1.5,5.5)"="1.5-5.5m", "[5.5,17.5)"="5.5-17.5m", "[17.5,43.5)"="17.5-43.5m", "[43.5,59.5)"="43.5-59.5m", "[59.5,70.5)"="59.5-70.5m", "[70.5, Inf]"="70.5-72m")) ## MCA, with "Churn" set as the supplementary variable res.mca <- MCA(df, quanti.sup=c(5, 18, 19), quali.sup=c(20), graph = FALSE) ## Plot relationship between levels of categorical variables obtained from MCA fviz_mca_var(res.mca, col.var = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), labelsize = 3, pointsize = 1, repel=TRUE) + xlim(-1.5, 2) + ylim (-1.0, 1.25) + theme(text = element_text(size=8), plot.title = element_blank(), axis.text.x = element_text(size=7), axis.text.y = element_text(size=7)) ``` ***

We can further visualize the relationships between possible values of variables in level maps. This allows us to get much more fine-grained insights, as for example "Senior Citizen" and "Not Senior Citizen" carry very different meanings, which are lost when lumped together into a single variable. For continuous variables to be used in this analysis, they need to be discretized and made into "pseudo" categorical variables. In this case, as I am interested in getting the most informative value bins with respect to the outcome variable of interest, `Churn`, I used the supervised discretization function implemented by the `arulesCBA` package (see more about it [here](http://rpubs.com/nchelaru/eda)).

There are lots of information in this plot! Interpretation of the level map is similar to that of correlation circles and squared loading plots:
  • Values that are situated close to each other are more closely related
  • Values that are closer to the origin are less important in accounting for the variance in the dataset than those farther away
  • Values that are on opposite sides of the origin are negatively correlated, whereas those on the same side are positively correlated
Briefly, since we are most interested in Churn, we can see that whereas have a month-to-month plan and paying by electronic cheque are associated with customers who churn, having one-year contract and not being senior citizen are associated with those who do not. These are actionable insights to inform marketing campaigns and customer retention strategies.

```{r, eval=F, echo=T} ## Import libraries library(FactoMineR) library(factoextra) library(arulesCBA) ## Discretize "Tenure" with respect to "Churn"/"No Churn" df$Binned_Tenure <- discretizeDF.supervised(Churn ~ ., df[, c('Tenure', 'Churn')], method = 'mdlp')$Tenure ## MCA, with "Churn" set as the supplementary variable res.mca <- MCA(df, quanti.sup = c(5, 18, 19), quali.sup = c(20)) ## Plot relationship between levels of categorical variables obtained from MCA fviz_mca_var(res.mca, col.var = "cos2") ```
### **Varimax rotation**

To further facilitate interpretation of the relationships between variables and principal dimensions {data-commentary-width=500} ```{r, fig.height=8, fig.width=8} ## Import library library(PCAmixdata) ## Import data df <- read.csv('https://github.com/nchelaru/data-prep/raw/master/telco_cleaned_renamed.csv') ## Drop the TotalCharges variable, as it is a product of MonthlyCharges and Tenure df <- within(df, rm('TotalCharges')) ## Split quantitative and qualitative variables split <- splitmix(df) ## FAMD res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali, rename.level=TRUE, graph=FALSE, ndim=25) ## Apply varimax rotation to the first two PCs res.pcarot <- PCArot(res.pcamix, dim=2, graph=FALSE) ## Plot plot(res.pcarot, choice="sqload", coloring.var=TRUE, axes=c(1, 2), leg=TRUE, posleg="topright", main=' ', xlim=c(0, 0.8), ylim=c(0, 0.6), cex=0.8) ``` ***

To further facilitate interpretation of the relationships between variables and PCs, additional rotation can be applied to PCs to result in high factor loadings for a few variables and low factor loadings for the rest. In other words, a small number of variables will become highly correlated with each PC. The most common form of rotation is varimax rotation, a generalized form of which is implemented in the `PCAmixdata` package for mixed data. To learn more, here is an excellet explanation on the [stats StackExchange](https://stats.stackexchange.com/questions/151653/what-is-the-intuitive-reason-behind-doing-rotations-in-factor-analysis-pca-how). Here we see a slightly different version of the squared loading plot from before. What is different here is that there is higher factor loading of `MonthlyCharges` and `InternetService` for the rotated PD1, and `Tenure` and `Contract` for the rotated PD2 (as their projections are more closely aligned either axis). This indicates these four variables are the most important in accounting for overall variance in the entire dataset.

```{r, eval=F, echo=T} ## Import library library(PCAmixdata) ## Split quantitative and qualitative variables split <- splitmix(df) ## FAMD res.pcamix <- PCAmix(X.quanti=split$X.quanti, X.quali=split$X.quali) ## Apply varimax rotation to the first two PCs res.pcarot <- PCArot(res.pcamix, dim=2, graph=FALSE) ## Plot plot(res.pcarot, choice="sqload", coloring.var=TRUE, axes=c(1, 2)) ```
### **Hierarchical clustering**

Unsupervised clustering based on principal dimensions {data-commentary-width=500} ```{r} ## Hierachical clustering res.hcpc <- HCPC(res.famd, nb.clust=-1, graph = FALSE) ## Plot fviz_cluster(res.hcpc, geom = "point", main = "Factor map") + theme(plot.title = element_blank(), axis.text.x = element_text(size=4), axis.text.y = element_text(size=4)) ``` ***

As principal components (PCs) are essentially “synthetic” variables that summarize the most important sources of variations in the data, they are very useful in reducing noise and redundancy in a dataset that in turn helps to reveal its inherent structure. One way to take this one step further is by hierarchically clustering the individual data points by their "position" in the principal dimension feature space. The implementation by the `FactoMineR` package uses the Ward's criterion to carry out hierarchical clustering. Doing this on our Telco customer churn data reveals three partially overlapping clusters, each of which have some proportion of churned/not churned customers. This further supports the idea that the customers do not divide cleanly by that categorisation, instead by some complex interaction of multiple factors. In the next slide, we will generate some summary visualizations to see how the clusters differ in terms of customer demographics and purchasing behaviour.

```{r, eval=F, echo=T} ## Import libraries library(FactoMineR) ## principal dimensionA res.famd <- FAMD(df, sup.var = 20) ## Hierachical clustering res.hcpc <- HCPC(res.famd, nb.clust=-1) ## Plot fviz_cluster(res.hcpc, geom = "point") ```
### **Examine clusters**

Explore characteristics of clusters {data-commentary-width=550} ```{r cache = TRUE} ## Import libraries library(autoEDA) library(svglite) library(htmlwidgets) library(slickR) library(shiny) ## Rename cluster column names(res.hcpc$data.clust)[names(res.hcpc$data.clust) == 'clust'] <- 'Cluster' ## autoEDA auto_res <- autoEDA(res.hcpc$data.clust, y = "Cluster", returnPlotList = TRUE, plotCategorical = "groupedBar", plotContinuous = "histogram", bins = 30, rotateLabels = TRUE, color = "#26A69A", verbose = FALSE) ## Create list of autoEDA figures converted to SVG plotsToSVG <- list() i <- 1 for (v in auto_res$plots) { x <- xmlSVG({show(v)}, standalone=TRUE) plotsToSVG[[i]] <- x i <- i + 1 } ## Custom function needed to render SVGs in Chrome/Firefox hash_encode_url <- function(url){ gsub("#", "%23", url) } ## Pass list of figures to SlickR s.in <- sapply(plotsToSVG, function(sv){hash_encode_url(paste0("data:image/svg+xml;utf8,",as.character(sv)))}) #Sys.sleep(4) slickR(s.in, slideId = 'ex4', slickOpts = list(dots=T, arrows=F, autoplay=T, adaptiveWidth=TRUE, adaptiveHeight=TRUE), height='525px') ``` ***

The automated exploratory data analysis package `autoEDA` is very useful for generating summary visualizations for each variable in a given dataset with respect to a target variable of interest. In our case, we want to see the distribution of values for each variable in each of the three clusters generated using hierarchical clustering based on FAMD. We see that each of the clusters have a different proportion of churned customers, with clusters 1 and 3 having very few churned customers, and cluster 2 being a more even split. Comparing the demographic characteristics and purchasing behaviours of these clusters can generate more nuanced insights into what differentiates loyal customers from those who are much more "on the fence". As a side note, here you see a slider generated using the `slickR` package, which is a nifty way to show multiple figures when space is limited. Learn more about it at its Github repo [here](https://github.com/metrumresearchgroup/slickR).

```{r, eval=F, echo=T} ## Import libraries library(autoEDA) ## autoEDA auto_res <- autoEDA(res.hcpc$data.clust, y = "clust", returnPlotList = TRUE, plotCategorical = "groupedBar", bins = 30, rotateLabels = TRUE, color = "#26A69A", verbose = FALSE) ## Plot auto_res$plots ```
Session info ========================================= Column --------------------------------------------- ```{r} sessionInfo() ```