Capítulo 4

###Agrupación de K-medias

La agrupación de K-medias (MacQueen, 1967) es el algoritmo de aprendizaje automático no supervisado más utilizado para dividir un conjunto de datos determinado en un conjunto de k grupos (es decir, k grupos), donde k representa el número de grupos preespecificados por el analista. Clasifica objetos en múltiples grupos (es decir, grupos), de modo que los objetos dentro del mismo grupo sean lo más similares posible (es decir, alta similitud intraclase), mientras que los objetos de diferentes grupos sean lo más diferentes posible (es decir, baja interclase). similitud de clases). En la agrupación de k-medias, cada grupo está representado por su centro (es decir, centroide) que corresponde a la media de los puntos asignados al grupo.

En este artículo, describiremos el algoritmo k-means y proporcionaremos ejemplos prácticos utilizando el software R.

—#4.1 K-significa ideas básicas

La idea básica detrás de la agrupación de k-medias consiste en definir grupos de modo que se minimice la variación total intra-grupo (conocida como variación total dentro del grupo).

Hay varios algoritmos de k-medias disponibles. El algoritmo estándar es el algoritmo Hartigan-Wong (1979), que define la variación total dentro del grupo como la suma de las distancias al cuadrado (distancias euclidianas entre elementos y el centroide correspondiente):

Descripción de la imagen

Descripción de la imagen

—#4.2 Algoritmo K-medias

El primer paso al utilizar la agrupación de k-medias es indicar el número de agrupaciones (k) que se generarán en la solución final.

El algoritmo comienza seleccionando aleatoriamente k objetos del conjunto de datos para que sirvan como centros iniciales de los grupos. Los objetos seleccionados también se conocen como medias de conglomerados o centroides.

A continuación, cada uno de los objetos restantes se asigna a su centroide más cercano, donde el más cercano se define utilizando la distancia euclidiana (Capítulo 3) entre el objeto y la media del grupo. Este paso se denomina “paso de asignación de clústeres”. Tenga en cuenta que, para utilizar la distancia de correlación, los datos se ingresan como puntuaciones z.

Después del paso de asignación, el algoritmo calcula el nuevo valor medio de cada grupo. El término “actualización de centroide” del grupo se utiliza para diseñar este paso. Ahora que se han recalculado los centros, se vuelve a comprobar cada observación para ver si podría estar más cerca de un grupo diferente. Todos los objetos se reasignan nuevamente utilizando el clúster actualizado. medio.

Los pasos de asignación de conglomerados y actualización de centroides se repiten iterativamente hasta que las asignaciones de conglomerados dejan de cambiar (es decir, hasta que se logra la convergencia). Eso es en los grupos formados en la iteración actual son los mismos que los obtenidos en la iteración anterior.

El algoritmo K-medias se puede resumir de la siguiente manera:

  1. Especifique el número de clusters (K) que se crearán (por el analista)
  2. Seleccione aleatoriamente k objetos del conjunto de datos como centros o medios del grupo inicial.
  3. Asigna cada observación a su centroide más cercano, según la distancia euclidiana entre el objeto y el centroide.
  4. Para cada uno de los k grupos, actualice el centroide del grupo calculando los nuevos valores medios de todos los puntos de datos del grupo. El centoide de un grupo K-ésimo es un vector de longitud p que contiene las medias de todas las variables para las observaciones en el grupo K-ésimo; p es el número de variables.
  5. Minimizar iterativamente el total dentro de la suma del cuadrado. Es decir, repita los pasos 3 y 4 hasta que las asignaciones del clúster dejen de cambiar o se alcance el número máximo de iteraciones. De forma predeterminada, el software R utiliza 10 como valor predeterminado para el número máximo de iteraciones.

—#4.3 Calcular la agrupación de k-medias en R

–4.3.1 #Datos

Usaremos los conjuntos de datos de demostración “USArrests”. Los datos deben prepararse como se describe en el capítulo 2. Los datos deben contener solo variables continuas, ya que el algoritmo k-means utiliza medias variables. Como no queremos que el algoritmo k-means dependa de una unidad variable arbitraria, comenzamos escalando los datos usando la función R scale() de la siguiente manera:

data("USArrests")
# Loading the data set
df <- scale (USArrests) # Scaling the data
#View the firt 3 rows of the data
head (df, n = 3)
##             Murder   Assault   UrbanPop         Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska  0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona 0.07163341 1.4788032  0.9989801  1.042878388

–4.3.2 Paquetes y funciones de R requeridos

La función estándar de R para la agrupación de k-medias es kmeans() [paquete de estadísticas], cuyo formato simplificado es el siguiente:

{r}# set.seed(123) x <- matrix(rnorm(100), nrow = 10) kmeans (x, centers, iter.max= 10, nstart = 1)

-x: matriz numérica, marco de datos numérico o un vector numérico -centros: los valores posibles son el número de conglomerados (k) o un conjunto de centros de conglomerados iniciales (distintos). Si es un número, se elige como centros iniciales un conjunto aleatorio de filas (distintas) en x. -iter.max: el número máximo de iteraciones permitidas. El valor predeterminado es 10. -nstart: el número de particiones iniciales aleatorias cuando los centros son un número. A menudo se recomienda probar nstart > 1. Para crear un hermoso gráfico de los clusters generados con la función kmeans(), usaremos el paquete factoextra.

-Instalar el paquete factoextra como: {r}# install.packages("factoextra")

-Loading factoextra:

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

–4.3.3 Estimación del número óptimo de conglomerados

La agrupación en clústeres k-means requiere que los usuarios especifiquen la cantidad de clústeres que se generarán.

Una pregunta fundamental es: ¿Cómo elegir el número correcto de clusters esperados k)?

Se presentarán diferentes métodos en el capítulo “estadísticas de evaluación y validación de conglomerados”.

Aquí ofrecemos una solución sencilla. La idea es calcular la agrupación de k-medias utilizando diferentes valores de agrupaciones k. A continuación, se dibuja el wss (dentro de la suma de los cuadrados) según el número de grupos. La ubicación de una curva (rodilla) en la parcela generalmente se considera un indicador del número apropiado de conglomerados.

La función R fuiz_nbclust() [en paquete factoextra] proporciona una solución conveniente para estimar el número óptimo de clústeres.

library(factoextra)
fviz_nbclust (df, kmeans, method="wss") + 
  geom_vline (xintercept =4, linetype =2)

El gráfico de arriba representa la varianza dentro de los grupos. Disminuye a medida que k aumenta, pero se puede ver una curvatura (o “codo”) en k = 4. Esta curvatura indica que los grupos adicionales más allá del cuarto tienen poco valor. En la siguiente sección, clasificaremos las observaciones en 4 racimos.

–4.3.4 Cálculo del agrupamiento de k-medias

Como el algoritmo de agrupamiento de k-medias comienza con k centroides seleccionados aleatoriamente, siempre se recomienda usar la función set.seed() para establecer una semilla para el algoritmo aleatorio de R generador de números. El objetivo es hacer reproducibles los resultados, de modo que el lector de este artículo obtenga exactamente los mismos resultados que los que se muestran a continuación.

El siguiente código R realiza agrupación de k-medias con k = 4:

# Compute k-means with k = 4
set.seed(123)
km.res <- kmeans (df, 4, nstart = 25)

Como el resultado final del resultado de agrupamiento de k-medias es sensible a las asignaciones iniciales aleatorias, especificamos nstart = 25. Esto significa que R probará 25 asignaciones iniciales aleatorias diferentes y luego seleccionará los mejores resultados correspondientes a aquel que tenga el más bajo dentro del grupo. variación. El valor predeterminado de nstart en R es uno. Sin embargo, se recomienda encarecidamente calcular la agrupación de k-medias con un valor grande de nstart, como 25 o 50, para obtener un resultado más estable.

# Print the results
print (km.res)
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 11.952463 16.212213 19.922437
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Es posible calcular la media de cada variable por conglomerados utilizando los datos originales:

aggregate (USArrests, by = list (cluster=km.res$cluster), mean)
##   cluster   Murder   Assault UrbanPop     Rape
## 1       1 13.93750 243.62500 53.75000 21.41250
## 2       2  3.60000  78.53846 52.07692 12.17692
## 3       3  5.65625 138.87500 73.87500 18.78125
## 4       4 10.81538 257.38462 76.00000 33.19231

Si desea agregar las clasificaciones de puntos a los datos originales, use esto:

dd <- cbind(USArrests, cluster = km.res$cluster) 
head (dd)
##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       1
## Alaska       10.0     263       48 44.5       4
## Arizona       8.1     294       80 31.0       4
## Arkansas      8.8     190       50 19.5       1
## California    9.0     276       91 40.6       4
## Colorado      7.9     204       78 38.7       4

–4.3.5 Acceso a los resultados de la función kmeans()

La función kmeans() devuelve una lista de componentes, que incluye:

⚫ cluster: Un vector de números enteros (de 1:k) que indica el cluster al que se asigna cada punto ⚫ centros: una matriz de centros de conglomerados (medios de conglomerados) ⚫ totss: La suma total de cuadrados (TSS), es decir, Σ (x, z)2. TSS mide la varianza total de los datos. ⚫ insidess: Vector de suma de cuadrados dentro del grupo, un componente por grupo ⚫ tot.withinss: suma total de cuadrados dentro del grupo, es decir, suma (dentro de) ⚫ Betweenss: la suma de cuadrados entre grupos, es decir, totsstot.withinss ⚫ tamaño: el número de observaciones en cada grupo

Se puede acceder a estos componentes de la siguiente manera:

# Cluster number for each of the observations km.res$cluster
km.res$cluster
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3
head (km.res$cluster, 4)
##  Alabama   Alaska  Arizona Arkansas 
##        1        4        4        1
# Cluster size
km.res$size
## [1]  8 13 16 13
# Cluster means
km.res$centers
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964

–4.3.6 Visualización de grupos de k-medias

Es una buena idea trazar los resultados del grupo. Estos se pueden utilizar para evaluar la elección del número de conglomerados, así como para comparar dos análisis de conglomerados diferentes.

Ahora, queremos visualizar los datos en un diagrama de dispersión coloreando cada punto de datos de acuerdo con su asignación de grupo.

El problema es que los datos contienen más de 2 variables y la pregunta es qué variables elegir para el diagrama de dispersión xy. Una solución es reducir el número de dimensiones aplicando un algoritmo de reducción de dimensionalidad, como el Análisis de Componentes Principales (PCA), que opera en las cuatro variables y genera dos nuevas variables (que representan las variables originales) que puede usar para hacer el trama.

En otras palabras, si tenemos un conjunto de datos multidimensional, una solución es realizar un Análisis de Componentes Principales (PCA) y trazar puntos de datos de acuerdo con las coordenadas de los dos primeros componentes principales.

La función fuiz_cluster() [paquete factoextra] se puede utilizar para visualizar fácilmente grupos de k-medias. Toma los resultados de k-medias y los datos originales como argumentos. En el gráfico resultante, las observaciones se representan mediante puntos, utilizando componentes principales si el número de variables es mayor que 2. También es posible dibujar una elipse de concentración alrededor de cada grupo.

fviz_cluster (km.res, data = df, 
              palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), 
              ellipse.type = "euclid", # Concentration ellipse
              star.plot = TRUE, # Add segments from centroids to items 
              repel = TRUE, # Avoid label overplotting (slow)
              ggtheme = theme_minimal()
)

—#4.4 K-significa ventajas y desventajas de la agrupación

La agrupación de K-means es un algoritmo muy simple y rápido. Puede manejar eficientemente conjuntos de datos muy grandes. Sin embargo, existen algunas debilidades, que incluyen:

  1. Asume conocimiento previo de los datos y requiere que el analista elija de antemano el número apropiado de cluster (k).
  2. El resultado final obtenido es sensible a la selección aleatoria inicial de los centros cluster. ¿Por qué es esto un problema? Porque, para cada ejecución diferente del algoritmo en el mismo conjunto de datos, puede elegir un conjunto diferente de centros iniciales. Esto puede dar lugar a diferentes resultados de agrupación en diferentes ejecuciones del algoritmo.
  3. Es sensible a los valores atípicos.
  4. Si reorganiza sus datos, es muy posible que obtenga una solución diferente cada vez que cambie el orden de sus datos.

Las posibles soluciones a estas debilidades incluyen:

  1. Solución al problema 1: Calcule k-medias para un rango de valores k, por ejemplo variando k entre 2 y 10. Luego, elija el mejor k comparando los resultados de agrupamiento obtenidos para los diferentes valores k.
  2. Solución al problema 2: Calcule el algoritmo K-means varias veces con diferentes centros de clúster iniciales. La ejecución con la suma de cuadrados total dentro del grupo más baja se selecciona como la solución de agrupación final.
  3. Para evitar distorsiones causadas por valores atípicos excesivos, es posible utilizar el algoritmo PAM, que es menos sensible a los valores atípicos.

—#4.5 Alternativa a la agrupación de k-medias Una alternativa sólida a k-means es PAM, que se basa en medoides. Como se analiza en el siguiente capítulo, la agrupación en clústeres de PAM se puede calcular utilizando la función pam() [paquete de clústeres]. La función pamk() [paquete fpc] es un contenedor para PAM que también imprime el número sugerido de grupos en función del ancho promedio óptimo de la silueta.

—#4.6 Resumen La agrupación de K-medias se puede utilizar para clasificar observaciones en k grupos, según su similitud. Cada grupo está representado por el valor medio de los puntos del grupo, conocido como centroide del grupo. El algoritmo K-means requiere que los usuarios especifiquen la cantidad de clústeres que se generarán. La función R kmeans() [paquete de estadísticas] se puede utilizar para calcular el algoritmo k-means. El formato simplificado es kmeans(x, centros), donde “x” son los datos y los centros es el número de clusters que se producirán. Después de calcular la agrupación de k-medias, se puede utilizar la función R fuiz_cluster() [paquete factoextra] para visualizar los resultados. El formato es fviz_cluster (km.res, datos), donde km.res es k-significa resultados y los datos corresponden a los conjuntos de datos originales.

Capítulo 5

###K-Medoides

El algoritmo k-medoids es un enfoque de agrupamiento relacionado con el agrupamiento de k-medias (capítulo 4) para dividir un conjunto de datos en k grupos o clusters. En la agrupación de k-medoides, cada grupo está representado por uno de los puntos de datos del grupo. Estos puntos se denominan medoides del cúmulo.

El término medoide se refiere a un objeto dentro de un grupo para el cual la disimilitud promedio entre él y todos los demás miembros del grupo es mínima. Corresponde al punto más céntrico del cluster. Estos objetos (uno por grupo) pueden considerarse como un ejemplo representativo de los miembros de ese grupo que puede resultar útil en algunas situaciones. Recuerde que, en la agrupación de k-medias, el centro de un grupo determinado se calcula como el valor medio de todos los puntos de datos del grupo.

K-medoid es una alternativa sólida a la agrupación de k-medias. Esto significa que el algoritmo es menos sensible al ruido y a los valores atípicos, en comparación con k-medias, porque utiliza medoides como centros de grupo en lugar de medias (usadas en k-medias).

El algoritmo k-medoids requiere que el usuario especifique k, el número de clusters que se generarán (como en el clustering k-means). Un enfoque útil para determinar el número óptimo de conglomerados es el método de silueta, que se describe en las siguientes secciones.

El método de agrupación de k-medoides más común es el algoritmo PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990). En este artículo, describiremos el algoritmo PAM y proporcionaremos ejemplos prácticos utilizando el software R. En el próximo capítulo, también discutiremos una variante de PAM llamada CLARA (Clustering Large Applications) que se utiliza para analizar grandes conjuntos de datos.

–#5.1 Concepto PAM

El uso de medias implica que la agrupación de k-medias es muy sensible a los valores atípicos. Esto puede afectar gravemente la asignación de observaciones a los conglomerados. El algoritmo PAM proporciona un algoritmo más robusto.

–#5.2 Algoritmo PAM

El algoritmo PAM se basa en la búsqueda de k objetos representativos o medoides entre las observaciones del conjunto de datos. Después de encontrar un conjunto de k medoides, se construyen grupos asignando cada observación al medoide más cercano.

A continuación, se intercambian cada medoide m seleccionado y cada punto de datos que no es medoide y se calcula la función objetivo. La función objetivo corresponde a la suma de las diferencias de todos los objetos con respecto a su medoide más cercano.

El paso SWAP intenta mejorar la calidad de la agrupación intercambiando objetos seleccionados (medoides) y objetos no seleccionados. Si la función objetivo se puede reducir intercambiando un objeto seleccionado con un objeto no seleccionado, entonces se lleva a cabo el intercambio.

Esto se continúa hasta que ya no se puede disminuir la función objetivo. El objetivo es encontrar k objetos representativos que minimicen la suma de las diferencias de las observaciones con su objeto representativo más cercano.

En resumen, el algoritmo PAM se desarrolla en dos fases de la siguiente manera:

  1. Seleccione k objetos para que se conviertan en medoides o, en caso de que se proporcionen estos objetos, utilícelos como medoides;
  2. Calcular la matriz de disimilitud si no se proporcionó;
  3. Asigne cada objeto a su medoide más cercano;
  4. Para cada grupo, busque si alguno de los objetos del grupo disminuye el coeficiente de disimilitud promedio; si es así, seleccione la entidad que disminuye más este coeficiente como medoide para este grupo;
  5. Si al menos un medoide ha cambiado, vaya a (3); de lo contrario, finalice el algoritmo.

Como se mencionó anteriormente, el algoritmo PAM funciona con una matriz de disimilitud y, para calcular esta matriz, el algoritmo puede usar dos métricas:

  1. Las distancias euclidianas, que son la raíz de la suma de cuadrados de las diferencias;
  2. Y la distancia de Manhattan que es la suma de distancias absolutas.

Tenga en cuenta que, en la práctica, debería obtener resultados similares la mayor parte del tiempo, utilizando la distancia euclidiana o de Manhattan. Si sus datos contienen valores atípicos, la distancia de Manhattan debería dar resultados más sólidos, mientras que la distancia euclidiana se vería influenciada por valores inusuales.

Lea más sobre medidas de distancia en el Capítulo 3.

—#5.3 Computación PAM en R

–#5.3.1 Datos

Usaremos los conjuntos de datos de demostración “USArrests”, que comenzamos escalando (Capítulo 2) usando la función R scale() de la siguiente manera:

data("USArrests")
#Load the data set
df <- scale (USArrests) # Scale the data
head (df, n = 3)
##             Murder   Assault   UrbanPop         Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska  0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona 0.07163341 1.4788032  0.9989801  1.042878388
#View the firt 3 rows of the data

–#5.3.2 Paquetes y funciones de R requeridos

Las funciones pam() [paquete de clúster] y pamk() [paquete fpc] se pueden utilizar para calcular PAM. La función pamk() no requiere que un usuario decida el número de grupos K.

En los siguientes ejemplos, describiremos sólo la función pam(), cuyo formato simplificado es:

{r}# #install.packages("cluster") #library(cluster) pam(x, k, metric = "euclidean", stand = FALSE)

⚫x: los valores posibles incluyen: - Matriz de datos numéricos o marco de datos numéricos: cada fila corresponde a una observación y cada columna corresponde a una variable. - Matriz de disimilitud: en este caso x suele ser la salida de daisy() o dist() ⚫k: El número de grupos ⚫ métrica: las métricas de distancia que se utilizarán. Las opciones disponibles son “euclidiana” y “manhattan”. ⚫ soporte: valor lógico; si es cierto, las variables (columnas) en x se estandarizan antes de calcular las diferencias. Se ignora cuando x es una matriz de disimilitud.

Para crear un hermoso gráfico de los clusters generados con la función pam(), usaremos el paquete factoextra.

  1. Instalación de los paquetes necesarios: {r}# install.packages(c("cluster", "factoextra"))

  2. Loading the packages:

library(cluster)
library (factoextra)

–#5.3.3 Estimación del número óptimo de conglomerados

Para estimar el número óptimo de grupos, utilizaremos el método de silueta promedio. La idea es calcular el algoritmo PAM utilizando diferentes valores de grupos k. A continuación, se dibuja la silueta promedio de los conglomerados según el número de conglomerados. La silueta media mide la calidad de un agrupamiento. Un ancho de silueta promedio alto indica un buen agrupamiento. El número óptimo de conglomerados k es aquel que maximiza la silueta promedio en un rango de valores posibles para k (Kaufman y Rousseeuw [1990]).

La función R fviz_nbclust() [paquete factoextra] proporciona una solución conveniente para estimar el número óptimo de clústeres.

library(cluster)
library (factoextra)
fviz_nbclust (df, pam, method = "silhouette")+ 
  theme_classic()

–#5.3.4 Cálculo de agrupaciones PAM

El siguiente código R calcula el algoritmo PAM con k = 2:

pam.res <- pam(df, 2)
print (pam.res)
## Medoids:
##            ID     Murder    Assault   UrbanPop       Rape
## New Mexico 31  0.8292944  1.3708088  0.3081225  1.1603196
## Nebraska   27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2 
## Objective function:
##    build     swap 
## 1.441358 1.368969 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

La salida impresa muestra:

-el grupo de medoides: una matriz, cuyas filas son los medoides y las columnas son variables -el vector de agrupamiento: un vector de números enteros (de 1:k) que indica el grupo al que se asigna cada punto

Si desea agregar las clasificaciones de puntos a los datos originales, use esto:

dd <- cbind(USArrests, cluster = pam.res$cluster)
head (dd, n = 3)
##         Murder Assault UrbanPop Rape cluster
## Alabama   13.2     236       58 21.2       1
## Alaska    10.0     263       48 44.5       1
## Arizona    8.1     294       80 31.0       1

–#5.3.5 Accediendo a los resultados de la función pam()

La función pam() devuelve un objeto de clase pam cuyos componentes incluyen:

⚫ medoides: objetos que representan grupos ⚫ agrupamiento: un vector que contiene el número de grupo de cada objeto

Se puede acceder a estos componentes de la siguiente manera:

# Cluster medoids: New Mexico, Nebraska
pam.res$medoids
##                Murder    Assault   UrbanPop       Rape
## New Mexico  0.8292944  1.3708088  0.3081225  1.1603196
## Nebraska   -0.8008247 -0.8250772 -0.2445636 -0.5052109
# Cluster numbers
head (pam.res$clustering)
##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          1          1          2          1          1

–#5.3.6 Visualización de clústeres PAM

Para visualizar los resultados de la partición, usaremos la función fviz_cluster() [paquete factoextra]. Dibuja un diagrama de dispersión de puntos de datos coloreados por números de grupo. Si los datos contienen más de 2 variables, se utiliza el algoritmo de Análisis de Componentes Principales (PCA) para reducir la dimensionalidad de los datos. En este caso, las dos primeras dimensiones principales se utilizan para trazar los datos.

fviz_cluster (pam.res,
              palette = c("#00AFBB", "#FC4E07"), # color palette 
              ellipse.type = "t", # Concentration ellipse
              repel= TRUE, # Avoid label overplotting (slow)
              ggtheme = theme_classic()
)

—#5.4 Resumen

El algoritmo K-medoids, PAM, es una alternativa sólida a k-means para dividir un conjunto de datos en grupos de observación.

En el método k-medoids, cada grupo está representado por un objeto seleccionado dentro del grupo. Los objetos seleccionados se denominan medoides y corresponden a los puntos ubicados más centralmente dentro del grupo.

El algoritmo PAM requiere que el usuario conozca los datos e indique el número apropiado de clusters a producir. Esto se puede estimar utilizando la función fuiz_nbclust [en paquete R factoextra].

La función R pam() [paquete de clúster] se puede utilizar para calcular el algoritmo PAM. El formato simplificado es pam(x, k), donde “x” son los datos y k es el número de clústeres que se generarán.

Después de realizar la agrupación en clústeres PAM, se puede utilizar la función R fuiz_cluster() [paquete factoextra] para visualizar los resultados. El formato es fviz_cluster(pam.res), donde pam.res son los resultados de PAM.

Tenga en cuenta que, para conjuntos de datos grandes, pam() puede necesitar demasiada memoria o demasiado tiempo de cálculo. En este caso, es preferible la función clara(). Esto no debería ser un problema para las computadoras modernas.

Capítulo 6

###CLARA - Agrupación de aplicaciones grandes

CLARA (Clustering Large Applications, Kaufman y Rousseeuw (1990)) es una extensión de los métodos k-medoids (Capítulo 5) para tratar datos que contienen una gran cantidad de objetos (más de varios miles de observaciones) con el fin de reducir el tiempo de computación y la RAM. problema de almacenamiento. Esto se logra utilizando el enfoque de muestreo.

—#6.1 Concepto CLARA

En lugar de encontrar medoides para todo el conjunto de datos, CLARA considera una pequeña muestra de datos con tamaño fijo (tamaño de muestra) y aplica el algoritmo PAM (Capítulo 5) para generar un conjunto óptimo de medoides para la muestra. La calidad de los medoides resultantes se mide por la disimilitud promedio entre cada objeto en todo el conjunto de datos y el medoide de su grupo, definido como la función de costo.

CLARA repite los procesos de muestreo y agrupamiento un número predeterminado de veces para minimizar el sesgo de muestreo. Los resultados finales de agrupación corresponden al conjunto de medioides con el costo mínimo. El algoritmo CLARA se resume en la siguiente sección.

—#6.2 Algoritmo CLARA

El algoritmo es el siguiente:

  1. Divida aleatoriamente los conjuntos de datos en múltiples subconjuntos con tamaño fijo (tamaño de muestra)
  2. Calcule el algoritmo PAM en cada subconjunto y elija los k objetos representativos correspondientes (medoides). Asigne cada observación de todo el conjunto de datos al medoide más cercano.
  3. Calcule la media (o la suma) de las disimilitudes de las observaciones con respecto a su medoide más cercano. Esto se utiliza como medida de la bondad de la agrupación.
  4. Conserve el conjunto de subdatos cuya media (o suma) sea mínima. Se lleva a cabo un análisis más detallado de la partición final. Tenga en cuenta que cada conjunto de subdatos está obligado a contener los medoides obtenidos del mejor conjunto de subdatos hasta ese momento. Las observaciones extraídas aleatoriamente se agregan a este conjunto hasta que se alcanza el tamaño de muestra.

—#6.3 Computación CLARA en R

–#6.3.1 Formato y preparación de datos

Para calcular el algoritmo CLARA en R, los datos deben prepararse como se indica en el Capítulo 2. Aquí, generaremos un conjunto de datos aleatorios. Para que el resultado sea reproducible, comenzamos usando la función set.seed().

set.seed(1234)
# Generate 500 objects, divided into 2 clusters. 
df <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)), 
            cbind(rnorm (300,50,8), rnorm(300,50,8)))
# Specify column and row names
colnames(df) <- c("x", "y")
rownames (df)<- paste0 ("S", 1:nrow(df))
# Previewing the data
head (df, nrow = 6)
##             x        y
## S1  -9.656526 3.881815
## S2   2.219434 5.574150
## S3   8.675529 1.484111
## S4 -18.765582 5.605868
## S5   3.432998 2.493448
## S6   4.048447 6.083699

–#6.3.2 Required R packages and functions

The function clara() [cluster package] can be used to compute CLARA. The simplified format is as follow: {r}# clara(x, k, metric = "euclidean", stand = FALSE, samples = 5, pamLike = FALSE)

⚫ x: una matriz de datos numéricos o marco de datos, cada fila corresponde a una observación y cada columna corresponde a una variable. Se permiten valores faltantes (NA). ⚫k: el número de conglomerados. ⚫ métrica: las métricas de distancia que se utilizarán. Las opciones disponibles son “euclidiana” y “manhattan”. Las distancias euclidianas son sumas de cuadrados de diferencias, y las distancias de Manhattan son la suma de diferencias absolutas. Lea más sobre medidas de distancia (Capítulo 3). Tenga en cuenta que la distancia de Manhattan es menos sensible a los valores atípicos. ⚫ soporte: valor lógico; si es cierto, las variables (columnas) en x se estandarizan antes de calcular las diferencias. Tenga en cuenta que se recomienda estandarizar las variables antes de agruparlas. ⚫ muestras: número de muestras que se extraerán del conjunto de datos. El valor predeterminado es 5, pero se recomienda un valor mucho mayor. ⚫ pamLike: lógico que indica si se debe utilizar el mismo algoritmo en la función pam(). Esto debería ser siempre cierto.

Para crear un hermoso gráfico de los clusters generados con la función pam(), usaremos el paquete factoextra.

  1. Installing required packages:
install.packages(c("cluster", "factoextra"))
## Warning: packages 'cluster', 'factoextra' are in use and will not be installed
  1. Loading the packages:
library(cluster) 
library (factoextra)

—#6.3.3 Estimación del número óptimo de conglomerados

Para estimar el número óptimo de clusters en sus datos, es posible utilizar el método de silueta promedio como se describe en el capítulo de clustering de PAM (Capítulo 5). La función R fuiz_nbclust() [paquete factoextra] proporciona una solución para facilitar este paso.

library(cluster) 
library(factoextra)
fviz_nbclust (df, clara, method = "silhouette")+
  theme_classic()

Según el gráfico, el número sugerido de grupos es 2. En la siguiente sección, clasificaremos las observaciones en 2 grupos.

–#6.3.4 Computing CLARA

The R code below computes PAM algorithm with k = 2:

# Compute CLARA
clara.res<- clara (df, 2, samples
=
50, pamLike = TRUE)
# Print components of clara.res print (clara.res)

La salida de la función clara() incluye los siguientes componentes: ⚫ medoides: objetos que representan grupos ⚫ agrupamiento: un vector que contiene el número de grupo de cada objeto ⚫ muestra: etiquetas o números de caso de las observaciones en la mejor muestra, es decir, la muestra utilizada por el algoritmo clara para la partición final.

Si desea agregar las clasificaciones de puntos a los datos originales, use esto:

dd <- cbind(df, cluster = clara.res$cluster)
head (dd, n = 4)
##             x        y cluster
## S1  -9.656526 3.881815       1
## S2   2.219434 5.574150       1
## S3   8.675529 1.484111       1
## S4 -18.765582 5.605868       1
# Medoids
clara.res$medoids
##              x         y
## S121 -1.531137  1.145057
## S455 48.357304 50.233499
# Clustering
head (clara.res$clustering, 10)
##  S1  S2  S3  S4  S5  S6  S7  S8  S9 S10 
##   1   1   1   1   1   1   1   1   1   1

–#6.3.5 Visualizando clusters CLARA

Para visualizar los resultados de la partición, usaremos la función fviz_cluster() [paquete factoextra]. Dibuja un diagrama de dispersión de puntos de datos coloreados por números de grupo.

fviz_cluster(clara.res, 
             palette = c("#00AFBB", "#FC4E07"), # paleta de colores
             ellipse.type = "t", # Elipse de concentración
             geom = "point", pointsize = 1,
             ggtheme = theme_classic()
)

—#6.4 Resumen

El algoritmo CLARA (Clustering Large Applications) es una extensión del método de clustering PAM (Partitioning Around Medoids) para grandes conjuntos de datos. Tenía la intención de reducir el tiempo de cálculo en el caso de grandes conjuntos de datos.

Como casi todos los algoritmos de partición, requiere que el usuario especifique la cantidad adecuada de grupos que se producirán. Esto se puede estimar utilizando la función fviz_nbclust [en el paquete R factoextra].

La función R clara() [paquete de clúster] se puede utilizar para calcular el algoritmo CLARA. El formato simplificado es clara(x, k, pamLike = TRUE), donde “x” son los datos y k es el número de clusters a generar.

Después de calcular CLARA, se puede utilizar la función R fuiz_cluster() [paquete factoextra] para visualizar los resultados. El formato es fviz_cluster(clara.res), donde clara.res son los resultados de CLARA.

Chapter 7

####Agglomerative Clustering

The agglomerative clustering is the most common type of hierarchical clustering use to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a single- ton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.

In this article we start by describing the agglomerative clustering algorithms. Next, we provide R lab sections with many examples for computing and visualizing hierarchical clustering. We continue by explaining how to interpret dendrogram. Finally, we provide R codes for cutting dendrograms into groups.

—#7.1 Algorithm

Agglomerative clustering works in a “bottom-up” manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below).

The inverse of agglomerative clustering is divisive clustering, which is also known as DIANA (Divise Analysis) and it works in a “top-down” manner. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below).

Note that, agglomerative clustering is good at identifying small clusters. Divisive clustering is good at identifying large clusters. In this article, we’ll focus mainly on agglomerative hierarchical clustering.

Steps to agglomerative hierarchical clustering We’ll follow the steps below to perform agglomerative hierarchical clustering using R software: 1. Preparing the data 2. Computing (dis)similarity information between every pair of objects in the data set. 3. Using linkage function to group objects into hierarchical cluster tree, based on the distance information generated at step 1. Objects/clusters that are in close proximity are linked together using the linkage function. 4. Determining where to cut the hierarchical tree into clusters. This creates a partition of the data. We’ll describe each of these steps in the next section.

–#7.2.1 Data structure and preparation The data should be a numeric matrix with: -rows representing observations (individuals); -and columns representing variables.

Here, we’ll use the R base USArrests data sets.

Note that, it’s generally recommended to standardize variables in the data set before performing subsequent analysis. Standardization makes variables comparable, when they are measured in different scales. For example one variable can measure the height in meter and another variable can measure the weight in kg. The R function scale() can be used for standardization, See ?scale documentation.

# Load the data 
data("USArrests")
# Standardize the data
df <- scale (USArrests)
#Show the first 6 rows 
head (df, nrow = 6)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

–#7.2.2 Similarity measures

In order to decide which objects/clusters should be combined or divided, we need methods for measuring the similarity between objects.

There are many methods to calculate the (dis)similarity information, including Eu- clidean and manhattan distances (Chapter 3). In R software, you can use the function dist() to compute the distance between every pair of object in a data set. The results of this computation is known as a distance or dissimilarity matrix.

By default, the function dist() computes the Euclidean distance between objects; however, it’s possible to indicate other metrics using the argument method. See ?dist for more information.

For example, consider the R base data set USArrests, you can compute the distance matrix as follow:

# Compute the dissimilarity matrix
#df = the standardized data
res.dist<-dist (df, method = "euclidean")

Note that, the function dist() computes the distance between the rows of a data matrix using the specified distance measure method.

To see easily the distance information between objects, we reformat the results of the function dist() into a matrix using the as.matrix() function. In this matrix, value in the cell formed by the row i, the column j, represents the distance between object i and object j in the original data set. For instance, element 1,1 represents the distance between object 1 and itself (which is zero). Element 1,2 represents the distance between object 1 and object 2, and so on.

The R code below displays the first 6 rows and columns of the distance matrix:

as.matrix(res.dist)[1:6, 1:6]
##             Alabama   Alaska  Arizona Arkansas California Colorado
## Alabama    0.000000 2.703754 2.293520 1.289810   3.263110 2.651067
## Alaska     2.703754 0.000000 2.700643 2.826039   3.012541 2.326519
## Arizona    2.293520 2.700643 0.000000 2.717758   1.310484 1.365031
## Arkansas   1.289810 2.826039 2.717758 0.000000   3.763641 2.831051
## California 3.263110 3.012541 1.310484 3.763641   0.000000 1.287619
## Colorado   2.651067 2.326519 1.365031 2.831051   1.287619 0.000000

–#7.2.3 Linkage

The linkage function takes the distance information, returned by the function dist(), and groups pairs of objects into clusters based on their similarity. Next, these newly formed clusters are linked to each other to create bigger clusters. This process is iterated until all the objects in the original data set are linked together in a hierarchical tree.

For example, given a distance matrix “res.dist” generated by the function dist(), the R base function hclust() can be used to create the hierarchical tree.

hclust() can be used as follow:

res.hc <-hclust (d = res.dist, method = "ward.D2")

⚫d: a dissimilarity structure as produced by the dist() function. ⚫ method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.

There are many cluster agglomeration methods (i.e, linkage methods). The most common linkage methods are described below. • Maximum or complete linkage: The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters. • Minimum or single linkage: The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, ‘loose’ clusters. • Mean or average linkage: The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2. • Centroid linkage: The distance between two clusters is defined as the distance between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2. • Ward’s minimum variance method: It minimizes the total within-cluster vari- ance. At each step the pair of clusters with minimum between-cluster distance are merged.

Note that, at each stage of the clustering process the two clusters, that have the smallest linkage distance, are linked together.

Complete linkage and Ward’s method are generally preferred.

–#7.2.4 Dendrogram

Dendrograms correspond to the graphical representation of the hierarchical tree generated by the function helust(). Dendrogram can be produced in R using the base function plot(res.hc), where res.hc is the output of hclust(). Here, we’ll use the function fuiz_dend()[ in factoextra R package] to produce a beautiful dendrogram. First install factoextra by typing this: install.packages(“factoextra”); next visualize the dendrogram as follow:

#cex: label size
library("factoextra")
fviz_dend (res.hc, cex = 0.5)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In the dendrogram displayed above, each leaf corresponds to one object. As we move up the tree, objects that are similar to each other are combined into branches, which are themselves fused at a higher height.

The height of the fusion, provided on the vertical axis, indicates the (dis)similarity/distance between two objects/clusters. The higher the height of the fusion, the less similar the objects are. This height is known as the cophenetic distance between the two objects.

Note that, conclusions about the proximity of two objects can be drawn only based on the height where branches containing those two objects first are fused. We cannot use the proximity of two objects along the horizontal axis as a criteria of their similarity.

In order to identify sub-groups, we can cut the dendrogram at a certain height as described in the next sections.

—#7.3 Verify the cluster tree After linking the objects in a data set into a hierarchical cluster tree, you might want to assess that the distances (i.e., heights) in the tree reflect the original distances accurately.

One way to measure how well the cluster tree generated by the hclust() function reflects your data is to compute the correlation between the cophenetic distances and the original distance data generated by the dist() function. If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix.

The closer the value of the correlation coefficient is to 1, the more accurately the clustering solution reflects your data. Values above 0.75 are felt to be good. The “average” linkage method appears to produce high values of this statistic. This may be one reason that it is so popular.

The R base function cophenetic() can be used to compute the cophenetic distances for hierarchical clustering.

# Compute cophentic distance 
res.coph<-cophenetic (res.hc)
# Correlation between cophenetic distance and
#the original distance
cor (res.dist, res.coph)
## [1] 0.6975266

Execute the hclust() function again using the average linkage method. Next, call cophenetic() to evaluate the clustering solution.

res.hc2<-hclust (res.dist, method = "average")

cor(res.dist, cophenetic (res.hc2))
## [1] 0.7180382

The correlation coefficient shows that using a different linkage method creates a tree that represents the original distances slightly better.

—#7.4Cut the dendrogram into different groups

One of the problems with hierarchical clustering is that, it does not tell us how many clusters there are, or where to cut the dendrogram to form clusters.

You can cut the hierarchical tree at a given height in order to partition your data into clusters. The R base function cutree() can be used to cut a tree, generated by the hclust() function, into several groups either by specifying the desired number of groups or the cut height. It returns a vector containing the cluster number of each observation.

#Cut tree into 4 groups
grp <- cutree (res.hc, k = 4)
head (grp, n = 4)
##  Alabama   Alaska  Arizona Arkansas 
##        1        2        2        3
# Number of members in each cluster 
table(grp)
## grp
##  1  2  3  4 
##  7 12 19 12
# Get the names for the members of cluster 1
rownames (df) [grp == 1]
## [1] "Alabama"        "Georgia"        "Louisiana"      "Mississippi"   
## [5] "North Carolina" "South Carolina" "Tennessee"

The result of the cuts can be visualized easily using the function fvis_dend() in factoextra:

# Cut in 4 groups and color by groups 
fviz_dend (res.hc, k = 4, #Cut in four groups 
           cex = 0.5, # label size
           k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),  
           color_labels_by_km = TRUE, # color labels by groups
           rect = TRUE # Add rectangle around groups
)

fviz_cluster(list(data = df, cluster = grp),
             palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), 
             ellipse.type = "convex", # Concentration ellipse
             repel = TRUE, # Avoid label overplotting (slow)
             show.clust.cent = FALSE, 
             ggtheme = theme_minimal()
)

—#7.6 Application of hierarchical clustering to gene expression data analysis

In gene expression data analysis, clustering is generaly used as one of the first step to explore the data. We are interested in whether there are groups of genes or groups of samples that have similar gene expression patterns.

Several distance measures (Chapter 3) have been described for assessing the similarity or the dissimilarity between items, in order to decide which items have to be grouped together or not. These measures can be used to cluster genes or samples that are similar.

For most common clustering softwares, the default distance measure is the Euclidean distance. The most popular methods for gene expression data are to use log2(expression +0.25), correlation distance and complete linkage clustering agglomerative-clustering.

Single and Complete linkage give the same dendrogram whether you use the raw data, the log of the data or any other transformation of the data that preserves the order because what matters is which ones have the smallest distance. The other methods are sensitive to the measurement scale.

Note that, when the data are scaled, the Euclidean distance of the z-scores is the same as correlation distance.

Pearson’s correlation is quite sensitive to outliers. When clustering genes, it is important to be aware of the possible impact of outliers. An alternative option is to use Spearman’s correlation instead of Pearson’s correlation.

In principle it is possible to cluster all the genes, although visualizing a huge den- drogram might be problematic. Usually, some type of preliminary analysis, such as differential expression analysis is used to select genes for clustering.

Selecting genes based on differential expression analysis removes genes which are likely to have only chance patterns. This should enhance the patterns found in the gene clusters. 7.7 Summary Hierarchical clustering is a cluster analysis method, which produce a tree-based representation (i.e.: dendrogram) of a data. Objects in the dendrogram are linked together based on their similarity.

To perform hierarchical cluster analysis in R, the first step is to calculate the pairwise distance matrix using the function dist(). Next, the result of this computation is used by the hclust() function to produce the hierarchical tree. Finally, you can use the function fviz_dend) [in factoextra R package] to plot easily a beautiful dendrogram. It’s also possible to cut the tree at a given height for partitioning the data into multiple groups (R function cutree()).

Chapter 8

###Comparing Dendrograms

After showing how to compute hierarchical clustering (Chapter 7), we describe, here, how to compare two dendrograms using the dendeztend R package.

The dendertend package provides several functions for comparing dendrograms. Here, we’ll focus on two functions:

» tanglegram() for visual comparison of two dendrograms » and cor.dendlist() for computing a correlation matrix between dendrograms.

—#8.1 Data preparation

We’ll use the R base USArrests data sets and we start by standardizing the variables using the function scale() as follow:

df <- scale(USArrests)

To make readable the plots, generated in the next sections, we’1l work with a small random subset of the data set. Therefore, we’ll use the function sample() to randomly select 10 observations among the 50 observations contained in the data set:

#Subset containing 10 rows
set.seed(123)

ss <- sample(1:50, 10)
# Verificar el número total de filas en df
n <- nrow(df)

# Si df tiene menos de 50 filas, ajustar el rango de sample
if (n < 50) {
  ss <- sample(1:n, 10)
} else {
  ss <- sample(1:50, 10)
}
df <- df[ss,]

—#8.2 Comparing dendrograms

We start by creating a list of two dendrograms by computing hierarchical clustering (HC) using two different linkage methods (“average” and “ward.D2”). Next, we transform the results as dendrograms and create a list to hold the two dendrograms.

library (dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
# Compute distance matriz
res.dist <- dist(df, method = "euclidean")
#Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")
#Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
#Create a list to hold dendrograms
dend_list <- dendlist(dend1, dend2)

–#8.2.1 Visual comparison of two dendrograms

To visually compare two dendrograms, we”ll use the tanglegram() function [dendertend package], which plots the two dendrograms, side by side, with their labels connected by lines.

The quality of the alignment of the two trees can be measured using the function entanglement(). Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment.

-Draw a tanglegram:

tanglegram(dend1, dend2)

-Customized the tanglegram using many other options as follow:

tanglegram(dend1, dend2,
           highlight_distinct_edges = FALSE, # Turn-off dashed lines
           common_subtrees_color_lines = FALSE, # Turn-off line colors
           common_subtrees_color_branches = TRUE, # Color common branches
           main = paste("entanglement =", round(entanglement(dend_list), 2))

)

Note that “unique” nodes, with a combination of labels/items not present in the other tree, are bighlighted with dashed lines.

8.2.2 Correlation matrix between a list of dendrograms

The function cor.dendlist() is used to compute “Baker” or “Cophenetic” correlation matrix between a list of trees. The value can range between -1 to 1. With near 0 values meaning that the two trees are not statistically similar.

#Cophenetic correlation matriz
cor.dendlist(dend_list, method = "cophenetic")
##           [,1]      [,2]
## [1,] 1.0000000 0.9895569
## [2,] 0.9895569 1.0000000
#Baker correlation matriz
cor.dendlist(dend_list, method = "baker")
##           [,1]      [,2]
## [1,] 1.0000000 0.8951486
## [2,] 0.8951486 1.0000000
cor_cophenetic(dend1, dend2)
## [1] 0.9895569
# correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.8951486

It’s also possible to compare simultaneously multiple dendrograms. A chaining operator 2%>% is used to run multiple function at the same time. It’s useful for simplifying the code:

#Create multiple dendrograms by chaining

dend1 <- df %>% dist %>% hclust("complete") %>% as.dendrogram
dend2 <- df %>% dist %>% hclust("single") %>% as.dendrogram
dend3 <- df %>% dist %>% hclust("average") %>% as.dendrogram
dend4 <- df %>% dist %>% hclust("centroid") %>% as.dendrogram

#Compute correlation matriz

dend_list <- dendlist("Complete" = dend1, "Single" = dend2, 
                      "Average" = dend3, "Centroid" = dend4)

cors <- cor.dendlist(dend_list)

#Print correlation matriz

round(cors, 2)
##          Complete Single Average Centroid
## Complete     1.00   0.96    0.97     0.97
## Single       0.96   1.00    0.99     1.00
## Average      0.97   0.99    1.00     1.00
## Centroid     0.97   1.00    1.00     1.00
# Visualize the correlation matriz using corrplot package
library(corrplot)
## corrplot 0.92 loaded
corrplot(cors, "pie", "lower")

Chapter 9

###Visualizing Dendrograms

As described in previous chapters, a dendrogram is a tree-based representation of a data created using hierarchical clustering methods (Chapter 7). In this article, we provide R code for visualizing and customizing dendrograms. Additionally, we show how to save and to zoom a large dendrogram.

We start by computing hierarchical clustering using the USArrests data sets:

#Load data
data(USArrests)

#Compute distances and hierarchical clustering
dd <- dist(scale(USArrests), method = "euclidean")
hc <- hclust(dd, method = "ward.D2")

To visualize the dendrogram, we’ll use the following R functions and packages:

Before continuing, install the required package as follow:

install.packages(c("factoextra", "dendextend"))
## Warning: packages 'factoextra', 'dendextend' are in use and will not be
## installed

–#9.1 Visualizing dendrograms

We’ll use the function fuiz_dend()lin factoextra R package] to create easily a beautiful dendrogram using either the R base plot or ggplot2. Tt provides also an option for drawing circular dendrograms and phylogenic-like trees.

To create a basic dendrograms, type this:

library(factoextra)
fviz_dend(hc, cex = 0.5)

You can use the arguments main, sub, xlab, ylab to change plot titles as follow:

fviz_dend(hc, cex = 0.5,
          main = "Dendrogram - ward.D2",
          xlab = "Objects", ylab = "Distance", sub = "")

To draw a horizontal dendrogram, type this:

fviz_dend (hc, cex = 0.5, horiz = TRUE)

It’s also possible to cut the tree at a given height for partitioning the data into multiple groups as described in the previous chapter: Hierarchical clustering (Chapter 7). In this case, it’s possible to color branches by groups and to add rectangle around each group.

For example:

```{r}############################################ library(factoextra)

Suponiendo que hc es un objeto hclust

fviz_dend(hc, k = 4, # Cut in four groups cex = 0.5, # label size k_colors = c(“#2E9FDF”, “#00AFBB”, “#E7B800”, “#FC4E07”), color_labels_by_k = TRUE, # color labels by groups rect = TRUE, # Add rectangle around groups rect_border = c(“#2E9FDF”, “#00AFBB”, “#E7B800”, “#FC4E07”), rect_fill = TRUE) # Corrected argument name


To change the plot theme, use the argument ggtheme, which allowed values include gg-
plot2 official themes [ theme_gray(), theme_bw(), theme_minimal(), theme_classic(),
theme_void()] or any other user-defined ggplot2 themes.

``` r
fviz_dend(
  hc,
  k = 4,                   # Cut in four groups
  cex = 0.5,               # Label size
  k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),  # Colors for the groups
  color_labels_by_k = TRUE,  # Color labels by groups
  ggtheme = theme_gray()    # Change theme
)

Allowed values for k_ color include brewer palettes from RColorBrewer Package (e.g. “RdBu”, “Blues”, “Dark2”, “Set2”, …; ) and scientific journal palettes from ggsci R package (e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”).

In the R code below, we’ll change group colors using “jco” (journal of clinical oncology) color palette:

fviz_dend(hc, cex = 0.5, k = 4, # Cut in four groups
          k_colors = "jco")

If you want to draw a horizontal dendrogram with rectangle around clusters, use this:

fviz_dend(hc, k = 4, cex = 0.4, horiz = TRUE, k_colors = "jco",
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)

Additionally, you can plot a circular dendrogram using the option type = “circular”.

fviz_dend(hc, cex = 0.5, k = 4,
          k_colors = "jco", type = "circular")

To plot a phylogenic-like tree, use type = “phylogenic” and repel = TRUE (to avoid labels overplotting). This functionality requires the R package igraph. Make sure that it’s installed before typing the following R. code.

require("igraph")
## Loading required package: igraph
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
fviz_dend(hc, k = 4, k_colors = "jco",
          type = "phylogenic", repel = TRUE)

The default layout for phylogenic trees is “layout.auto”. Allowed values are one of: c(“layout.auto”, “layout_with_drl”, “layout_as tree”, “layout.gem”, “layout.mds”, “layout_with_lgl”). To read more about these layouts, read the documentation of the igraph R package.

Let’s try phylo.layout = “layout.gem”:

require("igraph")

fviz_dend(hc, k = 4, # Cut in four groups
           k_colors = "jco",
           type = "phylogenic", repel = TRUE,
           phylo_layout = "layout.gem")

—#9.2 Case of dendrogram with large data sets

Tf you compute hierarchical clustering on a large data set, you might want to zoom in the dendrogram or to plot only a subset of the dendrogram.

Alternatively, you could also plot the dendrogram to a large page on a PDF, which can be zoomed without loss of resolution.

–#9.2.1 Zooming in the dendrogram

If you want to zoom in the first clusters, its possible to use the option xlim and ylim to limit the plot area. For example, type the code below:

fviz_dend(hc, xlim = c(1, 20), ylim = c(1, 8))

–#9.2.2 Plotting a sub-tree of dendrograms

To plot a sub-tree, we’ll follow the procedure below:

  1. Create the whole dendrogram using fuiz_dend() and save the result into an object, named dend_ plot for example.

  2. Use the R base function cut.dendrogram() to cut the dendrogram, at a given height (h), into multiple sub-trees. This returns a list with components $upper and $lower, the first is a truncated version of the original tree, also of class dendrogram, the latter a list with the branches obtained from cutting the tree, each a dendrogram.

  3. Visualize sub-trees using fuiz_dend().

The R code is as follow

-Cut the dendrogram and visualize the truncated version:

# Crear un gráfico del dendrograma completo
dend_plot <- fviz_dend(
  hc,                     # Clustering result object
  k = 4,                  # Cortar en cuatro grupos
  cex = 0.5,              # Tamaño de las etiquetas
  k_colors = "jco"        # Colores para los grupos
)

# Extraer los datos del dendrograma
dend_data <- attr(dend_plot, "dendrogram")

# Cortar el dendrograma a la altura h = 10
dend_cuts <- cut(dend_data, h = 10)

# Visualizar la versión truncada que contiene dos ramas
fviz_dend(dend_cuts$upper)
## Warning in min(-diff(our_dend_heights)): ningún argumento finito para min;
## retornando Inf

. Plot dendrograms sub-trees:

#Plot the whole dendrogram
print(dend_plot)

# Visualizar el subárbol 1
fviz_dend(
  dend_cuts$lower[[1]],   # Primer subárbol
  main = "Subtree 1"      # Título del gráfico
)

# Visualizar el subárbol 2
fviz_dend(
  dend_cuts$lower[[2]],   # Segundo subárbol
  main = "Subtree 2"      # Título del gráfico
)

You can also plot circuilar trees as follow:

# Visualizar el subárbol 2 en formato circular
fviz_dend(
  dend_cuts$lower[[2]],   # Segundo subárbol
  type = "circular",      # Tipo de gráfico circular
  main = "Subtree 2"      # Título del gráfico
)

–#9.2.3 CASE OF DENDROGRAM WITH LARGE DATA SETS

If you have a large dendrogram, you can save it to a large PDF page, which can be zoomed without loss of resolution.

# Abrir un archivo PDF
pdf("dendrogram.pdf", width = 30, height = 15)

# Realizar el gráfico del dendrograma
p <- fviz_dend(
  hc,                   # Objeto de clustering
  k = 4,                # Cortar en cuatro grupos
  cex = 1,              # Tamaño de las etiquetas
  k_colors = "jco"      # Colores para los grupos
)

# Imprimir el gráfico en el PDF
print(p)

# Cerrar el archivo PDF
dev.off()
## png 
##   2

9.3 Manipulating dendrograms using dendextend

The package dendezrtend provide functions for changing easily the appearance of a dendrogram and for comparing dendrograms.

In this section we’ll use the chaining operator (%4>%) to simplify our code. The chaining operator turns x %>% f(y) into f(x, y) so you can use it to rewrite multiple operations such that they can be read from left-to-right, top-to-bottom. For instance, the results of the two R codes below are equivalent.

» Standard R. code for creating a dendrograxm:

data <- scale(USArrests)

dist.res <- dist(data)

hc <- hclust(dist.res, method = "ward.D2")
dend <- as.dendrogram(hc)

plot (dend)

»R code for creating a dendrogram using chaining operator:

library (dendextend)
dend <- USArrests[1:5,] %>% # data
                  scale %>% # Scale the data
                  dist %>% # calculate a distance matriz,
                  hclust(method = "ward.D2") %>% #Hierarchical clustering
                  as.dendrogram # Turn the object into a dendrogram.
plot(dend)

-Functions to customize dendrograms: The function set() [in dendextend package] can be used to change the parameters of a dendrogram. The format is: {r}# set(object, what, value)

. Object: a dendrogram object

. What: a character indicating what is the property of the tree that should be

set /updated

. value: a vector with the value to set in the tree (the type of the value depends on the “what”).

Possible values for the argument what include:

Examples:

library (dendextend)

#1, Create a customized dendrogram

mycols <- c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07")

dend <- as.dendrogram(hc) %>%
  set("branches_lwd", 1) %>% # Branches line width
  set("branches_k_color", mycols, k = 4) %>% # Color branches by groups
  set("labels_colors", mycols, k = 4) %>% # Color labels by groups
  set("labels_cex", 0.5) # Change label size

# 2. Create plot
fviz_dend(dend)

—#9.4 Summary

We described functions and packages for visualizing and customizing dendrograms including:

» fuiz_dend() [in factoextra R package], which provides convenient solutions for plotting easily a beautiful dendrogram. It can be used to create rectangular and circular dendrograms, as well as, a phylogenic tree.

» and the dendeztend package, which provides a flexible methods to customize dendrograms.

Additionally, we described how to plot a subset of large dendrograms.

Chapter 10

###Heatmap: Static and Interactive

A heatmap (or heat map) is another way to visualize hierarchical clustering. It’s also called a false colored image, where data values are transformed to color scale.

Heat maps allow us to simultaneously visualize clusters of samples and features. First hierarchical clustering is done of both the rows and the columns of the data matrix.

The columns/rows of the data matrix are re-ordered according to the hierarchical clustering result, putting similar observations close to each other. “The blocks of high’ and “low” values are adjacent in the data matrix. Finally, a color scheme is applied for the visualization and the data matrix is displayed. Visualizing the data matrix in this way can help to find the variables that appear to be characteristic for each sample cluster.

Previously, we described how to visualize dendrograms (Chapter 9). Here, we’ll

demonstrate how to draw and arrange a heatmap in R.

–#10.1 R Packages/functions for drawing heatmaps

There are a multiple numbers of R packages and functions for drawing interactive and static heatmaps, including:

» heatmap() [R base function, stats package]: Draws a simple heatmap » heatmap.2() [gplots R package]: Draws an enhanced heatmap compared to the R base function.

» pheaimap() [pheatmap R package]: Draws pretty heatmaps and provides more control to change the appearance of heatmaps.

» d3heatmap() [d3heatmap R package]: Draws an interactive/clickable heatmap

» Heatmap() [ComplerHeatmap R/Bioconductor package]: Draws, annotates and arranges complex heatmaps (very useful for genomic data analysis)

Here, we start by describing the 5 R functions for drawing heatmaps. Next, we’ll focus on the ComplerHeatmap package, which provides a flexible solution to arrange and annotate multiple heatmaps. lt allows also to visualize the association between different data. from different sources.

–#10.2 Data preparation

We use mtcars data as a demo data set. We start by standardizing the data to make variables comparable:

df <- scale(mtcars)

–#10.3 R base heatmap: heatmap()

The built-in R heatmap() function [in stats package] can be used.

A simplified format is: {r}# heatmap(x, scale = "row")

»x: a numeric matrix »scale: a character indicating if the values should be centered and scaled in either the row direction or the column direction, or none. Allowed values are in c(“row”, “column”, “none”). Default is “row”.

#Default plot
heatmap(df, scale = "none")

In the plot above, high values are in red and low values are in yellow.

It’s possible to specify a color palette using the argument col, which can be defined as follow:

col<- colorRampPalette(c("red", "white", "blue")) (256)

-Or, using RColorBrewer color palette:

library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu")) (256)

Additionally, you can use the argument RowSideColors and ColSideColors to annotate rows and columns, respectively.

For example, in the the R code below will customize the heatmap as follow:

  1. An RColorBrewer color palette name is used to change the appearance

  2. The argument RowSideColors and ColSideColors are used to annotate rows and columns respectively. The expected values for these options are a vector containing color names specifying the classes for rows/columns.

#Use RColorBrewer color palette names

library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu")) (256)
heatmap(df, scale = "none", col = col,
        RowSideColors = rep(c("blue", "pink"), each = 16),
        ColSideColors = c(rep("purple", 5), rep("orange", 6)))

—#10.4 Enhanced heat maps: heatmap.2()

The function heatmap.2() [in gplots package] provides many extensions to the standard R heatmap() function presented in the previous section.

#install.packages("gplots")

library("gplots")
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(df, scale = "none", col = bluered(100),
trace = "none", density.info = "none")

Other arguments can be used including:

. labRow, labCol » hclustfun: hclustfun=function(x) helust(x, method=“ward”)

In the R code above, the bluered() function [in gplots package] is used to generate a smoothly varying set of colors. You can also use the following color generator functions:

» colorpanel(n, low, mid, high) — n: Desired number of color elements to be generated — low, mid, high: Colors to use for the Lowest, middle, and highest values. mid may be omitted. » redgreenín), greenred(n), bluered(n) and redblue(n)

—#10.5 Pretty heat maps: pheatmap()

First, install the pheatmap package: install. packages(“pheatmap”); then type this:

library("pheatmap")
pheatmap(df, cutree_rows = 4)

Arguments are available for changing the default clustering metric (“euclidean”) and method (“complete”). It’s also possible to annotate rows and columns using grouping variables.

—#10.6 Interactive heat maps: d3heatmap()

First, install the d3heatmap package: install. packages(“d3heatmap”); then type this:

```{r}# library(“d3heatmap”)

d3heatmap(scale(mtcars), colors = “RdY1Bu”, k_row = 4, # Number of groups in rows k_col = 2 # Number of groups in columns )


The d3heamap() function makes it possible to:

+ Put the mouse on a heatmap cell of interest to view the row and the column
names as well as the corresponding value.

+ Select an area for zooming. After zooming, click on the heatmap again to go
back to the previous display

---#10.7 Enhancing heatmaps using dendextend

The package dendextend can be used to enhance functions from other packages. The
mátcars data is used in the following sections. We'll start by defining the order and the
appearance for rows and columns using dendextend. These results are used in others
functions from others packages.


The order and the appearance for rows and columns can be defined as follow:


``` r
library (dendextend)

# order for rows

Rowv <- mtcars %>% scale %>% dist %>% hclust %>% as.dendrogram %>% 
  set("branches_k_color", k = 3) %>% set("branches_lwd", 1.2) %>%
  ladderize

#Order for columns: We must transpose the data

Colv <- mtcars %>% scale %>% t %>% dist %>% hclust %>% as.dendrogram %>%
  set("branches_k_color", k = 2, value = c("orange", "blue")) %>%
  set("branches_lwd", 1.2) %>%
  ladderize

The arguments above can be used in the functions below:

  1. The standard heatmap() function [in stats package]:
heatmap(scale(mtcars), Rowv = Rowv, Colv = Colv,
        scale = "none")

  1. The enhanced heatmap.2() function [in gplots package]:
library(gplots)

heatmap.2(scale(mtcars), scale = "none", col = bluered(100),
          Rowv = Rowv, Colv = Colv,
          trace = "none", density.info = "none")

  1. The interactive heatmap generator d3heatmap() function [in d3heatmap package]: {r}# library("d3heatmap") d3heatmap(scale(mtcars), colors = "RdBu", Rowv = Rowv, Colv = Colv)

—#10.8 Complex heatmap

ComplexHeatmap is an R/bioconductor package, developed by Zuguang Gu, which provides a flexible solution to arrange and annotate multiple heatmaps. lt allows also to visualize the association between different data from different sources.

It can be installed as follow: ```{r}# # Instalar BiocManager si no está ya instalado if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)

Utilizar BiocManager para instalar ComplexHeatmap

BiocManager::install(“ComplexHeatmap”)

#source(“https://bioconductor.org/biocLite.R”) #biocLite(“ComplexHeatmap”)



--#10.8.1 Simple heatmap

You can draw a simple heatmap as follow:


``` r
# Cargar la librería necesaria
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.18.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
## ! pheatmap() has been masked by ComplexHeatmap::pheatmap(). Most of the arguments
##    in the original pheatmap() are identically supported in the new function. You 
##    can still use the original function by explicitly calling pheatmap::pheatmap().
## 
## Attaching package: 'ComplexHeatmap'
## The following object is masked from 'package:pheatmap':
## 
##     pheatmap
library(grid)  # Necesario para gpar

# Supongo que `df` es un dataframe que ya has cargado o creado, por ejemplo:
# df <- as.matrix(mtcars)

# Crear el heatmap
Heatmap(
  df,
  name = "mtcars",           # Título de la leyenda
  column_title = "Variables", # Título de las columnas
  row_title = "Samples",      # Título de las filas
  row_names_gp = gpar(fontsize = 7) # Tamaño del texto para los nombres de las filas
)

  1. show_row_names, show_ column names: whether to show row and column names, respectively. Default value is TRUE

  2. show_row_hclust, show_column_hclust: logical value; whether to show row and column clusters. Default is TRUE

  3. clustering_distance_rows, clustering_distance_ columns: metric for clustering: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”, “pearson”, “spearman”, “kendall”)

  4. clustering_method_rows, clustering method_ columns: clustering methods:

“ward.D”, “ward.D2”, “single”, “complete”, “average”, … (see ?hclust).

,

To specify a custom colors, you must use the the colorRamp2() function [circlize package], as follow:

library(circlize)
## ========================================
## circlize version 0.4.16
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
##   in R. Bioinformatics 2014.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(circlize))
## ========================================
## 
## Attaching package: 'circlize'
## The following object is masked from 'package:igraph':
## 
##     degree
mycols <- colorRamp2(breaks = c(-2, 0, 2),
                     colors = c("green", "white", "red"))
Heatmap(df, name = "mtcars", col = mycols)

It’s also possible to use RColorBrewer color palettes:

library("circlize")
library("RColorBrewer")
Heatmap(df, name = "mtcars",
        col = colorRamp2(c(-2, 0, 2), brewer.pal(n=3, name="RdBu")))

We can also customize the appearance of dendograms using the function color_branches() [dendertend package]:

library(dendextend)
row_dend = hclust(dist(df)) # row clustering
col_dend = hclust(dist(t(df))) # column clustering
Heatmap(df, name = "mtcars",
        row_names_gp = gpar(fontsize = 6.5),
        cluster_rows = color_branches(row_dend, k = 4),
        cluster_columns = color_branches(col_dend, k = 2))

–#10.8.2 Splitting heatmap by rows

You can split the heatmap using either the k-means algorithm or a grouping variable.

It’s important to use the set.seed() function when performing k-means so that the

results obtained can be reproduced precisely at a later time.

#Divide into 2 groups
set.seed(2)
Heatmap(df, name = "mtcars", k = 2)

#split by a vector specifying rowgroups
Heatmap(df, name = "mtcars", split = mtcars$cyl,
        row_names_gp = gpar(fontsize = 7))

Note that, split can be also a data frame in which different combinations of levels split the rows of the heatmap.

 #Split by combining multiple variables
Heatmap(df, name ="mtcars",
        split = data.frame(cyl = mtcars$cyl, am = mtcars$am))

10.8.3 Heatmap annotation

The HeatmapAnnotation class is used to define annotation on row or column. A simplified format is:

{r}# HeatmapAnnotation(df, name, col, show_legend)

For the example below, we”ll transpose our data to have the observations in columns and the variables in rows.

df <- t(df)

–#10.8.3.1 Simple annotation

A vector, containing discrete or continuous values, is used to annotate rows or columns. We’ll use the qualitative variables cyl (levels = “4”, “5” and “8”) and am (levels = “0” and “1”), and the continuous variable mpg to annotate columns.

For each of these 3 variables, custom colors are defined as follow:

```{r}# # Annotation data frame annot_df <- data.frame(cyl = mtcars\(cyl, am = mtcars\)am, mpg = mtcars$mpg)

Define colors for each levels of qualitative variables

Define gradient color for continuous variable (mpg)

col = list( cyl = c(“4” = “green”, “6” = “gray”, “8” = “darkred”), am = c(“0” = “yellow”, “1” = “orange”), mpg = circlize::colorRamp2(c(17, 25), c(“lightblue”, “purple”)) )

Create the heatmap annotation

ha <- HeatmapAnnotation(annot_df, col = col)

Combine the heatmap and the annotation

Heatmap(df, name = “mtcars”, top_annotation = ha)


It's possible to hide the annotation legend using the argument show_legend = FALSE
as follow:

```{r}#
ha <- HeatmapAnnotation(annot_df, col = col, show_legend = FALSE)
Heatmap(df, name = "mtcars", top_annotation = ha)

–#10.8.3.2 Complex annotation

In this section we’ll see how to combine heatmap and some basic graphs to show the data distribution. For simple annotation graphics, the following functions can be used: anno_ points(), anno_barplot(), anno_bozxplot(), anno_density() and anno_histogram/).

An example is shown below: #76 ```{r}# # Cargar las librerías necesarias library(ComplexHeatmap) library(circlize)

Supongo que df es un data frame o matriz que ya has cargado o creado, por ejemplo:

df <- as.matrix(mtcars)

Definir algunos gráficos para mostrar la distribución de las columnas

.hist <- anno_histogram(df, gp = gpar(fill = “lightblue”)) .density <- anno_density(df, type = “line”, gp = gpar(col = “blue”)) ha_mix_top <- HeatmapAnnotation(hist = .hist, density = .density)

Definir algunos gráficos para mostrar la distribución de las filas

.violin <- anno_density(df, type = “violin”, gp = gpar(fill = “lightblue”), which = “row”) .boxplot <- anno_boxplot(df, which = “row”) ha_mix_right <- HeatmapAnnotation( violin = .violin, boxplot = .boxplot, which = “row”, width = unit(4, “cm”) )

Combinar la anotación con el heatmap

Heatmap( df, name = “mtcars”, # Título de la leyenda column_names_gp = gpar(fontsize = 8), # Tamaño del texto de los nombres de las columnas top_annotation = ha_mix_top, # Anotación superior top_annotation_height = unit(3.8, “cm”) # Altura de la anotación superior ) + ha_mix_right # Añadir la anotación derecha


10.8.3.3 Combining multiple heatmaps

Multiple heatmaps can be arranged as follow:


``` r
# Heatmap 1
ht1 = Heatmap(df, name = "ht1", km = 2,
              column_names_gp = gpar(fontsize = 9))

#Heatmap 2
ht2 = Heatmap(df, name = "ht2",
              col = circlize::colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
              column_names_gp = gpar(fontsize = 9))
#Combine the two heatmaps
ht1 + ht2

You can use the option width = unit(3, ’cm”)) to control the size of the heatmaps.

Note that when combining multiple heatmaps, the first heatmap is considered as the main heatmap. Some settings of the remaining heatmaps are auto-adjusted according to the setting of the main heatmap. These include: removing row clusters and titles, and adding splitting.

The draw() function can be used to customize the appearance of the final image:

draw(ht1 + ht2,
     row_title = "Two heatmaps, row title",
     row_title_gp = gpar(col = "red"),
     column_title = "Two heatmaps, column title",
     column_title_side = "bottom",
     gap = unit(0.5, "cm"))

Legends can be removed using the arguments show_heatmap legend = FALSE, show_annotation_ legend = FALSE.

—#10.9 Application to gene expression matrix

In gene expression data, rows are genes and columns are samples. More information about genes can be attached after the expression heatmap such as gene length and type of genes.

# Cargar los paquetes necesarios
library(ComplexHeatmap)
library(circlize)

# Leer los datos
expr <- readRDS(paste0(system.file(package = "ComplexHeatmap"), "/extdata/gene_expression.rds"))

# Convertir a matriz
mat <- as.matrix(expr[, grep("cell", colnames(expr))])

# Extraer el tipo de celda
type <- gsub("sild+_", "", colnames(mat))

# Crear anotación
ha = HeatmapAnnotation(df = data.frame(type = type))

# Crear y combinar heatmaps
ht_list <- Heatmap(mat, name = "expression", km = 5, top_annotation = ha, 
                   show_row_names = FALSE, show_column_names = FALSE) +
           Heatmap(expr$length, name = "length", width = unit(5, "mm"), 
                   col = circlize::colorRamp2(c(0, 100000), c("white", "orange"))) +
           Heatmap(expr$type, name = "type", width = unit(5, "mm")) +
           Heatmap(expr$chr, name = "chr", width = unit(5, "mm"), 
                   col = circlize::rand_color(length(unique(expr$chr))))
## There are 23 unique colors in the vector `col` and 23 unique values in
## `matrix`. `Heatmap()` will treat it as an exact discrete one-to-one
## mapping. If this is not what you want, slightly change the number of
## colors, e.g. by adding one more color or removing a color.
# Dibujar heatmap
draw(ht_list)

```{r}# expr <- readRDS(paste0 (system.file(package = “ComplexHeatmap”), “/extdata/gene_expression.rds”))

mat <- as.matrix(expr[, grep(“cell”, colnames(expr))]) type <- gsub(“sild+_“,”“, colnames(mat)) ha = HeatmapAnnotation(df = data.frame(type = type))

Heatmap(mat, name = “expression”, km = 5, top_annotation = ha, top_annotation_height = unit(4, “mm”), show_row_names = FALSE, show_column_names = FALSE) + Heatmap(expr\(length, name = "length", width = unit(5, "mm"), col = circlize::colorRamp2(c(0, 100000), c("white", "orange"))) + Heatmap(expr\)type, name = “type”, width = unit(5, “mm”)) + Heatmap(expr\(chr, name = "chr", width = unit(5, "mm"), col = circlize::rand_color (length (unique(expr\)chr))))

It's also possible to visualize genomic alterations and to integrate different molecular
levels (gene expression, DNA methylation, ...). Read the vignette, on Bioconductor, for further examples.

---#10.10 Summary

We described many functions for drawing heatmaps in R (from basic to complex
heatmaps). A basic heatmap can be produced using either the R base function
heatmap() or the function heatmap.2() [in the gplots package].

The pheatmap() function, in the package of the same name, creates pretty heatmaps,
where ones has better control over some graphical parameters such as cell size.

The Heatmap() function [in ComplexHeatmap package] allows us to easily, draw,
annotate and arrange complex heatmaps. This might be very useful in genomic fields.









Chapter 3

###Principal Component Analysis

---#3.1 Introduction

Principal component analysis (PCA) allows us to summarize and to visualize the
information in a data set containing individuals/observations described by multiple inter-
correlated quantitative variables. Each variable could be considered as a different dimen-
sion. If you have more than 3 variables in your data sets, it could be very difficult to
visualize a multi-dimensional hyperspace.

Principal component analysis is used to extract the important information from a mul-
tivariate data table and to express this information as a set of few new variables called
principal components. These new variables correspond to a linear combination of the
originals. The number of principal components is less than or equal to the number of
original variables.

The information in a given data set corresponds to the total variation it contains, The
goal of PCA is to identify directions (or principal components) along which the variation
in the data is maximal.

In other words, PCA reduces the dimensionality of a multivariate data to two or three
principal components, that can be visualized graphically, with minimal loss of informa-
tion.

In this chapter, we describe the basic idea of PCA and, demonstrate how to compute
and visualize PCA using R software. Additionally, we'11 show how to reveal the most
important variables that explain the variations in a data set.

---#3.2 Basics

Understanding the details of PCA requires knowledge of linear algebra. Here, we'll explain
only the basics with simple graphical representation of the data.

In the Plot 1A below, the data are represented in the X-Y coordinate system. The
dimension reduction is achieved by identifying the principal directions, called principal
components, in which the data varies.



PCA assumes that the directions with the largest variances are the most “important” (i.e,
the most principal).

In the figure below, the PC1 axis is the first principal direction along which the samples
show the largest variation. The PC2 axis is the second most important direction
and it is orthogonal to the PC1 axis.

The dimensionality of our two-dimensional data can be reduced to a single dimension by
projecting each sample onto the first principal component (Plot 1B)



Technically speaking, the amount of variance retained by each principal component is
measured by the so-called eigenvalue.

Note that, the PCA method is particularly useful when the variables within the data set
are highly correlated. Correlation indicates that there is redundancy in the data. Due to
this redundancy, PCA can be used to reduce the original variables into a smaller number
of new variables ( = principal components) explaining most of the variance in the
original variables.


Taken together, the main purpose of principal component analysis is to:

+ identify hidden pattern in a data set,

« reduce the dimensionnality of the data by removing the noise and redun-
dancy in the data,

» identify correlated variables



---#3.3 Computation

--#3.3.1 R packages

Several functions from different packages are available in the R software for computing

PCA:

-prcomp() and princomp() [built-in R. stats packagel,
-PCA() [FactoMineR package],
-dudi.pcal) [ade4 package],
-and epPCA() [ErPosition package]

No matter what function you decide to use, you can easily extract and visualize the
results of PCA using R functions provided in the factoextra R package.

Here, we'll use the two packages FactoMineR, (for the analysis) and factoextra (for
ggplot2-based visualization).

Install the two packages as follow:
```{r}#
install.packages("FactoMineR")

Load them in R, by typing this:

library("FactoMineR")
library("factoextra")

–#3.3.2 Data format

We’ll use the demo data sets decathlon2 from the factoextra package:

data(decathlon2)
# head(decathlon2)

As illustrated in Figure 3.1, the data used here describes athletes’ performance during two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described by 13 variables.

Note that, only some of these individuals and variables will be used to perform the principal component analysis. The coordinates of the remaining individuals and variables on the factor map will be predicted after the PCA.

In PCA terminology, our data contains :

» Active individuals (in light blue, rows 1:23) : Individuals that are used during the principal component analysis.

3.3.3 Data standardization

In principal component analysis, variables are often scaled (i.e. standardized). This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …); otherwise, the PCA outputs obtained will be severely affected.

The goal is to make the variables comparable. Generally variables are scaled to have i) standard deviation one and ii) mean zero.

The standardization of data is an approach widely used in the context of gene expression data analysis before PCA and clustering analysis. We might also want to scale the data when the mean and/or the standard deviation of variables are largely different.

When scaling variables, the data can be transformed as follow:
Descripción de la imagen

Descripción de la imagen

Where mean(x) is the mean of x values, and sd(x) is the standard deviation (SD).

The R base function scale() can be used to standardize the data. It takes a numeric matrix as an input and performs the scaling on the columns.

Note that, by default, the function PCA() [in FactoMineR], standardizes the data automatically during the PCA; so you don’t need do this transformation before the PCA.

–#3.3.4 R code

The function PCA() [FactoMineR package] can be used. A simplified format is :

{r}# PCA(X, scale.unit = TRUE, ncp = 5, graph = TRUE)

The R code below, computes principal component analysis on the active individu- als/variables: ```{r}# # Load the FactoMineR library library(FactoMineR)

Ensure you have the dataset available. Here I’m assuming it’s named decathlon2

and has a subset called active that you want to analyze. If decathlon2.active

is the complete dataset, just use the correct object name:

For example, if your dataset is just called decathlon2 and you want to analyze

all its active components, use the dataset directly:

res.pca <- PCA(decathlon2, graph = FALSE)

Perform PCA without generating graphs

res.pca <- PCA(decathlon2.active, graph = FALSE)

The output of the function PCA() is a list, including the following components :

```{r}#
print (res.pca)

The object that is created using the function PCA() contains many information found in many different lists and matrices. These values are described in the next section.

—#3.4 Visualization and Interpretation

We’ll use the factoextra R package to help in the interpretation of PCA. No matter what function you decide to use [stats::prcomp(), FactoMiner::PCA(), ade4::dudi.pca(), ExPosition::epPCA()], you can easily extract and visualize the results of PCA using R functions provided in the factoextra R. package.

These functions include:

» get_eigenvalue(res.pca): Extract the eigenvalues/variances of principal components

e fuiz_pco_ind(res.pca), fuiz_pca_var(res.pca): Visualize the results individuals and variables, respectively.

» fviz_pco_ biplot(res.pca): Make a biplot of individuals and variables.

In the next sections, we’ll illustrate each of these functions.

–#3.4.1 Eigenvalues / Variances

As described in previous sections, the eigenvalues measure the amount of variation re- tained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

We examine the eigenvalues to determine the number of principal components to be considered. The eigenvalues and the proportion of variances (¡.e., information) retained by the principal components (PCs) can be extracted using the function get_eigenvalue() [factoeztra packagel.

# Instalar el paquete si no está ya instalado
if (!requireNamespace("factoextra", quietly = TRUE)) {
  install.packages("factoextra")
}

# Cargar el paquete
library(factoextra)

# Asegúrate de tener un objeto PCA. Aquí asumo que usaste prcomp o una función similar
# Ejemplo: usando el conjunto de datos iris
data(iris)
res.pca <- prcomp(iris[, -5], scale = TRUE)

# Obtener los valores propios
eig.val <- get_eigenvalue(res.pca)

# Mostrar los valores propios
print(eig.val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.91849782       72.9624454                    72.96245
## Dim.2 0.91403047       22.8507618                    95.81321
## Dim.3 0.14675688        3.6689219                    99.48213
## Dim.4 0.02071484        0.5178709                   100.00000

The sum of all the eigenvalues give a total variance of 10.

The proportion of variation explained by each eigenvalue is given in the second column. For example, 4.124 divided by 10 equals 0.4124, or, about 41.24% of the variation is explained by this first eigenvalue. The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. For instance, 41.242% plus 18.385% equals 59.627%, and so forth. Therefore, about 59.627% of the variation is explained by the first two eigenvalues together.

Eigenvalues can be used to determine the number of principal components to retain after PCA (Kaiser, 1961):

« An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data are standardized.

« You can also limit the number of component to that number that accounts for a certain fraction of the total variance. For example, if you are satisfied with 70% of the total variance explained then use the number of components to achieve that.

Unfortunately, there is no well-accepted objective way to decide how many principal components are enough. This will depend on the specific field of application and the specific data set. In practice, we tend to look at the first few principal components in order to find interesting patterns in the data.

In our analysis, the first three principal components explain 72% of the variation. This is an acceptably large percentage.

An alternative method to determine the number of principal components is to look at a Scree Plot, which is the plot of eigenvalues ordered from largest to the smallest. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size (Jollife, 2002, Peres-Neto et al. (2005).

The scree plot can be produced using the function fuiz_etg() or fviz_screeplot() [factoeztra package].

{r}# fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50)) From the plot above, we might want to stop at the fifth principal component. 87% of the information (variances) contained in the data are retained by the first five principal components.

–#3.4.2 Graph of variables -#3.4.2.1 Results

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var() [factoextra package]. This function provides a list of matrices con- taining all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)

```{r}# 3.4.2 Graph of variables 3.4.2.1 Results

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var() [factoextra package]. This function provides a list of matrices con- taining all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)



The components of the get_pca_var() can be used in the plot of variables as follow:

+ varécoord: coordinates of variables to create a scatter plot

+ var$cos2: represents the quality of representation for variables on the factor map.
It's calculated as the squared coordinates: var.cos2 = var.coord * var.coord.

* varécontrib: contains the contributions (in percentage) of the variables to the princi-
pal components. The contribution of a variable (var) to a given principal component
is (in percentage) : (var.cos2 * 100) / (total cos2 of the component).

Note that, it's possible to plot variables and to color them according to either i)
their quality on the factor map (cos2) or 1i) their contribution values to the principal
components (contrib).

The different components can be accessed as follow:

```{r}#
# Coordinates
head (var$coord)

# Cos2: quality on the factore map
head (var$cos2)

# Contributions to the principal components
head (var$contrib)

In this section, we describe how to visualize variables and draw conclusions about their correlations. Next, we highlight variables according to either 1) their quality of represen- tation on the factor map or ii) their contributions to the principal components.

-#3.4.2.2 Correlation circle

The correlation between a variable and a principal component (PC) is used as the coor- dinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations (Abdi and Williams, 2010). {r}# #Coordinates of variables head (varfcoord, 4)

To plot variables, type this:

fviz_pca_var(res.pca, col.var = "black")

The plot above is also known as variable correlation plots. lt shows the relationships between all variables. lt can be interpreted as follow:

. Positively correlated variables are grouped together. .» Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).

*« The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.

-#3.4.2.3 Quality of representation

The quality of representation of the variables on factor map is called cos2 (square cosine, squared coordinates) . You can access to the cos2 as follow:

{r}# head(var$cos2, 4)

You can visualize the cos2 of variables on all the dimensions using the corrplot package: ```{r}# library(“corrplot”) corrplot(var$cos2, is.corr=FALSE)


It's also possible to create a bar plot of variables cos2 using the function fviz_cos2()[in
factoextra]:


``` r
#Total cos2 of variables on Dim.1 and Dim.2
fviz_cos2(res.pca, choice = "var", axes = 1:2)

Note that,

« A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.

For a. given variable, the sum of the cos2 on all the principal components is equal to one.

If a variable is perfectly represented by only two principal components (Dim.1 $ Dim.2), the sum of the cos2 on these two PCs is equal to one. In this case the variables will be positioned on the circle of correlations.

For some of the variables, more than 2 components might be required to perfectly repre- sent the data. In this case the variables are positioned inside the circle of correlations.

In summary:

+. The cos2 values are used to estimate the quality of the representation

It’s possible to color variables by their cos2 values using the argument col.var = “cos2”. This produces a gradient colors. In this case, the argument gradient.cols can be used to provide a custom color. For instance, gradient.cols = c(“white”, “blue”, “red”) means

that:

+. variables with low cos2 values will be colored in “white” + variables with mid cos2 values will be colored in “blue” + variables with high cos2 values will be colored in red

#Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
)

Note that, it’s also possible to change the transparency of the variables according to their cos2 values using the option alpha.var = “cos2”. For example, type this:

#Change the transparency by cos2 values
fviz_pca_var(res.pca, alpha.var = "cos2")

-#3.4.2.4 Contributions of variables to PCs

The contributions of variables in accounting for the variability in a given principal com- ponent are expressed in percentage.

» Variables that are correlated with PC1 (¡.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.

The contribution of variables can be extracted as follow : {r}# head(var$contrib, 4)

The larger the value of the contribution, the more the variable contributes to the component.

It’s possible to use the function corrplot() [corrplot package] to highlight the most con- tributing variables for each dimension: {r}# library("corrplot") corrplot (var$fcontrib, is.corr=FALSE)

The function fviz_contrib() [factoextra package] can be used to draw a bar plot of variable contributions. Tf your data contains many variables, you can decide to show only the top contributing variables. The R code below shows the top 10 variables contributing to the principal components:

# Contributions of variables to PC1Í
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

The total contribution to PC1 and PC2 is obtained with the following R. code:

fviz_contrib(res.pca, choice = "var", axes = 1:2, top = 10)

The red dashed line on the graph above indicates the expected average contribu- tion. If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/10 = 10%. For a given component, a variable with a contribution larger than this cutoff could be considered as important in contributing to the component.

Note that, the total contribution of a given variable, on explaining the variations retained by two principal components, say PC1 and PC2, is calculated as contrib = [(C1 * Eigl) + (02 * Eig2)]/(Eigl + Eig2), where

In this case, the expected average contribution (cutoff) is calculated as follow: As men- tioned above, if the contributions of the 10 variables were uniform, the expected average contribution on a given PC would be 1/10 = 10%. The expected average contribution of a variable for PC1 and PC2 is : [(10* Eigl) + (10 * Eig2)]/(Eigl + Eig2)

It can be seen that the variables - X100m, Long.jump and Pole.vault - contribute the most to the dimensions 1 and 2.

The most important (or, contributing) variables can be highligbted on the correlation plot as follow:

fviz_pca_var(res.pca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
             )

Note that, it’s also possible to change the transparency of variables according to their contrib values using the option alpha.var = “contrib”. For example, type this:

#Change the transparency by contrib values
fviz_pca_var(res.pca, alpha.var = "contrib")

-#3.4.2.5 Color by a custom continuous variable

In the previous sections, we showed how to color variables by their contributions and their cos2. Note that, it’s possible to color variables by any custom continuous variable. The coloring variable should have the same length as the number of active variables in the PCA (here n = 10).

For example, type this: ```{r}# #Create a random continuous variable of length 10 set.seed(123) my.cont.var <- rnorm(10)

#Color variables by the continuous variable

fviz_pca_var(res.pca, col.var = my.cont.var, gradient.cols = c(“blue”, “yellow”, “red”), legend.title = “Cont.Var”)


-#3.4.2.6 Color by groups

It's also possible to change the color of variables by groups defined by a qualita-

tive/categorical

variable, also called factor in R terminology.

As we don't have any grouping variable in our data sets for classifying variables, we'll

create it.

In the following demo example, we start by classifying the variables into 3 groups using
the kmeans clustering algorithm. Next, we use the clusters returned by the kmeans
algorithm to color variables.

Note that, if you are interested in learning clustering, we previously published a book
named “Practical Guide To Cluster Analysis in R” (https: //goo.g1/DmJ5y5).

```{r}#
# Cargar librerías necesarias
library(factoextra)
library(cluster)

# Configurar la semilla para reproducibilidad
set.seed(123)

# Ejecutar kmeans para crear 3 grupos
res.km <- kmeans(varfcoord, centers = 3, nstart = 25)

# Convertir los clusters en un factor
grp <- as.factor(res.km$cluster)

# Visualizar las variables del PCA coloreadas por grupos
fviz_pca_var(res.pca, col.var = grp,
             palette = c("#0073C2FF", "#EFC000FF", "#686868FF"),
             legend.title = "Cluster")

Note that, to change the color of groups the argument palette should be used. To change gradient colors, the argument gradient.cols should be used.

–#3.4.3 Dimension description

In the section 3.4.2.4, we described how to highlight variables according to their contri- butions to the principal components.

Note also that, the function dimdesc() [in FactoMineR], for dimension description, can be used to identify the most significantly associated variables with a given principal component . It can be used as follow: {r}# res.desc <- dimdesc(res.pca, axes = c(1,2), proba = 0.05) #Description of dimension 1 res.desc$Dim.1

{r}# #Description of dimension 2 res.desc$Dim.2

In the output above, $quanti means results for quantitative variables. Note that, variables are sorted by the p-value of the correlation.

–#3.4.4 Graph of individuals -#3.4.4.1 Results

The results, for individuals can be extracted using the function get_pca_ind() [factoeztra package]. Similarly to the get_pca_var(), the function get_pca_ind() provides a list of matrices containing all the results for the individuals (coordinates, correlation between variables and axes, squared cosine and contributions)

ind <- get_pca_ind(res.pca)
ind
## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

To get access to the different components, use this:

# Coordinates of individuals
head (ind$coord)
##       Dim.1      Dim.2       Dim.3        Dim.4
## 1 -2.257141 -0.4784238  0.12727962  0.024087508
## 2 -2.074013  0.6718827  0.23382552  0.102662845
## 3 -2.356335  0.3407664 -0.04405390  0.028282305
## 4 -2.291707  0.5953999 -0.09098530 -0.065735340
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870
## 6 -2.068701 -1.4842053 -0.02687825  0.006586116
# Quality of individuals
head(ind$cos2)
##       Dim.1      Dim.2        Dim.3        Dim.4
## 1 0.9539975 0.04286032 0.0030335249 1.086460e-04
## 2 0.8927725 0.09369248 0.0113475382 2.187482e-03
## 3 0.9790410 0.02047578 0.0003422122 1.410446e-04
## 4 0.9346682 0.06308947 0.0014732682 7.690193e-04
## 5 0.9315095 0.06823959 0.0000403979 2.104697e-04
## 6 0.6600989 0.33978301 0.0001114335 6.690714e-06
#Contributions of individuals
head (ind$contrib)
##       Dim.1      Dim.2       Dim.3       Dim.4
## 1 1.1637691 0.16694510 0.073591567 0.018672867
## 2 0.9825900 0.32925696 0.248367113 0.339198420
## 3 1.2683043 0.08469576 0.008816151 0.025742863
## 4 1.1996857 0.25856249 0.037605617 0.139067312
## 5 1.2959338 0.30313118 0.001117674 0.041253702
## 6 0.9775628 1.60670454 0.003281801 0.001396002

-#3.4.4.2 Plots: quality and contribution

The fviz_pca_ind() is used to produce the graph of individuals. To create a simple plot, type this:

fviz_pca_ind(res.pca)

Like variables, it’s also possible to color individuals by their cos2 values:

fviz_pca_ind(res.pca, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow ¿if many points)

)

Note that, individuals that are similar are grouped together on the plot.

You can also change the point size according the cos2 of the corresponding individuals:

fviz_pca_ind(res.pca, pointsize = "cos2",
             pointshape = 21, fill = "#E7B800",
             repel = TRUE #Avoid text overlapping (slow ¿if many points)

)

To change both point size and color by cos2, try this:

fviz_pca_ind(res.pca,
             col.ind = "cos2",  # Color por cos2
             pointsize = "cos2", # Tamaño de los puntos por cos2
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE) # Evitar superposición de texto

To create a bar plot of the quality of representation (cos2) of individuals on the factor map, you can use the function fuiz_cos£() as previously described for variables:

fviz_cos2(res.pca, choice = "ind")

To visualize the contribution of individuals to the first two principal components, type this:

#Total contribution on PC1 and PC2
fviz_contrib(res.pca, choice = "ind", axes = 1:2)

-#3.4.4.3 Color by a custom continuous variable

As for variables, individuals can be colored by any custom continuous variable by speci- fying the argument col.ind.

For example, type this:

```{r}# # Create a random continuous variable of length 23,

Same length as the number of active individuals in the PCA

set.seed(123)

my.cont.var <- rnorm(23)

#Color variables by the continuous variable

fviz_pca_ind(res.pca, col.ind = my.cont.var, gradient.cols = c(“blue”, “yellow”, “red’”), legend.title = “Cont.Var”)


-#3.4.4.4 Color by groups

Here, we describe how to color individuals by group. Additionally, we show how to add
concentration ellipses and confidence ellipses by groups. For this, we'll use the iris data
as demo data sets.

Iris data sets look like this:

``` r
head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

The column “Species” will be used as grouping variable. We start by computing principal component analysis as follow:

# The variable Species (index = 5) is removed
# before PCA analysis
iris.pca <- PCA(iris[,-5], graph = FALSE)

In the R code below: the argument habillage or col.ind can be used to specify the factor variable for coloring the individuals by groups.

To add a concentration ellipse around each group, specify the argument addEllipses = TRUE. The argument palette can be used to change group colors.

fviz_pca_ind(iris.pca,
             geom.ind = "point", # show points only (nbut not "tewt")
             col.ind = iris$Species, #color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             legend.title = "Groups"
             )

To remove the group mean point, specify the argument mean.poínt = FALSE,

If you want confidence ellipses instead of concentration ellipses, use ellipse.type = “confidence”.

```{r}# #Add confidence ellipses

fviz_pca_ind(iris.pca, geom.ind = “point”, col.ind = irisfSpecies, palette = c(“#00AFBB”, “#E7B800”, “#FC4E07”), addEllipses = TRUE, ellipse.type = “confidence”, legend.title = “Groups” )


Note that, allowed values for palette include:

. “grey” for grey color palettes;

+ brewer palettes e.g. “RdBu”, “Blues”, ..; To view all, type this in R: RColor-
Brewer::display.brewer.all().

« custom color palette e.g. c(“blue”, “red”;

+ and scientific journal palettes from ggsci R package, e.g.: “npg

; ”» « ” «

“ico”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.

» “aaas”, “Lancet”,

For example, to use the ¡co (journal of clinical oncology) color palette, type this:

```{r}#
fviz_pca_ind(iris.pca,
             label = "none", # hide individual labels
             habillage = irisfSpecies, # color by groups
             addEllipses = TRUE, # Concentration ellipses
             palette = "jco"
             )

–#3.4.5 Graph customization

Note that, fuiz_pca_ind() and fuiz_pca var() and related functions are wrapper around the core function fuiz() [in factoextra]. fuiz() is a wrapper around the function ggscat- ter() [in ggpubr]. Therefore, further arguments, to be passed to the function fuiz() and ggscatter(), can be specified in fviz_pca_ind() and fuiz_pca__var().

Here, we present some of these additional arguments to customize the PCA graph of variables and individuals.

-#3.4.5.1 Dimensions

By default, variables/individuals are represented on dimensions 1 and 2. If you want to visualize them on dimensions 2 and 3, for example, you should specify the argument azxes = c(2, 3).

# Variables on dimensions 2 and 3
fviz_pca_var(res.pca, axes = c(2, 3))

#Individuals on dimensions 2 and 3
fviz_pca_ind(res.pca, axes = c(2, 3))

-#3.4.5.2 Plot elements: point, text, arrow

The argument geom (for geometry) and derivatives are used to specify the geometry elements or graphical elements to be used for plotting.

  1. geom.var: a text specifying the geometry to be used for plotting variables. Allowed values are the combination of c(“point”, “arrow”, “text”).

. Use geom.var = “text” to show only text labels;

For example, type this:

#Show variable points and text labels
fviz_pca_var(res.pca, geom.var = c("point", "text"))

  1. geom.ind: a text specifying the geometry to be used for plotting individuals. Al- lowed values are the combination of c(“point”, “text”).

+. Use geom.ind = “point”, to show only points; . Use geom.ind = “text” to show only text labels; + Use geom.ind = c(“point”, “text”) to show both point and text labels (default)

For example, type this:

#Show individuals tert labels only
fviz_pca_ind(res.pca, geom.ind = "text")

3.4.5.3 Size and shape of plot elements

  1. labelsize: font size for the text labels, e.g.: labelsize = 4.

  2. pointsize: the size of points, e.g.: pointsize = 1.5,

  3. arrowsize: the size of arrows. Controls the thickness of arrows, e.g.: arrowsize = 0.5.

  4. pointshape: the shape of points, pointshape = 21. Type ggpubr::show_point_shapes() to see available point shapes.

# Change the size of arrows an labels
fviz_pca_var(res.pca, arrowsize = 1, labelsize = 5,
             repel = TRUE)

#Change points size, shape and fill color

#Change labelsize

fviz_pca_ind(res.pca,
             pointsize = 3, pointshape = 21, fill = "lightblue",
             labelsize = 5, repel = TRUE)

-#3.4.5.4 Ellipses

As we described in the previous section 3.4.4.4, when coloring individuals by groups, you can add point concentration ellipses using the argument addEllipses = TRUE.

Note that, the argument ellipse.type can be used to change the type of ellipses. Possible values are:

. “convez”: plot convex hull of a set o points.

. “confidence”. plot confidence ellipses around group mean points as the function

coord. ellipse()[in FactoMineR].

“t”: assumes a multivariate t-distribution.

. “euclid”: draws a circle with the radius equal to level, representing the euclidean distance from the center. This ellipse probably won’t appear circular unless co-

ord_fized( is applied.

The argument ellípse.level is also available to change the size of the concentration ellipse in normal probability. For example, specify ellipse.level = 0.95 or ellipse.level = 0.66.

# Add confidence ellipses

fviz_pca_ind(iris.pca, geom.ind = "point",
             col.ind = iris$Species, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, ellipse.type = "confidence",
             legend.title = "Groups"
)

library(factoextra)

# Perform PCA on the iris dataset
iris.pca <- prcomp(iris[, -5], scale = TRUE)

# Plot PCA with convex hulls and color by species
fviz_pca_ind(iris.pca, geom.ind = "point",
             col.ind = iris$Species, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, ellipse.type = "convex",
             legend.title = "Groups"
)

-#3.4.5.5 Group mean points

When coloring individuals by groups (section 3.4.4.4), the mean points of groups (barycen- ters) are also displayed by default.

To remove the mean points, use the argument mean.point = FALSE.

```{r}# fviz_pca_ind(iris.pca, geom.ind = “point”, # show points only (but not “tezxt”) group.ind = irisfSpecies, # color by groups legend.title = “Groups”, mean.point = FALSE)


-#3.4.5.6 Axis lines

The argument axes.linetype can be used to specify the line type of axes. Default is
“dashed”. Allowed values include “blank”, “solid”, “dotted”, etc. To see all possible
values type ggpubr::show_line_types() in R.

To remove axis lines, use axes.linetype = “blank”:

``` r
fviz_pca_var(res.pca, axes.linetype = "blank")

3.4.5.7 Graphical parameters

To change easily the graphical of any ggplots, you can use the function gepar()! [ggpubr package]

The graphical parameters that can be changed using ggpar() include:

*« Legend position. Possible values: “top”, “bottom”, “left”, “right”, “none”.

```{r}# ind.p <- fviz_pca_ind(iris.pca, geom = “point”, col.ind = irisfSpecies) gegpubr::gepar(ind.p, title = “Principal Component Analysis”, subtitle = “Iris data set”, caption = “Source: factoextra”, xlab = “PC1”, ylab = “PC2”, legend.title = “Species”, legend.position = “top”, ggtheme = theme_gray(), palette = “jco”

)


--#3.4.6 Biplot

To make a simple biplot of individuals and variables, type this:

``` r
fviz_pca_biplot(res.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969" # Individuals color
                )

Note that, the biplot might be only useful when there is a low number of variables and individuals in the data set; otherwise the final plot would be unreadable.

Note also that, the coordinate of individuals and variables are note constructed on the same space. Therefore, on biplot, you should mainly focus on the direction of variables but not on their absolute positions on the plot.

Roughly speaking a biplot can be interpreted as follow:

« an individual that is on the same side of a given variable has a high value for

this variable; + an individual that is on the opposite side of a given variable has a low value

for this variable.

Now, using the tris.pca output, let’s :

+» make a biplot of individuals and variables + change the color of individuals by groups: col.ind = iris$Species *« show only the labels for variables: label = “var” or use geom.ind = “point”

```{r}# fviz_pca_biplot(iris.pca, col.ind = irisfSpecies, palette = “jco”, addEllipses = TRUE, label = “var”, col.var = “black”, repel = TRUE, legend.title = “Species”)


filled by a color using the argument fill.ind. The border line color of individual points
is set to “black” using col.ind. 'To color variable by groups, the argument col.var will be used.

To customize individuals and variable colors, we use the helper functions fill_palette()
and color_palette() [in ggpubr package].


``` r
# Load necessary libraries
library(factoextra)
library(ggpubr)
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:dendextend':
## 
##     rotate
# Perform PCA on the iris dataset
iris.pca <- prcomp(iris[, -5], scale = TRUE)

# Create a custom factor column if needed
# iris$fSpecies <- as.factor(iris$Species) # Uncomment if not already done

# Create the PCA biplot
p <- fviz_pca_biplot(iris.pca,
                     geom.ind = "point",
                     pointshape = 21,
                     pointsize = 2.5,
                     fill.ind = iris$Species, # Use Species directly if fSpecies is not defined
                     col.ind = "black",
                     col.var = factor(c("sepal", "sepal", "petal", "petal")),
                     legend.title = list(fill = "Species", color = "Clusters"),
                     repel = TRUE # Avoid label overplotting
                     ) +
  ggpubr::fill_palette("jco") +  # Individual fill color
  ggpubr::color_palette("npg")   # Variable colors

# Print the plot
print(p)

Another complex example is to color individuals by groups (discrete color) and variables by their contributions to the principal components (gradient colors). Additionally, we’ll change the transparency of variables by their contributions using the argument alpha. var.

# Cargar librerías necesarias
library(factoextra)
library(FactoMineR)

# Realizar PCA en el conjunto de datos iris
iris.pca <- PCA(iris[, -5], graph = FALSE)

# Visualización del biplot de PCA
fviz_pca_biplot(iris.pca,
                # Configuración de individuos
                geom.ind = "point",
                fill.ind = iris$Species, # Asegúrate de tener la columna Species correctamente referenciada
                col.ind = "black",
                pointshape = 21,
                pointsize = 2,
                palette = "jco",
                addEllipses = TRUE,

                # Configuración de variables
                alpha.var = "contrib",
                col.var = "contrib",
                gradient.cols = c("RdYlBu"), # Corregido a "RdYlBu" que es el nombre correcto de la paleta
                
                # Configuración de leyendas
                legend.title = list(fill = "Species", # Corregido 'fil1l' a 'fill'
                                     color = "Contrib",
                                     alpha = "Contrib")
               )

—#3.5 Supplementary elements

–#3.5.1 Definition and types

As described above (section 3.3.2), the decathlon2 data sets contain supplementary continuous variables (quanti.sup, columns 11:12), supplementary qualitative vari- ables (quali.sup, column 13) and supplementary individuals (ind.sup, rows 24:27).

Supplementary variables and individuals are not used for the determination of the principal components. Their coordinates are predicted using only the information provided by the performed principal component analysis on active variables /individuals.

–#3.5.2 Specification in PCA

To specify supplementary individuals and variables, the function PCA() can be used as follow:

{r}# PCA(X, ind.sup = NULL, quanti.sup = NULL, quali.sup = NULL, graph = TRUE)

*« X :a data frame. Rows are individuals and columns are numeric variables.

*« ind.sup : a numeric vector specifying the indexes of the supplementary indi- viduals

For example, type this:

res.pca <- PCA(decathlon2, ind.sup = 24:27,
               quanti.sup = 11:12, quali.sup = 13, graph=FALSE)

–#3.5.3 Quantitative variables

» Predicted results (coordinates, correlation and cos2) for the supplementary quanti- tative variables:

res.pca$quanti.sup
## $coord
##             Dim.1       Dim.2      Dim.3       Dim.4       Dim.5
## Rank   -0.7014777 -0.24519443 -0.1834294  0.05575186 -0.07382647
## Points  0.9637075  0.07768262  0.1580225 -0.16623092 -0.03114711
## 
## $cor
##             Dim.1       Dim.2      Dim.3       Dim.4       Dim.5
## Rank   -0.7014777 -0.24519443 -0.1834294  0.05575186 -0.07382647
## Points  0.9637075  0.07768262  0.1580225 -0.16623092 -0.03114711
## 
## $cos2
##            Dim.1       Dim.2      Dim.3      Dim.4        Dim.5
## Rank   0.4920710 0.060120310 0.03364635 0.00310827 0.0054503477
## Points 0.9287322 0.006034589 0.02497110 0.02763272 0.0009701427
fviz_pca_var(res.pca)

Note that, by default, supplementary quantitative variables are shown in blue color and dashed lines.

Further arguments to customize the plot:

#Change color of variables
fviz_pca_var(res.pca,
             col.var = "black", # Active variable
             col.quanti.sup = "red" #Suppl. quantitative variables
)

#Hide active variables on the plot,
#show only supplementary variables
fviz_pca_var(res.pca, invisible = "var")

#Hide supplementary variables
fviz_pca_var(res.pca, invisible = "quanti.sup")

Using the fuiz_pco_var(), the quantitative supplementary variables are displayed automatically on the correlation circle plot. Note that, you can add the quanti.sup variables manually, using the fvuiz_add() function, for further customization. An example is shown below.

#Plot of active variables
p <- fviz_pca_var(res.pca, invisible = "quanti.sup")
#Add supplementary active variables

fviz_add(p, res.pca$quanti.sup$coord,
         geom = c("arrow", "text"),
         color = "red")

res.pca$ind.sup
## $coord
##              Dim.1       Dim.2      Dim.3      Dim.4       Dim.5
## KARPOV   0.7947206  0.77951227 -1.6330203  1.7242283 -0.75070396
## WARNERS -0.3864645 -0.12159237 -1.7387332 -0.7063341 -0.03230011
## Nool    -0.5591306  1.97748871 -0.4830358 -2.2784526 -0.25461493
## Drews   -1.1092038  0.01741477 -3.0488182 -1.5343468 -0.32642192
## 
## $cos2
##              Dim.1        Dim.2      Dim.3      Dim.4        Dim.5
## KARPOV  0.05104677 4.911173e-02 0.21553730 0.24028620 0.0455487744
## WARNERS 0.02422707 2.398250e-03 0.49039677 0.08092862 0.0001692349
## Nool    0.02897149 3.623868e-01 0.02162236 0.48108780 0.0060077529
## Drews   0.09207094 2.269527e-05 0.69560547 0.17617609 0.0079736753
## 
## $dist
##   KARPOV  WARNERS     Nool    Drews 
## 3.517470 2.482899 3.284943 3.655527
p <- fviz_pca_ind(res.pca, col.ind.sup = "blue", repel = TRUE)
p <- fviz_add(p, res.pca$quali.sup$coord, color = "red")
p

Supplementary individuals are shown in blue. The levels of the supplementary qual- itative variable are shown in red color.

3.5.5 Qualitative variables

In the previous section, we showed that you can add the supplementary qualitative vari- ables on individuals plot using fvíz_ada().

Note that, the supplementary qualitative variables can be also used for coloring individ- uals by groups. This can help to interpret the data. The data sets decathlon2 contain a supplementary qualitative variable at columns 13 corresponding to the type of competi- tions.

The results concerning the supplementary qualitative variable are:

res.pca$quali
## $coord
##              Dim.1      Dim.2       Dim.3      Dim.4      Dim.5
## Decastar -1.343451  0.1218097 -0.03789524  0.1808357  0.1343364
## OlympicG  1.231497 -0.1116589  0.03473730 -0.1657661 -0.1231417
## 
## $cos2
##              Dim.1       Dim.2        Dim.3      Dim.4       Dim.5
## Decastar 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062
## OlympicG 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062
## 
## $v.test
##              Dim.1      Dim.2      Dim.3      Dim.4      Dim.5
## Decastar -2.970766  0.4034256 -0.1528767  0.8971036  0.7202457
## OlympicG  2.970766 -0.4034256  0.1528767 -0.8971036 -0.7202457
## 
## $dist
## Decastar OlympicG 
## 1.412108 1.294433 
## 
## $eta2
##                 Dim.1      Dim.2       Dim.3      Dim.4      Dim.5
## Competition 0.4011568 0.00739783 0.001062332 0.03658159 0.02357972

To color individuals by a supplementary qualitative variable, the argument habillage is used to specify the index of the supplementary qualitative variable. Historically, this argument name comes from the FactoMineR package. It’s a french word meaning “dress- ing” in english. To keep consistency between FactoMineR and factoextra, we decided to keep the same argument name

fviz_pca_ind(res.pca, habillage = 13,
             addEllipses =TRUE, ellipse.type = "confidence",
             palette = "jco", repel = TRUE)

Recall that, to remove the mean points of groups, specify the argument mean.point = FALSE.

—#3.6 Filtering results

If you have many individuals/variable, it’s possible to visualize only some of them using the arguments select.ind and select.var.

select.ind, select.var: a selection of individuals/variable to be plotted. Allowed values are NULL or a list containing the arguments name, cos2 or contrib:

» contrib: if contrib > 1, ex: 5, then the top 5 individuals/variables with the highest contributions are plotted

```{r}# #Visualize variable with cos2 >= 0.6 fviz_pca_var(res.pca, select.var = list(cos2 = 0.6))

#Top 5 active variables with the highest cos2 fviz_pca_var(res.pca, select.var= list(cos2 = 5))

#Select by names name <- list(name = c(“Long.jump”, “High.jump”, “X100m”)) fviz_pca_var(res.pca, select.var = name)

#top 5 contributing individuals and variable fviz_pca_biplot(res.pca, select.ind = list(contrib = 5), select.var = list(contrib = 5), ggtheme = theme_minimal ())

When the selection is done according to the contribution values, supplementary
individuals /variables are not shown because they don't contribute to the construction
of the axes.




---#3.7 Exporting results

--#3.7.1 Export plots to PDF/PNG files

The factoextra package produces a gegplot2-based graphs. To save any ggplots, the
standard R. code is as follow:
```{r}#
#Print the plot to a pdf file
pdf ("myplot.pdf")
print (myplot)
dev.off()

In the following examples, we”l show you how to save the different graphs into pdf or png files.

The first step is to create the plots you want as an R object:

```{r}# # Scree plot scree.plot <- fviz_eig(res.pca)

Plot of individuals

ind.plot <- fviz_pca_ind(res.pca)

Plot of variables

var.plot <- fviz_pca_var(res.pca)

Next, the plots can be exported into a single pdf file as follow:

```{r}#
pdf ("PCA.pdf")# Create a new pdf device

print (scree.plot)

print (ind.plot)

print (var.plot)

dev.off()#Close the pdf device

Note that, using the above R code will create the PDF file into your current working directory. To see the path of your current working directory, type getwd() in the R console.

To print each plot to specific png file, the R code looks like this:

```{r}# # Print scree plot to a png file png(“pca-scree-plot.png”) print(scree.plot)

dev.otf()

Export into a CSV file

write.infile(res.pca, “pca.csv”, sep = “;”)



---#3.8 Summary

In conclusion, we described how to perform and interpret principal component analysis
(PCA). We computed PCA using the PCA() function [FactoMineR]. Next, we used the
factoextra R package to produce geplot2-based visualization of the PCA results.

There are other functions [packages] to compute PCA in R:
1) Using prcomp() [stats]

``` r
res.pca <- prcomp(iris[, -5], scale. = TRUE)

Read more: http://www. sthda.com/english/wiki/pca-using-prcomp-and-princomp 2) Using princomp() [stats]

res.pca <- princomp(iris[, -5], cor = TRUE)

Read more: http://www. sthda.com/english/wiki/pca-using-prcomp-and-princomp

  1. Using dudi.pca() [ade4]
library("ade4")
## 
## Attaching package: 'ade4'
## The following object is masked from 'package:FactoMineR':
## 
##     reconst
res.pca <- dudi.pca(iris[, -5], scannf = FALSE, nf = 5)
  1. Using epPCA() [ExPosition]
library("ExPosition")
## Loading required package: prettyGraphs
res.pca <- epPCA(iris[, -5], graph = FALSE)

No matter what functions you decide to use, in the list above, the factoextra package can handle the output for creating beautiful plots similar to what we described in the previous sections for FactoMineR: ```{r}# fviz_eig(res.pca) # Scree plot

fviz_pca_ind(res.pca) # Graph of individuals

fviz_pca_var(res.pca)# Graph of variables










Chapter 4

###Correspondence Analysis

---#4.1 Introduction

Correspondence analysis (CA) is an extension of principal component analysis (Chap-
ter 3) suited to explore relationships among qualitative variables (or categorical data).
Like principal component analysis, it provides a solution for summarizing and visualizing
data set in two-dimension plots.

Here, we describe the simple correspondence analysis, which is used to analyze frequencies
formed by two categorical data, a data table known as contengency table. It provides
factor scores (coordinates) for both row and column points of contingency table. These
coordinates are used to visualize graphically the association between row and column
elements in the contingency table.

When analyzing a two-way contingency table, a typical question is whether certain row
elements are associated with some elements of column elements. Correspondence analysis
is a geometric approach for visualizing the rows and columns of a two-way contingency
table as points in a low-dimensional space, such that the positions of the row and column
points are consistent with their associations in the table. The aim is to have a global
view of the data that is useful for interpretation.

In the current chapter, we'll show how to compute and interpret correspondence analysis
using two R packages: i) FactoMineR for the analysis and ii) factoexrtra for data visual-
ization. Additionally, we'll show how to reveal the most important variables that explain
the variations in a data set. We continue by explaining how to apply correspondence
analysis using supplementary rows and columns. This is important, if you want to make
predictions with CA. The last sections of this guide describe also how to filter CA result
in order to keep only the most contributing variables. Finally, we"ll see how to deal with
outliers.


---#4.2 Computation

--#4.2.1 R packages

Several functions from different packages are available in the R software for computing
correspondence analysis:

. CA() [FactoMineR packagel,

» ca() [ca package],

» dudi.coal) [ade4 package],

« corresp() [MASS package],

» and epCA() [ExPosition package]

No matter what function you decide to use, you can easily extract and visualize the results
of correspondence analysis using R functions provided in the factoextra R package.

Here, we'll use FactoMineR, (for the analysis) and factoextra (for geplot2-based elegant
visualization). To install the two packages, type this:

```{r}#
install.packages(c("FactoMineR", "factoextra"))

Load the packages:

library("FactoMineR")
library("factoextra")

4.2.2 Data format

The data should be a contingency table. We’ll use the demo data sets housetasks available in the factoextra R. package

data(housetasks)
# head(housetasks)

The data is a contingency table containing 13 housetasks and their repartition in the couple:

The data is illustrated in the following image:

4.2.3 Graph of contingency tables and chi-square test

The above contingency table is not very large. Therefore, it’s easy to visually inspect and interpret row and column profiles:

+. It’s evident that, the housetasks - Laundry, Main_Meal and Dinner - are more frequently done by the “Wife”.

*« Repairs and driving are dominantly done by the husband + Holidays are frequently associated with the column “jointly”

Exploratory data analysis and visualization of contingency tables have been covered in our previous article: Chi-Square test of independence in R!, Briefly, contingency table can be visualized using the functions balloonplot() [gplots package] and mosaicplot() [garphics package]:

library("gplots")

# 1. convert the data as a table

dt <- as.table(as.matrix(housetasks))

# 2. Graph

balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Note that, row and column sums are printed by default in the bottom and right margins, respectively. These values are hidden, in the above plot, using the argument show.margins = FALSE.

For a small contingeney table, you can use the Chi-square test to evaluate whether there is a significant dependence between row and column categories:

chisq <- chisq.test(housetasks)
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  housetasks
## X-squared = 1944.5, df = 36, p-value < 2.2e-16

In our example, the row and the column variables are statistically significantly asso- ciated (p-value =r chisq$p.value).

4.2.4 R code to compute CA

The function CA()[FactoMtiner package] can be used. A simplified format is : {r}# CA(X, ncp = 5, graph = TRUE)

To compute correspondence analysis, type this:

library("FactoMineR")
res.ca <- CA(housetasks, graph = FALSE)

The output of the function CA() is a list including :

print(res.ca)
## **Results of the Correspondence Analysis (CA)**
## The row variable has  13  categories; the column variable has 4 categories
## The chi square of independence between the two variables is equal to 1944.456 (p-value =  0 ).
## *The results are available in the following objects:
## 
##    name              description                   
## 1  "$eig"            "eigenvalues"                 
## 2  "$col"            "results for the columns"     
## 3  "$col$coord"      "coord. for the columns"      
## 4  "$col$cos2"       "cos2 for the columns"        
## 5  "$col$contrib"    "contributions of the columns"
## 6  "$row"            "results for the rows"        
## 7  "$row$coord"      "coord. for the rows"         
## 8  "$row$cos2"       "cos2 for the rows"           
## 9  "$row$contrib"    "contributions of the rows"   
## 10 "$call"           "summary called parameters"   
## 11 "$call$marge.col" "weights of the columns"      
## 12 "$call$marge.row" "weights of the rows"

The object that is created using the function CA() contains many information found in many different lists and matrices. These values are described in the next section.

4.3 Visualization and interpretation

We’ll use the following functions [in factoezxtra] to help in the interpretation and the visualization of the correspondence analysis:

. get_eigenvalue(res.ca): Extract the eigenvalues/variances retained by each dimen- sion (axis) * fuiz_eig(res.ca): Visualize the eigenvalues

» fviz_ca_row(res.ca), fuiz_ca col(res.ca): Visualize the results for rows and columns, respectively.

In the next sections, we’ll illustrate each of these functions.

4.3.1 Statistical significance

To interpret correspondence analysis, the first step is to evaluate whether there is a significant dependency between the rows and columns.

A rigorous method is to use the chi-square statistic for examining the association between row and column variables. This appears at the top of the report generated by the function summary(res.ca) or print(res.ca), see section 4.2.4. A high chi-square statistic means strong link between row and column variables.

In our example, the association is highly significant (chi-square: 1944.456, p = 0).

# Chi-square statistics

chi2 <- 1944.456

# Degree of freedom

df <- (nrow(housetasks) - 1) * (ncol(housetasks) - 1)
# P-value

pval <- pchisq(chi2, df = df, lower.tail = FALSE)
pval
## [1] 0

4.3.2 Eigenvalues / Variances

Recall that, we examine the eigenvalues to determine the number of axis to be considered. The eigenvalues and the proportion of variances retained by the different axes can be extracted using the function get_eigenvalue() [factoextra package]. Eigenvalues are large for the first axis and small for the subsequent axis.

library("factoextra")
eig.val <- get_eigenvalue(res.ca)

eig.val
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  0.5428893         48.69222                    48.69222
## Dim.2  0.4450028         39.91269                    88.60491
## Dim.3  0.1270484         11.39509                   100.00000

Eigenvalues correspond to the amount of information retained by each axis. Dimensions are ordered decreasingly and listed according to the amount of variance explained in the solution. Dimension 1 explains the most variance in the solution, followed by dimension 2 and so on.

The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. For instance, 48.69% plus 39.91% equals 88.6%, and so forth. Therefore, about 88.6% of the variation is explained by the first two dimensions.

Eigenvalues can be used to determine the number of axes to retain. There is no “rule of thumb” to choose the number of dimensions to keep for the data interpretation. It depends on the research question and the researcher’s need. For example, if you are satisfied with 80% of the total variances explained then use the number of dimensions necessary to achieve that.

Note that, a good dimension reduction is achieved when the the first few dimensions account for a large proportion of the variability.

In our analysis, the first two axes explain 88.6% of the variation. This is an acceptably large percentage.

An alternative method to determine the number of dimensions is to look at a Scree Plot, which is the plot of eigenvalues/variances ordered from largest to the smallest. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size.

The scree plot can be produced using the function fuiz_eig() or fuiz_screeplot() [factoeztra package].

fviz_screeplot(res.ca, addlabels = TRUE, ylim = c(0, 50))

The point at which the scree plot shows a bend (so called “elbow”) can be considered as indicating an optimal dimensionality.

It’s also possible to calculate an average eigenvalue above which the axis should be kept in the solution.

Our data contains 13 rows and 4 columns.

If the data were random, the expected value of the eigenvalue for each axis would be 1/(nrow(housetasks)-1) = 1/12 = 8.33% in terms of rows.

Likewise, the average axis should account for 1/(ncol(housetasks)-1) = 1/3 = 33.33% in terms of the 4 columns.

According to (Bendixen, 1995):

Any axis with a contribution larger than the maximum of these two percentages should be considered as important and included in the solution for the interpretation of the data.

The R code below, draws the scree plot with a red dashed line specifying the average eigenvalue:

fviz_screeplot(res.ca) +
  geom_hline(yintercept=33.33, linetype=2, color="red")

According to the graph above, only dimensions 1 and 2 should be used in the solution. The dimension 3 explains only 11.4% of the total inertia which is below the average eigeinvalue (33.33%) and too little to be kept for further analysis.

Note that, you can use more than 2 dimensions. However, the supplementary di- mensions are unlikely to contribute significantly to the interpretation of nature of the association between the rows and columns.

Dimensions 1 and 2 explain approximately 48.7% and 39.9% of the total inertia respec- tively. This corresponds to a cumulative total of 88.6% of total inertia retained by the 2 dimensions. The higher the retention, the more subtlety in the original data is retained in the low-dimensional solution (Bendixen, 2003).

–#4.3.3 Biplot

The function fviz_ca__biplot() [factoextra package] can be used to draw the biplot of rows and columns variables.

# repel= TRUE to avoid text overlapping (slow ¿f many point)
fviz_ca_biplot(res.ca, repel = TRUE)

The graph above is called symetric plot and shows a global pattern within the data. Rows are represented by blue points and columns by red triangles.

The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.

This graph shows that :

« housetasks such as dinner, breakfeast, laundry are done more often by the wife + Driving and repairs are done by the husband

. Symetric plot represents the row and column profiles simultaneously in a com- mon space. In this case, only the distance between row points or the distance between column points can be really interpreted.

« The distance between any row and column items is not meaningful! You can only make a general statements about the observed pattern.

The next step for the interpretation is to determine which row and column variables contribute the most in the definition of the different dimensions retained in the model.

4.3.4 Graph of row variables 4.3.4.1 Results

The function get_ca_row() [in factoextra] is used to extract the results for row variables. This function returns a list containing the coordinates, the cos2, the contribution and the inertia of row variables:

row <- get_ca_row(res.ca)
row
## Correspondence Analysis - Results for rows
##  ===================================================
##   Name       Description                
## 1 "$coord"   "Coordinates for the rows" 
## 2 "$cos2"    "Cos2 for the rows"        
## 3 "$contrib" "contributions of the rows"
## 4 "$inertia" "Inertia of the rows"

The components of the get_ca_row() function can be used in the plot of rows as follow:

« rowécos2: quality of representation of rows.

« varécontrib: contribution of rows (in %) to the definition of the dimensions.

Note that, it’s possible to plot row points and to color them according to either i) their quality on the factor map (cos2) or ii) their contribution values to the definition of dimensions (contrib).

The different components can be accessed as follow:

```{r}# #Coordinates head (row$coord)

#Cos2: quality on the factore map head (rou$cos2)

#Contributions to the principal components head (row$contrib)



In this section, we describe how to visualize row points only. Next, we highlight rows
according to either i) their quality of representation on the factor map or ii) their contri-
butions to the dimensions.

-#4.3.4.2 Coordinates of row points

The R code below displays the coordinates of each row point in each dimension (1, 2 and
3):

``` r
head (row$coord)
##                 Dim 1      Dim 2       Dim 3
## Laundry    -0.9918368  0.4953220 -0.31672897
## Main_meal  -0.8755855  0.4901092 -0.16406487
## Dinner     -0.6925740  0.3081043 -0.20741377
## Breakfeast -0.5086002  0.4528038  0.22040453
## Tidying    -0.3938084 -0.4343444 -0.09421375
## Dishes     -0.1889641 -0.4419662  0.26694926

Use the function fviz_ca_row() [in factoextra] to visualize only row points:

fviz_ca_row(res.ca, repel = TRUE)

It’s possible to change the color and the shape of the row points using the arguments col.row and shape.row as follow:

fviz_ca_row(res.ca, col.row="steelblue", shape.row = 15)

The plot above shows the relationships between row points:

*« Rows with a similar profile are grouped together. +. Negatively correlated rows are positioned on opposite sides of the plot origin (opposed quadrants).

4.3.4.3 Quality of representation of rows

The result of the analysis shows that, the contingency table has been successfully repre- sented in low dimension space using correspondence analysis. The two dimensions 1 and 2 are sufficient to retain 88.6% of the total inertia (variation) contained in the data.

However, not all the points are equally well displayed in the two dimensions.

Recall that, the quality of representation of the rows on the factor map is called the squared cosine (cos2) or the squared correlations.

The cos2 measures the degree of association between rows/columns and a particular axis. The cos2 of row points can be extracted as follow:

head(row$cos2, 4)
##                Dim 1     Dim 2      Dim 3
## Laundry    0.7399874 0.1845521 0.07546047
## Main_meal  0.7416028 0.2323593 0.02603787
## Dinner     0.7766401 0.1537032 0.06965666
## Breakfeast 0.5049433 0.4002300 0.09482670

The values of the cos2 are comprised between O and 1. The sum of the cos2 for rows on all the CA dimensions is equal to one.

The quality of representation of a row or column in n dimensions is simply the sum of the squared cosine of that row or column over the n dimensions.

If a row item is well represented by two dimensions, the sum of the cos2 is closed to one. For some of the row items, more than 2 dimensions are required to perfectly represent the data.

It’s possible to color row points by their cos2 values using the argument col.row = “cos2”. This produces a gradient colors, which can be customized using the argument gradient.cols. For instance, gradient.cols = c(“white”, “blue”, “red”) means that:

#Color by cos2 values: quality on the factor map
fviz_ca_row(res.ca, col.row = "cos2",
            gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
            repel = TRUE)

Note that, it’s also possible to change the transparency of the row points according to their cos2 values using the option alpha.row = “cos2”. For example, type this: ```{r}# library(factoextra)

Assuming res.ca is the result of a correspondence analysis

Perform correspondence analysis on a sample dataset

res.ca <- CA(iris[, 1:4]) # Uncomment and modify according to your actual data

Ensure c082 contains numeric values between 0 and 1

Here, I’m using a hypothetical cos2 values stored in res.ca\(row\)cos2[, 1]

Adjust this part according to your actual data structure

Example: Using the first dimension’s cos2 values

cos2_values <- res.ca\(row\)cos2[, 1]

Check the range of cos2_values

range(cos2_values)

Create the plot with transparency based on cos2 values

fviz_ca_row(res.ca, alpha.row = cos2_values)

You can visualize the cos2 of row points on all the dimensions using the corrplot package:

``` r
library("corrplot")
corrplot (row$cos2, is.corr=FALSE)

It’s also possible to create a bar plot of rows cos2 using the function fviz_cos2()|in

#Cos2 of rows on Dim.1 and Dim.2
fviz_cos2(res.ca, choice = "row", axes = 1:2)

Note that, all row points except Official are well represented by the first two dimensions. This implies that the position of the point corresponding the item Official on the scatter plot should be interpreted with some caution. A higher dimensional solution is probably necessary for the item Official.

-#4.3.4.4 Contributions of rows to the dimensions

The contribution of rows (in %) to the definition of the dimensions can be extracted as follow:

head(row$contrib)
##                 Dim 1    Dim 2    Dim 3
## Laundry    18.2867003 5.563891 7.968424
## Main_meal  12.3888433 4.735523 1.858689
## Dinner      5.4713982 1.321022 2.096926
## Breakfeast  3.8249284 3.698613 3.069399
## Tidying     1.9983518 2.965644 0.488734
## Dishes      0.4261663 2.844117 3.634294

The row variables with the larger value, contribute the most to the definition of the dimensions.

+. Rows that do not contribute much to any dimension or that contribute to the last dimensions are less important.

It’s possible to use the function corrplot() [corrplot package] to highlight the most con- tributing row points for each dimension:

library("corrplot")
corrplot (row$contrib, is.corr=FALSE)

The function fuiz_contrib() [factoextra package] can be used to draw a bar plot of row contributions. If your data contains many rows, you can decide to show only the top con- tributing rows. The R code below shows the top 10 rows contributing to the dimensions:

# Contributions of rows to dimension 1
fviz_contrib(res.ca, choice = "row", axes = 1, top = 10)

# Contributions of rows to dimension 2
fviz_contrib(res.ca, choice = "row", axes = 2, top = 10)

The total contribution to dimension 1 and 2 can be obtained as follow:

# Total contribution to dimension 1 and 2
fviz_contrib(res.ca, choice = "row", axes = 1:2, top = 10)

The red dashed line on the graph above indicates the expected average value, If the contributions were uniform. The calculation of the expected contribution value, under null hypothesis, has been detailed in the principal component analysis chapter (3).

It can be seen that:

+. the row items Repairs, Laundry, Main_meal and Driving are the most important in the definition of the first dimension. + the row items Holidays and Repairs contribute the most to the dimension 2.

The most important (or, contributing) row points can be highlighted on the scatter plot as follow:

fviz_ca_row(res.ca, col.row = "contrib",
            gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
            repel = TRUE)

The scatter plot gives an idea of what pole of the dimensions the row categories are actually contributing to. It is evident that row categories Repair and Driving have an important contribution to the positive pole of the first dimension, while the categories Laundry and Main_meal have a major contribution to the negative pole of the first dimension; etc, ….

In other words, dimension 1 is mainly defined by the opposition of Repair and Driving (positive pole), and Laundry and Matin__meal (negative pole).

Note that, it’s also possible to control the transparency of row points according to their contribution values using the option alpha.row = “contrib”. For example, type this:

#Change the transparency by contrib values
fviz_ca_row(res.ca, alpha.row="contrib",
            repel = TRUE)

–#4.3.5 Graph of column variables

-#4.3.5.1 Results The function get_ca__col() [in factoextra] is used to extract the results for column variables.

This function returns a list containing the coordinates, the cos2, the contribution and the inertia of columns variables:

col <- get_ca_col(res.ca)
col
## Correspondence Analysis - Results for columns
##  ===================================================
##   Name       Description                   
## 1 "$coord"   "Coordinates for the columns" 
## 2 "$cos2"    "Cos2 for the columns"        
## 3 "$contrib" "contributions of the columns"
## 4 "$inertia" "Inertia of the columns"

The result for columns gives the same information as described for rows. For this reason, we’ll just displayed the result for columns in this section with only a very brief comment.

To get access to the different components, use this:

# Coordinates of column points
head(col$coord)
##                   Dim 1      Dim 2       Dim 3
## Wife        -0.83762154  0.3652207 -0.19991139
## Alternating -0.06218462  0.2915938  0.84858939
## Husband      1.16091847  0.6019199 -0.18885924
## Jointly      0.14942609 -1.0265791 -0.04644302
# (Quality of representation
head(col$cos2)
##                   Dim 1     Dim 2       Dim 3
## Wife        0.801875947 0.1524482 0.045675847
## Alternating 0.004779897 0.1051016 0.890118521
## Husband     0.772026244 0.2075420 0.020431728
## Jointly     0.020705858 0.9772939 0.002000236
# Contributions
head(col$contrib)
##                 Dim 1     Dim 2      Dim 3
## Wife        44.462018 10.312237 10.8220753
## Alternating  0.103739  2.782794 82.5492464
## Husband     54.233879 17.786612  6.1331792
## Jointly      1.200364 69.118357  0.4954991

4.3.5.2 Plots: quality and contribution

The fviz_ca_col() is used to produce the graph of column points. To create a simple plot, type this:

fviz_ca_col(res.ca)

Like row points, it’s also possible to color column points by their cos2 values:

fviz_ca_col(res.ca, col.col = "cos2",
            gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
            repel = TRUE)

The R code below creates a barplot of columns cos2:

fviz_cos2(res.ca, choice = "col", axes = 1:2)

Recall that, the value of the cos2 is between O and 1. Á cos2 closed to 1 corresponds to a column/row variables that are well represented on the factor map.

Note that, only the column item Alternating is not very well displayed on the first two dimensions. The position of this item must be interpreted with caution in the space formed by dimensions 1 and 2.

To visualize the contribution of rows to the first two dimensions, type this:

fviz_contrib(res.ca, choice = "col", axes = 1:2)

–#4.3.6 Biplot options

Biplot is a graphical display of rows and columns in 2 or 3 dimensions. We have already described how to create CA biplots in section 4.3.3. Here, we”ll describe different types of CA biplots.

-#4.3.6.1 Symmetric biplot

As mentioned above, the standard plot of correspondence analysis is a symmetric biplot in which both rows (blue points) and columns (red triangles) are represented in the same space using the principal coordinates. These coordinates represent the row and column profiles. In this case, only the distance between row points or the distance between column points can be really interpreted.

With symmetric plot, the inter-distance between rows and columns can’t be inter- preted. Only a general statements can be made about the pattern.

fviz_ca_biplot(res.ca, repel = TRUE)

Note that, in order to interpret the distance between column points and row points, the simplest way is to make an asymmetric plot. This means that, the column profiles must be presented in row space or vice-versa.

4.3.6.2 Asymwmetric biplot

To make an asymetric biplot, rows (or columns) points are plotted from the standard co- ordinales (S) and the profiles of the columns (or the rows) are plotted from the principale coordinates (P) (Bendixen, 2003).

For a given axis, the standard and principle co-ordinates are related as follows: P = sqri(eigenvalue) X S

Depending on the situation, other types of display can be set using the argument map (Nenadic and Greenacre, 2007) in the function fviz_ca_biplot() [in factoeztra].

The allowed options for the argument map are:

  1. “rowprincipal” or “colprincipal” - these are the so-called asymmetric biplots, with either rows in principal coordinates and columns in standard coordinates, or vice versa (also known as row-metric-preserving or column-metric-preserving, respectively).

« “rowprincipal”: columns are represented in row space “colprincipal”: rows are represented in column space

. “symbiplot” - both rows and columns are scaled to have variances equal to the

singular values (square roots of eigenvalues), which gives a symmetric biplot but does not preserve row or column metrics.

. “rowgab” or “colgab”: Asymetric maps proposed by Gabriel $2 Odoroff (Gabriel

and Odorofl, 1990):

“rowgab”: rows in principal coordinates and columns in standard coordinates mul-

tiplied by the mass. “colgab”: columns in principal coordinates and rows in standard coordinates multi- plied by the mass.

. “rowgreen” or “colgreen”: The so-called contribution biplots showing visually

the most contributing points (Greenacre 2006b).

“rowgreen”: rows in principal coordinates and columns in standard coordinates multiplied by square root of the mass. “colgreen”: columns in principal coordinates and rows in standard coordinates mul- tiplied by the square root of the mass.

The R code below draws a standard asymetric biplot:

fviz_ca_biplot(res.ca,
               map ="rowprincipal", arrow = c(TRUE, TRUE),
               repel = TRUE)

We used, the argument arrows, which is a vector of two logicals specifying if the plot should contain points (FALSE, default) or arrows (TRUE). The first value sets the rows and the second value sets the columns.

If the angle between two arrows is acute, then their is a strong association between the corresponding row and column.

To interpret the distance between rows and and a column you should perpendicularly project row points on the column arrow.

-#4.3.6.3 Contribution biplot

In the standard symmetric biplot (mentioned in the previous section), it’s difficult to know the most contributing points to the solution of the CA.

Michael Greenacre proposed a new scaling displayed (called contribution biplot) which incorporates the contribution of points (Greenacre, 2013). In this display, points that contribute very little to the solution, are close to the center of the biplot and are relatively unimportant to the interpretation.

A contribution biplot can be drawn using the argument map = “rowgreen” or map = “colgreen”.

Firstly, you have to decide whether to analyse the contributions of rows or columns to the definition of the axes.

In our example we”ll interpret the contribution of rows to the axes. The argument map =*“colgreen” is used. In this case, recall that columns are in principal coordinates and rows in standard coordinates multiplied by the square root of the mass. For a given row, the square of the new coordinate on an axis i is exactly the contribution of this row to the inertia of the axis i.

fviz_ca_biplot(res.ca, map ="colgreen", arrow = c(TRUE, FALSE),
               repel = TRUE)

In the graph above, the position of the column profile points is unchanged relative to that in the conventional biplot. However, the distances of the row points from the plot origin are related to their contributions to the two-dimensional factor map.

The closer an arrow is (in terms of angular distance) to an axis the greater is the contri- bution of the row category on that axis relative to the other axis. If the arrow is halfway between the two, its row category contributes to the two axes to the same extent.

+. Dimension 2 is mainly defined by the row category Holidays.

+. The row category Driving contributes to the two axes to the same extent.

4.3.7 Dimension description

To easily identify row and column points that are the most associated with the principal dimensions, you can use the function dimdesc() [in FactoMineR]. Row/column variables are sorted by their coordinates in the dímdesc() output.

#Dimension description
res.desc <- dimdesc(res.ca, axes = c(1,2))

Description of dimension 1:

#Description of dimension 1 by row points
head(res.desc[[1]]$row, 4)
##                 coord
## Laundry    -0.9918368
## Main_meal  -0.8755855
## Dinner     -0.6925740
## Breakfeast -0.5086002
#Description of dimension 1 by column points
head(res.desc[[1]]$co1, 4)
## NULL

Description of dimension 2:

#Description of dimension 2 by row points
res.desc[[2]]$row
##                 coord
## Holidays   -1.4350066
## Finances   -0.6178684
## Insurance  -0.4737832
## Dishes     -0.4419662
## Tidying    -0.4343444
## Shopping   -0.4033171
## Official    0.2536132
## Dinner      0.3081043
## Breakfeast  0.4528038
## Main_meal   0.4901092
## Laundry     0.4953220
## Driving     0.6534143
## Repairs     0.8642647
#Description of dimension 1 by column points
res.desc[[2]]$col
##                  coord
## Jointly     -1.0265791
## Alternating  0.2915938
## Wife         0.3652207
## Husband      0.6019199

—#4.4 Supplementary elements

–#4.4.1 Data format

We’ll use the data set children [in FactoMineR package]. It contains 18 rows and 8 columns:

data(children)
#head(children)

The data used here is a contingency table describing the answers given by different cate- gories of people to the following question: What are the reasons that can make hesitate a woman or a couple to have children?

Only some of the rows and columns will be used to perform the correspondence analysis (CA). The coordinates of the remaining (supplementary) rows/columns on the factor map will be predicted after the CA.

In CA terminology, our data contains :

» Active columns (columns 1:5) : Columns that are used for the correspondence analysis.

–#4.4.2 Specification in CA

As mentioned above, supplementary rows and columns are not used for the definition of the principal dimensions. ’Their coordinates are predicted using only the information provided by the performed CA on active rows/columns.

To specify supplementary rows/columns, the function CA()[in FactoMineR] can be used as follow :

```{r}# CA(X, ncp = 5, row.sup = NULL, col.sup = NULL, graph = TRUE)


+ X :a data frame (contingency table)

* TOW.SUP : a numeric vector specifying the indexes of the supplementary rows

« col.sup : a numeric vector specifying the indexes of the supplementary
columns

* ncp: number of dimensions kept in the final results.

*« graph: a logical value. 1f TRUE a graph is displayed.

For example, type this:


``` r
res.ca <- CA (children, row.sup = 15:18, col.sup = 6:8,
              graph = FALSE)

–#4.4.3 Biplot of rows and columns

fviz_ca_biplot(res.ca, repel = TRUE)

. Supplementary rows are in darkblue

« Columns are in red

*« Supplementary columns are in darkred

It’s also possible to hide supplementary rows and columns using the argument invisible:

fviz_ca_biplot(res.ca, repel = TRUE,
               invisible = c("row.sup", "col.sup"))

–#4.4.4 Supplementary rows

Predicted results (coordinates and cos2) for the supplementary rows:

res.ca$row.sup
## $coord
##                  Dim 1     Dim 2      Dim 3      Dim 4
## comfort      0.2096705 0.7031677 0.07111168  0.3071354
## disagreement 0.1462777 0.1190106 0.17108916 -0.3132169
## world        0.5233045 0.1429707 0.08399269 -0.1063597
## to_live      0.3083067 0.5020193 0.52093397  0.2557357
## 
## $cos2
##                   Dim 1      Dim 2       Dim 3      Dim 4
## comfort      0.06892759 0.77524032 0.007928672 0.14790342
## disagreement 0.13132177 0.08692632 0.179649183 0.60210272
## world        0.87587685 0.06537746 0.022564054 0.03618163
## to_live      0.13899699 0.36853645 0.396830367 0.09563620

Plot of active and supplementary row points:

fviz_ca_row(res.ca, repel = TRUE)

Supplementary rows are shown in darkblue color.

-#4.4.5 Supplementary columns

Predicted results (coordinates and cos2) for the supplementary columns:

res.ca$col.sup
## $coord
##                  Dim 1       Dim 2       Dim 3       Dim 4
## thirty      0.10541339 -0.05969594 -0.10322613  0.06977996
## fifty      -0.01706444  0.04907657 -0.01568923 -0.01306117
## more_fifty -0.17706810 -0.04813788  0.10077299 -0.08517528
## 
## $cos2
##                Dim 1      Dim 2       Dim 3       Dim 4
## thirty     0.1375601 0.04411543 0.131910759 0.060278490
## fifty      0.0108695 0.08990298 0.009188167 0.006367804
## more_fifty 0.2860989 0.02114509 0.092666735 0.066200714

Plot of active and supplementary column points:

fviz_ca_col(res.ca, repel = TRUE)

Supplementary columns are shown in darkred.

—#4.5 Filtering results

If you have many row/column variables, it’s possible to visualize only some of them using the arguments select.row and select. col.

select.col, select.row: a selection of columns/rows to be drawn. Allowed values are NULL or a list containing the arguments name, cos2 or contrib:

e cos2: if cos2 is in [0, 1], ex: 0.6, then columns/rows with a cos2 > 0.6 are drawn

#Visualize rows with cos2 >= 0.8
fviz_ca_row(res.ca, select.row = list(cos2 = 0.8))

#Top 5 active rows and 5 suppl. rows with the highest cos2
fviz_ca_row(res.ca, select.row = list(cos2 = 5))

#Select by names
name <- list(name = c("employment", "fear", "future"))
fviz_ca_row(res.ca, select.row = name)

#Top 5 contributing rows and columns
fviz_ca_biplot(res.ca, select.row = list(contrib = 5),
               select.col = list(contrib = 5)) +
  theme_minimal ()

—#4.6 Outliers

If one or more “outliers” are present in the contingency table, they can dominate the interpretation the axes (Bendixen, 2003).

Outliers are points that have high absolute co-ordinate values and high contributions. They are represented, on the graph, very far from the centroid. In this case, the remaining row/column points tend to be tightly clustered in the graph which become difficult to interpret.

In the CA output, the coordinates of row/column points represent the number of standard deviations the row/column is away from the barycentre (Bendixen, 2003).

According to (Bendixen, 2003):

Outliers are points that are are at least one standard deviation away from the barycen- tre. They contribute also, significantly to the interpretation to one pole of an axis.

There are no apparent outliers in our data. If there were outliers in the data, they must be suppressed or treated as supplementary points when re-running the corre- spondence analysis.

—#4.7 Exporting results

–#4.7.1 Export plots to PDF/PNG files

To save the different graphs into pdf or png files, we start by creating the plot of interest as an R object:

#Scree plot
scree.plot <- fviz_eig(res.ca)

#Biplot of row and column variables
biplot.ca <- fviz_ca_biplot(res.ca)

Next, the plots can be exported into a single pdf file as follow (one plot per page):

library (ggpubr)
ggexport (plotlist = list(scree.plot, biplot.ca),
          filename = "CA.pdf")
## file saved to CA.pdf

4.7.2 Export results to txt/csv files

Easy to use R function: write.infile() [in FactoMineR] package:

# Export into a TXT file
write.infile(res.ca, "ca.txt", sep = "\t")

# Export into a CSV file
write.infile(res.ca, "ca.csv", sep = ";")

—#4.8 Summary

In conclusion, we described how to perform and interpret correspondence analysis (CA). We computed CA using the CA() function [FactoMineR package]. Next, we used the factoextra R package to produce ggplot2-based visualization of the CA results.

Other functions [packages] to compute CA in R, include:

  1. Using dudi.coa() [ade4]
library("ade4")
res.ca <- dudi.coa(housetasks, scannf = FALSE, nf = 5)
  1. Using ca() [ca]
library(ca)
res.ca <- ca(housetasks)
  1. Using corresp() [MASS]
library("MASS")
res.ca <- corresp(housetasks, nf = 3)
  1. Using epCA() [ExPosition]
library("ExPosition")
res.ca <- epCA(housetasks, graph = FALSE)

No matter what functions you decide to use, in the list above, the factoextra package can handle the output.

fviz_eig(res.ca) # Scree plot

fviz_ca_biplot(res.ca) # Biplot of rows and columns

Chapter 5

###Multiple Correspondence Analysis

—#5.1 Introduction

The Multiple correspondence analysis (MCA) is an extension of the simple corre- spondence analysis (chapter 4) for summarizing and visualizing a data table containing more than two categorical variables. 1t can also be seen as a generalization of principal component analysis when the variables to be analyzed are categorical instead of quanti- tative (Abdi and Williams, 2010).

MCA is generally used to analyse a data set from survey. The goal is to identify:

« A group of individuals with similar profile in their answers to the questions + The associations between variable categories

Previously, we described how to compute and interpret the simple correspondence anal- ysis (chapter 4). In the current chapter, we demonstrate how to compute and visualize multiple correspondence analysis in R software using FactoMineR (for the analysis) and factoeztra (for data visualization). Additionally, we’ll show how to reveal the most im- portant variables that contribute the most in explaining the variations in the data set. We continue by explaining how to predict the results for supplementary individuals and variables. Finally, we’ll demonstrate how to filter MCA results in order to keep only the most contributing variables.

—#5.2 Computation

–#5.2.1 R packages

Several functions from different packages are available in the R software for computing multiple correspondence analysis. These functions/packages include:

» MCA( function [FactoMineR package] * dudi.mca() function [ade4 package] * and epMCA() [ErPosition package]

No matter what function you decide to use, you can easily extract and visualize the MCA results using R functions provided in the factoeztra R package.

Here, we’ll use FactoMineR (for the analysis) and factoextra (for ggplot2-based elegant visualization). To install the two packages, type this:

{r}# install.packages(c("FactoMineR", "factoextra"))

Load the packages:

```{r}# library(“FactoMineR”)

library(“factoextra”)


--#5.2.2 Data format

We'll use the demo data sets poison available in FactoMineR package:

``` r
data(poison)
head(poison[, 1:7], 3)
##   Age Time   Sick Sex   Nausea Vomiting Abdominals
## 1   9   22 Sick_y   F Nausea_y  Vomit_n     Abdo_y
## 2   5    0 Sick_n   F Nausea_n  Vomit_n     Abdo_n
## 3   6   16 Sick_y   F Nausea_n  Vomit_y     Abdo_y

This data is a result from a survey carried out on children of primary school who suffered from food poisoning. They were asked about their symptoms and about what they ate.

The data contains 55 rows (individuals) and 15 columns (variables). We’ll use only some of these individuals (children) and variables to perform the multiple correspondence analysis. The coordinates of the remaining individuals and variables on the factor map will be predicted from the previous MCA. results.

In MCA terminology, our data contains :

» Active individuals (rows 1:55): Individuals that are used in the multiple corre- spondence analysis. » Active variables (columns 5:15) : Variables that are used in the MCA. » Supplementary variables: They don’t participate to the MCA. The coordinates of these variables will be predicted. — Supplementary quantitative variables (quanti.sup): Columns 1 and 2 cor- responding to the columns age and time, respectively. — Supplementary qualitative variables (quali.sup: Columns 3 and 4 corre- sponding to the columns Sick and Sez, respectively. ’This factor variables will be used to color individuals by groups.

Subset only active individuals and variables for multiple correspondence analysis:

poison.active <- poison[1:55, 5:15]
head(poison.active[, 1:6], 3)
##     Nausea Vomiting Abdominals   Fever   Diarrhae   Potato
## 1 Nausea_y  Vomit_n     Abdo_y Fever_y Diarrhea_y Potato_y
## 2 Nausea_n  Vomit_n     Abdo_n Fever_n Diarrhea_n Potato_y
## 3 Nausea_n  Vomit_y     Abdo_y Fever_y Diarrhea_y Potato_y

–#5.2.3 Data summary

The R base function summary() can be used to compute the frequency of variable cat- egories. As the data table contains a large number of variables, we’ll display only the results for the first 4 variables.

Statistical summaries:

# Summary of the 4 first variables
summary(poison.active)[, 1:4]
##       Nausea      Vomiting   Abdominals     Fever   
##  Nausea_n:43   Vomit_n:33   Abdo_n:18   Fever_n:20  
##  Nausea_y:12   Vomit_y:22   Abdo_y:37   Fever_y:35

The summary() functions return the size of each variable category.

It’s also possible to plot the frequency of variable categories. The R code below, plots the first 4 columns:

for (i in 1:4) {
  plot (poison.active[,i], main=colnames (poison.active) [i],
        ylab = "Count", col = "steelblue", las = 2)
}

The graphs above can be used to identify variable categories with a very low fre- quency. These types of variables can distort the analysis and should be removed.

-#5.2.4 R code

The function MCA()[FactoMiner package] can be used. A simplified format is :

```{r}# MCA(X, ncp = 5, graph = TRUE)


+ X:adata frame with n rows (individuals) and p columns (categorical variables)
* ncp : number of dimensions kept in the final results.
* graph: a logical value. If TRUE a graph is displayed.

In the R code below, the MCA is performed only on the active individuals/variables :

``` r
res.mca <- MCA(poison.active, graph = FALSE)

The output of the MCA() function is a list including :

print(res.mca)
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 55 individuals, described by 11 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$var$eta2"       "coord. of variables"             
## 8  "$ind"            "results for the individuals"     
## 9  "$ind$coord"      "coord. for the individuals"      
## 10 "$ind$cos2"       "cos2 for the individuals"        
## 11 "$ind$contrib"    "contributions of the individuals"
## 12 "$call"           "intermediate results"            
## 13 "$call$marge.col" "weights of columns"              
## 14 "$call$marge.li"  "weights of rows"

The object that is created using the function MCA() contains many information found in many different lists and matrices. These values are described in the next section.

—#5.3 Visualization and interpretation

We’ll use the factoeztra R. package to help in the interpretation and the visualization of the multiple correspondence analysis. No matter what function you decide to use [FactoMiner::MCA(), ade4::dudi.mca()], you can easily extract and visualize the results of multiple correspondence analysis using R. functions provided in the factoextra R. package.

These factoextra functions include:

» get_eigenvalue(res.mca): Extract the eigenvalues/variances retained by each di- mension (axis) » fviz_eig(res.mca): Visualize the eigenvalues /variances

get_mca_ind(res.mca), get_mca_var(res.mca): Extract the results for individuals and variables, respectively.

» fviz_mca_ind(res.mca), fvíz_mca_var(res.mca): Visualize the results for individu- als and variables, respectively.

» fviz_mca_ biplot(res.mca): Make a biplot of rows and columns.

In the next sections, we’ll illustrate each of these functions.

Note that, the MCA results is interpreted as the results from a simple correspondence analysis (CA). Therefore, it’s strongly recommended to read the interpretation of simple CA which has been comprehensively described in the Chapter 4.

–#5.3.1 Eigenvalues / Variances

The proportion of variances retained by the different dimensions (axes) can be extracted using the function get_eigenvalue() [factoextra package] as follow:

# Cargar la librería necesaria
library(factoextra)

# Suponiendo que res.mca es el resultado de un análisis de correspondencia múltiple (MCA)
# Obtener los valores propios del MCA
eig.val <- get_eigenvalue(res.mca)

# Mostrar las primeras filas de los valores propios
head(eig.val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.33523140        33.523140                    33.52314
## Dim.2 0.12913979        12.913979                    46.43712
## Dim.3 0.10734849        10.734849                    57.17197
## Dim.4 0.09587950         9.587950                    66.75992
## Dim.5 0.07883277         7.883277                    74.64319
## Dim.6 0.07108981         7.108981                    81.75217

To visualize the percentages of inertia explained by each MCA dimensions, use the func- tion fuiz_eig() or fuiz_screeplot() [factoexrtra package]:

fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))

5.3.2 Biplot

The function fviz_mca_ biplot() [factoextra package] is used to draw the biplot of individ- uals and variable categories:

fviz_mca_biplot(res.mca,
                repel = TRUE, # Avoid text overlapping (slow if many point)
                ggtheme = theme_minimal())

The plot above shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.

The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.

–#5.3.3 Graph of variables _#5.3.3.1 Results

The function get_mca_var() [in factoextra] is used to extract the results for variable categories. This function returns a list containing the coordinates, the cos2 and the contribution of variable categories:

var <- get_mca_var(res.mca)

var
## Multiple Correspondence Analysis Results for variables
##  ===================================================
##   Name       Description                  
## 1 "$coord"   "Coordinates for categories" 
## 2 "$cos2"    "Cos2 for categories"        
## 3 "$contrib" "contributions of categories"

« var$contrib: contains the contributions (in percentage) of the variables to the defi- nition of the dimensions.

Note that, it’s possible to plot variable categories and to color them according to

either i) their quality on the factor map (cos2) or ii) their contribution values to the definition of dimensions (contrib).

The different components can be accessed as follow:

# Coordinates
head (var$coord)
##               Dim 1       Dim 2        Dim 3       Dim 4       Dim 5
## Nausea_n  0.2673909  0.12139029 -0.265583253  0.03376130  0.07370500
## Nausea_y -0.9581506 -0.43498187  0.951673323 -0.12097801 -0.26410958
## Vomit_n   0.4790279 -0.40919465  0.084492799  0.27361142  0.05245250
## Vomit_y  -0.7185419  0.61379197 -0.126739198 -0.41041713 -0.07867876
## Abdo_n    1.3180221 -0.03574501 -0.005094243 -0.15360951 -0.06986987
## Abdo_y   -0.6411999  0.01738946  0.002478280  0.07472895  0.03399075
# Cos2: quality on the factore map
head (var$cos2)
##              Dim 1        Dim 2        Dim 3       Dim 4       Dim 5
## Nausea_n 0.2562007 0.0528025759 2.527485e-01 0.004084375 0.019466197
## Nausea_y 0.2562007 0.0528025759 2.527485e-01 0.004084375 0.019466197
## Vomit_n  0.3442016 0.2511603912 1.070855e-02 0.112294813 0.004126898
## Vomit_y  0.3442016 0.2511603912 1.070855e-02 0.112294813 0.004126898
## Abdo_n   0.8451157 0.0006215864 1.262496e-05 0.011479077 0.002374929
## Abdo_y   0.8451157 0.0006215864 1.262496e-05 0.011479077 0.002374929
# Contributions to the principal components
head (var$contrib)
##              Dim 1       Dim 2        Dim 3      Dim 4      Dim 5
## Nausea_n  1.515869  0.81100008 4.670018e+00 0.08449397 0.48977906
## Nausea_y  5.431862  2.90608363 1.673423e+01 0.30277007 1.75504164
## Vomit_n   3.733667  7.07226253 3.627455e-01 4.25893721 0.19036376
## Vomit_y   5.600500 10.60839380 5.441183e-01 6.38840581 0.28554563
## Abdo_n   15.417637  0.02943661 7.192511e-04 0.73219636 0.18424268
## Abdo_y    7.500472  0.01432051 3.499060e-04 0.35620363 0.08963157

In this section, we’ll describe how to visualize variable categories only. Next, we”!l high- light variable categories according to either i) their quality of representation on the factor map or ii) their contributions to the dimensions.

-#5.3.3.2 Correlation between variables and principal dimensions

To visualize the correlation between variables and MCA. principal dimensions, type this:

fviz_mca_var(res.mca, choice = "mca.cor",
              repel = TRUE, # Avoid text overlapping (slow)
              ggtheme = theme_minimal ())

« It can be seen that, the variables Diarrhae, Abdominals and Fever are the most correlated with dimension 1. Similarly, the variables Courgette and Potato are the most correlated with dimension 2.

5.3.3.3 Coordinates of variable categories

The R code below displays the coordinates of each variable categories in each dimension (1, 2 and 3):

head(round (var$coord, 2), 4)
##          Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
## Nausea_n  0.27  0.12 -0.27  0.03  0.07
## Nausea_y -0.96 -0.43  0.95 -0.12 -0.26
## Vomit_n   0.48 -0.41  0.08  0.27  0.05
## Vomit_y  -0.72  0.61 -0.13 -0.41 -0.08

Use the function fuiz_mca_var() [in factoeztra] to visualize only variable categories:

fviz_mca_var(res.mca,
             repel = TRUE, # Avoid text overlapping (slow)
             ggtheme = theme_minimal ())

It’s possible to change the color and the shape of the variable points using the arguments col.var and shape.var as follow:

fviz_mca_var(res.mca, col.var="black", shape.var = 15,
             repel = TRUE)

The plot above shows the relationships between variable categories. lt can be interpreted as follow:

-#5.3.3.4 Quality of representation of variable categories The two dimensions 1 and 2 are sufficient to retain 46% of the total inertia (variation) contained in the data. Not all the points are equally well displayed in the two dimensions.

The quality of the representation is called the squared cosíne (cos2), which measures the degree of association between variable categories and a particular axis. The cos2 of variable categories can be extracted as follow:

{r}# head(vartcos2, 4) If a variable category is well represented by two dimensions, the sum of the cos2 is closed to one. For some of the row items, more than 2 dimensions are required to perfectly represent the data.

It’s possible to color variable categories by their cos2 values using the argument col.var = “cos2”. This produces a gradient colors, which can be customized using the argument gradient.cols. For instance, gradient.cols = c(“white”, “blue”, “red”) means that:

#Color by cos2 values: quality on the factor map
fviz_mca_var(res.mca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE, # Avoid text overlapping
             ggtheme = theme_minimal ())

Note that, it’s also possible to change the transparency of the variable categories according to their cos2 values using the option alpha.var = “cos2”. For example, type this:

#Change the transparency by cos2 values
fviz_mca_var(res.mca, alpha.var="cos2",
             repel = TRUE,
             ggtheme = theme_minimal ())

You can visualize the cos2 of row categories on all the dimensions using the corrplot package:

library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

It’s also possible to create a bar plot of variable cos2 using the function fuiz_cos2()[in factoextra]:

#Cos2 of variable categories on Dim.1 and Dim.2
fviz_cos2(res.mca, choice = "var", axes = 1:2)

Note that, variable categories Fish_n, Fish_ y, Icecream_n and Icecream_ y are not very well represented by the first two dimensions. This implies that the position of the corresponding points on the scatter plot should be interpreted with some caution. A higher dimensional solution is probably necessary.

-#5.3.3.5 Contribution of variable categories to the dimensions

The contribution of the variable categories (in %) to the definition of the dimensions can be extracted as follow:

{r}# head(round(varfcontrib,2), 4)

The variable categories with the larger value, contribute the most to the definition of the dimensions. Variable categories that contribute the most to Dim.1 and Dim.2 are the most important in explaining the variability in the data set.

The function fuiz_contrib() [factoextra package] can be used to draw a bar plot of the contribution of variable categories. The R code below shows the top 15 variable categories contributing to the dimensions:

# Contributions of rows to dimension 1

fviz_contrib(res.mca, choice = "var", axes = 1, top = 15)

# Contributions of rows to dimension 2
fviz_contrib(res.mca, choice = "var", axes = 2, top = 15)

The total contributions to dimension 1 and 2 are obtained as follow:

# Total contribution to dimension 1 and 2
fviz_contrib(res.mca, choice = "var", axes = 1:2, top = 15)

The red dashed line on the graph above indicates the expected average value, If the contributions were uniform. The calculation of the expected contribution value, under null hypothesis, has been detailed in the principal component analysis! chapter.

It can be seen that:

. The categories Courg_n, Potato_n, Vomit_y and Icecream_n contribute the most to the dimension 2

The most important (or, contributing) variable categories can be highlighted on the scatter plot as follow:

fviz_mca_var(res.mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal ()

             )

The plot above gives an idea of what pole of the dimensions the categories are actually contributing to.

It is evident that the categories Abdo_n, Diarrhea_n, Fever_n and Mayo_n have an important contribution to the positive pole of the first dimension, while the categories Fever_y and Diarrhea_ y have a major contribution to the negative pole of the first dimension; etc, ….

Note that, it’s also possible to control the transparency of variable categories according to their contribution values using the option alpha.var = “contrib”. For example, type this:

#Change the transparency by contrib values
fviz_mca_var(res.mca, alpha.var="contrib",
             repel = TRUE,
             ggtheme = theme_minimal ())

–#5.3.4 Graph of individuals

-#5.3.4.1 Results

The function get_mca_ind()lin factoeztra] is used to extract the results for individuals. This function returns a list containing the coordinates, the cos2 and the contributions of individuals:

ind <- get_mca_ind(res.mca)
ind
## Multiple Correspondence Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

The result for individuals gives the same information as described for variable cat- egories. For this reason, 1’Il just displayed the result for individuals in this section without commenting.

To get access to the different components, use this:

#Coordinates of column points
head (ind$coord)
##        Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
## 1 -0.4525811 -0.26415072  0.17151614  0.01369348 -0.11696806
## 2  0.8361700 -0.03193457 -0.07208249 -0.08550351  0.51978710
## 3 -0.4481892  0.13538726 -0.22484048 -0.14170168 -0.05004753
## 4  0.8803694 -0.08536230 -0.02052044 -0.07275873 -0.22935022
## 5 -0.4481892  0.13538726 -0.22484048 -0.14170168 -0.05004753
## 6 -0.3594324 -0.43604390 -1.20932223  1.72464616  0.04348157
#Quality of representation
head (ind$cos2)
##        Dim 1        Dim 2        Dim 3        Dim 4        Dim 5
## 1 0.34652591 0.1180447167 0.0497683175 0.0003172275 0.0231460846
## 2 0.55589562 0.0008108236 0.0041310808 0.0058126211 0.2148103098
## 3 0.54813888 0.0500176790 0.1379484860 0.0547920948 0.0068349171
## 4 0.74773962 0.0070299584 0.0004062504 0.0051072923 0.0507479873
## 5 0.54813888 0.0500176790 0.1379484860 0.0547920948 0.0068349171
## 6 0.02485357 0.0365775483 0.2813443706 0.5722083217 0.0003637178
#Contributions
head(ind$contrib)
##      Dim 1      Dim 2        Dim 3        Dim 4      Dim 5
## 1 1.110927 0.98238297  0.498254685  0.003555817 0.31554778
## 2 3.792117 0.01435818  0.088003703  0.138637089 6.23134138
## 3 1.089470 0.25806722  0.856229950  0.380768961 0.05776914
## 4 4.203611 0.10259105  0.007132055  0.100387990 1.21319013
## 5 1.089470 0.25806722  0.856229950  0.380768961 0.05776914
## 6 0.700692 2.67693398 24.769968729 56.404214518 0.04360547

-#5.3.4.2 Plots: quality and contribution

The function fuiz_mca_ind() [in factoeztra] is used to visualize only individuals. Like variable categories, it”s also possible to color individuals by their cos2 values:

fviz_mca_ind(res.mca, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE, # Avoid text overlapping (slow ¿f many points)
             ggtheme = theme_minimal ())

The R code below creates a bar plots of individuals cos2 and contributions:

#Cos2 of individuals
fviz_cos2(res.mca, choice = "ind", axes = 1:2, top = 20)

#Contribution of individuals to the dimensions
fviz_contrib(res.mca, choice = "ind", axes = 1:2, top = 20)

–#5.3.5 Color individuals by groups

Note that, it’s possible to color the individuals using any of the qualitative variables in the initial data table (poison)

The R code below colors the individuals by groups using the levels of the variable Vomiting. The argument habillage is used to specify the factor variable for coloring the individuals by groups. Á concentration ellipse can be also added around each group using the argument addEllipses = TRUE. Tf you want a confidence ellipse around the mean point of categories, use ellipse.type = “confidence” The argument palette is used to change group colors.

fviz_mca_ind(res.mca,
             label = "none", # hide individual labels
             habillage = "Vomiting", # color by groups
             palette = c("#00AFBB", "#E7B800"),
             addEllipses = TRUE, ellipse.type = "confidence",
             ggtheme = theme_minimal())

Note that, to specify the value of the argument habillage, it’s also possible to use the index of the column as follow (habillage = 2). Additionally, you can provide an external grouping variable as follow: habillage = poison$ Vomiting. For example:

#habillage = index of the column to be used as grouping variable
fviz_mca_ind(res.mca, habillage = 2, addEllipses = TRUE)

#habillage = external grouping variable
#fviz_mca_ind(res.mca, habillage = poison$fVomiting, addEllipses = TRUE)

If you want to color individuals using multiple categorical variables at the same time, use the function fviz_ellipses() [in factoeztra] as follow:

fviz_ellipses(res.mca, c("Vomiting", "Fever"),
              geom = "point")
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Alternatively, you can specify categorical variable indices:

fviz_ellipses(res.mca, 1:4, geom = "point")

–#5.3.6 Dimension description

The function dimdesc() [in FactoMineR] can be used to identify the most correlated variables with a given dimension:

res.desc <- dimdesc(res.mca, axes = c(1,2))

# Description of dimension 1
res.desc[[1]]
## $row
##                 coord
## Laundry    -0.9918368
## Main_meal  -0.8755855
## Dinner     -0.6925740
## Breakfeast -0.5086002
## Tidying    -0.3938084
## Dishes     -0.1889641
## Shopping   -0.1176813
## Official    0.2266324
## Holidays    0.2524863
## Finances    0.2707669
## Insurance   0.6470759
## Driving     0.7417696
## Repairs     1.5287787
## 
## $col
##                   coord
## Wife        -0.83762154
## Alternating -0.06218462
## Jointly      0.14942609
## Husband      1.16091847
# Description of dimension 2
res.desc[[2]]
## $row
##                 coord
## Holidays   -1.4350066
## Finances   -0.6178684
## Insurance  -0.4737832
## Dishes     -0.4419662
## Tidying    -0.4343444
## Shopping   -0.4033171
## Official    0.2536132
## Dinner      0.3081043
## Breakfeast  0.4528038
## Main_meal   0.4901092
## Laundry     0.4953220
## Driving     0.6534143
## Repairs     0.8642647
## 
## $col
##                  coord
## Jointly     -1.0265791
## Alternating  0.2915938
## Wife         0.3652207
## Husband      0.6019199

—#5.4 Supplementary elements

–#5.4.1 Definition and types

As described above (section 5.2.2), the data set poison contains:

The data doesn’t contain supplementary individuals. However, for demonstration, we”ll use the individuals 53:55 as supplementary individuals.

Supplementary variables and individuals are not used for the determination of the principal dimensions. Their coordinates are predicted using only the infor- mation provided by the performed multiple correspondence analysis on active variables /individuals.

–#5.4.2 Specification in MCA

To specify supplementary individuals and variables, the function MCA() can be used as follow :

{r}# MCA(X, ind.sup = NULL, quanti.sup = NULL, quali.sup=NULL, graph = TRUE, axes = c(1,2))

« axes: a vector of length 2 specifying the components to be plotted.

For example, type this:

res.mca <- MCA(poison, ind.sup = 53:55,
               quanti.sup = 1:2, quali.sup = 3:4, graph=FALSE)

5.4.3 Results

The predicted results for supplementary individuals /variables can be extracted as follow:

# Supplementary qualitative variable categories
res.mca$quali.sup
## $coord
##              Dim 1         Dim 2       Dim 3        Dim 4       Dim 5
## Sick_n  1.41809140  0.0020394048  0.13199139 -0.016036841 -0.08354663
## Sick_y -0.63026284 -0.0009064021 -0.05866284  0.007127485  0.03713184
## F      -0.03108147  0.1123143957  0.05033124 -0.055927173 -0.06832928
## M       0.03356798 -0.1212995474 -0.05435774  0.060401347  0.07379562
## 
## $cos2
##              Dim 1        Dim 2       Dim 3        Dim 4       Dim 5
## Sick_n 0.893770319 1.848521e-06 0.007742990 0.0001143023 0.003102240
## Sick_y 0.893770319 1.848521e-06 0.007742990 0.0001143023 0.003102240
## F      0.001043342 1.362369e-02 0.002735892 0.0033780765 0.005042401
## M      0.001043342 1.362369e-02 0.002735892 0.0033780765 0.005042401
## 
## $v.test
##             Dim 1        Dim 2      Dim 3       Dim 4      Dim 5
## Sick_n  6.7514655  0.009709509  0.6284047 -0.07635063 -0.3977615
## Sick_y -6.7514655 -0.009709509 -0.6284047  0.07635063  0.3977615
## F      -0.2306739  0.833551410  0.3735378 -0.41506855 -0.5071119
## M       0.2306739 -0.833551410 -0.3735378  0.41506855  0.5071119
## 
## $eta2
##            Dim 1        Dim 2       Dim 3        Dim 4       Dim 5
## Sick 0.893770319 1.848521e-06 0.007742990 0.0001143023 0.003102240
## Sex  0.001043342 1.362369e-02 0.002735892 0.0033780765 0.005042401
# Supplementary quantitative variables
res.mca$quanti
## $coord
##             Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
## Age   0.003934896 -0.00741340 -0.26494536  0.20015501  0.02928483
## Time -0.838158507 -0.08330586 -0.08718851 -0.08421599 -0.02316931
# Supplementary individuals
res.mca$ind.sup
## $coord
##         Dim 1     Dim 2      Dim 3      Dim 4      Dim 5
## 53  1.0835684 0.5172478  0.5794063  0.5390903  0.4553650
## 54 -0.1249473 0.1417271 -0.1765234 -0.1526587 -0.2779565
## 55 -0.4315948 0.1270468 -0.2071580 -0.1186804 -0.1891760
## 
## $cos2
##         Dim 1      Dim 2      Dim 3      Dim 4      Dim 5
## 53 0.36304957 0.08272764 0.10380536 0.08986204 0.06411692
## 54 0.03157652 0.04062716 0.06302535 0.04713607 0.15626590
## 55 0.50232519 0.04352713 0.11572730 0.03798314 0.09650827

5.4.4 Plots

To make a biplot of individuals and variable categories, type this:

#Biplot of individuals and variable categories
fviz_mca_biplot(res.mca, repel = TRUE,
                ggtheme = theme_minimal())

« Active individuals are in blue

. Supplementary individuals are in darkblue

If you want to highlight the correlation between variables (active dz supplementary) and dimensions, use the function fviz_mca_var() with the argument choice = “mca.cor”:

fviz_mca_var(res.mca, choice = "mca.cor",
             repel = TRUE)

The R code below plots qualitative variable categories (active dz supplementary variables):

fviz_mca_var(res.mca, repel = TRUE,
             getheme= theme_minimal ())

For supplementary quantitative variables, type this:

fviz_mca_var(res.mca, choice = "quanti.sup",
             ggtheme = theme_minimal())

To visualize supplementary individuals, type this:

fviz_mca_ind(res.mca,
             label = "ind.sup", #XShow the label of ind.sup only
             ggtheme = theme_minimal ())

—#5.5 Filtering results

If you have many individuals/variable categories, it’s possible to visualize only some of them using the arguments select.ind and select.var.

select.ind, select.var: a selection of individuals/variable categories to be drawn. Allowed values are NULL or a líst containing the arguments name, cos2 or contrib:

» if cos2 > 1, ex: 5, then the top 5 active individuals/variable categories and top 5 supplementary columns/rows with the highest cos2 are plotted

# Cargar la librería necesaria
library(factoextra)

# Visualizar categorías de variables con cos2 >= 0.4
fviz_mca_var(res.mca, select.var = list(cos2 = 0.4))

# Visualizar las 10 variables activas con el cos2 más alto
fviz_mca_var(res.mca, select.var = list(cos2 = 10))

# Seleccionar por nombres específicos
name <- list(name = c("Fever_n", "Abdo_y", "Diarrhea_n", "Fever_Y", "Vomit_y", "Vomit_n"))
fviz_mca_var(res.mca, select.var = name)

# Visualizar el biplot de MCA con los 5 individuos y categorías de variables que más contribuyen
fviz_mca_biplot(res.mca, 
                select.ind = list(contrib = 5), 
                select.var = list(contrib = 5), 
                ggtheme = theme_minimal())

When the selection is done according to the contribution values, supplementary individuals /variable categories are not shown because they don’t contribute to the construction of the axes.

—#5.6 Exporting results

–#5.6.1 Export plots to PDF/PNG files

Two steps:

  1. Create the plot of interest as an R object:
#Scree plot
scree.plot <- fviz_eig(res.mca)
#Biplot of row and column variables
biplot.mca <- fviz_mca_biplot(res.mca)
  1. Export the plots into a single pdf file as follow (one plot per page):
library (ggpubr)
ggexport (plotlist = list(scree.plot, biplot.mca),
          filename = "MCA.pdf")
## file saved to MCA.pdf

–#5.6.2 Export results to txt/csv files

Easy to use R function: write.infile() [in FactoMineR] package.

#Export into a TXT file
write.infile(res.mca, "mca.txt", sep = "Mt")

# Export into a CSV file
write.infile(res.mca, "mca.csv", sep = ";")

—#5.7 Summary

In conclusion, we described how to perform and interpret multiple correspondence analysis (CA). We computed MCA using the MCA() function [FactoMineR package]. Next, we used the factoertra R package to produce geplot2-based visualization of the CA results.

Other functions [packages] to compute MCA in R, include:

  1. Using dudi.acm() [ade4]
library("ade4")
res.mca <- dudi.acm(poison.active, scannf = FALSE, nf = 5)
  1. Using epmMCA/() [ExPosition]
# Load necessary libraries
library(ExPosition)
library(factoextra)

# Assuming `poison.active` is your dataset for MCA and `poison$fVomiting` is your external grouping variable
# Ensure `poison$fVomiting` is a factor
#poison$fVomiting <- as.factor(poison$fVomiting)

# Perform MCA using ExPosition's epMCA function
res.mca <- epMCA(poison.active, graph = FALSE, correction = "bg")

No matter what functions you decide to use, in the list above, the factoextra package can handle the output.

fviz_eig(res.mca) #Scree plot

fviz_mca_biplot(res.mca) # Biplot of rows and columns

Chapter 6

###Factor Analysis of Mixed Data

—#6.1 Introduction

Factor analysis of mixed data (FAMD) is a principal component method dedicated to analyze a data set containing both quantitative and qualitative variables (Pagés, 2004). It makes it possible to analyze the similarity between individuals by taking into account a mixed types of variables. Additionally, one can explore the association between all variables, both quantitative and qualitative variables.

Roughly speaking, the FAMD algorithm can be seen as a mixed between principal compo- nent analysis (PCA) (Chapter 3) and multiple correspondence analysis (MCA) (Chapter 5). In other words, it acts as PCA quantitative variables and as MCA for qualitative variables.

Quantitative and qualitative variables are normalized during the analysis in order to balance the influence of each set of variables.

In the current chapter, we demonstrate how to compute and visualize factor analysis of mixed data using FactoMineR (for the analysis) and factoeztra (for data visualization) R packages.

—#6.2 Computation

–#6.2.1 R packages

Install required packages as follow: {r}# install.packages(c("FactoMineR", "factoextra"))

Load the packages:

{r}# library("FactoMineR”) library("factoextra")

–#6.2.2 Data format

We’ll use a subset of the wine data set available in FactoMineR package:

# Load necessary libraries
library(FactoMineR)

# Load the wine dataset from FactoMineR package
data(wine)

# Select specific columns from the wine dataset to create a subset dataframe
df <- wine[, c(1, 2, 16, 22, 29, 28, 30, 31)]

# Display the first few rows and columns of the subset dataframe
head(df[, 1:7], 4)
##           Label Soil Plante Acidity Harmony Intensity Overall.quality
## 2EL      Saumur Env1  2.000   2.107   3.143     2.857           3.393
## 1CHA     Saumur Env1  2.000   2.107   2.964     2.893           3.214
## 1FON Bourgueuil Env1  1.750   2.179   3.143     3.074           3.536
## 1VAU     Chinon Env2  2.304   3.179   2.038     2.462           2.464

To see the structure of the data, type this:

str(df)
## 'data.frame':    21 obs. of  8 variables:
##  $ Label          : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ...
##  $ Soil           : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ...
##  $ Plante         : num  2 2 1.75 2.3 1.76 ...
##  $ Acidity        : num  2.11 2.11 2.18 3.18 2.57 ...
##  $ Harmony        : num  3.14 2.96 3.14 2.04 3.64 ...
##  $ Intensity      : num  2.86 2.89 3.07 2.46 3.64 ...
##  $ Overall.quality: num  3.39 3.21 3.54 2.46 3.74 ...
##  $ Typical        : num  3.25 3.04 3.18 2.25 3.44 ...

The data contains 21 rows (wines, individuals) and 8 columns (variables):

The goal of this study is to analyze the characteristics of the wines.

–#6.2.3 R code

The function FAMD() [FactoMiner package] can be used to compute FAMD. A simplified format is :

{r}$ FAMD (base, ncp = 5, sup.var = NULL, ind.sup = NULL, graph = TRUE)

« ind.sup: a vector indicating the indexes of the supplementary individuals. + graph: a logical value. If TRUE a graph is displayed.

To compute FAMD, type this:

library (FactoMineR)
res.famd <- FAMD(df, graph = FALSE)

The output of the FAMD( function is a list including :

print(res.famd)
## *The results are available in the following objects:
## 
##   name          description                             
## 1 "$eig"        "eigenvalues and inertia"               
## 2 "$var"        "Results for the variables"             
## 3 "$ind"        "results for the individuals"           
## 4 "$quali.var"  "Results for the qualitative variables" 
## 5 "$quanti.var" "Results for the quantitative variables"

—#6.3 Visualization and interpretation

We’ll use the following factoextra functions:

» get_eigenvalue(res.famd): Extract the eigenvalues/variances retained by each di- mension (axis).

» fviz_eig(res.famd): Visualize the eigenvalues /variances.

» get_famd_ind(res.famd): Extract the results for individuals.

» fuiz_famd_ind(res.famd), fuiz_famd_var(res.famd): Visualize the results for indi- viduals and variables, respectively.

In the next sections, we’ll illustrate each of these functions.

To help in the interpretation of FAMD, we highly recommend to read the inter- pretation of principal component analysis (Chapter 3) and multiple correspondence analysis (Chapter 5). Many of the graphs presented here have been already described in our previous chapters.

–#6.3.1 FEigenvalues / Variances

The proportion of variances retained by the different dimensions (axes) can be extracted using the function get_eigenvalue() [factoextra package] as follow:

library("factoextra")
eig.val <- get_eigenvalue(res.famd)
head(eig.val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  4.8315174        43.922886                    43.92289
## Dim.2  1.8568797        16.880724                    60.80361
## Dim.3  1.5824794        14.386176                    75.18979
## Dim.4  1.1491200        10.446546                    85.63633
## Dim.5  0.6518053         5.925503                    91.56183

The function fuíz_eig() or fuiz_screeplot() [factoextra package] can be used to draw the scree plot (the percentages of inertia explained by each FAMD dimensions):

fviz_screeplot(res.famd)

6.3.2 Graph of variables 6.3.2.1 All variables

The function get_mfa_var()|in factoeztra] is used to extract the results for variables. By default, this function returns a list containing the coordinates, the cos2 and the contribution of all variables:

var <- get_famd_var(res.famd)
var
## FAMD results for variables 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"

The different components can be accessed as follow:

#Coordinates of variables
head (var$coord)
##                     Dim.1       Dim.2       Dim.3       Dim.4        Dim.5
## Plante          0.7344160 0.060551966 0.105902048 0.004011299 0.0010340559
## Acidity         0.1732738 0.491118153 0.126394029 0.115376784 0.0045862935
## Harmony         0.8943968 0.023628146 0.040124469 0.003653813 0.0086624633
## Intensity       0.6991811 0.134639254 0.065382234 0.023214984 0.0064730431
## Overall.quality 0.9115699 0.005246728 0.009336677 0.005445276 0.0007961880
## Typical         0.7808611 0.027094327 0.001549791 0.083446627 0.0005912942
#Cos2: quality of representation on the factore map
head (var$cos2)
##                      Dim.1        Dim.2        Dim.3        Dim.4        Dim.5
## Plante          0.53936692 3.666541e-03 1.121524e-02 1.609052e-05 1.069272e-06
## Acidity         0.03002381 2.411970e-01 1.597545e-02 1.331180e-02 2.103409e-05
## Harmony         0.79994566 5.582893e-04 1.609973e-03 1.335035e-05 7.503827e-05
## Intensity       0.48885427 1.812773e-02 4.274836e-03 5.389355e-04 4.190029e-05
## Overall.quality 0.83095973 2.752815e-05 8.717353e-05 2.965103e-05 6.339153e-07
## Typical         0.60974400 7.341026e-04 2.401853e-06 6.963340e-03 3.496288e-07
#Contributions to the dimensions
head (var$fcontrib)
## NULL

The following figure shows the correlation between variables - both quantitative and qual- itative variables - and the principal dimensions, as well as, the contribution of variables to the dimensions 1 and 2. The following functions [in the factoextra package] are used:

» fviz_famd_var() to plot both quantitative and qualitative variables

# Plot of variables
fviz_famd_var(res.famd, repel = TRUE)

# Contribution to the first dimension
fviz_contrib(res.famd, "var", axes = 1)

# Contribution to the second dimension
fviz_contrib(res.famd, "var", axes = 2)

The red dashed line on the graph above indicates the expected average value, If the contributions were uniform. Read more in chapter (Chapter 3).

From the plots above, it can be seen that:

» variables that contribute the most to the second dimension are: Soil and Acid- ity.

-#6.3.2.2 Quantitative variables

To extract the results for quantitative variables, type this:

quanti.var <- get_famd_var(res.famd, "quanti.var")
quanti.var
## FAMD results for quantitative variables 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"

In this section, we’ll describe how to visualize quantitative variables. Additionally, we’ll show how to highlight variables according to either i) their quality of representation on the factor map or ii) their contributions to the dimensions.

The R code below plots quantitative variables. We use repel = TRUE, to avoid text overlapping.

fviz_famd_var(res.famd, "quanti.var", repel = TRUE,
col.var = "black")

Briefly, the graph of variables (correlation circle) shows the relationship between variables, the quality of the representation of variables, as well as, the correlation between variables amd the dimensions. Read more at PCA (Chapter 3), MCA (Chapter 5) and MFA (Chapter 7).

The most contributing quantitative variables can be highlighted on the scatter plot us- ing the argument col.var = “contrib”. This produces a gradient colors, which can be customized using the argument gradient. cols.

fviz_famd_var(res.famd, "quanti.var", col.var = "contrib",
              gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
              repel = TRUE)

Similarly, you can highlight quantitative variables using their cos2 values representing the quality of representation on the factor map. If a variable is well represented by two dimensions, the sum of the cos2 is closed to one. For some of the items, more than 2 dimensions might be required to perfectly represent the data.

#Color by cos2 values: quality on the factor map

fviz_famd_var(res.famd, "quanti.var", col.var = "cos2",
              gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
              repel = TRUE)

-#6.3.2.3 Graph of qualitative variables

Like quantitative variables, the results for qualitative variables can be extracted as follow:

quali.var <- get_famd_var(res.famd, "quali.var")
quali.var
## FAMD results for qualitative variable categories 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"

To visualize qualitative variables, type this:

fviz_famd_var(res.famd, "quali.var", col.var = "contrib",
              gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
              )

The plot above shows the categories of the categorical variables.

–#6.3.3 Graph of individuals

To get the results for individuals, type this:

ind <- get_famd_ind(res.famd)
ind
## FAMD results for individuals 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"

To plot individuals, use the function fuíz_mfa_ind() [in factoeztra]. By default, individ- uals are colored in blue. However, like variables, it’s also possible to color individuals by their cos2 and contribution values:

fviz_famd_ind(res.famd, col.ind = "cos2",
                gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                repel = TRUE)

In the plot above, the qualitative variable categories are shown in black. Envl, Env2, Env3 are the categories of the soil. Saumur, Bourgueuil and Chinon are the categories of the wine Label. If you don’t want to show them on the plot, use the argument invisible = “quali.var?.

Individuals with similar profiles are close to each other on the factor map. For the interpretation, read more at Chapter 5 (MCA) and Chapter 7 (MFA).

Note that, it’s possible to color the individuals using any of the qualitative variables in the initial data table. To do this, the argument habillage is used in the fuiz_famd_ind() function. For example, if you want to color the wines according to the supplementary qualitative variable “Label”, type this:

fviz_mfa_ind(res.famd,
             habillage = "Label", # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, ellipse.type = "confidence",
             repel = TRUE #Avoid text overlapping
             )

If you want to color individuals using multiple categorical variables at the same time, use the function fviz_ellipses() [in factoextra] as follow:

fviz_ellipses(res.famd, c("Label", "Soil"), repel = TRUE)

Alternatively, you can specify categorical variable indices:

fviz_ellipses(res.famd, 1:2, geom = "point")

6.4 Summary

The factor analysis of mixed data (FAMD) makes it possible to analyze a data set, in which individuals are described by both qualitative and quantitative variables. In this ar- ticle, we described how to perform and interpret FAMD using FactoMineR and factoextra R packages.

Chapter 7

###Multiple Factor Analysis

—#7.1 Introduction

Multiple factor analysis (MFA) (Pagés, 2002) is a multivariate data analysis method for summarizing and visualizing a complex data table in which individuals are described by several sets of variables (quantitative and /or qualitative) structured into groups. It takes into account the contribution of all active groups of variables to define the distance between individuals.

The number of variables in each group may differ and the nature of the variables (quali- tative or quantitative) can vary from one group to the other but the variables should be of the same nature in a given group (Abdi and Williams, 2010).

MFA may be considered as a general factor analysis. Roughly, the core of MFA is based on:

This global analysis, where multiple sets of variables are simultaneously considered, re- quires to balance the influences of each set of variables. Therefore, in MFA, the variables are weighted during the analysis. Variables in the same group are normalized using the same weighting value, which can vary from one group to another. Technically, MFA assigns to each variable of group j, a weight equal to the inverse of the first eigenvalue of the analysis (PCA or MCA according to the type of variable) of the group j.

Multiple factor analysis can be used in a variety of fields (Pagés, 2002), where the variables are organized into groups:

  1. Survey analysis, where an individual is a person; a variable is a question. Questions are organized by themes (groups of questions).

  2. Sensory analysis, where an individual is a food product. A first set of variables in- cludes sensory variables (sweetness, bitterness, etc.); a second one includes chemical variables (pH, glucose rate, etc.).

  3. Ecology, where an individual is an observation place. A first set of variables de- scribes soil characteristics ; a second one describes flora.

  4. Times series, where several individuals are observed at different dates. In this situation, there is commonly two ways of defining groups of variables:

» generally, variables observed at the same time (date) are gathered together. + When variables are the same from one date to the others, each set can gather the different dates for one variable.

In the current chapter, we show how to compute and visualize multiple factor analysis in R software using FactoMineR (for the analysis) and factoeztra (for data visualization). Additional, we’l show how to reveal the most important variables that contribute the most in explaining the variations in the data set.

—#7.2 Computation

–#7.2.1 R packages

Install FactoMineR, and factoextra as follow: {r}# install.packages(c("FactoMineR", "factoextra"))

Load the packages: {r}# library("FactoMineRk") library("factoextra")

7.2.2 Data format

We’ll use the demo data sets wine available in FactoMineR package. This data set is about a sensory evaluation of wines by different judges.

library("FactoMineR")

data(wine)

colnames (wine)
##  [1] "Label"                         "Soil"                         
##  [3] "Odor.Intensity.before.shaking" "Aroma.quality.before.shaking" 
##  [5] "Fruity.before.shaking"         "Flower.before.shaking"        
##  [7] "Spice.before.shaking"          "Visual.intensity"             
##  [9] "Nuance"                        "Surface.feeling"              
## [11] "Odor.Intensity"                "Quality.of.odour"             
## [13] "Fruity"                        "Flower"                       
## [15] "Spice"                         "Plante"                       
## [17] "Phenolic"                      "Aroma.intensity"              
## [19] "Aroma.persistency"             "Aroma.quality"                
## [21] "Attack.intensity"              "Acidity"                      
## [23] "Astringency"                   "Alcohol"                      
## [25] "Balance"                       "Smooth"                       
## [27] "Bitterness"                    "Intensity"                    
## [29] "Harmony"                       "Overall.quality"              
## [31] "Typical"

The data contains 21 rows (wines, individuals) and 31 columns (variables):

.« The 29 next columns are continuous sensory variables. For each wine, the value is the mean score for all the judges.

The goal of this study is to analyze the characteristics of the wines.

The variables are organized in groups as follow:

  1. First group - Á group of categorical variables specifying the origin of the wines, including the variables label and soil corresponding to the first 2 columns in the data table. In FactoMineR terminology, the arguments group = 2 is used to define the first 2 columns as a group.

  2. Second group - Á group of continuous variables, describing the odor of the wines before shaking, including the variables: Odor.Intensity.before.shaking, Aroma.quality.before.shaking, Fruity.before.shaking, Flower.before.shaking and Spice.before.shaking. These variables corresponds to the next 5 columns after the first group. FactoMineR terminology: group = 5.

  3. Third group - A group of continuous variables quantifying the visual inspection of the wines, including the variables: Visual.intensity, Nuance and Surface.feeling. These variables corresponds to the next 3 columns after the second group. Fac- toMineR terminology: group = 3.

  4. Fourth group - Á group of continuous variables concerning the odor of the

wines after shaking, including the variables: Odor.Intensity, Quality.of.odour, Fruity, Flower, Spice, Plante, Phenolic, Aroma.intensity, Ároma.persistency and Aroma.quality. These variables corresponds to the next 10 columns after the third group. FactoMineR terminology: group = 10.

  1. Fith group - A group of continuous variables evaluating the taste of the wines, including the variables Attack.intensity, Acidity, Astringency, Alcohol, Balance, Smooth, Bitterness, Intensity aud Harmony. These variables corresponds to the next 9 columns after the fourth group. FactoMineR terminology: group = 9.

  2. Sixth group - Á group of continuous variables concerning the overall judgement of the wines, including the variables Overall.quality and Typical. These variables corresponds to the next 2 columns after the fith group. FactoMineR terminology: group = 2,

In summary:

. We have 6 groups of variables, which can be specified to the FactoMineR as follow: group = c(2, 5, 3, 10, 9, 2).

These groups can be named as follow: name.group = c(“origin”, “odor”, “visual”, “odor.after.shaking”, “taste”, “overall?).

–#7.2.3 R. code

The function MFA()[FactoMiner package] can be used. A simplified format is : {r}# MFA (base, group, type = rep("s",length(group)), ind.sup = NULL, name.group = NULL, num.group.sup = NULL, graph = TRUE)

base : a data frame with n rows (individuals) and p columns (variables) group: a vector with the number of variables in each group. type: the type of variables in each group. By default, all variables are quan- titative and scaled to unit variance. Allowed values include:

— “c” or “s” for quantitative variables. 1f “s”, the variables are scaled to

unit variance.

— “n” for categorical variables.

= «p for frequencies (from a contingency tables). ind.sup: a vector indicating the indexes of the supplementary individuals.

name.group: a vector containing the name of the groups (by default, NULLand the group are named group.1l, group.2 and so on).

« graph: a logical value. If TRUE a graph is displayed.

The R code below performs the MFA on the wines data using the groups: odor, visual, odor after shaking and taste. These groups are named active groups. The remaining group of variables - origin (the first group) and overall judgement (the sixth group) - are named supplementary groups; num.group.sup = c(1, 6):

# Cargar la librería necesaria
library(FactoMineR)

# Cargar el conjunto de datos wine
data(wine)

# Realizar el Análisis Factorial Múltiple (MFA)
res.mfa <- MFA(wine,
               group = c(2, 5, 3, 10, 9, 2),
               type = c("n", "s", "s", "s", "s", "s"),
               name.group = c("origin", "odor", "visual", "odor.after.shaking", "taste", "overall"),
               num.group.sup = c(1, 6),
               graph = FALSE)

The output of the MFA() function is a list including :

print (res.mfa)
## **Results of the Multiple Factor Analysis (MFA)**
## The analysis was performed on 21 individuals, described by 31 variables
## *Results are available in the following objects :
## 
##    name                 description                                           
## 1  "$eig"               "eigenvalues"                                         
## 2  "$separate.analyses" "separate analyses for each group of variables"       
## 3  "$group"             "results for all the groups"                          
## 4  "$partial.axes"      "results for the partial axes"                        
## 5  "$inertia.ratio"     "inertia ratio"                                       
## 6  "$ind"               "results for the individuals"                         
## 7  "$quanti.var"        "results for the quantitative variables"              
## 8  "$quanti.var.sup"    "results for the quantitative supplementary variables"
## 9  "$quali.var.sup"     "results for the categorical supplementary variables" 
## 10 "$summary.quanti"    "summary for the quantitative variables"              
## 11 "$summary.quali"     "summary for the categorical variables"               
## 12 "$global.pca"        "results for the global PCA"

—#7.3 Visualization and interpretation

We’ll use the factoextra R. package to help in the interpretation and the visualization of the multiple factor analysis.

The functions below [in factoextra package] will be used:

» get_eigenvalue(res.mfa): Extract the eigenvalues/variances retained by each dimen- sion (axis).

» fuiz_eig(res.mfa): Visualize the eigenvalues /variances.

» get_mfa_ind(res.mfa): Extract the results for individuals.

In the next sections, we’ll illustrate each of these functions.

To help in the interpretation of MFA, we highly recommend to read the interpreta- tion of principal component analysis (Chapter 3), simple (Chapter 4) and multiple correspondence analysis (Chapter 5). Many of the graphs presented here have been already described in previous chapter.

–#7.3.1 Eigenvalues / Variances

The proportion of variances retained by the different dimensions (axes) can be extracted using the function get_eigenvalue() [factoextra package] as follow:

library("factoextra")
eig.val <- get_eigenvalue(res.mfa)
head(eig.val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.4619504        49.378382                    49.37838
## Dim.2  1.3667683        19.494446                    68.87283
## Dim.3  0.6154291         8.777969                    77.65080
## Dim.4  0.3721997         5.308747                    82.95954
## Dim.5  0.2703825         3.856511                    86.81605
## Dim.6  0.2024033         2.886912                    89.70297

The function fuiz_eig() or fuiz_screeplot() [factoertra package] can be used to draw the scree plot:

fviz_screeplot(res.mfa)

7.3.2 Graph of variables 7.3.2.1 Groups of variables

The function get_mfa_var() [in factoextra] is used to extract the results for groups of variables. This function returns a list containing the coordinates, the cos2 and the con- tribution of groups, as well as, the

group <- get_mfa_var(res.mfa, "group")
group
## Multiple Factor Analysis results for variable groups 
##  ===================================================
##   Name           Description                                          
## 1 "$coord"       "Coordinates"                                        
## 2 "$cos2"        "Cos2, quality of representation"                    
## 3 "$contrib"     "Contributions"                                      
## 4 "$correlation" "Correlation between groups and principal dimensions"

The different components can be accessed as follow:

#Coordinates of groups
head (group$coord)
##                        Dim.1      Dim.2      Dim.3      Dim.4      Dim.5
## odor               0.7820738 0.61977283 0.37353451 0.17260092 0.08553276
## visual             0.8546846 0.04014481 0.01438360 0.04550736 0.02966750
## odor.after.shaking 0.9247734 0.46892047 0.18009116 0.10139051 0.11589439
## taste              0.9004187 0.23793016 0.04741982 0.05270088 0.03928784
#Cos2: quality of representation on the factore map
head (group$cos2)
##                        Dim.1       Dim.2        Dim.3       Dim.4        Dim.5
## odor               0.3799491 0.238613517 0.0866745169 0.018506155 0.0045445922
## visual             0.7284016 0.001607007 0.0002062976 0.002065011 0.0008776492
## odor.after.shaking 0.6245535 0.160582210 0.0236855692 0.007507471 0.0098089810
## taste              0.7222292 0.050429542 0.0020031144 0.002474125 0.0013749986
#Contributions to the dimensions
head (group$contrib)
##                       Dim.1     Dim.2     Dim.3    Dim.4    Dim.5
## odor               22.59055 45.345861 60.694972 46.37321 31.63399
## visual             24.68795  2.937207  2.337166 12.22660 10.97242
## odor.after.shaking 26.71250 34.308703 29.262699 27.24089 42.86313
## taste              26.00900 17.408230  7.705163 14.15930 14.53047

To plot the groups of variables, type this:

fviz_mfa_var(res.mfa, "group")

* red color = active groups of variables + green color = supplementary groups of variables

The plot above illustrates the correlation between groups aud dimensions. “The coordi- nates of the four active groups on the first dimension are almost identical. This means that they contribute similarly to the first dimension. Concerning the second dimension, the two groups - odor and odor.after.shake - have the highest coordinates indicating a highest contribution to the second dimension.

To draw a bar plot of groups contribution to the dimensions, use the function fuiz_contrib():

#Contribution to the first dimension
fviz_contrib(res.mfa, "group", axes = 1)

#Contribution to the second dimension
fviz_contrib(res.mfa, "group", axes = 2)

-#7.3.2.2 Quantitative variables

The function get_mfa_var() [in factoextra] is used to extract the results for quantita- tive variables. This function returns a list containing the coordinates, the cos2 and the contribution of variables:

quanti.var <- get_mfa_var(res.mfa, "quanti.var")
quanti.var
## Multiple Factor Analysis results for quantitative variables 
##  ===================================================
##   Name       Description                      
## 1 "$coord"   "Coordinates"                    
## 2 "$cos2"    "Cos2, quality of representation"
## 3 "$contrib" "Contributions"

The different components can be accessed as follow:

#Coordinates
head (quanti.var$coord)
##                                   Dim.1       Dim.2       Dim.3      Dim.4
## Odor.Intensity.before.shaking 0.5908036  0.66723783 -0.02326175  0.3287015
## Aroma.quality.before.shaking  0.8352510 -0.07539908 -0.35417877  0.1414425
## Fruity.before.shaking         0.7160259 -0.15069626 -0.53748761  0.2517063
## Flower.before.shaking         0.4387181 -0.40937751  0.63731284  0.4029075
## Spice.before.shaking          0.0380525  0.86501993  0.12795122 -0.1822298
## Visual.intensity              0.8811873  0.23833245  0.14099033 -0.2128871
##                                     Dim.5
## Odor.Intensity.before.shaking  0.05786231
## Aroma.quality.before.shaking   0.04992114
## Fruity.before.shaking          0.18981578
## Flower.before.shaking          0.12200773
## Spice.before.shaking           0.36741971
## Visual.intensity              -0.17676282
#Cos2: quality on the factore map
head (quanti.var$cos2)
##                                     Dim.1       Dim.2       Dim.3      Dim.4
## Odor.Intensity.before.shaking 0.349048863 0.445206325 0.000541109 0.10804466
## Aroma.quality.before.shaking  0.697644264 0.005685021 0.125442602 0.02000597
## Fruity.before.shaking         0.512693037 0.022709361 0.288892928 0.06335608
## Flower.before.shaking         0.192473567 0.167589944 0.406167661 0.16233443
## Spice.before.shaking          0.001447992 0.748259477 0.016371514 0.03320769
## Visual.intensity              0.776491025 0.056802358 0.019878273 0.04532093
##                                     Dim.5
## Odor.Intensity.before.shaking 0.003348047
## Aroma.quality.before.shaking  0.002492121
## Fruity.before.shaking         0.036030029
## Flower.before.shaking         0.014885886
## Spice.before.shaking          0.134997242
## Visual.intensity              0.031245096
#Contributions to the dimensions
head (quanti.var$contrib)
##                                    Dim.1      Dim.2       Dim.3     Dim.4
## Odor.Intensity.before.shaking 4.49733206 14.5296787  0.03921898 12.948424
## Aroma.quality.before.shaking  8.98882147  0.1855354  9.09194110  2.397581
## Fruity.before.shaking         6.60581103  0.7411389 20.93864000  7.592798
## Flower.before.shaking         2.47993227  5.4694372 29.43858302 19.454686
## Spice.before.shaking          0.01865671 24.4200703  1.18658923  3.979718
## Visual.intensity              7.91221841  1.4660681  1.13941864  4.295418
##                                    Dim.5
## Odor.Intensity.before.shaking  0.5523351
## Aroma.quality.before.shaking   0.4111309
## Fruity.before.shaking          5.9439566
## Flower.before.shaking          2.4557588
## Spice.before.shaking          22.2708049
## Visual.intensity               4.0764862

In this section, we’ll describe how to visualize quantitative variables colored by groups.

Next, we’ll highlight variables according to either i) their quality of representation on the factor map or ii) their contributions to the dimensions.

To interpret the graphs presented here, read the chapter on PCA (Chapter 3) and MCA (Chapter 5).

Correlation between quantitative variables and dimensions. The R code below plots quantitative variables colored by groups. The argument palette is used to change group colors (see ?ggpubr::ggpar for more information about palette). Supplementary quantitative variables are in dashed arrow and violet color. We use repel = TRUE, to avoid text overlapping.

fviz_mfa_var(res.mfa, "quanti.var", palette = "jco",
             col.var.sup = "violet", repel = TRUE)

To make the plot more readable, we can use geom = c(“point”, “text”) instead of geom = c(“arrow”, “text”). We’ll change also the legend position from “right” to “bottom”, using the argument legend = “bottom”:

fviz_mfa_var(res.mfa, "quanti.var", palette = "jco",
             col.var.sup = "violet", repel = TRUE,
             geom = c("point", "text"), legend = "bottom")

Briefly, the graph of variables (correlation circle) shows the relationship between variables, the quality of the representation of variables, as well as, the correlation between variables and the dimensions:

« For a given dimension, the most correlated variables to the dimension are close to the dimension.

For example, the first dimension represents the positive sentiments about wines: “inten- sity” and “harmony”. The most correlated variables to the second dimension are: i) Spice before shaking and Odor intensity before shaking for the odor group; ii) Spice, Plant and Odor intensity for the odor after shaking group and iii) Bitterness for the taste group. This dimension represents essentially the “spicyness” and the vegetal characteristic due to olfaction.

The contribution of quantitative variables (in %) to the definition of the dimensions can be visualized using the function fuiz_contrib() [factoextra package]. Variables are colored by groups. The R code below shows the top 20 variable categories contributing to the dimensions:

#Contributions to dimension 1
fviz_contrib(res.mfa, choice = "quanti.var", axes = 1, top = 20,
             palette = ";jco")

#Contributions to dimension 2
fviz_contrib(res.mfa, choice = "quanti.var", axes = 2, top = 20,
             palette = "jco")

The red dashed line on the graph above indicates the expected average value, If the contributions were uniform. The calculation of the expected contribution value, under null hypothesis, has been detailed in the principal component analysis chapter (Chapter 3).

The variables with the larger value, contribute the most to the definition of the dimensions. Variables that contribute the most to Dim.1 and Dim.2 are the most important in explaining the variability in the data set.

The most contributing quantitative variables can be highlighted on the scatter plot us- ing the argument col.var = “contrib”. This produces a gradient colors, which can be customized using the argument gradient. cols.

# Cargar la librería necesaria
library(factoextra)

# Visualizar las variables cuantitativas del MFA
fviz_mfa_var(res.mfa, 
             "quanti.var", 
             col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             col.var.sup = "violet", 
             repel = TRUE,
             geom = c("point", "text"))

Similarly, you can highlight quantitative variables using their cos2 values representing the quality of representation on the factor map. If a variable is well represented by two dimensions, the sum of the cos2 is closed to one. For some of the row items, more than 2 dimensions might be required to perfectly represent the data.

#Color by cos2 values: quality on the factor map

fviz_mfa_var(res.mfa, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             col.var.sup = "violet", repel = TRUE)

To create a bar plot of variables cos2, type this:

fviz_cos2(res.mfa, choice = "quanti.var", axes = 1)

–#7.3.3 Graph of individuals

To get the results for individuals, type this:

ind <- get_mfa_ind(res.mfa)
ind
## Multiple Factor Analysis results for individuals 
##  ===================================================
##   Name                      Description                      
## 1 "$coord"                  "Coordinates"                    
## 2 "$cos2"                   "Cos2, quality of representation"
## 3 "$contrib"                "Contributions"                  
## 4 "$coord.partiel"          "Partial coordinates"            
## 5 "$within.inertia"         "Within inertia"                 
## 6 "$within.partial.inertia" "Within partial inertia"

To plot individuals, use the function fuiz_mfa_tnd() [in factoextra]. By default, individ- uals are colored in blue. However, like variables, it’s also possible to color individuals by their cos2 values:

fviz_mfa_ind(res.mfa, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

In the plot above, the supplementary qualitative variable categories are shown in black. Envl, Env2, Env3 are the categories of the soil. Saumur, Bourgueuil and Chinon are the categories of the wine Label. If you don’t want to show them on the plot, use the argument invisible = “quali.var”.

Individuals with similar profiles are close to each other on the factor map. The first axis, mainly opposes the wine ¿DAM and, the wines 1VAU and 2ING. As described in the previous section, the first dimension represents the harmony and the intensity of wines.

Thus, the wine 1DAM (positive coordinates) was evaluated as the most “intense” and “harmonious” contrary to wines 1VAU and 2ING (negative coordinates) which are the least “intense” and “harmonious”. The second axis is essentially associated with the two wines Tl and T2 characterized by a strong value of the variables Spice.before.shaking and Odor.intensity.before.shaking.

Most of the supplementary qualitative variable categories are close to the origin of the map. This result indicates that the concerned categories are not related to the first axis (wine “intensity” 8 “harmony”) or the second axis (wine T1 and T2).

The category Env4 has high coordinates on the second axis related to ’T1 and T2.

The category “Reference” is known to be related to an excellent wine-producing soil. As expected, our analysis demonstrates that the category “Reference” has high coordinates on the first axis, which is positively correlated with wines “intensity” and “harmony”.

Note that, it’s possible to color the individuals using any of the qualitative variables in the initial data table. To do this, the argument habillage is used in the fuiz_mfa_ind() function. For example, if you want to color the wines according to the supplementary qualitative variable “Label”, type this:

fviz_mfa_ind(res.mfa,
              habillage = "Label", # color by groups
              palette = c("#00AFBB", "#E7B800", "#FC4E07"),
              addEllipses = TRUE, ellipse.type = "confidence",
              repel = TRUE #Avoid text overlapping
              )

If you want to color individuals using multiple categorical variables at the same time, use the function fviz_ellipses() [in factoextra] as follow:

fviz_ellipses(res.mfa, c("Label", "Soil"), repel = TRUE)

Alternatively, you can specify categorical variable indices:

{r}# fviz_ellipses(res.mca, 1:2, geom = "point")

7.3.4 Graph of partial individuals

The results for individuals obtained from the analysis performed with a single group are named partial individuals. In other words, an individual considered from the point of view of a single group is called partial individual.

In the default fuiz_mfa_ind() plot, for a given individual, the point corresponds to the mean individual or the center of gravity of the partial points of the individual. That is, the individual viewed by all groups of variables.

For a given individual, there are as many partial points as groups of variables.

The graph of partial individuals represents each wine viewed by each group and its barycenter. To plot the partial points of all individuals, type this:

fviz_mfa_ind(res.mfa, partial = "al1")

If you want to visualize partial points for wines of interest, let say c(“1DAM”, “1VAU”,“2ING”), use this:

fviz_mfa_ind(res.mfa, partial = c("1DAM", "1VAU", "21NG"))

Red color represents the wines seen by only the odor variables; violet color represents the wines seen by only the visual variables, and so on.

The wine ¿DAM has been described in the previous section as particularly “intense” and “harmonious”, particularly by the odor group: It has a high coordinate on the first axis from the point of view of the odor variables group compared to the point of view of the other groups.

From the odor group’s point of view, 2ING was more “intense” and “harmonious” than 1VAU but from the taste group’s point of view, IVAU was more “intense” and “harmo- nious” than 21NG.

–#7.3.5 Graph of partial axes The graph of partial axes shows the relationship between the principal axes of the MFA and the ones obtained from analyzing each group using either a PCA (for groups of continuous variables) or a MCA (for qualitative variables).

fviz_mfa_axes(res.mfa)

It can be seen that, he first dimension of each group is highly correlated to the MFA*s first one. The second dimension of the MFA is essentially correlated to the second dimension of the olfactory groups.

7.4 Summary

The multiple factor analysis (MFA) makes it possible to analyse individuals characterized by multiple sets of variables. In this article, we described how to perform and interpret MFA using FactoMineR. and factoextra R packages.

Chapter 8

##HCPC: Hierarchical Clustering on Principal Components

—#8.1 Introduction

Clustering is one of the important data mining methods for discovering kuowledge in multivariate data sets. The goal is to identify groups (i.e. clusters) of similar objects within a data set of interest. ’To learn more about clustering, you can read our book entitled “Practical Guide to Cluster Analysis in R” (https: //goo.g1/DmJ5y5).

Briefly, the two most common clustering strategies are:

  1. Hierarchical clustering, used for identifying groups of similar observations in a data set.

  2. Partitioning clustering such as k-means algorithm, used for splitting a data set into several groups.

The HCPC (Hierarchical Clustering on Principal Components) approach allows us to combine the three standard methods used in multivariate data analyses (Husson et al., 2010):

  1. Principal component methods (PCA, CA, MCA, FAMD, MFA),
  2. Hierarchical clustering and
  3. Partitioning clustering, particularly the k-means method.

This chapter describes WHY and HOW to combine principal components and clus- tering methods. Finally, we demonstrate how to compute and visualize HCPC using R. software.

—#8.2 Why HCPC?

Combining principal component methods and clustering methods are useful in at least three situations.

–#8.2.1 Case 1: Continuous variables

In the situation where you have a multidimensional data set containing multiple con- tinuous variables, the principal component analysis (PCA) can be used to reduce the dimension of the data into few continuous variables containing the most important infor- mation in the data. Next, you can perform cluster analysis on the PCA results.

The PCA step can be considered as a denoising step which can lead to a more stable clustering. This might be very useful if you have a large data set with multiple variables, such as in gene expression data.

–#8.2.2 Case 2: Clustering on categorical data

In order to perform clustering analysis on categorical data, the correspondence analysis (CA, for analyzing contingency table) and the multiple correspondence analysis (MCA, for analyzing multidimensional categorical variables) can be used to transform categorical variables into a set of few continuous variables (the principal components). The cluster amalysis can be then applied on the (M)CA results.

In this case, the (M)CA method can be considered as pre-processing steps which allow to compute clustering on categorical data.

–#8.2.3 Case 3: Clustering on mixed data

When you have a mixed data of continuous and categorical variables, you can first perform FAMD (factor analysis of mizxed data) or MFA (multiple factor analysis). Next, you can apply cluster analysis on the FAMD/MFA outputs.

—#8.3 Algorithm of the HCPC method

The algorithm of the HCPC method, as implemented in the FactoMineR package, can be summarized as follow:

  1. Compute principal component methods: PCA, (M)CA or MFA depending on the types of variables in the data set and the structure of the data set. At this step, you can choose the number of dimensions to be retained in the output by specifying the argument ncp. The default value is 5.

  2. Compute hierarchical clustering: Hierarchical clustering is performed using the Ward’s criterion on the selected principal components. Ward criterion is used in the hierarchical clustering because it is based on the multidimensional variance like principal component analysis.

  3. Choose the number of clusters based on the hierarchical tree: An initial partitioning is performed by cutting the hierarchical tree.

  4. Perform K-means clustering to improve the initial partition obtained from hierar- chical clustering. The final partitioning solution, obtained after consolidation with k-means, can be (slightly) different from the one obtained with the hierarchical clustering.

—#8.4 Computation

–#8.4.1 R packages

We’ll use two R packages: i) FactoMineR for computing HCPC and ii) factoeztra for visualizing the results.

To install the packages, type this: {r}# install.packages(c("FactoMineR", "factoextra"))

After the installation, load the packages as follow: {r}# library (factoextra) library (FactoMineR)

–#8.4.2 R function

The function ACPC() [in FactoMineR package] can be used to compute hierarchical clustering on principal components.

A simplified format is: ```{r}# HCPC(res, nb.clust = 0, min = 3, max = NULL, graph = TRUE)


+. res: Fither the result of a factor analysis or a data frame.
« nb.clust: an integer specifying the number of clusters. Possible values are:
— 0: the tree is cut at the level the user clicks on
— -1: the tree is automatically cut at the suggested level
— Any positive integer: the tree is cut with nb.clusters clusters
* min, max: the minimum and the maximum number of clusters to be gener-
ated, respectively
*« graph: if TRUE, graphics are displayed

8.4.3 Case of continuous variables

We start by computing again the principal component analysis (PCA). The argument
ncp = 3 is used in the function PCA() to keep only the first three principal components.
Next, the HCPC is applied on the result of the PCA.

``` r
library (FactoMineR)
#Compute PCA with ncp = 3
res.pca <- PCA(USArrests, ncp = 3, graph = FALSE)
#Compute hierarchical clustering on principal components

res.hcpc <- HCPC(res.pca, graph = FALSE)

To visualize the dendrogram generated by the hierarchical clustering, we’ll use the function fuiz_dend() [in factoeztra packagel:

fviz_dend(
  res.hcpc,
  cex = 0.7,  # Tamaño de las etiquetas
  palette = "jco",  # Paleta de colores
  rect = TRUE,  # Agregar rectángulos alrededor de los grupos
  rect_fill = TRUE,  # Rellenar los rectángulos
  rect_border = "jco",  # Color de borde de los rectángulos
  labels_track_height = 0.8  # Aumentar el espacio para las etiquetas
)

The dendrogram suggests 4 clusters solution.

It’s possible to visualize individuals on the principal component map and to color individ- uals according to the cluster they belong to. The function fuiz_cluster() [in factoextra] can be used to visualize individuals clusters.

fviz_cluster(res.hcpc,
             repel = TRUE, #Avoid label overlapping
             show.clust.cent = TRUE, # Show cluster centers
             palette = "jco",
             ggtheme = theme_minimal(),
             main = "Factor map"
)

You can also draw a three dimensional plot combining the hierarchical clustering and the factorial map using the R base function plot():

#Principal components + tree
plot(res.hcpc, choice = "3D.map")

The function HCPC() returns a list containing;

To display the original data with cluster assignments, type this:

head(res.hcpc$data.clust, 10)
##             Murder Assault UrbanPop Rape clust
## Alabama       13.2     236       58 21.2     3
## Alaska        10.0     263       48 44.5     4
## Arizona        8.1     294       80 31.0     4
## Arkansas       8.8     190       50 19.5     3
## California     9.0     276       91 40.6     4
## Colorado       7.9     204       78 38.7     4
## Connecticut    3.3     110       77 11.1     2
## Delaware       5.9     238       72 15.8     2
## Florida       15.4     335       80 31.9     4
## Georgia       17.4     211       60 25.8     3

In the table above, the last column contains the cluster assignments.

To display quantitative variables that describe the most each cluster, type this:

res.hcpc$desc.var$quanti
## $`1`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop -3.898420         52.07692       65.540       9.691087  14.329285
## Murder   -4.030171          3.60000        7.788       2.269870   4.311735
## Rape     -4.052061         12.17692       21.232       3.130779   9.272248
## Assault  -4.638172         78.53846      170.760      24.700095  82.500075
##               p.value
## UrbanPop 9.682222e-05
## Murder   5.573624e-05
## Rape     5.076842e-05
## Assault  3.515038e-06
## 
## $`2`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop  2.793185         73.87500       65.540       8.652131  14.329285
## Murder   -2.374121          5.65625        7.788       1.594902   4.311735
##              p.value
## UrbanPop 0.005219187
## Murder   0.017590794
## 
## $`3`
##             v.test Mean in category Overall mean sd in category Overall sd
## Murder    4.357187          13.9375        7.788       2.433587   4.311735
## Assault   2.698255         243.6250      170.760      46.540137  82.500075
## UrbanPop -2.513667          53.7500       65.540       7.529110  14.329285
##               p.value
## Murder   1.317449e-05
## Assault  6.970399e-03
## UrbanPop 1.194833e-02
## 
## $`4`
##            v.test Mean in category Overall mean sd in category Overall sd
## Rape     5.352124         33.19231       21.232       6.996643   9.272248
## Assault  4.356682        257.38462      170.760      41.850537  82.500075
## UrbanPop 3.028838         76.00000       65.540      10.347798  14.329285
## Murder   2.913295         10.81538        7.788       2.001863   4.311735
##               p.value
## Rape     8.692769e-08
## Assault  1.320491e-05
## UrbanPop 2.454964e-03
## Murder   3.576369e-03

Here, we show only some columns of interest: “Mean in category”, “Overall Mean”, “p.value”

From the output above, it can be seen that:

. the variables UrbanPop, Murder, Rape and Assault are most significantly asso- ciated with the cluster 1. For example, the mean value of the Assault variable in cluster 1 is 78.53 which is less than it’s overall mean (170.76) across all clusters. Therefore, It can be conclude that the cluster 1 is characterized by a low rate of Assault compared to all clusters.

. the variables UrbanPop and Murder are most significantly associated with the cluster 2.

..ANd SO ON …

Similarly, to show principal dimensions that are the most associated with clusters, type this:

res.hcpc$desc.axes$quanti
## $`1`
##          v.test Mean in category  Overall mean sd in category Overall sd
## Dim.1 -5.175764        -1.964502 -5.828671e-16      0.6192556   1.574878
##            p.value
## Dim.1 2.269806e-07
## 
## $`2`
##         v.test Mean in category  Overall mean sd in category Overall sd
## Dim.2 3.585635        0.7428712 -4.951595e-16      0.6137936  0.9948694
##            p.value
## Dim.2 0.0003362596
## 
## $`3`
##          v.test Mean in category  Overall mean sd in category Overall sd
## Dim.1  2.058338        1.0610731 -5.828671e-16      0.5146613  1.5748783
## Dim.3  2.028887        0.3965588 -4.163336e-18      0.3714503  0.5971291
## Dim.2 -4.536594       -1.4773302 -4.951595e-16      0.5750284  0.9948694
##            p.value
## Dim.1 3.955769e-02
## Dim.3 4.246985e-02
## Dim.2 5.717010e-06
## 
## $`4`
##         v.test Mean in category  Overall mean sd in category Overall sd
## Dim.1 4.986474         1.892656 -5.828671e-16      0.6126035   1.574878
##            p.value
## Dim.1 6.149115e-07

The results above indicate that, individuals in clusters 1 and 4 have high coordi- nates on axes 1. Individuals in cluster 2 have high coordinates on the second axis. Individuals who belong to the third cluster have high coordinates on axes 1, 2 and 3.

Finally, representative individuals of each cluster can be extracted as follow:

res.hcpc$desc.ind$para
## Cluster: 1
##         Idaho  South Dakota         Maine          Iowa New Hampshire 
##     0.3674381     0.4993032     0.5012072     0.5533105     0.5891145 
## ------------------------------------------------------------ 
## Cluster: 2
##         Ohio     Oklahoma Pennsylvania       Kansas      Indiana 
##    0.2796100    0.5047549    0.5088363    0.6039091    0.7100820 
## ------------------------------------------------------------ 
## Cluster: 3
##        Alabama South Carolina        Georgia      Tennessee      Louisiana 
##      0.3553460      0.5335189      0.6136865      0.8522640      0.8780872 
## ------------------------------------------------------------ 
## Cluster: 4
##   Michigan    Arizona New Mexico   Maryland      Texas 
##  0.3246254  0.4532480  0.5176322  0.9013514  0.9239792

For each cluster, the top 5 closest individuals to the cluster center is shown. The distance between each individual and the cluster center is provided. For example, representative individuals for cluster 1 include: Idaho, South Dakota, Maine, lowa and New Hampshire.

–#8.4.4 Case of categorical variables

For categorical variables, compute CA or MCA and then apply the function HCPC( on the results as described above.

Here, we’ll use the tea data [in FactoMineR| as demo data set: Rows represent the individuals and columns represent categorical variables.

We start, by performing an MCA on the individuals. We keep the first 20 axes of the MCA which retain 87% of the information.

#Loading data
library (FactoMineR)
data (tea)

# Performing MCA

res.mca <- MCA(tea,
               ncp = 20, # Number of components kept
               quanti.sup = 19, # Quantitative supplementary variables
               quali.sup = c(20:36), # Qualitative supplementary variables
               graph=FALSE)

Next, we apply hierarchical clustering on the results of the MCA:

res.hcpc <- HCPC (res.mca, graph = FALSE, max = 3)

The results can be visualized as follow:

#Dendrogram
fviz_dend(res.hcpc, show_labels = FALSE)

#Individuals facor map
fviz_cluster(res.hcpc, geom = "point", main = "Factor map")

As mentioned above, clusters can be described by i) variables and/or categories, ii) prin- cipal axes and iii) individuals. In the example below, we display only a subset of the results.

« Description by variables and categories

#Description by variables
res.hcpc$desc.var$test.chi2
##                    p.value df
## where         8.465616e-79  4
## how           3.144675e-47  4
## price         1.862462e-28 10
## tearoom       9.624188e-19  2
## pub           8.539893e-10  2
## friends       6.137618e-08  2
## resto         3.537876e-07  2
## How           3.616532e-06  6
## Tea           1.778330e-03  4
## sex           1.789593e-03  2
## frequency     1.973274e-03  6
## work          3.052988e-03  2
## tea.time      3.679599e-03  2
## lunch         1.052478e-02  2
## dinner        2.234313e-02  2
## always        3.600913e-02  2
## sugar         3.685785e-02  2
## sophisticated 4.077297e-02  2
#Description by variable categories
res.hcpc$desc.var$category
## $`1`
##                               Cla/Mod   Mod/Cla    Global      p.value
## where=chain store           85.937500 93.750000 64.000000 2.094419e-40
## how=tea bag                 84.117647 81.250000 56.666667 1.478564e-25
## tearoom=Not.tearoom         70.661157 97.159091 80.666667 1.082077e-18
## price=p_branded             83.157895 44.886364 31.666667 1.631861e-09
## pub=Not.pub                 67.088608 90.340909 79.000000 1.249296e-08
## friends=Not.friends         76.923077 45.454545 34.666667 2.177180e-06
## resto=Not.resto             64.705882 81.250000 73.666667 4.546462e-04
## price=p_private label       90.476190 10.795455  7.000000 1.343844e-03
## tea.time=Not.tea time       67.938931 50.568182 43.666667 4.174032e-03
## How=alone                   64.102564 71.022727 65.000000 9.868387e-03
## work=Not.work               63.380282 76.704545 71.000000 1.036429e-02
## sugar=sugar                 66.206897 54.545455 48.333333 1.066744e-02
## always=Not.always           63.959391 71.590909 65.666667 1.079912e-02
## price=p_unknown             91.666667  6.250000  4.000000 1.559798e-02
## frequency=1 to 2/week       75.000000 18.750000 14.666667 1.649092e-02
## frequency=1/day             68.421053 36.931818 31.666667 1.958790e-02
## age_Q=15-24                 68.478261 35.795455 30.666667 2.179803e-02
## price=p_cheap              100.000000  3.977273  2.333333 2.274539e-02
## lunch=Not.lunch             61.328125 89.204545 85.333333 2.681490e-02
## SPC=senior                  42.857143  8.522727 11.666667 4.813710e-02
## lunch=lunch                 43.181818 10.795455 14.666667 2.681490e-02
## always=always               48.543689 28.409091 34.333333 1.079912e-02
## sugar=No.sugar              51.612903 45.454545 51.666667 1.066744e-02
## work=work                   47.126437 23.295455 29.000000 1.036429e-02
## tea.time=tea time           51.479290 49.431818 56.333333 4.174032e-03
## How=lemon                   30.303030  5.681818 11.000000 5.943089e-04
## resto=resto                 41.772152 18.750000 26.333333 4.546462e-04
## How=other                    0.000000  0.000000  3.000000 2.952904e-04
## price=p_variable            44.642857 28.409091 37.333333 1.595638e-04
## frequency=+2/day            45.669291 32.954545 42.333333 9.872288e-05
## friends=friends             48.979592 54.545455 65.333333 2.177180e-06
## how=unpackaged              19.444444  3.977273 12.000000 4.328211e-07
## pub=pub                     26.984127  9.659091 21.000000 1.249296e-08
## where=tea shop               6.666667  1.136364 10.000000 4.770573e-10
## price=p_upscale             18.867925  5.681818 17.666667 9.472539e-11
## how=tea bag+unpackaged      27.659574 14.772727 31.333333 1.927326e-13
## tearoom=tearoom              8.620690  2.840909 19.333333 1.082077e-18
## where=chain store+tea shop  11.538462  5.113636 26.000000 1.133459e-23
##                                v.test
## where=chain store           13.307475
## how=tea bag                 10.449142
## tearoom=Not.tearoom          8.826287
## price=p_branded              6.030764
## pub=Not.pub                  5.692859
## friends=Not.friends          4.736242
## resto=Not.resto              3.506146
## price=p_private label        3.206448
## tea.time=Not.tea time        2.864701
## How=alone                    2.580407
## work=Not.work                2.563432
## sugar=sugar                  2.553408
## always=Not.always            2.549133
## price=p_unknown              2.418189
## frequency=1 to 2/week        2.397866
## frequency=1/day              2.334149
## age_Q=15-24                  2.293869
## price=p_cheap                2.277684
## lunch=Not.lunch              2.214202
## SPC=senior                  -1.976156
## lunch=lunch                 -2.214202
## always=always               -2.549133
## sugar=No.sugar              -2.553408
## work=work                   -2.563432
## tea.time=tea time           -2.864701
## How=lemon                   -3.434198
## resto=resto                 -3.506146
## How=other                   -3.619397
## price=p_variable            -3.775692
## frequency=+2/day            -3.893709
## friends=friends             -4.736242
## how=unpackaged              -5.053925
## pub=pub                     -5.692859
## where=tea shop              -6.226471
## price=p_upscale             -6.475138
## how=tea bag+unpackaged      -7.353743
## tearoom=tearoom             -8.826287
## where=chain store+tea shop -10.029275
## 
## $`2`
##                                         Cla/Mod Mod/Cla   Global      p.value
## where=tea shop                        90.000000  84.375 10.00000 3.703402e-30
## how=unpackaged                        66.666667  75.000 12.00000 5.346850e-20
## price=p_upscale                       49.056604  81.250 17.66667 2.392655e-17
## Tea=green                             27.272727  28.125 11.00000 4.436713e-03
## sophisticated=sophisticated           13.488372  90.625 71.66667 8.080918e-03
## sex=M                                 16.393443  62.500 40.66667 9.511848e-03
## resto=Not.resto                       13.122172  90.625 73.66667 1.587879e-02
## dinner=dinner                         28.571429  18.750  7.00000 1.874042e-02
## escape.exoticism=Not.escape-exoticism 14.556962  71.875 52.66667 2.177458e-02
## how=tea bag+unpackaged                 5.319149  15.625 31.33333 3.876799e-02
## escape.exoticism=escape-exoticism      6.338028  28.125 47.33333 2.177458e-02
## dinner=Not.dinner                      9.318996  81.250 93.00000 1.874042e-02
## resto=resto                            3.797468   9.375 26.33333 1.587879e-02
## Tea=Earl Grey                          7.253886  43.750 64.33333 1.314753e-02
## sex=F                                  6.741573  37.500 59.33333 9.511848e-03
## sophisticated=Not.sophisticated        3.529412   9.375 28.33333 8.080918e-03
## where=chain store+tea shop             2.564103   6.250 26.00000 3.794134e-03
## price=p_variable                       3.571429  12.500 37.33333 1.349384e-03
## age_Q=15-24                            2.173913   6.250 30.66667 6.100227e-04
## price=p_branded                        2.105263   6.250 31.66667 4.024289e-04
## how=tea bag                            1.764706   9.375 56.66667 5.537403e-09
## where=chain store                      1.562500   9.375 64.00000 1.664577e-11
##                                          v.test
## where=tea shop                        11.410559
## how=unpackaged                         9.156781
## price=p_upscale                        8.472945
## Tea=green                              2.845318
## sophisticated=sophisticated            2.648670
## sex=M                                  2.593088
## resto=Not.resto                        2.411690
## dinner=dinner                          2.350655
## escape.exoticism=Not.escape-exoticism  2.294277
## how=tea bag+unpackaged                -2.066641
## escape.exoticism=escape-exoticism     -2.294277
## dinner=Not.dinner                     -2.350655
## resto=resto                           -2.411690
## Tea=Earl Grey                         -2.479748
## sex=F                                 -2.593088
## sophisticated=Not.sophisticated       -2.648670
## where=chain store+tea shop            -2.894789
## price=p_variable                      -3.205264
## age_Q=15-24                           -3.427119
## price=p_branded                       -3.538486
## how=tea bag                           -5.830161
## where=chain store                     -6.732775
## 
## $`3`
##                               Cla/Mod    Mod/Cla   Global      p.value
## where=chain store+tea shop  85.897436  72.826087 26.00000 5.730651e-34
## how=tea bag+unpackaged      67.021277  68.478261 31.33333 1.382641e-19
## tearoom=tearoom             77.586207  48.913043 19.33333 1.252051e-16
## pub=pub                     63.492063  43.478261 21.00000 1.126679e-09
## friends=friends             41.836735  89.130435 65.33333 1.429181e-09
## price=p_variable            51.785714  63.043478 37.33333 1.572243e-09
## resto=resto                 54.430380  46.739130 26.33333 2.406386e-07
## How=other                  100.000000   9.782609  3.00000 1.807938e-05
## frequency=+2/day            41.732283  57.608696 42.33333 4.237330e-04
## tea.time=tea time           38.461538  70.652174 56.33333 8.453564e-04
## work=work                   44.827586  42.391304 29.00000 9.079377e-04
## sex=F                       37.078652  71.739130 59.33333 3.494245e-03
## lunch=lunch                 50.000000  23.913043 14.66667 3.917102e-03
## How=lemon                   51.515152  18.478261 11.00000 8.747530e-03
## sugar=No.sugar              36.129032  60.869565 51.66667 3.484061e-02
## home=home                   31.615120 100.000000 97.00000 3.506563e-02
## home=Not.home                0.000000   0.000000  3.00000 3.506563e-02
## sugar=sugar                 24.827586  39.130435 48.33333 3.484061e-02
## price=p_private label        9.523810   2.173913  7.00000 2.370629e-02
## how=unpackaged              13.888889   5.434783 12.00000 1.645107e-02
## How=alone                   25.128205  53.260870 65.00000 5.300881e-03
## lunch=Not.lunch             27.343750  76.086957 85.33333 3.917102e-03
## sex=M                       21.311475  28.260870 40.66667 3.494245e-03
## Tea=green                    9.090909   3.260870 11.00000 2.545816e-03
## frequency=1 to 2/week       11.363636   5.434783 14.66667 1.604219e-03
## work=Not.work               24.882629  57.608696 71.00000 9.079377e-04
## tea.time=Not.tea time       20.610687  29.347826 43.66667 8.453564e-04
## where=tea shop               3.333333   1.086957 10.00000 1.466234e-04
## price=p_branded             14.736842  15.217391 31.66667 2.746948e-05
## resto=Not.resto             22.171946  53.260870 73.66667 2.406386e-07
## friends=Not.friends          9.615385  10.869565 34.66667 1.429181e-09
## pub=Not.pub                 21.940928  56.521739 79.00000 1.126679e-09
## how=tea bag                 14.117647  26.086957 56.66667 1.082059e-12
## tearoom=Not.tearoom         19.421488  51.086957 80.66667 1.252051e-16
## where=chain store           12.500000  26.086957 64.00000 1.711522e-19
##                               v.test
## where=chain store+tea shop 12.150084
## how=tea bag+unpackaged      9.053653
## tearoom=tearoom             8.278053
## pub=pub                     6.090345
## friends=friends             6.052158
## price=p_variable            6.036775
## resto=resto                 5.164845
## How=other                   4.287379
## frequency=+2/day            3.524844
## tea.time=tea time           3.337500
## work=work                   3.317602
## sex=F                       2.920541
## lunch=lunch                 2.884762
## How=lemon                   2.621767
## sugar=No.sugar              2.110206
## home=home                   2.107600
## home=Not.home              -2.107600
## sugar=sugar                -2.110206
## price=p_private label      -2.261856
## how=unpackaged             -2.398752
## How=alone                  -2.788157
## lunch=Not.lunch            -2.884762
## sex=M                      -2.920541
## Tea=green                  -3.017842
## frequency=1 to 2/week      -3.155139
## work=Not.work              -3.317602
## tea.time=Not.tea time      -3.337500
## where=tea shop             -3.796720
## price=p_branded            -4.193490
## resto=Not.resto            -5.164845
## friends=Not.friends        -6.052158
## pub=Not.pub                -6.090345
## how=tea bag                -7.119644
## tearoom=Not.tearoom        -8.278053
## where=chain store          -9.030332

The variables that characterize the most the clusters are the variables “where” and “how”. Each cluster is characterized by a category of the variables “where” and “how”. For example, individuals who belong to the first cluster buy tea as tea bag in chain stores.

» Description by principal components

res.hcpc$desc.axes
## 
## Link between the cluster variable and the quantitative variables
## ================================================================
##              Eta2      P-value
## Dim.2  0.66509105 2.828937e-71
## Dim.1  0.63497903 1.009707e-65
## Dim.4  0.11231020 2.073924e-08
## Dim.14 0.03141943 8.732913e-03
## Dim.6  0.02358138 2.890373e-02
## 
## Description of each cluster by quantitative variables
## =====================================================
## $`1`
##           v.test Mean in category Overall mean sd in category Overall sd
## Dim.6   2.647552       0.03433626 1.608233e-17      0.2655618  0.2671712
## Dim.2  -7.796641      -0.13194656 5.612264e-17      0.1813156  0.3486355
## Dim.1 -12.409741      -0.23196088 1.299886e-17      0.2143767  0.3850642
##            p.value
## Dim.6 8.107689e-03
## Dim.2 6.357699e-15
## Dim.1 2.314001e-35
## 
## $`2`
##           v.test Mean in category  Overall mean sd in category Overall sd
## Dim.2  13.918285       0.81210870  5.612264e-17      0.2340345  0.3486355
## Dim.4   4.350620       0.20342610 -2.717589e-17      0.3700048  0.2793822
## Dim.14  2.909073       0.10749165  2.015604e-17      0.2161509  0.2207818
## Dim.13  2.341566       0.08930402 -4.333917e-18      0.1606616  0.2278809
## Dim.3   2.208179       0.11087544  1.000213e-17      0.2449710  0.3000159
## Dim.11 -2.234447      -0.08934293 -1.678345e-17      0.2066708  0.2389094
##             p.value
## Dim.2  4.905356e-44
## Dim.4  1.357531e-05
## Dim.14 3.625025e-03
## Dim.13 1.920305e-02
## Dim.3  2.723180e-02
## Dim.11 2.545367e-02
## 
## $`3`
##          v.test Mean in category  Overall mean sd in category Overall sd
## Dim.1 13.485906       0.45155993  1.299886e-17      0.2516544  0.3850642
## Dim.6 -2.221728      -0.05161581  1.608233e-17      0.2488566  0.2671712
## Dim.4 -4.725270      -0.11479621 -2.717589e-17      0.2924881  0.2793822
##            p.value
## Dim.1 1.893256e-41
## Dim.6 2.630166e-02
## Dim.4 2.298093e-06

» Description by Individuals

res.hcpc$desc.ind$para
## Cluster: 1
##       285       152       166       143        71 
## 0.5884476 0.6242123 0.6242123 0.6244176 0.6478185 
## ------------------------------------------------------------ 
## Cluster: 2
##        31        95        53       182       202 
## 0.6620553 0.7442013 0.7610437 0.7948663 0.8154826 
## ------------------------------------------------------------ 
## Cluster: 3
##       172        33       233        18        67 
## 0.7380497 0.7407711 0.7503006 0.7572188 0.7701598

8.5 Summary

We described how to compute hierarchical clustering on principal components (HCPC). This approach is useful in situations, including:

« When you have a large data set containing continuous variables, a principal compo- nent analysis can be used to reduce the dimension of the data before the hierarchical clustering analysis.

We used the FactoMineR. package to compute the HCPC and the factoextra R. package for ggplot2-based elegant data visualization.