Introduccion El clustering es uno de los métodos de minería de datos más importantes para descubrir conocimientos en conjuntos de datos multivariantes. El objetivo es identificar grupos de objetos similares dentro de un conjunto de datos de interés.

install.packages(c("FactoMineR", "factoextra"))
## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

Utilizaremos dos paquetes de R: factorMinerR para calcular HCPC y factoextra para visualizar los resultados

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
library(ggplot2)
library(FactoMineR)
res.pca<-PCA(USArrests, ncp =3, graph = FALSE)

Empezaremos calculando de nuevo el análisis de componentes principales (ACP). el argumento ncp= 3 se utiliza en la función PCA() para conservar sólo los tres primeros componentes principales. a continuación, se aplica el HCPC al resultado del ACP.

res.hcpc<- HCPC(res.pca, graph = FALSE)

Para visualizar el dendrograma generado por el clustering jerárquico, utilizaremos la función fviz_dend() del paquete factoextra

fviz_dend(res.hcpc,
          cex = 0.7,
          palette = "jco",
          rect = TRUE, rect_fill = TRUE,
          rect_border = "jco",
          labels_track_height = 0.8
  
)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Es posible visualizar los individuos en el mapa de componentes principales y colorear los individuos según el cluster al que pertenecen. la función fviz_cluster en factoextra puede ser utilizada para visualizar los clusters de los individuos.

fviz_cluster(res.hcpc,
             repel = TRUE,
             show.clust.cent = TRUE,
             palette ="jco",
             ggtheme = theme_minimal(),
             main = "factor"
             )

También puede dibujar un gráfico tridimensional combinando el clustering jerárquico y el mapa factorial usando la función base de R plot

plot(res.hcpc, choice = "3D.map")

La función HCPC() devuelve una lista que contiene: data.clust:_ los datos originales con una columna suplementaria llamada class que contiene la partición. desc.var: las variables que describen los clusters desc.ind: los individuos más típicos de cada cluster des.axes: los ejes que describen los clusters

head(res.hcpc$data.clust, 10)
##             Murder Assault UrbanPop Rape clust
## Alabama       13.2     236       58 21.2     3
## Alaska        10.0     263       48 44.5     4
## Arizona        8.1     294       80 31.0     4
## Arkansas       8.8     190       50 19.5     3
## California     9.0     276       91 40.6     4
## Colorado       7.9     204       78 38.7     4
## Connecticut    3.3     110       77 11.1     2
## Delaware       5.9     238       72 15.8     2
## Florida       15.4     335       80 31.9     4
## Georgia       17.4     211       60 25.8     3

En la tabla anterior, la última columna contiene las asignaciones de clúster. para visualizar las variables cuantitativas que describen más cada cluster.

-Aquí mostramos sólo algunas columnas de interés: ” media en categoría”, ” media global”, “valor p”

res.hcpc$desc.var$quanti
## $`1`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop -3.898420         52.07692       65.540       9.691087  14.329285
## Murder   -4.030171          3.60000        7.788       2.269870   4.311735
## Rape     -4.052061         12.17692       21.232       3.130779   9.272248
## Assault  -4.638172         78.53846      170.760      24.700095  82.500075
##               p.value
## UrbanPop 9.682222e-05
## Murder   5.573624e-05
## Rape     5.076842e-05
## Assault  3.515038e-06
## 
## $`2`
##             v.test Mean in category Overall mean sd in category Overall sd
## UrbanPop  2.793185         73.87500       65.540       8.652131  14.329285
## Murder   -2.374121          5.65625        7.788       1.594902   4.311735
##              p.value
## UrbanPop 0.005219187
## Murder   0.017590794
## 
## $`3`
##             v.test Mean in category Overall mean sd in category Overall sd
## Murder    4.357187          13.9375        7.788       2.433587   4.311735
## Assault   2.698255         243.6250      170.760      46.540137  82.500075
## UrbanPop -2.513667          53.7500       65.540       7.529110  14.329285
##               p.value
## Murder   1.317449e-05
## Assault  6.970399e-03
## UrbanPop 1.194833e-02
## 
## $`4`
##            v.test Mean in category Overall mean sd in category Overall sd
## Rape     5.352124         33.19231       21.232       6.996643   9.272248
## Assault  4.356682        257.38462      170.760      41.850537  82.500075
## UrbanPop 3.028838         76.00000       65.540      10.347798  14.329285
## Murder   2.913295         10.81538        7.788       2.001863   4.311735
##               p.value
## Rape     8.692769e-08
## Assault  1.320491e-05
## UrbanPop 2.454964e-03
## Murder   3.576369e-03

De la salida anterior, se puede ver que: las variables UrbanPop, Asesinato, Violación y Asalto están más significativamente asociadas con el cluster 1. Por ejemplo, el valor medio de la variable Asalto en el grupo 11 es 78,53, que es inferior a su media general (170,76) en todos los grupos. Por lo tanto, se puede concluir que el grupo 1 se caracteriza por un bajo índice de agresiones en comparación con todos los grupos. las variables Urban Pop y Murde son más significativas con el cluster2. Del mismo modo, para mostrar las dimensiones principales que están más asociadas con los clusters.

res.hcpc$desc.axes$quanti
## $`1`
##          v.test Mean in category  Overall mean sd in category Overall sd
## Dim.1 -5.175764        -1.964502 -5.322132e-16      0.6192556   1.574878
##            p.value
## Dim.1 2.269806e-07
## 
## $`2`
##         v.test Mean in category  Overall mean sd in category Overall sd
## Dim.2 3.585635        0.7428712 -4.949513e-16      0.6137936  0.9948694
##            p.value
## Dim.2 0.0003362596
## 
## $`3`
##          v.test Mean in category  Overall mean sd in category Overall sd
## Dim.1  2.058338        1.0610731 -5.322132e-16      0.5146613  1.5748783
## Dim.3  2.028887        0.3965588  2.161465e-17      0.3714503  0.5971291
## Dim.2 -4.536594       -1.4773302 -4.949513e-16      0.5750284  0.9948694
##            p.value
## Dim.1 3.955769e-02
## Dim.3 4.246985e-02
## Dim.2 5.717010e-06
## 
## $`4`
##         v.test Mean in category  Overall mean sd in category Overall sd
## Dim.1 4.986474         1.892656 -5.322132e-16      0.6126035   1.574878
##            p.value
## Dim.1 6.149115e-07

Los resultados anteriores indican que, los individuos de los clusters 1 y 4 tienen coordenadas altas en los ejes 1. Los individuos del cluuster 2 tiene coordenadas altas en el segundo eje. Los individuos que pertenecen al tercer cluster tienen coordenadas altas en los ejes 1,2 y 3.

res.hcpc$desc.ind$para
## Cluster: 1
##         Idaho  South Dakota         Maine          Iowa New Hampshire 
##     0.3674381     0.4993032     0.5012072     0.5533105     0.5891145 
## ------------------------------------------------------------ 
## Cluster: 2
##         Ohio     Oklahoma Pennsylvania       Kansas      Indiana 
##    0.2796100    0.5047549    0.5088363    0.6039091    0.7100820 
## ------------------------------------------------------------ 
## Cluster: 3
##        Alabama South Carolina        Georgia      Tennessee      Louisiana 
##      0.3553460      0.5335189      0.6136865      0.8522640      0.8780872 
## ------------------------------------------------------------ 
## Cluster: 4
##   Michigan    Arizona New Mexico   Maryland      Texas 
##  0.3246254  0.4532480  0.5176322  0.9013514  0.9239792

Para variables categóricas, calcule CA o MCA y, a continuación, aplique la función HCPC() a los resultados como se ha descrito anteriormente. Aquí utilizaremos el “té” como conjunto de datos de demostración: Las filas representan los individuos y las columnas las variables categóricas. Comenzamos realizando un ACM en los individuos. Nos quedamos con los primeros 20 ejes del MCA, que retienen el 87% de la información.

library(factoextra)
data(tea)
res.mca<-MCA(tea,
             ncp = 20,
             quanti.sup = 19,
             quali.sup = c(20:36),
             graph = FALSE)

A continuación, aplicamos la agrupación jerárquica a los resultados del ACM:

res.hcpc<-HCPC(res.mca, graph = FALSE, max = 3)
fviz_dend(res.hcpc, show_labels =FALSE)

fviz_cluster(res.hcpc, geom= "point", main= "Factor map")

Las graficas resultantes son las siguientes.

fviz_dend(res.hcpc, show_labels =FALSE)

fviz_cluster(res.hcpc, geom= "point", main= "Factor map")

Como se ha mencionado anteriormente, las agrupaciones pueden describirse mediante 1- variables y/o categorías 2- ejes principales 3- individuos. en el ejemplo siguiente, mostramos un subser de los resultados.

res.hcpc$desc.var$test.chi2
##                    p.value df
## where         8.465616e-79  4
## how           3.144675e-47  4
## price         1.862462e-28 10
## tearoom       9.624188e-19  2
## pub           8.539893e-10  2
## friends       6.137618e-08  2
## resto         3.537876e-07  2
## How           3.616532e-06  6
## Tea           1.778330e-03  4
## sex           1.789593e-03  2
## frequency     1.973274e-03  6
## work          3.052988e-03  2
## tea.time      3.679599e-03  2
## lunch         1.052478e-02  2
## dinner        2.234313e-02  2
## always        3.600913e-02  2
## sugar         3.685785e-02  2
## sophisticated 4.077297e-02  2
res.hcpc$desc.var$category
## $`1`
##                               Cla/Mod   Mod/Cla    Global      p.value
## where=chain store           85.937500 93.750000 64.000000 2.094419e-40
## how=tea bag                 84.117647 81.250000 56.666667 1.478564e-25
## tearoom=Not.tearoom         70.661157 97.159091 80.666667 1.082077e-18
## price=p_branded             83.157895 44.886364 31.666667 1.631861e-09
## pub=Not.pub                 67.088608 90.340909 79.000000 1.249296e-08
## friends=Not.friends         76.923077 45.454545 34.666667 2.177180e-06
## resto=Not.resto             64.705882 81.250000 73.666667 4.546462e-04
## price=p_private label       90.476190 10.795455  7.000000 1.343844e-03
## tea.time=Not.tea time       67.938931 50.568182 43.666667 4.174032e-03
## How=alone                   64.102564 71.022727 65.000000 9.868387e-03
## work=Not.work               63.380282 76.704545 71.000000 1.036429e-02
## sugar=sugar                 66.206897 54.545455 48.333333 1.066744e-02
## always=Not.always           63.959391 71.590909 65.666667 1.079912e-02
## price=p_unknown             91.666667  6.250000  4.000000 1.559798e-02
## frequency=1 to 2/week       75.000000 18.750000 14.666667 1.649092e-02
## frequency=1/day             68.421053 36.931818 31.666667 1.958790e-02
## age_Q=15-24                 68.478261 35.795455 30.666667 2.179803e-02
## price=p_cheap              100.000000  3.977273  2.333333 2.274539e-02
## lunch=Not.lunch             61.328125 89.204545 85.333333 2.681490e-02
## SPC=senior                  42.857143  8.522727 11.666667 4.813710e-02
## lunch=lunch                 43.181818 10.795455 14.666667 2.681490e-02
## always=always               48.543689 28.409091 34.333333 1.079912e-02
## sugar=No.sugar              51.612903 45.454545 51.666667 1.066744e-02
## work=work                   47.126437 23.295455 29.000000 1.036429e-02
## tea.time=tea time           51.479290 49.431818 56.333333 4.174032e-03
## How=lemon                   30.303030  5.681818 11.000000 5.943089e-04
## resto=resto                 41.772152 18.750000 26.333333 4.546462e-04
## How=other                    0.000000  0.000000  3.000000 2.952904e-04
## price=p_variable            44.642857 28.409091 37.333333 1.595638e-04
## frequency=+2/day            45.669291 32.954545 42.333333 9.872288e-05
## friends=friends             48.979592 54.545455 65.333333 2.177180e-06
## how=unpackaged              19.444444  3.977273 12.000000 4.328211e-07
## pub=pub                     26.984127  9.659091 21.000000 1.249296e-08
## where=tea shop               6.666667  1.136364 10.000000 4.770573e-10
## price=p_upscale             18.867925  5.681818 17.666667 9.472539e-11
## how=tea bag+unpackaged      27.659574 14.772727 31.333333 1.927326e-13
## tearoom=tearoom              8.620690  2.840909 19.333333 1.082077e-18
## where=chain store+tea shop  11.538462  5.113636 26.000000 1.133459e-23
##                                v.test
## where=chain store           13.307475
## how=tea bag                 10.449142
## tearoom=Not.tearoom          8.826287
## price=p_branded              6.030764
## pub=Not.pub                  5.692859
## friends=Not.friends          4.736242
## resto=Not.resto              3.506146
## price=p_private label        3.206448
## tea.time=Not.tea time        2.864701
## How=alone                    2.580407
## work=Not.work                2.563432
## sugar=sugar                  2.553408
## always=Not.always            2.549133
## price=p_unknown              2.418189
## frequency=1 to 2/week        2.397866
## frequency=1/day              2.334149
## age_Q=15-24                  2.293869
## price=p_cheap                2.277684
## lunch=Not.lunch              2.214202
## SPC=senior                  -1.976156
## lunch=lunch                 -2.214202
## always=always               -2.549133
## sugar=No.sugar              -2.553408
## work=work                   -2.563432
## tea.time=tea time           -2.864701
## How=lemon                   -3.434198
## resto=resto                 -3.506146
## How=other                   -3.619397
## price=p_variable            -3.775692
## frequency=+2/day            -3.893709
## friends=friends             -4.736242
## how=unpackaged              -5.053925
## pub=pub                     -5.692859
## where=tea shop              -6.226471
## price=p_upscale             -6.475138
## how=tea bag+unpackaged      -7.353743
## tearoom=tearoom             -8.826287
## where=chain store+tea shop -10.029275
## 
## $`2`
##                                         Cla/Mod Mod/Cla   Global      p.value
## where=tea shop                        90.000000  84.375 10.00000 3.703402e-30
## how=unpackaged                        66.666667  75.000 12.00000 5.346850e-20
## price=p_upscale                       49.056604  81.250 17.66667 2.392655e-17
## Tea=green                             27.272727  28.125 11.00000 4.436713e-03
## sophisticated=sophisticated           13.488372  90.625 71.66667 8.080918e-03
## sex=M                                 16.393443  62.500 40.66667 9.511848e-03
## resto=Not.resto                       13.122172  90.625 73.66667 1.587879e-02
## dinner=dinner                         28.571429  18.750  7.00000 1.874042e-02
## escape.exoticism=Not.escape-exoticism 14.556962  71.875 52.66667 2.177458e-02
## how=tea bag+unpackaged                 5.319149  15.625 31.33333 3.876799e-02
## escape.exoticism=escape-exoticism      6.338028  28.125 47.33333 2.177458e-02
## dinner=Not.dinner                      9.318996  81.250 93.00000 1.874042e-02
## resto=resto                            3.797468   9.375 26.33333 1.587879e-02
## Tea=Earl Grey                          7.253886  43.750 64.33333 1.314753e-02
## sex=F                                  6.741573  37.500 59.33333 9.511848e-03
## sophisticated=Not.sophisticated        3.529412   9.375 28.33333 8.080918e-03
## where=chain store+tea shop             2.564103   6.250 26.00000 3.794134e-03
## price=p_variable                       3.571429  12.500 37.33333 1.349384e-03
## age_Q=15-24                            2.173913   6.250 30.66667 6.100227e-04
## price=p_branded                        2.105263   6.250 31.66667 4.024289e-04
## how=tea bag                            1.764706   9.375 56.66667 5.537403e-09
## where=chain store                      1.562500   9.375 64.00000 1.664577e-11
##                                          v.test
## where=tea shop                        11.410559
## how=unpackaged                         9.156781
## price=p_upscale                        8.472945
## Tea=green                              2.845318
## sophisticated=sophisticated            2.648670
## sex=M                                  2.593088
## resto=Not.resto                        2.411690
## dinner=dinner                          2.350655
## escape.exoticism=Not.escape-exoticism  2.294277
## how=tea bag+unpackaged                -2.066641
## escape.exoticism=escape-exoticism     -2.294277
## dinner=Not.dinner                     -2.350655
## resto=resto                           -2.411690
## Tea=Earl Grey                         -2.479748
## sex=F                                 -2.593088
## sophisticated=Not.sophisticated       -2.648670
## where=chain store+tea shop            -2.894789
## price=p_variable                      -3.205264
## age_Q=15-24                           -3.427119
## price=p_branded                       -3.538486
## how=tea bag                           -5.830161
## where=chain store                     -6.732775
## 
## $`3`
##                               Cla/Mod    Mod/Cla   Global      p.value
## where=chain store+tea shop  85.897436  72.826087 26.00000 5.730651e-34
## how=tea bag+unpackaged      67.021277  68.478261 31.33333 1.382641e-19
## tearoom=tearoom             77.586207  48.913043 19.33333 1.252051e-16
## pub=pub                     63.492063  43.478261 21.00000 1.126679e-09
## friends=friends             41.836735  89.130435 65.33333 1.429181e-09
## price=p_variable            51.785714  63.043478 37.33333 1.572243e-09
## resto=resto                 54.430380  46.739130 26.33333 2.406386e-07
## How=other                  100.000000   9.782609  3.00000 1.807938e-05
## frequency=+2/day            41.732283  57.608696 42.33333 4.237330e-04
## tea.time=tea time           38.461538  70.652174 56.33333 8.453564e-04
## work=work                   44.827586  42.391304 29.00000 9.079377e-04
## sex=F                       37.078652  71.739130 59.33333 3.494245e-03
## lunch=lunch                 50.000000  23.913043 14.66667 3.917102e-03
## How=lemon                   51.515152  18.478261 11.00000 8.747530e-03
## sugar=No.sugar              36.129032  60.869565 51.66667 3.484061e-02
## home=home                   31.615120 100.000000 97.00000 3.506563e-02
## home=Not.home                0.000000   0.000000  3.00000 3.506563e-02
## sugar=sugar                 24.827586  39.130435 48.33333 3.484061e-02
## price=p_private label        9.523810   2.173913  7.00000 2.370629e-02
## how=unpackaged              13.888889   5.434783 12.00000 1.645107e-02
## How=alone                   25.128205  53.260870 65.00000 5.300881e-03
## lunch=Not.lunch             27.343750  76.086957 85.33333 3.917102e-03
## sex=M                       21.311475  28.260870 40.66667 3.494245e-03
## Tea=green                    9.090909   3.260870 11.00000 2.545816e-03
## frequency=1 to 2/week       11.363636   5.434783 14.66667 1.604219e-03
## work=Not.work               24.882629  57.608696 71.00000 9.079377e-04
## tea.time=Not.tea time       20.610687  29.347826 43.66667 8.453564e-04
## where=tea shop               3.333333   1.086957 10.00000 1.466234e-04
## price=p_branded             14.736842  15.217391 31.66667 2.746948e-05
## resto=Not.resto             22.171946  53.260870 73.66667 2.406386e-07
## friends=Not.friends          9.615385  10.869565 34.66667 1.429181e-09
## pub=Not.pub                 21.940928  56.521739 79.00000 1.126679e-09
## how=tea bag                 14.117647  26.086957 56.66667 1.082059e-12
## tearoom=Not.tearoom         19.421488  51.086957 80.66667 1.252051e-16
## where=chain store           12.500000  26.086957 64.00000 1.711522e-19
##                               v.test
## where=chain store+tea shop 12.150084
## how=tea bag+unpackaged      9.053653
## tearoom=tearoom             8.278053
## pub=pub                     6.090345
## friends=friends             6.052158
## price=p_variable            6.036775
## resto=resto                 5.164845
## How=other                   4.287379
## frequency=+2/day            3.524844
## tea.time=tea time           3.337500
## work=work                   3.317602
## sex=F                       2.920541
## lunch=lunch                 2.884762
## How=lemon                   2.621767
## sugar=No.sugar              2.110206
## home=home                   2.107600
## home=Not.home              -2.107600
## sugar=sugar                -2.110206
## price=p_private label      -2.261856
## how=unpackaged             -2.398752
## How=alone                  -2.788157
## lunch=Not.lunch            -2.884762
## sex=M                      -2.920541
## Tea=green                  -3.017842
## frequency=1 to 2/week      -3.155139
## work=Not.work              -3.317602
## tea.time=Not.tea time      -3.337500
## where=tea shop             -3.796720
## price=p_branded            -4.193490
## resto=Not.resto            -5.164845
## friends=Not.friends        -6.052158
## pub=Not.pub                -6.090345
## how=tea bag                -7.119644
## tearoom=Not.tearoom        -8.278053
## where=chain store          -9.030332

Las variables que más caracterizan a los clusters son las variables “dónde” y “cómo”.

res.hcpc$desc.axes
## 
## Link between the cluster variable and the quantitative variables
## ================================================================
##              Eta2      P-value
## Dim.2  0.66509105 2.828937e-71
## Dim.1  0.63497903 1.009707e-65
## Dim.4  0.11231020 2.073924e-08
## Dim.14 0.03141943 8.732913e-03
## Dim.6  0.02358138 2.890373e-02
## 
## Description of each cluster by quantitative variables
## =====================================================
## $`1`
##           v.test Mean in category  Overall mean sd in category Overall sd
## Dim.6   2.647552       0.03433626  1.987198e-17      0.2655618  0.2671712
## Dim.2  -7.796641      -0.13194656 -4.981548e-18      0.1813156  0.3486355
## Dim.1 -12.409741      -0.23196088  4.901172e-17      0.2143767  0.3850642
##            p.value
## Dim.6 8.107689e-03
## Dim.2 6.357699e-15
## Dim.1 2.314001e-35
## 
## $`2`
##           v.test Mean in category  Overall mean sd in category Overall sd
## Dim.2  13.918285       0.81210870 -4.981548e-18      0.2340345  0.3486355
## Dim.4   4.350620       0.20342610  1.369991e-16      0.3700048  0.2793822
## Dim.14  2.909073       0.10749165 -2.314988e-17      0.2161509  0.2207818
## Dim.13  2.341566       0.08930402  5.823069e-17      0.1606616  0.2278809
## Dim.3   2.208179       0.11087544 -5.123217e-17      0.2449710  0.3000159
## Dim.11 -2.234447      -0.08934293  6.206696e-17      0.2066708  0.2389094
##             p.value
## Dim.2  4.905356e-44
## Dim.4  1.357531e-05
## Dim.14 3.625025e-03
## Dim.13 1.920305e-02
## Dim.3  2.723180e-02
## Dim.11 2.545367e-02
## 
## $`3`
##          v.test Mean in category Overall mean sd in category Overall sd
## Dim.1 13.485906       0.45155993 4.901172e-17      0.2516544  0.3850642
## Dim.6 -2.221728      -0.05161581 1.987198e-17      0.2488566  0.2671712
## Dim.4 -4.725270      -0.11479621 1.369991e-16      0.2924881  0.2793822
##            p.value
## Dim.1 1.893256e-41
## Dim.6 2.630166e-02
## Dim.4 2.298093e-06
res.hcpc$desc.ind$para
## Cluster: 1
##       285       152       166       143        71 
## 0.5884476 0.6242123 0.6242123 0.6244176 0.6478185 
## ------------------------------------------------------------ 
## Cluster: 2
##        31        95        53       182       202 
## 0.6620553 0.7442013 0.7610437 0.7948663 0.8154826 
## ------------------------------------------------------------ 
## Cluster: 3
##       172        33       233        18        67 
## 0.7380497 0.7407711 0.7503006 0.7572188 0.7701598