Tarea_A32_PR18044 Aplicación Análisis de Clúster
Ejemplo 3.7 un ejemplo de aplicación del análisis de conglomerados
El director de ventas de una cadena de tiendas de electrodomésticos con implantación nacional está estudiando el plan de incentivos de sus vendedores. Considere que los incentivos deben estar ajustados a las dificultades de las distintas zonas de ventas, siendo necesario fijar incentivos más altos en aquellas zonas geográficas en que las condiciones de vida de sus habitantes hacen más difícil las ventas. Por este motivo quiere determinar si las comunidades autónomas se pueden segmentar en grupos homogéneos respecto al equipamiento de los hogares.
Para ello dispone de los datos que aparecen en el cuadro 3.22 y el objetivo es establecer cuántos grupos de comunidades autónomas con niveles de equipamiento similar pueden establecerse y en qué radican las diferencias entre esos grupos. El procedimiento que aplicaremos es el descrito en el tema, a saber:
Análisis de la existencia de outliers en la medida en que pueden generar importantes distorsiones en la detección del número de grupos.
Realización de un análisis de conglomerados jerárquicos, evaluando la solución de distintos métodos de conglomeración, aplicando los criterios presentados para identificar el número adecuado de grupos y obtención de los centroides que han de servir de partida para el paso siguiente.
Realización de un análisis de conglomerados no jerárquico mediante el método de k-medias para la obtención de una solución óptima en términos de homogeneidad intragrupos y heterogeneidad intersegmentos.
## CC.AA. Automóvil TV.color Vídeo Microondas Lavavajillas Teléfono
## 1 España 69.0 97.6 62.4 32.3 17.0 85.2
## 2 Andalucía 66.7 98.0 82.7 24.1 12.7 74.7
## 3 Aragón 67.2 97.5 56.8 43.4 20.6 88.4
## 4 Asturias 63.7 95.2 52.1 24.4 13.3 88.1
## 5 Baleares 71.9 98.8 62.4 29.8 10.1 87.9
## 6 Canarias 72.7 96.8 68.4 27.9 5.8 75.4
## 7 Cantabria 63.4 94.9 48.9 36.5 11.2 80.5
## 8 Castilla y León 65.8 97.1 47.7 28.1 14.0 85.0
## 9 Cast.-La Mancha 61.5 97.3 53.6 21.7 7.1 72.9
## 10 Cataluña 70.4 98.1 71.1 36.8 19.8 92.2
## 11 Com. Valenciana 72.7 98.4 68.2 26.6 12.1 84.4
## 12 Extremadura 60.5 97.7 43.7 20.7 11.7 67.1
## 13 Galicia 65.5 91.3 42.7 13.5 14.6 85.9
## 14 Madrid 74.0 99.4 76.3 53.9 32.3 95.7
## 15 Murcia 69.0 98.7 59.3 19.5 12.1 81.4
## 16 Navarra 76.4 99.3 60.6 44.0 20.6 87.4
## 17 País Vasco 71.3 98.3 61.6 45.7 23.7 94.3
## 18 La Rioja 64.9 98.6 54.4 44.4 17.6 83.4
resultados_outliers <- data.frame(CC.AA. = c("España", "Andalucía", "Aragón", "Asturias", "Baleares",
"Canarias", "Cantabria", "Castilla y León", "Castilla-La Mancha",
"Cataluña", "Com. Valenciana", "Extremadura", "Galicia",
"Madrid", "Murcia", "Navarra", "País Vasco", "La Rioja"),
D2 = c(0.40, 3.93, 1.94, 4.46, 6.02, 10.47, 7.27, 3.25, 4.12,
4.21, 2.85, 0.29, 13.30, 9.49, 4.61, 9.58, 2.55, 4.25),
p_value_chi = c(0.99, 0.68, 0.92, 9.61, 0.42, 0.10, 0.29, 0.77, 0.66,
0.64, 0.82, 0.15, 0.03, 0.14, 0.59, 0.14, 0.86, 0.64))
print(resultados_outliers)## CC.AA. D2 p_value_chi
## 1 España 0.40 0.99
## 2 Andalucía 3.93 0.68
## 3 Aragón 1.94 0.92
## 4 Asturias 4.46 9.61
## 5 Baleares 6.02 0.42
## 6 Canarias 10.47 0.10
## 7 Cantabria 7.27 0.29
## 8 Castilla y León 3.25 0.77
## 9 Castilla-La Mancha 4.12 0.66
## 10 Cataluña 4.21 0.64
## 11 Com. Valenciana 2.85 0.82
## 12 Extremadura 0.29 0.15
## 13 Galicia 13.30 0.03
## 14 Madrid 9.49 0.14
## 15 Murcia 4.61 0.59
## 16 Navarra 9.58 0.14
## 17 País Vasco 2.55 0.86
## 18 La Rioja 4.25 0.64
library(stats)
matriz.dis.euclid.caso3 <- dist(datos[, -1], method = "euclidean")
hclust.average.caso3 <- hclust(matriz.dis.euclid.caso3, method = "average")
print(hclust.average.caso3)##
## Call:
## hclust(d = matriz.dis.euclid.caso3, method = "average")
##
## Cluster method : average
## Distance : euclidean
## Number of objects: 18
## height merge.1 merge.2
## 1 6.874591 -3 -18
## 2 7.153321 -4 -8
## 3 7.805767 -5 -11
## 4 9.339233 -1 3
## 5 9.387225 -16 -17
## 6 12.380548 1 5
## Warning: package 'NbClust' was built under R version 4.5.2
Datos.NbClust <- datos[, c("Automóvil", "TV.color", "Vídeo", "Microondas", "Lavavajillas", "Teléfono")]
colnames(Datos.NbClust) <- c("automovil", "tvcolor", "video", "microond", "lavavaji", "telefono")
res.wardD2 <- NbClust(Datos.NbClust,
distance = "euclidean",
min.nc = 2,
max.nc = 15,
method = "ward.D2",
index = "alllong")## Warning in pf(beale, pp, df2): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in pf(beale, pp, df2): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in log(det(P)/det(W)): Se han producido NaNs
## Warning in pf(beale, pp, df2): Se han producido NaNs
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 8 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.3084 10.9552 8.6730 13.6750 182.7994 1.458381e+15 612504.2932 3659.0000
## 3 3.0067 11.9843 3.6644 8.7681 207.2635 8.429416e+14 195076.6371 2372.7967
## 4 0.9890 10.4186 3.3738 8.3192 227.0281 4.998113e+14 162904.8130 1906.9433
## 5 0.7929 9.7875 4.0918 7.8431 248.9889 2.305508e+14 115617.9267 1536.6400
## 6 1.3688 10.2580 3.2810 7.7737 278.7225 6.364045e+13 59789.0250 1168.7690
## 7 1.2121 10.4798 2.9057 7.5738 325.9617 6.278598e+12 37773.1391 917.8223
## 8 1.3307 10.7005 2.3395 7.3293 351.2572 2.011553e+12 26359.7034 726.0390
## 9 1.0311 10.6612 2.3793 6.9147 375.8871 6.480048e+11 18848.5893 588.3850
## 10 0.9984 10.8857 2.6022 6.5456 397.9883 2.343403e+11 10040.9976 465.3583
## 11 1.0601 11.5885 2.8409 6.2961 427.8106 5.408748e+10 7201.2983 351.1417
## 12 1.2541 12.9161 2.6984 6.1704 480.8366 3.382908e+09 3896.7090 249.7733
## 13 1.5600 14.4913 1.9617 5.9377 977.1256 4.200000e-03 1829.0523 172.2883
## 14 1.0507 15.0205 2.2118 5.0933 NaN 0.000000e+00 1084.5981 123.7400
## 15 1.4032 16.3636 1.8571 4.1649 NaN 0.000000e+00 511.7484 79.6800
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale
## 2 7.132820e+03 129.4288 0.3702 1.0867 0.3181 0.5504 8.1671 2.8565
## 3 8.070990e+03 199.5873 0.3956 1.0531 0.3161 4.2684 -3.0629 -2.3568
## 4 8.221908e+03 248.3451 0.6022 0.8698 0.3419 0.5515 3.2527 2.5028
## 5 1.185957e+04 308.1919 0.5579 0.8452 0.3434 8.6970 -3.5401 -2.7240
## 6 1.294684e+04 405.1956 0.5124 0.7586 0.3787 15.7979 -1.8734 -2.4025
## 7 6.141230e+04 515.9823 0.5742 0.6390 0.4212 0.4986 3.0168 2.9016
## 8 6.324076e+04 652.2791 0.6817 0.6654 0.3972 18.0014 -2.8333 -2.7252
## 9 9.074476e+04 804.8813 0.7296 0.6394 0.4261 28.6303 -0.9651 -1.8565
## 10 1.012226e+05 1017.6675 0.6985 0.5764 0.4785 24.9376 -1.9198 -2.4620
## 11 1.125365e+05 1348.6866 0.6335 0.5071 0.5212 45.5972 -0.9781 -1.8815
## 12 2.861114e+05 1896.0393 0.6906 0.4416 0.5749 132.3449 0.0000 0.0000
## 13 4.222924e+15 2748.7645 0.7908 0.3890 0.6565 60.4131 -0.9834 -1.8918
## 14 -1.880118e+15 3827.2188 0.8114 0.3620 0.6838 189.1116 0.0000 0.0000
## 15 1.049247e+16 5943.5248 0.9224 0.2904 0.7667 380.6423 0.0000 0.0000
## Ratkowsky Ball Ptbiserial Gap Frey McClain Gamma Gplus Tau
## 2 0.3969 1829.5000 0.4539 -0.6438 0.4531 0.6077 0.5302 8.9542 20.2092
## 3 0.4340 790.9322 0.5012 -1.1169 0.1493 1.4282 0.6811 5.0654 21.6340
## 4 0.3967 476.7358 0.5195 -1.4087 0.4821 1.5890 0.7456 3.7582 22.0261
## 5 0.3825 307.3280 0.5029 -1.6960 0.1499 2.0181 0.7887 2.6732 19.9608
## 6 0.3590 194.7948 0.5151 -0.1846 0.2784 2.2708 0.8795 1.3399 19.5556
## 7 0.3448 131.1175 0.5091 0.4693 0.8636 2.4952 0.9154 0.8562 18.5229
## 8 0.3307 90.7549 0.4538 0.9413 0.4384 3.3721 0.9210 0.6275 14.6275
## 9 0.3171 65.3761 0.4183 1.1037 0.3168 4.2268 0.9445 0.3529 12.0131
## 10 0.3033 46.5358 0.3994 1.5455 0.2956 4.7948 0.9551 0.2484 10.5621
## 11 0.2912 31.9220 0.3639 2.1145 0.2159 6.0396 0.9645 0.1503 8.1699
## 12 0.2809 20.8144 0.3363 2.5283 0.1518 7.2496 0.9863 0.0458 6.5882
## 13 0.2713 13.2529 0.3218 3.0074 0.4252 7.9794 0.9955 0.0131 5.7386
## 14 0.2631 8.8386 0.2690 3.3806 0.2014 11.4837 0.9966 0.0065 3.8824
## 15 0.2559 5.3120 0.2394 3.8445 0.4367 14.4335 1.0000 0.0000 2.9412
## Dunn Hubert SDindex Dindex SDbw
## 2 0.2865 2e-04 0.2608 13.1959 1.1107
## 3 0.3897 2e-04 0.2129 10.7652 0.7478
## 4 0.4871 3e-04 0.1881 9.4658 0.2643
## 5 0.4871 3e-04 0.2115 8.2952 0.2323
## 6 0.4978 4e-04 0.1814 7.1368 0.1605
## 7 0.5872 4e-04 0.1632 6.2177 0.1032
## 8 0.5566 4e-04 0.2029 5.5115 0.0873
## 9 0.6357 4e-04 0.2023 4.8338 0.0743
## 10 0.6418 4e-04 0.2219 4.1291 0.0581
## 11 0.6418 4e-04 0.2242 3.4737 0.0456
## 12 0.7629 4e-04 0.2223 2.8410 0.0317
## 13 0.9264 4e-04 0.2231 2.1494 0.0177
## 14 0.8976 4e-04 0.3024 1.7345 0.0150
## 15 1.0795 5e-04 0.2987 1.2130 0.0087
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale CritValue_Gap
## 2 0.3979 15.1322 0.0163 0.5234
## 3 0.1924 16.7852 1.0000 0.3500
## 4 0.1924 16.7852 0.0504 0.3758
## 5 0.1924 16.7852 1.0000 -0.7994
## 6 0.0348 55.4763 1.0000 0.3193
## 7 0.1255 20.9054 0.0369 0.6980
## 8 0.1255 20.9054 1.0000 1.1293
## 9 -0.0981 -11.1930 1.0000 1.1200
## 10 0.0348 55.4763 1.0000 1.2283
## 11 -0.0981 -11.1930 1.0000 1.5758
## 12 -0.3211 0.0000 NaN 1.8159
## 13 -0.0981 -11.1930 1.0000 2.1872
## 14 -0.3211 0.0000 NaN 2.5587
## 15 -0.3211 0.0000 NaN 2.7794
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 3.0000 15.0000 3.0000 2.000 13.000 3.00000e+00 3.0
## Value_Index 3.0067 16.3636 5.0086 13.675 496.289 2.72309e+14 417427.7
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.00 1.500000e+01 3.0000 2.0000 15.0000 15.0000 2.0000
## Value_Index 820.35 1.237259e+16 -21.4006 0.3702 0.2904 0.7667 0.5504
## PseudoT2 Beale Ratkowsky Ball PtBiserial Gap Frey
## Number_clusters 2.0000 3.0000 3.000 3.000 4.0000 2.0000 1
## Value_Index 8.1671 -2.3568 0.434 1038.568 0.5195 -0.6438 NA
## McClain Gamma Gplus Tau Dunn Hubert SDindex Dindex
## Number_clusters 2.0000 15 15 4.0000 15.0000 0 7.0000 0
## Value_Index 0.6077 1 0 22.0261 1.0795 0 0.1632 0
## SDbw
## Number_clusters 15.0000
## Value_Index 0.0087
##
## $Best.partition
## [1] 1 1 2 3 1 1 3 3 3 2 1 3 3 2 1 2 2 2
hclust.ward.caso3 <- hclust(matriz.dis.euclid.caso3, method = "ward.D2")
grupo.ward <- cutree(hclust.ward.caso3, k = 2, h = NULL)
datos.caso3.grupos <- cbind(datos, grupo.ward)
print(datos.caso3.grupos)## CC.AA. Automóvil TV.color Vídeo Microondas Lavavajillas Teléfono
## 1 España 69.0 97.6 62.4 32.3 17.0 85.2
## 2 Andalucía 66.7 98.0 82.7 24.1 12.7 74.7
## 3 Aragón 67.2 97.5 56.8 43.4 20.6 88.4
## 4 Asturias 63.7 95.2 52.1 24.4 13.3 88.1
## 5 Baleares 71.9 98.8 62.4 29.8 10.1 87.9
## 6 Canarias 72.7 96.8 68.4 27.9 5.8 75.4
## 7 Cantabria 63.4 94.9 48.9 36.5 11.2 80.5
## 8 Castilla y León 65.8 97.1 47.7 28.1 14.0 85.0
## 9 Cast.-La Mancha 61.5 97.3 53.6 21.7 7.1 72.9
## 10 Cataluña 70.4 98.1 71.1 36.8 19.8 92.2
## 11 Com. Valenciana 72.7 98.4 68.2 26.6 12.1 84.4
## 12 Extremadura 60.5 97.7 43.7 20.7 11.7 67.1
## 13 Galicia 65.5 91.3 42.7 13.5 14.6 85.9
## 14 Madrid 74.0 99.4 76.3 53.9 32.3 95.7
## 15 Murcia 69.0 98.7 59.3 19.5 12.1 81.4
## 16 Navarra 76.4 99.3 60.6 44.0 20.6 87.4
## 17 País Vasco 71.3 98.3 61.6 45.7 23.7 94.3
## 18 La Rioja 64.9 98.6 54.4 44.4 17.6 83.4
## grupo.ward
## 1 1
## 2 1
## 3 2
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 2
## 11 1
## 12 1
## 13 1
## 14 2
## 15 1
## 16 2
## 17 2
## 18 2
plot(hclust.ward.caso3, labels = datos$CC.AA.,
main = "Clustering Jerárquico - Método Ward",
xlab = "Comunidades Autónomas",
ylab = "Distancia",
cex = 0.8)
rect.hclust(hclust.ward.caso3, k = 2, border = c("pink", "pink"))##
## ---------------------
## Welcome to dendextend version 1.19.1
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Adjuntando el paquete: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
dend <- as.dendrogram(hclust.ward.caso3)
labels_colors(dend) <- grupo.ward + 1
plot(dend,
main = "Dendrograma - Asignación de Clusters",
ylab = "Altura",
cex = 0.7)
legend("topright",
legend = paste("Cluster", 1:2),
fill = 2:3,
cex = 0.8)## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(list(data = datos[, -1], cluster = grupo.ward),
palette = c("#2E9", "#00A"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw(),
main = "Visualización PCA - Clusters Ward")