## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Top 15 de las variables que más contribuyen en sus datos (1 dimensión)
Top 15 de las variables que más contribuyen en sus datos (2 dimensión)
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Aca se muestras la principales variables que más influyen. Las rojas contribuyen más. Algunas no se me muestran para evitar el solapmiento de la letras.
Se arman grupos según el municipio, para encontrar las principales variables en cada uno.Algunas varaibles se omiten para evitar el solapamiento.
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Ahora lo que se puede hacer el filtrar la base de datos con las varialebes de mayor contribucion y volver a plantear el PCA. A partir de alli construir otras gráficas.
Algunas gráficas con las principales variables con mayor contribución
\[Clusters\] #
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.1 v dplyr 1.0.5
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
calcular_totwithinss <- function(n_clusters, datos, iter.max=1000, nstart=50){
# Esta función aplica el algoritmo kmeans y devuelve la suma total de
# cuadrados internos.
cluster_kmeans <- kmeans(centers = n_clusters, x = datos, iter.max = iter.max,
nstart = nstart)
return(cluster_kmeans$tot.withinss)
}
# Se aplica esta función con para diferentes valores de k
total_withinss <- map_dbl(.x = 1:15,
.f = calcular_totwithinss,
datos = datos)
total_withinss
## [1] 1896.0000 1523.0568 1337.5073 1195.5029 1108.5242 1032.7868 963.2228
## [8] 903.0566 854.2012 806.5478 763.3222 720.4764 683.4948 655.8959
## [15] 618.5793
data.frame(n_clusters = 1:15, suma_cuadrados_internos = total_withinss) %>%
ggplot(aes(x = n_clusters, y = suma_cuadrados_internos)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:15) +
labs(title = "Evolución de la suma total de cuadrados intra-cluster") +
theme_bw()
km_clusters <- kmeans(x = datos, centers = 4, nstart = 50)
# Las funciones del paquete factoextra emplean el nombre de las filas del
# dataframe que contiene los datos como identificador de las observaciones.
# Esto permite añadir labels a los gráficos.
fviz_cluster(object = km_clusters, data = datos, show.clust.cent = TRUE,
ellipse.type = "euclid", star.plot = TRUE, repel = TRUE) +
labs(title = "Resultados clustering K-means") +
theme_bw() +
theme(legend.position = "none")
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
km_clusters <- eclust(x = datos, FUNcluster = "kmeans", k =2 , seed = 123,
hc_metric = "euclidean", nstart = 50, graph = FALSE)
fviz_silhouette(sil.obj = km_clusters, print.summary = TRUE, palette = "jco",
ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 22 0.13
## 2 2 58 0.26
\[Silhoutte witdh\]
Estas son una métricas que se usan para enontrar la mejor cantidad de cluster. En este caso indica que según tus datos es mejor agruparlo en 2 clusters. Lo probe con más, pero la covertura (width) se cae. Ya aca es cosa del investigados decidir si usa estos test, o se agrupa de otra forma.
library(NbClust)
numero_clusters <- NbClust(data = datos, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "kmeans", index = "alllong")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 12 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 8 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
fviz_nbclust(numero_clusters)
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : la condición tiene longitud > 1 y sólo el primer elemento
## será usado
## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : la
## condición tiene longitud > 1 y sólo el primer elemento será usado
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : la condición tiene longitud > 1 y sólo el primer elemento
## será usado
## Warning in if (class(best_nc) == "matrix") {: la condición tiene longitud > 1 y
## sólo el primer elemento será usado
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 12 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 8 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
km_clusters$silinfo$clus.avg.widths
## [1] 0.129349 0.260589
p <- fviz_cluster(object = km_clusters, geom = "point", ellipse.type = "norm",
palette = "jco")
p + geom_point(data = p$data[c(112, 128),], colour = "firebrick", size = 2.5) +
theme_bw() + theme(legend.position = "bottom")
## Warning: Removed 2 rows containing missing values (geom_point).
colores <- colorRampPalette(c("red", "white", "blue"))(256)
heatmap(x = datos, scale = "none",
distfun = function(x){dist(x, method = "euclidean")},
hclustfun = function(x){hclust(x, method = "average")},
cexRow = 0.7)
library(pheatmap)
pheatmap(mat = datos, scale = "none", clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean", clustering_method = "average",
cutree_rows = 4, fontsize = 6)
## Loading required package: cluster
## Warning in clValid(obj = datos, nClust = 2:6, clMethods = c("hierarchical", :
## rownames for data not specified, using 1:nrow(data)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical APN 0.0107 0.0283 0.0332 0.0497 0.0294
## AD 6.0475 5.9357 5.7871 5.6234 5.3583
## ADM 0.1172 0.1880 0.2049 0.2905 0.4162
## FOM 0.9207 0.9157 0.9124 0.9111 0.8932
## Connectivity 9.2472 10.1583 14.0333 17.8913 26.9139
## Dunn 0.4591 0.4591 0.4688 0.4688 0.4794
## Silhouette 0.2696 0.2468 0.2428 0.1528 0.1496
## kmeans APN 0.0542 0.0975 0.1185 0.0980 0.1889
## AD 6.0338 5.7633 5.5906 5.1581 5.0842
## ADM 0.4058 0.5217 0.5918 0.6476 1.0933
## FOM 0.9313 0.9012 0.8906 0.8564 0.8441
## Connectivity 14.7012 28.5143 34.9595 40.3325 51.2762
## Dunn 0.4562 0.3240 0.3384 0.3025 0.2617
## Silhouette 0.2414 0.1579 0.1511 0.1716 0.1648
## pam APN 0.0464 0.1180 0.1853 0.0813 0.1454
## AD 6.0554 5.7437 5.5382 5.1590 5.0462
## ADM 0.3028 0.7106 0.9611 0.4406 0.8142
## FOM 0.9345 0.9096 0.8954 0.8505 0.8496
## Connectivity 16.4071 40.9472 50.6849 58.3036 57.9901
## Dunn 0.3988 0.2554 0.2690 0.2758 0.2933
## Silhouette 0.2496 0.1373 0.1172 0.1382 0.1399
##
## Optimal Scores:
##
## Score Method Clusters
## APN 0.0107 hierarchical 2
## AD 5.0462 pam 6
## ADM 0.1172 hierarchical 2
## FOM 0.8441 kmeans 6
## Connectivity 9.2472 hierarchical 2
## Dunn 0.4794 hierarchical 6
## Silhouette 0.2696 hierarchical 2
## Score Method Clusters
## APN 0.01070261 hierarchical 2
## AD 5.04624779 pam 6
## ADM 0.11720010 hierarchical 2
## FOM 0.84412827 kmeans 6
## Connectivity 9.24722222 hierarchical 2
## Dunn 0.47944631 hierarchical 6
## Silhouette 0.26960489 hierarchical 2
Este es un resumen de otras metricas indicando el mejor numero de cluster.
#Finalmente se hacen varios cluster de aceurdo a esas metricas
#2 CLUSTERS
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
#4 CLUSTERS
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Esta pienso que es la mejor agrupación
#6 CLUSTERS
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps