PCA Aguacate

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Top 15 de las variables que más contribuyen en sus datos (1 dimensión)

Top 15 de las variables que más contribuyen en sus datos (2 dimensión)

## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Aca se muestras la principales variables que más influyen. Las rojas contribuyen más. Algunas no se me muestran para evitar el solapmiento de la letras.

Se arman grupos según el municipio, para encontrar las principales variables en cada uno.Algunas varaibles se omiten para evitar el solapamiento.

## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# Ahora lo que se puede hacer el filtrar la base de datos con las varialebes de mayor contribucion y volver a plantear el PCA. A partir de alli construir otras gráficas.

Algunas gráficas con las principales variables con mayor contribución \[Clusters\] #

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.1     v dplyr   1.0.5
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

calcular_totwithinss <- function(n_clusters, datos, iter.max=1000, nstart=50){
  # Esta función aplica el algoritmo kmeans y devuelve la suma total de
  # cuadrados internos.
  cluster_kmeans <- kmeans(centers = n_clusters, x = datos, iter.max = iter.max,
                           nstart = nstart)
  return(cluster_kmeans$tot.withinss)
}

# Se aplica esta función con para diferentes valores de k
total_withinss <- map_dbl(.x = 1:15,
                          .f = calcular_totwithinss,
                          datos = datos)
total_withinss

##  [1] 1896.0000 1523.0568 1337.5073 1195.5029 1108.5242 1032.7868  963.2228
##  [8]  903.0566  854.2012  806.5478  763.3222  720.4764  683.4948  655.8959
## [15]  618.5793

data.frame(n_clusters = 1:15, suma_cuadrados_internos = total_withinss) %>%
  ggplot(aes(x = n_clusters, y = suma_cuadrados_internos)) +
    geom_line() +
    geom_point() +
    scale_x_continuous(breaks = 1:15) +
    labs(title = "Evolución de la suma total de cuadrados intra-cluster") +
    theme_bw()

km_clusters <- kmeans(x = datos, centers = 4, nstart = 50)

# Las funciones del paquete factoextra emplean el nombre de las filas del
# dataframe que contiene los datos como identificador de las observaciones.
# Esto permite añadir labels a los gráficos.
fviz_cluster(object = km_clusters, data = datos, show.clust.cent = TRUE,
             ellipse.type = "euclid", star.plot = TRUE, repel = TRUE) +
  labs(title = "Resultados clustering K-means") +
  theme_bw() +
  theme(legend.position = "none")

## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

km_clusters <- eclust(x = datos, FUNcluster = "kmeans", k =2 , seed = 123,
                      hc_metric = "euclidean", nstart = 50, graph = FALSE)
fviz_silhouette(sil.obj = km_clusters, print.summary = TRUE, palette = "jco",
                ggtheme = theme_classic())

##   cluster size ave.sil.width
## 1       1   22          0.13
## 2       2   58          0.26

\[Silhoutte witdh\]

Estas son una métricas que se usan para enontrar la mejor cantidad de cluster. En este caso indica que según tus datos es mejor agruparlo en 2 clusters. Lo probe con más, pero la covertura (width) se cae. Ya aca es cosa del investigados decidir si usa estos test, o se agrupa de otra forma.

library(NbClust)
numero_clusters <- NbClust(data = datos, distance = "euclidean", min.nc = 2,
                           max.nc = 10, method = "kmeans", index = "alllong")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 12 proposed 2 as the best number of clusters 
## * 6 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 8 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

fviz_nbclust(numero_clusters)

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : la condición tiene longitud > 1 y sólo el primer elemento
## será usado

## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : la
## condición tiene longitud > 1 y sólo el primer elemento será usado

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : la condición tiene longitud > 1 y sólo el primer elemento
## será usado

## Warning in if (class(best_nc) == "matrix") {: la condición tiene longitud > 1 y
## sólo el primer elemento será usado

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 12 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 2 proposed  4 as the best number of clusters
## * 8 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

km_clusters$silinfo$clus.avg.widths

## [1] 0.129349 0.260589

p <- fviz_cluster(object = km_clusters, geom = "point", ellipse.type  = "norm",
                  palette = "jco") 
p + geom_point(data = p$data[c(112, 128),], colour = "firebrick", size = 2.5) +
    theme_bw() + theme(legend.position = "bottom")

## Warning: Removed 2 rows containing missing values (geom_point).

colores <- colorRampPalette(c("red", "white", "blue"))(256)

heatmap(x = datos, scale = "none",
        distfun = function(x){dist(x, method = "euclidean")},
        hclustfun = function(x){hclust(x, method = "average")},
        cexRow = 0.7)

library(pheatmap)
pheatmap(mat = datos, scale = "none", clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean", clustering_method = "average",
         cutree_rows = 4, fontsize = 6)

## Loading required package: cluster

## Warning in clValid(obj = datos, nClust = 2:6, clMethods = c("hierarchical", :
## rownames for data not specified, using 1:nrow(data)

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                  2       3       4       5       6
##                                                                   
## hierarchical APN            0.0107  0.0283  0.0332  0.0497  0.0294
##              AD             6.0475  5.9357  5.7871  5.6234  5.3583
##              ADM            0.1172  0.1880  0.2049  0.2905  0.4162
##              FOM            0.9207  0.9157  0.9124  0.9111  0.8932
##              Connectivity   9.2472 10.1583 14.0333 17.8913 26.9139
##              Dunn           0.4591  0.4591  0.4688  0.4688  0.4794
##              Silhouette     0.2696  0.2468  0.2428  0.1528  0.1496
## kmeans       APN            0.0542  0.0975  0.1185  0.0980  0.1889
##              AD             6.0338  5.7633  5.5906  5.1581  5.0842
##              ADM            0.4058  0.5217  0.5918  0.6476  1.0933
##              FOM            0.9313  0.9012  0.8906  0.8564  0.8441
##              Connectivity  14.7012 28.5143 34.9595 40.3325 51.2762
##              Dunn           0.4562  0.3240  0.3384  0.3025  0.2617
##              Silhouette     0.2414  0.1579  0.1511  0.1716  0.1648
## pam          APN            0.0464  0.1180  0.1853  0.0813  0.1454
##              AD             6.0554  5.7437  5.5382  5.1590  5.0462
##              ADM            0.3028  0.7106  0.9611  0.4406  0.8142
##              FOM            0.9345  0.9096  0.8954  0.8505  0.8496
##              Connectivity  16.4071 40.9472 50.6849 58.3036 57.9901
##              Dunn           0.3988  0.2554  0.2690  0.2758  0.2933
##              Silhouette     0.2496  0.1373  0.1172  0.1382  0.1399
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## APN          0.0107 hierarchical 2       
## AD           5.0462 pam          6       
## ADM          0.1172 hierarchical 2       
## FOM          0.8441 kmeans       6       
## Connectivity 9.2472 hierarchical 2       
## Dunn         0.4794 hierarchical 6       
## Silhouette   0.2696 hierarchical 2

##                   Score       Method Clusters
## APN          0.01070261 hierarchical        2
## AD           5.04624779          pam        6
## ADM          0.11720010 hierarchical        2
## FOM          0.84412827       kmeans        6
## Connectivity 9.24722222 hierarchical        2
## Dunn         0.47944631 hierarchical        6
## Silhouette   0.26960489 hierarchical        2

Este es un resumen de otras metricas indicando el mejor numero de cluster.

#Finalmente se hacen varios cluster de aceurdo a esas metricas

#2 CLUSTERS

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#4 CLUSTERS

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Esta pienso que es la mejor agrupación

#6 CLUSTERS

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

PCA Aguacate

Mig

16/3/2022