Indicadores Económicos

Análisis de Cluster

Author

Eduardo Canales

Introducción

El siguiente post muestra un análisis de la técnica multivariante Análisis de Cluster para analizar la base de datos savin el paquete faraway, que nos da 5 indicadores económicos de 50 países en el período 1960–1970.

El experimento consiste en agrupar o clasificar los paíse en grupos homogeneos según los indicadores económicos :

sr: la tasa de ahorro de cada país.
pop15: su porcentaje de población menor de 15 años.
pop75: su porcentaje de población mayor de 75 años.
dpi: su renta per cápita en dólares.
ddpi: su tasa de crecimiento, como porcentaje de su renta per cápita.

             sr pop15 pop75     dpi ddpi
Australia 11.43 29.35  2.87 2329.68 2.87
Austria   12.07 23.32  4.41 1507.99 3.93
Belgium   13.17 23.80  4.43 2108.47 3.82
Bolivia    5.75 41.89  1.67  189.13 0.22
Brazil    12.88 42.19  0.83  728.47 4.56
Canada     8.79 31.72  2.85 2982.88 2.43

Para realizar la agrupación de los países comenzaremos determinando el numéro optimo de cluster que debemos utilizar.

Numéro optimo de cluster

# Medida de Distancia utilizada
d<-dist(savings,method = "euclidea")
round(as.matrix(d)[1:5, 1:5], 2)

          Australia Austria Belgium Bolivia  Brazil
Australia      0.00  821.71  221.29 2140.60 1601.26
Austria      821.71    0.00  600.48 1319.01  779.76
Belgium      221.29  600.48    0.00 1919.44 1380.13
Bolivia     2140.60 1319.01 1919.44    0.00  539.41
Brazil      1601.26  779.76 1380.13  539.41    0.00

## Numero optimo de cluster
library(ggplot2)
library(factoextra)

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(NbClust)
## Evaluacion para la eleccion del numero de cluster
## Metodo del codo
fviz_nbclust(savings,kmeans,method = "wss")+
  labs(subtitle = "Elbow method")

## Metodo de la silueta
fviz_nbclust(savings,kmeans,method = "silhouette")+
  labs(subtitle = "Silhouete method")

## Gap static
fviz_nbclust(savings,nstart=25, kmeans,method = "gap_stat",nboot = 50)+
  labs(subtitle = "Gap statistic method")

### Metodo completo
res.nbclust<-NbClust(data = savings, distance = "euclidean", min.nc = 3, max.nc = 9, 
            method = "complete", index = "all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 6 proposed 3 as the best number of clusters 
* 8 proposed 4 as the best number of clusters 
* 3 proposed 5 as the best number of clusters 
* 2 proposed 6 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 4 proposed 9 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  4 
 
 
*******************************************************************

Según los resultado el numéro optimo de cluster es \(4\) por el método completo y con los otros tres métodos son \(2\) cluster.

Agrupación no jerárquica K-means

Ralizaremos agrupaciones con los distintos algoritmos que presenta la función kmeans

Algoritmo MacQueen,

k2<-kmeans(savings,centers = 4,iter.max =100, nstart = 25, algorithm = "MacQueen")
k2

K-means clustering with 4 clusters of sizes 3, 30, 8, 9

Cluster means:
         sr    pop15    pop75       dpi     ddpi
1  7.736667 27.65667 3.606667 3428.0867 2.630000
2  8.484667 40.86200 1.436667  407.2647 3.930667
3 12.671250 24.64750 3.801250 2364.6250 3.316250
4 11.603333 27.60778 3.368889 1546.5244 3.948889

Clustering vector:
     Australia        Austria        Belgium        Bolivia         Brazil 
             3              4              3              2              2 
        Canada          Chile          China       Colombia     Costa Rica 
             1              2              2              2              2 
       Denmark        Ecuador        Finland         France        Germany 
             3              2              4              3              3 
        Greece      Guatamala       Honduras        Iceland          India 
             2              2              2              4              2 
       Ireland          Italy          Japan          Korea     Luxembourg 
             4              4              4              2              3 
         Malta         Norway    Netherlands    New Zealand      Nicaragua 
             2              3              4              4              2 
        Panama       Paraguay           Peru    Philippines       Portugal 
             2              2              2              2              2 
  South Africa South Rhodesia          Spain         Sweden    Switzerland 
             2              2              2              1              3 
        Turkey        Tunisia United Kingdom  United States      Venezuela 
             2              2              4              1              2 
        Zambia        Jamaica        Uruguay          Libya       Malaysia 
             2              2              2              2              2 

Within cluster sum of squares by cluster:
[1]  544059.7 1582368.6  211576.8  531227.4
 (between_SS / total_SS =  94.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(k2,data = savings)

# Metodos de agrupación
library(purrr)
library(cluster)
Cofeciente_Aglomeracion<- c( "average", "single", "complete", "ward")
names(Cofeciente_Aglomeracion) <- c( "average", "single", "complete", "ward")

# function para encontrar el coeficiente de aglomeracion mas cercano 1
ac <- function(x) {
  agnes(savings, method = x)$ac
}

map_dbl(Cofeciente_Aglomeracion, ac)

  average    single  complete      ward 
0.9626906 0.9135734 0.9810209 0.9913942

Según los resultados el coeficiente de aglomeración más cercano a \(1\) es el del método ward

library(ggplot2)
library("factoextra")
resultado<-hclust(d,method = "ward.D2")
fviz_dend(resultado, k=4, cex = 0.5)+labs(title = "Agrupación Jerárquica",
       subtitle = "Distancia euclídea, Ward, K=4")