Clasificación de países por composición sectorial del PIB

Análisis de Cluster

Author

Eduardo Canales

Introducción

El siguiente post muestra un análisis de la técnica multivariante Análisis de Cluster para analizar la base de datos List of Countries by GDP Sector Composition de la pagina web https://www.kaggle.com/datasets en la cual se eligierón los países siguientes:

Japan, France, Brazil, Canada, Rusia, Spain, Mexico, Argentina,Taiwan, Colombia, Chile, Venezuela, Peru, Puerto Rico, Ecuador, Cuba, Republica Dominicana, Guatemala, Uruguay, Panama, Costa Rica, Bolivia, Paraguay, El Salvador, Honduras, Trinidad y Tobago, Jamaica, Nicaragua, Haiti, Belice .

Se desea agrupar estos países del mundo en función del PIB(Nominal) de los siguientes sectores:

Sector Agricultura
Sector Industria
Sector de Servicios

El experimento consiste en agrupar o clasificar estos países en grupos homogeneos según el PIB utilizando las siguientes variables de cada uno de los sectores antes mencionados:

PIB : Valor monetario de la producción de bienes y servicios (millones de $).
CPSA : Contribución del PIB sector agricola mundial.
Ranking SA : Ranking del pais para el PIB sector agricola.
PCPSA : Porcentaje de contribución PIB sector agricola.
CPSI : Contribución del PIB sector industria mundial.
Ranking SI : Ranking del pais para el PIB sector industria.
PCPSI : Porcentaje de contribución PIB sector industria.
CPSS : Contribución del PIB sector de servicios mundial.
Ranking SS : Ranking del pais para el PIB sector de servicios.
PCPSS : Porcentaje de contribución PIB sector de servicios.

Sector Agricultura

Comenzaremos realizando la agrupación de los países para el sector agricola utilizando la agrupación jerarquica.

Lo primero que realizaremos será calcular la matriz de distancias de las variables del sector agricola para cada uno de los países utilizando la funcion dist .

Medida de Distancia

# estandarizamos los datos de la base de datos
agricultura<-scale(Agricultura[,-1])
nombre<-c("Japan","France","Brazil","Canada","Russia","Spain","Mexico","Argentina","Taiwan","Colombia","Chile","Venezuela","Peru","Puerto Rico","Ecuador","Cuba","Dom. Republic","Guatemala","Uruguay","Panama","Costa Rica","Bolivia",
          "Paraguay","El Salvador","Honduras","T. Tobago","Jamaica","Nicaragua","Haiti","Belize")
row.names(agricultura)<-nombre

# Distancia eclídea
distancia<-dist(x=agricultura, method = "euclidean")
round(as.matrix(distancia)[1:5, 1:5], 2)

       Japan France Brazil Canada Russia
Japan   0.00   2.18   3.94   3.15   3.36
France  2.18   0.00   2.82   1.23   1.30
Brazil  3.94   2.82   0.00   3.63   2.16
Canada  3.15   1.23   3.63   0.00   1.57
Russia  3.36   1.30   2.16   1.57   0.00

Para determinar que nuestra clasificación es la mejor utilizamos la función agnes que especifica el coeficiente de aglomeración que mide el monto de estructura de cluster encontrado (valores cercanos a 1 sugieren una estructura de cluster fuerte).

# Metodos de agrupación
library(purrr)
library(cluster)
Cofeciente_Aglomeracion<- c( "average", "single", "complete", "ward")
names(Cofeciente_Aglomeracion) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {
  agnes(agricultura, method = x)$ac
}

map_dbl(Cofeciente_Aglomeracion, ac)

  average    single  complete      ward 
0.8169654 0.6947719 0.8535576 0.9109951

Según los resultados obtenidos observamos que el valor que más se acerca a $1$ es el método de ward .

library(ggplot2)
library("factoextra")

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

resultado<-hclust(distancia,method = "ward.D2")
fviz_dend(resultado, cex = 0.5)

El gráfico anterior muestra como sería la agrupación de cada uno de los países.

fviz_dend(resultado, k = 4,cex = 0.5,
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE,rect = TRUE )+
  labs(title = "Agrupación Jerárquica",
       subtitle = "Distancia euclídea, Ward, K=4")

library(cluster)
k3<-kmeans(agricultura,centers = 4,iter.max = 100,algorithm = "MacQueen")
fviz_cluster(k3,data = agricultura)+
  labs(title = "Agrupación No Jerárquica",subtitle = "Algoritmo de MacQueen , K-means")

Sector Industria

Para la agrupación de los países del sector industria utilizaremos la agrupación no jerarquica, comenzaremos determinado el numéro optimo de cluster que debemos utilizar.

Número optimo de cluster

# Medida de Distancia utilizada
distanciaI<-dist(industria,method = "manhattan")
round(as.matrix(distanciaI)[1:5, 1:5], 2)

       Japan France Brazil Canada Russia
Japan   0.00   6.56   7.30   6.77   6.97
France  6.56   0.00   0.95   2.10   2.68
Brazil  7.30   0.95   0.00   1.36   2.12
Canada  6.77   2.10   1.36   0.00   0.76
Russia  6.97   2.68   2.12   0.76   0.00

## Numero optimo de cluster
library(factoextra)
library(NbClust)
fviz_nbclust(industria, kmeans, method = "wss") + 
geom_vline(xintercept = 5, linetype = 2)

Según el metodo wss el numéro optimo de cluster esta entre $5$

Ahora visualimos en que cluster quedarón distribuidos cada uno de los países

k2<-kmeans(industria,centers = 5,nstart = 25)
k2

K-means clustering with 5 clusters of sizes 12, 1, 9, 6, 2

Cluster means:
         PIB       CPSI  RankingSI      PCPSI
1 -0.5307384 -0.5330793  0.9470518 -0.7410622
2  4.0258879  4.3677428 -1.3053391  0.1121927
3 -0.3159431 -0.2629599 -0.4206791  0.6436317
4  1.0339904  0.8852053 -1.1237510 -0.3168951
5 -0.5087410 -0.4576918  0.2346677  2.4446196

Clustering vector:
        Japan        France        Brazil        Canada        Russia 
            2             4             4             4             4 
        Spain        Mexico     Argentina        Taiwan      Colombia 
            4             4             3             3             3 
        Chile     Venezuela          Peru   Puerto Rico       Ecuador 
            3             3             3             5             3 
         Cuba Dom. Republic     Guatemala       Uruguay        Panama 
            1             3             1             1             1 
   Costa Rica       Bolivia      Paraguay   El Salvador      Honduras 
            1             3             1             1             1 
    T. Tobago       Jamaica     Nicaragua         Haiti        Belize 
            5             1             1             1             1 

Within cluster sum of squares by cluster:
[1] 8.1980150 0.0000000 3.4906336 3.6712278 0.3587785
 (between_SS / total_SS =  86.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(k2,data = agricultura)

library(fpc)
pam.res<-pam(industria,5)
fviz_cluster(pam.res,
             palette = c( "#FF4040","#9932CC","#FFD700","#009ACD","#2E8B57"), 
             ellipse.type = "t", 
             repel = TRUE, 
             ggtheme = theme_classic()
)

Too few points to calculate an ellipse

Sector Servicios

En los agrupaciones anteriores hemos observado como se utilizan las diferentes técnicas de agrupación del análisis de cluster, para este último utilizaremos otras funciones que ayudaran a tener una mejor clasificación de los cluster.

Comenzaremos encontrando el número optimo de cluster,

library(factoextra)
library(NbClust)
### Metodo completo
res.nbclust<-NbClust(data = servicio, 
                  distance = "euclidea", min.nc = 3, 
                  max.nc = 9, 
                  method = "complete", index = "all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 9 proposed 3 as the best number of clusters 
* 5 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 4 proposed 7 as the best number of clusters 
* 2 proposed 8 as the best number of clusters 
* 2 proposed 9 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
*******************************************************************

Ahora procedemos a realizar la agrupación para cada uno de los países del sector de servicios.

library(fpc)
pam.res<-pam(servicio,3)
fviz_cluster(pam.res,
             palette = c( "#FF4040","#9932CC","#2E8B57"), 
             ellipse.type = "t", 
             repel = TRUE, 
             ggtheme = theme_classic()
)

Por ultimo tomaremos toda la base de datos exceptuando PIB para ver que agrupación se obtiene.

# Medida de Distancia utilizada
d<-dist(sectores,method = "euclidea")
round(as.matrix(d)[1:5, 1:5], 2)

              United States China Japan France India
United States          0.00  7.02  5.11   5.67  6.77
China                  7.02  0.00  6.71   7.80  6.11
Japan                  5.11  6.71  0.00   1.82  3.58
France                 5.67  7.80  1.82   0.00  3.57
India                  6.77  6.11  3.58   3.57  0.00

## Numero optimo de cluster
library(factoextra)
library(NbClust)
res.nbclust<-NbClust(data = sectores, 
                  distance = "euclidea", min.nc = 3, 
                  max.nc = 9, 
                  method = "complete", index = "all")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 7 proposed 3 as the best number of clusters 
* 9 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 2 proposed 6 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 4 proposed 9 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  4 
 
 
*******************************************************************

Utilizando la agrupación jerarquica los países quedan de la siguiente manera;

library(ggplot2)
library("factoextra")
resul<-hclust(d,method = "ward.D")
fviz_dend(resul, k = 4,cex = 0.5,
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"))+
  labs(title = "Agrupación Jerárquica",
       subtitle = "Distancia euclídea, Ward, K=4")

library(fpc)
pam.res<-pam(sectores,4)
fviz_cluster(pam.res,
             palette = c( "#FF4040","#9932CC","#2E8B57","#00AFBB"), 
             ellipse.type = "t", 
             repel = TRUE, 
             ggtheme = theme_classic()
)

Too few points to calculate an ellipse

library(cluster)
library(ggplot2)
library(factoextra)
matriz_distancias <-get_dist(x = sectores, method = "euclidea")
hc_diana <- diana(x = matriz_distancias, diss = TRUE, stand = FALSE)
fviz_dend(x = hc_diana, cex = 0.5) +
  labs(title = "Hierarchical clustering divisivo",
       subtitle = "Distancia euclídea")

library("clValid")
intern <- clValid(sectores, nClust = 2:6, 
              clMethods = c("hierarchical","kmeans","pam",'clara'),
              validation = "internal")
# Summary
summary(intern)


Clustering Methods:
 hierarchical kmeans pam clara 

Cluster sizes:
 2 3 4 5 6 

Validation Measures:
                                 2       3       4       5       6
                                                                  
hierarchical Connectivity   5.6079  5.8579  9.2230 13.0810 20.1595
             Dunn           0.7270  0.7333  0.3772  0.3881  0.2857
             Silhouette     0.5560  0.5271  0.3032  0.2679  0.3030
kmeans       Connectivity   5.6079  5.8579 19.5317 19.5663 28.4032
             Dunn           0.7270  0.7333  0.1974  0.2317  0.1449
             Silhouette     0.5560  0.5271  0.3203  0.3535  0.3238
pam          Connectivity  13.2099 16.1389 23.4512 26.3802 31.3488
             Dunn           0.1287  0.1360  0.0913  0.1394  0.1649
             Silhouette     0.2661  0.2844  0.2180  0.2560  0.2515
clara        Connectivity  13.2099 16.1389 23.4512 26.3802 31.3488
             Dunn           0.1287  0.1360  0.0913  0.1394  0.1649
             Silhouette     0.2661  0.2844  0.2180  0.2560  0.2515

Optimal Scores:

             Score  Method       Clusters
Connectivity 5.6079 hierarchical 2       
Dunn         0.7333 hierarchical 3       
Silhouette   0.5560 hierarchical 2

plot(intern)