Clasificación de países por composición sectorial del PIB
Análisis de Cluster
Introducción
El siguiente post muestra un análisis de la técnica multivariante Análisis de Cluster para analizar la base de datos List of Countries by GDP Sector Composition de la pagina web https://www.kaggle.com/datasets en la cual se eligierón los países siguientes:
Japan, France, Brazil, Canada, Rusia, Spain, Mexico, Argentina,Taiwan, Colombia, Chile, Venezuela, Peru, Puerto Rico, Ecuador, Cuba, Republica Dominicana, Guatemala, Uruguay, Panama, Costa Rica, Bolivia, Paraguay, El Salvador, Honduras, Trinidad y Tobago, Jamaica, Nicaragua, Haiti, Belice .
Se desea agrupar estos países del mundo en función del PIB(Nominal) de los siguientes sectores:
Sector Agricultura
Sector Industria
Sector de Servicios
El experimento consiste en agrupar o clasificar estos países en grupos homogeneos según el PIB utilizando las siguientes variables de cada uno de los sectores antes mencionados:
PIB : Valor monetario de la producción de bienes y servicios (millones de $).
CPSA : Contribución del PIB sector agricola mundial.
Ranking SA : Ranking del pais para el PIB sector agricola.
PCPSA : Porcentaje de contribución PIB sector agricola.
CPSI : Contribución del PIB sector industria mundial.
Ranking SI : Ranking del pais para el PIB sector industria.
PCPSI : Porcentaje de contribución PIB sector industria.
CPSS : Contribución del PIB sector de servicios mundial.
Ranking SS : Ranking del pais para el PIB sector de servicios.
PCPSS : Porcentaje de contribución PIB sector de servicios.
Sector Agricultura
Comenzaremos realizando la agrupación de los países para el sector agricola utilizando la agrupación jerarquica.
Lo primero que realizaremos será calcular la matriz de distancias de las variables del sector agricola para cada uno de los países utilizando la funcion dist .
Medida de Distancia
# estandarizamos los datos de la base de datos
agricultura<-scale(Agricultura[,-1])
nombre<-c("Japan","France","Brazil","Canada","Russia","Spain","Mexico","Argentina","Taiwan","Colombia","Chile","Venezuela","Peru","Puerto Rico","Ecuador","Cuba","Dom. Republic","Guatemala","Uruguay","Panama","Costa Rica","Bolivia",
"Paraguay","El Salvador","Honduras","T. Tobago","Jamaica","Nicaragua","Haiti","Belize")
row.names(agricultura)<-nombre# Distancia eclídea
distancia<-dist(x=agricultura, method = "euclidean")
round(as.matrix(distancia)[1:5, 1:5], 2) Japan France Brazil Canada Russia
Japan 0.00 2.18 3.94 3.15 3.36
France 2.18 0.00 2.82 1.23 1.30
Brazil 3.94 2.82 0.00 3.63 2.16
Canada 3.15 1.23 3.63 0.00 1.57
Russia 3.36 1.30 2.16 1.57 0.00
Para determinar que nuestra clasificación es la mejor utilizamos la función agnes que especifica el coeficiente de aglomeración que mide el monto de estructura de cluster encontrado (valores cercanos a 1 sugieren una estructura de cluster fuerte).
# Metodos de agrupación
library(purrr)
library(cluster)
Cofeciente_Aglomeracion<- c( "average", "single", "complete", "ward")
names(Cofeciente_Aglomeracion) <- c( "average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {
agnes(agricultura, method = x)$ac
}
map_dbl(Cofeciente_Aglomeracion, ac) average single complete ward
0.8169654 0.6947719 0.8535576 0.9109951
Según los resultados obtenidos observamos que el valor que más se acerca a \(1\) es el método de ward .
library(ggplot2)
library("factoextra")Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
resultado<-hclust(distancia,method = "ward.D2")
fviz_dend(resultado, cex = 0.5)El gráfico anterior muestra como sería la agrupación de cada uno de los países.
fviz_dend(resultado, k = 4,cex = 0.5,
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE,rect = TRUE )+
labs(title = "Agrupación Jerárquica",
subtitle = "Distancia euclídea, Ward, K=4")library(cluster)
k3<-kmeans(agricultura,centers = 4,iter.max = 100,algorithm = "MacQueen")
fviz_cluster(k3,data = agricultura)+
labs(title = "Agrupación No Jerárquica",subtitle = "Algoritmo de MacQueen , K-means")Sector Industria
Para la agrupación de los países del sector industria utilizaremos la agrupación no jerarquica, comenzaremos determinado el numéro optimo de cluster que debemos utilizar.
Número optimo de cluster
# Medida de Distancia utilizada
distanciaI<-dist(industria,method = "manhattan")
round(as.matrix(distanciaI)[1:5, 1:5], 2) Japan France Brazil Canada Russia
Japan 0.00 6.56 7.30 6.77 6.97
France 6.56 0.00 0.95 2.10 2.68
Brazil 7.30 0.95 0.00 1.36 2.12
Canada 6.77 2.10 1.36 0.00 0.76
Russia 6.97 2.68 2.12 0.76 0.00
## Numero optimo de cluster
library(factoextra)
library(NbClust)
fviz_nbclust(industria, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2)Según el metodo wss el numéro optimo de cluster esta entre \(5\)
Ahora visualimos en que cluster quedarón distribuidos cada uno de los países
k2<-kmeans(industria,centers = 5,nstart = 25)
k2K-means clustering with 5 clusters of sizes 12, 1, 9, 6, 2
Cluster means:
PIB CPSI RankingSI PCPSI
1 -0.5307384 -0.5330793 0.9470518 -0.7410622
2 4.0258879 4.3677428 -1.3053391 0.1121927
3 -0.3159431 -0.2629599 -0.4206791 0.6436317
4 1.0339904 0.8852053 -1.1237510 -0.3168951
5 -0.5087410 -0.4576918 0.2346677 2.4446196
Clustering vector:
Japan France Brazil Canada Russia
2 4 4 4 4
Spain Mexico Argentina Taiwan Colombia
4 4 3 3 3
Chile Venezuela Peru Puerto Rico Ecuador
3 3 3 5 3
Cuba Dom. Republic Guatemala Uruguay Panama
1 3 1 1 1
Costa Rica Bolivia Paraguay El Salvador Honduras
1 3 1 1 1
T. Tobago Jamaica Nicaragua Haiti Belize
5 1 1 1 1
Within cluster sum of squares by cluster:
[1] 8.1980150 0.0000000 3.4906336 3.6712278 0.3587785
(between_SS / total_SS = 86.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
fviz_cluster(k2,data = agricultura)library(fpc)
pam.res<-pam(industria,5)
fviz_cluster(pam.res,
palette = c( "#FF4040","#9932CC","#FFD700","#009ACD","#2E8B57"),
ellipse.type = "t",
repel = TRUE,
ggtheme = theme_classic()
)Too few points to calculate an ellipse
Sector Servicios
En los agrupaciones anteriores hemos observado como se utilizan las diferentes técnicas de agrupación del análisis de cluster, para este último utilizaremos otras funciones que ayudaran a tener una mejor clasificación de los cluster.
Comenzaremos encontrando el número optimo de cluster,
library(factoextra)
library(NbClust)
### Metodo completo
res.nbclust<-NbClust(data = servicio,
distance = "euclidea", min.nc = 3,
max.nc = 9,
method = "complete", index = "all")*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 9 proposed 3 as the best number of clusters
* 5 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 4 proposed 7 as the best number of clusters
* 2 proposed 8 as the best number of clusters
* 2 proposed 9 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
Ahora procedemos a realizar la agrupación para cada uno de los países del sector de servicios.
library(fpc)
pam.res<-pam(servicio,3)
fviz_cluster(pam.res,
palette = c( "#FF4040","#9932CC","#2E8B57"),
ellipse.type = "t",
repel = TRUE,
ggtheme = theme_classic()
)Por ultimo tomaremos toda la base de datos exceptuando PIB para ver que agrupación se obtiene.
# Medida de Distancia utilizada
d<-dist(sectores,method = "euclidea")
round(as.matrix(d)[1:5, 1:5], 2) United States China Japan France India
United States 0.00 7.02 5.11 5.67 6.77
China 7.02 0.00 6.71 7.80 6.11
Japan 5.11 6.71 0.00 1.82 3.58
France 5.67 7.80 1.82 0.00 3.57
India 6.77 6.11 3.58 3.57 0.00
## Numero optimo de cluster
library(factoextra)
library(NbClust)
res.nbclust<-NbClust(data = sectores,
distance = "euclidea", min.nc = 3,
max.nc = 9,
method = "complete", index = "all")*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 7 proposed 3 as the best number of clusters
* 9 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 2 proposed 6 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 4 proposed 9 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
Utilizando la agrupación jerarquica los países quedan de la siguiente manera;
library(ggplot2)
library("factoextra")
resul<-hclust(d,method = "ward.D")
fviz_dend(resul, k = 4,cex = 0.5,
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"))+
labs(title = "Agrupación Jerárquica",
subtitle = "Distancia euclídea, Ward, K=4")library(fpc)
pam.res<-pam(sectores,4)
fviz_cluster(pam.res,
palette = c( "#FF4040","#9932CC","#2E8B57","#00AFBB"),
ellipse.type = "t",
repel = TRUE,
ggtheme = theme_classic()
)Too few points to calculate an ellipse
library(cluster)
library(ggplot2)
library(factoextra)
matriz_distancias <-get_dist(x = sectores, method = "euclidea")
hc_diana <- diana(x = matriz_distancias, diss = TRUE, stand = FALSE)
fviz_dend(x = hc_diana, cex = 0.5) +
labs(title = "Hierarchical clustering divisivo",
subtitle = "Distancia euclídea")library("clValid")
intern <- clValid(sectores, nClust = 2:6,
clMethods = c("hierarchical","kmeans","pam",'clara'),
validation = "internal")
# Summary
summary(intern)
Clustering Methods:
hierarchical kmeans pam clara
Cluster sizes:
2 3 4 5 6
Validation Measures:
2 3 4 5 6
hierarchical Connectivity 5.6079 5.8579 9.2230 13.0810 20.1595
Dunn 0.7270 0.7333 0.3772 0.3881 0.2857
Silhouette 0.5560 0.5271 0.3032 0.2679 0.3030
kmeans Connectivity 5.6079 5.8579 19.5317 19.5663 28.4032
Dunn 0.7270 0.7333 0.1974 0.2317 0.1449
Silhouette 0.5560 0.5271 0.3203 0.3535 0.3238
pam Connectivity 13.2099 16.1389 23.4512 26.3802 31.3488
Dunn 0.1287 0.1360 0.0913 0.1394 0.1649
Silhouette 0.2661 0.2844 0.2180 0.2560 0.2515
clara Connectivity 13.2099 16.1389 23.4512 26.3802 31.3488
Dunn 0.1287 0.1360 0.0913 0.1394 0.1649
Silhouette 0.2661 0.2844 0.2180 0.2560 0.2515
Optimal Scores:
Score Method Clusters
Connectivity 5.6079 hierarchical 2
Dunn 0.7333 hierarchical 3
Silhouette 0.5560 hierarchical 2
plot(intern)