sr pop15 pop75 dpi ddpi
Australia 11.43 29.35 2.87 2329.68 2.87
Austria 12.07 23.32 4.41 1507.99 3.93
Belgium 13.17 23.80 4.43 2108.47 3.82
Bolivia 5.75 41.89 1.67 189.13 0.22
Brazil 12.88 42.19 0.83 728.47 4.56
Canada 8.79 31.72 2.85 2982.88 2.43
Indicadores Económicos
Análisis de Cluster
Introducción
El siguiente post muestra un análisis de la técnica multivariante Análisis de Cluster para analizar la base de datos savin el paquete faraway, que nos da 5 indicadores económicos de 50 países en el período 1960–1970.
El experimento consiste en agrupar o clasificar los paíse en grupos homogeneos según los indicadores económicos :
sr: la tasa de ahorro de cada país.pop15: su porcentaje de población menor de 15 años.pop75: su porcentaje de población mayor de 75 años.dpi: su renta per cápita en dólares.ddpi: su tasa de crecimiento, como porcentaje de su renta per cápita.
Para realizar la agrupación de los países comenzaremos determinando el numéro optimo de cluster que debemos utilizar.
Numéro optimo de cluster
# Medida de Distancia utilizada
d<-dist(savings,method = "euclidea")
round(as.matrix(d)[1:5, 1:5], 2) Australia Austria Belgium Bolivia Brazil
Australia 0.00 821.71 221.29 2140.60 1601.26
Austria 821.71 0.00 600.48 1319.01 779.76
Belgium 221.29 600.48 0.00 1919.44 1380.13
Bolivia 2140.60 1319.01 1919.44 0.00 539.41
Brazil 1601.26 779.76 1380.13 539.41 0.00
## Numero optimo de cluster
library(ggplot2)
library(factoextra)Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(NbClust)
## Evaluacion para la eleccion del numero de cluster
## Metodo del codo
fviz_nbclust(savings,kmeans,method = "wss")+
labs(subtitle = "Elbow method")## Metodo de la silueta
fviz_nbclust(savings,kmeans,method = "silhouette")+
labs(subtitle = "Silhouete method")## Gap static
fviz_nbclust(savings,nstart=25, kmeans,method = "gap_stat",nboot = 50)+
labs(subtitle = "Gap statistic method")### Metodo completo
res.nbclust<-NbClust(data = savings, distance = "euclidean", min.nc = 3, max.nc = 9,
method = "complete", index = "all")*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 6 proposed 3 as the best number of clusters
* 8 proposed 4 as the best number of clusters
* 3 proposed 5 as the best number of clusters
* 2 proposed 6 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 4 proposed 9 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
Según los resultado el numéro optimo de cluster es \(4\) por el método completo y con los otros tres métodos son \(2\) cluster.
Agrupación no jerárquica K-means
Ralizaremos agrupaciones con los distintos algoritmos que presenta la función kmeans
Algoritmo MacQueen,
k2<-kmeans(savings,centers = 4,iter.max =100, nstart = 25, algorithm = "MacQueen")
k2K-means clustering with 4 clusters of sizes 3, 30, 8, 9
Cluster means:
sr pop15 pop75 dpi ddpi
1 7.736667 27.65667 3.606667 3428.0867 2.630000
2 8.484667 40.86200 1.436667 407.2647 3.930667
3 12.671250 24.64750 3.801250 2364.6250 3.316250
4 11.603333 27.60778 3.368889 1546.5244 3.948889
Clustering vector:
Australia Austria Belgium Bolivia Brazil
3 4 3 2 2
Canada Chile China Colombia Costa Rica
1 2 2 2 2
Denmark Ecuador Finland France Germany
3 2 4 3 3
Greece Guatamala Honduras Iceland India
2 2 2 4 2
Ireland Italy Japan Korea Luxembourg
4 4 4 2 3
Malta Norway Netherlands New Zealand Nicaragua
2 3 4 4 2
Panama Paraguay Peru Philippines Portugal
2 2 2 2 2
South Africa South Rhodesia Spain Sweden Switzerland
2 2 2 1 3
Turkey Tunisia United Kingdom United States Venezuela
2 2 4 1 2
Zambia Jamaica Uruguay Libya Malaysia
2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 544059.7 1582368.6 211576.8 531227.4
(between_SS / total_SS = 94.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
fviz_cluster(k2,data = savings)# Metodos de agrupación
library(purrr)
library(cluster)
Cofeciente_Aglomeracion<- c( "average", "single", "complete", "ward")
names(Cofeciente_Aglomeracion) <- c( "average", "single", "complete", "ward")
# function para encontrar el coeficiente de aglomeracion mas cercano 1
ac <- function(x) {
agnes(savings, method = x)$ac
}
map_dbl(Cofeciente_Aglomeracion, ac) average single complete ward
0.9626906 0.9135734 0.9810209 0.9913942
Según los resultados el coeficiente de aglomeración más cercano a \(1\) es el del método ward
library(ggplot2)
library("factoextra")
resultado<-hclust(d,method = "ward.D2")
fviz_dend(resultado, k=4, cex = 0.5)+labs(title = "Agrupación Jerárquica",
subtitle = "Distancia euclídea, Ward, K=4")