¿Debería crear 4 clusters con esta data?

La técnica se aplica a datos numéricos. Seleccionamos la que necesitamos, y a ese subconjunto los normalizamos:

subData=ide[,-c(1:6)]
ideS=scale(subData)

Porsiacaso guardamos los nombres de las provincias (no es necesario para la pregunta):

row.names(ideS)=ide$provinciaNombre

Para saber cuántos clusters pedir, usamos:

library(NbClust)
nb <- NbClust(ideS, method = "complete")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Se recomienda 3 (no 4).

La data no tenía valores perdidos, si los hubiera tenido los resultados no son los mismos. De ahí que hay que saber cuando imputar (reemplazar) los valores perdidos.

Imputando valores perdidos de subData:

for(i in 1:ncol(subData)){  
  MEDIA=mean(ide[,i], na.rm = TRUE) 
  subData[is.na(ide[,i]), i] <- round(MEDIA,0)
}

Si uso el aglomeración jerárquica en la data del IDE, ¿cuántas provincias quedarían mal asignadas?

Sabiendo que debo pedir 3 clusters:

Usamos eclust pues ésta entrega siluetas (hclust no la calcula):

###
subData=ide[,-c(1:6)]
ideS=scale(subData)
####
library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

algoritmo="hclust"
cuantosClusters=3
resultadoJ <- eclust(ideS,FUNcluster =algoritmo,
                     k = cuantosClusters,
                     method = "complete",
                     graph = FALSE)

Pedimos las siluetas

siluetas=resultadoJ$silinfo$widths

Nos quedamos con las negativas:

BadCluster=siluetas[siluetas$sil_width<0,]

Pedimos la cantidad:

nrow(BadCluster)

## [1] 12

** la respuesta es 12**.

Tendremos 3 variables latentes entre todas las variables observadas del IDE (sus componentes)?

Determinar latentes requiere hallar primero la matriz de correlacion:

matriz=cor(subData)

Verificamos que la data se suficiente:

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

KMO(matriz)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matriz)
## Overall MSA =  0.74
## MSA for each item = 
##  identificacion2012         medicos2012     escolaridad2012 
##                0.76                0.81                0.68 
##     AguaDesague2012 electrificacion2012 
##                0.73                0.74

Verificamos que la matriz de correlación NO es una matriz identidad:

cortest.bartlett(matriz, n=nrow(ide))

## $chisq
## [1] 431.0569
## 
## $p.value
## [1] 2.286103e-86
## 
## $df
## [1] 10

Calculamos las componentes principales (las latentes):

LATENTES=principal(matriz,3,rotate="varimax", scores=T)

Mostramos las latentes, eliminando la relación de sus observadas cuyo valor sea mayor a 0.4:

print(LATENTES,digits=3,cut = 0.4)

## Principal Components Analysis
## Call: principal(r = matriz, nfactors = 3, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                       RC2   RC1   RC3    h2     u2  com
## identificacion2012              0.932 0.980 0.0203 1.26
## medicos2012         0.891             0.834 0.1660 1.10
## escolaridad2012           0.815 0.451 0.886 0.1144 1.62
## AguaDesague2012     0.814             0.801 0.1994 1.42
## electrificacion2012 0.469 0.818       0.902 0.0980 1.64
## 
##                         RC2   RC1   RC3
## SS loadings           1.735 1.530 1.137
## Proportion Var        0.347 0.306 0.227
## Cumulative Var        0.347 0.653 0.880
## Proportion Explained  0.394 0.348 0.258
## Cumulative Proportion 0.394 0.742 1.000
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.072 
## 
## Fit based upon off diagonal values = 0.981

No tenemos 3 conjuntos claros de variables observadas que se alineen con la latentes solicitadas.