Simulacro - Solución

Tenemos el archivo IDE_limpio.

¿Debería crear 4 clusters con esta data?

subData=ide[,-c(1:6)]
ideS=scale(subData)
row.names(ideS)=ide$provinciaNombre
library(NbClust)
nb <- NbClust(ideS, method = "complete")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Se recomienda 3 (no 4).

La data no tenía valores perdidos, si los hubiera tenido los resultados no son los mismos. De ahí que hay que saber cuando imputar (reemplazar) los valores perdidos.

Imputando valores perdidos de subData:

for(i in 1:ncol(subData)){  
  MEDIA=mean(ide[,i], na.rm = TRUE) 
  subData[is.na(ide[,i]), i] <- round(MEDIA,0)
}

Si uso el aglomeración jerárquica en la data del IDE, ¿cuántas provincias quedarían mal asignadas?

Sabiendo que debo pedir 3 clusters:

###
subData=ide[,-c(1:6)]
ideS=scale(subData)
####
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
algoritmo="hclust"
cuantosClusters=3
resultadoJ <- eclust(ideS,FUNcluster =algoritmo,
                     k = cuantosClusters,
                     method = "complete",
                     graph = FALSE) 
siluetas=resultadoJ$silinfo$widths
BadCluster=siluetas[siluetas$sil_width<0,]
nrow(BadCluster)
## [1] 12

** la respuesta es 12**.

Tendremos 3 variables latentes entre todas las variables observadas del IDE (sus componentes)?

matriz=cor(subData)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
KMO(matriz)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matriz)
## Overall MSA =  0.74
## MSA for each item = 
##  identificacion2012         medicos2012     escolaridad2012 
##                0.76                0.81                0.68 
##     AguaDesague2012 electrificacion2012 
##                0.73                0.74
cortest.bartlett(matriz, n=nrow(ide))
## $chisq
## [1] 431.0569
## 
## $p.value
## [1] 2.286103e-86
## 
## $df
## [1] 10
LATENTES=principal(matriz,3,rotate="varimax", scores=T)
print(LATENTES,digits=3,cut = 0.4)
## Principal Components Analysis
## Call: principal(r = matriz, nfactors = 3, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                       RC2   RC1   RC3    h2     u2  com
## identificacion2012              0.932 0.980 0.0203 1.26
## medicos2012         0.891             0.834 0.1660 1.10
## escolaridad2012           0.815 0.451 0.886 0.1144 1.62
## AguaDesague2012     0.814             0.801 0.1994 1.42
## electrificacion2012 0.469 0.818       0.902 0.0980 1.64
## 
##                         RC2   RC1   RC3
## SS loadings           1.735 1.530 1.137
## Proportion Var        0.347 0.306 0.227
## Cumulative Var        0.347 0.653 0.880
## Proportion Explained  0.394 0.348 0.258
## Cumulative Proportion 0.394 0.742 1.000
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.072 
## 
## Fit based upon off diagonal values = 0.981

No tenemos 3 conjuntos claros de variables observadas que se alineen con la latentes solicitadas.