Tema 1

La data que tiene representa diversos indicadores a nivel distrital. Ud cuenta con información adicional, como la provincia y departamento al que pertenece el distrito, y si el distrito es o no la capital de la provincia.

library(openxlsx)
datafile='extra.xlsx'
datat=read.xlsx(datafile)
head(datat)

##   ubiDep ubiPro ubiDis      nomDep    nomPro         nomDis Poblacion
## 1 130000 131200 131201 LA LIBERTAD      Virú           Viru     55446
## 2 190000 190100 190101       PASCO     Pasco    Chaupimarca     27731
## 3 080000 080800 080803       CUSCO   Espinar      Coporaque     17260
## 4 060000 060500 060501   CAJAMARCA Contumazá      Contumaza      9033
## 5 050000 050200 050203    AYACUCHO  Cangallo Los Morochucos      8094
## 6 060000 060100 060112   CAJAMARCA Cajamarca       San Juan      5156
##   EsperanzaVida SecundariaCompleta Educación25mas IngresoFamiliarPC
## 1      75.08095           46.93850       7.098978          448.0094
## 2      72.87408           80.49536      10.659090          623.7036
## 3      68.19872           30.33886       4.004448          173.0789
## 4      70.83607           31.64180       6.122623          371.1274
## 5      76.56993           23.89175       4.573024          211.6296
## 6      71.66868           21.56563       4.494292          237.0323
##   capitalProv
## 1          SI
## 2          SI
## 3          NO
## 4          SI
## 5          NO
## 6          NO

Verificando info numerica o no numerica

str(datat)

## 'data.frame':    1834 obs. of  12 variables:
##  $ ubiDep            : chr  "130000" "190000" "080000" "060000" ...
##  $ ubiPro            : chr  "131200" "190100" "080800" "060500" ...
##  $ ubiDis            : chr  "131201" "190101" "080803" "060501" ...
##  $ nomDep            : chr  "LA LIBERTAD" "PASCO" "CUSCO" "CAJAMARCA" ...
##  $ nomPro            : chr  "Virú" "Pasco" "Espinar" "Contumazá" ...
##  $ nomDis            : chr  "Viru" "Chaupimarca" "Coporaque" "Contumaza" ...
##  $ Poblacion         : num  55446 27731 17260 9033 8094 ...
##  $ EsperanzaVida     : num  75.1 72.9 68.2 70.8 76.6 ...
##  $ SecundariaCompleta: num  46.9 80.5 30.3 31.6 23.9 ...
##  $ Educación25mas    : num  7.1 10.66 4 6.12 4.57 ...
##  $ IngresoFamiliarPC : num  448 624 173 371 212 ...
##  $ capitalProv       : chr  "SI" "SI" "NO" "SI" ...

Pregunta 1:

No se preocupe si la data tiene valores perdidos. Si se utiliza toda la información numérica disponible para organizar los distritos en grupos:

p1=datat
p1_scaled=scale(p1[,c(7:11)])
library(NbClust)
nbP1 <- NbClust(p1_scaled, method = "complete")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 2 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

Vemos arribe que deben ser 2 grupos…

Cuantos quedarian mal asignados??

library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

algoritmo="hclust"
cuantosClusters=length(table(nbP1$Best.partition))
solucionJerarquica1 <- eclust(p1_scaled, 
                              FUNcluster =algoritmo,
                              k = cuantosClusters,
                              method = "complete", # como en nb!
                              graph = FALSE) 
widths <-solucionJerarquica1$silinfo$widths
nrow(widths[widths$sil_width<0,])

## [1] 59

Son 59, notese que cargó el ggplot.

Pregunta 2:

INSTRUCCIONES: En este ejercicio trabaje solo con los distritos que no son capital de provincia. Cuando los tenga, verifique si hay datos perdidos; si los hubiese reemplacelos por la mediana de la columna. Si se quiere organizar todo este subconjunto de distritos en grupos:

p2=datat
p2sub=p2[p2$capitalProv=='NO',]


subTable=p2sub[,c(7:11)]
for(i in 1:ncol(subTable)){ 
  MEDIANA=median(subTable[,i], na.rm = TRUE) 
  subTable[is.na(subTable[,i]), i] <- round(MEDIANA,0)
}

p2_scaled=scale(subTable)
library(NbClust)
nbP2 <- NbClust(p2_scaled, method = "complete")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 5 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

Sale 2, igual que la otra pregunta!.

Y cuantos salen mal?

library(factoextra)
algoritmo="hclust"
cuantosClusters=length(table(nbP2$Best.partition))
solucionJerarquica1 <- eclust(p2_scaled, 
                              FUNcluster =algoritmo,
                              k = cuantosClusters,
                              method = "complete", # como en nb!
                              graph = FALSE) 
widths <-solucionJerarquica1$silinfo$widths
nrow(widths[widths$sil_width<0,])

## [1] 36

Pregunta 3:

En este ejercicio verifique primero si hay datos perdidos. Si los hubiese reemplacelos por la media de la columna.

IMPUTANDO:

p3=datat

subTable=p3[,c(7:11)]
for(i in 1:ncol(subTable)){  # para cada columna:
  MEDIA=mean(subTable[,i], na.rm = TRUE) # calcula la mediana de esa columna
  subTable[is.na(subTable[,i]), i] <- round(MEDIA,0) # pon la mediana donde haya un NA en esa columna (redondeada)
}

DATA SUFICIENTE? SÍ

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

matrizCor <- cor(subTable)  #numericas requieren Pearson 
KMO(matrizCor)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matrizCor)
## Overall MSA =  0.71
## MSA for each item = 
##          Poblacion      EsperanzaVida SecundariaCompleta 
##               0.93               0.90               0.71 
##     Educación25mas  IngresoFamiliarPC 
##               0.63               0.70

TENEMOS MATRIZ IDENTIDAD? NO

cortest.bartlett(matrizCor, n=nrow(subTable))

## $chisq
## [1] 4208.446
## 
## $p.value
## [1] 0
## 
## $df
## [1] 10

CUANTA VARIANZA EXPLICAN 2 latentes?

resultadoPr=principal(matrizCor,2,rotate="varimax", scores=T)
print(resultadoPr,digits=3,cut = 0.4)

## Principal Components Analysis
## Call: principal(r = matrizCor, nfactors = 2, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                      RC1   RC2    h2     u2  com
## Poblacion                0.782 0.630 0.3703 1.06
## EsperanzaVida            0.699 0.532 0.4681 1.18
## SecundariaCompleta 0.894       0.801 0.1987 1.01
## Educación25mas     0.903       0.908 0.0922 1.22
## IngresoFamiliarPC  0.764 0.444 0.781 0.2185 1.61
## 
##                         RC1   RC2
## SS loadings           2.261 1.391
## Proportion Var        0.452 0.278
## Cumulative Var        0.452 0.730
## Proportion Explained  0.619 0.381
## Cumulative Proportion 0.619 1.000
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.138 
## 
## Fit based upon off diagonal values = 0.915

Explica= 0.739

Pregunta 4:

Y si le quito Ingreso Familiar per capita:

library(psych)
matrizCor <- cor(subTable[,-5])  #numericas requieren Pearson 
KMO(matrizCor)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matrizCor)
## Overall MSA =  0.6
## MSA for each item = 
##          Poblacion      EsperanzaVida SecundariaCompleta 
##               0.77               0.76               0.57 
##     Educación25mas 
##               0.56

TENEMOS MATRIZ IDENTIDAD? NO

cortest.bartlett(matrizCor, n=nrow(subTable))

## $chisq
## [1] 1993.336
## 
## $p.value
## [1] 0
## 
## $df
## [1] 6

CUANTA VARIANZA EXPLICAN 2 latentes?

resultadoPr=principal(matrizCor,2,rotate="varimax", scores=T)
print(resultadoPr,digits=3,cut = 0.4)

## Principal Components Analysis
## Call: principal(r = matrizCor, nfactors = 2, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                      RC1   RC2    h2    u2  com
## Poblacion                0.771 0.617 0.383 1.08
## EsperanzaVida            0.736 0.573 0.427 1.12
## SecundariaCompleta 0.943       0.899 0.101 1.02
## Educación25mas     0.875       0.868 0.132 1.26
## 
##                         RC1   RC2
## SS loadings           1.709 1.248
## Proportion Var        0.427 0.312
## Cumulative Var        0.427 0.739
## Proportion Explained  0.578 0.422
## Cumulative Proportion 0.578 1.000
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.17 
## 
## Fit based upon off diagonal values = 0.822

Pregunta 5:

Transforme las variable ‘IngresoFamiliarPC’ y ‘EsperanzaVida’ en ordinales. En ambos casos solo haga tres niveles: use ‘bajo’,‘medio’ y ‘alto’ para la primera, y ‘malo’,‘regular’,‘bueno’ para la segunda.

p4=datat
etiquetas1=c('bajo','medio','alto')
p4$IngresoFamiliarPC_O=cut(p4$IngresoFamiliarPC,
                           breaks=3,
                           labels=etiquetas1,
                           ordered_result = T)

etiquetas2=c('malo','regular','bueno')
p4$EsperanzaVida_O =cut(p4$EsperanzaVida,
                           breaks=3,
                           labels=etiquetas2,
                           ordered_result = T)

Salió?

summary(p4[,c(13,14)])

##  IngresoFamiliarPC_O EsperanzaVida_O
##  bajo :1633          malo   :142    
##  medio: 151          regular:962    
##  alto :   8          bueno  :730    
##  NA's :  42

*Tabla:

tablaTE=table(p4$IngresoFamiliarPC_O,p4$EsperanzaVida_O)

hay relacion? SÍ

chisq.test(tablaTE,simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  tablaTE
## X-squared = 117.72, df = NA, p-value = 0.0004998

como estan relacionados las modalidades?

library(ca)
tablaCA_te=ca(tablaTE)
plot.ca(tablaCA_te, col=c("red","blue"))

_ Regular y malo no tiene relación con bajo: pues están muy cerca al ‘0’. _ bueno tiene relacion directa con medio y alto.