La data que tiene representa diversos indicadores a nivel distrital. Ud cuenta con información adicional, como la provincia y departamento al que pertenece el distrito, y si el distrito es o no la capital de la provincia.
library(openxlsx)
datafile='extra.xlsx'
datat=read.xlsx(datafile)
head(datat)
## ubiDep ubiPro ubiDis nomDep nomPro nomDis Poblacion
## 1 130000 131200 131201 LA LIBERTAD Virú Viru 55446
## 2 190000 190100 190101 PASCO Pasco Chaupimarca 27731
## 3 080000 080800 080803 CUSCO Espinar Coporaque 17260
## 4 060000 060500 060501 CAJAMARCA Contumazá Contumaza 9033
## 5 050000 050200 050203 AYACUCHO Cangallo Los Morochucos 8094
## 6 060000 060100 060112 CAJAMARCA Cajamarca San Juan 5156
## EsperanzaVida SecundariaCompleta Educación25mas IngresoFamiliarPC
## 1 75.08095 46.93850 7.098978 448.0094
## 2 72.87408 80.49536 10.659090 623.7036
## 3 68.19872 30.33886 4.004448 173.0789
## 4 70.83607 31.64180 6.122623 371.1274
## 5 76.56993 23.89175 4.573024 211.6296
## 6 71.66868 21.56563 4.494292 237.0323
## capitalProv
## 1 SI
## 2 SI
## 3 NO
## 4 SI
## 5 NO
## 6 NO
Verificando info numerica o no numerica
str(datat)
## 'data.frame': 1834 obs. of 12 variables:
## $ ubiDep : chr "130000" "190000" "080000" "060000" ...
## $ ubiPro : chr "131200" "190100" "080800" "060500" ...
## $ ubiDis : chr "131201" "190101" "080803" "060501" ...
## $ nomDep : chr "LA LIBERTAD" "PASCO" "CUSCO" "CAJAMARCA" ...
## $ nomPro : chr "Virú" "Pasco" "Espinar" "Contumazá" ...
## $ nomDis : chr "Viru" "Chaupimarca" "Coporaque" "Contumaza" ...
## $ Poblacion : num 55446 27731 17260 9033 8094 ...
## $ EsperanzaVida : num 75.1 72.9 68.2 70.8 76.6 ...
## $ SecundariaCompleta: num 46.9 80.5 30.3 31.6 23.9 ...
## $ Educación25mas : num 7.1 10.66 4 6.12 4.57 ...
## $ IngresoFamiliarPC : num 448 624 173 371 212 ...
## $ capitalProv : chr "SI" "SI" "NO" "SI" ...
No se preocupe si la data tiene valores perdidos. Si se utiliza toda la información numérica disponible para organizar los distritos en grupos:
p1=datat
p1_scaled=scale(p1[,c(7:11)])
library(NbClust)
nbP1 <- NbClust(p1_scaled, method = "complete")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
Vemos arribe que deben ser 2 grupos…
Cuantos quedarian mal asignados??
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
algoritmo="hclust"
cuantosClusters=length(table(nbP1$Best.partition))
solucionJerarquica1 <- eclust(p1_scaled,
FUNcluster =algoritmo,
k = cuantosClusters,
method = "complete", # como en nb!
graph = FALSE)
widths <-solucionJerarquica1$silinfo$widths
nrow(widths[widths$sil_width<0,])
## [1] 59
Son 59, notese que cargó el ggplot.
INSTRUCCIONES: En este ejercicio trabaje solo con los distritos que no son capital de provincia. Cuando los tenga, verifique si hay datos perdidos; si los hubiese reemplacelos por la mediana de la columna. Si se quiere organizar todo este subconjunto de distritos en grupos:
p2=datat
p2sub=p2[p2$capitalProv=='NO',]
subTable=p2sub[,c(7:11)]
for(i in 1:ncol(subTable)){
MEDIANA=median(subTable[,i], na.rm = TRUE)
subTable[is.na(subTable[,i]), i] <- round(MEDIANA,0)
}
p2_scaled=scale(subTable)
library(NbClust)
nbP2 <- NbClust(p2_scaled, method = "complete")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 5 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
Sale 2, igual que la otra pregunta!.
Y cuantos salen mal?
library(factoextra)
algoritmo="hclust"
cuantosClusters=length(table(nbP2$Best.partition))
solucionJerarquica1 <- eclust(p2_scaled,
FUNcluster =algoritmo,
k = cuantosClusters,
method = "complete", # como en nb!
graph = FALSE)
widths <-solucionJerarquica1$silinfo$widths
nrow(widths[widths$sil_width<0,])
## [1] 36
En este ejercicio verifique primero si hay datos perdidos. Si los hubiese reemplacelos por la media de la columna.
p3=datat
subTable=p3[,c(7:11)]
for(i in 1:ncol(subTable)){ # para cada columna:
MEDIA=mean(subTable[,i], na.rm = TRUE) # calcula la mediana de esa columna
subTable[is.na(subTable[,i]), i] <- round(MEDIA,0) # pon la mediana donde haya un NA en esa columna (redondeada)
}
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
matrizCor <- cor(subTable) #numericas requieren Pearson
KMO(matrizCor)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matrizCor)
## Overall MSA = 0.71
## MSA for each item =
## Poblacion EsperanzaVida SecundariaCompleta
## 0.93 0.90 0.71
## Educación25mas IngresoFamiliarPC
## 0.63 0.70
TENEMOS MATRIZ IDENTIDAD? NO
cortest.bartlett(matrizCor, n=nrow(subTable))
## $chisq
## [1] 4208.446
##
## $p.value
## [1] 0
##
## $df
## [1] 10
resultadoPr=principal(matrizCor,2,rotate="varimax", scores=T)
print(resultadoPr,digits=3,cut = 0.4)
## Principal Components Analysis
## Call: principal(r = matrizCor, nfactors = 2, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## Poblacion 0.782 0.630 0.3703 1.06
## EsperanzaVida 0.699 0.532 0.4681 1.18
## SecundariaCompleta 0.894 0.801 0.1987 1.01
## Educación25mas 0.903 0.908 0.0922 1.22
## IngresoFamiliarPC 0.764 0.444 0.781 0.2185 1.61
##
## RC1 RC2
## SS loadings 2.261 1.391
## Proportion Var 0.452 0.278
## Cumulative Var 0.452 0.730
## Proportion Explained 0.619 0.381
## Cumulative Proportion 0.619 1.000
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.138
##
## Fit based upon off diagonal values = 0.915
Explica= 0.739
Y si le quito Ingreso Familiar per capita:
library(psych)
matrizCor <- cor(subTable[,-5]) #numericas requieren Pearson
KMO(matrizCor)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matrizCor)
## Overall MSA = 0.6
## MSA for each item =
## Poblacion EsperanzaVida SecundariaCompleta
## 0.77 0.76 0.57
## Educación25mas
## 0.56
TENEMOS MATRIZ IDENTIDAD? NO
cortest.bartlett(matrizCor, n=nrow(subTable))
## $chisq
## [1] 1993.336
##
## $p.value
## [1] 0
##
## $df
## [1] 6
resultadoPr=principal(matrizCor,2,rotate="varimax", scores=T)
print(resultadoPr,digits=3,cut = 0.4)
## Principal Components Analysis
## Call: principal(r = matrizCor, nfactors = 2, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## Poblacion 0.771 0.617 0.383 1.08
## EsperanzaVida 0.736 0.573 0.427 1.12
## SecundariaCompleta 0.943 0.899 0.101 1.02
## Educación25mas 0.875 0.868 0.132 1.26
##
## RC1 RC2
## SS loadings 1.709 1.248
## Proportion Var 0.427 0.312
## Cumulative Var 0.427 0.739
## Proportion Explained 0.578 0.422
## Cumulative Proportion 0.578 1.000
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.17
##
## Fit based upon off diagonal values = 0.822
Transforme las variable ‘IngresoFamiliarPC’ y ‘EsperanzaVida’ en ordinales. En ambos casos solo haga tres niveles: use ‘bajo’,‘medio’ y ‘alto’ para la primera, y ‘malo’,‘regular’,‘bueno’ para la segunda.
p4=datat
etiquetas1=c('bajo','medio','alto')
p4$IngresoFamiliarPC_O=cut(p4$IngresoFamiliarPC,
breaks=3,
labels=etiquetas1,
ordered_result = T)
etiquetas2=c('malo','regular','bueno')
p4$EsperanzaVida_O =cut(p4$EsperanzaVida,
breaks=3,
labels=etiquetas2,
ordered_result = T)
Salió?
summary(p4[,c(13,14)])
## IngresoFamiliarPC_O EsperanzaVida_O
## bajo :1633 malo :142
## medio: 151 regular:962
## alto : 8 bueno :730
## NA's : 42
*Tabla:
tablaTE=table(p4$IngresoFamiliarPC_O,p4$EsperanzaVida_O)
chisq.test(tablaTE,simulate.p.value = T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: tablaTE
## X-squared = 117.72, df = NA, p-value = 0.0004998
library(ca)
tablaCA_te=ca(tablaTE)
plot.ca(tablaCA_te, col=c("red","blue"))
_ Regular y malo no tiene relación con bajo: pues están muy cerca al ‘0’. _ bueno tiene relacion directa con medio y alto.