CASO 1: Sistema Educativo del PerĆŗ

Revisamos la información

educacion=read.csv("https://raw.githubusercontent.com/VictorGuevaraP/Mineria-de-datos-2020/master/Sistema_Educativo_Peru.csv",sep=";")
head(educacion)
##   region  departam almain_1 almapr_2 almase_3 taasi_17 taasp_18 taass_19
## 1  Selva  Amazonas    11235    81839    31688     46.0     79.7     55.1
## 2 Sierra    Ancash    32254   185875    95346     55.5     87.7     59.4
## 3 Sierra  Apurimac    14728    94015    44724     45.3     88.8     63.6
## 4  Costa  Arequipa    31601   143318   104735     44.8     94.5     81.3
## 5 Sierra  Ayacucho    17173   127033    55667     39.6     89.9     57.7
## 6 Sierra Cajamarca    33304   280254   107485     41.6     90.0     50.0
##   taefp_22 taefs_23 doin_24 dopr_25 dose_26 cein_29 cepr_30 cese_31
## 1     77.9     80.5     505    3272    1718     290    1100     199
## 2     79.6     80.9    1627    8460    6930     834    1781     478
## 3     77.4     79.0     609    3711    2277     361     857     207
## 4     90.5     89.3    1969    7534    6869     910    1107     442
## 5     75.1     78.9     816    5585    3502     397    1392     298
## 6     79.1     80.5    1718   11260    6154     894    3532     642
summary(educacion)
##     region            departam            almain_1        almapr_2     
##  Length:23          Length:23          Min.   : 3298   Min.   : 15104  
##  Class :character   Class :character   1st Qu.:11794   1st Qu.: 86892  
##  Mode  :character   Mode  :character   Median :20469   Median :131746  
##                                        Mean   :21127   Mean   :135380  
##                                        3rd Qu.:29979   3rd Qu.:198012  
##                                        Max.   :41212   Max.   :280254  
##     almase_3         taasi_17        taasp_18        taass_19    
##  Min.   :  9361   Min.   :24.90   Min.   :79.70   Min.   :48.60  
##  1st Qu.: 35865   1st Qu.:41.00   1st Qu.:89.15   1st Qu.:57.80  
##  Median : 57783   Median :45.30   Median :90.40   Median :63.60  
##  Mean   : 68830   Mean   :48.09   Mean   :90.51   Mean   :65.46  
##  3rd Qu.:103454   3rd Qu.:55.95   3rd Qu.:93.25   3rd Qu.:74.70  
##  Max.   :141192   Max.   :76.00   Max.   :95.50   Max.   :86.50  
##     taefp_22        taefs_23        doin_24        dopr_25         dose_26    
##  Min.   :74.10   Min.   :77.50   Min.   : 157   Min.   :  611   Min.   : 541  
##  1st Qu.:78.50   1st Qu.:80.30   1st Qu.: 559   1st Qu.: 3227   1st Qu.:2020  
##  Median :84.00   Median :83.10   Median : 927   Median : 5313   Median :3502  
##  Mean   :83.22   Mean   :83.26   Mean   :1053   Mean   : 5569   Mean   :4119  
##  3rd Qu.:86.90   3rd Qu.:85.30   3rd Qu.:1543   3rd Qu.: 8014   3rd Qu.:6512  
##  Max.   :93.10   Max.   :90.90   Max.   :2283   Max.   :11260   Max.   :7907  
##     cein_29         cepr_30          cese_31     
##  Min.   : 75.0   Min.   : 164.0   Min.   : 45.0  
##  1st Qu.:253.0   1st Qu.: 685.5   1st Qu.:189.5  
##  Median :437.0   Median :1185.0   Median :264.0  
##  Mean   :489.0   Mean   :1252.0   Mean   :300.7  
##  3rd Qu.:775.5   3rd Qu.:1805.5   3rd Qu.:454.5  
##  Max.   :910.0   Max.   :3532.0   Max.   :642.0
dim(educacion)
## [1] 23 16

Eliminamos las 2 primeras variables ya que no son numƩricas

educacion_prep=educacion[,3:16]
head(educacion_prep)
##   almain_1 almapr_2 almase_3 taasi_17 taasp_18 taass_19 taefp_22 taefs_23
## 1    11235    81839    31688     46.0     79.7     55.1     77.9     80.5
## 2    32254   185875    95346     55.5     87.7     59.4     79.6     80.9
## 3    14728    94015    44724     45.3     88.8     63.6     77.4     79.0
## 4    31601   143318   104735     44.8     94.5     81.3     90.5     89.3
## 5    17173   127033    55667     39.6     89.9     57.7     75.1     78.9
## 6    33304   280254   107485     41.6     90.0     50.0     79.1     80.5
##   doin_24 dopr_25 dose_26 cein_29 cepr_30 cese_31
## 1     505    3272    1718     290    1100     199
## 2    1627    8460    6930     834    1781     478
## 3     609    3711    2277     361     857     207
## 4    1969    7534    6869     910    1107     442
## 5     816    5585    3502     397    1392     298
## 6    1718   11260    6154     894    3532     642
dim(educacion_prep)
## [1] 23 14

Pruebas preliminares

corrplot(cor(educacion_prep))

chart.Correlation(educacion_prep)

Para la decisión respecto a prueba de hipótesis

P- Valor < nivel de significancia (0.05) Rechazo Ho

cortest(educacion_prep)
## Tests of correlation matrices 
## Call:cortest(R1 = educacion_prep)
##  Chi Square value 1863.54  with df =  91   with probability < 0

0 < 0.05 (es correcto), entonces se rechaza Ho.

Esto quiere decir que la correlación entre variables es igual a 0

por ende no se crearĆ­an los componentes principales y esta data lista para la fase del modelado

Prueba de esfericidad de Barttlet

bartlett.test(educacion_prep)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  educacion_prep
## Bartlett's K-squared = 2645.8, df = 13, p-value < 2.2e-16

2.2e-16 < 0.05 (es correcto), entonces se rechaza Ho

Esto quiere decir que la matriz de correlaciones es distinta de la matriz de identidad

Prueba KMO (Kaiser Meyer Olkin)

KMO(educacion_prep)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = educacion_prep)
## Overall MSA =  0.74
## MSA for each item = 
## almain_1 almapr_2 almase_3 taasi_17 taasp_18 taass_19 taefp_22 taefs_23 
##     0.79     0.88     0.77     0.54     0.51     0.80     0.49     0.43 
##  doin_24  dopr_25  dose_26  cein_29  cepr_30  cese_31 
##     0.68     0.77     0.78     0.77     0.71     0.93

Overall MSA = 0.74 > 0.5 (Se cumple),

Se justifica el anƔlisis de componentes principales

CASO 2: Distrito

Lectura de los datos

distritos<-read.csv("https://gist.githubusercontent.com/BenjiSantos/33fee8a00211146958990b66f864c70b/raw/a83679c2f26e016d7e71268badd388a5356d3db6/distritos.csv", sep = ";")
summary(distritos)
##    distrito            ocu_vivi        pobpjov         sinelect     
##  Length:34          Min.   :1.030   Min.   :3.700   Min.   : 0.350  
##  Class :character   1st Qu.:1.065   1st Qu.:4.525   1st Qu.: 1.775  
##  Mode  :character   Median :1.100   Median :5.100   Median : 4.355  
##                     Mean   :1.116   Mean   :5.024   Mean   : 9.687  
##                     3rd Qu.:1.150   3rd Qu.:5.475   3rd Qu.:16.310  
##                     Max.   :1.330   Max.   :6.500   Max.   :32.930  
##     sinagua         pea1619         pocprin           peam15     
##  Min.   : 7.02   Min.   :0.100   Min.   : 1.100   Min.   :15.63  
##  1st Qu.:14.14   1st Qu.:0.725   1st Qu.: 4.125   1st Qu.:29.10  
##  Median :22.51   Median :1.700   Median : 6.650   Median :42.69  
##  Mean   :25.17   Mean   :2.162   Mean   : 7.141   Mean   :43.05  
##  3rd Qu.:36.66   3rd Qu.:3.875   3rd Qu.:10.075   3rd Qu.:59.59  
##  Max.   :65.76   Max.   :5.400   Max.   :14.400   Max.   :65.11
head(distritos)
##     distrito ocu_vivi pobpjov sinelect sinagua pea1619 pocprin peam15
## 1        Ate     1.15     5.3    27.60   51.10     3.9     1.1  63.48
## 2   Barranco     1.09     4.5     1.59    8.32     0.8     3.9  33.48
## 3      BreƱa     1.08     4.4     2.20   23.15     0.9     4.0  37.89
## 4 Carabayllo     1.10     5.1    30.13   38.09     4.5    12.6  63.65
## 5      Comas     1.20     5.9    10.92   24.27     3.8     9.4  60.37
## 6 Chorrillos     1.15     5.5    16.77   37.11     3.2    10.6  18.78

Estandarisamos los datos para aplicar K-means,

para ellos es necesario quitar la variable cualitativa

distritosFilter=distritos[,2:8]
distritosStandar<-scale(distritosFilter)

Aplicanción del método del codo para ver la cantidad óptima de clusters

fviz_nbclust(distritosStandar, kmeans, method = "wss")

Aplicación del método silhouette para ver la cantidad óptima de clusters

fviz_nbclust(distritosStandar, FUNcluster=kmeans, method="silhouette")+theme_classic()