Introducción

Este documento hace parte del trabajo del curso de Métodos multivariados aplicados de la Universidad Nacional de Colombia para la Maestría y Especialización en Estadística. El alcance de este documento reporta de manera técnica el paso a paso que se llevó a cabo para la consolidación del trabajo.

Descriptivos generales

Descriptivos de variables categoricas

data %>% select(VENDEDOR, NUM_DOC, FECHA, ITEM, CLIENTE, COD_SUBGRUPO, COD_DEP, DEP, COD_CIU, COD_CCO) 

 10  Variables      9898  Observations
--------------------------------------------------------------------------------
VENDEDOR 
       n  missing distinct 
    9898        0       17 

lowest : 001 002 003 005 007, highest: 030 031 032 035 036
                                                                            
Value          1     2     3     5     7     9    11    14    17    18    22
Frequency    201   290   357  1209   687    28     1  1344  2302   164  2379
Proportion 0.020 0.029 0.036 0.122 0.069 0.003 0.000 0.136 0.233 0.017 0.240
                                              
Value         27    30    31    32    35    36
Frequency    122   350     4   375    24    61
Proportion 0.012 0.035 0.000 0.038 0.002 0.006
--------------------------------------------------------------------------------
NUM_DOC 
       n  missing distinct 
    9898        0     7788 

lowest : DVV-0000260 DVV-0000261 DVV-0000262 DVV-0000263 DVV-0000318
highest: MED-37335   MED-37337   MED-37338   MED-37339   MED-37340  
--------------------------------------------------------------------------------
FECHA 
         n    missing   distinct       Info       Mean        Gmd        .05 
      9898          0        736          1 2015-07-27        364 2014-03-21 
       .10        .25        .50        .75        .90        .95 
2014-05-22 2014-10-22 2015-08-01 2016-04-25 2016-10-08 2016-11-21 

lowest : 2014-01-10 2014-01-11 2014-01-12 2014-01-13 2014-01-16
highest: 2017-01-26 2017-01-27 2017-01-28 2017-01-29 2017-01-30
--------------------------------------------------------------------------------
ITEM 
       n  missing distinct 
    9898        0      248 

lowest : 1000-0001 1000-0004 1000-0009 1100-0002 1100-0003
highest: 900-0020  900-0022  900-0024  900-0025  900-0027 
--------------------------------------------------------------------------------
CLIENTE 
       n  missing distinct 
    9898        0      371 

lowest : 10109050   102197600  1022348758 1024501802 1026553774
highest: 9815991    98535261   98584936   985849362  98665688  
--------------------------------------------------------------------------------
COD_SUBGRUPO 
       n  missing distinct 
    9898        0       24 

lowest : 001 002 003 004 005, highest: 028 040 100 101 102
--------------------------------------------------------------------------------
COD_DEP 
       n  missing distinct 
    9305      593       17 

lowest : 05 08 11 13 15, highest: 63 66 68 73 76
                                                                            
Value          5     8    11    13    15    17    19    25    41    50    52
Frequency   3421   143  2790     1    11    41     8  1066     6     1     4
Proportion 0.368 0.015 0.300 0.000 0.001 0.004 0.001 0.115 0.001 0.000 0.000
                                              
Value         54    63    66    68    73    76
Frequency     64    32   444   129    61  1083
Proportion 0.007 0.003 0.048 0.014 0.007 0.116
--------------------------------------------------------------------------------
DEP 
       n  missing distinct 
    9305      593       16 

lowest : ANTIOQUIA       ATLÁNTICO       BOGOTÁ, D.C.    BOLÍVAR         BOYACÁ         
highest: QUINDIO         RISARALDA       SANTANDER       TOLIMA          VALLE DEL CAUCA

ANTIOQUIA (3421, 0.368), ATLÁNTICO (143, 0.015), BOGOTÁ, D.C. (3856, 0.414),
BOLÍVAR (1, 0.000), BOYACÁ (11, 0.001), CALDAS (41, 0.004), CAUCA (8, 0.001),
HUILA (6, 0.001), META (1, 0.000), NARIÑO (4, 0.000), NORTE DE SANTANDER (64,
0.007), QUINDIO (32, 0.003), RISARALDA (444, 0.048), SANTANDER (129, 0.014),
TOLIMA (61, 0.007), VALLE DEL CAUCA (1083, 0.116)
--------------------------------------------------------------------------------
COD_CIU 
       n  missing distinct 
    9898        0       51 

lowest : 05000 05001 05088 05129 05237, highest: 76147 76364 76520 76834 76892
--------------------------------------------------------------------------------
COD_CCO 
       n  missing distinct 
    9898        0       12 

lowest : 001 002 003 005 006, highest: 009 010 012 014 015
                                                                            
Value          1     2     3     5     6     7     8     9    10    12    14
Frequency   7080  2802     7     1     1     1     1     1     1     1     1
Proportion 0.715 0.283 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
                
Value         15
Frequency      1
Proportion 0.000
--------------------------------------------------------------------------------

Descriptivos de variables numericas

data %>% select(-c(VENDEDOR, NUM_DOC, FECHA, ITEM, CLIENTE, COD_SUBGRUPO, COD_DEP, DEP, COD_CIU, COD_CCO)) 

 7  Variables      9898  Observations
--------------------------------------------------------------------------------
AÑO 
       n  missing distinct     Info     Mean      Gmd 
    9898        0        4    0.893     2015    0.917 
                                  
Value       2014  2015  2016  2017
Frequency   3160  3254  3354   130
Proportion 0.319 0.329 0.339 0.013
--------------------------------------------------------------------------------
MES 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    9898        0       12    0.993     6.85     3.83        1        2 
     .25      .50      .75      .90      .95 
       4        7       10       11       12 

lowest :  1  2  3  4  5, highest:  8  9 10 11 12
                                                                            
Value          1     2     3     4     5     6     7     8     9    10    11
Frequency    552   712   762   882   791   765   922   926   853   983   949
Proportion 0.056 0.072 0.077 0.089 0.080 0.077 0.093 0.094 0.086 0.099 0.096
                
Value         12
Frequency    801
Proportion 0.081
--------------------------------------------------------------------------------
DIA_PLA 
       n  missing distinct     Info     Mean      Gmd 
    9898        0        9    0.892       41       27 

lowest :  0  1  8 15 30, highest: 30 45 60 75 90
                                                                
Value          0     1     8    15    30    45    60    75    90
Frequency   1793     6     6    52  2915   385  4204    33   504
Proportion 0.181 0.001 0.001 0.005 0.295 0.039 0.425 0.003 0.051
--------------------------------------------------------------------------------
CANTIDAD 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    9898        0      272    0.995      105      142      3.0      9.7 
     .25      .50      .75      .90      .95 
    25.0     42.0    100.0    200.0    400.0 

lowest : -2000 -1000  -760  -600  -500, highest:  4000  4100  4500  4994  7500
--------------------------------------------------------------------------------
PRE_TOT 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    9898        0     4071        1  1798791  2777935    76948   132500 
     .25      .50      .75      .90      .95 
  212000   450000  1166782  3547008  7072510 

lowest : -20320816 -16082100 -11235194 -10494000  -7009593
highest:  68350560  71879720 124830404 158253480 160170648
--------------------------------------------------------------------------------
TRM 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    9898        0      693        1     1900      131     1772     1780 
     .25      .50      .75      .90      .95 
    1802     1882     1934     2049     2158 

lowest : 1754.89 1757.24 1758.03 1758.38 1758.45
highest: 2414.39 2423.56 2438.79 2442.03 2446.35
--------------------------------------------------------------------------------
PRE_TOT_US 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    9898        0     9180        1      941     1450     40.0     67.9 
     .25      .50      .75      .90      .95 
   111.5    232.8    606.0   1831.2   3648.3 

lowest : -10569.665  -8957.391  -5841.957  -5122.148  -3883.130
highest:  35498.808  38688.067  64779.996  66383.999  66383.999
--------------------------------------------------------------------------------

Analisis de cluster

En el siguiente apartado se realiza el analisis de cluster usando diferentes tecnicas (tanto jerarquico como no jerarquico)…..

Cluster para variables numericas

Analisis de cluster mediante vecinos cercanos

Analisis de cluster mediante vecinos lejanos

Analisis de cluster mediante media de grupos

Analisis de cluster mediante el metodo Ward

Analisis de cluster mediante Kmeans

Grafico del numero optimo de k para kmeans

Analisis de cluster para variables categoricas

Grafico de cluster jerarquico

Biplot de las variables categoricas en el Eje 1-2

Biplot de las variables categoricas en el Eje 3-4