Manipulacion, orden y limpieza de bases de datos

Datos crudos y datos procesados.

Datos: Valores de variables cuantitativas o cualitativas pertenecientes a un grupo de objetos (población, etc.)

Cualitativos: País de orígen, sexo, tratamiento, etc.
Cuantitativos: Altura, peso, presión sanguínea, etc.

Datos crúdos

Fuente original de los datos, es decir, son los datos crudos.
Muchas veces difícil de utilizar para análisis de datos
Analizar datos implica procesamiento de los mismos.
Procesar los datos crúdos.

Pueden ser datos binarios, algún excel sin formato, datos obtenidos de forma manual, etc.

Estos datos sabemos que están en el formato correcto si: - No manipulé nada en los datos. - No removí nada de ellos. - No los resumí en algún sentido.

Datos procesados

Datos listos para el análisis
Procesar datos implica: Unir, extraer, transformar, etc.
Dependiendo del campo en el que se esté trabajando, hay estándares para procesar. P.E: Genética.
Todos los pasos deben ser registrados. El preprocesamiento siempre es el componente más importante del análisis de datos.
Tener cuidado en todos los pasos realizados y en lo que se hace. Comprender que está pasando en el procesamiento de los datos

Ejemplo de los pasos a seguir en el procesamiento de los datos

\[ \textbf{Genoma}: Muestras-> \frac{Amplificación}{Lista\hspace{.2cm}de\hspace{.2cm}genes\hspace{.2cm}y\hspace{.2cm}secuencias}->\frac{Imágenes}{datos\hspace{.2cm}crudos} ->\frac{Colores}{procesar\hspace{.2cm}para\hspace{.2cm}obtener\hspace{.2cm}perfiles\hspace{.2cm}genéticos} \]

Características de los datos procesados y ordenados.

A los datos crudos los debemos convertir en un set de datos ordenados, para ello debemos contar con los siguientes componentes:

Datos crudos -> Archivos de donde extraemos la información
Data set ordenado.
Un libro de códigos describiendo cada variable y sus valores en el data set ordenado. P.e. Describir las unidades de medida de alguna variable.
Todo el código de programación utilizado para procesar los datos. Es decir reportar los pasos exactos.

Meta de procesamiento para los datos ordenados.

Cada variable medida debe estar en una columna. Variable x columna.
Cada observación de esta variable debe estar en una línea diferente.
Debe haber una tabla para cada tipo de variable.
Cuando se tienen múltiples tablas se debe incluir una columna en la tabla que les permita unirse (merging data)

##   Ozone Solar.R Wind Temp Month Day           Region
## 1    41     190  7.4   67     5   1 CIUDAD DE MÉXICO
## 2    36     118  8.0   72     5   2 CIUDAD DE MÉXICO
## 3    12     149 12.6   74     5   3 CIUDAD DE MÉXICO
## 4    18     313 11.5   62     5   4 CIUDAD DE MÉXICO
## 7    23     299  8.6   65     5   7 CIUDAD DE MÉXICO
## 8    19      99 13.8   59     5   8 CIUDAD DE MÉXICO

##    Ozone Solar.R Wind Temp Month Day Region
## 28    23      13 12.0   67     5  28 PUEBLA
## 29    45     252 14.9   81     5  29 PUEBLA
## 30   115     223  5.7   79     5  30 PUEBLA
## 31    37     279  7.4   76     5  31 PUEBLA
## 38    29     127  9.7   82     6   7 PUEBLA
## 40    71     291 13.8   90     6   9 PUEBLA

##    Ozone Solar.R Wind Temp Month Day           Region
## 1     41     190  7.4   67     5   1 CIUDAD DE MÉXICO
## 2     36     118  8.0   72     5   2 CIUDAD DE MÉXICO
## 3     12     149 12.6   74     5   3 CIUDAD DE MÉXICO
## 4     18     313 11.5   62     5   4 CIUDAD DE MÉXICO
## 7     23     299  8.6   65     5   7 CIUDAD DE MÉXICO
## 8     19      99 13.8   59     5   8 CIUDAD DE MÉXICO
## 9      8      19 20.1   61     5   9 CIUDAD DE MÉXICO
## 12    16     256  9.7   69     5  12 CIUDAD DE MÉXICO
## 13    11     290  9.2   66     5  13 CIUDAD DE MÉXICO
## 14    14     274 10.9   68     5  14 CIUDAD DE MÉXICO
## 15    18      65 13.2   58     5  15 CIUDAD DE MÉXICO
## 16    14     334 11.5   64     5  16 CIUDAD DE MÉXICO
## 17    34     307 12.0   66     5  17 CIUDAD DE MÉXICO
## 18     6      78 18.4   57     5  18 CIUDAD DE MÉXICO
## 19    30     322 11.5   68     5  19 CIUDAD DE MÉXICO
## 20    11      44  9.7   62     5  20 CIUDAD DE MÉXICO
## 21     1       8  9.7   59     5  21 CIUDAD DE MÉXICO
## 22    11     320 16.6   73     5  22 CIUDAD DE MÉXICO
## 23     4      25  9.7   61     5  23 CIUDAD DE MÉXICO
## 24    32      92 12.0   61     5  24 CIUDAD DE MÉXICO
## 28    23      13 12.0   67     5  28           PUEBLA
## 29    45     252 14.9   81     5  29           PUEBLA
## 30   115     223  5.7   79     5  30           PUEBLA
## 31    37     279  7.4   76     5  31           PUEBLA
## 38    29     127  9.7   82     6   7           PUEBLA
## 40    71     291 13.8   90     6   9           PUEBLA
## 41    39     323 11.5   87     6  10           PUEBLA
## 44    23     148  8.0   82     6  13           PUEBLA
## 47    21     191 14.9   77     6  16           PUEBLA
## 48    37     284 20.7   72     6  17           PUEBLA
## 49    20      37  9.2   65     6  18           PUEBLA
## 50    12     120 11.5   73     6  19           PUEBLA
## 51    13     137 10.3   76     6  20           PUEBLA
## 62   135     269  4.1   84     7   1           PUEBLA
## 63    49     248  9.2   85     7   2           PUEBLA
## 64    32     236  9.2   81     7   3           PUEBLA
## 66    64     175  4.6   83     7   5           PUEBLA
## 67    40     314 10.9   83     7   6           PUEBLA
## 68    77     276  5.1   88     7   7           PUEBLA
## 69    97     267  6.3   92     7   8           PUEBLA

TIPS

La primera fila deben ser los nombres de las variables (columnas)
Los nombres de las columnas deben ser leíbles.

Lista de instrucciones.

Debes poder obtener el mismo resultado una vez que reproceses los datos crudos una y otra vez. Si no es así, hay algo mal en tu pipeline y hay que corregirlo.

Script de código
El input para el script son los datos crudos
El output son los datos ordenados y procesados
No hay parámetros para el script, el resultado debe ser el mismo una y otra vez

PASOS.

Establecer directorio
Descargar archivos.
Leerlos en R.
Procesarlos.
Analizarlos.
Visualizarlos
Interpretarlos

Paquete “data.table”

Es más rápido y eficiente que los comandos típicos para manipular tablas de datos. - Todas las funciones que tiene data.frame() aplican para data.table() - Es más rápido extrayendo, agrupando y actualizando variables. - Tiene cierta curva de aprendizaje

Descargar el paquete y cargarlo.

install.packages("data.table")
library(data.table)

Creando data frames con data.frame() y data.table()

library(data.table)

## Utilizando la función data.frame
DF <-  data.frame(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z= rnorm(9))
head(DF)

##            x y          z
## 1 -1.1447733 a  0.8116274
## 2  1.1398627 a  0.6328489
## 3  0.1210255 a -2.0270648
## 4  1.4389043 b  0.8637169
## 5 -0.5279340 b -1.0525449
## 6 -1.3782284 b  0.6091437

## Utilizando data.table
DT <- data.table(x=rnorm(9), y=rep(c("a", "b", "c"), each=3), z= rnorm(9))
head(DT)

##             x y          z
## 1: -0.8436824 a  1.0074831
## 2: -1.8978058 a  0.1939378
## 3: -0.4020842 a -1.4574328
## 4: -0.3116095 b -0.2612332
## 5:  0.1891363 b  0.4151691
## 6: -0.7363604 b  0.4137015

Podemos ver todas las tablas de datos en la memoria

tables()

##    NAME NROW NCOL MB  COLS KEY
## 1:   DT    9    3  0 x,y,z    
## Total: 0MB

Al extraer es algo diferente en las filas, no usamos las comas para diferenciar filas de columnas, automáticamente nos da las filas.

DT[3]

##             x y         z
## 1: -0.4020842 a -1.457433

DT[c(2,3)]

##             x y          z
## 1: -1.8978058 a  0.1939378
## 2: -0.4020842 a -1.4574328

Sin embargo, en las columnas es un poco diferente. Se hace uso de expresiones (el argumento que va después de la coma es una expresión). Una expresión en R es una colección de estados encerrados entre {}

Calcular valores para variables utilizando expresiones. En este caso aplicamos funciónes en la tabla de datos a dos columnas de la tabla creada con data.table()

## media de la columna "x" y la suma de la columna "Z" utilizando list()
DT[,list(mean(x), sum(z))]

##            V1        V2
## 1: -0.2993343 -2.838312

## Crear una tabla de una columna en particular con el número de elementos de cada valor en particular.
DT[,table(y)]

## y
## a b c 
## 3 3 3

Añadir nuevas columnas, en donde la primera expresión es el nombre de la columna seguido de dos puntos (“:”) y un igual (“=”) y finalmente la información que queremos meter en esa nueva columna

DT$z+3

## [1] 4.007483 3.193938 1.542567 2.738767 3.415169 3.413702 1.870567 2.506959
## [9] 1.472536

DT[, w:=z^2]
head(DT)

##             x y          z          w
## 1: -0.8436824 a  1.0074831 1.01502227
## 2: -1.8978058 a  0.1939378 0.03761186
## 3: -0.4020842 a -1.4574328 2.12411051
## 4: -0.3116095 b -0.2612332 0.06824276
## 5:  0.1891363 b  0.4151691 0.17236540
## 6: -0.7363604 b  0.4137015 0.17114894

No hay que sObreescribir nuevamente la tabla, con este paquete evitamos el uso de memoria extra al crear una tabla nueva, SE MODIFICA LA TABLA ORIGINAL

## ES LA MISMA TABLA
DT2 <- DT

EJERCICIO: Interpretar qué se está haciendo en las siguientes expresiones

2+2; 3+4

## [1] 4

## [1] 7

DT[, m:={temp <- x+z; log2(temp+5)}]
DT[, a:=x>0]
DT[, b:= mean(x+w), by=a]

Extrayendo y ordenando datos.

Extrayendo datos

Extracción típica utilizando los operadores []

set.seed(12345)
data <- na.omit(airquality)

##Columnas
head(data[,1])

## [1] 41 36 12 18 23 19

head(data[,"Ozone"])

## [1] 41 36 12 18 23 19

##Filas y columnas
data[1:2, "Solar.R"]

## [1] 190 118

Sin embargo podemos hacer extracciones mucho más sofisticadas y precisas. Generalmente se hacen antes de la coma, es decir, en donde extraemos las filas.

Extrayendo datos utilizando AND(&), OR(|) y la función which()

#AND
head(data[(data$Ozone>=50 & data$Wind<=10),])

##    Ozone Solar.R Wind Temp Month Day
## 30   115     223  5.7   79     5  30
## 62   135     269  4.1   84     7   1
## 66    64     175  4.6   83     7   5
## 68    77     276  5.1   88     7   7
## 69    97     267  6.3   92     7   8
## 70    97     272  5.7   92     7   9

#OR
head(data[(data$Ozone<=50 | data$Wind>=11),])

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8

##Which
head(data[which(data$Temp>=90),])

##     Ozone Solar.R Wind Temp Month Day
## 40     71     291 13.8   90     6   9
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 120    76     203  9.7   97     8  28

Buscando valores con características específicas utilizando las expresiones %in% o ==

## %in%
head(data[data$Month %in% c(5,6),])

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 7    23     299  8.6   65     5   7
## 8    19      99 13.8   59     5   8

data$Mes <- factor(data$Month, labels = c("mayo", "junio", "julio", "agosto", "septiembre"))
data[data$Mes %in% c("mayo", "julio"),]

##    Ozone Solar.R Wind Temp Month Day   Mes
## 1     41     190  7.4   67     5   1  mayo
## 2     36     118  8.0   72     5   2  mayo
## 3     12     149 12.6   74     5   3  mayo
## 4     18     313 11.5   62     5   4  mayo
## 7     23     299  8.6   65     5   7  mayo
## 8     19      99 13.8   59     5   8  mayo
## 9      8      19 20.1   61     5   9  mayo
## 12    16     256  9.7   69     5  12  mayo
## 13    11     290  9.2   66     5  13  mayo
## 14    14     274 10.9   68     5  14  mayo
## 15    18      65 13.2   58     5  15  mayo
## 16    14     334 11.5   64     5  16  mayo
## 17    34     307 12.0   66     5  17  mayo
## 18     6      78 18.4   57     5  18  mayo
## 19    30     322 11.5   68     5  19  mayo
## 20    11      44  9.7   62     5  20  mayo
## 21     1       8  9.7   59     5  21  mayo
## 22    11     320 16.6   73     5  22  mayo
## 23     4      25  9.7   61     5  23  mayo
## 24    32      92 12.0   61     5  24  mayo
## 28    23      13 12.0   67     5  28  mayo
## 29    45     252 14.9   81     5  29  mayo
## 30   115     223  5.7   79     5  30  mayo
## 31    37     279  7.4   76     5  31  mayo
## 62   135     269  4.1   84     7   1 julio
## 63    49     248  9.2   85     7   2 julio
## 64    32     236  9.2   81     7   3 julio
## 66    64     175  4.6   83     7   5 julio
## 67    40     314 10.9   83     7   6 julio
## 68    77     276  5.1   88     7   7 julio
## 69    97     267  6.3   92     7   8 julio
## 70    97     272  5.7   92     7   9 julio
## 71    85     175  7.4   89     7  10 julio
## 73    10     264 14.3   73     7  12 julio
## 74    27     175 14.9   81     7  13 julio
## 76     7      48 14.3   80     7  15 julio
## 77    48     260  6.9   81     7  16 julio
## 78    35     274 10.3   82     7  17 julio
## 79    61     285  6.3   84     7  18 julio
## 80    79     187  5.1   87     7  19 julio
## 81    63     220 11.5   85     7  20 julio
## 82    16       7  6.9   74     7  21 julio
## 85    80     294  8.6   86     7  24 julio
## 86   108     223  8.0   85     7  25 julio
## 87    20      81  8.6   82     7  26 julio
## 88    52      82 12.0   86     7  27 julio
## 89    82     213  7.4   88     7  28 julio
## 90    50     275  7.4   86     7  29 julio
## 91    64     253  7.4   83     7  30 julio
## 92    59     254  9.2   81     7  31 julio

## ==
head(data[data$Month == c(9,5),])

## Warning in data$Month == c(9, 5): longitud de objeto mayor no es múltiplo de la
## longitud de uno menor

##    Ozone Solar.R Wind Temp Month Day  Mes
## 2     36     118  8.0   72     5   2 mayo
## 4     18     313 11.5   62     5   4 mayo
## 8     19      99 13.8   59     5   8 mayo
## 12    16     256  9.7   69     5  12 mayo
## 14    14     274 10.9   68     5  14 mayo
## 16    14     334 11.5   64     5  16 mayo

head(data[data$Month >= 6,])

##    Ozone Solar.R Wind Temp Month Day   Mes
## 38    29     127  9.7   82     6   7 junio
## 40    71     291 13.8   90     6   9 junio
## 41    39     323 11.5   87     6  10 junio
## 44    23     148  8.0   82     6  13 junio
## 47    21     191 14.9   77     6  16 junio
## 48    37     284 20.7   72     6  17 junio

data[data$Mes == "septiembre",]

##     Ozone Solar.R Wind Temp Month Day        Mes
## 124    96     167  6.9   91     9   1 septiembre
## 125    78     197  5.1   92     9   2 septiembre
## 126    73     183  2.8   93     9   3 septiembre
## 127    91     189  4.6   93     9   4 septiembre
## 128    47      95  7.4   87     9   5 septiembre
## 129    32      92 15.5   84     9   6 septiembre
## 130    20     252 10.9   80     9   7 septiembre
## 131    23     220 10.3   78     9   8 septiembre
## 132    21     230 10.9   75     9   9 septiembre
## 133    24     259  9.7   73     9  10 septiembre
## 134    44     236 14.9   81     9  11 septiembre
## 135    21     259 15.5   76     9  12 septiembre
## 136    28     238  6.3   77     9  13 septiembre
## 137     9      24 10.9   71     9  14 septiembre
## 138    13     112 11.5   71     9  15 septiembre
## 139    46     237  6.9   78     9  16 septiembre
## 140    18     224 13.8   67     9  17 septiembre
## 141    13      27 10.3   76     9  18 septiembre
## 142    24     238 10.3   68     9  19 septiembre
## 143    16     201  8.0   82     9  20 septiembre
## 144    13     238 12.6   64     9  21 septiembre
## 145    23      14  9.2   71     9  22 septiembre
## 146    36     139 10.3   81     9  23 septiembre
## 147     7      49 10.3   69     9  24 septiembre
## 148    14      20 16.6   63     9  25 septiembre
## 149    30     193  6.9   70     9  26 septiembre
## 151    14     191 14.3   75     9  28 septiembre
## 152    18     131  8.0   76     9  29 septiembre
## 153    20     223 11.5   68     9  30 septiembre

Ordenando datos

Podemos ordenar vectores o tablas completas según deseemos.

Vectores: utilizamos la función sort()

##Orden creciente
sort(data$Wind)

##   [1]  2.3  2.8  3.4  4.0  4.1  4.6  4.6  5.1  5.1  5.1  5.7  5.7  6.3  6.3  6.3
##  [16]  6.3  6.3  6.3  6.9  6.9  6.9  6.9  6.9  6.9  7.4  7.4  7.4  7.4  7.4  7.4
##  [31]  7.4  7.4  7.4  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.6  8.6  8.6  9.2  9.2
##  [46]  9.2  9.2  9.2  9.2  9.7  9.7  9.7  9.7  9.7  9.7  9.7  9.7  9.7 10.3 10.3
##  [61] 10.3 10.3 10.3 10.3 10.3 10.3 10.3 10.3 10.9 10.9 10.9 10.9 10.9 10.9 11.5
##  [76] 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 12.0 12.0 12.0 12.0 12.6 12.6
##  [91] 13.2 13.8 13.8 13.8 13.8 14.3 14.3 14.3 14.3 14.9 14.9 14.9 14.9 15.5 15.5
## [106] 15.5 16.6 16.6 18.4 20.1 20.7

##Orden decreciente
sort(data$Wind, decreasing = TRUE)

##   [1] 20.7 20.1 18.4 16.6 16.6 15.5 15.5 15.5 14.9 14.9 14.9 14.9 14.3 14.3 14.3
##  [16] 14.3 13.8 13.8 13.8 13.8 13.2 12.6 12.6 12.0 12.0 12.0 12.0 11.5 11.5 11.5
##  [31] 11.5 11.5 11.5 11.5 11.5 11.5 11.5 10.9 10.9 10.9 10.9 10.9 10.9 10.3 10.3
##  [46] 10.3 10.3 10.3 10.3 10.3 10.3 10.3 10.3  9.7  9.7  9.7  9.7  9.7  9.7  9.7
##  [61]  9.7  9.7  9.2  9.2  9.2  9.2  9.2  9.2  8.6  8.6  8.6  8.0  8.0  8.0  8.0
##  [76]  8.0  8.0  8.0  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4  6.9  6.9  6.9
##  [91]  6.9  6.9  6.9  6.3  6.3  6.3  6.3  6.3  6.3  5.7  5.7  5.1  5.1  5.1  4.6
## [106]  4.6  4.1  4.0  3.4  2.8  2.3

##Poniendo los valores perdidos al final
sort(airquality$Ozone, na.last = TRUE)

##   [1]   1   4   6   7   7   7   8   9   9   9  10  11  11  11  12  12  13  13
##  [19]  13  13  14  14  14  14  16  16  16  16  18  18  18  18  19  20  20  20
##  [37]  20  21  21  21  21  22  23  23  23  23  23  23  24  24  27  28  28  28
##  [55]  29  30  30  31  32  32  32  34  35  35  36  36  37  37  39  39  40  41
##  [73]  44  44  44  45  45  46  47  48  49  50  52  59  59  61  63  64  64  65
##  [91]  66  71  73  73  76  77  78  78  79  80  82  84  85  85  89  91  96  97
## [109]  97 108 110 115 118 122 135 168  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
## [127]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
## [145]  NA  NA  NA  NA  NA  NA  NA  NA  NA

Data frames: Podemos ordenarlas según alguna variable (columna) que deseemos con la función order().

head(data[order(data$Ozone),])

##     Ozone Solar.R Wind Temp Month Day        Mes
## 21      1       8  9.7   59     5  21       mayo
## 23      4      25  9.7   61     5  23       mayo
## 18      6      78 18.4   57     5  18       mayo
## 76      7      48 14.3   80     7  15      julio
## 147     7      49 10.3   69     9  24 septiembre
## 9       8      19 20.1   61     5   9       mayo

Podemos ordenar las tablas por múltiples variables. Se ordena de manera creciente la primera variable, si algún valor de la primera variable se repite, en la segunda variable se ordenará de manera creciente.

head(data[order(data$Ozone, data$Wind),])

##     Ozone Solar.R Wind Temp Month Day        Mes
## 21      1       8  9.7   59     5  21       mayo
## 23      4      25  9.7   61     5  23       mayo
## 18      6      78 18.4   57     5  18       mayo
## 147     7      49 10.3   69     9  24 septiembre
## 76      7      48 14.3   80     7  15      julio
## 9       8      19 20.1   61     5   9       mayo

head(data[order(data$Ozone, data$Wind, decreasing = TRUE),])

##     Ozone Solar.R Wind Temp Month Day    Mes
## 117   168     238  3.4   81     8  25 agosto
## 62    135     269  4.1   84     7   1  julio
## 99    122     255  4.0   89     8   7 agosto
## 121   118     225  2.3   94     8  29 agosto
## 30    115     223  5.7   79     5  30   mayo
## 101   110     207  8.0   90     8   9 agosto

Ordenando con el paquete dplyr

library(dplyr)
starwarsdata <- na.omit(starwars)
head(arrange(starwarsdata, height))

## # A tibble: 6 x 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Wicket S~     88    20 brown      brown      brown              8 male  mascu~
## 2 Leia Org~    150    49 brown      light      brown             19 fema~ femin~
## 3 Beru Whi~    165    75 brown      light      blue              47 fema~ femin~
## 4 Padmé Am~    165    45 brown      light      brown             46 fema~ femin~
## 5 Barriss ~    166    50 black      yellow     blue              40 fema~ femin~
## 6 Wedge An~    170    77 brown      fair       hazel             21 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

head(arrange(starwarsdata, desc = height))

## # A tibble: 6 x 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Wicket S~     88    20 brown      brown      brown              8 male  mascu~
## 2 Leia Org~    150    49 brown      light      brown             19 fema~ femin~
## 3 Beru Whi~    165    75 brown      light      blue              47 fema~ femin~
## 4 Padmé Am~    165    45 brown      light      brown             46 fema~ femin~
## 5 Barriss ~    166    50 black      yellow     blue              40 fema~ femin~
## 6 Wedge An~    170    77 brown      fair       hazel             21 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Resumiendo datos

Resumir datos es clave ya que nos permite ver cosas raras como valores perdidos u otros problemas que necesitas resolver. Por lo tanto es recomendable antes de limpiar los datos, resumir y observar que está contenido dentro de estos. Paara ello se harán uso de las siguientes funciones.

1. Resumenes rápidos

summary() : Info descriptiva de todas las variables dentro de una tabla de datos

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

str() : Nos da la descripción de la tabla. Clases, dimensiones, nombres de las columnas, tipo de variable, etc

str(airquality)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

quantile(): Nos da los cuantiles de una variable cuantitativa.

quantile(data$Ozone)

##   0%  25%  50%  75% 100% 
##    1   18   31   62  168

quantile(data$Solar.R, probs = c(.5,.3))

## 50% 30% 
## 207 137

2. Para contar o determinar si hay al menos un elemento específico dentro del objeto

table(): Si hay valores repetidos en una variable, cuantifica el número de veces que se repite ese valor.

table(starwarsdata$eye_color)

## 
##     black      blue blue-gray     brown     hazel    orange       red    yellow 
##         1         8         1        10         2         2         1         4

sum()

sum(starwarsdata$sex=="female")

## [1] 6

sum(is.na(starwars))

## [1] 105

sum(is.na(starwars$height))

## [1] 6

any(): Vector lógico que me devuelve un verdadero o falso si hay algún valor en particular

any(starwarsdata$skin_color == "pale")

## [1] TRUE

any(is.na(starwars$height))

## [1] TRUE

all()

all(starwars$skin_color == "pale")

## [1] FALSE

all(airquality$Ozone<0)

## [1] FALSE

colSums()

colSums(is.na(airquality))

##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

3. Buscar relaciones entre variables en el set de datos.

xtabs(): primer argumento es una fórmula y el segundo la base de datos de donde sacamos la información. Podemos sustituir los nombres de las columnas por un punto “.” después del “~” y obtendremos una tabla dividida en todas las columnas del data set

colnames(starwars)

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "films"      "vehicles"   "starships"

xtabs(mass~name+gender, data = starwars)

##                        gender
## name                    feminine masculine
##   Ackbar                     0.0      83.0
##   Adi Gallia                50.0       0.0
##   Anakin Skywalker           0.0      84.0
##   Ayla Secura               55.0       0.0
##   Barriss Offee             50.0       0.0
##   Ben Quadinaros             0.0      65.0
##   Beru Whitesun lars        75.0       0.0
##   Biggs Darklighter          0.0      84.0
##   Boba Fett                  0.0      78.2
##   Bossk                      0.0     113.0
##   C-3PO                      0.0      75.0
##   Chewbacca                  0.0     112.0
##   Darth Maul                 0.0      80.0
##   Darth Vader                0.0     136.0
##   Dexter Jettster            0.0     102.0
##   Dooku                      0.0      80.0
##   Dud Bolt                   0.0      45.0
##   Greedo                     0.0      74.0
##   Gregar Typho               0.0      85.0
##   Grievous                   0.0     159.0
##   Han Solo                   0.0      80.0
##   IG-88                      0.0     140.0
##   Jabba Desilijic Tiure      0.0    1358.0
##   Jango Fett                 0.0      79.0
##   Jar Jar Binks              0.0      66.0
##   Jek Tono Porkins           0.0     110.0
##   Ki-Adi-Mundi               0.0      82.0
##   Kit Fisto                  0.0      87.0
##   Lama Su                    0.0      88.0
##   Lando Calrissian           0.0      79.0
##   Leia Organa               49.0       0.0
##   Lobot                      0.0      79.0
##   Luke Skywalker             0.0      77.0
##   Luminara Unduli           56.2       0.0
##   Mace Windu                 0.0      84.0
##   Nien Nunb                  0.0      68.0
##   Nute Gunray                0.0      90.0
##   Obi-Wan Kenobi             0.0      77.0
##   Owen Lars                  0.0     120.0
##   Padmé Amidala             45.0       0.0
##   Palpatine                  0.0      75.0
##   Plo Koon                   0.0      80.0
##   Poggle the Lesser          0.0      80.0
##   Qui-Gon Jinn               0.0      89.0
##   R2-D2                      0.0      32.0
##   R5-D4                      0.0      32.0
##   Ratts Tyerell              0.0      15.0
##   Raymus Antilles            0.0      79.0
##   Roos Tarpals               0.0      82.0
##   Sebulba                    0.0      40.0
##   Shaak Ti                  57.0       0.0
##   Tarfful                    0.0     136.0
##   Tion Medon                 0.0      80.0
##   Wat Tambor                 0.0      48.0
##   Wedge Antilles             0.0      77.0
##   Wicket Systri Warrick      0.0      20.0
##   Yoda                       0.0      17.0
##   Zam Wesell                55.0       0.0

# Utilizando el "." en xtabs.
ChickWeight2 <- aggregate(ChickWeight$weight, by= list(ChickWeight$Diet, ChickWeight$Time), FUN=mean)
names(ChickWeight2) <-c("diet", "time", "mean weight")
ChickWeight2 <- ChickWeight2[order(ChickWeight2$diet),]
xtabs(`mean weight`~., data = ChickWeight2)

##     time
## diet         0         2         4         6         8        10        12
##    1  41.40000  47.25000  56.47368  66.78947  79.68421  93.05263 108.52632
##    2  40.70000  49.40000  59.80000  75.40000  91.70000 108.50000 131.30000
##    3  40.80000  50.40000  62.20000  77.90000  98.40000 117.10000 144.40000
##    4  41.00000  51.80000  64.50000  83.90000 105.60000 126.00000 151.40000
##     time
## diet        14        16        18        20        21
##    1 123.38889 144.64706 158.94118 170.41176 177.75000
##    2 141.90000 164.70000 187.70000 205.60000 214.70000
##    3 164.50000 197.40000 233.10000 258.90000 270.30000
##    4 161.80000 182.00000 202.90000 233.88889 238.55556

Manipulacion y creación de variables

En ocasiones nos encontraremos con que la base de datos que hemos cargado carece de cierta información, por lo tanto habrá que transformar un poco los datos para obtener los valores que necesitemos y agregarlos a la tabla de datos. Tal es el caso de:

1. Crear secuencias

Las secuencias son utilizadas para indicar diferentes operaciones en los datos.

-seq(): Tiene los argumentos by y length. By indica de cuanto en cuanto va la secuencia y length indica el largo de la secuencia. Se utilizan principalmente para hacer un vector que nos permita hacer loops o acceder a extracciones específicas de los datos.

## by
seq(1,20, by=3)

## [1]  1  4  7 10 13 16 19

## length
seq(1,20, length=5)

## [1]  1.00  5.75 10.50 15.25 20.00

2. Extrayendo variables

Creamos variables a partir de la extracción de otras variables

ChickWeight <- ChickWeight
ChickWeight$min.time.max.time <- ChickWeight$Time %in% c(0,21)
head(ChickWeight)

##   weight Time Chick Diet min.time.max.time
## 1     42    0     1    1              TRUE
## 2     51    2     1    1             FALSE
## 3     59    4     1    1             FALSE
## 4     64    6     1    1             FALSE
## 5     76    8     1    1             FALSE
## 6     93   10     1    1             FALSE

3. Creando variables binarias

ChickWeight$time_great_six <- ifelse(ChickWeight$Time>6, TRUE, FALSE)
head(ChickWeight)

##   weight Time Chick Diet min.time.max.time time_great_six
## 1     42    0     1    1              TRUE          FALSE
## 2     51    2     1    1             FALSE          FALSE
## 3     59    4     1    1             FALSE          FALSE
## 4     64    6     1    1             FALSE          FALSE
## 5     76    8     1    1             FALSE           TRUE
## 6     93   10     1    1             FALSE           TRUE

4. Creando variables categóricas

cut(): Tiene 2 argumentos principales, x que es un vector o una variable cuantitativa y breaks que va a dividir ese vector en distintos grupos que le pidamos

starwarsdata <- starwarsdata[,1:9]

starwarsdata$heightGroups <- cut(starwarsdata$height, breaks = quantile(starwarsdata$height))

table(starwarsdata$heightGroups)

## 
##  (88,170] (170,180] (180,188] (188,228] 
##         7         8         7         6

table(starwarsdata$heightGroups, starwarsdata$height)

##            
##             88 150 165 166 170 172 175 177 178 180 182 183 188 190 193 196 198
##   (88,170]   0   1   2   1   3   0   0   0   0   0   0   0   0   0   0   0   0
##   (170,180]  0   0   0   0   0   1   2   1   2   2   0   0   0   0   0   0   0
##   (180,188]  0   0   0   0   0   0   0   0   0   0   1   3   3   0   0   0   0
##   (188,228]  0   0   0   0   0   0   0   0   0   0   0   0   0   1   1   1   1
##            
##             202 228
##   (88,170]    0   0
##   (170,180]   0   0
##   (180,188]   0   0
##   (188,228]   1   1

class(starwarsdata$heightGroups)

## [1] "factor"

Transformaciones comunes

abs(): Valor absoluto
sqrt(): Raiz cuadrada
ceiling(): Reducir a un dígito hacia arriba
floor(): reducir a un dígito hacia abajo
round(): redondear, usando el argumento digits redondeamos a 2,3,4, etc. dígitos que queramos
log(): Logaritmo natural
log2(),log10(): Otros logaritmos
exp(): Exponenciando x

Reacomodando los datos

A veces los datos en R no están ordenados de la manera en la que nosotros deseemos y ha que reordenar o darle forma a los datos. Para ello haremos uso de el paquete reshape2.

Recordemos que la meta para un dataset ordenado es: 1. Cada variable debe formar una columna 2. Cada observación es una fila 3. Cada tabla/file guarda datos acerca de algún tipo de observación.

melt(): toma un dataframe y lo reagrupa de forma distinta. Útil cuando tenemos una variable distribuida en diferentes columnas. En el argumento id se ponen las variables que van a ser ordenadas y en el argumento measure.vars van las variables que serán cohercidas en una sola columna para agrupar el nuevo data set.

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

lol_champs <- read.csv("./data/lol_champs.csv")
lol_champs <- lol_champs[,-1]
names(lol_champs)

##  [1] "Champions"     "HP"            "HP."           "HP5"          
##  [5] "HP5."          "MP"            "MP."           "MP5"          
##  [9] "MP5."          "AD"            "AD."           "AS"           
## [13] "AS."           "AR"            "AR."           "MR"           
## [17] "MR."           "MS"            "Range"         "Title"        
## [21] "Release.date"  "Last.changed"  "Class.es."     "Legacy"       
## [25] "Position.s."   "Resource"      "Range.type"    "Adaptive.type"
## [29] "Store.price"   "Crafting"      "Style"         "Passive"      
## [33] "Q.Spell"       "W.Spell"       "E.Spell"       "Ultimate"     
## [37] "Popularity"    "Winrate"       "BanRate"       "Mainedby"     
## [41] "PentaKill"     "Gold"          "Minions"       "Wards"        
## [45] "DamageDealt"   "KDA_kills"     "KDA_deaths"    "KDA_assists"

meltedlol <- melt(lol_champs, id=c("Champions", "Title"), measure.vars = c( "KDA_kills", "KDA_deaths", "KDA_assists"))
head(meltedlol)

##   Champions               Title  variable value
## 1    Aatrox    the Darkin Blade KDA_kills   5.7
## 2      Ahri the Nine-Tailed Fox KDA_kills   6.2
## 3     Akali  the Rogue Assassin KDA_kills   7.5
## 4    Akshan  the Rogue Sentinel KDA_kills   8.9
## 5   Alistar        the Minotaur KDA_kills   1.7
## 6     Amumu       the Sad Mummy KDA_kills   5.0

head(melt(lol_champs, id=c("Champions", "Position.s."), measure.vars = c("Winrate","BanRate")))

##   Champions Position.s. variable value
## 1    Aatrox  Top Middle  Winrate 50.4%
## 2      Ahri      Middle  Winrate 51.6%
## 3     Akali  Top Middle  Winrate 46.5%
## 4    Akshan      Middle  Winrate 50.1%
## 5   Alistar     Support  Winrate 49.8%
## 6     Amumu      Jungle  Winrate 52.2%

dcast(): Es una función que le da un formato nuevo a la tabla en una forma particular en un dataframe particular. En este caso, los datos de la función melt los regresamos a su estado anterior. Del mismo modo podemos pasar una función para resumir datos.

head(dcast(meltedlol, Champions~variable))

##   Champions KDA_kills KDA_deaths KDA_assists
## 1    Aatrox       5.7        5.7         6.0
## 2      Ahri       6.2        5.4         7.8
## 3     Akali       7.5        6.0         5.1
## 4    Akshan       8.9        6.7         6.0
## 5   Alistar       1.7        6.1        13.6
## 6     Amumu       5.0        6.1        11.0

names(starwarsdata)

##  [1] "name"         "height"       "mass"         "hair_color"   "skin_color"  
##  [6] "eye_color"    "birth_year"   "sex"          "gender"       "heightGroups"

##Aplicando una función
meltedstarwars <- melt(starwarsdata, id=c("eye_color", "sex"), measure.vars = c("height", "mass"))
dcast(meltedstarwars, sex~variable, mean)

##      sex   height     mass
## 1 female 165.6667 55.03333
## 2   male 181.1739 83.70435

aggregate(): Es una función que computa un estadístico en función de ciertas variables. El primer argumento es la variable numérica a la que aplicaremos la función, el argumento by va a agrupar la tabla en función de las variables (generalmente categóricas) que yo elija del set de datos y el argumento FUN es la función que será aplicada a la variable del primer argumento.

head(ChickWeight)

##   weight Time Chick Diet min.time.max.time time_great_six
## 1     42    0     1    1              TRUE          FALSE
## 2     51    2     1    1             FALSE          FALSE
## 3     59    4     1    1             FALSE          FALSE
## 4     64    6     1    1             FALSE          FALSE
## 5     76    8     1    1             FALSE           TRUE
## 6     93   10     1    1             FALSE           TRUE

ChickWeight2 <- aggregate(ChickWeight$weight, by= list(ChickWeight$Diet, ChickWeight$Time), FUN=mean)
names(ChickWeight2) <-c("diet", "time", "mean weight")
ChickWeight2 <- ChickWeight2[order(ChickWeight2$diet),]
head(ChickWeight2)

##    diet time mean weight
## 1     1    0    41.40000
## 5     1    2    47.25000
## 9     1    4    56.47368
## 13    1    6    66.78947
## 17    1    8    79.68421
## 21    1   10    93.05263

head(aggregate(InsectSprays$count, by=list(InsectSprays$spray), FUN=mean))

##   Group.1         x
## 1       A 14.500000
## 2       B 15.333333
## 3       C  2.083333
## 4       D  4.916667
## 5       E  3.500000
## 6       F 16.666667

Paquete “dplyr”

Es un paquete especialmente diseñado para trabajar con dataframes. Es uno de los paquetes más poderosos para el manejo de bases de datos.

Las propiedades de dplyr son las siguientes: 1. Argumentos - funcion(dataframe, qué hacer con ese data frame)

El resultado es un nuevo data frame
puedo referirme a las columnas directamente sin utilizar alguno de los operadores $ ó []

Funciones: - select(): Sirve para extraer columnas de un dataframe

library(dplyr)
names(lol_champs)

##  [1] "Champions"     "HP"            "HP."           "HP5"          
##  [5] "HP5."          "MP"            "MP."           "MP5"          
##  [9] "MP5."          "AD"            "AD."           "AS"           
## [13] "AS."           "AR"            "AR."           "MR"           
## [17] "MR."           "MS"            "Range"         "Title"        
## [21] "Release.date"  "Last.changed"  "Class.es."     "Legacy"       
## [25] "Position.s."   "Resource"      "Range.type"    "Adaptive.type"
## [29] "Store.price"   "Crafting"      "Style"         "Passive"      
## [33] "Q.Spell"       "W.Spell"       "E.Spell"       "Ultimate"     
## [37] "Popularity"    "Winrate"       "BanRate"       "Mainedby"     
## [41] "PentaKill"     "Gold"          "Minions"       "Wards"        
## [45] "DamageDealt"   "KDA_kills"     "KDA_deaths"    "KDA_assists"

head(select(lol_champs, Champions:Title ))

##   Champions  HP HP.  HP5 HP5.  MP MP.    MP5 MP5. AD  AD.    AS     AS. AR  AR.
## 1    Aatrox 580  90 3.00 1.00   0   0  0.000 0.00 60 5.00 0.651   +2.5% 38 3.25
## 2      Ahri 526  92 5.50 0.60 418  25  8.000 0.80 53 3.00 0.668     +2% 21 3.50
## 3     Akali 500 105 8.00 0.50 200   0 50.000 0.00 62 3.30 0.625   +3.2% 23 3.50
## 4    Akshan 560  90 3.75 0.65 350  40  8.175 0.70 52 3.50 0.638     +4% 26 3.00
## 5   Alistar 600 106 8.50 0.85 350  40  8.500 0.80 62 3.75 0.625 +2.125% 44 3.50
## 6     Amumu 615  75 9.00 0.85 285  40  7.380 0.53 53 3.80 0.736  +2.18% 30 3.50
##   MR  MR.  MS Range               Title
## 1 32 1.25 345   175    the Darkin Blade
## 2 30 0.50 330   550 the Nine-Tailed Fox
## 3 37 1.25 345   125  the Rogue Assassin
## 4 30 0.50 330   500  the Rogue Sentinel
## 5 32 1.25 330   125        the Minotaur
## 6 32 1.25 335   125       the Sad Mummy

head(select(lol_champs, c(Champions, Range, Title)))

##   Champions Range               Title
## 1    Aatrox   175    the Darkin Blade
## 2      Ahri   550 the Nine-Tailed Fox
## 3     Akali   125  the Rogue Assassin
## 4    Akshan   500  the Rogue Sentinel
## 5   Alistar   125        the Minotaur
## 6     Amumu   125       the Sad Mummy

head(select(lol_champs, -(Release.date:KDA_assists)))

##   Champions  HP HP.  HP5 HP5.  MP MP.    MP5 MP5. AD  AD.    AS     AS. AR  AR.
## 1    Aatrox 580  90 3.00 1.00   0   0  0.000 0.00 60 5.00 0.651   +2.5% 38 3.25
## 2      Ahri 526  92 5.50 0.60 418  25  8.000 0.80 53 3.00 0.668     +2% 21 3.50
## 3     Akali 500 105 8.00 0.50 200   0 50.000 0.00 62 3.30 0.625   +3.2% 23 3.50
## 4    Akshan 560  90 3.75 0.65 350  40  8.175 0.70 52 3.50 0.638     +4% 26 3.00
## 5   Alistar 600 106 8.50 0.85 350  40  8.500 0.80 62 3.75 0.625 +2.125% 44 3.50
## 6     Amumu 615  75 9.00 0.85 285  40  7.380 0.53 53 3.80 0.736  +2.18% 30 3.50
##   MR  MR.  MS Range               Title
## 1 32 1.25 345   175    the Darkin Blade
## 2 30 0.50 330   550 the Nine-Tailed Fox
## 3 37 1.25 345   125  the Rogue Assassin
## 4 30 0.50 330   500  the Rogue Sentinel
## 5 32 1.25 330   125        the Minotaur
## 6 32 1.25 335   125       the Sad Mummy

filter(): Extrae filas de un data frame basado en alguna condición lógica

head(filter(lol_champs, HP>=500))

##   Champions  HP HP.  HP5 HP5.  MP MP.    MP5 MP5. AD  AD.    AS     AS. AR  AR.
## 1    Aatrox 580  90 3.00 1.00   0   0  0.000 0.00 60 5.00 0.651   +2.5% 38 3.25
## 2      Ahri 526  92 5.50 0.60 418  25  8.000 0.80 53 3.00 0.668     +2% 21 3.50
## 3     Akali 500 105 8.00 0.50 200   0 50.000 0.00 62 3.30 0.625   +3.2% 23 3.50
## 4    Akshan 560  90 3.75 0.65 350  40  8.175 0.70 52 3.50 0.638     +4% 26 3.00
## 5   Alistar 600 106 8.50 0.85 350  40  8.500 0.80 62 3.75 0.625 +2.125% 44 3.50
## 6     Amumu 615  75 9.00 0.85 285  40  7.380 0.53 53 3.80 0.736  +2.18% 30 3.50
##   MR  MR.  MS Range               Title Release.date Last.changed   Class.es.
## 1 32 1.25 345   175    the Darkin Blade   2013-06-13        V11.2  Juggernaut
## 2 30 0.50 330   550 the Nine-Tailed Fox   2011-12-14       V11.11       Burst
## 3 37 1.25 345   125  the Rogue Assassin   2010-05-11       V11.14    Assassin
## 4 30 0.50 330   500  the Rogue Sentinel   2021-07-22       V11.17    Marksman
## 5 32 1.25 330   125        the Minotaur   2009-02-21       V11.11    Vanguard
## 6 32 1.25 335   125       the Sad Mummy   2009-06-26       V11.17    Vanguard
##               Legacy Position.s.                Resource Range.type
## 1       Fighter Tank  Top Middle  Manaless ( Blood Well)      Melee
## 2      Mage Assassin      Middle                    Mana     Ranged
## 3           Assassin  Top Middle                  Energy      Melee
## 4  Marksman Assassin      Middle                    Mana     Ranged
## 5       Tank Support     Support                    Mana      Melee
## 6          Tank Mage      Jungle                    Mana      Melee
##   Adaptive.type  Store.price Crafting
## 1      Physical  4800 |  880  +  2880
## 2         Magic  3150 |  790  +  1890
## 3      Physical  3150 |  790  +  1890
## 4      Physical  6300 |  975  +  3780
## 5         Magic  1350 |  585   +  810
## 6         Magic   450 |  260   +  270
##                                                Style             Passive
## 1    Damage3 Toughness3 Control2 Mobility2 Utility2  Deathbringer Stance
## 2    Damage3 Toughness1 Control2 Mobility3 Utility1        Essence Theft
## 3    Damage3 Toughness1 Control1 Mobility3 Utility1      Assassin's Mark
## 4    Damage3 Toughness1 Control1 Mobility3 Utility2       Dirty Fighting
## 5    Damage1 Toughness3 Control3 Mobility1 Utility2      Triumphant Roar
## 6    Damage2 Toughness3 Control3 Mobility1 Utility1         Cursed Touch
##             Q.Spell         W.Spell       E.Spell               Ultimate
## 1  The Darkin Blade Infernal Chains   Umbral Dash            World Ender
## 2  Orb of Deception        Fox-Fire         Charm            Spirit Rush
## 3 Five Point Strike Twilight Shroud Shuriken Flip      Perfect Execution
## 4        Avengerang     Going Rogue  Heroic Swing            Comeuppance
## 5         Pulverize        Headbutt       Trample       Unbreakable Will
## 6      Bandage Toss         Despair       Tantrum Curse of the Sad Mummy
##   Popularity Winrate BanRate Mainedby PentaKill   Gold Minions Wards
## 1       5.2%   50.4%    2.5%     0.6%    0.0017 10,804   164.9   7.7
## 2       5.2%   51.6%    1.1%     0.6%    0.0003 10,858   160.3   9.8
## 3       5.9%   46.5%    7.0%     0.6%    0.0015 10,786   151.1   7.9
## 4       9.7%   50.1%   33.4%     2.6%    0.0034 12,175   163.9   8.3
## 5       5.7%   49.8%    1.8%     0.3%    0.0000  7,378    32.6  22.7
## 6       8.6%   52.2%    6.6%     0.2%    0.0002 10,466   135.1   7.4
##   DamageDealt KDA_kills KDA_deaths KDA_assists
## 1      19,100       5.7        5.7         6.0
## 2      19,146       6.2        5.4         7.8
## 3      19,521       7.5        6.0         5.1
## 4      21,382       8.9        6.7         6.0
## 5       7,094       1.7        6.1        13.6
## 6      14,506       5.0        6.1        11.0

arrange(): Reordena las filas de una tabla de datos conservando el orden de las otras columnas

## de menor a mayor es por default
head(arrange(lol_champs[,c("Champions", "Position.s.", "Winrate", "Minions")], Winrate))

##    Champions Position.s. Winrate Minions
## 1  Gangplank         Top   42.6%   194.3
## 2       Azir      Middle   46.4%   181.8
## 3      Akali  Top Middle   46.5%   151.1
## 4       Gwen     Top Mid   46.8%   177.0
## 5      Jayce  Top Middle   46.8%   185.8
## 6       Ryze  Middle Top   46.9%   204.5

## de mayor a menor
head(arrange(lol_champs[,c("Champions", "Position.s.", "Winrate", "Minions")], desc(Winrate)))

##        Champions   Position.s. Winrate Minions
## 1           Sona       Support   53.2%    18.1
## 2          Corki        Middle   52.8%   196.9
## 3       Malzahar        Middle   52.7%   190.3
## 4           Ashe        Bottom   52.6%   171.8
## 5           Kled    Top Jungle   52.6%   157.2
## 6  Kled & Skaarl  Fighter Tank   52.6%   157.2

rename(): Renombra las variables de un data frame

head(rename(ChickWeight, peso=weight, dia_dieta=Time, pollo=Chick))

##   peso dia_dieta pollo Diet min.time.max.time time_great_six
## 1   42         0     1    1              TRUE          FALSE
## 2   51         2     1    1             FALSE          FALSE
## 3   59         4     1    1             FALSE          FALSE
## 4   64         6     1    1             FALSE          FALSE
## 5   76         8     1    1             FALSE           TRUE
## 6   93        10     1    1             FALSE           TRUE

mutate(): Crea y agrega nuevas variables/columnas o transforma variables existentes de un data frame

head(mutate(ChickWeight2, Pesos2 = `mean weight`*2))

##    diet time mean weight   Pesos2
## 1     1    0    41.40000  82.8000
## 5     1    2    47.25000  94.5000
## 9     1    4    56.47368 112.9474
## 13    1    6    66.78947 133.5789
## 17    1    8    79.68421 159.3684
## 21    1   10    93.05263 186.1053

head(mutate(ChickWeight2, `mean weight`=`mean weight`*2))

##    diet time mean weight
## 1     1    0     82.8000
## 5     1    2     94.5000
## 9     1    4    112.9474
## 13    1    6    133.5789
## 17    1    8    159.3684
## 21    1   10    186.1053

group_by(): divide un data frame de acuerdo a alguna variable categórica

grouped.data <- group_by(lol_champs[,c("Champions", "Position.s.", "Winrate", "Minions")], Position.s.)
head(grouped.data)

## # A tibble: 6 x 4
## # Groups:   Position.s. [4]
##   Champions  Position.s.   Winrate Minions
##   <chr>      <chr>         <chr>     <dbl>
## 1 " Aatrox"  " Top Middle" 50.4%     165. 
## 2 " Ahri"    " Middle"     51.6%     160. 
## 3 " Akali"   " Top Middle" 46.5%     151. 
## 4 " Akshan"  " Middle"     50.1%     164. 
## 5 " Alistar" " Support"    49.8%      32.6
## 6 " Amumu"   " Jungle"     52.2%     135.

summarise(): Genera un resumen estadístico de diferentes variables en el data frame. funciona muy bien en conjunto con la funcion group_by().

summarise(lol_champs, Winrate=mean(as.numeric(gsub("%", "", lol_champs$Winrate))), 
          BanRate=sd(as.numeric(gsub("%", "", lol_champs$Winrate))))

##   Winrate  BanRate
## 1 50.0943 1.568449

## En conjunto con datos a los que se les aplicó la función group_by
summarise(grouped.data, Minion_mean=mean(Minions))

## # A tibble: 32 x 2
##    Position.s.           Minion_mean
##    <chr>                       <dbl>
##  1 " Bottom"                    186.
##  2 " Bottom Top Middle"         194.
##  3 " Fighter Tank"              157.
##  4 " Jungle"                    151.
##  5 " Jungle Middle"             139.
##  6 " Jungle Support"            125.
##  7 " Jungle Top"                160.
##  8 " Jungle Top Middle"         169.
##  9 " Jungle Top Support"        131 
## 10 " Mana"                       95 
## # ... with 22 more rows

names(ChickWeight)

## [1] "weight"            "Time"              "Chick"            
## [4] "Diet"              "min.time.max.time" "time_great_six"

grouped_pollos <- group_by(ChickWeight, Diet)
summarise(grouped_pollos, media.pesos = mean(weight))

## # A tibble: 4 x 2
##   Diet  media.pesos
##   <fct>       <dbl>
## 1 1            103.
## 2 2            123.
## 3 3            143.
## 4 4            135.

operador %>%

Agiliza el uso del paquete dplyr. Se escribe primero el data set, después el operador %>% y después las funciones.

lol_champs %>% select(Champions, Position.s., Adaptive.type, Winrate, Minions, Title) %>% filter(Position.s. == " Middle") %>% group_by(Adaptive.type) %>% summarise(Minions= mean(Minions))

## # A tibble: 2 x 2
##   Adaptive.type Minions
##   <chr>           <dbl>
## 1 " Magic"         165.
## 2 " Physical"      170.

Uniendo datos

A veces leemos mas de un datatset en R y queremos unir estos datasets basados en algún ID.

Se utiliza la función merge() tiene como argumentos x, y que son los dos dataframes que vamos a unir y by que indica que columna tienen en común, by.x y by.y indican columnas de los dos diferentes dataframes, siendo by.x una columna del data frame en el argumento x y by.y una columna del dataframe puesto en el argumento y

Primero hay que ver qué nombres tienen en común las tablas de datos. Para ello haremos uso de la función intersect()

intersect(names(educacion.1), names(educacion.2))

## [1] "Entidad.federativa" "Total"              "No.especificado"

Después las unes por las columnas en común. Si no especificamos las columnas por las que va a unir las tablas lo hace por las variables que tienen en común. El argumento all nos indica si hay un valor que aparece en uno pero no en el otro, rellenado con valores NA para incluir valores perdidos que no aparecen en el otro dataframe.

##Sinónimos
head(merge(educacion.1,educacion.2, by.x ="Entidad.federativa", by.y = "Entidad.federativa"))

##    Entidad.federativa Total.x Sabe.leer.y.escribir No.sabe.leer.y.escribir
## 1      Aguascalientes  234498               216002                   17770
## 2     Baja California  553060               495977                   52605
## 3 Baja California Sur  121220               107786                   12780
## 4            Campeche  145473               129890                   14959
## 5             Chiapas 1072615               866691                  202707
## 6           Chihuahua  586667               528252                   55258
##   No.especificado.x Total.y  Asiste No.asiste No.especificado.y
## 1               726 1352235  407267    944493               475
## 2              4478 3610844  976069   2615724             19051
## 3               654  758642  213156    544548               938
## 4               624  878528  253183    624987               358
## 5              3217 5181929 1571582   3608207              2140
## 6              3157 3570280 1005113   2561934              3233

head(merge(educacion.1,educacion.2, by = "Entidad.federativa"))

##    Entidad.federativa Total.x Sabe.leer.y.escribir No.sabe.leer.y.escribir
## 1      Aguascalientes  234498               216002                   17770
## 2     Baja California  553060               495977                   52605
## 3 Baja California Sur  121220               107786                   12780
## 4            Campeche  145473               129890                   14959
## 5             Chiapas 1072615               866691                  202707
## 6           Chihuahua  586667               528252                   55258
##   No.especificado.x Total.y  Asiste No.asiste No.especificado.y
## 1               726 1352235  407267    944493               475
## 2              4478 3610844  976069   2615724             19051
## 3               654  758642  213156    544548               938
## 4               624  878528  253183    624987               358
## 5              3217 5181929 1571582   3608207              2140
## 6              3157 3570280 1005113   2561934              3233

##Sin especificar
head(merge(educacion.1,educacion.2, all = TRUE))

##    Entidad.federativa   Total No.especificado Sabe.leer.y.escribir
## 1      Aguascalientes  234498             726               216002
## 2      Aguascalientes 1352235             475                   NA
## 3     Baja California  553060            4478               495977
## 4     Baja California 3610844           19051                   NA
## 5 Baja California Sur  121220             654               107786
## 6 Baja California Sur  758642             938                   NA
##   No.sabe.leer.y.escribir Asiste No.asiste
## 1                   17770     NA        NA
## 2                      NA 407267    944493
## 3                   52605     NA        NA
## 4                      NA 976069   2615724
## 5                   12780     NA        NA
## 6                      NA 213156    544548

Otra forma de unir los datos es utilizando la función join() del paquete plyr

library(plyr)
head(join(educacion.2, educacion.1, by="Entidad.federativa"))

##         Entidad.federativa     Total   Asiste No.asiste No.especificado
## 1 Estados Unidos Mexicanos 119976584 33795678  86037103          143803
## 2           Aguascalientes   1352235   407267    944493             475
## 3          Baja California   3610844   976069   2615724           19051
## 4      Baja California Sur    758642   213156    544548             938
## 5                 Campeche    878528   253183    624987             358
## 6     Coahuila de Zaragoza   2980244   845319   2133016            1909
##      Total Sabe.leer.y.escribir No.sabe.leer.y.escribir No.especificado
## 1 19529018             17554529                 1871713          102776
## 2   234498               216002                   17770             726
## 3   553060               495977                   52605            4478
## 4   121220               107786                   12780             654
## 5   145473               129890                   14959             624
## 6   486221               445183                   39038            2000

educ.list <- list(educacion.1, educacion.2, educacion.3)
head(join_all(educ.list, by = "Entidad.federativa"))

##         Entidad.federativa    Total Sabe.leer.y.escribir
## 1 Estados Unidos Mexicanos 19529018             17554529
## 2           Aguascalientes   234498               216002
## 3          Baja California   553060               495977
## 4      Baja California Sur   121220               107786
## 5                 Campeche   145473               129890
## 6     Coahuila de Zaragoza   486221               445183
##   No.sabe.leer.y.escribir No.especificado     Total   Asiste No.asiste
## 1                 1871713          102776 119976584 33795678  86037103
## 2                   17770             726   1352235   407267    944493
## 3                   52605            4478   3610844   976069   2615724
## 4                   12780             654    758642   213156    544548
## 5                   14959             624    878528   253183    624987
## 6                   39038            2000   2980244   845319   2133016
##   No.especificado     Total Sin.escolaridad Preescolar Primaria Secundaria
## 1          143803 119976584         7701507    6110435 33253208   29426059
## 2             475   1352235           63957      72236   346807     361191
## 3           19051   3610844          180275     155091   872803     945609
## 4             938    758642           36121      38810   184093     182005
## 5             358    878528           68195      49037   233844     218926
## 6            1909   2980244          135915     156848   705577     823255
##   Estudios.técnicos.o.comerciales.con.primaria.terminada
## 1                                                 354724
## 2                                                   4705
## 3                                                  12001
## 4                                                   2653
## 5                                                   1948
## 6                                                  21514
##   Estudios.técnicos.o.comerciales.con.secundaria.terminada
## 1                                                  1244138
## 2                                                    14150
## 3                                                    30223
## 4                                                     8057
## 5                                                     7641
## 6                                                    50777
##   Preparatoria.o.bachillerato Normal.básica
## 1                    21149168        123608
## 2                      229410          1597
## 3                      773443          2847
## 4                      165452           762
## 5                      143267          1182
## 6                      521913          4391
##   Estudios.técnicos.o.comerciales.con.preparatoria.terminada
## 1                                                    1453857
## 2                                                      20758
## 3                                                      37916
## 4                                                       9603
## 5                                                       9373
## 6                                                      65532
##   Licenciatura.o.equivalente Posgrado No.especificado
## 1                   16777488  2055605          326787
## 2                     207540    27203            2681
## 3                     521809    63133           15694
## 4                     116667    11975            2444
## 5                     127182    16002            1931
## 6                     429571    53651           11300

Paquete tidyr

Es un paquete que ayuda a reacomodar las tablas de datos cuando tenemos datos desorganizados.

gather(): Esta función nos sirVe para crear variables de un data set y reordenarlos. Tiene diferentes usos y sus argumentos son un data set, el nombre de las nuevas columnas en donde se van a agrupar las del dataset y aquella columna que no queramos o las columnas que queramos agrupar

Cuando las columnas son valores, no nombres de variables

library(tidyr)
names(educacion.1)

## [1] "Entidad.federativa"      "Total"                  
## [3] "Sabe.leer.y.escribir"    "No.sabe.leer.y.escribir"
## [5] "No.especificado"

head(gather(educacion.1, Aptitud.Escritura.y.lectura, Count, -c(Entidad.federativa, Total)))

##         Entidad.federativa    Total Aptitud.Escritura.y.lectura    Count
## 1 Estados Unidos Mexicanos 19529018        Sabe.leer.y.escribir 17554529
## 2           Aguascalientes   234498        Sabe.leer.y.escribir   216002
## 3          Baja California   553060        Sabe.leer.y.escribir   495977
## 4      Baja California Sur   121220        Sabe.leer.y.escribir   107786
## 5                 Campeche   145473        Sabe.leer.y.escribir   129890
## 6     Coahuila de Zaragoza   486221        Sabe.leer.y.escribir   445183

Cuando múltiples variables están almacenadas en una columna. Utilizamos gather() y separate()

data <- data.frame(materia=c("fisica","matematicas","español"), M_sj1=c(5,7,8), M_sj2=c(8,9,9), H_sj1=c(10,7,6), H_sj2=c(7,9,7))

data2 <- gather(data, sex_suj, calificacion,-materia)

separate(data2, sex_suj, into = c("sexo", "sujeto"))

##        materia sexo sujeto calificacion
## 1       fisica    M    sj1            5
## 2  matematicas    M    sj1            7
## 3      español    M    sj1            8
## 4       fisica    M    sj2            8
## 5  matematicas    M    sj2            9
## 6      español    M    sj2            9
## 7       fisica    H    sj1           10
## 8  matematicas    H    sj1            7
## 9      español    H    sj1            6
## 10      fisica    H    sj2            7
## 11 matematicas    H    sj2            9
## 12     español    H    sj2            7

Cuando variables son almacenadas en rows y columnas. Usamos gather() y spread

lol <- lol_champs[,c("Champions", "Adaptive.type", "KDA_kills", "KDA_assists", "KDA_deaths")]
lol <- gather(lol, KDA, valor, KDA_kills:KDA_deaths)
head(spread(lol, Adaptive.type, valor))

##   Champions         KDA  6300 |  975  Magic  Melee  Physical
## 1    Aatrox KDA_assists           NA     NA     NA       6.0
## 2    Aatrox  KDA_deaths           NA     NA     NA       5.7
## 3    Aatrox   KDA_kills           NA     NA     NA       5.7
## 4      Ahri KDA_assists           NA    7.8     NA        NA
## 5      Ahri  KDA_deaths           NA    5.4     NA        NA
## 6      Ahri   KDA_kills           NA    6.2     NA        NA