Programación y Entorno de Trabajo para la IA: R

Author

Kamal Romero

Published

June 3, 2024

Data frames

Gran parte de los datos con los que trabajamos en el mundo de la ciencia de datos, están organizados en arreglos rectangulares, donde las filas representan las instancias, individuos u objetos de estudio, y las columnas las características de los mismos. Esto implica que por definición, dado que las unidades de estudio se representan como entes multidimensionales, que las filas son heterogéneas (una fila es un individuo caracterizado, por ejemplo, por su edad, código postal de su residencia, saldo de su cuenta bancaria, etc.), y las columnas son homogéneas (la columna de edad son todos enteros si se representan en años)

En R, el contenedor usado para poder realizar análisis sobre datos rectangulares, son los data frames.

Formalmente hablando, los data frames son listas de vectores de igual magnitud. Por lo que todo lo que hemos aprendido en las secciones anteriores, es aplicable a los data frames

Inspeccionar los datos

Por lo general, solemos inspeccionar los datos de una tabla para familiarizarnos con ellos y ver que tipo de análisis podemos realizar con ellos.

Vamos a usar uno de los conjuntos de datos que ya viene en la instalación básica de R, mtcars

Podemos abrir una versión amigable de la tabla en RStudio

mtcars

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Existen varias formas de echar vistazos rápidos a los datos sin necesidad de visualizar la tabla entera

Por ejemplo podemos ver las 6 primeras filas

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

O las 6 últimas

tail(mtcars)

                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Si en lugar de 6, queremos ver un determinado número de filas, simplemente se lo decimos a las funciones heady tail

head(mtcars, n = 3)

               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

tail(mtcars, n = 2)

               mpg cyl disp  hp drat   wt qsec vs am gear carb
Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
Volvo 142E    21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

En el caso en que tengamos muchas columnas, la inspección visual es engorrosa, pero, al igual que las listas, si un data frame es una lista de vectores, donde los vectores son las columnas, podemos acceder a los nombres de las columnas usando names()

names(mtcars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Asimismo, un data frame es un objeto rectangular, es decir, con más de una dimensión, como una matriz, por lo que podremos obtener el número de filas y columnas con dim()

dim(mtcars)

[1] 32 11

Podemos ver por separado el número de filas y columnas con nrow y ncol

ncol(mtcars)

[1] 11

nrow(mtcars)

[1] 32

Otra forma es usando la función str. Aunque no es intuitiva e incluso intimida a primera vista, no es tan complicado y aporta información útil

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Una versión que aporta información estadística es summary

summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000

Recordar que podíamos acceder a elementos de los vectores por posición o nombre

head(mtcars, n = 1)

          mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

mtcars[1,2]

[1] 6

mtcars[1,4]

[1] 110

En el primer caso le decimos a R que queremos que nos muestre el dato de la primera fila y la segunda columna, en el segundo caso el dato de la primera fila y cuarta columna

En el caso que deseemos ver una columna o fila específica, dejamos una de las dimensiones en blanco

Si queremos acceder a la información de la segunda fila

mtcars[2, ]

              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

Si queremos acceder a la información de la cuarta columna

mtcars[ , 4]

 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
[20]  65  97 150 150 245 175  66  91 113 264 175 335 109

Tambien podemos obtener la información por columna usando el nombre de la misma

mtcars[ , "cyl"]

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

Asimismo, podemos inspeccionar una columna de data frame escribiendo el nombre del mismo seguido de un signo $ y el nombre de la columna

La ventaja de este método es que los nombres de la columna se autocompletan

mtcars$cyl

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

Con la posición, al igual que los vectores, podemos acceder a conjuntos de filas y columnas (slices)

Por ejemplo, si nos queremos quedar solo con las 3 primeras filas y la cuarta y quinta columna

mtcars[1:3, 4:5]

               hp drat
Mazda RX4     110 3.90
Mazda RX4 Wag 110 3.90
Datsun 710     93 3.85

Manipulación de data frames

Podemos añadir columnas a un data frame, usemos una versión reducida de mtcars

(mtcars_mini <- mtcars[1:10, ])

                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Creamos un vector (con nombre) de caracteres de longitud 10

( nueva_columna <- letters[1:10] )

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Añadimos este vector como una nueva columna a mtcars_mini

mtcars_mini$new <- nueva_columna

mtcars_mini

                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb new
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   a
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   b
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   c
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   d
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   e
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   f
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   g
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2   h
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2   i
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   j

Como un data frame es un arreglo rectangular que nos permite utilizar funciones que hemos empleado antes con las matrices, otra forma de añadir una nueva(s) columna(s) es usando la función cbind()

( new2 <- letters[11:20] )

 [1] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"

( mtcars_mini <- cbind(mtcars_mini, new2) )

                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb new new2
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   a    k
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   b    l
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   c    m
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   d    n
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   e    o
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   f    p
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   g    q
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2   h    r
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2   i    s
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   j    t

Asimismo podemos usar rbind() para añadir nuevas filas al data frame

( rbind(mtcars_mini[, 1:(ncol(mtcars_mini)-2)], mtcars[11:15, ]) )

                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4          21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag      21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive     21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant            18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360         14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D          24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230           22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280           19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C          17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE         16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL         17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC        15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

Prestar atención a lo que hemos hecho para poder adaptar mtcars_mini al número de columnas del sub conjunto de mtcars.

Para aplicar la función cbind()hay que asegurarse que los data frames tengan el mismo número de filas, y para rbind() que tengan el mismo número de columnas.

Si queremos cambiar el orden de las columnas pasamos un vector de caracteres como máscara en la dimensión de columnas como máscara.

orden_nuevo <- c("carb", "mpg",  "cyl",  "disp", "hp", "drat", "wt", "qsec", "vs",  "am", "gear")

mtcars[, orden_nuevo]

                    carb  mpg cyl  disp  hp drat    wt  qsec vs am gear
Mazda RX4              4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4
Mazda RX4 Wag          4 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4
Datsun 710             1 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4
Hornet 4 Drive         1 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3
Hornet Sportabout      2 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3
Valiant                1 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3
Duster 360             4 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3
Merc 240D              2 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4
Merc 230               2 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4
Merc 280               4 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4
Merc 280C              4 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4
Merc 450SE             3 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3
Merc 450SL             3 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3
Merc 450SLC            3 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3
Cadillac Fleetwood     4 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3
Lincoln Continental    4 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3
Chrysler Imperial      4 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3
Fiat 128               1 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4
Honda Civic            2 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4
Toyota Corolla         1 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4
Toyota Corona          1 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3
Dodge Challenger       2 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3
AMC Javelin            2 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3
Camaro Z28             4 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3
Pontiac Firebird       2 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3
Fiat X1-9              1 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4
Porsche 914-2          2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5
Lotus Europa           2 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5
Ford Pantera L         4 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5
Ferrari Dino           6 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5
Maserati Bora          8 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5
Volvo 142E             2 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4

Si deseamos crear una nueva columna a partir de otras

mtcars_mini$mpgOvercyl <- mtcars_mini$mpg / mtcars_mini$cyl

mtcars_mini

                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb new new2
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   a    k
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   b    l
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   c    m
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   d    n
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   e    o
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   f    p
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   g    q
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2   h    r
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2   i    s
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   j    t
                  mpgOvercyl
Mazda RX4           3.500000
Mazda RX4 Wag       3.500000
Datsun 710          5.700000
Hornet 4 Drive      3.566667
Hornet Sportabout   2.337500
Valiant             3.016667
Duster 360          1.787500
Merc 240D           6.100000
Merc 230            5.700000
Merc 280            3.200000

Y podemos filtrar filas usando condiciones lógicas al igual que hacíamos con los vectores

mtcars[mtcars$cyl > 4 & mtcars$hp > 100, ]

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

mtcars[mtcars$cyl > 4 | mtcars$hp < 100, ]

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

mtcars[mtcars$cyl == 6, ]

                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Más allá de los data frames

Los data frames son una estructura extremadamente útil, con la gran ventaja de que forma parte del ecosistema de R base (no depende de librería elaboradas por terceros). Pero, lo anterior también representa también una restricción, por lo que se han desarrollado librerías que complementan las propiedades de los data frames

El universo Tidyverse

Figure 1: Colección de paquetes del ecosistema Tidyverse. Ver su web

La principal referencia de este apartado va a ser este capítulo del libro R for Data Science (acá en español)

En esta primera parte de la exposición nos concentraremos en el paquete dplyr y en la manipulación de datos, e introduciremos el pipe de la librería magrittr

Dplyr

En su página web, se presenta la librería dplyr como “dplyr es una gramática de manipulación de datos, que proporciona un conjunto coherente de verbos que ayudan a resolver los problemas más comunes de manipulación de datos”

Otra definición se puede ver en este texto de Carlos Gil Bellosta

El “conjunto de verbos” es el siguiente:

mutate() añade nuevas variables que son funciones de variables existentes
select() selecciona variables basándose en sus nombres
filter() selecciona casos basándose en sus valores
summarise() reduce múltiples valores a un único resumen
arrange() cambia el orden de las filas.

Al ser una librería desarrollada por terceros, hay que instalar la librería (se hace una sola vez) y cargarla

install.packages("dplyr")
library(dplyr)

Vamos a trabajar con el conjunto de datos starwars

starwars

# A tibble: 87 x 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
 3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
 4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
 5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
 6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
 7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
 9 Biggs D~    183    84 black      light      brown           24   male  mascu~
10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
# i 77 more rows
# i 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Observen que nuestro data frame ha cambiado de nombre, ahora es un tibble. Leer este capítulo del libro para una exposición detallada.

Inspeccionamos el data frame (si, el tibble sigue siendo un data frame) con glimpse

glimpse(starwars)

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J~
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

Antes de continuar, es mejor presentar ya el pipe de magrittr. Empezaremos con un ejemplo simple en el cual deseamos calcular el logaritmo de la raíz cuadrada de la sumatoria de una secuencia de 20 realizaciones de una distribución uniforme.

Los pasos serían los siguientes:

Generar la secuencia de 20 realizaciones de una distribución uniforme runif(20)
Calcular la sumatoria sum(runif(20)))
Obtener la raíz cuadrada sqrt(sum(runif(20)))
Aplicar el logaritmo log(sqrt(sum(runif(20))))

set.seed(356)
log(sqrt(sum(runif(20))))

[1] 1.257018

La función es algo complicada de leer, y se complica su lectura a medida que añadimos más funciones.

El pipe (tubería) de magrittr %>% parte de la misma idea que el pipe de bash |, pasar el output de la salida de una función a otra función de manera encadenada.

Una manera “limpia” de ejecutar varios verbos de manera secuencial es a través del uso del pipe

Según su web, dicha herramienta permite hacer nuestro código más legible por las siguientes razones:

estructurando secuencias de operaciones de datos de izquierda a derecha (en lugar de dentro a fuera)
evitando las llamadas a funciones anidadas
minimizando la necesidad de variables locales y definiciones de funciones
facilitar añadir pasos en cualquier punto de la secuencia de operaciones

La operación anterior quedaría de la siguiente forma con pipes:

set.seed(356)

runif(20) %>% 
  sum() %>% 
  sqrt() %>% 
  log()

[1] 1.257018

Lo cual, es más natural y fácil de leer. Vamos a aplicar este encadenamiento de funciones con el pipe de manera intensiva con dplyr

Vamos a seleccionar filas que cumplan una determinada condición, esto se hace con el verbo filter()

Por ejemplo, nos quedamos solo con los personajes de Star Wars que son de Tatooine

starwars %>% 
  dplyr::filter(homeworld == "Tatooine")

# A tibble: 10 x 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
 3 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
 4 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
 5 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
 6 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
 7 Biggs D~    183    84 black      light      brown           24   male  mascu~
 8 Anakin ~    188    84 blond      fair       blue            41.9 male  mascu~
 9 Shmi Sk~    163    NA black      fair       brown           72   fema~ femin~
10 Cliegg ~    183    NA brown      fair       blue            82   male  mascu~
# i 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

O que viva en Tatooine y sean de género masculino

starwars %>%
  dplyr::filter(homeworld == "Tatooine" & gender == "masculine")

# A tibble: 8 x 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky~    172    77 blond      fair       blue            19   male  mascu~
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu~
3 Darth Va~    202   136 none       white      yellow          41.9 male  mascu~
4 Owen Lars    178   120 brown, gr~ light      blue            52   male  mascu~
5 R5-D4         97    32 <NA>       white, red red             NA   none  mascu~
6 Biggs Da~    183    84 black      light      brown           24   male  mascu~
7 Anakin S~    188    84 blond      fair       blue            41.9 male  mascu~
8 Cliegg L~    183    NA brown      fair       blue            82   male  mascu~
# i 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

O queremos quedarnos con aquellos registros que no sean humanos

starwars |> 
  dplyr::filter(species != "Human")

# A tibble: 48 x 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu~
 2 R2-D2        96    32 <NA>       white, bl~ red               33 none  mascu~
 3 R5-D4        97    32 <NA>       white, red red               NA none  mascu~
 4 Chewbac~    228   112 brown      unknown    blue             200 male  mascu~
 5 Greedo      173    74 <NA>       green      black             44 male  mascu~
 6 Jabba D~    175  1358 <NA>       green-tan~ orange           600 herm~ mascu~
 7 Yoda         66    17 white      green      brown            896 male  mascu~
 8 IG-88       200   140 none       metal      red               15 none  mascu~
 9 Bossk       190   113 none       green      red               53 male  mascu~
10 Ackbar      180    83 none       brown mot~ orange            41 male  mascu~
# i 38 more rows
# i 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Si lo que deseamos es seleccionar columnas, usamos el verbo select()

Siguiendo con el primer ejemplo, queremos saber que personajes de Star Wars son de Tatooine, pero solo queremos ver la columna del nombre del personaje

starwars %>% 
  dplyr::filter(homeworld == "Tatooine") %>%
  select(name)

# A tibble: 10 x 1
   name              
   <chr>             
 1 Luke Skywalker    
 2 C-3PO             
 3 Darth Vader       
 4 Owen Lars         
 5 Beru Whitesun Lars
 6 R5-D4             
 7 Biggs Darklighter 
 8 Anakin Skywalker  
 9 Shmi Skywalker    
10 Cliegg Lars

Además del nombre, queremos saber su altura y la especie

starwars %>% 
  dplyr::filter(homeworld == "Tatooine") %>%
  select(name, height, species)

# A tibble: 10 x 3
   name               height species
   <chr>               <int> <chr>  
 1 Luke Skywalker        172 Human  
 2 C-3PO                 167 Droid  
 3 Darth Vader           202 Human  
 4 Owen Lars             178 Human  
 5 Beru Whitesun Lars    165 Human  
 6 R5-D4                  97 Droid  
 7 Biggs Darklighter     183 Human  
 8 Anakin Skywalker      188 Human  
 9 Shmi Skywalker        163 Human  
10 Cliegg Lars           183 Human

Supongamos que no sabemos de antemano, cuantos tipos de especies hay, y deseamos saberlo para aplicar los filtros adecuados. En R base tendríamos que extraer la columna y usar la función unique() (unique(starwars$species)). En dplyr tenemos la función distinct()

starwars |> 
  select(species) |> 
  distinct()

# A tibble: 38 x 1
   species       
   <chr>         
 1 Human         
 2 Droid         
 3 Wookiee       
 4 Rodian        
 5 Hutt          
 6 <NA>          
 7 Yoda's species
 8 Trandoshan    
 9 Mon Calamari  
10 Ewok          
# i 28 more rows

Si queremos contar cuantos registros hay por especie, usamos la función count()

starwars |> 
  count(species)

# A tibble: 38 x 2
   species       n
   <chr>     <int>
 1 Aleena        1
 2 Besalisk      1
 3 Cerean        1
 4 Chagrian      1
 5 Clawdite      1
 6 Droid         6
 7 Dug           1
 8 Ewok          1
 9 Geonosian     1
10 Gungan        3
# i 28 more rows

Si deseamos que presente la información en orden descendente

starwars |> 
  count(species, sort = TRUE)

# A tibble: 38 x 2
   species      n
   <chr>    <int>
 1 Human       35
 2 Droid        6
 3 <NA>         4
 4 Gungan       3
 5 Kaminoan     2
 6 Mirialan     2
 7 Twi'lek      2
 8 Wookiee      2
 9 Zabrak       2
10 Aleena       1
# i 28 more rows

Si una vez escogida la columna deseada, desea ordenarla como en la caso anterior, usamos el verbo arrange(). En este caso seleccionamos la columna de masa corporal y nombre

starwars %>%
  select(name, mass) %>%
  arrange(mass)

# A tibble: 87 x 2
   name                   mass
   <chr>                 <dbl>
 1 Ratts Tyerel             15
 2 Yoda                     17
 3 Wicket Systri Warrick    20
 4 R2-D2                    32
 5 R5-D4                    32
 6 Sebulba                  40
 7 Padmé Amidala            45
 8 Dud Bolt                 45
 9 Wat Tambor               48
10 Sly Moore                48
# i 77 more rows

Para ordenarlo de mayor a menos usamos la función desc() dentro de arrange()

starwars %>%
  select(name, mass) %>%
  arrange(desc(mass))

# A tibble: 87 x 2
   name                   mass
   <chr>                 <dbl>
 1 Jabba Desilijic Tiure  1358
 2 Grievous                159
 3 IG-88                   140
 4 Darth Vader             136
 5 Tarfful                 136
 6 Owen Lars               120
 7 Bossk                   113
 8 Chewbacca               112
 9 Jek Tono Porkins        110
10 Dexter Jettster         102
# i 77 more rows

Supongamos que deseamos seleccionar solo las columnas que cumplan una determinada condición, esto se puede hacer con la función where() dentro del verbo select(). Por ejemplo, si deseamos solo las columnas que contengan variables numéricas

starwars %>%
  select(where(is.numeric))

# A tibble: 87 x 3
   height  mass birth_year
    <int> <dbl>      <dbl>
 1    172    77       19  
 2    167    75      112  
 3     96    32       33  
 4    202   136       41.9
 5    150    49       19  
 6    178   120       52  
 7    165    75       47  
 8     97    32       NA  
 9    183    84       24  
10    182    77       57  
# i 77 more rows

Pero, también existen variantes de select() más específicas. Por ejemplo, para este caso también podríamos haber usado select_if() y la condición

starwars %>%
  select_if(is.numeric)

# A tibble: 87 x 3
   height  mass birth_year
    <int> <dbl>      <dbl>
 1    172    77       19  
 2    167    75      112  
 3     96    32       33  
 4    202   136       41.9
 5    150    49       19  
 6    178   120       52  
 7    165    75       47  
 8     97    32       NA  
 9    183    84       24  
10    182    77       57  
# i 77 more rows

Note

Hay varias funciones helpers que permiten hacer una selección más fina de columnas. Para echar un vistazo ejecutar ?select

Algo que solemos realizar de manera muy frecuente, es crear nuevas columnas transformando las existentes, esto lo hacemos con el verbo mutate()

starwars |> 
  mutate(H_W = height / mass)

# A tibble: 87 x 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
 3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
 4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
 5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
 6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
 7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
 9 Biggs D~    183    84 black      light      brown           24   male  mascu~
10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
# i 77 more rows
# i 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, H_W <dbl>

Es posible crear varias columnas al mismo tiempo

starwars |> 
  mutate(H_W = height / mass,
         fakeVar = sqrt(mass))

# A tibble: 87 x 16
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
 3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
 4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
 5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
 6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
 7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
 9 Biggs D~    183    84 black      light      brown           24   male  mascu~
10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
# i 77 more rows
# i 7 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, H_W <dbl>, fakeVar <dbl>

Y para finalizar este exceso resumido paso por dplyr, presentamos dos verbos de los más usado en el ciclo de análisis de datos: group_by() y summarise()

Para los que conocen SQL, los dos verbos anteriores son el equivalente a un GROUP BY y una operación de agregación

Por ejemplo, supongamos que queremos obtener la altura media por género

starwars |> 
  group_by(gender) |> 
  summarise(altura_media = mean(height, na.rm = TRUE))

# A tibble: 3 x 2
  gender    altura_media
  <chr>            <dbl>
1 feminine          167.
2 masculine         177.
3 <NA>              175

O el peso medio por planeta de origen

starwars |> 
  group_by(homeworld) |> 
  summarise(peso_medio = mean(mass, na.rm = TRUE))

# A tibble: 49 x 2
   homeworld      peso_medio
   <chr>               <dbl>
 1 Alderaan             64  
 2 Aleen Minor          15  
 3 Bespin               79  
 4 Bestine IV          110  
 5 Cato Neimoidia       90  
 6 Cerea                82  
 7 Champala            NaN  
 8 Chandrila           NaN  
 9 Concord Dawn         79  
10 Corellia             78.5
# i 39 more rows

Recordar que lo podemos ordenar

starwars |> 
  group_by(homeworld) |> 
  summarise(peso_medio = mean(mass, na.rm = TRUE)) |> 
  arrange(desc(peso_medio))

# A tibble: 49 x 2
   homeworld      peso_medio
   <chr>               <dbl>
 1 Nal Hutta          1358  
 2 Kalee               159  
 3 Kashyyyk            124  
 4 Trandosha           113  
 5 Bestine IV          110  
 6 Ojom                102  
 7 Cato Neimoidia       90  
 8 Glee Anselm          87  
 9 Tatooine             85.4
10 Haruun Kal           84  
# i 39 more rows

Número de registros por planeta ordenado de mayor a menor

starwars |> 
  group_by(homeworld) |> 
  summarise(registros = n()) |> 
  arrange(desc(registros))

# A tibble: 49 x 2
   homeworld registros
   <chr>         <int>
 1 Naboo            11
 2 Tatooine         10
 3 <NA>             10
 4 Alderaan          3
 5 Coruscant         3
 6 Kamino            3
 7 Corellia          2
 8 Kashyyyk          2
 9 Mirial            2
10 Ryloth            2
# i 39 more rows

Y para finalizar, si deseamos cambiar el nombre de una columna, usamos la función rename(), y para cambiar el orden de las columnas relocate()

starwars |> 
  rename(planeta = homeworld)

# A tibble: 87 x 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
 3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
 4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
 5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
 6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
 7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
 9 Biggs D~    183    84 black      light      brown           24   male  mascu~
10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
# i 77 more rows
# i 5 more variables: planeta <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

starwars |> 
  relocate(species, .after = mass)

# A tibble: 87 x 14
   name    height  mass species hair_color skin_color eye_color birth_year sex  
   <chr>    <int> <dbl> <chr>   <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke S~    172    77 Human   blond      fair       blue            19   male 
 2 C-3PO      167    75 Droid   <NA>       gold       yellow         112   none 
 3 R2-D2       96    32 Droid   <NA>       white, bl~ red             33   none 
 4 Darth ~    202   136 Human   none       white      yellow          41.9 male 
 5 Leia O~    150    49 Human   brown      light      brown           19   fema~
 6 Owen L~    178   120 Human   brown, gr~ light      blue            52   male 
 7 Beru W~    165    75 Human   brown      light      blue            47   fema~
 8 R5-D4       97    32 Droid   <NA>       white, red red             NA   none 
 9 Biggs ~    183    84 Human   black      light      brown           24   male 
10 Obi-Wa~    182    77 Human   auburn, w~ fair       blue-gray       57   male 
# i 77 more rows
# i 5 more variables: gender <chr>, homeworld <chr>, films <list>,
#   vehicles <list>, starships <list>

Algo adicional que conviene saber y combina bien con las funciones map, los tibbles anidados. Los tibbles anidados se crean con el verbo nest_by()

starwars_nested <- starwars |>
  nest_by(homeworld)

starwars_nested

# A tibble: 49 x 2
# Rowwise:  homeworld
   homeworld                     data
   <chr>          <list<tibble[,13]>>
 1 Alderaan                  [3 x 13]
 2 Aleen Minor               [1 x 13]
 3 Bespin                    [1 x 13]
 4 Bestine IV                [1 x 13]
 5 Cato Neimoidia            [1 x 13]
 6 Cerea                     [1 x 13]
 7 Champala                  [1 x 13]
 8 Chandrila                 [1 x 13]
 9 Concord Dawn              [1 x 13]
10 Corellia                  [2 x 13]
# i 39 more rows

Se accede al tibble anidado con la sintaxis de una lista

starwars_nested$data[[40]]

# A tibble: 10 x 13
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
 3 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
 4 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
 5 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
 6 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
 7 Biggs D~    183    84 black      light      brown           24   male  mascu~
 8 Anakin ~    188    84 blond      fair       blue            41.9 male  mascu~
 9 Shmi Sk~    163    NA black      fair       brown           72   fema~ femin~
10 Cliegg ~    183    NA brown      fair       blue            82   male  mascu~
# i 4 more variables: species <chr>, films <list>, vehicles <list>,
#   starships <list>

Más allá de los tibbles: joins

Acabamos esta sección abarcando un problema con el cual ya se han topado en módulos anteriores, el hecho de que es común trabajar con más de una tabla de datos que guardan relación entre si, por lo que recurrimos a los ya conocidos JOINS y sus variantes.

También podemos realizar estas operaciones en tidyverse con las siguientes funciones:

left_join()
right_join()
inner_join()
full_join()
anti_join()
semi_join()

Note

Una buena introducción a los joins (uniones) con dplyr es esta. Y para una comparación con R base y SQL esto

Vamos a usar un ejemplo de juguete para mostrar los joins más básicos. Creamos dos conjuntos de datos usando la función tibble()

df_A <- tribble(
  ~ID, ~y,
   "A", 5,
   "B", 5,
   "C", 8,
   "D", 0,
   "F", 9)
df_B <- tribble(
  ~ID, ~z,
   "A", 30,
   "B", 21,
   "C", 22,
   "D", 25,
   "E", 29)

df_A

# A tibble: 5 x 2
  ID        y
  <chr> <dbl>
1 A         5
2 B         5
3 C         8
4 D         0
5 F         9

df_B

# A tibble: 5 x 2
  ID        z
  <chr> <dbl>
1 A        30
2 B        21
3 C        22
4 D        25
5 E        29

Left join

left_join(x = df_A, y = df_B, by = "ID")

# A tibble: 5 x 3
  ID        y     z
  <chr> <dbl> <dbl>
1 A         5    30
2 B         5    21
3 C         8    22
4 D         0    25
5 F         9    NA

Right join

right_join(x = df_A, y = df_B, by = "ID")

# A tibble: 5 x 3
  ID        y     z
  <chr> <dbl> <dbl>
1 A         5    30
2 B         5    21
3 C         8    22
4 D         0    25
5 E        NA    29

Inner join

inner_join(x = df_A, y = df_B, by = "ID")

# A tibble: 4 x 3
  ID        y     z
  <chr> <dbl> <dbl>
1 A         5    30
2 B         5    21
3 C         8    22
4 D         0    25

Full join

full_join(x = df_A, y = df_B, by = "ID")

# A tibble: 6 x 3
  ID        y     z
  <chr> <dbl> <dbl>
1 A         5    30
2 B         5    21
3 C         8    22
4 D         0    25
5 F         9    NA
6 E        NA    29

¿Qué ocurre si las columnas tienen nombres distintos?

colnames(df_B)[1] <- "KL"

full_join(x = df_A, y = df_B, by = c("ID"="KL"))

# A tibble: 6 x 3
  ID        y     z
  <chr> <dbl> <dbl>
1 A         5    30
2 B         5    21
3 C         8    22
4 D         0    25
5 F         9    NA
6 E        NA    29

data.table: The need for speed

Ya que nos hemos empapado de parte del tidyverse, vamos a aprender la otra gran alternativa a los data frames estructura de R base, la librería data.table

Antes de empezar a analizar esta librería, primero habría que preguntarse ¿Por qué debemos aprender otras alternativas ya teniendo R base y el tidyverse?

data.table posee unas ventajas idiosincráticas que la hacen una alternativa atractiva. Según los mismos desarrolladores de la librería, data.table:

“es un paquete extremadamente rápido y eficiente en memoria para transformar datos en R.”

Siguiendo a Carlos Gil Bellosta, tal y como menciona en su texto:

“El tidyverse no es el único dialecto popular de R. Por ejemplo, el paquete data.table propone otro dialecto con características muy distintas. El código en dicho dialecto es mucho menos legible pero tiene una ventaja importante: es increíblemente rápido y gestiona muy bien la memoria. Es un paquete (o dialecto) con el que conviene familiarizarse para trabajar con conjuntos de datos muy grandes, de millones, decenas de millones o, incluso de cientos de millones de filas”

Y finalizamos con esta afirmación de Grant McDermott, profesor de la Universidad de Oregon y consultor de analítica de grandes datos

“El tidyverse es genial (…) Entonces, ¿por qué molestarse en aprender otro paquete/sintaxis de gestión de datos? En lo que respecta a data.table, se me ocurren al menos cinco razones:

Conciso
Increíblemente rápido
Uso eficiente de memoria
Rico en funciones (y estable)
Sin dependencias”

Mi opinión personal es muy concreta, data.table es muy rápido y eficiente a la hora de manejar una gran cantidad de datos que aún no hayan pasado el límite para convertirse en Big Data.

Pero nada es gratuito, como menciona Gil Bellosta, la sintaxis es menos amigable si la comparamos con la del tidyverse.

La sintaxis de data table se puede resumir mediante el siguiente diagrama:

i indica las filas que deseamos seleccionar ya sea para filtrar o para ejecutar una operación sobre ese subconjunto de filas. El equivalente a filter(), slice() y arrange() en dplyr o WHERE en SQL
j indica ya sea las columnas que deseamos seleccionar o la operación que deseamos realizar sobre las columnas. El equivalente a select(); mutate() en dplyr o SELECT o las funciones de agregación en SQL
by indica como debemos agregar el conjunto de datos. El equivalente a group_by() en dplyr o GROUP BY en SQL

Vamos a replicar los ejemplos de la sección anterior para poder realizar una comparación directa

Primero, instalamos la librería y la llamamos posteriormente

install.packages(data.table)
library(data.table)

Convertimos starwars a un objeto data.table

starwars_dt <- as.data.table(starwars)

starwars_dt[1:10]

                  name height mass    hair_color  skin_color eye_color
 1:     Luke Skywalker    172   77         blond        fair      blue
 2:              C-3PO    167   75          <NA>        gold    yellow
 3:              R2-D2     96   32          <NA> white, blue       red
 4:        Darth Vader    202  136          none       white    yellow
 5:        Leia Organa    150   49         brown       light     brown
 6:          Owen Lars    178  120   brown, grey       light      blue
 7: Beru Whitesun Lars    165   75         brown       light      blue
 8:              R5-D4     97   32          <NA>  white, red       red
 9:  Biggs Darklighter    183   84         black       light     brown
10:     Obi-Wan Kenobi    182   77 auburn, white        fair blue-gray
    birth_year    sex    gender homeworld species
 1:       19.0   male masculine  Tatooine   Human
 2:      112.0   none masculine  Tatooine   Droid
 3:       33.0   none masculine     Naboo   Droid
 4:       41.9   male masculine  Tatooine   Human
 5:       19.0 female  feminine  Alderaan   Human
 6:       52.0   male masculine  Tatooine   Human
 7:       47.0 female  feminine  Tatooine   Human
 8:         NA   none masculine  Tatooine   Droid
 9:       24.0   male masculine  Tatooine   Human
10:       57.0   male masculine   Stewjon   Human
                                                                                                                    films
 1:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
 2:     A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
 3: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith,...
 4:                                             A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
 5:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
 6:                                                                   A New Hope,Attack of the Clones,Revenge of the Sith
 7:                                                                   A New Hope,Attack of the Clones,Revenge of the Sith
 8:                                                                                                            A New Hope
 9:                                                                                                            A New Hope
10:     A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
                             vehicles
 1: Snowspeeder,Imperial Speeder Bike
 2:                                  
 3:                                  
 4:                                  
 5:             Imperial Speeder Bike
 6:                                  
 7:                                  
 8:                                  
 9:                                  
10:                   Tribubble bongo
                                                                                               starships
 1:                                                                              X-wing,Imperial shuttle
 2:                                                                                                     
 3:                                                                                                     
 4:                                                                                      TIE Advanced x1
 5:                                                                                                     
 6:                                                                                                     
 7:                                                                                                     
 8:                                                                                                     
 9:                                                                                               X-wing
10: Jedi starfighter,Trade Federation cruiser,Naboo star skiff,Jedi Interceptor,Belbullab-22 starfighter

Observar que la presentación de la tabla en la consola es distinta a la de un tibble que a su vez era distinta a la de un data frame

Empezamos, nos quedamos solo con los personajes de Star Wars que son de Tatooine

starwars_dt[homeworld == "Tatooine"] |> head()

                 name height mass  hair_color skin_color eye_color birth_year
1:     Luke Skywalker    172   77       blond       fair      blue       19.0
2:              C-3PO    167   75        <NA>       gold    yellow      112.0
3:        Darth Vader    202  136        none      white    yellow       41.9
4:          Owen Lars    178  120 brown, grey      light      blue       52.0
5: Beru Whitesun Lars    165   75       brown      light      blue       47.0
6:              R5-D4     97   32        <NA> white, red       red         NA
      sex    gender homeworld species
1:   male masculine  Tatooine   Human
2:   none masculine  Tatooine   Droid
3:   male masculine  Tatooine   Human
4:   male masculine  Tatooine   Human
5: female  feminine  Tatooine   Human
6:   none masculine  Tatooine   Droid
                                                                                                               films
1:                       A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
2: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
3:                                         A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
4:                                                               A New Hope,Attack of the Clones,Revenge of the Sith
5:                                                               A New Hope,Attack of the Clones,Revenge of the Sith
6:                                                                                                        A New Hope
                            vehicles               starships
1: Snowspeeder,Imperial Speeder Bike X-wing,Imperial shuttle
2:                                                          
3:                                           TIE Advanced x1
4:                                                          
5:                                                          
6:

Que viva en Tatooine y sean de género masculino

starwars_dt[homeworld == "Tatooine" & gender == "masculine"]

                name height mass  hair_color skin_color eye_color birth_year
1:    Luke Skywalker    172   77       blond       fair      blue       19.0
2:             C-3PO    167   75        <NA>       gold    yellow      112.0
3:       Darth Vader    202  136        none      white    yellow       41.9
4:         Owen Lars    178  120 brown, grey      light      blue       52.0
5:             R5-D4     97   32        <NA> white, red       red         NA
6: Biggs Darklighter    183   84       black      light     brown       24.0
7:  Anakin Skywalker    188   84       blond       fair      blue       41.9
8:       Cliegg Lars    183   NA       brown       fair      blue       82.0
    sex    gender homeworld species
1: male masculine  Tatooine   Human
2: none masculine  Tatooine   Droid
3: male masculine  Tatooine   Human
4: male masculine  Tatooine   Human
5: none masculine  Tatooine   Droid
6: male masculine  Tatooine   Human
7: male masculine  Tatooine   Human
8: male masculine  Tatooine   Human
                                                                                                               films
1:                       A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
2: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
3:                                         A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
4:                                                               A New Hope,Attack of the Clones,Revenge of the Sith
5:                                                                                                        A New Hope
6:                                                                                                        A New Hope
7:                                                       The Phantom Menace,Attack of the Clones,Revenge of the Sith
8:                                                                                              Attack of the Clones
                              vehicles
1:   Snowspeeder,Imperial Speeder Bike
2:                                    
3:                                    
4:                                    
5:                                    
6:                                    
7: Zephyr-G swoop bike,XJ-6 airspeeder
8:                                    
                                                 starships
1:                                 X-wing,Imperial shuttle
2:                                                        
3:                                         TIE Advanced x1
4:                                                        
5:                                                        
6:                                                  X-wing
7: Naboo fighter,Trade Federation cruiser,Jedi Interceptor
8:

Solo los registros que no sean humanos

starwars_dt[species != "Human"] |> head()

                    name height mass hair_color       skin_color eye_color
1:                 C-3PO    167   75       <NA>             gold    yellow
2:                 R2-D2     96   32       <NA>      white, blue       red
3:                 R5-D4     97   32       <NA>       white, red       red
4:             Chewbacca    228  112      brown          unknown      blue
5:                Greedo    173   74       <NA>            green     black
6: Jabba Desilijic Tiure    175 1358       <NA> green-tan, brown    orange
   birth_year            sex    gender homeworld species
1:        112           none masculine  Tatooine   Droid
2:         33           none masculine     Naboo   Droid
3:         NA           none masculine  Tatooine   Droid
4:        200           male masculine  Kashyyyk Wookiee
5:         44           male masculine     Rodia  Rodian
6:        600 hermaphroditic masculine Nal Hutta    Hutt
                                                                                                                   films
1:     A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
2: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith,...
3:                                                                                                            A New Hope
4:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
5:                                                                                                            A New Hope
6:                                                                      A New Hope,Return of the Jedi,The Phantom Menace
   vehicles                          starships
1:                                            
2:                                            
3:                                            
4:    AT-ST Millennium Falcon,Imperial shuttle
5:                                            
6:

También podemos seleccionar filas por posición, por ejemplo las 6 primeras filas

starwars_dt[1:6]

             name height mass  hair_color  skin_color eye_color birth_year
1: Luke Skywalker    172   77       blond        fair      blue       19.0
2:          C-3PO    167   75        <NA>        gold    yellow      112.0
3:          R2-D2     96   32        <NA> white, blue       red       33.0
4:    Darth Vader    202  136        none       white    yellow       41.9
5:    Leia Organa    150   49       brown       light     brown       19.0
6:      Owen Lars    178  120 brown, grey       light      blue       52.0
      sex    gender homeworld species
1:   male masculine  Tatooine   Human
2:   none masculine  Tatooine   Droid
3:   none masculine     Naboo   Droid
4:   male masculine  Tatooine   Human
5: female  feminine  Alderaan   Human
6:   male masculine  Tatooine   Human
                                                                                                                   films
1:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
2:     A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
3: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith,...
4:                                             A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
5:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
6:                                                                   A New Hope,Attack of the Clones,Revenge of the Sith
                            vehicles               starships
1: Snowspeeder,Imperial Speeder Bike X-wing,Imperial shuttle
2:                                                          
3:                                                          
4:                                           TIE Advanced x1
5:             Imperial Speeder Bike                        
6:

Ahora seleccionamos columnas, recordar que para eso debemos usar el segundo argumento

starwars_dt[homeworld == "Tatooine", list(name)]

                  name
 1:     Luke Skywalker
 2:              C-3PO
 3:        Darth Vader
 4:          Owen Lars
 5: Beru Whitesun Lars
 6:              R5-D4
 7:  Biggs Darklighter
 8:   Anakin Skywalker
 9:     Shmi Skywalker
10:        Cliegg Lars

Tenemos que pasar los nombres de las columnas como una lista, afortunadamente podemos usar el alias .

starwars_dt[homeworld == "Tatooine", .(name)]

                  name
 1:     Luke Skywalker
 2:              C-3PO
 3:        Darth Vader
 4:          Owen Lars
 5: Beru Whitesun Lars
 6:              R5-D4
 7:  Biggs Darklighter
 8:   Anakin Skywalker
 9:     Shmi Skywalker
10:        Cliegg Lars

Si pasamos el nombre solamente, nos devuelve un vector en lugar de un data table

starwars_dt[homeworld == "Tatooine", name]

 [1] "Luke Skywalker"     "C-3PO"              "Darth Vader"       
 [4] "Owen Lars"          "Beru Whitesun Lars" "R5-D4"             
 [7] "Biggs Darklighter"  "Anakin Skywalker"   "Shmi Skywalker"    
[10] "Cliegg Lars"

Note

Pruebe lo anterior con dos columnas, ¿Qué devuelve? ¿Tiene lógica?

Ahora intente con tres columnas ¿Por qué cree que tiene un error?

Intente lo anterior pasando un vector de caracteres con los nombres de las columnas que desea seleccionar

Más columnas

starwars_dt[homeworld == "Tatooine", .(name, height, species)]

                  name height species
 1:     Luke Skywalker    172   Human
 2:              C-3PO    167   Droid
 3:        Darth Vader    202   Human
 4:          Owen Lars    178   Human
 5: Beru Whitesun Lars    165   Human
 6:              R5-D4     97   Droid
 7:  Biggs Darklighter    183   Human
 8:   Anakin Skywalker    188   Human
 9:     Shmi Skywalker    163   Human
10:        Cliegg Lars    183   Human

Si deseamos ordenar el data table según alguna columna en particular

starwars_dt[, .(name, mass)][order(mass)]

                     name   mass
 1:          Ratts Tyerel   15.0
 2:                  Yoda   17.0
 3: Wicket Systri Warrick   20.0
 4:                 R2-D2   32.0
 5:                 R5-D4   32.0
 6:               Sebulba   40.0
 7:         Padmé Amidala   45.0
 8:              Dud Bolt   45.0
 9:            Wat Tambor   48.0
10:             Sly Moore   48.0
11:           Leia Organa   49.0
12:            Adi Gallia   50.0
13:         Barriss Offee   50.0
14:           Ayla Secura   55.0
15:            Zam Wesell   55.0
16:       Luminara Unduli   56.2
17:              Shaak Ti   57.0
18:        Ben Quadinaros   65.0
19:         Jar Jar Binks   66.0
20:             Nien Nunb   68.0
21:                Greedo   74.0
22:                 C-3PO   75.0
23:    Beru Whitesun Lars   75.0
24:             Palpatine   75.0
25:        Luke Skywalker   77.0
26:        Obi-Wan Kenobi   77.0
27:        Wedge Antilles   77.0
28:             Boba Fett   78.2
29:      Lando Calrissian   79.0
30:                 Lobot   79.0
31:            Jango Fett   79.0
32:       Raymus Antilles   79.0
33:              Han Solo   80.0
34:            Darth Maul   80.0
35:              Plo Koon   80.0
36:     Poggle the Lesser   80.0
37:                 Dooku   80.0
38:            Tion Medon   80.0
39:          Roos Tarpals   82.0
40:          Ki-Adi-Mundi   82.0
41:                Ackbar   83.0
42:     Biggs Darklighter   84.0
43:      Anakin Skywalker   84.0
44:            Mace Windu   84.0
45:          Gregar Typho   85.0
46:             Kit Fisto   87.0
47:               Lama Su   88.0
48:          Qui-Gon Jinn   89.0
49:           Nute Gunray   90.0
50:       Dexter Jettster  102.0
51:      Jek Tono Porkins  110.0
52:             Chewbacca  112.0
53:                 Bossk  113.0
54:             Owen Lars  120.0
55:           Darth Vader  136.0
56:               Tarfful  136.0
57:                 IG-88  140.0
58:              Grievous  159.0
59: Jabba Desilijic Tiure 1358.0
60:        Wilhuff Tarkin     NA
61:            Mon Mothma     NA
62:          Arvel Crynyd     NA
63:         Finis Valorum     NA
64:            Rugor Nass     NA
65:              Ric Olié     NA
66:                 Watto     NA
67:         Quarsh Panaka     NA
68:        Shmi Skywalker     NA
69:           Bib Fortuna     NA
70:               Gasgano     NA
71:             Eeth Koth     NA
72:           Saesee Tiin     NA
73:           Yarael Poof     NA
74:            Mas Amedda     NA
75:                 Cordé     NA
76:           Cliegg Lars     NA
77:                 Dormé     NA
78:   Bail Prestor Organa     NA
79:               Taun We     NA
80:            Jocasta Nu     NA
81:                R4-P17     NA
82:              San Hill     NA
83:                  Finn     NA
84:                   Rey     NA
85:           Poe Dameron     NA
86:                   BB8     NA
87:        Captain Phasma     NA
                     name   mass

Observamos dos cosas nuevas:

La primera, si no vamos a realizar una operación sobre las filas, la casilla de ese argumento queda en blanco, y en el segundo argumento seleccionamos las columnas.

Lo segundo, una vez seleccionadas las columnas de interés, queremos operar sobre las filas, asi que necesitamos encadenar esa operación abriendo unos nuevos corchetes y operando sobre el primer argumento que es de la fila.

Para ordenar de manera descendente

starwars_dt[, .(name, mass)][order(-mass)]

                     name   mass
 1: Jabba Desilijic Tiure 1358.0
 2:              Grievous  159.0
 3:                 IG-88  140.0
 4:           Darth Vader  136.0
 5:               Tarfful  136.0
 6:             Owen Lars  120.0
 7:                 Bossk  113.0
 8:             Chewbacca  112.0
 9:      Jek Tono Porkins  110.0
10:       Dexter Jettster  102.0
11:           Nute Gunray   90.0
12:          Qui-Gon Jinn   89.0
13:               Lama Su   88.0
14:             Kit Fisto   87.0
15:          Gregar Typho   85.0
16:     Biggs Darklighter   84.0
17:      Anakin Skywalker   84.0
18:            Mace Windu   84.0
19:                Ackbar   83.0
20:          Roos Tarpals   82.0
21:          Ki-Adi-Mundi   82.0
22:              Han Solo   80.0
23:            Darth Maul   80.0
24:              Plo Koon   80.0
25:     Poggle the Lesser   80.0
26:                 Dooku   80.0
27:            Tion Medon   80.0
28:      Lando Calrissian   79.0
29:                 Lobot   79.0
30:            Jango Fett   79.0
31:       Raymus Antilles   79.0
32:             Boba Fett   78.2
33:        Luke Skywalker   77.0
34:        Obi-Wan Kenobi   77.0
35:        Wedge Antilles   77.0
36:                 C-3PO   75.0
37:    Beru Whitesun Lars   75.0
38:             Palpatine   75.0
39:                Greedo   74.0
40:             Nien Nunb   68.0
41:         Jar Jar Binks   66.0
42:        Ben Quadinaros   65.0
43:              Shaak Ti   57.0
44:       Luminara Unduli   56.2
45:           Ayla Secura   55.0
46:            Zam Wesell   55.0
47:            Adi Gallia   50.0
48:         Barriss Offee   50.0
49:           Leia Organa   49.0
50:            Wat Tambor   48.0
51:             Sly Moore   48.0
52:         Padmé Amidala   45.0
53:              Dud Bolt   45.0
54:               Sebulba   40.0
55:                 R2-D2   32.0
56:                 R5-D4   32.0
57: Wicket Systri Warrick   20.0
58:                  Yoda   17.0
59:          Ratts Tyerel   15.0
60:        Wilhuff Tarkin     NA
61:            Mon Mothma     NA
62:          Arvel Crynyd     NA
63:         Finis Valorum     NA
64:            Rugor Nass     NA
65:              Ric Olié     NA
66:                 Watto     NA
67:         Quarsh Panaka     NA
68:        Shmi Skywalker     NA
69:           Bib Fortuna     NA
70:               Gasgano     NA
71:             Eeth Koth     NA
72:           Saesee Tiin     NA
73:           Yarael Poof     NA
74:            Mas Amedda     NA
75:                 Cordé     NA
76:           Cliegg Lars     NA
77:                 Dormé     NA
78:   Bail Prestor Organa     NA
79:               Taun We     NA
80:            Jocasta Nu     NA
81:                R4-P17     NA
82:              San Hill     NA
83:                  Finn     NA
84:                   Rey     NA
85:           Poe Dameron     NA
86:                   BB8     NA
87:        Captain Phasma     NA
                     name   mass

Para seleccionar solo las columnas que cumplan una determinada condición, por ejemplo, si deseamos solo las columnas que contengan variables numéricas, la operación es algo más verbosa

starwars_dt[, .SD, .SDcols = is.numeric]

    height   mass birth_year
 1:    172   77.0       19.0
 2:    167   75.0      112.0
 3:     96   32.0       33.0
 4:    202  136.0       41.9
 5:    150   49.0       19.0
 6:    178  120.0       52.0
 7:    165   75.0       47.0
 8:     97   32.0         NA
 9:    183   84.0       24.0
10:    182   77.0       57.0
11:    188   84.0       41.9
12:    180     NA       64.0
13:    228  112.0      200.0
14:    180   80.0       29.0
15:    173   74.0       44.0
16:    175 1358.0      600.0
17:    170   77.0       21.0
18:    180  110.0         NA
19:     66   17.0      896.0
20:    170   75.0       82.0
21:    183   78.2       31.5
22:    200  140.0       15.0
23:    190  113.0       53.0
24:    177   79.0       31.0
25:    175   79.0       37.0
26:    180   83.0       41.0
27:    150     NA       48.0
28:     NA     NA         NA
29:     88   20.0        8.0
30:    160   68.0         NA
31:    193   89.0       92.0
32:    191   90.0         NA
33:    170     NA       91.0
34:    185   45.0       46.0
35:    196   66.0       52.0
36:    224   82.0         NA
37:    206     NA         NA
38:    183     NA         NA
39:    137     NA         NA
40:    112   40.0         NA
41:    183     NA       62.0
42:    163     NA       72.0
43:    175   80.0       54.0
44:    180     NA         NA
45:    178   55.0       48.0
46:     79   15.0         NA
47:     94   45.0         NA
48:    122     NA         NA
49:    163   65.0         NA
50:    188   84.0       72.0
51:    198   82.0       92.0
52:    196   87.0         NA
53:    171     NA         NA
54:    184   50.0         NA
55:    188     NA         NA
56:    264     NA         NA
57:    188   80.0       22.0
58:    196     NA         NA
59:    185   85.0         NA
60:    157     NA         NA
61:    183     NA       82.0
62:    183   80.0         NA
63:    170   56.2       58.0
64:    166   50.0       40.0
65:    165     NA         NA
66:    193   80.0      102.0
67:    191     NA       67.0
68:    183   79.0       66.0
69:    168   55.0         NA
70:    198  102.0         NA
71:    229   88.0         NA
72:    213     NA         NA
73:    167     NA         NA
74:     96     NA         NA
75:    193   48.0         NA
76:    191     NA         NA
77:    178   57.0         NA
78:    216  159.0         NA
79:    234  136.0         NA
80:    188   79.0         NA
81:    178   48.0         NA
82:    206   80.0         NA
83:     NA     NA         NA
84:     NA     NA         NA
85:     NA     NA         NA
86:     NA     NA         NA
87:     NA     NA         NA
    height   mass birth_year

Acá introducimos dos elementos nuevos: .SD se emplea para hacer sub conjuntos del data table, definidos por las columnas enunciadas en .SDcols. A primera vista no es intuitivo, veremos un ejemplo menos trivial

Supongamos que queremos calcular la media de las columnas numéricas

starwars_dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = is.numeric]

     height     mass birth_year
1: 174.6049 97.31186   87.56512

O solo a dos columnas

starwars_dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("height", "mass")]

     height     mass
1: 174.6049 97.31186

En data table creamos nuevas columnas transformando las existentes con :=

starwars_dt[, H_W := height / mass][] |> head()

             name height mass  hair_color  skin_color eye_color birth_year
1: Luke Skywalker    172   77       blond        fair      blue       19.0
2:          C-3PO    167   75        <NA>        gold    yellow      112.0
3:          R2-D2     96   32        <NA> white, blue       red       33.0
4:    Darth Vader    202  136        none       white    yellow       41.9
5:    Leia Organa    150   49       brown       light     brown       19.0
6:      Owen Lars    178  120 brown, grey       light      blue       52.0
      sex    gender homeworld species
1:   male masculine  Tatooine   Human
2:   none masculine  Tatooine   Droid
3:   none masculine     Naboo   Droid
4:   male masculine  Tatooine   Human
5: female  feminine  Alderaan   Human
6:   male masculine  Tatooine   Human
                                                                                                                   films
1:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
2:     A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
3: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith,...
4:                                             A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
5:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
6:                                                                   A New Hope,Attack of the Clones,Revenge of the Sith
                            vehicles               starships      H_W
1: Snowspeeder,Imperial Speeder Bike X-wing,Imperial shuttle 2.233766
2:                                                           2.226667
3:                                                           3.000000
4:                                           TIE Advanced x1 1.485294
5:             Imperial Speeder Bike                         3.061224
6:                                                           1.483333

Es posible crear varias columnas al mismo tiempo

starwars_dt[, `:=` (H_W = height / mass,
                    fakeVar = sqrt(mass))][] |> head()

             name height mass  hair_color  skin_color eye_color birth_year
1: Luke Skywalker    172   77       blond        fair      blue       19.0
2:          C-3PO    167   75        <NA>        gold    yellow      112.0
3:          R2-D2     96   32        <NA> white, blue       red       33.0
4:    Darth Vader    202  136        none       white    yellow       41.9
5:    Leia Organa    150   49       brown       light     brown       19.0
6:      Owen Lars    178  120 brown, grey       light      blue       52.0
      sex    gender homeworld species
1:   male masculine  Tatooine   Human
2:   none masculine  Tatooine   Droid
3:   none masculine     Naboo   Droid
4:   male masculine  Tatooine   Human
5: female  feminine  Alderaan   Human
6:   male masculine  Tatooine   Human
                                                                                                                   films
1:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
2:     A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith
3: A New Hope,The Empire Strikes Back,Return of the Jedi,The Phantom Menace,Attack of the Clones,Revenge of the Sith,...
4:                                             A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
5:                           A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith,The Force Awakens
6:                                                                   A New Hope,Attack of the Clones,Revenge of the Sith
                            vehicles               starships      H_W   fakeVar
1: Snowspeeder,Imperial Speeder Bike X-wing,Imperial shuttle 2.233766  8.774964
2:                                                           2.226667  8.660254
3:                                                           3.000000  5.656854
4:                                           TIE Advanced x1 1.485294 11.661904
5:             Imperial Speeder Bike                         3.061224  7.000000
6:                                                           1.483333 10.954451

Y para realizar operaciones de agrupación en data table

starwars_dt[, .(altura_media = mean(height, na.rm = TRUE)), by = gender]

      gender altura_media
1: masculine     176.5323
2:  feminine     166.5333
3:      <NA>     175.0000