Transformacion de datos con Dplyr

Objetivo
Introduccion
Aspectos basicos de Dplyr
Funcion filter()
Funcion arrange()
Funcion select()
Funcion mutate()
Funcion summarize()

Objetivo

El objetivo de este trabajo es conocer el funcionamiento del paquete “dplyr”.

Introduccion

Dplyr consta de 5 funciones principales que nos van a ayudar a la hora de manipular nuestra informacion:

Filtrado de informacion y datos a traves de filter()
Reordenar las filas con arrange()
Escoger variables con select()
Crear nuevas variables con funciones existentes a traves de mutate()
Agrupar datos con summarize()

Todas estas funciones se pueden utilizar en conjunto con group_by(), que se centra en la agrupacion de datos.

Ademas, todas siguen un mismo patron de uso:

El primer argumento siempre es el data frame con el que vamos a trabajar
El siguiente argumento define la accion que queremos llevar a cabo (dplyr)
El resultado es un nuevo data frame

Aspectos basicos de Dplyr

Funcion filter()

Filter() nos permite crear subsets de nuestro data frame en base a valores.

library(tidyverse)
library(nycflights13)

(jan1 <- filter(flights, month == 12, day == 25))

## # A tibble: 719 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    12    25      456            500        -4      649
##  2  2013    12    25      524            515         9      805
##  3  2013    12    25      542            540         2      832
##  4  2013    12    25      546            550        -4     1022
##  5  2013    12    25      556            600        -4      730
##  6  2013    12    25      557            600        -3      743
##  7  2013    12    25      557            600        -3      818
##  8  2013    12    25      559            600        -1      855
##  9  2013    12    25      559            600        -1      849
## 10  2013    12    25      600            600         0      850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Para utilizar de forma efectiva filter() tenemos que conocer como seleccionar las observaciones, y es ahi donde utilizaremos los comparadores que R nos provee. En este el estandar son los siguientes; >, >=, <, <=, != (not equal), and == (equal).

Por otro lado, podemos destacar los operadores logicos que nos sirven para definir de una manera mas completa una expresion en filter(). Por ejemplo, & es “y”, | es “o” y ! es “no”.

Un atajo util para a la hora de utilizar filter() es %in%. Esto nos sirve para seleccionar cada frase donde x es uno de los valores de y.

nov_dec <- filter(flights, month %in% c(11,12))

DE MORGANS LAW

Funcion arrange()

Arrange() trabaja de forma similar a filter() excepto que en vez de seleccionar las filas, cambian su orden.

arrange(flights, year, month, day)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Podemos utilizar desc() para reordenar las columnas en un orden descendente.

arrange(flights, desc(arr_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     7    22     2257            759       898      121
##  9  2013    12     5      756           1700       896     1058
## 10  2013     5     3     1133           2055       878     1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Funcion select()

Select() nos da la opción de hacer zoom en las variables que necesitamos y crear un subset especifico.

select(flights, year:day)

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

Dentro de la funcion select() tenemos que destacar que podemos utilizar otro tipo de funciones, por ejemplo:

starts_with (“abc”)
ends_with (“xyz”)
contains (“ijk”)
matches (expresiones regulares)
num_range (“x”, 1:3)

Una opcion interesante es utilizar select() con la funcion everything().Es muy util para mover para mover variables al inicio al inicio del dataframe.

select(flights, time_hour, air_time, everything())

## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

Funcion mutate()

Aparte de la seleccion y filtrado de variables, tambien es muy interesante añadir columnar que son funciones de las columnas existentes. Mutate() siempre añade una nueva columna al final del dataset.

flights_sml <- select(flights, 
                      year:day,
                      ends_with("delay"),
                      distance,
                      air_time
                      )
mutate(flights_sml, 
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60)

## # A tibble: 336,776 x 9
##     year month   day dep_delay arr_delay distance air_time  gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227     9  370.
##  2  2013     1     1         4        20     1416      227    16  374.
##  3  2013     1     1         2        33     1089      160    31  408.
##  4  2013     1     1        -1       -18     1576      183   -17  517.
##  5  2013     1     1        -6       -25      762      116   -19  394.
##  6  2013     1     1        -4        12      719      150    16  288.
##  7  2013     1     1        -5        19     1065      158    24  404.
##  8  2013     1     1        -3       -14      229       53   -11  259.
##  9  2013     1     1        -3        -8      944      140    -5  405.
## 10  2013     1     1        -2         8      733      138    10  319.
## # ... with 336,766 more rows

Si solo se quiere matntener las nuevas variables podemos utilizar transmute().

transmute(flights, 
          gain = arr_delay - dep_delay,
          hours = air_time / 60,
          gain_per_hour = gain / hours)

## # A tibble: 336,776 x 3
##     gain hours gain_per_hour
##    <dbl> <dbl>         <dbl>
##  1     9 3.78           2.38
##  2    16 3.78           4.23
##  3    31 2.67          11.6 
##  4   -17 3.05          -5.57
##  5   -19 1.93          -9.83
##  6    16 2.5            6.4 
##  7    24 2.63           9.11
##  8   -11 0.883        -12.5 
##  9    -5 2.33          -2.14
## 10    10 2.3            4.35
## # ... with 336,766 more rows

Hay muchas funciones que podemos utilizar con mutate(), la propiedad principal es que la funcion tiene que vectorizarse. Tiene que coger un vector de valores como inpute y devolver un vector con el mismo numeros de valores como output.

**Operaciones aritmeticas (+, -, *, /, ^)**

Acorde a la “regla de reciclaje”, si un parametro is mar corto que otro, se extendera automaticamente a la longitud del mas grande. Por ello, este tipo de operaciones son mas utiles con valores individuales.

Artimetica modular (%/% y %%)

%/% (division de integers) y %% (recordatorio), donde x == y * (x %/% y) + (x %% y)

Logs, log(), log2(), log10()
Offset, lead() and lag()
Acumulativos, cumsum(), cummean(), cummax()
Comparadores logicos (<, <=, >, >=, !=)
Ranking

Group by + mutate

Hasta el momento hemos visto como trabajamos group_by() con la summarize(), pero no podemos negar la posibilidades que tien con mutate. Por ejemplo;

popular_dests <- flights %>%
  group_by(dest) %>%
  filter(n() > 365)


popular_dests %>% 
  filter(arr_delay > 0) %>%
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
  select(year:day, dest, arr_delay, prop_delay)

## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##     year month   day dest  arr_delay prop_delay
##    <int> <int> <int> <chr>     <dbl>      <dbl>
##  1  2013     1     1 IAH          11  0.000111 
##  2  2013     1     1 IAH          20  0.000201 
##  3  2013     1     1 MIA          33  0.000235 
##  4  2013     1     1 ORD          12  0.0000424
##  5  2013     1     1 FLL          19  0.0000938
##  6  2013     1     1 ORD           8  0.0000283
##  7  2013     1     1 LAX           7  0.0000344
##  8  2013     1     1 DFW          31  0.000282 
##  9  2013     1     1 ATL          12  0.0000400
## 10  2013     1     1 DTW          16  0.000116 
## # ... with 131,096 more rows

Funcion summarize()

La funcion summarize() agrupa toda la informacion en una misma linea, eso si, siempre tiene que ir de la mano de la funcion groupby().

by_day <- group_by(flights,year,month,day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

Otro ejemplo, en este caso utilizando el simbolo pipe %>%. Sobre los missing values, utilizamos el argumento na.rm para definir si queremos que mantengan o eliminen los missing values del dataset. Normalmente las funciones de agregacion siguen la regla de “si hay un missing valur en el input, el output sera un missing value”

delay <- flights %>% 
  group_by(dest) %>%
  summarize( 
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
    ) %>% 
    filter(count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
               geom_smooth(se = FALSE)

Siempre que trabajemos con una funcion de agregacion, siempre es una buena alternativa trabajar con la funcion n().

delays <- flights %>%
  group_by(tailnum) %>%
  summarize(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

Por otro lado,la funcion summary nos permite utilizar diferentes tipo de funciones:

Medidas de ubicacion a traves de mean() o median()
Medidas de extension sd(), IQR(), mad()
Medidas de ranking min(), quantile(), max()
Medidas de posicion first(),last()

Un aspecto muy importante a tener en cuenta es la agrupacion por multiples variables.

daily <- group_by(flights, year, month, day)

(per_day <- summarize(daily, flights = n()))

## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # ... with 355 more rows

(per_month <- summarize(per_day, flights = sum(flights)))

## # A tibble: 12 x 3
## # Groups:   year [?]
##     year month flights
##    <int> <int>   <int>
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135

Cuidado al crear progresivamente resúmenes: está bien para sumas y recuentos, pero hay que pensar en las medias y varianzas de ponderación, y no es posible hacerlo exactamente para estadísticas basadas en rangos como la mediana. En otras palabras, la suma de las sumas agrupadas es la suma total, pero la mediana de las medias agrupadas no es la mediana general.