1. Optimizando los procesos iterativos

Funcionales

La clase de hoy ha sido tomado del Capítulo 2, tema 9 del libro Advanced R programming, de Hadley Wickham https://adv-r.hadley.nz/fp.html.

R, es un lenguaje funcional. Esto significa que tiene un estilo de resolución de problemas centrado en funciones.

Las técnicas funcionales han experimentado un gran interés porque pueden producir soluciones eficientes y elegantes a muchos problemas modernos.

Un estilo funcional tiende a crear funciones que pueden analizarse fácilmente de forma aislada (es decir, utilizando solo información local) y, por lo tanto, a menudo es mucho más fácil de optimizar o paralelizar automáticamente.

A continuación analizaremos las tres técnicas funcionales clave para descomponer los problemas en partes más pequeñas:

Técnica 1, veremos cómo reemplazar muchos bucles for con funcionales que son funciones (como lapply ()) que toman otra función como argumento. Los funcionales le permiten tomar una función que resuelve el problema para una sola entrada y generalizarla para manejar cualquier número de entradas. Los funcionales son la técnica más importante y se usan todo el tiempo en el análisis de datos.
Técnica 2, presentaremos las fábricas de funciones: funciones que crean funciones. Las fábricas de funciones se usan con menos frecuencia que las funcionales, pero pueden permitirle particionar elegantemente el trabajo entre diferentes partes de su código.
Técnica 3, mostraremos cómo crear operadores de función: funcionales que toman funciones como entrada y producen funciones como salida. Son como adverbios, porque normalmente modifican el funcionamiento de una función.

Un funcional es una función que toma una función como entrada y devuelve un vector como salida.

randomise <- function(f) f(runif(1e3)) #  la función input genera 1000 números uniformes aleatoriamente

randomise(mean)

## [1] 0.492595

randomise(mean) # cada vez el promedio resultado es distinto

## [1] 0.5021419

randomise(sum)

## [1] 491.0315

randomise(sum) # cada vez la suma resultado es distinta

## [1] 497.2087

Es posible que ya hayan utilizado reemplazos del bucle for como lapply(), apply() y tapply() de la base R; o map().

Un uso común de los funcionales es como alternativa a los bucles for. Los bucles for tienen una mala reputación en R porque muchas personas creen que son lentos, pero la verdadera desventaja de los bucles for es que son muy flexibles: un bucle que calcula lo que se está iterando, pero no lo que se debe hacer con los resultados. Así como es mejor usar while que repetir, y es mejor usar for que while, es mejor usar un funcional que for.

Cada funcioanl está diseñada para una tarea específica, por lo que cuando reconoce el funcional, inmediatamente sabe por qué se está utilizando.

Prerrequisitos

La primera técnica se centrará en las funciones proporcionadas por el paquete purrr (Henry y Wickham 2018a).

map()

El funcional más fundamental es purrr::map(). Toma un vector y una función, llama a la función una vez para cada elemento del vector y devuelve los resultados en una lista. En otras palabras, map(1:3,f) es equivalente a list(f(1), f(2), f(3)).

library(purrr)
triple <- function(x) x*3
map(1: 3, triple)

## [[1]]
## [1] 3
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9

O, gráficamente:

El equivalente base de map() es lapply(). La única diferencia es que lapply() no admite los ayudantes sobre los que aprenderá a continuación.

Produciendo vectores atómicos

map() devuelve una lista, lo que la convierte en la más general de la familia de maps porque puede poner cualquier cosa en una lista. Pero es inconveniente devolver una lista cuando una estructura de datos más simple sería suficiente, por lo que hay cuatro variantes más específicas: map_lgl(), map_int(), map_dbl() y map_chr(). Cada uno devuelve un vector atómico del tipo especificado:

# map_chr () siempre devuelve un vector de caracteres
map_chr(mtcars, typeof)

##      mpg      cyl     disp       hp     drat       wt     qsec       vs 
## "double" "double" "double" "double" "double" "double" "double" "double" 
##       am     gear     carb 
## "double" "double" "double"

# map_lgl() siempre devuelve un vector lógico
map_lgl(mtcars, is.double)

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# map_int () siempre devuelve un vector entero
n_unique <- function(x) length(unique(x))
map_int(mtcars, n_unique)

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   25    3   27   22   22   29   30    2    2    3    6

# map_dbl() siempre devuelve un vector doble
map_dbl(mtcars, mean)

##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

purrr usa la convención de que los sufijos, *_dbl()*, se refieren a la salida. Todas las funciones map _() pueden tomar cualquier tipo de vector como entrada. Estos ejemplos se basan en dos hechos: mtcars es un Data.frame* y los data.frame son listas que contienen vectores de la misma longitud. Esto es más obvio si dibujamos un data.frame con la misma orientación que el vector:

Todas las funciones de map siempre devuelven un vector de salida de la misma longitud que la entrada, lo que implica que cada llamada a .f debe devolver un solo valor. Si no es así, obtendrá un error:

pair <- function(x) c(x, x)
#map_dbl(1:2, pair)

Esto es similar al error que obtendrá si .f devuelve el tipo de resultado incorrecto:

#map_dbl(1:2, as.character)

En cualquier caso, suele ser útil volver a map(), porque map() puede aceptar cualquier tipo de salida. Eso le permite ver el resultado problemático y averiguar qué hacer con él.

map(1:2, pair)

## [[1]]
## [1] 1 1
## 
## [[2]]
## [1] 2 2

map(1:2, as.character)

## [[1]]
## [1] "1"
## 
## [[2]]
## [1] "2"

Uso de R en Big Data

¿Qué es grande? (para esta clase)

Cuando R no funciona para porque tienes demasiados datos
¿Qué se vuelve más difícil cuando los datos son grandes?
- Es posible que los datos no se carguen en la memoria
- El análisis de los datos puede llevar mucho tiempo.
- Las visualizaciones se vuelven desordenadas
- Etc.,

¿Cuántos datos puede cargar R?

R establece un límite en la mayor cantidad de memoria que asignará alas operaciones desde el sistema.

memory.limit()

## [1] 16267

R carga todos los datos en la memoria (por defecto)

Cambiar el límite de la memoria

Se puede usar memory.size() para cambiar el límite de asignación de R. Pero…
- Los límites de memoria dependen de su configuración
Si está ejecutando R de 32 bits en cualquier sistema operativo, se tendrán 2 o 3Gb disponibles
Si está ejecutando R de 64 bits en un sistema operativo de 64 bits, el límite superior es efectivamente infinito, pero …
- … Bajo ninguna circunstancia, puede tener más de (2 ^ 31) -1 = 2,147,483,647 filas o columnas

¿Qué es un límite de memoria de 2 GB (o 3 GB)?

2 GB de memoria que usa R no es lo mismo que 2 GB en el disco
- Probablemente no más de 500 MB en disco

data(esoph)
object.size(esoph)

## 5952 bytes

El tiempo de sus procesos R

Al procesar grandes conjuntos de datos, el tiempo que tarda una función en ejecutar una tarea puede convertirse en un factor limitante. A continuación se muestran tres opciones sobre cómo cronometrar las funciones R:

# OPCION 1:
ptm <- proc.time()
some.output <- rnorm(10^6)
diff1 <- proc.time() - ptm
diff1 # en segundos

##    user  system elapsed 
##    0.09    0.00    0.09

# OPCION 2:
system.time(
some.output <- rnorm(10^6)
) # en segundos

##    user  system elapsed 
##    0.10    0.00    0.09

# OPCION 3:
t1 <- Sys.time()
some.output <- rnorm(10^6)
t2 <- Sys.time()
difftime(t2,t1) # en segundos

## Time difference of 0.09253907 secs

Dependiendo de la computadora que esté utilizando, los tiempos calculados pueden verse diferentes. Las opciones 1 y 2 producen tres números (“usuario”, “sistema” y “transcurrido”). El último número que es más útil: da la cantidad total de tiempo transcurrido (en segundos). La opción 3 tiene una ligera ventaja, ya que puede configurar la unidad para que se informe (“segundos”, “minutos”, “horas”, “días”, “semanas”).

Funciones de pipe

Los pipe son una adición relativamente reciente a R, introducida en el paquete magrittr. Se refiere a un sintaxis que encadena funciones individuales con un símbolo de barra vertical (por ejemplo,%>%,% <>%). Mediante el uso pipes, se puede crear una sintaxis R más completa y más corta. Los pipes son un excelente compañero para dplyr.

Leer datos en el espacio de trabajo de R

Formatos de archivo comunes para el almacenamiento de datos, paquetes R y funciones para la importación de datos:

R también puede acceder a las bases de datos de forma remota, sin importarlas a su espacio de trabajo. Esta opción es más conveniente para bases de datos que son demasiado grandes para caber en la memoria de R. Se puede usar el paquete dplyr para obtener acceso remoto a bases de datos SQL, revisaremos un ejemplo.

library(purrr)
rm(list=ls()) # limpia el espacio de trabajo

options(java.parameters="-Xmx8000m") # para evitar errores del tipo 'Java.lang.OutOfMemoryError' al usar librerias cuya dependencia sea rJava, como por ejemplo la libreria xlsx.

if(.Platform$OS.type == "windows") withAutoprint({
memory.size()
memory.size(TRUE)
memory.limit()
}) # informa l asignación de memoria actual y maxima que esta usando R.

## > memory.size()
## [1] 88.79
## > memory.size(TRUE)
## [1] 91.81
## > memory.limit()
## [1] 16267

#t1 = data.table::fread('bal2018.txt', encoding = 'UTF-8')
#Error in data.table::fread("bal2018.txt", encoding = "UTF-8") : 
 # File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

ptm <- proc.time()
t2 = read.csv('bal2018.txt', fileEncoding = "UTF-16", sep = "\t", header = T) 
diff1 <- proc.time() - ptm
diff1 # en segundos

##    user  system elapsed 
##   56.17    1.12   58.68

dim(t2)

## [1] 78200   807

Ejemlplo: Calculemos los valores R-cuadrado para la relación lineal

Funciones base

library(dplyr)

ptm <- proc.time()
# crear data.frame para la variable cyl=4

cyl_4 <- filter(mtcars, cyl == 4)

# crear un modelo de regresion lineal 

lm_4 <- lm(mpg ~ wt, data = cyl_4)

# obtener el summary

lm_4_summary <- summary(lm_4)

# obtener el valor de r2

lm_4_r_squared <- lm_4_summary["r.squared"]

# verificar el valor

lm_4_r_squared

## $r.squared
## [1] 0.5086326

diff1 <- proc.time() - ptm
diff1 # en segundos

##    user  system elapsed 
##    0.02    0.02    0.03

Usando dplyr

Alternativamente, se puede hacer lo mismo pipes de dplyr. Escribir mucho menos, pero para hacer esto para los 3 subconjuntos de datos significa que tenemos que copiar y pegar varias veces, por lo que si queremos correr un modelo lineal de mpg ~ disp además de mpg ~ wt, tendría que duplica el código 3 veces más y cámbiarlo 3 veces más.

Esto puede no parecer un gran problema, pero eventualmente lo será una vez que comience a escalar el código (digamos, más de 10 veces o más de 100 veces, etc.).

ptm <- proc.time()

lm_4cyl_rsquared <- mtcars %>% 
  filter(cyl == 4) %>%
  lm(mpg ~ wt, data = .) %>% 
  summary() %>% 
  .$"coefficients" 


diff1 <- proc.time() - ptm
diff1 # en segundos

##    user  system elapsed 
##    0.02    0.00    0.01

Usando purr

Para resolver este problema de minimizar la repetición con más repeticiones, puede cargar purrr por sí solo, pero también se carga como parte de la librería tidyverse.

Los argumentos base para map() son:

.x: una lista o vector atómico (lógico, entero, doble / numérico y carácter)
.f: Una función o fórmula

Volviendo a nuestro ejemplo de tomar el R cuadrado de un modelo lineal, usamos el siguiente código con purrr.

library(purrr)

ptm <- proc.time()

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map("coefficients")

## $`4`
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 39.571196   4.346582  9.103980 7.771511e-06
## wt          -5.647025   1.850119 -3.052251 1.374278e-02
## 
## $`6`
##              Estimate Std. Error   t value    Pr(>|t|)
## (Intercept) 28.408845   4.184369  6.789278 0.001054844
## wt          -2.780106   1.334917 -2.082605 0.091757660
## 
## $`8`
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 23.868029  3.0054619  7.941551 4.052705e-06
## wt          -2.192438  0.7392393 -2.965803 1.179281e-02

diff1 <- proc.time() - ptm
diff1 # en segundos

##    user  system elapsed 
##       0       0       0

¡Esto genera una salida de nuestros 3 modelos lineales de acuerdo con el número de cilindros en 5 líneas de código!

Hay un “.” colocado dos veces dentro del código. Los “.” indican a los datos.
La función split divide el data.frame mtcars en 3 data.frame, cada uno almacenado dentro de una lista. Esto permite al mapa recorrer nuestros 3 data.frame y replicar la función lm() en cada uno de ellos individualmente.

# piped
mtcars %>% 
  split(.$cyl)

## $`4`
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## 
## $`6`
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 
## $`8`
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

# base R
split(mtcars, mtcars$cyl)

## $`4`
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## 
## $`6`
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 
## $`8`
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Contianuamos con el ejemplo:

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .))

## $`4`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Coefficients:
## (Intercept)           wt  
##      39.571       -5.647  
## 
## 
## $`6`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Coefficients:
## (Intercept)           wt  
##       28.41        -2.78  
## 
## 
## $`8`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Coefficients:
## (Intercept)           wt  
##      23.868       -2.192

A continuación, asignamos nuestra función de resumen a cada uno de los elementos de la lista para obtener resultados más limpios con valores de R cuadrado:

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary)

## $`4`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1513 -1.9795 -0.6272  1.9299  5.2523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.571      4.347   9.104 7.77e-06 ***
## wt            -5.647      1.850  -3.052   0.0137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.332 on 9 degrees of freedom
## Multiple R-squared:  0.5086, Adjusted R-squared:  0.454 
## F-statistic: 9.316 on 1 and 9 DF,  p-value: 0.01374
## 
## 
## $`6`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##      Mazda RX4  Mazda RX4 Wag Hornet 4 Drive        Valiant       Merc 280 
##        -0.1250         0.5840         1.9292        -0.6897         0.3547 
##      Merc 280C   Ferrari Dino 
##        -1.0453        -1.0080 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   28.409      4.184   6.789  0.00105 **
## wt            -2.780      1.335  -2.083  0.09176 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.165 on 5 degrees of freedom
## Multiple R-squared:  0.4645, Adjusted R-squared:  0.3574 
## F-statistic: 4.337 on 1 and 5 DF,  p-value: 0.09176
## 
## 
## $`8`
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1491 -1.4664 -0.8458  1.5711  3.7619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.8680     3.0055   7.942 4.05e-06 ***
## wt           -2.1924     0.7392  -2.966   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.024 on 12 degrees of freedom
## Multiple R-squared:  0.423,  Adjusted R-squared:  0.3749 
## F-statistic: 8.796 on 1 and 12 DF,  p-value: 0.01179

Usamos map_dbl ya que nuestra salida es doble o numérica.

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_df("r.squared")

## # A tibble: 1 x 3
##     `4`   `6`   `8`
##   <dbl> <dbl> <dbl>
## 1 0.509 0.465 0.423

Si es que usamos simplemente map(), el resultado esta en una lista.

mtcars %>%
  split(.$cyl) %>% 
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map("r.squared")

## $`4`
## [1] 0.5086326
## 
## $`6`
## [1] 0.4645102
## 
## $`8`
## [1] 0.4229655

pmap e imap

Librerías

library(purrr)        # Functional programming
library(dplyr)        # Data wrangling
library(tidyr)        # Tidy-ing data
library(stringr)      # String operations
library(repurrrsive)  # Game of Thrones data

Obtención de datos

dat <- got_chars
glimpse(dat[[1]])

## List of 18
##  $ url        : chr "https://www.anapioficeandfire.com/api/characters/1022"
##  $ id         : int 1022
##  $ name       : chr "Theon Greyjoy"
##  $ gender     : chr "Male"
##  $ culture    : chr "Ironborn"
##  $ born       : chr "In 278 AC or 279 AC, at Pyke"
##  $ died       : chr ""
##  $ alive      : logi TRUE
##  $ titles     : chr [1:3] "Prince of Winterfell" "Captain of Sea Bitch" "Lord of the Iron Islands (by law of the green lands)"
##  $ aliases    : chr [1:4] "Prince of Fools" "Theon Turncloak" "Reek" "Theon Kinslayer"
##  $ father     : chr ""
##  $ mother     : chr ""
##  $ spouse     : chr ""
##  $ allegiances: chr "House Greyjoy of Pyke"
##  $ books      : chr [1:3] "A Game of Thrones" "A Storm of Swords" "A Feast for Crows"
##  $ povBooks   : chr [1:2] "A Clash of Kings" "A Dance with Dragons"
##  $ tvSeries   : chr [1:6] "Season 1" "Season 2" "Season 3" "Season 4" ...
##  $ playedBy   : chr "Alfie Allen"

Función map: extraer un solo elemento de una listas

# Method 1: using the name of the list element, similar to dat[[1]]["name"], dat[[2]]["name"], etc
map(dat, "name")

## [[1]]
## [1] "Theon Greyjoy"
## 
## [[2]]
## [1] "Tyrion Lannister"
## 
## [[3]]
## [1] "Victarion Greyjoy"
## 
## [[4]]
## [1] "Will"
## 
## [[5]]
## [1] "Areo Hotah"
## 
## [[6]]
## [1] "Chett"
## 
## [[7]]
## [1] "Cressen"
## 
## [[8]]
## [1] "Arianne Martell"
## 
## [[9]]
## [1] "Daenerys Targaryen"
## 
## [[10]]
## [1] "Davos Seaworth"
## 
## [[11]]
## [1] "Arya Stark"
## 
## [[12]]
## [1] "Arys Oakheart"
## 
## [[13]]
## [1] "Asha Greyjoy"
## 
## [[14]]
## [1] "Barristan Selmy"
## 
## [[15]]
## [1] "Varamyr"
## 
## [[16]]
## [1] "Brandon Stark"
## 
## [[17]]
## [1] "Brienne of Tarth"
## 
## [[18]]
## [1] "Catelyn Stark"
## 
## [[19]]
## [1] "Cersei Lannister"
## 
## [[20]]
## [1] "Eddard Stark"
## 
## [[21]]
## [1] "Jaime Lannister"
## 
## [[22]]
## [1] "Jon Connington"
## 
## [[23]]
## [1] "Jon Snow"
## 
## [[24]]
## [1] "Aeron Greyjoy"
## 
## [[25]]
## [1] "Kevan Lannister"
## 
## [[26]]
## [1] "Melisandre"
## 
## [[27]]
## [1] "Merrett Frey"
## 
## [[28]]
## [1] "Quentyn Martell"
## 
## [[29]]
## [1] "Samwell Tarly"
## 
## [[30]]
## [1] "Sansa Stark"

# Method 2: using the `pluck` function
map(dat, pluck("name"))

## [[1]]
## [1] "Theon Greyjoy"
## 
## [[2]]
## [1] "Tyrion Lannister"
## 
## [[3]]
## [1] "Victarion Greyjoy"
## 
## [[4]]
## [1] "Will"
## 
## [[5]]
## [1] "Areo Hotah"
## 
## [[6]]
## [1] "Chett"
## 
## [[7]]
## [1] "Cressen"
## 
## [[8]]
## [1] "Arianne Martell"
## 
## [[9]]
## [1] "Daenerys Targaryen"
## 
## [[10]]
## [1] "Davos Seaworth"
## 
## [[11]]
## [1] "Arya Stark"
## 
## [[12]]
## [1] "Arys Oakheart"
## 
## [[13]]
## [1] "Asha Greyjoy"
## 
## [[14]]
## [1] "Barristan Selmy"
## 
## [[15]]
## [1] "Varamyr"
## 
## [[16]]
## [1] "Brandon Stark"
## 
## [[17]]
## [1] "Brienne of Tarth"
## 
## [[18]]
## [1] "Catelyn Stark"
## 
## [[19]]
## [1] "Cersei Lannister"
## 
## [[20]]
## [1] "Eddard Stark"
## 
## [[21]]
## [1] "Jaime Lannister"
## 
## [[22]]
## [1] "Jon Connington"
## 
## [[23]]
## [1] "Jon Snow"
## 
## [[24]]
## [1] "Aeron Greyjoy"
## 
## [[25]]
## [1] "Kevan Lannister"
## 
## [[26]]
## [1] "Melisandre"
## 
## [[27]]
## [1] "Merrett Frey"
## 
## [[28]]
## [1] "Quentyn Martell"
## 
## [[29]]
## [1] "Samwell Tarly"
## 
## [[30]]
## [1] "Sansa Stark"

# Method 3: using the index of the list element
map(dat, 3)

## [[1]]
## [1] "Theon Greyjoy"
## 
## [[2]]
## [1] "Tyrion Lannister"
## 
## [[3]]
## [1] "Victarion Greyjoy"
## 
## [[4]]
## [1] "Will"
## 
## [[5]]
## [1] "Areo Hotah"
## 
## [[6]]
## [1] "Chett"
## 
## [[7]]
## [1] "Cressen"
## 
## [[8]]
## [1] "Arianne Martell"
## 
## [[9]]
## [1] "Daenerys Targaryen"
## 
## [[10]]
## [1] "Davos Seaworth"
## 
## [[11]]
## [1] "Arya Stark"
## 
## [[12]]
## [1] "Arys Oakheart"
## 
## [[13]]
## [1] "Asha Greyjoy"
## 
## [[14]]
## [1] "Barristan Selmy"
## 
## [[15]]
## [1] "Varamyr"
## 
## [[16]]
## [1] "Brandon Stark"
## 
## [[17]]
## [1] "Brienne of Tarth"
## 
## [[18]]
## [1] "Catelyn Stark"
## 
## [[19]]
## [1] "Cersei Lannister"
## 
## [[20]]
## [1] "Eddard Stark"
## 
## [[21]]
## [1] "Jaime Lannister"
## 
## [[22]]
## [1] "Jon Connington"
## 
## [[23]]
## [1] "Jon Snow"
## 
## [[24]]
## [1] "Aeron Greyjoy"
## 
## [[25]]
## [1] "Kevan Lannister"
## 
## [[26]]
## [1] "Melisandre"
## 
## [[27]]
## [1] "Merrett Frey"
## 
## [[28]]
## [1] "Quentyn Martell"
## 
## [[29]]
## [1] "Samwell Tarly"
## 
## [[30]]
## [1] "Sansa Stark"

Crear un dataframe de una lista

# The `[` is the function here -- essentially telling it to apply [] to each list
# and the name, gender and culter are the argument passed to []
map_dfr(dat,`[`, c("name", "gender", "culture"))

## # A tibble: 30 x 3
##    name               gender culture   
##    <chr>              <chr>  <chr>     
##  1 Theon Greyjoy      Male   "Ironborn"
##  2 Tyrion Lannister   Male   ""        
##  3 Victarion Greyjoy  Male   "Ironborn"
##  4 Will               Male   ""        
##  5 Areo Hotah         Male   "Norvoshi"
##  6 Chett              Male   ""        
##  7 Cressen            Male   ""        
##  8 Arianne Martell    Female "Dornish" 
##  9 Daenerys Targaryen Female "Valyrian"
## 10 Davos Seaworth     Male   "Westeros"
## # ... with 20 more rows

Aplicar una función propia a través de map

dead_or_alive <- function(x){
  ifelse(x[["alive"]], paste(x[["name"]], "is alive!"),
    paste(x[["name"]], "is dead :("))
}
map_chr(dat, dead_or_alive)

##  [1] "Theon Greyjoy is alive!"      "Tyrion Lannister is alive!"  
##  [3] "Victarion Greyjoy is alive!"  "Will is dead :("             
##  [5] "Areo Hotah is alive!"         "Chett is dead :("            
##  [7] "Cressen is dead :("           "Arianne Martell is alive!"   
##  [9] "Daenerys Targaryen is alive!" "Davos Seaworth is alive!"    
## [11] "Arya Stark is alive!"         "Arys Oakheart is dead :("    
## [13] "Asha Greyjoy is alive!"       "Barristan Selmy is alive!"   
## [15] "Varamyr is dead :("           "Brandon Stark is alive!"     
## [17] "Brienne of Tarth is alive!"   "Catelyn Stark is dead :("    
## [19] "Cersei Lannister is alive!"   "Eddard Stark is dead :("     
## [21] "Jaime Lannister is alive!"    "Jon Connington is alive!"    
## [23] "Jon Snow is alive!"           "Aeron Greyjoy is alive!"     
## [25] "Kevan Lannister is dead :("   "Melisandre is alive!"        
## [27] "Merrett Frey is dead :("      "Quentyn Martell is dead :("  
## [29] "Samwell Tarly is alive!"      "Sansa Stark is alive!"

Uso de pmap por fila

dat_m <- dat %>% {
  tibble(
    name = map_chr(., "name"),
    gender = map_chr(., "gender"),
    culture = map_chr(., "culture"),
    aliases = map(., "aliases"),
    allegiances = map(., "allegiances")
)}
pmap(dat_m, paste)

## [[1]]
## [1] "Theon Greyjoy Male Ironborn Prince of Fools House Greyjoy of Pyke"
## [2] "Theon Greyjoy Male Ironborn Theon Turncloak House Greyjoy of Pyke"
## [3] "Theon Greyjoy Male Ironborn Reek House Greyjoy of Pyke"           
## [4] "Theon Greyjoy Male Ironborn Theon Kinslayer House Greyjoy of Pyke"
## 
## [[2]]
##  [1] "Tyrion Lannister Male  The Imp House Lannister of Casterly Rock"           
##  [2] "Tyrion Lannister Male  Halfman House Lannister of Casterly Rock"           
##  [3] "Tyrion Lannister Male  The boyman House Lannister of Casterly Rock"        
##  [4] "Tyrion Lannister Male  Giant of Lannister House Lannister of Casterly Rock"
##  [5] "Tyrion Lannister Male  Lord Tywin's Doom House Lannister of Casterly Rock" 
##  [6] "Tyrion Lannister Male  Lord Tywin's Bane House Lannister of Casterly Rock" 
##  [7] "Tyrion Lannister Male  Yollo House Lannister of Casterly Rock"             
##  [8] "Tyrion Lannister Male  Hugor Hill House Lannister of Casterly Rock"        
##  [9] "Tyrion Lannister Male  No-Nose House Lannister of Casterly Rock"           
## [10] "Tyrion Lannister Male  Freak House Lannister of Casterly Rock"             
## [11] "Tyrion Lannister Male  Dwarf House Lannister of Casterly Rock"             
## 
## [[3]]
## [1] "Victarion Greyjoy Male Ironborn The Iron Captain House Greyjoy of Pyke"
## 
## [[4]]
## [1] "Will Male   "
## 
## [[5]]
## [1] "Areo Hotah Male Norvoshi  House Nymeros Martell of Sunspear"
## 
## [[6]]
## [1] "Chett Male   "
## 
## [[7]]
## [1] "Cressen Male   "
## 
## [[8]]
## [1] "Arianne Martell Female Dornish  House Nymeros Martell of Sunspear"
## 
## [[9]]
##  [1] "Daenerys Targaryen Female Valyrian Dany House Targaryen of King's Landing"                   
##  [2] "Daenerys Targaryen Female Valyrian Daenerys Stormborn House Targaryen of King's Landing"     
##  [3] "Daenerys Targaryen Female Valyrian The Unburnt House Targaryen of King's Landing"            
##  [4] "Daenerys Targaryen Female Valyrian Mother of Dragons House Targaryen of King's Landing"      
##  [5] "Daenerys Targaryen Female Valyrian Mother House Targaryen of King's Landing"                 
##  [6] "Daenerys Targaryen Female Valyrian Mhysa House Targaryen of King's Landing"                  
##  [7] "Daenerys Targaryen Female Valyrian The Silver Queen House Targaryen of King's Landing"       
##  [8] "Daenerys Targaryen Female Valyrian Silver Lady House Targaryen of King's Landing"            
##  [9] "Daenerys Targaryen Female Valyrian Dragonmother House Targaryen of King's Landing"           
## [10] "Daenerys Targaryen Female Valyrian The Dragon Queen House Targaryen of King's Landing"       
## [11] "Daenerys Targaryen Female Valyrian The Mad King's daughter House Targaryen of King's Landing"
## 
## [[10]]
## [1] "Davos Seaworth Male Westeros Onion Knight House Baratheon of Dragonstone" 
## [2] "Davos Seaworth Male Westeros Davos Shorthand House Seaworth of Cape Wrath"
## [3] "Davos Seaworth Male Westeros Ser Onions House Baratheon of Dragonstone"   
## [4] "Davos Seaworth Male Westeros Onion Lord House Seaworth of Cape Wrath"     
## [5] "Davos Seaworth Male Westeros Smuggler House Baratheon of Dragonstone"     
## 
## [[11]]
##  [1] "Arya Stark Female Northmen Arya Horseface House Stark of Winterfell"      
##  [2] "Arya Stark Female Northmen Arya Underfoot House Stark of Winterfell"      
##  [3] "Arya Stark Female Northmen Arry House Stark of Winterfell"                
##  [4] "Arya Stark Female Northmen Lumpyface House Stark of Winterfell"           
##  [5] "Arya Stark Female Northmen Lumpyhead House Stark of Winterfell"           
##  [6] "Arya Stark Female Northmen Stickboy House Stark of Winterfell"            
##  [7] "Arya Stark Female Northmen Weasel House Stark of Winterfell"              
##  [8] "Arya Stark Female Northmen Nymeria House Stark of Winterfell"             
##  [9] "Arya Stark Female Northmen Squan House Stark of Winterfell"               
## [10] "Arya Stark Female Northmen Saltb House Stark of Winterfell"               
## [11] "Arya Stark Female Northmen Cat of the Canaly House Stark of Winterfell"   
## [12] "Arya Stark Female Northmen Bets House Stark of Winterfell"                
## [13] "Arya Stark Female Northmen The Blind Girh House Stark of Winterfell"      
## [14] "Arya Stark Female Northmen The Ugly Little Girl House Stark of Winterfell"
## [15] "Arya Stark Female Northmen Mercedenl House Stark of Winterfell"           
## [16] "Arya Stark Female Northmen Mercye House Stark of Winterfell"              
## 
## [[12]]
## [1] "Arys Oakheart Male Reach  House Oakheart of Old Oak"
## 
## [[13]]
## [1] "Asha Greyjoy Female Ironborn Esgred House Greyjoy of Pyke"         
## [2] "Asha Greyjoy Female Ironborn The Kraken's Daughter House Ironmaker"
## 
## [[14]]
## [1] "Barristan Selmy Male Westeros Barristan the Bold House Selmy of Harvest Hall"     
## [2] "Barristan Selmy Male Westeros Arstan Whitebeard House Targaryen of King's Landing"
## [3] "Barristan Selmy Male Westeros Ser Grandfather House Selmy of Harvest Hall"        
## [4] "Barristan Selmy Male Westeros Barristan the Old House Targaryen of King's Landing"
## [5] "Barristan Selmy Male Westeros Old Ser House Selmy of Harvest Hall"                
## 
## [[15]]
## [1] "Varamyr Male Free Folk Varamyr Sixskins "
## [2] "Varamyr Male Free Folk Haggon "          
## [3] "Varamyr Male Free Folk Lump "            
## 
## [[16]]
## [1] "Brandon Stark Male Northmen Bran House Stark of Winterfell"           
## [2] "Brandon Stark Male Northmen Bran the Broken House Stark of Winterfell"
## [3] "Brandon Stark Male Northmen The Winged Wolf House Stark of Winterfell"
## 
## [[17]]
## [1] "Brienne of Tarth Female  The Maid of Tarth House Baratheon of Storm's End"
## [2] "Brienne of Tarth Female  Brienne the Beauty House Stark of Winterfell"    
## [3] "Brienne of Tarth Female  Brienne the Blue House Tarth of Evenfall Hall"   
## 
## [[18]]
## [1] "Catelyn Stark Female Rivermen Catelyn Tully House Stark of Winterfell"    
## [2] "Catelyn Stark Female Rivermen Lady Stoneheart House Tully of Riverrun"    
## [3] "Catelyn Stark Female Rivermen The Silent Sistet House Stark of Winterfell"
## [4] "Catelyn Stark Female Rivermen Mother Mercilesr House Tully of Riverrun"   
## [5] "Catelyn Stark Female Rivermen The Hangwomans House Stark of Winterfell"   
## 
## [[19]]
## [1] "Cersei Lannister Female Westerman  House Lannister of Casterly Rock"
## 
## [[20]]
## [1] "Eddard Stark Male Northmen Ned House Stark of Winterfell"           
## [2] "Eddard Stark Male Northmen The Ned House Stark of Winterfell"       
## [3] "Eddard Stark Male Northmen The Quiet Wolf House Stark of Winterfell"
## 
## [[21]]
## [1] "Jaime Lannister Male Westerlands The Kingslayer House Lannister of Casterly Rock"       
## [2] "Jaime Lannister Male Westerlands The Lion of Lannister House Lannister of Casterly Rock"
## [3] "Jaime Lannister Male Westerlands The Young Lion House Lannister of Casterly Rock"       
## [4] "Jaime Lannister Male Westerlands Cripple House Lannister of Casterly Rock"              
## 
## [[22]]
## [1] "Jon Connington Male Stormlands Griffthe Mad King's Hand House Connington of Griffin's Roost"
## [2] "Jon Connington Male Stormlands Griffthe Mad King's Hand House Targaryen of King's Landing"  
## 
## [[23]]
## [1] "Jon Snow Male Northmen Lord Snow House Stark of Winterfell"                                    
## [2] "Jon Snow Male Northmen Ned Stark's Bastard House Stark of Winterfell"                          
## [3] "Jon Snow Male Northmen The Snow of Winterfell House Stark of Winterfell"                       
## [4] "Jon Snow Male Northmen The Crow-Come-Over House Stark of Winterfell"                           
## [5] "Jon Snow Male Northmen The 998th Lord Commander of the Night's Watch House Stark of Winterfell"
## [6] "Jon Snow Male Northmen The Bastard of Winterfell House Stark of Winterfell"                    
## [7] "Jon Snow Male Northmen The Black Bastard of the Wall House Stark of Winterfell"                
## [8] "Jon Snow Male Northmen Lord Crow House Stark of Winterfell"                                    
## 
## [[24]]
## [1] "Aeron Greyjoy Male Ironborn The Damphair House Greyjoy of Pyke"  
## [2] "Aeron Greyjoy Male Ironborn Aeron Damphair House Greyjoy of Pyke"
## 
## [[25]]
## [1] "Kevan Lannister Male   House Lannister of Casterly Rock"
## 
## [[26]]
## [1] "Melisandre Female Asshai The Red Priestess "    
## [2] "Melisandre Female Asshai The Red Woman "        
## [3] "Melisandre Female Asshai The King's Red Shadow "
## [4] "Melisandre Female Asshai Lady Red "             
## [5] "Melisandre Female Asshai Lot Seven "            
## 
## [[27]]
## [1] "Merrett Frey Male Rivermen Merrett Muttonhead House Frey of the Crossing"
## 
## [[28]]
## [1] "Quentyn Martell Male Dornish Frog House Nymeros Martell of Sunspear"                        
## [2] "Quentyn Martell Male Dornish Prince Frog House Nymeros Martell of Sunspear"                 
## [3] "Quentyn Martell Male Dornish The prince who came too late House Nymeros Martell of Sunspear"
## [4] "Quentyn Martell Male Dornish The Dragonrider House Nymeros Martell of Sunspear"             
## 
## [[29]]
## [1] "Samwell Tarly Male Andal Sam House Tarly of Horn Hill"             
## [2] "Samwell Tarly Male Andal Ser Piggy House Tarly of Horn Hill"       
## [3] "Samwell Tarly Male Andal Prince Pork-chop House Tarly of Horn Hill"
## [4] "Samwell Tarly Male Andal Lady Piggy House Tarly of Horn Hill"      
## [5] "Samwell Tarly Male Andal Sam the Slayer House Tarly of Horn Hill"  
## [6] "Samwell Tarly Male Andal Black Sam House Tarly of Horn Hill"       
## [7] "Samwell Tarly Male Andal Lord of Ham House Tarly of Horn Hill"     
## 
## [[30]]
## [1] "Sansa Stark Female Northmen Little bird House Baelish of Harrenhal"
## [2] "Sansa Stark Female Northmen Alayne Stone House Stark of Winterfell"
## [3] "Sansa Stark Female Northmen Jonquil House Baelish of Harrenhal"

Ejemplo de una función propia

TablaDinamica <- function(base,ColumnaPrincipal, ColumnasPivotables){
  
  tabla <<- dcast(base, formula=as.formula(paste(ColumnaPrincipal , paste(ColumnasPivotables, collapse  = "+"), sep = '~'))) # si usamos << en la asignación, la tabla se guarda en la memoria de R (Global Enviroment)
  
  return(tabla)
}

Más información sobre la librería purr https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_purrr.pdf

2. Spark

Apache Spark es un motor informático de código abierto y de propósito general que se utiliza para procesar y analizar una gran cantidad de datos de forma distribuida. Funciona con el sistema operativo para distribuir datos a través de los núcleos y procesar los datos en paralelo.

Spark utiliza una arquitectura maestro/esclavo, es decir, un coordinador central y muchos trabajadores distribuidos. El lenguaje de programación es scala.

Términos:

RDD, Según Spark, se define como una colección de elementos que es tolerante a fallos y que es capaz de operar en paralelo.
Task: Una tarea es una unidad de trabajo que envía al ejecutor. Cada etapa tiene una tarea, una tarea por partición. La misma tarea se realiza en diferentes particiones de RDD.
job: Es un cálculo paralelo que consta de varias tareas que se generan en respuesta a acciones en Apache Spark.*
stage: Cada trabajo se divide en conjuntos más pequeños de tareas llamadas etapas (stage) que dependen unas de otras. Las etapas se clasifican como límites computacionales. Todo el cálculo no se puede realizar en una sola etapa. Se logra en muchas etapas.

Spark ofrece múltiples ventajas. Permite ejecutar una aplicación en el clúster de Hadoop mucho más rápido que en la memoria y en el disco. También reduce el número de operaciones de lectura y escritura en el disco. Es compatible con varios lenguajes de programación. Tiene API integradas en Java, Python, Scala para que el programador pueda escribir la aplicación en diferentes lenguajes. Además, proporciona soporte para transmisión de datos, gráficos y algoritmos de aprendizaje automático para realizar análisis de datos avanzados.

3. Sparklyr

Sparklyr es un paquete de R que le permite analizar datos en Spark mientras usa herramientas familiares en R.
Sparklyr admite un flujo completo para dplyr, una librería popular para trabajar con objetos dataframes tanto en memoria como sin memoria.
Puede usar dplyr para traducir código R a Spark SQL.
Sparklyr también es compatible con MLlib para que pueda ejecutar modelos clasificadores, regresiones, agrupaciones, árboles de decisión y muchos más algoritmos de aprendizaje automático en sus datos distribuidos en Spark.
Con sparklyr puede analizar grandes cantidades de datos que tradicionalmente no cabrían en la memoria R.
Puede recopilar los resultados de Spark en R para una mayor visualización y documentación.
Con sparklyr y rsparkling tiene acceso a todas las herramientas en H2O (deep learning) para el análisis con R y Spark.

Términos + Hive metadata: Un metastore de Hive (también conocido como metastore_db) es una base de datos relacional para administrar los metadatos de las entidades relacionales persistentes, p. bases de datos, tablas, columnas, particiones. Es lo que utiliza spark para administrar los metadatos de entidades relacionales persistentes (por ejemplo, bases de datos, tablas, columnas, particiones) en una base de datos relacional (para un acceso rápido, spark - SQL).

Un ejemplo de un análisis de datos utilizando Apache Spark, R y sparklyr en modo local

#librerias
library(tidyverse) # es importante ejecutar previamente 
#library(dplyr)
library(sparklyr)
library(DBI)

# instalamos spark localmente (en nuestro computador)

#spark_install ("2.0.1") # ejecutar esta linea una sola vez por proyecto

# nos conectamos a la version local

sc <- spark_connect(master = "local", version = "2.0.1") 

# copiamos una base de datos a la memoria distribuida de spark 

import_iris <- copy_to(sc, iris, "spark_iris",
                       overwrite = TRUE) 

# Particionamos la base de datos

partition_iris <- sdf_random_split(
  import_iris,training=0.5, testing=0.5) 

# Crea un hive metadata para cada partición

sdf_register(partition_iris,
             c("spark_iris_training","spark_iris_test"))

## $spark_iris_training
## # Source: spark<spark_iris_training> [?? x 5]
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species   
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>     
##  1          4.4         2.9          1.4         0.2 setosa    
##  2          4.4         3            1.3         0.2 setosa    
##  3          4.6         3.1          1.5         0.2 setosa    
##  4          4.6         3.4          1.4         0.3 setosa    
##  5          4.7         3.2          1.3         0.2 setosa    
##  6          4.9         3.1          1.5         0.2 setosa    
##  7          4.9         3.6          1.4         0.1 setosa    
##  8          5           2.3          3.3         1   versicolor
##  9          5           3.2          1.2         0.2 setosa    
## 10          5           3.3          1.4         0.2 setosa    
## # ... with more rows
## 
## $spark_iris_test
## # Source: spark<spark_iris_test> [?? x 5]
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          4.3         3            1.1         0.1 setosa 
##  2          4.4         3.2          1.3         0.2 setosa 
##  3          4.5         2.3          1.3         0.3 setosa 
##  4          4.6         3.2          1.4         0.2 setosa 
##  5          4.6         3.6          1           0.2 setosa 
##  6          4.7         3.2          1.6         0.2 setosa 
##  7          4.8         3            1.4         0.1 setosa 
##  8          4.8         3            1.4         0.3 setosa 
##  9          4.8         3.1          1.6         0.2 setosa 
## 10          4.8         3.4          1.6         0.2 setosa 
## # ... with more rows

# crear un spark data frame desde la fuente sc

tidy_iris <- tbl(sc,"spark_iris_training") %>%
  select(Species, Petal_Length, Petal_Width) 

# Spark ML Decision Tree Model

model_iris <- tidy_iris %>%
  ml_decision_tree(response="Species",
                   features=c("Petal_Length","Petal_Width"))


# Creamos una referencia para la tabla en spark

test_iris <- tbl(sc,"spark_iris_test")

# traer los datos a la memoria de R con collect 

pred_iris <- ml_predict(model_iris, test_iris) %>% collect

# continuamos en R Studio 

library(ggplot2)

pred_iris %>%
  inner_join(data.frame(prediction=0:2,
                        lab=model_iris$index_labels)) %>%
  ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
  geom_point()

# Desconectamos spark

spark_disconnect(sc)

Más información sobre la librería Sparklyr[resúmen] https://science.nu/wp-content/uploads/2018/07/r-sparklyr.pdf, [documentación completa] https://cran.r-project.org/web/packages/sparklyr/readme/README.html

#4. Usando Sparklyr

La mayoría de los comandos de Spark se ejecutan desde la consola de R; sin embargo, la ejecución de seguimiento y análisis se realiza a través de la interfaz web de Spark, que se muestra a continuación.

Puede ver en la interfaz web de Spark que se inicia un job para recopilar la información de Spark. También puede seleccionar la pestaña Almacenamiento para ver el conjunto de datos almacenado en memoria caché en Spark:

En la imagen anterior observemos que los datos está completamente cargado en la memoria, como lo indica la columna Fraction Cached, que muestra 100%; por lo tanto, puede ver exactamente cuánta memoria está usando este conjunto de datos a través de la columna Tamaño en memoria.

La pestaña Ejecutores, proporciona una vista de los recursos de su clúster. Para las conexiones locales, solo encontrará un ejecutor activo con solo 2 GB de memoria asignada a Spark y 384 MB disponibles para el cálculo. Aprenderemos también, cómo solicitar más instancias y recursos informáticos, y cómo se asigna la memoria.

Cuando usa Spark de R para analizar datos, puede usar SQL (Structured Query Language) o dplyr (una gramática de manipulación de datos). Puede utilizar SQL a través del paquete DBI, por ejemplo:

library(dplyr)
library(sparklyr)
library(DBI)

sc <- spark_connect(master = "local", version = "2.0.1") 

spark_web(sc) # reporte de las operaciones realizadas en spark

cars <- copy_to(sc, mtcars, overwrite = T) # cuando ocupamos una base de datos del enviroment

dbGetQuery(sc, "SELECT count(*) FROM mtcars") # la conexion de spar sc siempre antecede como algumento de las funciones de spark

##   count(1)
## 1       32

summarize_all(cars, mean) %>% 
  show_query() # para traducir el codigo de dplyr en SQL

## <SQL>
## SELECT AVG(`mpg`) AS `mpg`, AVG(`cyl`) AS `cyl`, AVG(`disp`) AS `disp`, AVG(`hp`) AS `hp`, AVG(`drat`) AS `drat`, AVG(`wt`) AS `wt`, AVG(`qsec`) AS `qsec`, AVG(`vs`) AS `vs`, AVG(`am`) AS `am`, AVG(`gear`) AS `gear`, AVG(`carb`) AS `carb`
## FROM `mtcars`

summarise(cars, mpg_percentile = percentile(mpg, 0.25)) %>%
  show_query() # se pueden llamar las funciones de SQL

## <SQL>
## SELECT percentile(`mpg`, 0.25) AS `mpg_percentile`
## FROM `mtcars`

En general, solemos comenzar analizando datos en Spark con dplyr, seguido de muestrear filas y seleccionar un subconjunto de las columnas disponibles. El último paso es recopilar (collect) datos de Spark para realizar más procesamiento de datos en R, como visualización de datos.

Segundo ejemplo de un análisis de datos utilizando Apache Spark, R y sparklyr en modo local

library(tidyverse)

summarize_all(cars, mean) %>% collect

## # A tibble: 1 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81

class(cars)

## [1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"

cars %>%
  mutate(transmission = ifelse(am == 0, "automatic", "manual")) %>%
  group_by(transmission) %>%
  summarise_all(mean) %>% collect # antes usabamos en dplyr sumarise y ahora en sparklyr usamos sumarise_all

## # A tibble: 2 x 12
##   transmission   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 manual        24.4  5.08  144.  127.  4.05  2.41  17.4 0.538     1  4.38  2.92
## 2 automatic     17.1  6.95  290.  160.  3.29  3.77  18.2 0.368     0  3.21  2.74

select(cars, hp, mpg) %>%
  sample_n(100) %>%
  collect() %>%
  plot()

library(ggplot2)

ggplot(aes(as.factor(cyl), mpg), data = mtcars) + geom_col()

model <- ml_linear_regression(cars, mpg ~ hp)
model

## Formula: mpg ~ hp
## 
## Coefficients:
## (Intercept)          hp 
## 30.09886054 -0.06822828

model %>%
  ml_predict(copy_to(sc, data.frame(hp = 250 + 10 * 1:10))) %>%
  transmute(hp = hp, mpg = prediction) %>%
  full_join(select(cars, hp, mpg)) %>%
  collect() %>%
  plot() # dibujando desde spark

modelo <- ml_logistic_regression(cars, am ~ .)
modelo

## Formula: am ~ .
## 
## Coefficients:
##  (Intercept)          mpg          cyl         disp           hp         drat 
##  -0.68057477   1.73068529  -6.50306685  -0.11106774   0.01566047  33.02750111 
##           wt         qsec           vs         gear         carb 
## -20.68143251  -9.52647833  -6.81113196  29.16524289   3.33862282

# trayendo a R con spark y dibujar 

car_group <- cars %>%
  group_by(cyl) %>%
  summarise(mpg = sum(mpg, na.rm = TRUE)) %>%
  collect() %>%
  print()

## # A tibble: 3 x 2
##     cyl   mpg
##   <dbl> <dbl>
## 1     6  138.
## 2     4  293.
## 3     8  211.

ggplot(aes(as.factor(cyl), mpg), data = car_group) + 
  geom_col(fill = "#999999") + coord_flip()

Usando dbplot

El paquete dbplot proporciona funciones auxiliares para dibujar con datos remotos. La librería dbplot usa código de R para transformar los datos que están escritos en Spark. Luego usa esos resultados para crear un gráfico usando el paquete ggplot2 donde la transformación de datos y el gráfico son activados por una sola función.

library(dbplot)

cars %>%
dbplot_histogram(mpg, binwidth = 3) + # grafico para analizar una sola variable continua
labs(title = "MPG Distribution",
     subtitle = "Histogram over miles per gallon")

ggplot(aes(mpg, wt), data = mtcars) + # grafico para analizar dos variables continuas
  geom_point()

dbplot_raster(cars, mpg, wt, resolution = 16) # Un grafico raster devuelve una cuadrícula de posiciones x / y y los resultados de una agregacion determinada, generalmente representada por el color del cuadrado.

db_compute_raster(cars, mpg, wt)

## # A tibble: 32 x 3
##      mpg    wt `n()`
##    <dbl> <dbl> <dbl>
##  1  21.0  2.61     1
##  2  21.0  2.84     1
##  3  22.6  2.30     1
##  4  21.2  3.19     1
##  5  18.6  3.43     1
##  6  17.9  3.43     1
##  7  14.2  3.55     1
##  8  24.3  3.16     1
##  9  22.6  3.12     1
## 10  19.1  3.43     1
## # ... with 22 more rows

También puede utilizar dbplot para recuperar los datos sin procesar y visualizarlos por otros medios; para recuperar los agregados utilice db_compute_bins (), db_compute_count(), db_compute_raster() y db_compute_boxplot().

Datos

Los datos se leen de fuentes de datos existentes en una variedad de formatos, como texto sin formato, CSV, JSON, Java Database Connectivity (JDBC) y muchos más. Por ejemplo, podemos exportar nuestro conjunto de datos de ejemplo como un archivo CSV y volverlo a leer:

#spark_write_csv(cars, "cars.csv")
#list.files(pattern = 'cars') # para buscar un archivo en la carpeta de trabajo del proyecto

#cars2 <- spark_read_csv(sc, "cars.csv")
#class(cars2)

# para borrar una tabla que está en la memoria de spark
db_drop_table(sc, "mtcars")

## [1] 0

Logs

El registro de logs es una herramienta que contiene información relevante para la ejecución de tareas en el clúster. Para los clústeres locales, podemos recuperar todos los registros recientes ejecutando lo siguiente:

spark_log(sc)

## 20/08/27 09:57:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 93 (MapPartitionsRDD[261] at collect at utils.scala:114)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Adding task set 93.0 with 1 tasks
## 20/08/27 09:57:26 INFO TaskSetManager: Starting task 0.0 in stage 93.0 (TID 111, localhost, partition 0, ANY, 5379 bytes)
## 20/08/27 09:57:26 INFO Executor: Running task 0.0 in stage 93.0 (TID 111)
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
## 20/08/27 09:57:26 INFO Executor: Finished task 0.0 in stage 93.0 (TID 111). 3031 bytes result sent to driver
## 20/08/27 09:57:26 INFO TaskSetManager: Finished task 0.0 in stage 93.0 (TID 111) in 46 ms on localhost (1/1)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Removed TaskSet 93.0, whose tasks have all completed, from pool 
## 20/08/27 09:57:26 INFO DAGScheduler: ResultStage 93 (collect at utils.scala:114) finished in 0.046 s
## 20/08/27 09:57:26 INFO DAGScheduler: Job 64 finished: collect at utils.scala:114, took 0.077047 s
## 20/08/27 09:57:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
## 20/08/27 09:57:26 INFO CodeGenerator: Code generated in 48.801 ms
## 20/08/27 09:57:26 INFO SparkContext: Starting job: count at utils.scala:114
## 20/08/27 09:57:26 INFO DAGScheduler: Registering RDD 263 (count at utils.scala:114)
## 20/08/27 09:57:26 INFO DAGScheduler: Got job 65 (count at utils.scala:114) with 1 output partitions
## 20/08/27 09:57:26 INFO DAGScheduler: Final stage: ResultStage 95 (count at utils.scala:114)
## 20/08/27 09:57:26 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 94)
## 20/08/27 09:57:26 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 94)
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting ShuffleMapStage 94 (MapPartitionsRDD[263] at count at utils.scala:114), which has no missing parents
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_94 stored as values in memory (estimated size 18.8 KB, free 912.1 MB)
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_94_piece0 stored as bytes in memory (estimated size 7.8 KB, free 912.1 MB)
## 20/08/27 09:57:26 INFO BlockManagerInfo: Added broadcast_94_piece0 in memory on 127.0.0.1:54827 (size: 7.8 KB, free: 912.2 MB)
## 20/08/27 09:57:26 INFO SparkContext: Created broadcast 94 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 94 (MapPartitionsRDD[263] at count at utils.scala:114)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Adding task set 94.0 with 1 tasks
## 20/08/27 09:57:26 INFO TaskSetManager: Starting task 0.0 in stage 94.0 (TID 112, localhost, partition 0, PROCESS_LOCAL, 6769 bytes)
## 20/08/27 09:57:26 INFO Executor: Running task 0.0 in stage 94.0 (TID 112)
## 20/08/27 09:57:26 INFO BlockManager: Found block rdd_11_0 locally
## 20/08/27 09:57:26 INFO Executor: Finished task 0.0 in stage 94.0 (TID 112). 1978 bytes result sent to driver
## 20/08/27 09:57:26 INFO TaskSetManager: Finished task 0.0 in stage 94.0 (TID 112) in 15 ms on localhost (1/1)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Removed TaskSet 94.0, whose tasks have all completed, from pool 
## 20/08/27 09:57:26 INFO DAGScheduler: ShuffleMapStage 94 (count at utils.scala:114) finished in 0.015 s
## 20/08/27 09:57:26 INFO DAGScheduler: looking for newly runnable stages
## 20/08/27 09:57:26 INFO DAGScheduler: running: Set()
## 20/08/27 09:57:26 INFO DAGScheduler: waiting: Set(ResultStage 95)
## 20/08/27 09:57:26 INFO DAGScheduler: failed: Set()
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting ResultStage 95 (MapPartitionsRDD[267] at count at utils.scala:114), which has no missing parents
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_95 stored as values in memory (estimated size 67.3 KB, free 912.0 MB)
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_95_piece0 stored as bytes in memory (estimated size 22.9 KB, free 912.0 MB)
## 20/08/27 09:57:26 INFO BlockManagerInfo: Added broadcast_95_piece0 in memory on 127.0.0.1:54827 (size: 22.9 KB, free: 912.2 MB)
## 20/08/27 09:57:26 INFO SparkContext: Created broadcast 95 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 95 (MapPartitionsRDD[267] at count at utils.scala:114)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Adding task set 95.0 with 1 tasks
## 20/08/27 09:57:26 INFO TaskSetManager: Starting task 0.0 in stage 95.0 (TID 113, localhost, partition 0, ANY, 5379 bytes)
## 20/08/27 09:57:26 INFO Executor: Running task 0.0 in stage 95.0 (TID 113)
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
## 20/08/27 09:57:26 INFO Executor: Finished task 0.0 in stage 95.0 (TID 113). 3232 bytes result sent to driver
## 20/08/27 09:57:26 INFO DAGScheduler: ResultStage 95 (count at utils.scala:114) finished in 0.049 s
## 20/08/27 09:57:26 INFO DAGScheduler: Job 65 finished: count at utils.scala:114, took 0.081230 s
## 20/08/27 09:57:26 INFO TaskSetManager: Finished task 0.0 in stage 95.0 (TID 113) in 49 ms on localhost (1/1)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Removed TaskSet 95.0, whose tasks have all completed, from pool 
## 20/08/27 09:57:27 INFO SparkSqlParser: Parsing command: DROP TABLE `mtcars`
## 20/08/27 09:57:27 INFO HiveMetaStore: 0: get_database: default
## 20/08/27 09:57:27 INFO audit: ugi=Hugo   ip=unknown-ip-addr  cmd=get_database: default   
## 20/08/27 09:57:27 INFO HiveMetaStore: 0: get_table : db=default tbl=mtcars
## 20/08/27 09:57:27 INFO audit: ugi=Hugo   ip=unknown-ip-addr  cmd=get_table : db=default tbl=mtcars   
## 20/08/27 09:57:27 INFO SparkSqlParser: Parsing command: `mtcars`
## 20/08/27 09:57:27 INFO MapPartitionsRDD: Removing RDD 11 from persistence list
## 20/08/27 09:57:27 INFO BlockManager: Removing RDD 11
## 20/08/27 09:57:27 INFO SparkContext: Starting job: count at null:-2
## 20/08/27 09:57:27 INFO DAGScheduler: Registering RDD 271 (count at null:-2)
## 20/08/27 09:57:27 INFO DAGScheduler: Got job 66 (count at null:-2) with 1 output partitions
## 20/08/27 09:57:27 INFO DAGScheduler: Final stage: ResultStage 97 (count at null:-2)
## 20/08/27 09:57:27 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 96)
## 20/08/27 09:57:27 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 96)
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting ShuffleMapStage 96 (MapPartitionsRDD[271] at count at null:-2), which has no missing parents
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_96 stored as values in memory (estimated size 8.1 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_96_piece0 stored as bytes in memory (estimated size 4.3 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO BlockManagerInfo: Added broadcast_96_piece0 in memory on 127.0.0.1:54827 (size: 4.3 KB, free: 912.2 MB)
## 20/08/27 09:57:27 INFO SparkContext: Created broadcast 96 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 96 (MapPartitionsRDD[271] at count at null:-2)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Adding task set 96.0 with 1 tasks
## 20/08/27 09:57:27 INFO TaskSetManager: Starting task 0.0 in stage 96.0 (TID 114, localhost, partition 0, PROCESS_LOCAL, 5494 bytes)
## 20/08/27 09:57:27 INFO Executor: Running task 0.0 in stage 96.0 (TID 114)
## 20/08/27 09:57:27 INFO Executor: Finished task 0.0 in stage 96.0 (TID 114). 1636 bytes result sent to driver
## 20/08/27 09:57:27 INFO DAGScheduler: ShuffleMapStage 96 (count at null:-2) finished in 0.029 s
## 20/08/27 09:57:27 INFO DAGScheduler: looking for newly runnable stages
## 20/08/27 09:57:27 INFO DAGScheduler: running: Set()
## 20/08/27 09:57:27 INFO DAGScheduler: waiting: Set(ResultStage 97)
## 20/08/27 09:57:27 INFO DAGScheduler: failed: Set()
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting ResultStage 97 (MapPartitionsRDD[274] at count at null:-2), which has no missing parents
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_97 stored as values in memory (estimated size 7.0 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO TaskSetManager: Finished task 0.0 in stage 96.0 (TID 114) in 29 ms on localhost (1/1)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Removed TaskSet 96.0, whose tasks have all completed, from pool 
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_97_piece0 stored as bytes in memory (estimated size 3.7 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO BlockManagerInfo: Added broadcast_97_piece0 in memory on 127.0.0.1:54827 (size: 3.7 KB, free: 912.2 MB)
## 20/08/27 09:57:27 INFO SparkContext: Created broadcast 97 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 97 (MapPartitionsRDD[274] at count at null:-2)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Adding task set 97.0 with 1 tasks
## 20/08/27 09:57:27 INFO TaskSetManager: Starting task 0.0 in stage 97.0 (TID 115, localhost, partition 0, ANY, 5379 bytes)
## 20/08/27 09:57:27 INFO Executor: Running task 0.0 in stage 97.0 (TID 115)
## 20/08/27 09:57:27 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
## 20/08/27 09:57:27 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
## 20/08/27 09:57:27 INFO Executor: Finished task 0.0 in stage 97.0 (TID 115). 1866 bytes result sent to driver
## 20/08/27 09:57:27 INFO DAGScheduler: ResultStage 97 (count at null:-2) finished in 0.007 s
## 20/08/27 09:57:27 INFO DAGScheduler: Job 66 finished: count at null:-2, took 0.052031 s
## 20/08/27 09:57:27 INFO TaskSetManager: Finished task 0.0 in stage 97.0 (TID 115) in 7 ms on localhost (1/1)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Removed TaskSet 97.0, whose tasks have all completed, from pool

spark_log(sc, filter = "sparklyr") # si filtramos segun algun topico

## 20/08/27 09:51:25 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54216/jars/sparklyr-2.0-2.11.jar with timestamp 1598539885927
## 20/08/27 09:51:38 INFO Executor: Fetching spark://127.0.0.1:54216/jars/sparklyr-2.0-2.11.jar with timestamp 1598539885927
## 20/08/27 09:51:38 INFO Utils: Fetching spark://127.0.0.1:54216/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-9a62abb7-e822-4618-9b95-5547479c3599\userFiles-adbd7a96-8c2e-491c-b6c7-2bd0a43545cc\fetchFileTemp1132867395228677426.tmp
## 20/08/27 09:51:38 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-9a62abb7-e822-4618-9b95-5547479c3599/userFiles-adbd7a96-8c2e-491c-b6c7-2bd0a43545cc/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:51:40 INFO SparkSqlParser: Parsing command: sparklyr_tmp_59fd0ca6_9364_489a_a014_5b06929d7d65
## FROM `sparklyr_tmp_59fd0ca6_9364_489a_a014_5b06929d7d65` AS `zzz2`
## 20/08/27 09:51:40 INFO SparkSqlParser: Parsing command: sparklyr_tmp_749a5caa_0040_482d_9e4d_f2ab11217bb3
## FROM `sparklyr_tmp_749a5caa_0040_482d_9e4d_f2ab11217bb3` AS `zzz3`
## FROM `sparklyr_tmp_59fd0ca6_9364_489a_a014_5b06929d7d65`
## FROM `sparklyr_tmp_749a5caa_0040_482d_9e4d_f2ab11217bb3`
## 20/08/27 09:51:45 INFO SparkSqlParser: Parsing command: sparklyr_tmp_a81e93d3_9e90_494c_b7fd_8caa4cc78c69
## FROM `sparklyr_tmp_a81e93d3_9e90_494c_b7fd_8caa4cc78c69` AS `zzz7`
## FROM `sparklyr_tmp_a81e93d3_9e90_494c_b7fd_8caa4cc78c69`
## 20/08/27 09:51:46 INFO SparkSqlParser: Parsing command: sparklyr_tmp_c473fa71_6350_43a2_b8cb_761a43f24bef
## FROM `sparklyr_tmp_c473fa71_6350_43a2_b8cb_761a43f24bef` AS `zzz8`
## FROM `sparklyr_tmp_c473fa71_6350_43a2_b8cb_761a43f24bef`
## 20/08/27 09:51:47 INFO SparkSqlParser: Parsing command: sparklyr_tmp_e722ed52_f51a_46a3_b58d_af476c3133b2
## FROM `sparklyr_tmp_e722ed52_f51a_46a3_b58d_af476c3133b2` AS `zzz10`
## FROM `sparklyr_tmp_e722ed52_f51a_46a3_b58d_af476c3133b2`
## 20/08/27 09:51:47 INFO SparkSqlParser: Parsing command: sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867
## FROM `sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867` AS `zzz11`
## FROM `sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867`
## FROM `sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867`
## 20/08/27 09:51:57 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54293/jars/sparklyr-2.0-2.11.jar with timestamp 1598539917005
## 20/08/27 09:52:10 INFO Executor: Fetching spark://127.0.0.1:54293/jars/sparklyr-2.0-2.11.jar with timestamp 1598539917005
## 20/08/27 09:52:10 INFO Utils: Fetching spark://127.0.0.1:54293/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-1bfbb169-6176-449c-be4a-fe3c08fec275\userFiles-c86573f6-64d3-4951-8f29-789b7bff3e82\fetchFileTemp8452961142174595398.tmp
## 20/08/27 09:52:11 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-1bfbb169-6176-449c-be4a-fe3c08fec275/userFiles-c86573f6-64d3-4951-8f29-789b7bff3e82/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:53:42 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54410/jars/sparklyr-2.0-2.11.jar with timestamp 1598540022654
## 20/08/27 09:53:53 INFO Executor: Fetching spark://127.0.0.1:54410/jars/sparklyr-2.0-2.11.jar with timestamp 1598540022654
## 20/08/27 09:53:53 INFO Utils: Fetching spark://127.0.0.1:54410/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-cd010f6c-f324-47db-892c-d4679f698670\userFiles-71211c0f-8947-4248-9d6c-0c06ac71fa40\fetchFileTemp7840457036382436762.tmp
## 20/08/27 09:53:53 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-cd010f6c-f324-47db-892c-d4679f698670/userFiles-71211c0f-8947-4248-9d6c-0c06ac71fa40/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:56:15 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54734/jars/sparklyr-2.0-2.11.jar with timestamp 1598540175585
## 20/08/27 09:56:25 INFO Executor: Fetching spark://127.0.0.1:54734/jars/sparklyr-2.0-2.11.jar with timestamp 1598540175585
## 20/08/27 09:56:25 INFO Utils: Fetching spark://127.0.0.1:54734/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-d8f5f96a-383a-4bf3-999d-69c482372062\userFiles-f296a21e-a704-4f50-a81e-f1706c48f1b6\fetchFileTemp1086623201380595667.tmp
## 20/08/27 09:56:25 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-d8f5f96a-383a-4bf3-999d-69c482372062/userFiles-f296a21e-a704-4f50-a81e-f1706c48f1b6/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:56:27 INFO SparkSqlParser: Parsing command: sparklyr_tmp_cb530f84_1a46_4d68_a364_da539e345adb
## FROM `sparklyr_tmp_cb530f84_1a46_4d68_a364_da539e345adb` AS `zzz2`
## 20/08/27 09:56:27 INFO SparkSqlParser: Parsing command: sparklyr_tmp_c9dce0e3_bb9f_4939_a97c_d0bb9f1a6c23
## FROM `sparklyr_tmp_c9dce0e3_bb9f_4939_a97c_d0bb9f1a6c23` AS `zzz3`
## FROM `sparklyr_tmp_cb530f84_1a46_4d68_a364_da539e345adb`
## FROM `sparklyr_tmp_c9dce0e3_bb9f_4939_a97c_d0bb9f1a6c23`
## 20/08/27 09:56:31 INFO SparkSqlParser: Parsing command: sparklyr_tmp_0d1b933c_53e9_434a_9f19_bf7129c3a692
## FROM `sparklyr_tmp_0d1b933c_53e9_434a_9f19_bf7129c3a692` AS `zzz7`
## FROM `sparklyr_tmp_0d1b933c_53e9_434a_9f19_bf7129c3a692`
## 20/08/27 09:56:31 INFO SparkSqlParser: Parsing command: sparklyr_tmp_73792aa5_bfd5_47d8_9a18_a629b18bcc8a
## FROM `sparklyr_tmp_73792aa5_bfd5_47d8_9a18_a629b18bcc8a` AS `zzz8`
## FROM `sparklyr_tmp_73792aa5_bfd5_47d8_9a18_a629b18bcc8a`
## 20/08/27 09:56:32 INFO SparkSqlParser: Parsing command: sparklyr_tmp_2b9cafdd_433e_4e1e_b203_12184ba12b29
## FROM `sparklyr_tmp_2b9cafdd_433e_4e1e_b203_12184ba12b29` AS `zzz10`
## FROM `sparklyr_tmp_2b9cafdd_433e_4e1e_b203_12184ba12b29`
## 20/08/27 09:56:32 INFO SparkSqlParser: Parsing command: sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08
## FROM `sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08` AS `zzz11`
## FROM `sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08`
## FROM `sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08`
## 20/08/27 09:56:42 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54806/jars/sparklyr-2.0-2.11.jar with timestamp 1598540202682
## 20/08/27 09:56:55 INFO Executor: Fetching spark://127.0.0.1:54806/jars/sparklyr-2.0-2.11.jar with timestamp 1598540202682
## 20/08/27 09:56:56 INFO Utils: Fetching spark://127.0.0.1:54806/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-e10b3f19-3b14-406f-84cd-7a1adf57e767\userFiles-f6a77198-2fe1-4cc3-ac2f-c6aeec14ffe2\fetchFileTemp6298598374547532482.tmp
## 20/08/27 09:56:56 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-e10b3f19-3b14-406f-84cd-7a1adf57e767/userFiles-f6a77198-2fe1-4cc3-ac2f-c6aeec14ffe2/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:57:08 INFO SparkSqlParser: Parsing command: sparklyr_tmp_ef05a58f_b600_4b5e_b1da_1c944fa5a3c3
## FROM `sparklyr_tmp_ef05a58f_b600_4b5e_b1da_1c944fa5a3c3` AS `zzz13`
## 20/08/27 09:57:08 INFO SparkSqlParser: Parsing command: sparklyr_tmp_a47670be_625f_4b15_a6c1_4b5776fe697b
## FROM `sparklyr_tmp_a47670be_625f_4b15_a6c1_4b5776fe697b` AS `zzz14`
## FROM `sparklyr_tmp_a47670be_625f_4b15_a6c1_4b5776fe697b`
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: sparklyr_tmp_4f9e8dac_4383_48bf_8e4e_1fa7d299ea43
## FROM `sparklyr_tmp_4f9e8dac_4383_48bf_8e4e_1fa7d299ea43` AS `zzz15`
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: sparklyr_tmp_2e7c2a2d4672
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: CACHE TABLE `sparklyr_tmp_2e7c2a2d4672`
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: `sparklyr_tmp_2e7c2a2d4672`
## 20/08/27 09:57:10 INFO SparkSqlParser: Parsing command: SELECT count(*) FROM  `sparklyr_tmp_2e7c2a2d4672`
## FROM `sparklyr_tmp_2e7c2a2d4672` AS `zzz16`
## FROM `sparklyr_tmp_2e7c2a2d4672`
## 20/08/27 09:57:10 INFO SparkSqlParser: Parsing command: sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396
## FROM `sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396` AS `zzz17`
## FROM `sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396`) `LHS`
## FROM `sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396`) `LHS`
## 20/08/27 09:57:16 INFO SparkSqlParser: Parsing command: sparklyr_tmp_05e04002_445f_4949_92ed_f8d280d3040f
## FROM `sparklyr_tmp_05e04002_445f_4949_92ed_f8d280d3040f` AS `zzz18`
## FROM `sparklyr_tmp_05e04002_445f_4949_92ed_f8d280d3040f`
## 20/08/27 09:57:17 INFO SparkSqlParser: Parsing command: sparklyr_tmp_7fbd53b2_b6b3_4534_8342_0adec098cc66
## FROM `sparklyr_tmp_7fbd53b2_b6b3_4534_8342_0adec098cc66` AS `zzz19`
## FROM `sparklyr_tmp_7fbd53b2_b6b3_4534_8342_0adec098cc66`

#spark_disconnect_all() # una vez de desconecta spark, se borran todos los datos

Guardar resultados sobre spark dataframes en la memoria de Spark

Podemos realizar la importación, análisis y el modelado de datos dentro de Spark.

Los ejemplos que hemos trabajado hasta ahora, usamos bases de datos pequeñas. En escenarios de la vida real, se utilizan grandes cantidades de datos para los modelos. Si los datos deben transformarse primero, antes de ajustar los modelos, es una buena idea guardar los resultados de todas las transformaciones en una nueva tabla cargada en la memoria de Spark.

El comando compute() puede tomar el resultado final de un comando dplyr y guardar los resultados en la memoria Spark:

library("ggplot2")
library("corrr")
library("dbplot")
library("rmarkdown")
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.0.1") 

cars <- copy_to(sc, mtcars, overwrite = T)

cached_cars <- cars %>% 
  mutate(cyl = paste0("cyl_", cyl)) %>%
  compute("cached_cars")

Tercer ejemplo: Predecir si alguien está trabajando activamente, es decir, no jubilado, estudiante o desempleado.

Utilizaremos el conjunto de datos de OkCupid. El conjunto de datos consiste en datos de perfil de usuario de un sitio de citas en línea y contiene un conjunto diverso de características, incluidas características como género y profesión, así como campos de texto libre relacionados con intereses personales. Hay alrededor de 60.000 perfiles en el conjunto de datos.

download.file(
  "https://github.com/r-spark/okcupid/raw/master/profiles.csv.zip",
  "okcupid.zip")

unzip("okcupid.zip", exdir = "data")
unlink("okcupid.zip")

profiles <- read.csv("data/profiles.csv")

write.csv(dplyr::sample_n(profiles, 10^3),
          "data/profiles.csv", row.names = FALSE) # así se particiona una base de datos que sea muy grande

En la práctica, es mejor que utilice una implementación eficiente y no distribuida del algoritmo de modelado. Por ejemplo, es posible que desee utilizar el paquete ranger.

library(sparklyr)
library(ggplot2)
library(dbplot)
library(dplyr)

sc <- spark_connect(master = "local", version = "2.0.1") 

okc <- spark_read_csv(sc, "data/profiles.csv", 
  escape = "\"", 
  memory = FALSE,
  options = list(multiline = TRUE)
) %>%
  mutate(
    height = as.numeric(height),
    income = ifelse(income == "-1", NA, as.numeric(income))
  ) %>%
  mutate(sex = ifelse(is.na(sex), "missing", sex)) %>%
  mutate(drinks = ifelse(is.na(drinks), "missing", drinks)) %>%
  mutate(drugs = ifelse(is.na(drugs), "missing", drugs)) %>%
  mutate(job = ifelse(is.na(job), "missing", job)) %>% compute()

glimpse(okc) # para una vista rápida

## Rows: ??
## Columns: 31
## Database: spark_connection
## $ age         <chr> "29", "40", "38", "20", "34", "27", "27", "33", "68", "...
## $ body_type   <chr> "athletic", "athletic", "thin", "average", "average", "...
## $ diet        <chr> "strictly anything", NA, "vegan", "mostly anything", "s...
## $ drinks      <chr> "socially", "missing", "often", "socially", "rarely", "...
## $ drugs       <chr> "never", "missing", "missing", "never", "never", "missi...
## $ education   <chr> "dropped out of space camp", "graduated from college/un...
## $ essay0      <chr> "im a small furry mammal that roams the deserts of life...
## $ essay1      <chr> "eating and havin fun! wooo! rebel yell!", "working and...
## $ essay2      <chr> "chasing down my prey", "i used to be good at skiing, b...
## $ essay3      <chr> "my cute little tail", "i'm tall.", NA, NA, "is my long...
## $ essay4      <chr> "romance novels that have a very masculine and mysterio...
## $ essay5      <chr> "food. sleep. food. sleep. food. sleep.", "cheese<br />...
## $ essay6      <chr> "food. and sleep", "too many things!", NA, "nothing at ...
## $ essay7      <chr> "on the prowl for people whos feelings i can hurt and f...
## $ essay8      <chr> "i have very high standards in life", NA, NA, "i have n...
## $ essay9      <chr> ":)", "you live fairly close", "despite your rejection ...
## $ ethnicity   <chr> "other", "white", "white", "white", "asian", "white", "...
## $ height      <dbl> 94, 74, 65, 68, 62, 71, 71, 72, 72, 70, 69, 69, 61, 69,...
## $ income      <dbl> 1e+06, 6e+04, NaN, 2e+04, NaN, NaN, NaN, NaN, NaN, NaN,...
## $ job         <chr> "rather not say", "computer / hardware / software", "ot...
## $ last_online <chr> "2012-06-26-21-46", "2012-06-26-14-13", "2012-06-28-21-...
## $ location    <chr> "emeryville, california", "berkeley, california", "oakl...
## $ offspring   <chr> "doesn&rsquo;t have kids", "has kids", "doesn&rsquo;t w...
## $ orientation <chr> "straight", "straight", "straight", "straight", "straig...
## $ pets        <chr> "has dogs and likes cats", "has dogs", "likes dogs and ...
## $ religion    <chr> "other and very serious about it", NA, NA, "atheism", N...
## $ sex         <chr> "f", "m", "f", "m", "f", "m", "m", "m", "m", "m", "f", ...
## $ sign        <chr> "taurus but it doesn&rsquo;t matter", NA, "aries and it...
## $ smokes      <chr> "no", NA, "no", "yes", "no", "no", "yes", NA, "no", "tr...
## $ speaks      <chr> "english (fluently), french (okay), spanish (fluently)"...
## $ status      <chr> "single", "single", "single", "single", "single", "sing...

Ahora, agregamos nuestra variable de respuesta como una columna en el conjunto de datos y observamos su distribución:

okc <- okc %>%
  mutate(
    not_working = ifelse(job %in% c("student", "unemployed", "retired"), 1 , 0)
  )

okc %>% 
  group_by(not_working) %>% 
  tally()

## # Source: spark<?> [?? x 2]
##   not_working     n
##         <dbl> <dbl>
## 1           0   900
## 2           1   100

# modelamiento 

data_splits <- sdf_random_split(okc, training = 0.8, testing = 0.2, seed = 42)
okc_train <- data_splits$training
okc_test <- data_splits$testing

# distribucion de la variable de respuesta 

okc_train %>%
  group_by(not_working) %>%
  tally() %>%
  mutate(frac = n / sum(n))

## # Source: spark<?> [?? x 3]
##   not_working     n  frac
##         <dbl> <dbl> <dbl>
## 1           1    85 0.105
## 2           0   721 0.895

# summary de variables numericas
sdf_describe(okc_train, cols = c("age", "income"))

## # Source: spark<?> [?? x 3]
##   summary age               income           
##   <chr>   <chr>             <chr>            
## 1 count   806               160              
## 2 mean    32.06699751861042 115750.0         
## 3 stddev  9.27900067801498  223284.6562929237
## 4 min     18                20000.0          
## 5 max     68                1000000.0

dbplot_histogram(okc_train, age)

# variable dependiente vs las demas variables

prop_data <- okc_train %>%
  mutate(religion = regexp_extract(religion, "^\\\\w+", 0)) %>% 
  group_by(religion, not_working) %>%
  tally() %>%
  group_by(religion) %>%
  summarize(
    count = sum(n),
    prop = sum(not_working * n) / sum(n)
  ) %>%
  mutate(se = sqrt(prop * (1 - prop) / count)) %>%
  collect()

prop_data # esta en la memoria de R

## # A tibble: 10 x 4
##    religion     count   prop     se
##    <chr>        <dbl>  <dbl>  <dbl>
##  1 judaism         46 0.0870 0.0415
##  2 atheism         89 0.112  0.0335
##  3 christianity    81 0.123  0.0366
##  4 hinduism         8 0.125  0.117 
##  5 agnosticism    128 0.117  0.0284
##  6 other          100 0.15   0.0357
##  7 buddhism        25 0.24   0.0854
##  8 islam            4 0.5    0.25  
##  9 <NA>           260 0.0577 0.0145
## 10 catholicism     65 0.108  0.0384

# Proporción de personas no empleadas actualmente, por religion

prop_data %>%
  ggplot(aes(x = religion, y = prop)) + geom_point(size = 2) +
  geom_errorbar(aes(ymin = prop - 1.96 * se, ymax = prop + 1.96 * se),
                width = .1) +
  geom_hline(yintercept = sum(prop_data$prop * prop_data$count) /
                              sum(prop_data$count))

# alcohol y drogas? tablas cruzadas

contingency_tbl <- okc_train %>% 
  sdf_crosstab("drinks", "drugs") %>%
  collect()

contingency_tbl

## # A tibble: 7 x 5
##   drinks_drugs missing never often sometimes
##   <chr>          <dbl> <dbl> <dbl>     <dbl>
## 1 very often         0     2     1         1
## 2 socially         125   375     3        66
## 3 not at all         3    43     0         4
## 4 desperately        1     1     1         0
## 5 often             24    26     0        19
## 6 missing           19    14     0         3
## 7 rarely             5    61     2         7

Podemos visualizar esta información:

library(ggmosaic)
library(forcats)
library(tidyr)

contingency_tbl %>%
  rename(drinks = drinks_drugs) %>%
  gather("drugs", "count", missing:sometimes) %>%
  mutate(
    drinks = as_factor(drinks) %>% 
      fct_relevel("missing", "not at all", "rarely", "socially", 
                  "very often", "desperately"),
    drugs = as_factor(drugs) %>%
      fct_relevel("missing", "never", "sometimes", "often")
  ) %>%
  ggplot() +
  geom_mosaic(aes(x = product(drinks, drugs), fill = drinks, 
                  weight = count))

Curso de R: R para Big Data

Instructora: Karen Calva

8/18/2020