La clase de hoy ha sido tomado del CapĆtulo 2, tema 9 del libro Advanced R programming, de Hadley Wickham https://adv-r.hadley.nz/fp.html.
R, es un lenguaje funcional. Esto significa que tiene un estilo de resolución de problemas centrado en funciones.
Las tƩcnicas funcionales han experimentado un gran interƩs porque pueden producir soluciones eficientes y elegantes a muchos problemas modernos.
A continuación analizaremos las tres técnicas funcionales clave para descomponer los problemas en partes mÔs pequeñas:
Técnica 1, veremos cómo reemplazar muchos bucles for con funcionales que son funciones (como lapply ()) que toman otra función como argumento. Los funcionales le permiten tomar una función que resuelve el problema para una sola entrada y generalizarla para manejar cualquier número de entradas. Los funcionales son la técnica mÔs importante y se usan todo el tiempo en el anÔlisis de datos.
Técnica 2, presentaremos las fÔbricas de funciones: funciones que crean funciones. Las fÔbricas de funciones se usan con menos frecuencia que las funcionales, pero pueden permitirle particionar elegantemente el trabajo entre diferentes partes de su código.
Técnica 3, mostraremos cómo crear operadores de función: funcionales que toman funciones como entrada y producen funciones como salida. Son como adverbios, porque normalmente modifican el funcionamiento de una función.
Un funcional es una función que toma una función como entrada y devuelve un vector como salida.
randomise <- function(f) f(runif(1e3)) # la función input genera 1000 números uniformes aleatoriamente
randomise(mean)
## [1] 0.492595
randomise(mean) # cada vez el promedio resultado es distinto
## [1] 0.5021419
randomise(sum)
## [1] 491.0315
randomise(sum) # cada vez la suma resultado es distinta
## [1] 497.2087
Es posible que ya hayan utilizado reemplazos del bucle for como lapply(), apply() y tapply() de la base R; o map().
Un uso común de los funcionales es como alternativa a los bucles for. Los bucles for tienen una mala reputación en R porque muchas personas creen que son lentos, pero la verdadera desventaja de los bucles for es que son muy flexibles: un bucle que calcula lo que se estÔ iterando, pero no lo que se debe hacer con los resultados. Asà como es mejor usar while que repetir, y es mejor usar for que while, es mejor usar un funcional que for.
Cada funcioanl estĆ” diseƱada para una tarea especĆfica, por lo que cuando reconoce el funcional, inmediatamente sabe por quĆ© se estĆ” utilizando.
Prerrequisitos
La primera tƩcnica se centrarƔ en las funciones proporcionadas por el paquete purrr (Henry y Wickham 2018a).
El funcional mÔs fundamental es purrr::map(). Toma un vector y una función, llama a la función una vez para cada elemento del vector y devuelve los resultados en una lista. En otras palabras, map(1:3,f) es equivalente a list(f(1), f(2), f(3)).
library(purrr)
triple <- function(x) x*3
map(1: 3, triple)
## [[1]]
## [1] 3
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 9
O, grƔficamente:
El equivalente base de map() es lapply(). La única diferencia es que lapply() no admite los ayudantes sobre los que aprenderÔ a continuación.
Produciendo vectores atómicos
map() devuelve una lista, lo que la convierte en la mĆ”s general de la familia de maps porque puede poner cualquier cosa en una lista. Pero es inconveniente devolver una lista cuando una estructura de datos mĆ”s simple serĆa suficiente, por lo que hay cuatro variantes mĆ”s especĆficas: map_lgl(), map_int(), map_dbl() y map_chr(). Cada uno devuelve un vector atómico del tipo especificado:
# map_chr () siempre devuelve un vector de caracteres
map_chr(mtcars, typeof)
## mpg cyl disp hp drat wt qsec vs
## "double" "double" "double" "double" "double" "double" "double" "double"
## am gear carb
## "double" "double" "double"
# map_lgl() siempre devuelve un vector lógico
map_lgl(mtcars, is.double)
## mpg cyl disp hp drat wt qsec vs am gear carb
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# map_int () siempre devuelve un vector entero
n_unique <- function(x) length(unique(x))
map_int(mtcars, n_unique)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 25 3 27 22 22 29 30 2 2 3 6
# map_dbl() siempre devuelve un vector doble
map_dbl(mtcars, mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
purrr usa la convención de que los sufijos, *_dbl()*, se refieren a la salida. Todas las funciones map _() pueden tomar cualquier tipo de vector como entrada. Estos ejemplos se basan en dos hechos: mtcars es un Data.frame* y los data.frame son listas que contienen vectores de la misma longitud. Esto es mÔs obvio si dibujamos un data.frame con la misma orientación que el vector:
Todas las funciones de map siempre devuelven un vector de salida de la misma longitud que la entrada, lo que implica que cada llamada a .f debe devolver un solo valor. Si no es asĆ, obtendrĆ” un error:
pair <- function(x) c(x, x)
#map_dbl(1:2, pair)
Esto es similar al error que obtendrĆ” si .f devuelve el tipo de resultado incorrecto:
#map_dbl(1:2, as.character)
En cualquier caso, suele ser útil volver a map(), porque map() puede aceptar cualquier tipo de salida. Eso le permite ver el resultado problemÔtico y averiguar qué hacer con él.
map(1:2, pair)
## [[1]]
## [1] 1 1
##
## [[2]]
## [1] 2 2
map(1:2, as.character)
## [[1]]
## [1] "1"
##
## [[2]]
## [1] "2"
¿Qué es grande? (para esta clase)
Cuando R no funciona para porque tienes demasiados datos
ĀæQuĆ© se vuelve mĆ”s difĆcil cuando los datos son grandes?
Es posible que los datos no se carguen en la memoria
El anƔlisis de los datos puede llevar mucho tiempo.
Las visualizaciones se vuelven desordenadas
Etc.,
¿CuÔntos datos puede cargar R?
memory.limit()
## [1] 16267
Cambiar el lĆmite de la memoria
Se puede usar memory.size() para cambiar el lĆmite de asignación de R. Peroā¦
Si estƔ ejecutando R de 32 bits en cualquier sistema operativo, se tendrƔn 2 o 3Gb disponibles
Si estĆ” ejecutando R de 64 bits en un sistema operativo de 64 bits, el lĆmite superior es efectivamente infinito, pero ā¦
ĀæQuĆ© es un lĆmite de memoria de 2 GB (o 3 GB)?
2 GB de memoria que usa R no es lo mismo que 2 GB en el disco
data(esoph)
object.size(esoph)
## 5952 bytes
El tiempo de sus procesos R
Al procesar grandes conjuntos de datos, el tiempo que tarda una función en ejecutar una tarea puede convertirse en un factor limitante. A continuación se muestran tres opciones sobre cómo cronometrar las funciones R:
# OPCION 1:
ptm <- proc.time()
some.output <- rnorm(10^6)
diff1 <- proc.time() - ptm
diff1 # en segundos
## user system elapsed
## 0.09 0.00 0.09
# OPCION 2:
system.time(
some.output <- rnorm(10^6)
) # en segundos
## user system elapsed
## 0.10 0.00 0.09
# OPCION 3:
t1 <- Sys.time()
some.output <- rnorm(10^6)
t2 <- Sys.time()
difftime(t2,t1) # en segundos
## Time difference of 0.09253907 secs
Dependiendo de la computadora que estĆ© utilizando, los tiempos calculados pueden verse diferentes. Las opciones 1 y 2 producen tres nĆŗmeros (āusuarioā, āsistemaā y ātranscurridoā). El Ćŗltimo nĆŗmero que es mĆ”s Ćŗtil: da la cantidad total de tiempo transcurrido (en segundos). La opción 3 tiene una ligera ventaja, ya que puede configurar la unidad para que se informe (āsegundosā, āminutosā, āhorasā, ādĆasā, āsemanasā).
Funciones de pipe
Los pipe son una adición relativamente reciente a R, introducida en el paquete magrittr. Se refiere a un sintaxis que encadena funciones individuales con un sĆmbolo de barra vertical (por ejemplo,%>%,% <>%). Mediante el uso pipes, se puede crear una sintaxis R mĆ”s completa y mĆ”s corta. Los pipes son un excelente compaƱero para dplyr.
Leer datos en el espacio de trabajo de R
Formatos de archivo comunes para el almacenamiento de datos, paquetes R y funciones para la importación de datos:
R también puede acceder a las bases de datos de forma remota, sin importarlas a su espacio de trabajo. Esta opción es mÔs conveniente para bases de datos que son demasiado grandes para caber en la memoria de R. Se puede usar el paquete dplyr para obtener acceso remoto a bases de datos SQL, revisaremos un ejemplo.
library(purrr)
rm(list=ls()) # limpia el espacio de trabajo
options(java.parameters="-Xmx8000m") # para evitar errores del tipo 'Java.lang.OutOfMemoryError' al usar librerias cuya dependencia sea rJava, como por ejemplo la libreria xlsx.
if(.Platform$OS.type == "windows") withAutoprint({
memory.size()
memory.size(TRUE)
memory.limit()
}) # informa l asignación de memoria actual y maxima que esta usando R.
## > memory.size()
## [1] 88.79
## > memory.size(TRUE)
## [1] 91.81
## > memory.limit()
## [1] 16267
#t1 = data.table::fread('bal2018.txt', encoding = 'UTF-8')
#Error in data.table::fread("bal2018.txt", encoding = "UTF-8") :
# File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.
ptm <- proc.time()
t2 = read.csv('bal2018.txt', fileEncoding = "UTF-16", sep = "\t", header = T)
diff1 <- proc.time() - ptm
diff1 # en segundos
## user system elapsed
## 56.17 1.12 58.68
dim(t2)
## [1] 78200 807
Funciones base
library(dplyr)
ptm <- proc.time()
# crear data.frame para la variable cyl=4
cyl_4 <- filter(mtcars, cyl == 4)
# crear un modelo de regresion lineal
lm_4 <- lm(mpg ~ wt, data = cyl_4)
# obtener el summary
lm_4_summary <- summary(lm_4)
# obtener el valor de r2
lm_4_r_squared <- lm_4_summary["r.squared"]
# verificar el valor
lm_4_r_squared
## $r.squared
## [1] 0.5086326
diff1 <- proc.time() - ptm
diff1 # en segundos
## user system elapsed
## 0.02 0.02 0.03
Usando dplyr
Alternativamente, se puede hacer lo mismo pipes de dplyr. Escribir mucho menos, pero para hacer esto para los 3 subconjuntos de datos significa que tenemos que copiar y pegar varias veces, por lo que si queremos correr un modelo lineal de mpg ~ disp ademĆ”s de mpg ~ wt, tendrĆa que duplica el código 3 veces mĆ”s y cĆ”mbiarlo 3 veces mĆ”s.
Esto puede no parecer un gran problema, pero eventualmente lo serÔ una vez que comience a escalar el código (digamos, mÔs de 10 veces o mÔs de 100 veces, etc.).
ptm <- proc.time()
lm_4cyl_rsquared <- mtcars %>%
filter(cyl == 4) %>%
lm(mpg ~ wt, data = .) %>%
summary() %>%
.$"coefficients"
diff1 <- proc.time() - ptm
diff1 # en segundos
## user system elapsed
## 0.02 0.00 0.01
Usando purr
Para resolver este problema de minimizar la repetición con mĆ”s repeticiones, puede cargar purrr por sĆ solo, pero tambiĆ©n se carga como parte de la librerĆa tidyverse.
Los argumentos base para map() son:
Volviendo a nuestro ejemplo de tomar el R cuadrado de un modelo lineal, usamos el siguiente código con purrr.
library(purrr)
ptm <- proc.time()
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map("coefficients")
## $`4`
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.571196 4.346582 9.103980 7.771511e-06
## wt -5.647025 1.850119 -3.052251 1.374278e-02
##
## $`6`
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.408845 4.184369 6.789278 0.001054844
## wt -2.780106 1.334917 -2.082605 0.091757660
##
## $`8`
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.868029 3.0054619 7.941551 4.052705e-06
## wt -2.192438 0.7392393 -2.965803 1.179281e-02
diff1 <- proc.time() - ptm
diff1 # en segundos
## user system elapsed
## 0 0 0
Ā”Esto genera una salida de nuestros 3 modelos lineales de acuerdo con el nĆŗmero de cilindros en 5 lĆneas de código!
# piped
mtcars %>%
split(.$cyl)
## $`4`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
##
## $`6`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
##
## $`8`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
# base R
split(mtcars, mtcars$cyl)
## $`4`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
##
## $`6`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
##
## $`8`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Contianuamos con el ejemplo:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .))
## $`4`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 39.571 -5.647
##
##
## $`6`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 28.41 -2.78
##
##
## $`8`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 23.868 -2.192
A continuación, asignamos nuestra función de resumen a cada uno de los elementos de la lista para obtener resultados mÔs limpios con valores de R cuadrado:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary)
## $`4`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1513 -1.9795 -0.6272 1.9299 5.2523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.571 4.347 9.104 7.77e-06 ***
## wt -5.647 1.850 -3.052 0.0137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.332 on 9 degrees of freedom
## Multiple R-squared: 0.5086, Adjusted R-squared: 0.454
## F-statistic: 9.316 on 1 and 9 DF, p-value: 0.01374
##
##
## $`6`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Residuals:
## Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Merc 280
## -0.1250 0.5840 1.9292 -0.6897 0.3547
## Merc 280C Ferrari Dino
## -1.0453 -1.0080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.409 4.184 6.789 0.00105 **
## wt -2.780 1.335 -2.083 0.09176 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.165 on 5 degrees of freedom
## Multiple R-squared: 0.4645, Adjusted R-squared: 0.3574
## F-statistic: 4.337 on 1 and 5 DF, p-value: 0.09176
##
##
## $`8`
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1491 -1.4664 -0.8458 1.5711 3.7619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.8680 3.0055 7.942 4.05e-06 ***
## wt -2.1924 0.7392 -2.966 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.024 on 12 degrees of freedom
## Multiple R-squared: 0.423, Adjusted R-squared: 0.3749
## F-statistic: 8.796 on 1 and 12 DF, p-value: 0.01179
Usamos map_dbl ya que nuestra salida es doble o numƩrica.
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_df("r.squared")
## # A tibble: 1 x 3
## `4` `6` `8`
## <dbl> <dbl> <dbl>
## 1 0.509 0.465 0.423
Si es que usamos simplemente map(), el resultado esta en una lista.
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map("r.squared")
## $`4`
## [1] 0.5086326
##
## $`6`
## [1] 0.4645102
##
## $`8`
## [1] 0.4229655
pmap e imap
LibrerĆas
library(purrr) # Functional programming
library(dplyr) # Data wrangling
library(tidyr) # Tidy-ing data
library(stringr) # String operations
library(repurrrsive) # Game of Thrones data
Obtención de datos
dat <- got_chars
glimpse(dat[[1]])
## List of 18
## $ url : chr "https://www.anapioficeandfire.com/api/characters/1022"
## $ id : int 1022
## $ name : chr "Theon Greyjoy"
## $ gender : chr "Male"
## $ culture : chr "Ironborn"
## $ born : chr "In 278 AC or 279 AC, at Pyke"
## $ died : chr ""
## $ alive : logi TRUE
## $ titles : chr [1:3] "Prince of Winterfell" "Captain of Sea Bitch" "Lord of the Iron Islands (by law of the green lands)"
## $ aliases : chr [1:4] "Prince of Fools" "Theon Turncloak" "Reek" "Theon Kinslayer"
## $ father : chr ""
## $ mother : chr ""
## $ spouse : chr ""
## $ allegiances: chr "House Greyjoy of Pyke"
## $ books : chr [1:3] "A Game of Thrones" "A Storm of Swords" "A Feast for Crows"
## $ povBooks : chr [1:2] "A Clash of Kings" "A Dance with Dragons"
## $ tvSeries : chr [1:6] "Season 1" "Season 2" "Season 3" "Season 4" ...
## $ playedBy : chr "Alfie Allen"
Función map: extraer un solo elemento de una listas
# Method 1: using the name of the list element, similar to dat[[1]]["name"], dat[[2]]["name"], etc
map(dat, "name")
## [[1]]
## [1] "Theon Greyjoy"
##
## [[2]]
## [1] "Tyrion Lannister"
##
## [[3]]
## [1] "Victarion Greyjoy"
##
## [[4]]
## [1] "Will"
##
## [[5]]
## [1] "Areo Hotah"
##
## [[6]]
## [1] "Chett"
##
## [[7]]
## [1] "Cressen"
##
## [[8]]
## [1] "Arianne Martell"
##
## [[9]]
## [1] "Daenerys Targaryen"
##
## [[10]]
## [1] "Davos Seaworth"
##
## [[11]]
## [1] "Arya Stark"
##
## [[12]]
## [1] "Arys Oakheart"
##
## [[13]]
## [1] "Asha Greyjoy"
##
## [[14]]
## [1] "Barristan Selmy"
##
## [[15]]
## [1] "Varamyr"
##
## [[16]]
## [1] "Brandon Stark"
##
## [[17]]
## [1] "Brienne of Tarth"
##
## [[18]]
## [1] "Catelyn Stark"
##
## [[19]]
## [1] "Cersei Lannister"
##
## [[20]]
## [1] "Eddard Stark"
##
## [[21]]
## [1] "Jaime Lannister"
##
## [[22]]
## [1] "Jon Connington"
##
## [[23]]
## [1] "Jon Snow"
##
## [[24]]
## [1] "Aeron Greyjoy"
##
## [[25]]
## [1] "Kevan Lannister"
##
## [[26]]
## [1] "Melisandre"
##
## [[27]]
## [1] "Merrett Frey"
##
## [[28]]
## [1] "Quentyn Martell"
##
## [[29]]
## [1] "Samwell Tarly"
##
## [[30]]
## [1] "Sansa Stark"
# Method 2: using the `pluck` function
map(dat, pluck("name"))
## [[1]]
## [1] "Theon Greyjoy"
##
## [[2]]
## [1] "Tyrion Lannister"
##
## [[3]]
## [1] "Victarion Greyjoy"
##
## [[4]]
## [1] "Will"
##
## [[5]]
## [1] "Areo Hotah"
##
## [[6]]
## [1] "Chett"
##
## [[7]]
## [1] "Cressen"
##
## [[8]]
## [1] "Arianne Martell"
##
## [[9]]
## [1] "Daenerys Targaryen"
##
## [[10]]
## [1] "Davos Seaworth"
##
## [[11]]
## [1] "Arya Stark"
##
## [[12]]
## [1] "Arys Oakheart"
##
## [[13]]
## [1] "Asha Greyjoy"
##
## [[14]]
## [1] "Barristan Selmy"
##
## [[15]]
## [1] "Varamyr"
##
## [[16]]
## [1] "Brandon Stark"
##
## [[17]]
## [1] "Brienne of Tarth"
##
## [[18]]
## [1] "Catelyn Stark"
##
## [[19]]
## [1] "Cersei Lannister"
##
## [[20]]
## [1] "Eddard Stark"
##
## [[21]]
## [1] "Jaime Lannister"
##
## [[22]]
## [1] "Jon Connington"
##
## [[23]]
## [1] "Jon Snow"
##
## [[24]]
## [1] "Aeron Greyjoy"
##
## [[25]]
## [1] "Kevan Lannister"
##
## [[26]]
## [1] "Melisandre"
##
## [[27]]
## [1] "Merrett Frey"
##
## [[28]]
## [1] "Quentyn Martell"
##
## [[29]]
## [1] "Samwell Tarly"
##
## [[30]]
## [1] "Sansa Stark"
# Method 3: using the index of the list element
map(dat, 3)
## [[1]]
## [1] "Theon Greyjoy"
##
## [[2]]
## [1] "Tyrion Lannister"
##
## [[3]]
## [1] "Victarion Greyjoy"
##
## [[4]]
## [1] "Will"
##
## [[5]]
## [1] "Areo Hotah"
##
## [[6]]
## [1] "Chett"
##
## [[7]]
## [1] "Cressen"
##
## [[8]]
## [1] "Arianne Martell"
##
## [[9]]
## [1] "Daenerys Targaryen"
##
## [[10]]
## [1] "Davos Seaworth"
##
## [[11]]
## [1] "Arya Stark"
##
## [[12]]
## [1] "Arys Oakheart"
##
## [[13]]
## [1] "Asha Greyjoy"
##
## [[14]]
## [1] "Barristan Selmy"
##
## [[15]]
## [1] "Varamyr"
##
## [[16]]
## [1] "Brandon Stark"
##
## [[17]]
## [1] "Brienne of Tarth"
##
## [[18]]
## [1] "Catelyn Stark"
##
## [[19]]
## [1] "Cersei Lannister"
##
## [[20]]
## [1] "Eddard Stark"
##
## [[21]]
## [1] "Jaime Lannister"
##
## [[22]]
## [1] "Jon Connington"
##
## [[23]]
## [1] "Jon Snow"
##
## [[24]]
## [1] "Aeron Greyjoy"
##
## [[25]]
## [1] "Kevan Lannister"
##
## [[26]]
## [1] "Melisandre"
##
## [[27]]
## [1] "Merrett Frey"
##
## [[28]]
## [1] "Quentyn Martell"
##
## [[29]]
## [1] "Samwell Tarly"
##
## [[30]]
## [1] "Sansa Stark"
Crear un dataframe de una lista
# The `[` is the function here -- essentially telling it to apply [] to each list
# and the name, gender and culter are the argument passed to []
map_dfr(dat,`[`, c("name", "gender", "culture"))
## # A tibble: 30 x 3
## name gender culture
## <chr> <chr> <chr>
## 1 Theon Greyjoy Male "Ironborn"
## 2 Tyrion Lannister Male ""
## 3 Victarion Greyjoy Male "Ironborn"
## 4 Will Male ""
## 5 Areo Hotah Male "Norvoshi"
## 6 Chett Male ""
## 7 Cressen Male ""
## 8 Arianne Martell Female "Dornish"
## 9 Daenerys Targaryen Female "Valyrian"
## 10 Davos Seaworth Male "Westeros"
## # ... with 20 more rows
Aplicar una función propia a través de map
dead_or_alive <- function(x){
ifelse(x[["alive"]], paste(x[["name"]], "is alive!"),
paste(x[["name"]], "is dead :("))
}
map_chr(dat, dead_or_alive)
## [1] "Theon Greyjoy is alive!" "Tyrion Lannister is alive!"
## [3] "Victarion Greyjoy is alive!" "Will is dead :("
## [5] "Areo Hotah is alive!" "Chett is dead :("
## [7] "Cressen is dead :(" "Arianne Martell is alive!"
## [9] "Daenerys Targaryen is alive!" "Davos Seaworth is alive!"
## [11] "Arya Stark is alive!" "Arys Oakheart is dead :("
## [13] "Asha Greyjoy is alive!" "Barristan Selmy is alive!"
## [15] "Varamyr is dead :(" "Brandon Stark is alive!"
## [17] "Brienne of Tarth is alive!" "Catelyn Stark is dead :("
## [19] "Cersei Lannister is alive!" "Eddard Stark is dead :("
## [21] "Jaime Lannister is alive!" "Jon Connington is alive!"
## [23] "Jon Snow is alive!" "Aeron Greyjoy is alive!"
## [25] "Kevan Lannister is dead :(" "Melisandre is alive!"
## [27] "Merrett Frey is dead :(" "Quentyn Martell is dead :("
## [29] "Samwell Tarly is alive!" "Sansa Stark is alive!"
Uso de pmap por fila
dat_m <- dat %>% {
tibble(
name = map_chr(., "name"),
gender = map_chr(., "gender"),
culture = map_chr(., "culture"),
aliases = map(., "aliases"),
allegiances = map(., "allegiances")
)}
pmap(dat_m, paste)
## [[1]]
## [1] "Theon Greyjoy Male Ironborn Prince of Fools House Greyjoy of Pyke"
## [2] "Theon Greyjoy Male Ironborn Theon Turncloak House Greyjoy of Pyke"
## [3] "Theon Greyjoy Male Ironborn Reek House Greyjoy of Pyke"
## [4] "Theon Greyjoy Male Ironborn Theon Kinslayer House Greyjoy of Pyke"
##
## [[2]]
## [1] "Tyrion Lannister Male The Imp House Lannister of Casterly Rock"
## [2] "Tyrion Lannister Male Halfman House Lannister of Casterly Rock"
## [3] "Tyrion Lannister Male The boyman House Lannister of Casterly Rock"
## [4] "Tyrion Lannister Male Giant of Lannister House Lannister of Casterly Rock"
## [5] "Tyrion Lannister Male Lord Tywin's Doom House Lannister of Casterly Rock"
## [6] "Tyrion Lannister Male Lord Tywin's Bane House Lannister of Casterly Rock"
## [7] "Tyrion Lannister Male Yollo House Lannister of Casterly Rock"
## [8] "Tyrion Lannister Male Hugor Hill House Lannister of Casterly Rock"
## [9] "Tyrion Lannister Male No-Nose House Lannister of Casterly Rock"
## [10] "Tyrion Lannister Male Freak House Lannister of Casterly Rock"
## [11] "Tyrion Lannister Male Dwarf House Lannister of Casterly Rock"
##
## [[3]]
## [1] "Victarion Greyjoy Male Ironborn The Iron Captain House Greyjoy of Pyke"
##
## [[4]]
## [1] "Will Male "
##
## [[5]]
## [1] "Areo Hotah Male Norvoshi House Nymeros Martell of Sunspear"
##
## [[6]]
## [1] "Chett Male "
##
## [[7]]
## [1] "Cressen Male "
##
## [[8]]
## [1] "Arianne Martell Female Dornish House Nymeros Martell of Sunspear"
##
## [[9]]
## [1] "Daenerys Targaryen Female Valyrian Dany House Targaryen of King's Landing"
## [2] "Daenerys Targaryen Female Valyrian Daenerys Stormborn House Targaryen of King's Landing"
## [3] "Daenerys Targaryen Female Valyrian The Unburnt House Targaryen of King's Landing"
## [4] "Daenerys Targaryen Female Valyrian Mother of Dragons House Targaryen of King's Landing"
## [5] "Daenerys Targaryen Female Valyrian Mother House Targaryen of King's Landing"
## [6] "Daenerys Targaryen Female Valyrian Mhysa House Targaryen of King's Landing"
## [7] "Daenerys Targaryen Female Valyrian The Silver Queen House Targaryen of King's Landing"
## [8] "Daenerys Targaryen Female Valyrian Silver Lady House Targaryen of King's Landing"
## [9] "Daenerys Targaryen Female Valyrian Dragonmother House Targaryen of King's Landing"
## [10] "Daenerys Targaryen Female Valyrian The Dragon Queen House Targaryen of King's Landing"
## [11] "Daenerys Targaryen Female Valyrian The Mad King's daughter House Targaryen of King's Landing"
##
## [[10]]
## [1] "Davos Seaworth Male Westeros Onion Knight House Baratheon of Dragonstone"
## [2] "Davos Seaworth Male Westeros Davos Shorthand House Seaworth of Cape Wrath"
## [3] "Davos Seaworth Male Westeros Ser Onions House Baratheon of Dragonstone"
## [4] "Davos Seaworth Male Westeros Onion Lord House Seaworth of Cape Wrath"
## [5] "Davos Seaworth Male Westeros Smuggler House Baratheon of Dragonstone"
##
## [[11]]
## [1] "Arya Stark Female Northmen Arya Horseface House Stark of Winterfell"
## [2] "Arya Stark Female Northmen Arya Underfoot House Stark of Winterfell"
## [3] "Arya Stark Female Northmen Arry House Stark of Winterfell"
## [4] "Arya Stark Female Northmen Lumpyface House Stark of Winterfell"
## [5] "Arya Stark Female Northmen Lumpyhead House Stark of Winterfell"
## [6] "Arya Stark Female Northmen Stickboy House Stark of Winterfell"
## [7] "Arya Stark Female Northmen Weasel House Stark of Winterfell"
## [8] "Arya Stark Female Northmen Nymeria House Stark of Winterfell"
## [9] "Arya Stark Female Northmen Squan House Stark of Winterfell"
## [10] "Arya Stark Female Northmen Saltb House Stark of Winterfell"
## [11] "Arya Stark Female Northmen Cat of the Canaly House Stark of Winterfell"
## [12] "Arya Stark Female Northmen Bets House Stark of Winterfell"
## [13] "Arya Stark Female Northmen The Blind Girh House Stark of Winterfell"
## [14] "Arya Stark Female Northmen The Ugly Little Girl House Stark of Winterfell"
## [15] "Arya Stark Female Northmen Mercedenl House Stark of Winterfell"
## [16] "Arya Stark Female Northmen Mercye House Stark of Winterfell"
##
## [[12]]
## [1] "Arys Oakheart Male Reach House Oakheart of Old Oak"
##
## [[13]]
## [1] "Asha Greyjoy Female Ironborn Esgred House Greyjoy of Pyke"
## [2] "Asha Greyjoy Female Ironborn The Kraken's Daughter House Ironmaker"
##
## [[14]]
## [1] "Barristan Selmy Male Westeros Barristan the Bold House Selmy of Harvest Hall"
## [2] "Barristan Selmy Male Westeros Arstan Whitebeard House Targaryen of King's Landing"
## [3] "Barristan Selmy Male Westeros Ser Grandfather House Selmy of Harvest Hall"
## [4] "Barristan Selmy Male Westeros Barristan the Old House Targaryen of King's Landing"
## [5] "Barristan Selmy Male Westeros Old Ser House Selmy of Harvest Hall"
##
## [[15]]
## [1] "Varamyr Male Free Folk Varamyr Sixskins "
## [2] "Varamyr Male Free Folk Haggon "
## [3] "Varamyr Male Free Folk Lump "
##
## [[16]]
## [1] "Brandon Stark Male Northmen Bran House Stark of Winterfell"
## [2] "Brandon Stark Male Northmen Bran the Broken House Stark of Winterfell"
## [3] "Brandon Stark Male Northmen The Winged Wolf House Stark of Winterfell"
##
## [[17]]
## [1] "Brienne of Tarth Female The Maid of Tarth House Baratheon of Storm's End"
## [2] "Brienne of Tarth Female Brienne the Beauty House Stark of Winterfell"
## [3] "Brienne of Tarth Female Brienne the Blue House Tarth of Evenfall Hall"
##
## [[18]]
## [1] "Catelyn Stark Female Rivermen Catelyn Tully House Stark of Winterfell"
## [2] "Catelyn Stark Female Rivermen Lady Stoneheart House Tully of Riverrun"
## [3] "Catelyn Stark Female Rivermen The Silent Sistet House Stark of Winterfell"
## [4] "Catelyn Stark Female Rivermen Mother Mercilesr House Tully of Riverrun"
## [5] "Catelyn Stark Female Rivermen The Hangwomans House Stark of Winterfell"
##
## [[19]]
## [1] "Cersei Lannister Female Westerman House Lannister of Casterly Rock"
##
## [[20]]
## [1] "Eddard Stark Male Northmen Ned House Stark of Winterfell"
## [2] "Eddard Stark Male Northmen The Ned House Stark of Winterfell"
## [3] "Eddard Stark Male Northmen The Quiet Wolf House Stark of Winterfell"
##
## [[21]]
## [1] "Jaime Lannister Male Westerlands The Kingslayer House Lannister of Casterly Rock"
## [2] "Jaime Lannister Male Westerlands The Lion of Lannister House Lannister of Casterly Rock"
## [3] "Jaime Lannister Male Westerlands The Young Lion House Lannister of Casterly Rock"
## [4] "Jaime Lannister Male Westerlands Cripple House Lannister of Casterly Rock"
##
## [[22]]
## [1] "Jon Connington Male Stormlands Griffthe Mad King's Hand House Connington of Griffin's Roost"
## [2] "Jon Connington Male Stormlands Griffthe Mad King's Hand House Targaryen of King's Landing"
##
## [[23]]
## [1] "Jon Snow Male Northmen Lord Snow House Stark of Winterfell"
## [2] "Jon Snow Male Northmen Ned Stark's Bastard House Stark of Winterfell"
## [3] "Jon Snow Male Northmen The Snow of Winterfell House Stark of Winterfell"
## [4] "Jon Snow Male Northmen The Crow-Come-Over House Stark of Winterfell"
## [5] "Jon Snow Male Northmen The 998th Lord Commander of the Night's Watch House Stark of Winterfell"
## [6] "Jon Snow Male Northmen The Bastard of Winterfell House Stark of Winterfell"
## [7] "Jon Snow Male Northmen The Black Bastard of the Wall House Stark of Winterfell"
## [8] "Jon Snow Male Northmen Lord Crow House Stark of Winterfell"
##
## [[24]]
## [1] "Aeron Greyjoy Male Ironborn The Damphair House Greyjoy of Pyke"
## [2] "Aeron Greyjoy Male Ironborn Aeron Damphair House Greyjoy of Pyke"
##
## [[25]]
## [1] "Kevan Lannister Male House Lannister of Casterly Rock"
##
## [[26]]
## [1] "Melisandre Female Asshai The Red Priestess "
## [2] "Melisandre Female Asshai The Red Woman "
## [3] "Melisandre Female Asshai The King's Red Shadow "
## [4] "Melisandre Female Asshai Lady Red "
## [5] "Melisandre Female Asshai Lot Seven "
##
## [[27]]
## [1] "Merrett Frey Male Rivermen Merrett Muttonhead House Frey of the Crossing"
##
## [[28]]
## [1] "Quentyn Martell Male Dornish Frog House Nymeros Martell of Sunspear"
## [2] "Quentyn Martell Male Dornish Prince Frog House Nymeros Martell of Sunspear"
## [3] "Quentyn Martell Male Dornish The prince who came too late House Nymeros Martell of Sunspear"
## [4] "Quentyn Martell Male Dornish The Dragonrider House Nymeros Martell of Sunspear"
##
## [[29]]
## [1] "Samwell Tarly Male Andal Sam House Tarly of Horn Hill"
## [2] "Samwell Tarly Male Andal Ser Piggy House Tarly of Horn Hill"
## [3] "Samwell Tarly Male Andal Prince Pork-chop House Tarly of Horn Hill"
## [4] "Samwell Tarly Male Andal Lady Piggy House Tarly of Horn Hill"
## [5] "Samwell Tarly Male Andal Sam the Slayer House Tarly of Horn Hill"
## [6] "Samwell Tarly Male Andal Black Sam House Tarly of Horn Hill"
## [7] "Samwell Tarly Male Andal Lord of Ham House Tarly of Horn Hill"
##
## [[30]]
## [1] "Sansa Stark Female Northmen Little bird House Baelish of Harrenhal"
## [2] "Sansa Stark Female Northmen Alayne Stone House Stark of Winterfell"
## [3] "Sansa Stark Female Northmen Jonquil House Baelish of Harrenhal"
Ejemplo de una función propia
TablaDinamica <- function(base,ColumnaPrincipal, ColumnasPivotables){
tabla <<- dcast(base, formula=as.formula(paste(ColumnaPrincipal , paste(ColumnasPivotables, collapse = "+"), sep = '~'))) # si usamos << en la asignación, la tabla se guarda en la memoria de R (Global Enviroment)
return(tabla)
}
MĆ”s información sobre la librerĆa purr https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_purrr.pdf
Apache Spark es un motor informÔtico de código abierto y de propósito general que se utiliza para procesar y analizar una gran cantidad de datos de forma distribuida. Funciona con el sistema operativo para distribuir datos a través de los núcleos y procesar los datos en paralelo.
Spark utiliza una arquitectura maestro/esclavo, es decir, un coordinador central y muchos trabajadores distribuidos. El lenguaje de programación es scala.
TƩrminos:
Spark ofrece múltiples ventajas. Permite ejecutar una aplicación en el clúster de Hadoop mucho mÔs rÔpido que en la memoria y en el disco. También reduce el número de operaciones de lectura y escritura en el disco. Es compatible con varios lenguajes de programación. Tiene API integradas en Java, Python, Scala para que el programador pueda escribir la aplicación en diferentes lenguajes. AdemÔs, proporciona soporte para transmisión de datos, grÔficos y algoritmos de aprendizaje automÔtico para realizar anÔlisis de datos avanzados.
Términos + Hive metadata: Un metastore de Hive (también conocido como metastore_db) es una base de datos relacional para administrar los metadatos de las entidades relacionales persistentes, p. bases de datos, tablas, columnas, particiones. Es lo que utiliza spark para administrar los metadatos de entidades relacionales persistentes (por ejemplo, bases de datos, tablas, columnas, particiones) en una base de datos relacional (para un acceso rÔpido, spark - SQL).
#librerias
library(tidyverse) # es importante ejecutar previamente
#library(dplyr)
library(sparklyr)
library(DBI)
# instalamos spark localmente (en nuestro computador)
#spark_install ("2.0.1") # ejecutar esta linea una sola vez por proyecto
# nos conectamos a la version local
sc <- spark_connect(master = "local", version = "2.0.1")
# copiamos una base de datos a la memoria distribuida de spark
import_iris <- copy_to(sc, iris, "spark_iris",
overwrite = TRUE)
# Particionamos la base de datos
partition_iris <- sdf_random_split(
import_iris,training=0.5, testing=0.5)
# Crea un hive metadata para cada partición
sdf_register(partition_iris,
c("spark_iris_training","spark_iris_test"))
## $spark_iris_training
## # Source: spark<spark_iris_training> [?? x 5]
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 4.4 2.9 1.4 0.2 setosa
## 2 4.4 3 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 4.6 3.4 1.4 0.3 setosa
## 5 4.7 3.2 1.3 0.2 setosa
## 6 4.9 3.1 1.5 0.2 setosa
## 7 4.9 3.6 1.4 0.1 setosa
## 8 5 2.3 3.3 1 versicolor
## 9 5 3.2 1.2 0.2 setosa
## 10 5 3.3 1.4 0.2 setosa
## # ... with more rows
##
## $spark_iris_test
## # Source: spark<spark_iris_test> [?? x 5]
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 4.3 3 1.1 0.1 setosa
## 2 4.4 3.2 1.3 0.2 setosa
## 3 4.5 2.3 1.3 0.3 setosa
## 4 4.6 3.2 1.4 0.2 setosa
## 5 4.6 3.6 1 0.2 setosa
## 6 4.7 3.2 1.6 0.2 setosa
## 7 4.8 3 1.4 0.1 setosa
## 8 4.8 3 1.4 0.3 setosa
## 9 4.8 3.1 1.6 0.2 setosa
## 10 4.8 3.4 1.6 0.2 setosa
## # ... with more rows
# crear un spark data frame desde la fuente sc
tidy_iris <- tbl(sc,"spark_iris_training") %>%
select(Species, Petal_Length, Petal_Width)
# Spark ML Decision Tree Model
model_iris <- tidy_iris %>%
ml_decision_tree(response="Species",
features=c("Petal_Length","Petal_Width"))
# Creamos una referencia para la tabla en spark
test_iris <- tbl(sc,"spark_iris_test")
# traer los datos a la memoria de R con collect
pred_iris <- ml_predict(model_iris, test_iris) %>% collect
# continuamos en R Studio
library(ggplot2)
pred_iris %>%
inner_join(data.frame(prediction=0:2,
lab=model_iris$index_labels)) %>%
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
geom_point()
# Desconectamos spark
spark_disconnect(sc)
MĆ”s información sobre la librerĆa Sparklyr[resĆŗmen] https://science.nu/wp-content/uploads/2018/07/r-sparklyr.pdf, [documentación completa] https://cran.r-project.org/web/packages/sparklyr/readme/README.html
#4. Usando Sparklyr
La mayorĆa de los comandos de Spark se ejecutan desde la consola de R; sin embargo, la ejecución de seguimiento y anĆ”lisis se realiza a travĆ©s de la interfaz web de Spark, que se muestra a continuación.
Puede ver en la interfaz web de Spark que se inicia un job para recopilar la información de Spark. También puede seleccionar la pestaña Almacenamiento para ver el conjunto de datos almacenado en memoria caché en Spark:
En la imagen anterior observemos que los datos estƔ completamente cargado en la memoria, como lo indica la columna Fraction Cached, que muestra 100%; por lo tanto, puede ver exactamente cuƔnta memoria estƔ usando este conjunto de datos a travƩs de la columna TamaƱo en memoria.
La pestaña Ejecutores, proporciona una vista de los recursos de su clúster. Para las conexiones locales, solo encontrarÔ un ejecutor activo con solo 2 GB de memoria asignada a Spark y 384 MB disponibles para el cÔlculo. Aprenderemos también, cómo solicitar mÔs instancias y recursos informÔticos, y cómo se asigna la memoria.
Cuando usa Spark de R para analizar datos, puede usar SQL (Structured Query Language) o dplyr (una gramÔtica de manipulación de datos). Puede utilizar SQL a través del paquete DBI, por ejemplo:
library(dplyr)
library(sparklyr)
library(DBI)
sc <- spark_connect(master = "local", version = "2.0.1")
spark_web(sc) # reporte de las operaciones realizadas en spark
cars <- copy_to(sc, mtcars, overwrite = T) # cuando ocupamos una base de datos del enviroment
dbGetQuery(sc, "SELECT count(*) FROM mtcars") # la conexion de spar sc siempre antecede como algumento de las funciones de spark
## count(1)
## 1 32
summarize_all(cars, mean) %>%
show_query() # para traducir el codigo de dplyr en SQL
## <SQL>
## SELECT AVG(`mpg`) AS `mpg`, AVG(`cyl`) AS `cyl`, AVG(`disp`) AS `disp`, AVG(`hp`) AS `hp`, AVG(`drat`) AS `drat`, AVG(`wt`) AS `wt`, AVG(`qsec`) AS `qsec`, AVG(`vs`) AS `vs`, AVG(`am`) AS `am`, AVG(`gear`) AS `gear`, AVG(`carb`) AS `carb`
## FROM `mtcars`
summarise(cars, mpg_percentile = percentile(mpg, 0.25)) %>%
show_query() # se pueden llamar las funciones de SQL
## <SQL>
## SELECT percentile(`mpg`, 0.25) AS `mpg_percentile`
## FROM `mtcars`
En general, solemos comenzar analizando datos en Spark con dplyr, seguido de muestrear filas y seleccionar un subconjunto de las columnas disponibles. El último paso es recopilar (collect) datos de Spark para realizar mÔs procesamiento de datos en R, como visualización de datos.
library(tidyverse)
summarize_all(cars, mean) %>% collect
## # A tibble: 1 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20.1 6.19 231. 147. 3.60 3.22 17.8 0.438 0.406 3.69 2.81
class(cars)
## [1] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
cars %>%
mutate(transmission = ifelse(am == 0, "automatic", "manual")) %>%
group_by(transmission) %>%
summarise_all(mean) %>% collect # antes usabamos en dplyr sumarise y ahora en sparklyr usamos sumarise_all
## # A tibble: 2 x 12
## transmission mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 manual 24.4 5.08 144. 127. 4.05 2.41 17.4 0.538 1 4.38 2.92
## 2 automatic 17.1 6.95 290. 160. 3.29 3.77 18.2 0.368 0 3.21 2.74
select(cars, hp, mpg) %>%
sample_n(100) %>%
collect() %>%
plot()
library(ggplot2)
ggplot(aes(as.factor(cyl), mpg), data = mtcars) + geom_col()
model <- ml_linear_regression(cars, mpg ~ hp)
model
## Formula: mpg ~ hp
##
## Coefficients:
## (Intercept) hp
## 30.09886054 -0.06822828
model %>%
ml_predict(copy_to(sc, data.frame(hp = 250 + 10 * 1:10))) %>%
transmute(hp = hp, mpg = prediction) %>%
full_join(select(cars, hp, mpg)) %>%
collect() %>%
plot() # dibujando desde spark
modelo <- ml_logistic_regression(cars, am ~ .)
modelo
## Formula: am ~ .
##
## Coefficients:
## (Intercept) mpg cyl disp hp drat
## -0.68057477 1.73068529 -6.50306685 -0.11106774 0.01566047 33.02750111
## wt qsec vs gear carb
## -20.68143251 -9.52647833 -6.81113196 29.16524289 3.33862282
# trayendo a R con spark y dibujar
car_group <- cars %>%
group_by(cyl) %>%
summarise(mpg = sum(mpg, na.rm = TRUE)) %>%
collect() %>%
print()
## # A tibble: 3 x 2
## cyl mpg
## <dbl> <dbl>
## 1 6 138.
## 2 4 293.
## 3 8 211.
ggplot(aes(as.factor(cyl), mpg), data = car_group) +
geom_col(fill = "#999999") + coord_flip()
Usando dbplot
El paquete dbplot proporciona funciones auxiliares para dibujar con datos remotos. La librerĆa dbplot usa código de R para transformar los datos que estĆ”n escritos en Spark. Luego usa esos resultados para crear un grĆ”fico usando el paquete ggplot2 donde la transformación de datos y el grĆ”fico son activados por una sola función.
library(dbplot)
cars %>%
dbplot_histogram(mpg, binwidth = 3) + # grafico para analizar una sola variable continua
labs(title = "MPG Distribution",
subtitle = "Histogram over miles per gallon")
ggplot(aes(mpg, wt), data = mtcars) + # grafico para analizar dos variables continuas
geom_point()
dbplot_raster(cars, mpg, wt, resolution = 16) # Un grafico raster devuelve una cuadrĆcula de posiciones x / y y los resultados de una agregacion determinada, generalmente representada por el color del cuadrado.
db_compute_raster(cars, mpg, wt)
## # A tibble: 32 x 3
## mpg wt `n()`
## <dbl> <dbl> <dbl>
## 1 21.0 2.61 1
## 2 21.0 2.84 1
## 3 22.6 2.30 1
## 4 21.2 3.19 1
## 5 18.6 3.43 1
## 6 17.9 3.43 1
## 7 14.2 3.55 1
## 8 24.3 3.16 1
## 9 22.6 3.12 1
## 10 19.1 3.43 1
## # ... with 22 more rows
TambiƩn puede utilizar dbplot para recuperar los datos sin procesar y visualizarlos por otros medios; para recuperar los agregados utilice db_compute_bins (), db_compute_count(), db_compute_raster() y db_compute_boxplot().
Datos
Los datos se leen de fuentes de datos existentes en una variedad de formatos, como texto sin formato, CSV, JSON, Java Database Connectivity (JDBC) y muchos mƔs. Por ejemplo, podemos exportar nuestro conjunto de datos de ejemplo como un archivo CSV y volverlo a leer:
#spark_write_csv(cars, "cars.csv")
#list.files(pattern = 'cars') # para buscar un archivo en la carpeta de trabajo del proyecto
#cars2 <- spark_read_csv(sc, "cars.csv")
#class(cars2)
# para borrar una tabla que estĆ” en la memoria de spark
db_drop_table(sc, "mtcars")
## [1] 0
Logs
El registro de logs es una herramienta que contiene información relevante para la ejecución de tareas en el clúster. Para los clústeres locales, podemos recuperar todos los registros recientes ejecutando lo siguiente:
spark_log(sc)
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 93 (MapPartitionsRDD[261] at collect at utils.scala:114)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Adding task set 93.0 with 1 tasks
## 20/08/27 09:57:26 INFO TaskSetManager: Starting task 0.0 in stage 93.0 (TID 111, localhost, partition 0, ANY, 5379 bytes)
## 20/08/27 09:57:26 INFO Executor: Running task 0.0 in stage 93.0 (TID 111)
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
## 20/08/27 09:57:26 INFO Executor: Finished task 0.0 in stage 93.0 (TID 111). 3031 bytes result sent to driver
## 20/08/27 09:57:26 INFO TaskSetManager: Finished task 0.0 in stage 93.0 (TID 111) in 46 ms on localhost (1/1)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Removed TaskSet 93.0, whose tasks have all completed, from pool
## 20/08/27 09:57:26 INFO DAGScheduler: ResultStage 93 (collect at utils.scala:114) finished in 0.046 s
## 20/08/27 09:57:26 INFO DAGScheduler: Job 64 finished: collect at utils.scala:114, took 0.077047 s
## 20/08/27 09:57:26 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
## 20/08/27 09:57:26 INFO CodeGenerator: Code generated in 48.801 ms
## 20/08/27 09:57:26 INFO SparkContext: Starting job: count at utils.scala:114
## 20/08/27 09:57:26 INFO DAGScheduler: Registering RDD 263 (count at utils.scala:114)
## 20/08/27 09:57:26 INFO DAGScheduler: Got job 65 (count at utils.scala:114) with 1 output partitions
## 20/08/27 09:57:26 INFO DAGScheduler: Final stage: ResultStage 95 (count at utils.scala:114)
## 20/08/27 09:57:26 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 94)
## 20/08/27 09:57:26 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 94)
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting ShuffleMapStage 94 (MapPartitionsRDD[263] at count at utils.scala:114), which has no missing parents
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_94 stored as values in memory (estimated size 18.8 KB, free 912.1 MB)
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_94_piece0 stored as bytes in memory (estimated size 7.8 KB, free 912.1 MB)
## 20/08/27 09:57:26 INFO BlockManagerInfo: Added broadcast_94_piece0 in memory on 127.0.0.1:54827 (size: 7.8 KB, free: 912.2 MB)
## 20/08/27 09:57:26 INFO SparkContext: Created broadcast 94 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 94 (MapPartitionsRDD[263] at count at utils.scala:114)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Adding task set 94.0 with 1 tasks
## 20/08/27 09:57:26 INFO TaskSetManager: Starting task 0.0 in stage 94.0 (TID 112, localhost, partition 0, PROCESS_LOCAL, 6769 bytes)
## 20/08/27 09:57:26 INFO Executor: Running task 0.0 in stage 94.0 (TID 112)
## 20/08/27 09:57:26 INFO BlockManager: Found block rdd_11_0 locally
## 20/08/27 09:57:26 INFO Executor: Finished task 0.0 in stage 94.0 (TID 112). 1978 bytes result sent to driver
## 20/08/27 09:57:26 INFO TaskSetManager: Finished task 0.0 in stage 94.0 (TID 112) in 15 ms on localhost (1/1)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Removed TaskSet 94.0, whose tasks have all completed, from pool
## 20/08/27 09:57:26 INFO DAGScheduler: ShuffleMapStage 94 (count at utils.scala:114) finished in 0.015 s
## 20/08/27 09:57:26 INFO DAGScheduler: looking for newly runnable stages
## 20/08/27 09:57:26 INFO DAGScheduler: running: Set()
## 20/08/27 09:57:26 INFO DAGScheduler: waiting: Set(ResultStage 95)
## 20/08/27 09:57:26 INFO DAGScheduler: failed: Set()
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting ResultStage 95 (MapPartitionsRDD[267] at count at utils.scala:114), which has no missing parents
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_95 stored as values in memory (estimated size 67.3 KB, free 912.0 MB)
## 20/08/27 09:57:26 INFO MemoryStore: Block broadcast_95_piece0 stored as bytes in memory (estimated size 22.9 KB, free 912.0 MB)
## 20/08/27 09:57:26 INFO BlockManagerInfo: Added broadcast_95_piece0 in memory on 127.0.0.1:54827 (size: 22.9 KB, free: 912.2 MB)
## 20/08/27 09:57:26 INFO SparkContext: Created broadcast 95 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 95 (MapPartitionsRDD[267] at count at utils.scala:114)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Adding task set 95.0 with 1 tasks
## 20/08/27 09:57:26 INFO TaskSetManager: Starting task 0.0 in stage 95.0 (TID 113, localhost, partition 0, ANY, 5379 bytes)
## 20/08/27 09:57:26 INFO Executor: Running task 0.0 in stage 95.0 (TID 113)
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
## 20/08/27 09:57:26 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
## 20/08/27 09:57:26 INFO Executor: Finished task 0.0 in stage 95.0 (TID 113). 3232 bytes result sent to driver
## 20/08/27 09:57:26 INFO DAGScheduler: ResultStage 95 (count at utils.scala:114) finished in 0.049 s
## 20/08/27 09:57:26 INFO DAGScheduler: Job 65 finished: count at utils.scala:114, took 0.081230 s
## 20/08/27 09:57:26 INFO TaskSetManager: Finished task 0.0 in stage 95.0 (TID 113) in 49 ms on localhost (1/1)
## 20/08/27 09:57:26 INFO TaskSchedulerImpl: Removed TaskSet 95.0, whose tasks have all completed, from pool
## 20/08/27 09:57:27 INFO SparkSqlParser: Parsing command: DROP TABLE `mtcars`
## 20/08/27 09:57:27 INFO HiveMetaStore: 0: get_database: default
## 20/08/27 09:57:27 INFO audit: ugi=Hugo ip=unknown-ip-addr cmd=get_database: default
## 20/08/27 09:57:27 INFO HiveMetaStore: 0: get_table : db=default tbl=mtcars
## 20/08/27 09:57:27 INFO audit: ugi=Hugo ip=unknown-ip-addr cmd=get_table : db=default tbl=mtcars
## 20/08/27 09:57:27 INFO SparkSqlParser: Parsing command: `mtcars`
## 20/08/27 09:57:27 INFO MapPartitionsRDD: Removing RDD 11 from persistence list
## 20/08/27 09:57:27 INFO BlockManager: Removing RDD 11
## 20/08/27 09:57:27 INFO SparkContext: Starting job: count at null:-2
## 20/08/27 09:57:27 INFO DAGScheduler: Registering RDD 271 (count at null:-2)
## 20/08/27 09:57:27 INFO DAGScheduler: Got job 66 (count at null:-2) with 1 output partitions
## 20/08/27 09:57:27 INFO DAGScheduler: Final stage: ResultStage 97 (count at null:-2)
## 20/08/27 09:57:27 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 96)
## 20/08/27 09:57:27 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 96)
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting ShuffleMapStage 96 (MapPartitionsRDD[271] at count at null:-2), which has no missing parents
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_96 stored as values in memory (estimated size 8.1 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_96_piece0 stored as bytes in memory (estimated size 4.3 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO BlockManagerInfo: Added broadcast_96_piece0 in memory on 127.0.0.1:54827 (size: 4.3 KB, free: 912.2 MB)
## 20/08/27 09:57:27 INFO SparkContext: Created broadcast 96 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 96 (MapPartitionsRDD[271] at count at null:-2)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Adding task set 96.0 with 1 tasks
## 20/08/27 09:57:27 INFO TaskSetManager: Starting task 0.0 in stage 96.0 (TID 114, localhost, partition 0, PROCESS_LOCAL, 5494 bytes)
## 20/08/27 09:57:27 INFO Executor: Running task 0.0 in stage 96.0 (TID 114)
## 20/08/27 09:57:27 INFO Executor: Finished task 0.0 in stage 96.0 (TID 114). 1636 bytes result sent to driver
## 20/08/27 09:57:27 INFO DAGScheduler: ShuffleMapStage 96 (count at null:-2) finished in 0.029 s
## 20/08/27 09:57:27 INFO DAGScheduler: looking for newly runnable stages
## 20/08/27 09:57:27 INFO DAGScheduler: running: Set()
## 20/08/27 09:57:27 INFO DAGScheduler: waiting: Set(ResultStage 97)
## 20/08/27 09:57:27 INFO DAGScheduler: failed: Set()
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting ResultStage 97 (MapPartitionsRDD[274] at count at null:-2), which has no missing parents
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_97 stored as values in memory (estimated size 7.0 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO TaskSetManager: Finished task 0.0 in stage 96.0 (TID 114) in 29 ms on localhost (1/1)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Removed TaskSet 96.0, whose tasks have all completed, from pool
## 20/08/27 09:57:27 INFO MemoryStore: Block broadcast_97_piece0 stored as bytes in memory (estimated size 3.7 KB, free 912.0 MB)
## 20/08/27 09:57:27 INFO BlockManagerInfo: Added broadcast_97_piece0 in memory on 127.0.0.1:54827 (size: 3.7 KB, free: 912.2 MB)
## 20/08/27 09:57:27 INFO SparkContext: Created broadcast 97 from broadcast at DAGScheduler.scala:1012
## 20/08/27 09:57:27 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 97 (MapPartitionsRDD[274] at count at null:-2)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Adding task set 97.0 with 1 tasks
## 20/08/27 09:57:27 INFO TaskSetManager: Starting task 0.0 in stage 97.0 (TID 115, localhost, partition 0, ANY, 5379 bytes)
## 20/08/27 09:57:27 INFO Executor: Running task 0.0 in stage 97.0 (TID 115)
## 20/08/27 09:57:27 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
## 20/08/27 09:57:27 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
## 20/08/27 09:57:27 INFO Executor: Finished task 0.0 in stage 97.0 (TID 115). 1866 bytes result sent to driver
## 20/08/27 09:57:27 INFO DAGScheduler: ResultStage 97 (count at null:-2) finished in 0.007 s
## 20/08/27 09:57:27 INFO DAGScheduler: Job 66 finished: count at null:-2, took 0.052031 s
## 20/08/27 09:57:27 INFO TaskSetManager: Finished task 0.0 in stage 97.0 (TID 115) in 7 ms on localhost (1/1)
## 20/08/27 09:57:27 INFO TaskSchedulerImpl: Removed TaskSet 97.0, whose tasks have all completed, from pool
spark_log(sc, filter = "sparklyr") # si filtramos segun algun topico
## 20/08/27 09:51:25 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54216/jars/sparklyr-2.0-2.11.jar with timestamp 1598539885927
## 20/08/27 09:51:38 INFO Executor: Fetching spark://127.0.0.1:54216/jars/sparklyr-2.0-2.11.jar with timestamp 1598539885927
## 20/08/27 09:51:38 INFO Utils: Fetching spark://127.0.0.1:54216/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-9a62abb7-e822-4618-9b95-5547479c3599\userFiles-adbd7a96-8c2e-491c-b6c7-2bd0a43545cc\fetchFileTemp1132867395228677426.tmp
## 20/08/27 09:51:38 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-9a62abb7-e822-4618-9b95-5547479c3599/userFiles-adbd7a96-8c2e-491c-b6c7-2bd0a43545cc/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:51:40 INFO SparkSqlParser: Parsing command: sparklyr_tmp_59fd0ca6_9364_489a_a014_5b06929d7d65
## FROM `sparklyr_tmp_59fd0ca6_9364_489a_a014_5b06929d7d65` AS `zzz2`
## 20/08/27 09:51:40 INFO SparkSqlParser: Parsing command: sparklyr_tmp_749a5caa_0040_482d_9e4d_f2ab11217bb3
## FROM `sparklyr_tmp_749a5caa_0040_482d_9e4d_f2ab11217bb3` AS `zzz3`
## FROM `sparklyr_tmp_59fd0ca6_9364_489a_a014_5b06929d7d65`
## FROM `sparklyr_tmp_749a5caa_0040_482d_9e4d_f2ab11217bb3`
## 20/08/27 09:51:45 INFO SparkSqlParser: Parsing command: sparklyr_tmp_a81e93d3_9e90_494c_b7fd_8caa4cc78c69
## FROM `sparklyr_tmp_a81e93d3_9e90_494c_b7fd_8caa4cc78c69` AS `zzz7`
## FROM `sparklyr_tmp_a81e93d3_9e90_494c_b7fd_8caa4cc78c69`
## 20/08/27 09:51:46 INFO SparkSqlParser: Parsing command: sparklyr_tmp_c473fa71_6350_43a2_b8cb_761a43f24bef
## FROM `sparklyr_tmp_c473fa71_6350_43a2_b8cb_761a43f24bef` AS `zzz8`
## FROM `sparklyr_tmp_c473fa71_6350_43a2_b8cb_761a43f24bef`
## 20/08/27 09:51:47 INFO SparkSqlParser: Parsing command: sparklyr_tmp_e722ed52_f51a_46a3_b58d_af476c3133b2
## FROM `sparklyr_tmp_e722ed52_f51a_46a3_b58d_af476c3133b2` AS `zzz10`
## FROM `sparklyr_tmp_e722ed52_f51a_46a3_b58d_af476c3133b2`
## 20/08/27 09:51:47 INFO SparkSqlParser: Parsing command: sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867
## FROM `sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867` AS `zzz11`
## FROM `sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867`
## FROM `sparklyr_tmp_8d99257e_19cc_4429_a9ca_84a1a5b4f867`
## 20/08/27 09:51:57 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54293/jars/sparklyr-2.0-2.11.jar with timestamp 1598539917005
## 20/08/27 09:52:10 INFO Executor: Fetching spark://127.0.0.1:54293/jars/sparklyr-2.0-2.11.jar with timestamp 1598539917005
## 20/08/27 09:52:10 INFO Utils: Fetching spark://127.0.0.1:54293/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-1bfbb169-6176-449c-be4a-fe3c08fec275\userFiles-c86573f6-64d3-4951-8f29-789b7bff3e82\fetchFileTemp8452961142174595398.tmp
## 20/08/27 09:52:11 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-1bfbb169-6176-449c-be4a-fe3c08fec275/userFiles-c86573f6-64d3-4951-8f29-789b7bff3e82/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:53:42 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54410/jars/sparklyr-2.0-2.11.jar with timestamp 1598540022654
## 20/08/27 09:53:53 INFO Executor: Fetching spark://127.0.0.1:54410/jars/sparklyr-2.0-2.11.jar with timestamp 1598540022654
## 20/08/27 09:53:53 INFO Utils: Fetching spark://127.0.0.1:54410/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-cd010f6c-f324-47db-892c-d4679f698670\userFiles-71211c0f-8947-4248-9d6c-0c06ac71fa40\fetchFileTemp7840457036382436762.tmp
## 20/08/27 09:53:53 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-cd010f6c-f324-47db-892c-d4679f698670/userFiles-71211c0f-8947-4248-9d6c-0c06ac71fa40/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:56:15 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54734/jars/sparklyr-2.0-2.11.jar with timestamp 1598540175585
## 20/08/27 09:56:25 INFO Executor: Fetching spark://127.0.0.1:54734/jars/sparklyr-2.0-2.11.jar with timestamp 1598540175585
## 20/08/27 09:56:25 INFO Utils: Fetching spark://127.0.0.1:54734/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-d8f5f96a-383a-4bf3-999d-69c482372062\userFiles-f296a21e-a704-4f50-a81e-f1706c48f1b6\fetchFileTemp1086623201380595667.tmp
## 20/08/27 09:56:25 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-d8f5f96a-383a-4bf3-999d-69c482372062/userFiles-f296a21e-a704-4f50-a81e-f1706c48f1b6/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:56:27 INFO SparkSqlParser: Parsing command: sparklyr_tmp_cb530f84_1a46_4d68_a364_da539e345adb
## FROM `sparklyr_tmp_cb530f84_1a46_4d68_a364_da539e345adb` AS `zzz2`
## 20/08/27 09:56:27 INFO SparkSqlParser: Parsing command: sparklyr_tmp_c9dce0e3_bb9f_4939_a97c_d0bb9f1a6c23
## FROM `sparklyr_tmp_c9dce0e3_bb9f_4939_a97c_d0bb9f1a6c23` AS `zzz3`
## FROM `sparklyr_tmp_cb530f84_1a46_4d68_a364_da539e345adb`
## FROM `sparklyr_tmp_c9dce0e3_bb9f_4939_a97c_d0bb9f1a6c23`
## 20/08/27 09:56:31 INFO SparkSqlParser: Parsing command: sparklyr_tmp_0d1b933c_53e9_434a_9f19_bf7129c3a692
## FROM `sparklyr_tmp_0d1b933c_53e9_434a_9f19_bf7129c3a692` AS `zzz7`
## FROM `sparklyr_tmp_0d1b933c_53e9_434a_9f19_bf7129c3a692`
## 20/08/27 09:56:31 INFO SparkSqlParser: Parsing command: sparklyr_tmp_73792aa5_bfd5_47d8_9a18_a629b18bcc8a
## FROM `sparklyr_tmp_73792aa5_bfd5_47d8_9a18_a629b18bcc8a` AS `zzz8`
## FROM `sparklyr_tmp_73792aa5_bfd5_47d8_9a18_a629b18bcc8a`
## 20/08/27 09:56:32 INFO SparkSqlParser: Parsing command: sparklyr_tmp_2b9cafdd_433e_4e1e_b203_12184ba12b29
## FROM `sparklyr_tmp_2b9cafdd_433e_4e1e_b203_12184ba12b29` AS `zzz10`
## FROM `sparklyr_tmp_2b9cafdd_433e_4e1e_b203_12184ba12b29`
## 20/08/27 09:56:32 INFO SparkSqlParser: Parsing command: sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08
## FROM `sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08` AS `zzz11`
## FROM `sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08`
## FROM `sparklyr_tmp_76fe3384_2881_4fac_9815_377e8f28ff08`
## 20/08/27 09:56:42 INFO SparkContext: Added JAR file:/C:/Users/Karen/Documents/R/win-library/3.6/sparklyr/java/sparklyr-2.0-2.11.jar at spark://127.0.0.1:54806/jars/sparklyr-2.0-2.11.jar with timestamp 1598540202682
## 20/08/27 09:56:55 INFO Executor: Fetching spark://127.0.0.1:54806/jars/sparklyr-2.0-2.11.jar with timestamp 1598540202682
## 20/08/27 09:56:56 INFO Utils: Fetching spark://127.0.0.1:54806/jars/sparklyr-2.0-2.11.jar to C:\Users\Karen\AppData\Local\spark\spark-2.0.1-bin-hadoop2.7\tmp\local\spark-e10b3f19-3b14-406f-84cd-7a1adf57e767\userFiles-f6a77198-2fe1-4cc3-ac2f-c6aeec14ffe2\fetchFileTemp6298598374547532482.tmp
## 20/08/27 09:56:56 INFO Executor: Adding file:/C:/Users/Karen/AppData/Local/spark/spark-2.0.1-bin-hadoop2.7/tmp/local/spark-e10b3f19-3b14-406f-84cd-7a1adf57e767/userFiles-f6a77198-2fe1-4cc3-ac2f-c6aeec14ffe2/sparklyr-2.0-2.11.jar to class loader
## 20/08/27 09:57:08 INFO SparkSqlParser: Parsing command: sparklyr_tmp_ef05a58f_b600_4b5e_b1da_1c944fa5a3c3
## FROM `sparklyr_tmp_ef05a58f_b600_4b5e_b1da_1c944fa5a3c3` AS `zzz13`
## 20/08/27 09:57:08 INFO SparkSqlParser: Parsing command: sparklyr_tmp_a47670be_625f_4b15_a6c1_4b5776fe697b
## FROM `sparklyr_tmp_a47670be_625f_4b15_a6c1_4b5776fe697b` AS `zzz14`
## FROM `sparklyr_tmp_a47670be_625f_4b15_a6c1_4b5776fe697b`
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: sparklyr_tmp_4f9e8dac_4383_48bf_8e4e_1fa7d299ea43
## FROM `sparklyr_tmp_4f9e8dac_4383_48bf_8e4e_1fa7d299ea43` AS `zzz15`
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: sparklyr_tmp_2e7c2a2d4672
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: CACHE TABLE `sparklyr_tmp_2e7c2a2d4672`
## 20/08/27 09:57:09 INFO SparkSqlParser: Parsing command: `sparklyr_tmp_2e7c2a2d4672`
## 20/08/27 09:57:10 INFO SparkSqlParser: Parsing command: SELECT count(*) FROM `sparklyr_tmp_2e7c2a2d4672`
## FROM `sparklyr_tmp_2e7c2a2d4672` AS `zzz16`
## FROM `sparklyr_tmp_2e7c2a2d4672`
## 20/08/27 09:57:10 INFO SparkSqlParser: Parsing command: sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396
## FROM `sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396` AS `zzz17`
## FROM `sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396`) `LHS`
## FROM `sparklyr_tmp_15423af4_f638_4272_9163_23e86bb6f396`) `LHS`
## 20/08/27 09:57:16 INFO SparkSqlParser: Parsing command: sparklyr_tmp_05e04002_445f_4949_92ed_f8d280d3040f
## FROM `sparklyr_tmp_05e04002_445f_4949_92ed_f8d280d3040f` AS `zzz18`
## FROM `sparklyr_tmp_05e04002_445f_4949_92ed_f8d280d3040f`
## 20/08/27 09:57:17 INFO SparkSqlParser: Parsing command: sparklyr_tmp_7fbd53b2_b6b3_4534_8342_0adec098cc66
## FROM `sparklyr_tmp_7fbd53b2_b6b3_4534_8342_0adec098cc66` AS `zzz19`
## FROM `sparklyr_tmp_7fbd53b2_b6b3_4534_8342_0adec098cc66`
#spark_disconnect_all() # una vez de desconecta spark, se borran todos los datos
Podemos realizar la importación, anÔlisis y el modelado de datos dentro de Spark.
Los ejemplos que hemos trabajado hasta ahora, usamos bases de datos pequeƱas. En escenarios de la vida real, se utilizan grandes cantidades de datos para los modelos. Si los datos deben transformarse primero, antes de ajustar los modelos, es una buena idea guardar los resultados de todas las transformaciones en una nueva tabla cargada en la memoria de Spark.
El comando compute() puede tomar el resultado final de un comando dplyr y guardar los resultados en la memoria Spark:
library("ggplot2")
library("corrr")
library("dbplot")
library("rmarkdown")
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.1")
cars <- copy_to(sc, mtcars, overwrite = T)
cached_cars <- cars %>%
mutate(cyl = paste0("cyl_", cyl)) %>%
compute("cached_cars")
Utilizaremos el conjunto de datos de OkCupid. El conjunto de datos consiste en datos de perfil de usuario de un sitio de citas en lĆnea y contiene un conjunto diverso de caracterĆsticas, incluidas caracterĆsticas como gĆ©nero y profesión, asĆ como campos de texto libre relacionados con intereses personales. Hay alrededor de 60.000 perfiles en el conjunto de datos.
download.file(
"https://github.com/r-spark/okcupid/raw/master/profiles.csv.zip",
"okcupid.zip")
unzip("okcupid.zip", exdir = "data")
unlink("okcupid.zip")
profiles <- read.csv("data/profiles.csv")
write.csv(dplyr::sample_n(profiles, 10^3),
"data/profiles.csv", row.names = FALSE) # asĆ se particiona una base de datos que sea muy grande
En la prÔctica, es mejor que utilice una implementación eficiente y no distribuida del algoritmo de modelado. Por ejemplo, es posible que desee utilizar el paquete ranger.
library(sparklyr)
library(ggplot2)
library(dbplot)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.1")
okc <- spark_read_csv(sc, "data/profiles.csv",
escape = "\"",
memory = FALSE,
options = list(multiline = TRUE)
) %>%
mutate(
height = as.numeric(height),
income = ifelse(income == "-1", NA, as.numeric(income))
) %>%
mutate(sex = ifelse(is.na(sex), "missing", sex)) %>%
mutate(drinks = ifelse(is.na(drinks), "missing", drinks)) %>%
mutate(drugs = ifelse(is.na(drugs), "missing", drugs)) %>%
mutate(job = ifelse(is.na(job), "missing", job)) %>% compute()
glimpse(okc) # para una vista rƔpida
## Rows: ??
## Columns: 31
## Database: spark_connection
## $ age <chr> "29", "40", "38", "20", "34", "27", "27", "33", "68", "...
## $ body_type <chr> "athletic", "athletic", "thin", "average", "average", "...
## $ diet <chr> "strictly anything", NA, "vegan", "mostly anything", "s...
## $ drinks <chr> "socially", "missing", "often", "socially", "rarely", "...
## $ drugs <chr> "never", "missing", "missing", "never", "never", "missi...
## $ education <chr> "dropped out of space camp", "graduated from college/un...
## $ essay0 <chr> "im a small furry mammal that roams the deserts of life...
## $ essay1 <chr> "eating and havin fun! wooo! rebel yell!", "working and...
## $ essay2 <chr> "chasing down my prey", "i used to be good at skiing, b...
## $ essay3 <chr> "my cute little tail", "i'm tall.", NA, NA, "is my long...
## $ essay4 <chr> "romance novels that have a very masculine and mysterio...
## $ essay5 <chr> "food. sleep. food. sleep. food. sleep.", "cheese<br />...
## $ essay6 <chr> "food. and sleep", "too many things!", NA, "nothing at ...
## $ essay7 <chr> "on the prowl for people whos feelings i can hurt and f...
## $ essay8 <chr> "i have very high standards in life", NA, NA, "i have n...
## $ essay9 <chr> ":)", "you live fairly close", "despite your rejection ...
## $ ethnicity <chr> "other", "white", "white", "white", "asian", "white", "...
## $ height <dbl> 94, 74, 65, 68, 62, 71, 71, 72, 72, 70, 69, 69, 61, 69,...
## $ income <dbl> 1e+06, 6e+04, NaN, 2e+04, NaN, NaN, NaN, NaN, NaN, NaN,...
## $ job <chr> "rather not say", "computer / hardware / software", "ot...
## $ last_online <chr> "2012-06-26-21-46", "2012-06-26-14-13", "2012-06-28-21-...
## $ location <chr> "emeryville, california", "berkeley, california", "oakl...
## $ offspring <chr> "doesn’t have kids", "has kids", "doesn’t w...
## $ orientation <chr> "straight", "straight", "straight", "straight", "straig...
## $ pets <chr> "has dogs and likes cats", "has dogs", "likes dogs and ...
## $ religion <chr> "other and very serious about it", NA, NA, "atheism", N...
## $ sex <chr> "f", "m", "f", "m", "f", "m", "m", "m", "m", "m", "f", ...
## $ sign <chr> "taurus but it doesn’t matter", NA, "aries and it...
## $ smokes <chr> "no", NA, "no", "yes", "no", "no", "yes", NA, "no", "tr...
## $ speaks <chr> "english (fluently), french (okay), spanish (fluently)"...
## $ status <chr> "single", "single", "single", "single", "single", "sing...
Ahora, agregamos nuestra variable de respuesta como una columna en el conjunto de datos y observamos su distribución:
okc <- okc %>%
mutate(
not_working = ifelse(job %in% c("student", "unemployed", "retired"), 1 , 0)
)
okc %>%
group_by(not_working) %>%
tally()
## # Source: spark<?> [?? x 2]
## not_working n
## <dbl> <dbl>
## 1 0 900
## 2 1 100
# modelamiento
data_splits <- sdf_random_split(okc, training = 0.8, testing = 0.2, seed = 42)
okc_train <- data_splits$training
okc_test <- data_splits$testing
# distribucion de la variable de respuesta
okc_train %>%
group_by(not_working) %>%
tally() %>%
mutate(frac = n / sum(n))
## # Source: spark<?> [?? x 3]
## not_working n frac
## <dbl> <dbl> <dbl>
## 1 1 85 0.105
## 2 0 721 0.895
# summary de variables numericas
sdf_describe(okc_train, cols = c("age", "income"))
## # Source: spark<?> [?? x 3]
## summary age income
## <chr> <chr> <chr>
## 1 count 806 160
## 2 mean 32.06699751861042 115750.0
## 3 stddev 9.27900067801498 223284.6562929237
## 4 min 18 20000.0
## 5 max 68 1000000.0
dbplot_histogram(okc_train, age)
# variable dependiente vs las demas variables
prop_data <- okc_train %>%
mutate(religion = regexp_extract(religion, "^\\\\w+", 0)) %>%
group_by(religion, not_working) %>%
tally() %>%
group_by(religion) %>%
summarize(
count = sum(n),
prop = sum(not_working * n) / sum(n)
) %>%
mutate(se = sqrt(prop * (1 - prop) / count)) %>%
collect()
prop_data # esta en la memoria de R
## # A tibble: 10 x 4
## religion count prop se
## <chr> <dbl> <dbl> <dbl>
## 1 judaism 46 0.0870 0.0415
## 2 atheism 89 0.112 0.0335
## 3 christianity 81 0.123 0.0366
## 4 hinduism 8 0.125 0.117
## 5 agnosticism 128 0.117 0.0284
## 6 other 100 0.15 0.0357
## 7 buddhism 25 0.24 0.0854
## 8 islam 4 0.5 0.25
## 9 <NA> 260 0.0577 0.0145
## 10 catholicism 65 0.108 0.0384
# Proporción de personas no empleadas actualmente, por religion
prop_data %>%
ggplot(aes(x = religion, y = prop)) + geom_point(size = 2) +
geom_errorbar(aes(ymin = prop - 1.96 * se, ymax = prop + 1.96 * se),
width = .1) +
geom_hline(yintercept = sum(prop_data$prop * prop_data$count) /
sum(prop_data$count))
# alcohol y drogas? tablas cruzadas
contingency_tbl <- okc_train %>%
sdf_crosstab("drinks", "drugs") %>%
collect()
contingency_tbl
## # A tibble: 7 x 5
## drinks_drugs missing never often sometimes
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 very often 0 2 1 1
## 2 socially 125 375 3 66
## 3 not at all 3 43 0 4
## 4 desperately 1 1 1 0
## 5 often 24 26 0 19
## 6 missing 19 14 0 3
## 7 rarely 5 61 2 7
Podemos visualizar esta información:
library(ggmosaic)
library(forcats)
library(tidyr)
contingency_tbl %>%
rename(drinks = drinks_drugs) %>%
gather("drugs", "count", missing:sometimes) %>%
mutate(
drinks = as_factor(drinks) %>%
fct_relevel("missing", "not at all", "rarely", "socially",
"very often", "desperately"),
drugs = as_factor(drugs) %>%
fct_relevel("missing", "never", "sometimes", "often")
) %>%
ggplot() +
geom_mosaic(aes(x = product(drinks, drugs), fill = drinks,
weight = count))