9 de julio de 2018

Opciones paralelización en R

  • Funciones con argumentos de paralelización
  • Librerías para paralelizar en PC (memoria compartida)
  • Librerias para cluster (memoria distribuida)

Funciones con argumentos para paralelizar. Ejemplo boot

Ejemplo simple de correlación con bootstrap

library(boot)

x = c(109,88,96,96,109,116,114,96,85,100,113,117,107,104,101,81)
y = c(116,77,95,79,113,122,109,94,91,88,115,119,100,115,95,90)
datos = cbind(x, y)

cor2 <- function(data, indices) {
  r <- cor(data[indices,1],data[indices,2])  
  return(r)}

results <- boot(data=datos,cor2, R=100)

Funciones con argumentos para paralelizar. Ejemplo boot

Profiling del resultado sin paralelizar

library(profvis)
## Warning: package 'profvis' was built under R version 3.4.4
profvis({
results <- boot(data=datos,cor2, R=1000000)
})

Funciones con argumentos para paralelizar. Ejemplo boot

Profiling del resultado con paralelización

profvis({
results <- boot(data=datos,cor2, R=1000000, parallel="snow", ncpus = 6)
})

Funciones con argumentos para paralelizar

Este tipo de funciones utiliza de manera transparente las librerias y funciones que veremos en un momento. Prestar atención en el profiling a las funciones clusterApply y parLapply.

Librerias para paralelizar en R (en PC)

Métodos de paralización

  1. Socket: inicia una nueva versión de R en cada procesador.
  • Pro: Funciona en cualquier sistema, inlcuido Windows.
  • Pro: Cada proceso en cada nodo es único, no hay posibilidad de contaminación de procesos.
  • Con: Cada proceso es único, y por ello será más lento.
  • Con: Paquetes y variables debe ser cargados/definidas explicitamente en cada procesador por separado.
  • Con: Más complicado de implementar.
  1. Fork: copia la sesión actual de R en los procesadores.
  • Con: Funciona en Mac, Linux, Unix, BSD, pero no en Windows.
  • Con: Los procesos son duplicados en los procesadores y puede causar problemas con los números aletorios y con el uso de GUI (RStudio).
  • Pro: Más rapido que sockets.
  • Pro: Copia la versión actual de R, por eso todo el espacio de trabajo existe en cada procesador.
  • Pro: Facil de implementar.

  • http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html

Liberia paralell

Versiones paralelizadas de lapply, sapply, apply

  • parLapply(cl, x, FUN, …). Utiliza el grupo de nodos creado por makeCluster(). Usa Socket.
  • mclapply(X, FUN, …, mc.cores). Crea un grupo de nodos especificos para el procesamiento. Usa Fork. *No disponible para windows.

mclapply - Fork

Version secuencial. Ejemplo con la estimación de "parameters in linear mixed-effects models with restricted maximum likelihood (REML)".

library(lme4)
## Warning: package 'lme4' was built under R version 3.4.4
## Loading required package: Matrix
f <- function(i) {
  lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}
system.time(save1 <- lapply(1:100, f))
##    user  system elapsed 
##    2.09    0.00    2.09

mclapply - Fork

En paralelo sólo funciona en unix, en windows realizará procesamiento secuencial.En windows se ejecuta lapply en lugar de mcapply. No se gana velocidad en el procesamiento.

Reemplaza lapply por mcapply.

#system.time(save2 <- mclapply(1:100, f))

parLapply - Socket. Pasos

  • Inicia un cluster con n nodos
  • Ejecuta código de pre-procesamiento en cada nodo (cargar un paquete)
  • Use par-apply en lugar de -apply
  • Pare o cierre el cluster

Paso 1. Nodos disponibles

library(parallel)
detectCores(logical = FALSE) #cores fisicos
## [1] 2
detectCores() 
## [1] 4

Paso 2. Iniciar y parar cluster

cl <- makeCluster(4)
# código a ejecutar en paralelo
stopCluster(cl)
# makeCluster(4, type="PSOCK") #FORK

Paso 3. Código de pre-procesamiento. ClusterEvalQ

Cargar paquetes y variables de la sesión actual de R en todos los procesadores. clusterEvalQ ejecuta una expresión en cada procesador.

cl <- makeCluster(4, type="PSOCK")
clusterEvalQ(cl, 2 + 2)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 4

Paso 3. Código de pre-procesamiento. ClusterEvalQ

x <- 1
clusterEvalQ(cl, x)
## Error in checkForRemoteErrors(lapply(cl, recvResult)): 4 nodes produced errors; first error: objeto 'x' no encontrado
clusterEvalQ(cl, y <- 1)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 1
clusterEvalQ(cl, y)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 1

Paso 3. Código de pre-procesamiento

La variable "y" no existe en el procesador principal.

y
##  [1] 116  77  95  79 113 122 109  94  91  88 115 119 100 115  95  90

Paso 3. Código de pre-procesamiento. ClusterExport

clusterExport exporta una variable a los procesadores

x
## [1] 1
clusterExport(cl, "x")
clusterEvalQ(cl, x)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 1

Paso 3. Código de pre-procesamiento. ClusterEvalQ

Cargar librerías en los procesadores

clusterEvalQ(cl, {
  library(ggplot2)
  library(boot)
})
## [[1]]
## [1] "boot"      "ggplot2"   "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[2]]
## [1] "boot"      "ggplot2"   "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[3]]
## [1] "boot"      "ggplot2"   "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[4]]
## [1] "boot"      "ggplot2"   "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"

Paso 4. Par-apply

Versiones paralelas de apply(), con un argumento adicional para operar con el grupo de procesadores definido.

  • parApply, paralela de apply
  • parLapply, paralela de lapply
  • parSapply, paralela de sapply

Paso 4. Par-apply. parApply

Apply aplica una función a cada elemento de una lista o vector. En este ejemplo se calcula la media de cada una de las cuatro primeras columnas de airquality. El argumento adicional, na.rm = TRUE se pasa a la función mean. Devuelve una lista.

  • Versión sin paralelizar
apply( airquality[, 1:4], 2, mean, na.rm = TRUE) #'1'filas, '2' columnas
##      Ozone    Solar.R       Wind       Temp 
##  42.129310 185.931507   9.957516  77.882353
  • Versión paralelizada
parApply(cl, airquality[, 1:4], 2, mean, na.rm = TRUE)
##      Ozone    Solar.R       Wind       Temp 
##  42.129310 185.931507   9.957516  77.882353

Paso 4. Par-apply. parLapply

  • Versión sin paralelizar
lapply(airquality[, 1:4], mean, na.rm = TRUE)
## $Ozone
## [1] 42.12931
## 
## $Solar.R
## [1] 185.9315
## 
## $Wind
## [1] 9.957516
## 
## $Temp
## [1] 77.88235
  • Versión paralelizada
parLapply(cl, airquality[, 1:4], mean, na.rm = TRUE)
## $Ozone
## [1] 42.12931
## 
## $Solar.R
## [1] 185.9315
## 
## $Wind
## [1] 9.957516
## 
## $Temp
## [1] 77.88235

Paso 4. Par-apply. parSapply

sapply es una versión simplificada de lapply, que aplica la función lapply y estudia la salida. Cuando entiende que dicha salida admite una representación mas simple que la lista, la simplifica.

  • Versión sin paralelizar
sapply(airquality[, 1:4], mean, na.rm = TRUE)
##      Ozone    Solar.R       Wind       Temp 
##  42.129310 185.931507   9.957516  77.882353
  • Versión paralelizada
parSapply(cl, airquality[, 1:4], mean, na.rm = TRUE)
##      Ozone    Solar.R       Wind       Temp 
##  42.129310 185.931507   9.957516  77.882353

Paso 4. Par-apply, con balance de carga

parLapplyLB(cl, airquality[, 1:4], mean, na.rm = TRUE)
## $Ozone
## [1] 42.12931
## 
## $Solar.R
## [1] 185.9315
## 
## $Wind
## [1] 9.957516
## 
## $Temp
## [1] 77.88235

Repaso de Funciones

  • detecCores
  • makeCluster
  • stopCluster ## Repaso de Funciones
  • clusterEvalQ # evaluates a literal expression on each cluster node.
  • clusterExport
  • clusterCall # calls a function fun with identical arguments on each node.
  • clusterApply # takes a cluster, a vector or a list, and a function, and calls the function with the first element of the list on the first node, with the second element of the list on the second node, and so on.
  • clusterApplyLB # clusterApply with Load Balance. No funciona con socket.
  • clusterSplit # splits 'seq' into one consecutive piece for each cluster ## Repaso de Funciones
  • parApply
  • parLapply
  • parSapply
  • parRapply # parallel row apply functions for a matrix x.
  • parCapply # parallel column apply functions for a matrix x.
  • parLapplyLB # parallel parLapply with Load Balance.
  • parSapplyLB # parallel parSapply with Load Balance. ## clusterCall Los argumentos de clusterCall son evaluados en el master (procesador principal) y sus valores son enviados a los nodos para su ejecucion.
cl <- makeCluster(4, type = "SOCK") 

myfunc <- function(x=2){x+1}
myfunc_argument <- 5
clusterCall(cl, myfunc, myfunc_argument) 
## [[1]]
## [1] 6
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 6
clusterCall(cl, function(x=2){x+1}, 5) 
## [[1]]
## [1] 6
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 6

http://www.sfu.ca/~sblay/R/snow.html

clusterApply

clusterApply(cl, 1:2, sum, 3)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 5

clusterApplyLB

Similar a clusterApply, con balance de carga. No funciona con Socket.

clusterApplyLB(cl, 1:3, sum, 3)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 6

Ejemplo y comparacion par-apply. 1.lapply

Ejemplo y comparacion par-apply. 2-mclapply

library(parallel)
f <- function(i) {
  lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}

s2<- system.time({
  library(lme4)
  save2 <- mclapply(1:100, f)
})

Ejemplo y comparacion par-apply. 3.parLapply

library(parallel)
f <- function(i) {
  lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}

s3<-system.time({
  cl <- makeCluster(detectCores())
  clusterEvalQ(cl, library(lme4))
  save3 <- parLapply(cl, 1:100, f)
  stopCluster(cl)
})

Ejemplo y comparacion par-apply. 4.parLapplyLB

library(parallel)
f <- function(i) {
  lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}

s4<-system.time({
  cl <- makeCluster(detectCores())
  clusterEvalQ(cl, library(lme4))
  save4 <- parLapplyLB(cl, 1:100, f)
  stopCluster(cl)
})

Ejemplo y comparacion par-apply.

sysTime = do.call("rbind",list(s1,s2,s3,s4))
sysTime = cbind(sysTime,data.frame(fun=c("lapply","mcapply","parLapply","parLapplyLB")))
require(ggplot2)
## Loading required package: ggplot2
ggplot(data=sysTime, aes(x=fun,y=elapsed,fill=fun)) + 
  geom_bar(stat="identity") + ggtitle("Elapsed time of each function")

Profundicemos mas.

Borrar todo lo que esta en memoria

rm(list = ls())

snow.time() de libreria snow

http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I.html

library(snow)
## Warning: package 'snow' was built under R version 3.4.4
## 
## Attaching package: 'snow'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, clusterSplit, makeCluster,
##     parApply, parCapply, parLapply, parRapply, parSapply,
##     splitIndices, stopCluster
set.seed(1237)#semilla de aleatoriedad. Permite reproducibilidad
sleep = sample(1:10,10) # secuencia en la cual se interumpira el procesamiento
sleep
##  [1]  4  9  1  8  2 10  5  6  3  7
cl = makeCluster(4, type="SOCK")

clusterSplit(cl, sleep) #split de las interrupciones por nodo
## [[1]]
## [1] 4 9 1
## 
## [[2]]
## [1] 8 2
## 
## [[3]]
## [1] 10  5
## 
## [[4]]
## [1] 6 3 7
st = snow.time(clusterApply(cl, sleep, Sys.sleep))
stLB = snow.time(clusterApplyLB(cl, sleep, Sys.sleep))
stPL = snow.time(parLapply(cl, sleep, Sys.sleep))

stopCluster(cl)

snow.time() de libreria snow

  • Verde: procesamiento activo
  • Azul: procesador esperando para devolver el resultado
  • Rojo: comunicacion entre master/worker
plot(st, title="clusterApply")
plot(stLB, title="clusterApplyLB")
plot(stPL, title="parLapply")

## snow.time() de libreria snow

Conclusion: clusterApplyLB () y parLapply () toma menos timmpo que clusterApply(). La eficiencia de la primera se debe al equilibrio de carga (realizando una tarea cuando es necesario) mientras que la de la segunda se debe a un menor número de operaciones gracias a la programación de tareas.

snow.time(). Ejemplo

library(snow)
set.seed(1237)
cl = makeCluster(4, type="SOCK")
newairquality <- airquality[sample(1:nrow(airquality), 10000000, replace = TRUE), 1:4]
st <- snow.time(clusterApply(cl, newairquality, mean, na.rm = TRUE))
stLB <- snow.time(clusterApplyLB(cl, newairquality, mean, na.rm = TRUE))
stPL <- snow.time(parLapply(cl, newairquality, mean, na.rm = TRUE))
plot(st, title="clusterApply")
plot(stLB, title="clusterApplyLB")
plot(stPL, title="parLapply")

Graficando tiempos con Parallel

require(parallel)
set.seed(1237)
sleep = sample(1:10,10)
cl = makeCluster(detectCores())

st = system.time(clusterApply(cl, sleep, Sys.sleep))
stLB = system.time(clusterApplyLB(cl, sleep, Sys.sleep))
stPL = system.time(parLapply(cl, sleep, Sys.sleep))
stPLB = system.time(parLapplyLB(cl, sleep, Sys.sleep))

stopCluster(cl)

sysTime = do.call("rbind",list(st,stLB,stPL,stPLB))
sysTime = cbind(sysTime,data.frame(fun=c("clusterApply","clusterApplyLB","parLapply","parLapplyLB")))

Graficando tiempos con Parallel

require(ggplot2)
ggplot(data=sysTime, aes(x=fun,y=elapsed,fill=fun)) + 
  geom_bar(stat="identity") + ggtitle("Elapsed time of each function")

Graficando tiempos con Parallel. Ejemplo

library(parallel)
set.seed(1237)
cl = makeCluster(4, type="SOCK")
newairquality <- airquality[sample(1:nrow(airquality), 10000000, replace = TRUE), 1:4]
st <- system.time(clusterApply(cl, newairquality, mean, na.rm = TRUE))
stLB <- system.time(clusterApplyLB(cl, newairquality, mean, na.rm = TRUE))
stPL <- system.time(parLapply(cl, newairquality, mean, na.rm = TRUE))
stPLB <- system.time(parLapplyLB(cl, newairquality, mean, na.rm = TRUE))

Graficando tiempos con Parallel

require(ggplot2)
ggplot(data=sysTime, aes(x=fun,y=elapsed,fill=fun)) + 
  geom_bar(stat="identity") + ggtitle("Elapsed time of each function")

Ejemplo sample

Se realiza un muestro con reemplazo de airquality. Aqui se muestra un ejemplo simple de la funcion, con 10 muestras.

airquality[sample(1:nrow(airquality), 10, replace=TRUE),]
##      Ozone Solar.R Wind Temp Month Day
## 12      16     256  9.7   69     5  12
## 63      49     248  9.2   85     7   2
## 15      18      65 13.2   58     5  15
## 7       23     299  8.6   65     5   7
## 1       41     190  7.4   67     5   1
## 102     NA     222  8.6   92     8  10
## 74      27     175 14.9   81     7  13
## 12.1    16     256  9.7   69     5  12
## 89      82     213  7.4   88     7  28
## 72      NA     139  8.6   82     7  11

Ejemplo sample. Paralelización.

  1. Definir función a paralilizar
sample2 <- function(data) {
  sample.data <- airquality[sample(1:nrow(airquality), 100, replace=TRUE),]
  return(sample.data)}
  1. Cargar librería parallel y definir cluster
library(parallel)
cl <- makeCluster(detectCores())
  1. Definir multiples semillas de números seudo-aleatorios independientes para cada prodcesador. Esto permite hacer reproducible la aleatoriedad e independiente en cada procesador
clusterSetRNGStream(cl, 123) 

Ejemplo sample. Paralelización.

  1. Exportar datos airquality en cada procesador
clusterExport(cl, c("airquality"))
  1. Ejecutar parLapply
airquality.extent <- parLapply(cl,airquality,sample2) 
class(airquality.extent)#listado 
## [1] "list"
length(airquality.extent)
## [1] 6
str(airquality.extent)
## List of 6
##  $ Ozone  :'data.frame': 100 obs. of  6 variables:
##   ..$ Ozone  : int [1:100] NA NA 168 78 16 80 28 23 35 71 ...
##   ..$ Solar.R: int [1:100] 266 31 238 NA 201 294 NA 13 NA 291 ...
##   ..$ Wind   : num [1:100] 14.9 14.9 3.4 6.9 8 8.6 14.9 12 7.4 13.8 ...
##   ..$ Temp   : int [1:100] 58 77 81 86 82 86 66 67 85 90 ...
##   ..$ Month  : int [1:100] 5 6 8 8 9 7 5 5 8 6 ...
##   ..$ Day    : int [1:100] 26 29 25 4 20 24 6 28 5 9 ...
##  $ Solar.R:'data.frame': 100 obs. of  6 variables:
##   ..$ Ozone  : int [1:100] NA 23 16 122 66 23 23 23 39 16 ...
##   ..$ Solar.R: int [1:100] 322 14 256 255 NA 13 220 220 323 201 ...
##   ..$ Wind   : num [1:100] 11.5 9.2 9.7 4 4.6 12 10.3 10.3 11.5 8 ...
##   ..$ Temp   : int [1:100] 79 71 69 89 87 67 78 78 87 82 ...
##   ..$ Month  : int [1:100] 6 9 5 8 8 5 9 9 6 9 ...
##   ..$ Day    : int [1:100] 15 22 12 7 6 28 8 8 10 20 ...
##  $ Wind   :'data.frame': 100 obs. of  6 variables:
##   ..$ Ozone  : int [1:100] NA 30 78 23 115 122 16 23 13 65 ...
##   ..$ Solar.R: int [1:100] 59 193 197 13 223 255 201 14 112 157 ...
##   ..$ Wind   : num [1:100] 1.7 6.9 5.1 12 5.7 4 8 9.2 11.5 9.7 ...
##   ..$ Temp   : int [1:100] 76 70 92 67 79 89 82 71 71 80 ...
##   ..$ Month  : int [1:100] 6 9 9 5 5 8 9 9 9 8 ...
##   ..$ Day    : int [1:100] 22 26 2 28 30 7 20 22 15 14 ...
##  $ Temp   :'data.frame': 100 obs. of  6 variables:
##   ..$ Ozone  : int [1:100] 37 18 NA 59 NA NA 108 24 23 73 ...
##   ..$ Solar.R: int [1:100] 284 131 137 51 291 264 223 238 299 183 ...
##   ..$ Wind   : num [1:100] 20.7 8 11.5 6.3 14.9 14.3 8 10.3 8.6 2.8 ...
##   ..$ Temp   : int [1:100] 72 76 86 79 91 79 85 68 65 93 ...
##   ..$ Month  : int [1:100] 6 9 8 8 7 6 7 9 5 9 ...
##   ..$ Day    : int [1:100] 17 29 11 17 14 6 25 19 7 3 ...
##  $ Month  :'data.frame': 100 obs. of  6 variables:
##   ..$ Ozone  : int [1:100] 4 41 23 NA NA 35 NA NA 39 32 ...
##   ..$ Solar.R: int [1:100] 25 190 14 31 153 274 255 139 83 92 ...
##   ..$ Wind   : num [1:100] 9.7 7.4 9.2 14.9 5.7 10.3 12.6 8.6 6.9 15.5 ...
##   ..$ Temp   : int [1:100] 61 67 71 77 88 82 75 82 81 84 ...
##   ..$ Month  : int [1:100] 5 5 9 6 8 7 8 7 8 9 ...
##   ..$ Day    : int [1:100] 23 1 22 29 27 17 23 11 1 6 ...
##  $ Day    :'data.frame': 100 obs. of  6 variables:
##   ..$ Ozone  : int [1:100] NA 18 23 96 28 NA 14 7 64 122 ...
##   ..$ Solar.R: int [1:100] 194 313 220 167 273 59 20 48 175 255 ...
##   ..$ Wind   : num [1:100] 8.6 11.5 10.3 6.9 11.5 1.7 16.6 14.3 4.6 4 ...
##   ..$ Temp   : int [1:100] 69 62 78 91 82 76 63 80 83 89 ...
##   ..$ Month  : int [1:100] 5 5 9 9 8 6 9 7 7 8 ...
##   ..$ Day    : int [1:100] 10 4 8 1 13 22 25 15 5 7 ...

Ejemplo sample. Paralelización.

  1. Ejecutar parLapply Convertir listado a data.frame
airquality.extent.ul <- do.call(rbind.data.frame, airquality.extent)
str(airquality.extent.ul)
## 'data.frame':    600 obs. of  6 variables:
##  $ Ozone  : int  NA NA 168 78 16 80 28 23 35 71 ...
##  $ Solar.R: int  266 31 238 NA 201 294 NA 13 NA 291 ...
##  $ Wind   : num  14.9 14.9 3.4 6.9 8 8.6 14.9 12 7.4 13.8 ...
##  $ Temp   : int  58 77 81 86 82 86 66 67 85 90 ...
##  $ Month  : int  5 6 8 8 9 7 5 5 8 6 ...
##  $ Day    : int  26 29 25 4 20 24 6 28 5 9 ...

Ejemplo sample. Paralelización.

  1. Ejecutar parLapply. Alternativa
airquality.extent2 <- do.call(rbind.data.frame, clusterApply(cl, airquality, sample2))
stopCluster(cl)
class(airquality.extent2)
## [1] "data.frame"
length(airquality.extent2)
## [1] 6
str(airquality.extent2)
## 'data.frame':    600 obs. of  6 variables:
##  $ Ozone  : int  NA 11 32 NA NA 96 77 13 NA 66 ...
##  $ Solar.R: int  250 320 92 101 138 167 276 27 101 NA ...
##  $ Wind   : num  6.3 16.6 15.5 10.9 8 6.9 5.1 10.3 10.9 4.6 ...
##  $ Temp   : int  76 73 84 84 83 91 88 76 84 87 ...
##  $ Month  : int  6 5 9 7 6 9 7 9 7 8 ...
##  $ Day    : int  24 22 6 4 30 1 7 18 4 6 ...

Ejemplo sample. Importancia de clusterExport

Se define un conjunto de datos propio llamado dt, al cual se le asigna el dataframe airquality. Si dt no se exporta a cada procesador parLapply emitira un error. En este primer ejemplo se omite clusterExport.

rm(list = ls())
sample2 <- function(data) {
  sample.data <- dt[sample(1:nrow(dt), 100, replace=TRUE),]
  return(sample.data)}

dt<- airquality

library(parallel)
cl <- makeCluster(detectCores())

airquality.extent2 <- do.call(rbind.data.frame, parLapply(cl, dt, sample2))
## Error in checkForRemoteErrors(val): 4 nodes produced errors; first error: argument of length 0
stopCluster(cl)

Ejemplo sample. Importancia de clusterExport

En este segundo ejemplo se incluye ClusterExport, y funciona adecuadamente.

rm(list = ls())
sample2 <- function(data) {
  sample.data <- dt[sample(1:nrow(dt), 100, replace=TRUE),]
  return(sample.data)}

dt<- airquality

library(parallel)
cl <- makeCluster(detectCores())

clusterExport(cl, "dt")

airquality.extent2 <- do.call(rbind.data.frame, parLapply(cl, dt, sample2))
stopCluster(cl)

Ejemplo boot

Sin Paralelizar ```

Ejemplo boot

Sin paralelizar

library(boot)

x = c(109,88,96,96,109,116,114,96,85,100,113,117,107,104,101,81)
y = c(116,77,95,79,113,122,109,94,91,88,115,119,100,115,95,90)
datos = cbind(x, y)

cor2 <- function(data, indices) {
  r <- cor(data[indices,1],data[indices,2])  
  return(r)}

results <- boot(data=datos,cor2, R=100)

Ejemplo boot

Paralelizado

library(parallel)
cl <- makeCluster(detectCores())
clusterSetRNGStream(cl, 123) #hacer reproducible la aleatoriedad
library(boot) 

clusterEvalQ(cl, library(boot))
## [[1]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"     
## 
## [[2]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"     
## 
## [[3]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"     
## 
## [[4]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"
clusterExport(cl, c("datos", "cor2"))

funcion <- function(...) boot(datos,cor2, R=100)

boot1 <-  parLapply(cl,datos,funcion) #listado 
class(boot1)
## [1] "list"
boot2 <- do.call(c, parLapply(cl, datos, funcion))
class(boot2)
## [1] "boot"
length(boot2$t)
## [1] 3200
boot.ci(boot2, type = c("norm", "basic", "perc"), conf = 0.9)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 3200 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot2, conf = 0.9, type = c("norm", "basic", 
##     "perc"))
## 
## Intervals : 
## Level      Normal              Basic              Percentile     
## 90%   ( 0.7290,  0.9164 )   ( 0.7370,  0.9209 )   ( 0.7363,  0.9201 )  
## Calculations and Intervals on Original Scale
stopCluster(cl)

```

library(boot)
cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
cd4.boot <- boot(cd4, corr, R = 999, sim = "parametric",ran.gen = cd4.rg, mle = cd4.mle)
boot.ci(cd4.boot, type = c("norm", "basic", "perc"), conf = 0.9, h = atanh, hinv = tanh)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 999 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = cd4.boot, conf = 0.9, type = c("norm", "basic", 
##     "perc"), h = atanh, hinv = tanh)
## 
## Intervals : 
## Level      Normal              Basic              Percentile     
## 90%   ( 0.4620,  0.8603 )   ( 0.4660,  0.8580 )   ( 0.4952,  0.8677 )  
## Calculations on Transformed Scale;  Intervals on Original Scale
library(parallel)

 cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
run1 <- function(...) boot(cd4, corr, R = 500, sim = "parametric",ran.gen = cd4.rg, mle = cd4.mle)
mc <- 2 # set as appropriate for your hardware
 ## To make this reproducible:
 set.seed(123, "L'Ecuyer")
 cd4.boot <- do.call(c, mclapply(seq_len(mc), run1) )
 boot.ci(cd4.boot, type = c("norm", "basic", "perc"), conf = 0.9, h = atanh, hinv = tanh)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = cd4.boot, conf = 0.9, type = c("norm", "basic", 
##     "perc"), h = atanh, hinv = tanh)
## 
## Intervals : 
## Level      Normal              Basic              Percentile     
## 90%   ( 0.4664,  0.8625 )   ( 0.4584,  0.8647 )   ( 0.4753,  0.8700 )  
## Calculations on Transformed Scale;  Intervals on Original Scale
run1 <- function(...) {
 library(boot)
 cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
 cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
 boot(cd4, corr, R = 500, sim = "parametric",
 ran.gen = cd4.rg, mle = cd4.mle)
 }
 cl <- makeCluster(mc)
 ## make this reproducible
 clusterSetRNGStream(cl, 123)
 library(boot) # needed for c() method on master
 cd4.boot <- do.call(c, parLapply(cl, seq_len(mc), run1) )
 boot.ci(cd4.boot, type = c("norm", "basic", "perc"), conf = 0.9, h = atanh, hinv = tanh)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = cd4.boot, conf = 0.9, type = c("norm", "basic", 
##     "perc"), h = atanh, hinv = tanh)
## 
## Intervals : 
## Level      Normal              Basic              Percentile     
## 90%   ( 0.4705,  0.8589 )   ( 0.4620,  0.8597 )   ( 0.4900,  0.8689 )  
## Calculations on Transformed Scale;  Intervals on Original Scale
 stopCluster(cl)

Ejemplo boot

Sin paralelizar

library(boot)

x = c(109,88,96,96,109,116,114,96,85,100,113,117,107,104,101,81)
y = c(116,77,95,79,113,122,109,94,91,88,115,119,100,115,95,90)
datos = cbind(x, y)

cor2 <- function(data, indices) {
  r <- cor(data[indices,1],data[indices,2])  
  return(r)}

results <- boot(datos,cor2, R=100)

Ejemplo boot

Paralelizado. Las clusterEvalQ y clusterExport son necesarias, sin embargo en este ejemplo no se ejecutarán para observar el error de respuesta.

library(parallel)
cl <- makeCluster(detectCores())
clusterSetRNGStream(cl, 123) 

# clusterEvalQ(cl, library(boot)) #nesarios, pero no ejecutados para observar el error.
# clusterExport(cl, c("datos", "cor2"))

funcion <- function(...) boot(datos,cor2, R=100)

boot1 <-  parLapply(cl,datos,funcion) #listado 
## Error in checkForRemoteErrors(val): 4 nodes produced errors; first error: no se pudo encontrar la función "boot"

Ejemplo boot.

Paralelizado

clusterEvalQ(cl, library(boot))
## [[1]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"     
## 
## [[2]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"     
## 
## [[3]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"     
## 
## [[4]]
## [1] "boot"      "snow"      "methods"   "stats"     "graphics"  "grDevices"
## [7] "utils"     "datasets"  "base"
clusterExport(cl, c("datos", "cor2"))

boot1 <-  parLapply(cl,datos,funcion) #listado 
class(boot1)
## [1] "list"
boot1[[1]]
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = datos, statistic = cor2, R = 100)
## 
## 
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 0.8285718 0.00727428  0.05320471

Ejemplo boot.

Paralelizado. Alternativa

boot2 <- do.call(c, parLapply(cl, datos, funcion))
class(boot2)
## [1] "boot"
stopCluster(cl)

Ejemplo Final

Datos iris de ancho y largo de petalos de tres especies de plantas. Tutorial kmeans https://datascienceplus.com/k-means-clustering-in-r/

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_point(aes(shape = iris$Species), size = 5)

## Ejemplo Final Agrupamiento con el algoritmo k-means de iris, usando dos de sus atributos (ancho y largo del pétalo). Se generan 3 grupos.

set.seed(4242)
clusters <- kmeans(iris[, c("Petal.Length", "Petal.Width")], 3)
clusters
## K-means clustering with 3 clusters of sizes 50, 48, 52
## 
## Cluster means:
##   Petal.Length Petal.Width
## 1     1.462000    0.246000
## 2     5.595833    2.037500
## 3     4.269231    1.342308
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2
## [106] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2
## [141] 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  2.02200 16.29167 13.05769
##  (between_SS / total_SS =  94.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
table(iris$Species, clusters$cluster)  # Tabla de contingencia de cada clase por especies
##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0  2 48
##   virginica   0 46  4
ggplot(iris, aes(Petal.Length, Petal.Width)) + geom_point(aes(color = as.factor(clusters$cluster), shape = iris$Species), size = 5)

## Ejemplo Final kmeans depende del número de clusters a determinar y el número de iteracciones para definir cada cluster. Se examinará esto más a profundidad con procesamiento paralelo.

library(parallel)
iris.cluster <- iris[,-5]

cl <- makeCluster(detectCores())
clusterExport(cl, 'iris.cluster')
worker <- function(centers, nstart) {
  kmeans(iris.cluster, centers=centers, nstart=nstart)
}
myiter <- 3
nstarts <- rep(25, myiter)
nclus <- 2:5
g <- expand.grid(nstarts=nstarts, nclus=nclus)
g
##    nstarts nclus
## 1       25     2
## 2       25     2
## 3       25     2
## 4       25     3
## 5       25     3
## 6       25     3
## 7       25     4
## 8       25     4
## 9       25     4
## 10      25     5
## 11      25     5
## 12      25     5

Ejemplo Final

results <- clusterMap(cl, worker, centers=g$nclus, nstart=g$nstarts)
stopCluster(cl)
results
## [[1]]
## K-means clustering with 2 clusters of sizes 97, 53
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.301031    2.886598     4.958763    1.695876
## 2     5.005660    3.369811     1.560377    0.290566
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 123.79588  28.55208
##  (between_SS / total_SS =  77.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[2]]
## K-means clustering with 2 clusters of sizes 97, 53
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.301031    2.886598     4.958763    1.695876
## 2     5.005660    3.369811     1.560377    0.290566
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 123.79588  28.55208
##  (between_SS / total_SS =  77.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[3]]
## K-means clustering with 2 clusters of sizes 53, 97
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.005660    3.369811     1.560377    0.290566
## 2     6.301031    2.886598     4.958763    1.695876
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  28.55208 123.79588
##  (between_SS / total_SS =  77.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[4]]
## K-means clustering with 3 clusters of sizes 62, 38, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.901613    2.748387     4.393548    1.433871
## 2     6.850000    3.073684     5.742105    2.071053
## 3     5.006000    3.428000     1.462000    0.246000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
## [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
## [141] 2 2 1 2 2 2 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 39.82097 23.87947 15.15100
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[5]]
## K-means clustering with 3 clusters of sizes 50, 62, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     5.901613    2.748387     4.393548    1.433871
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3
## [106] 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3
## [141] 3 3 2 3 3 3 2 3 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[6]]
## K-means clustering with 3 clusters of sizes 50, 38, 62
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     6.850000    3.073684     5.742105    2.071053
## 3     5.901613    2.748387     4.393548    1.433871
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 23.87947 39.82097
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[7]]
## K-means clustering with 4 clusters of sizes 32, 50, 28, 40
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.912500    3.100000     5.846875    2.131250
## 2     5.006000    3.428000     1.462000    0.246000
## 3     5.532143    2.635714     3.960714    1.228571
## 4     6.252500    2.855000     4.815000    1.625000
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 3 4 3 4 3 4 3 3 3 3 4 3 4 3 3 4 3
##  [71] 4 3 4 4 4 4 4 4 4 3 3 3 3 4 3 4 4 4 3 3 3 4 3 3 3 3 3 4 3 3 1 4 1 1 1
## [106] 1 3 1 1 1 4 4 1 4 4 1 1 1 1 4 1 4 1 4 1 1 4 4 1 1 1 1 1 4 4 1 1 1 4 1
## [141] 1 1 4 1 1 1 4 4 1 4
## 
## Within cluster sum of squares by cluster:
## [1] 18.703437 15.151000  9.749286 13.624750
##  (between_SS / total_SS =  91.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[8]]
## K-means clustering with 4 clusters of sizes 40, 50, 32, 28
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.252500    2.855000     4.815000    1.625000
## 2     5.006000    3.428000     1.462000    0.246000
## 3     6.912500    3.100000     5.846875    2.131250
## 4     5.532143    2.635714     3.960714    1.228571
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 4 1 4 1 4 1 4 4 4 4 1 4 1 4 4 1 4
##  [71] 1 4 1 1 1 1 1 1 1 4 4 4 4 1 4 1 1 1 4 4 4 1 4 4 4 4 4 1 4 4 3 1 3 3 3
## [106] 3 4 3 3 3 1 1 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 1 3 3 3 1 3
## [141] 3 3 1 3 3 3 1 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 13.624750 15.151000 18.703437  9.749286
##  (between_SS / total_SS =  91.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[9]]
## K-means clustering with 4 clusters of sizes 32, 28, 50, 40
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.912500    3.100000     5.846875    2.131250
## 2     5.532143    2.635714     3.960714    1.228571
## 3     5.006000    3.428000     1.462000    0.246000
## 4     6.252500    2.855000     4.815000    1.625000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 2 4 2 4 2 4 2 2 2 2 4 2 4 2 2 4 2
##  [71] 4 2 4 4 4 4 4 4 4 2 2 2 2 4 2 4 4 4 2 2 2 4 2 2 2 2 2 4 2 2 1 4 1 1 1
## [106] 1 2 1 1 1 4 4 1 4 4 1 1 1 1 4 1 4 1 4 1 1 4 4 1 1 1 1 1 4 4 1 1 1 4 1
## [141] 1 1 4 1 1 1 4 4 1 4
## 
## Within cluster sum of squares by cluster:
## [1] 18.703437  9.749286 15.151000 13.624750
##  (between_SS / total_SS =  91.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[10]]
## K-means clustering with 5 clusters of sizes 25, 50, 12, 24, 39
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.508000    2.600000     3.908000    1.204000
## 2     5.006000    3.428000     1.462000    0.246000
## 3     7.475000    3.125000     6.300000    2.050000
## 4     6.529167    3.058333     5.508333    2.162500
## 5     6.207692    2.853846     4.746154    1.564103
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 1 5 5 5 1 5 1 1 5 1 5 1 5 5 1 5 1
##  [71] 5 1 5 5 5 5 5 5 5 1 1 1 1 5 1 5 5 5 1 1 1 5 1 1 1 1 1 5 1 1 4 5 3 4 4
## [106] 3 1 3 4 3 4 4 4 5 4 4 4 3 3 5 4 5 3 5 4 3 5 5 4 3 3 3 4 5 5 3 4 4 5 4
## [141] 4 4 5 4 4 4 5 4 4 5
## 
## Within cluster sum of squares by cluster:
## [1]  8.36640 15.15100  4.65500  5.46250 12.81128
##  (between_SS / total_SS =  93.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[11]]
## K-means clustering with 5 clusters of sizes 39, 50, 24, 25, 12
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.207692    2.853846     4.746154    1.564103
## 2     5.006000    3.428000     1.462000    0.246000
## 3     6.529167    3.058333     5.508333    2.162500
## 4     5.508000    2.600000     3.908000    1.204000
## 5     7.475000    3.125000     6.300000    2.050000
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 4 1 1 1 4 1 4 4 1 4 1 4 1 1 4 1 4
##  [71] 1 4 1 1 1 1 1 1 1 4 4 4 4 1 4 1 1 1 4 4 4 1 4 4 4 4 4 1 4 4 3 1 5 3 3
## [106] 5 4 5 3 5 3 3 3 1 3 3 3 5 5 1 3 1 5 1 3 5 1 1 3 5 5 5 3 1 1 5 3 3 1 3
## [141] 3 3 1 3 3 3 1 3 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 12.81128 15.15100  5.46250  8.36640  4.65500
##  (between_SS / total_SS =  93.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## [[12]]
## K-means clustering with 5 clusters of sizes 37, 12, 50, 24, 27
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.229730    2.851351     4.767568    1.572973
## 2     7.475000    3.125000     6.300000    2.050000
## 3     5.006000    3.428000     1.462000    0.246000
## 4     6.529167    3.058333     5.508333    2.162500
## 5     5.529630    2.622222     3.940741    1.218519
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 5 1 5 1 5 1 5 5 5 5 1 5 1 1 5 1 5
##  [71] 1 5 1 1 1 1 1 1 1 5 5 5 5 1 5 1 1 1 5 5 5 1 5 5 5 5 5 1 5 5 4 1 2 4 4
## [106] 2 5 2 4 2 4 4 4 1 4 4 4 2 2 1 4 1 2 1 4 2 1 1 4 2 2 2 4 1 1 2 4 4 1 4
## [141] 4 4 1 4 4 4 1 4 4 1
## 
## Within cluster sum of squares by cluster:
## [1] 11.963784  4.655000 15.151000  5.462500  9.228889
##  (between_SS / total_SS =  93.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"