Tutorial TIC.EC2019 - Analisis de datos con R. Procesamiento secuencial, paralelo y distribuido.

Daniela Ballari - dballari@uazuay.edu.ec - Universidad del Azuay, Cuenca (Ecuador)

1 Temario

Objetivo

Introducir a los participantes al ambiente de procesamiento en R y RStudio, mostrando ejemplos del procesaminto paralelo y distribuido en el Cluster HPC de CEDIA.

Dirigido a

Investigadores y profesionales con necesidades de analizar grandes volumenes de informacion.

Temas

  • Introduccion a R .
  • Exploracion de los datos del tutorial.
  • Demostracion con Kmeans y Random Forest.
  • Procesamiento secuencial.
  • Procesamiento paralelo en windows.
  • Acceso al cluster de CEDIA.
  • Procesamiento paralelo en Linux.
  • Procesamiento distribuido con rslurm.

Fecha del tutorial

Jueves 28 de noviembre 2019. de 14:30 a 16:30hs. En el marco del congreso TIC.EC 2019

Descarga de script y datos

Vinculo: http://bit.ly/ticec_2019

2 Una muy breve introduccion a R

2.1 Operadores matematicos

2 + 2
## [1] 4
9 - 2
## [1] 7
10*10
## [1] 100
25/5
## [1] 5
3 ^ 2        # 3 al cuadrado
## [1] 9
100^(1/2)
## [1] 10
sqrt(100)    # Raiz cuadrada
## [1] 10
log(10)      # Logaritmo natural
## [1] 2.302585
log10(1000)  # Logaritmo de base 10
## [1] 3
exp(1)       # e elevado a la 1
## [1] 2.718282
cos(pi/4)    # coseno de pi sobre 4
## [1] 0.7071068

2.2 Vectores

x = c(3,4,7,8)      # Crear un vector con los numeros 3,4,7,8
x                   # Visualizar x en pantalla
## [1] 3 4 7 8
y = 1:4             # Crear un vector con los numeros 1,2,3,4

z = seq(0,1,0.1)    # Crear un vector de 0 a 1 con incrementos de 0.1
z[1:3]              # Visualizar los 3 primeros elementos de z
## [1] 0.0 0.1 0.2
z[3]                # Visualiar el elemento de la posicion 3
## [1] 0.2

2.3 Operaciones entre vectores

x+y
## [1]  4  6 10 12
x-y                
## [1] 2 2 4 4
x*y               
## [1]  3  8 21 32
x/y                
## [1] 3.000000 2.000000 2.333333 2.000000
sqrt(y)            
## [1] 1.000000 1.414214 1.732051 2.000000
y^2
## [1]  1  4  9 16
y^2+5*cos(y*pi/2)
## [1]  1 -1  9 21
names = c("Dani","Tom","Susana","Valeria")

2.4 Matrices

X = cbind(x,y)        # Combinar los vectores x, y en una matriz

Y = matrix(0,2,4)     # Crear una matriz de 2x4 con elementos 0

Z = matrix(x,2,2)     # Crear una matriz de 2x2 con elementos x
Z[,1]                 # Extraer la primera columna
## [1] 3 4
Z[2,]                 # Extraer la segunda fila
## [1] 4 8
Z[1,1]                # Extraer primera fila y primera columna
## [1] 3

2.5 Data Frames

Son como hojas de excel o matrices que contienen tanto datos numericos como de texto. En el ejemplo se muestra la altura y peso de 4 personas.

altura = c(1.5, 1.75, 1.63, 1.64) 
nombre = c("Miguel","Carlos","Isabel","Ema")

datos = data.frame(altura,nombre)
datos[1,] 
##   altura nombre
## 1    1.5 Miguel
peso = c(180, 175, 110, 40)  #Anadir una nueva columna de peso
datos = cbind(datos, peso)

2.6 Leer y escribir datos

Crear un archivo txt en el mismo diretorio que se encuentra este script en R, con los siguientes datos:

Bank Acct Amount
Chase 1536253 50.32
PNC 189273462 1563.82
Fleet 287363 20000.00

data1<-read.table("bank.txt", header=TRUE) 
write.table(data1,'data1.txt',sep="")

library(readxl)
ejemplo_excel <- read_excel("ejemplo_excel.xlsx")
# View(ejemplo_excel)

2.7 Estadistica descriptiva y graficos

Para graficos mas avanzados con ggplot consulte el tutorial: https://rpubs.com/daniballari/qplot y https://rpubs.com/daniballari/ggplot

data(trees)
head(trees)
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7
# Medidas descriptivas
media<-mean(trees$Height)
desviacion<-sd(trees$Height)
varianza<-var(trees$Height)
Minimo<-min(trees$Height)
maximo<-max(trees$Height)
summary(trees)
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00
#Correlacion
cor(trees$Girth, trees$Height)#correlation
## [1] 0.5192801
cor(trees)
##            Girth    Height    Volume
## Girth  1.0000000 0.5192801 0.9671194
## Height 0.5192801 1.0000000 0.5982497
## Volume 0.9671194 0.5982497 1.0000000
#Graficos
histograma<-hist(trees$Height)

boxplot(trees$Height)

plot(trees$Girth, trees$Height) #scatterplot

2.8 Crear funciones en R

Basado en https://swcarpentry.github.io/r-novice-inflammation/02-func-R/

myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}

# Temperatura Fahrenheit a Kelvin

fahrenheit_to_kelvin <- function(temp_F) {
  temp_K <- ((temp_F - 32) * (5 / 9)) + 273.15
  return(temp_K)
}

# Punto de congelamiento del agua
fahrenheit_to_kelvin(32)

# Punto de ebullicion del agua
fahrenheit_to_kelvin(212)

2.9 Funcion apply

a <- 1:10
b <- 1:10
x<-data.frame(a,b)
head(x)
##   a b
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 5
## 6 6 6
apply(x,1,sum)
##  [1]  2  4  6  8 10 12 14 16 18 20
apply(x,2,sum)
##  a  b 
## 55 55

2.10 Funcion sapply

Devuelve la estructura de datos mas simple posible, por ejemplo un vector.

sapply(1:nrow(x), function(i) sum(x[i,]))
##  [1]  2  4  6  8 10 12 14 16 18 20

2.11 Funcion lapply

Se aplica sobre listas y devuelve en estructura de listas.

lapply(1:nrow(x), function(i) sum(x[i,]))
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 8
## 
## [[5]]
## [1] 10
## 
## [[6]]
## [1] 12
## 
## [[7]]
## [1] 14
## 
## [[8]]
## [1] 16
## 
## [[9]]
## [1] 18
## 
## [[10]]
## [1] 20

3 Ejemplos y Datos

Los ejemplos de procesamiento secuencial, paralelo y distribuido se realizaron utilizando diferentes ejemplos basados en el meodo de clusterizacion Kmeans y en el metodo de Random Forest (regresion y clasificacion).

El ejemplo de kmeans esta basado en https://rpubs.com/SANPANDE/BostonCl, https://bgstieber.github.io/post/an-introduction-to-the-kmeans-algorithm/ y https://bookdown.org/content/1498/paralelizacion-con-r-rslurm.html#ejemplos

El ejemplo de Random Forest esta basado en https://bookdown.org/content/2031/ensambladores-random-forest-parte-ii.html y https://datascienceplus.com/random-forests-in-r/

3.1 Datos a utilizar

Boston: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

506 registros x 14 variables - medv: median value of owner-occupied homes in $1000s.
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
- ptratio: pupil-teacher ratio by town.
- black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
- lstat: lower status of the population (percent).

library(MASS)

dim(Boston)     # n col y n rows
## [1] 506  14
head(Boston)    # Primeros 6 registros
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
##  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
str(Boston)     # Estructura
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

4 Kmeans

4.1 Procesamiento Secuencial

Boston2 <- Boston[,c(6,11:14)]
str(Boston2)
## 'data.frame':    506 obs. of  5 variables:
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston2)
##      rm ptratio  black lstat medv
## 1 6.575    15.3 396.90  4.98 24.0
## 2 6.421    17.8 396.90  9.14 21.6
## 3 7.185    17.8 392.83  4.03 34.7
## 4 6.998    18.7 394.63  2.94 33.4
## 5 7.147    18.7 396.90  5.33 36.2
## 6 6.430    18.7 394.12  5.21 28.7
# - *rm*: average number of rooms per dwelling.
# - *ptratio*: pupil-teacher ratio by town.
# - *black*: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
# - *lstat*: lower status of the population (percent).
# - *medv*: median value of owner-occupied homes in \$1000s. 

clusters=5
Boston2.cl=kmeans(Boston2,clusters)
Boston2.cl
## K-means clustering with 5 clusters of sizes 13, 36, 290, 117, 50
## 
## Cluster means:
##         rm  ptratio     black     lstat     medv
## 1 6.008538 18.28462 228.75846 18.757692 17.36923
## 2 6.068944 19.91389  50.15417 20.477500 12.63889
## 3 6.073269 18.98000 391.83521 13.544862 19.59690
## 4 7.096231 16.67350 387.46521  5.551282 35.23932
## 5 5.838500 18.57800 334.64020 16.878000 18.29400
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  3  3  4  4  4  4  3  3  3  3  3  3  3  3  3  3  3  3  5  3  3  3  3  3  3  5 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  3  5  3  3  5  3  1  5  1  3  3  3  3  4  4  4  4  3  3  3  3  3  3  3  3  3 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  3  3  3  4  3  4  3  3  3  3  3  3  4  3  3  3  3  3  3  3  3  4  3 
##  [ reached getOption("max.print") -- omitted 431 entries ]
## 
## Within cluster sum of squares by cluster:
## [1] 15688.62 59413.96 31473.42 18595.83 29658.08
##  (between_SS / total_SS =  96.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

4.1.1 Visualizacion

plot(Boston2[4:5],col=Boston2.cl$cluster)
points(Boston2.cl$Centers,pch=19,cex=1.5,col=1:clusters)

4.1.2 Errores intra y entre clusters

kmeans_withinss <- unlist(lapply(2:10, FUN = function(i) 
                  (kmeans(x = Boston2, centers = i)[c('tot.withinss')])))
#tot.withinss - Total within-cluster sum of squares, i.e. sum(withinss).

plot(2:10, kmeans_withinss, xlab = 'Clusters', ylab = 'Within Cluster SSE')

kmeans_betweenss <- unlist(lapply(2:10, FUN = function(i) 
                  (kmeans(x = Boston2, centers = i)[c('betweenss')])))
#betweenss - The between-cluster sum of squares.

#Porcentaje de varianza explicada
tot.ss <- sum(apply(Boston2, 2, var)) * (nrow(Boston2) - 1)

var_explained <- kmeans_betweenss / tot.ss

plot(2:10, var_explained, xlab = 'Clusters', ylab = '% de la varianza total explicada')

4.1.3 Medicion de tiempos

Las unidades de elapsed son el minutos.

s1<- system.time({
    kmeans_withinss <- unlist(lapply(2:10, FUN = function(i) 
                  (kmeans(x = Boston2, centers = i)[c('tot.withinss')])))
})
s1
##    user  system elapsed 
##   0.014   0.000   0.014

4.2 Paralelizacion en Windows

Funcion parLapply de la libreria parallel

library(parallel)

clusters<- 2:10 # se evaluan de 2 a 10 clusters

cl = makeCluster(4)                # crear numero de cores
clusterSetRNGStream(cl, iseed=101) # semilla aleatoria
# init = clusterEvalQ(cl, {library(randomForest);library(MASS); NULL })  #inicializar librerias en cada core
clusterExport(cl, list("Boston2"))   # exportar datos a los cores

(s2<- system.time({
results = parLapply(cl, clusters, function(i) kmeans(Boston2,i))}))
##    user  system elapsed 
##   0.003   0.001   0.070
stopCluster(cl)

results[[1]]
## K-means clustering with 2 clusters of sizes 466, 40
## 
## Cluster means:
##         rm ptratio    black    lstat     medv
## 1 6.304245 18.3515 381.8433 11.97629 23.33348
## 2 6.056175 19.6675  63.4515 20.53750 13.20500
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
##  [ reached getOption("max.print") -- omitted 431 entries ]
## 
## Within cluster sum of squares by cluster:
## [1] 414694.4 124508.7
##  (between_SS / total_SS =  87.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

4.2.1 Unir resultados

kmeans_withinss <- sapply(results, function(results) results$tot.withinss)
plot(2:10, kmeans_withinss, xlab = 'Clusters', ylab = 'Within Cluster SSE')

kmeans_betweenss <- sapply(results, function(results) results$betweenss)
#Porcentaje de varianza explicada
tot.ss <- sum(apply(Boston2, 2, var)) * (nrow(Boston2) - 1)
var_explained <- kmeans_betweenss / tot.ss
plot(2:10, var_explained, xlab = 'Clusters', ylab = '% de la varianza total explicada')

4.3 Paralelizacion en cluster de CEDIA - Linux

4.3.1 Acceso a Rstudio Server de CEDIA

Accede a https://rstudio.cedia.edu.ec/, ingresa el usuario y contrasena dado en el tutorial. Observar que la interfaz de Rstudio Server es similar al software Rstudio de escritorio.

Funcion mclapply de la libreria parallel

library(parallel)

RNGkind("L'Ecuyer-CMRG") #semillas aleatorias en los cores

clusters<- 2:10 # se evaluan de 2 a 10 clusters

(s3<- system.time({
results = mclapply(clusters, function(i) kmeans(Boston2,i),mc.cores = 8, mc.set.seed = TRUE)
}))
##    user  system elapsed 
##   0.005   0.035   0.044
results[[1]]
## K-means clustering with 2 clusters of sizes 466, 40
## 
## Cluster means:
##         rm ptratio    black    lstat     medv
## 1 6.304245 18.3515 381.8433 11.97629 23.33348
## 2 6.056175 19.6675  63.4515 20.53750 13.20500
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
##  [ reached getOption("max.print") -- omitted 431 entries ]
## 
## Within cluster sum of squares by cluster:
## [1] 414694.4 124508.7
##  (between_SS / total_SS =  87.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

4.4 Comparando tiempos

sysTime = data.frame(do.call("rbind",list(s1,s2, s3)))
sysTime = cbind(elapsed=sysTime$elapsed,data.frame(fun=c("1-sin paralelizar","2-paralelizado W","3-paralelizado L")))

require(ggplot2)
## Loading required package: ggplot2
ggplot(data=sysTime, aes(x=fun,y=elapsed,fill=fun)) +
geom_bar(stat="identity") +
ggtitle("Elapsed time of each function")

4.5 Procesamiento distribuido en cluster de CEDIA

Se utilizara la libreria rslurm. Informacion detallada se puede acceder en: https://bookdown.org/content/1498/paralelizacion-con-r-rslurm.html

SLURM es un es un sistema de manejo de cluster y calendarizacion de tareas para cluster Linux. La libreria rslurm permite conectar r con SLURM para procesamiento distribuido.

Los pasos a seguir con:

1-Cargar libreria rslurm
2-Creacion de funciones a usar sobre el conjunto de parametros. Debe existir al menos una funcion principal
3-Creacion de data frame de parametros
4-Llamada a crear y (alternativamente) ejecutar el job usando a slurm -> slurm_apply
5-Retornar los resultados -> get_slurm_out
6-Eliminar (limpiar) archivos generados en el proceso -> cleanup_files

4.5.1 Ejemplo 1 media y desviacion estandar

#Cargar libreria rslurm
library(rslurm)

#Creacion de funcion. En este caso se simulan varios conjuntos de datos con distribucion normal (cada uno con media y desviacion estandar diferente).
test_func <- function(par_mu, par_sd) {
    samp <- rnorm(10^6, par_mu, par_sd)
    c(s_mu = mean(samp), s_sd = sd(samp))
}

#Creacion de data frame de parametros. En este caso, se crean 10 medias de 1 a 10, y 10 desviaciones estandar de 0.1 a 1.
pars <- data.frame(par_mu = 1:10,
                   par_sd = seq(0.1, 1, length.out = 10))
head(pars, 3)
##   par_mu par_sd
## 1      1    0.1
## 2      2    0.2
## 3      3    0.3
#ejecutar el job
sjob <- slurm_apply(test_func, pars, jobname = 'test_apply',
                    nodes = 2, cpus_per_node = 5, 
                    slurm_options = list(exclude = "compute-0-1"),
                    submit = TRUE, pkgs = c())
#pkgs = c() no cargar ningun paquete
sjob
## $jobname
## [1] "test_apply"
## 
## $nodes
## [1] 2
## 
## attr(,"class")
## [1] "slurm_job"
#Retornar los resultados
get_job_status(sjob)
## $completed
## [1] FALSE
## 
## $queue
##        JOBID PARTITION     NAME     USER ST TIME NODES NODELIST.REASON.
## 1 2794_[0-1]       oro test_app daniela. PD 0:00     1           (None)
## 
## $log
##                                                                         _rslurm_test_apply/slurm_0.out 
##                                                                                                     "" 
##                                                                         _rslurm_test_apply/slurm_1.out 
## "/var/spool/slurmd/job02792/slurm_script: line 8: /usr/lib64/R/bin/Rscript: No such file or directory" 
## 
## attr(,"class")
## [1] "slurm_job_status"
list.files('_rslurm_test_apply', 'results')
## [1] "results_0.RDS"
res <- get_slurm_out(sjob, outtype = 'table') 
head(res, 3)
##       s_mu      s_sd
## 1 1.000028 0.1000984
## 2 2.000066 0.1999072
## 3 3.000359 0.3001642
res_raw <- get_slurm_out(sjob, outtype = 'raw') #Para resultados mas complejos use "raw"

res_raw[1:3] 
## [[1]]
##      s_mu      s_sd 
## 1.0000277 0.1000984 
## 
## [[2]]
##      s_mu      s_sd 
## 2.0000656 0.1999072 
## 
## [[3]]
##      s_mu      s_sd 
## 3.0003594 0.3001642
#Eliminar  archivos
cleanup_files(sjob)

4.5.2 Ejemplo 2 kmeans

myFunction <- function(i, j) {
  kmeans(Boston2, centers=i, nstart=j )
}

myiter <- 3
nstarts <- rep(25, myiter)
nclus <- 2:11
params <- expand.grid(j=nstarts,i=nclus) 
params
##     j  i
## 1  25  2
## 2  25  2
## 3  25  2
## 4  25  3
## 5  25  3
## 6  25  3
## 7  25  4
## 8  25  4
## 9  25  4
## 10 25  5
## 11 25  5
## 12 25  5
## 13 25  6
## 14 25  6
## 15 25  6
## 16 25  7
## 17 25  7
## 18 25  7
## 19 25  8
## 20 25  8
## 21 25  8
## 22 25  9
## 23 25  9
## 24 25  9
## 25 25 10
## 26 25 10
## 27 25 10
## 28 25 11
## 29 25 11
## 30 25 11
dim(params)
## [1] 30  2
# Objetos adicionales "data"
sjob <- slurm_apply(myFunction, params, jobname = 'kmeans_apply', nodes = 3, cpus_per_node = 5, 
                    add_objects = "Boston2", slurm_options = list(exclude = "compute-0-1"), pkgs = c())

# Estado de ejecucion del job
get_job_status(sjob)
## $completed
## [1] FALSE
## 
## $queue
##        JOBID PARTITION     NAME     USER ST TIME NODES NODELIST.REASON.
## 1 2797_[0-2]       oro kmeans_a daniela. PD 0:00     1           (None)
## 
## $log
##                                                                       _rslurm_kmeans_apply/slurm_0.out 
##                                                                                                     "" 
##                                                                       _rslurm_kmeans_apply/slurm_1.out 
## "/var/spool/slurmd/job02791/slurm_script: line 8: /usr/lib64/R/bin/Rscript: No such file or directory" 
##                                                                       _rslurm_kmeans_apply/slurm_2.out 
##                                                                                                     "" 
## 
## attr(,"class")
## [1] "slurm_job_status"
# Recuperar resutados de archivos y combinarlos en un solo data frame 
results <- get_slurm_out(sjob, outtype = 'raw')
summary(results)
##       Length Class  Mode
##  [1,] 9      kmeans list
##  [2,] 9      kmeans list
##  [3,] 9      kmeans list
##  [4,] 9      kmeans list
##  [5,] 9      kmeans list
##  [6,] 9      kmeans list
##  [7,] 9      kmeans list
##  [8,] 9      kmeans list
##  [9,] 9      kmeans list
## [10,] 9      kmeans list
## [11,] 9      kmeans list
## [12,] 9      kmeans list
## [13,] 9      kmeans list
## [14,] 9      kmeans list
## [15,] 9      kmeans list
## [16,] 9      kmeans list
## [17,] 9      kmeans list
## [18,] 9      kmeans list
## [19,] 9      kmeans list
## [20,] 9      kmeans list
results[[1]]
## K-means clustering with 2 clusters of sizes 40, 466
## 
## Cluster means:
##         rm ptratio    black    lstat     medv
## 1 6.056175 19.6675  63.4515 20.53750 13.20500
## 2 6.304245 18.3515 381.8433 11.97629 23.33348
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
##  [ reached getOption("max.print") -- omitted 431 entries ]
## 
## Within cluster sum of squares by cluster:
## [1] 124508.7 414694.4
##  (between_SS / total_SS =  87.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Eliminar archivos temporales generados por slurm
cleanup_files(sjob)

5 Random Forest

5.1 Procesamiento secuencial

5.1.1 Matrices de correlacion

library(MASS)
library(PerformanceAnalytics)
chart.Correlation(data.frame(Boston[,"medv"], Boston[,c(1:7)]))

chart.Correlation(data.frame(Boston[,"medv"], Boston[,c(8:13)]))

5.1.2 Datos de entrenamiento y de validacion

set.seed(101)#Semilla aleatoria

train.i <- sample(1:nrow(Boston),300) # 300 datos para entrenamiento, seleccionados aleatoriamente
train<- Boston[train.i,]              # Set de datos de entrenamiento
test<- Boston[-train.i,]              # Set de datos de validacion

5.1.3 Aprendizaje del modelo

La variable medv sera la variable objetivo, la cual es la mediana del costo de vivienda. Se usaran 500 arboles.

library(randomForest)

Boston.rf <- randomForest(medv ~ . , data = train, ntree=500,localImp = TRUE)
Boston.rf
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = 500, localImp = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.91917
##                     % Var explained: 82.6

El error medio cuadratico y la varianza explicada son calculados usando la estimacion OBB. El numero de variables seleccionadas aleatoriamente en cada ramificacion es 4 por defecto.

5.1.4 Error OBB vs numero de arboles

plot(Boston.rf)

5.1.5 Importancia de variables

# importance(Boston.rf)
varImpPlot(Boston.rf)

5.1.6 Prediccion

La predicion consiste en aplicar el modelo RF a un set de datos independiente, que no ha sido utilizado en la validacion.

prediccion<- predict(Boston.rf, newdata=test)
head(prediccion)
##        6       10       17       19       21       22 
## 27.25811 18.89784 21.10072 17.76235 15.35398 17.71217

5.1.7 Validacion

Si bien no es la unica opcion, en este caso se calcularan diferentes metricas con la libreria Metrics.

library(Metrics)

test$prediccion <- prediccion

rmse(test$medv, test$prediccion)
## [1] 2.906019
cor(test$medv, test$prediccion)
## [1] 0.9551733
percent_bias(test$medv, test$prediccion)
## [1] -0.01785421

5.1.8 Medicion de tiempos

Las unidades de elapsed son el minutos.

s4<- system.time({
    Boston.rf=randomForest(medv ~ . , data = train, ntree=1000,localImp = TRUE)
})
s4
##    user  system elapsed 
##   1.493   0.007   1.501

5.2 Paralelizacion en Windows

4 cores, eachTrees es 250. Es decir 250 arboles por cada core.

library(parallel)
eachTrees = 250 #en cada core se procesaran 250 arboles

cl = makeCluster(4)                # crear numero de cores
clusterSetRNGStream(cl, iseed=101) # semilla aleatoria
init = clusterEvalQ(cl, {library(randomForest);library(MASS); NULL })  #inicializar librerias en cada core
clusterExport(cl, list("train"))   # exportar datos a los cores

(s5<- system.time({
results = parLapply(cl, rep(eachTrees, 4), function(i) randomForest(medv ~ . , data = train, ntree=i, localImp = TRUE))
}))
##    user  system elapsed 
##   0.025   0.006   0.467
stopCluster(cl)

results
## [[1]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.60396
##                     % Var explained: 82.96
## 
## [[2]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.68718
##                     % Var explained: 82.87
## 
## [[3]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.58814
##                     % Var explained: 82.98
## 
## [[4]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 15.19506
##                     % Var explained: 82.27

5.2.1 Unir los resultados procesados en los distintos cores

library(rfUtilities)

result<- rf.combine(results[[1]],results[[2]],results[[3]],results[[4]])
result
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE) 
##                Type of random forest: regression
##                Number of random forests models: 4
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 15.19506
##                     % Var explained: 82.27

5.3 Paralelizacion en cluster de CEDIA - Linux

5.3.1 Acceso a Rstudio Server de CEDIA

Accede a https://rstudio.cedia.edu.ec/, ingresa el usuario y contrasena dado en el tutorial. Observar que la interfaz de Rstudio Server es similar al

Funcion mclapply de la libreria parallel

4 cores, eachTrees es 250. Es decir 250 arboles por cada core. Las diferencias se deben a que son semillas aleatorias diferentes respecto a cuando se ejecuto en 4 cores en paralelo.

(s6<- system.time({
results = mclapply(rep(eachTrees, 4), function(i) randomForest(medv ~ . , data = train, ntree=i),mc.cores = 8, mc.set.seed = TRUE)
}))
##    user  system elapsed 
##   0.541   0.083   0.324
results
## [[1]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.81228
##                     % Var explained: 82.72
## 
## [[2]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.92028
##                     % Var explained: 82.59
## 
## [[3]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.35901
##                     % Var explained: 83.25
## 
## [[4]]
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.52452
##                     % Var explained: 83.06
(result<- rf.combine(results[[1]],results[[2]],results[[3]],results[[4]] ))
## 
## Call:
##  randomForest(formula = medv ~ ., data = train, ntree = i) 
##                Type of random forest: regression
##                Number of random forests models: 4
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 14.52452
##                     % Var explained: 83.06
varImpPlot(result)

5.4 Comparando tiempos

sysTime = do.call("rbind",list(s4,s5,s6))
sysTime = cbind(sysTime,data.frame(fun=c("sin paralelizar","paralelizado W","paralelizado L")))

require(ggplot2)
## Loading required package: ggplot2
ggplot(data=sysTime, aes(x=fun,y=elapsed,fill=fun)) +
geom_bar(stat="identity") +
ggtitle("Elapsed time of each function")

Daniela Ballari

2019-11-26