Tutorial TIC.EC2019 - Analisis de datos con R. Procesamiento secuencial, paralelo y distribuido.
Tutorial TIC.EC2019 - Analisis de datos con R. Procesamiento secuencial, paralelo y distribuido.
Daniela Ballari - dballari@uazuay.edu.ec - Universidad del Azuay, Cuenca (Ecuador)
1 Temario
Objetivo
Introducir a los participantes al ambiente de procesamiento en R y RStudio, mostrando ejemplos del procesaminto paralelo y distribuido en el Cluster HPC de CEDIA.
Dirigido a
Investigadores y profesionales con necesidades de analizar grandes volumenes de informacion.
Temas
- Introduccion a R .
- Exploracion de los datos del tutorial.
- Demostracion con Kmeans y Random Forest.
- Procesamiento secuencial.
- Procesamiento paralelo en windows.
- Acceso al cluster de CEDIA.
- Procesamiento paralelo en Linux.
- Procesamiento distribuido con rslurm.
Fecha del tutorial
Jueves 28 de noviembre 2019. de 14:30 a 16:30hs. En el marco del congreso TIC.EC 2019
Descarga de script y datos
Vinculo: http://bit.ly/ticec_2019
2 Una muy breve introduccion a R
2.1 Operadores matematicos
2 + 2## [1] 4
9 - 2## [1] 7
10*10## [1] 100
25/5## [1] 5
3 ^ 2 # 3 al cuadrado## [1] 9
100^(1/2)## [1] 10
sqrt(100) # Raiz cuadrada## [1] 10
log(10) # Logaritmo natural## [1] 2.302585
log10(1000) # Logaritmo de base 10## [1] 3
exp(1) # e elevado a la 1## [1] 2.718282
cos(pi/4) # coseno de pi sobre 4## [1] 0.7071068
2.2 Vectores
x = c(3,4,7,8) # Crear un vector con los numeros 3,4,7,8
x # Visualizar x en pantalla## [1] 3 4 7 8
y = 1:4 # Crear un vector con los numeros 1,2,3,4
z = seq(0,1,0.1) # Crear un vector de 0 a 1 con incrementos de 0.1
z[1:3] # Visualizar los 3 primeros elementos de z## [1] 0.0 0.1 0.2
z[3] # Visualiar el elemento de la posicion 3## [1] 0.2
2.3 Operaciones entre vectores
x+y## [1] 4 6 10 12
x-y ## [1] 2 2 4 4
x*y ## [1] 3 8 21 32
x/y ## [1] 3.000000 2.000000 2.333333 2.000000
sqrt(y) ## [1] 1.000000 1.414214 1.732051 2.000000
y^2## [1] 1 4 9 16
y^2+5*cos(y*pi/2)## [1] 1 -1 9 21
names = c("Dani","Tom","Susana","Valeria")2.4 Matrices
X = cbind(x,y) # Combinar los vectores x, y en una matriz
Y = matrix(0,2,4) # Crear una matriz de 2x4 con elementos 0
Z = matrix(x,2,2) # Crear una matriz de 2x2 con elementos x
Z[,1] # Extraer la primera columna## [1] 3 4
Z[2,] # Extraer la segunda fila## [1] 4 8
Z[1,1] # Extraer primera fila y primera columna## [1] 3
2.5 Data Frames
Son como hojas de excel o matrices que contienen tanto datos numericos como de texto. En el ejemplo se muestra la altura y peso de 4 personas.
altura = c(1.5, 1.75, 1.63, 1.64)
nombre = c("Miguel","Carlos","Isabel","Ema")
datos = data.frame(altura,nombre)
datos[1,] ## altura nombre
## 1 1.5 Miguel
peso = c(180, 175, 110, 40) #Anadir una nueva columna de peso
datos = cbind(datos, peso)2.6 Leer y escribir datos
Crear un archivo txt en el mismo diretorio que se encuentra este script en R, con los siguientes datos:
Bank Acct Amount Chase 1536253 50.32 PNC 189273462 1563.82 Fleet 287363 20000.00
data1<-read.table("bank.txt", header=TRUE)
write.table(data1,'data1.txt',sep="")
library(readxl)
ejemplo_excel <- read_excel("ejemplo_excel.xlsx")
# View(ejemplo_excel)2.7 Estadistica descriptiva y graficos
Para graficos mas avanzados con ggplot consulte el tutorial: https://rpubs.com/daniballari/qplot y https://rpubs.com/daniballari/ggplot
data(trees)
head(trees)## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
# Medidas descriptivas
media<-mean(trees$Height)
desviacion<-sd(trees$Height)
varianza<-var(trees$Height)
Minimo<-min(trees$Height)
maximo<-max(trees$Height)
summary(trees)## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
#Correlacion
cor(trees$Girth, trees$Height)#correlation## [1] 0.5192801
cor(trees)## Girth Height Volume
## Girth 1.0000000 0.5192801 0.9671194
## Height 0.5192801 1.0000000 0.5982497
## Volume 0.9671194 0.5982497 1.0000000
#Graficos
histograma<-hist(trees$Height)boxplot(trees$Height)plot(trees$Girth, trees$Height) #scatterplot2.8 Crear funciones en R
Basado en https://swcarpentry.github.io/r-novice-inflammation/02-func-R/
myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}
# Temperatura Fahrenheit a Kelvin
fahrenheit_to_kelvin <- function(temp_F) {
temp_K <- ((temp_F - 32) * (5 / 9)) + 273.15
return(temp_K)
}
# Punto de congelamiento del agua
fahrenheit_to_kelvin(32)
# Punto de ebullicion del agua
fahrenheit_to_kelvin(212)2.9 Funcion apply
a <- 1:10
b <- 1:10
x<-data.frame(a,b)
head(x)## a b
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 5
## 6 6 6
apply(x,1,sum)## [1] 2 4 6 8 10 12 14 16 18 20
apply(x,2,sum)## a b
## 55 55
2.10 Funcion sapply
Devuelve la estructura de datos mas simple posible, por ejemplo un vector.
sapply(1:nrow(x), function(i) sum(x[i,]))## [1] 2 4 6 8 10 12 14 16 18 20
2.11 Funcion lapply
Se aplica sobre listas y devuelve en estructura de listas.
lapply(1:nrow(x), function(i) sum(x[i,]))## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 8
##
## [[5]]
## [1] 10
##
## [[6]]
## [1] 12
##
## [[7]]
## [1] 14
##
## [[8]]
## [1] 16
##
## [[9]]
## [1] 18
##
## [[10]]
## [1] 20
3 Ejemplos y Datos
Los ejemplos de procesamiento secuencial, paralelo y distribuido se realizaron utilizando diferentes ejemplos basados en el meodo de clusterizacion Kmeans y en el metodo de Random Forest (regresion y clasificacion).
El ejemplo de kmeans esta basado en https://rpubs.com/SANPANDE/BostonCl, https://bgstieber.github.io/post/an-introduction-to-the-kmeans-algorithm/ y https://bookdown.org/content/1498/paralelizacion-con-r-rslurm.html#ejemplos
El ejemplo de Random Forest esta basado en https://bookdown.org/content/2031/ensambladores-random-forest-parte-ii.html y https://datascienceplus.com/random-forests-in-r/
3.1 Datos a utilizar
Boston: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
506 registros x 14 variables - medv: median value of owner-occupied homes in $1000s. - crim: per capita crime rate by town. - zn: proportion of residential land zoned for lots over 25,000 sq.ft. - indus: proportion of non-retail business acres per town. - chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). - nox: nitrogen oxides concentration (parts per 10 million). - rm: average number of rooms per dwelling. - age: proportion of owner-occupied units built prior to 1940. - dis: weighted mean of distances to five Boston employment centres. - rad: index of accessibility to radial highways. - tax: full-value property-tax rate per $10,000. - ptratio: pupil-teacher ratio by town. - black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. - lstat: lower status of the population (percent).
library(MASS)
dim(Boston) # n col y n rows## [1] 506 14
head(Boston) # Primeros 6 registros## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
str(Boston) # Estructura## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
4 Kmeans
4.1 Procesamiento Secuencial
Boston2 <- Boston[,c(6,11:14)]
str(Boston2)## 'data.frame': 506 obs. of 5 variables:
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston2)## rm ptratio black lstat medv
## 1 6.575 15.3 396.90 4.98 24.0
## 2 6.421 17.8 396.90 9.14 21.6
## 3 7.185 17.8 392.83 4.03 34.7
## 4 6.998 18.7 394.63 2.94 33.4
## 5 7.147 18.7 396.90 5.33 36.2
## 6 6.430 18.7 394.12 5.21 28.7
# - *rm*: average number of rooms per dwelling.
# - *ptratio*: pupil-teacher ratio by town.
# - *black*: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
# - *lstat*: lower status of the population (percent).
# - *medv*: median value of owner-occupied homes in \$1000s.
clusters=5
Boston2.cl=kmeans(Boston2,clusters)
Boston2.cl## K-means clustering with 5 clusters of sizes 13, 36, 290, 117, 50
##
## Cluster means:
## rm ptratio black lstat medv
## 1 6.008538 18.28462 228.75846 18.757692 17.36923
## 2 6.068944 19.91389 50.15417 20.477500 12.63889
## 3 6.073269 18.98000 391.83521 13.544862 19.59690
## 4 7.096231 16.67350 387.46521 5.551282 35.23932
## 5 5.838500 18.57800 334.64020 16.878000 18.29400
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3 3 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3 5
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 3 5 3 3 5 3 1 5 1 3 3 3 3 4 4 4 4 3 3 3 3 3 3 3 3 3
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 3 3 3 4 3 4 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 4 3
## [ reached getOption("max.print") -- omitted 431 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 15688.62 59413.96 31473.42 18595.83 29658.08
## (between_SS / total_SS = 96.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
4.1.1 Visualizacion
plot(Boston2[4:5],col=Boston2.cl$cluster)
points(Boston2.cl$Centers,pch=19,cex=1.5,col=1:clusters)4.1.2 Errores intra y entre clusters
kmeans_withinss <- unlist(lapply(2:10, FUN = function(i)
(kmeans(x = Boston2, centers = i)[c('tot.withinss')])))
#tot.withinss - Total within-cluster sum of squares, i.e. sum(withinss).
plot(2:10, kmeans_withinss, xlab = 'Clusters', ylab = 'Within Cluster SSE')kmeans_betweenss <- unlist(lapply(2:10, FUN = function(i)
(kmeans(x = Boston2, centers = i)[c('betweenss')])))
#betweenss - The between-cluster sum of squares.
#Porcentaje de varianza explicada
tot.ss <- sum(apply(Boston2, 2, var)) * (nrow(Boston2) - 1)
var_explained <- kmeans_betweenss / tot.ss
plot(2:10, var_explained, xlab = 'Clusters', ylab = '% de la varianza total explicada')4.1.3 Medicion de tiempos
Las unidades de elapsed son el minutos.
s1<- system.time({
kmeans_withinss <- unlist(lapply(2:10, FUN = function(i)
(kmeans(x = Boston2, centers = i)[c('tot.withinss')])))
})
s1## user system elapsed
## 0.014 0.000 0.014
4.2 Paralelizacion en Windows
Funcion parLapply de la libreria parallel
library(parallel)
clusters<- 2:10 # se evaluan de 2 a 10 clusters
cl = makeCluster(4) # crear numero de cores
clusterSetRNGStream(cl, iseed=101) # semilla aleatoria
# init = clusterEvalQ(cl, {library(randomForest);library(MASS); NULL }) #inicializar librerias en cada core
clusterExport(cl, list("Boston2")) # exportar datos a los cores
(s2<- system.time({
results = parLapply(cl, clusters, function(i) kmeans(Boston2,i))}))## user system elapsed
## 0.003 0.001 0.070
stopCluster(cl)
results[[1]]## K-means clustering with 2 clusters of sizes 466, 40
##
## Cluster means:
## rm ptratio black lstat medv
## 1 6.304245 18.3515 381.8433 11.97629 23.33348
## 2 6.056175 19.6675 63.4515 20.53750 13.20500
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 431 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 414694.4 124508.7
## (between_SS / total_SS = 87.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
4.2.1 Unir resultados
kmeans_withinss <- sapply(results, function(results) results$tot.withinss)
plot(2:10, kmeans_withinss, xlab = 'Clusters', ylab = 'Within Cluster SSE')kmeans_betweenss <- sapply(results, function(results) results$betweenss)
#Porcentaje de varianza explicada
tot.ss <- sum(apply(Boston2, 2, var)) * (nrow(Boston2) - 1)
var_explained <- kmeans_betweenss / tot.ss
plot(2:10, var_explained, xlab = 'Clusters', ylab = '% de la varianza total explicada')4.3 Paralelizacion en cluster de CEDIA - Linux
4.3.1 Acceso a Rstudio Server de CEDIA
Accede a https://rstudio.cedia.edu.ec/, ingresa el usuario y contrasena dado en el tutorial. Observar que la interfaz de Rstudio Server es similar al software Rstudio de escritorio.
Funcion mclapply de la libreria parallel
library(parallel)
RNGkind("L'Ecuyer-CMRG") #semillas aleatorias en los cores
clusters<- 2:10 # se evaluan de 2 a 10 clusters
(s3<- system.time({
results = mclapply(clusters, function(i) kmeans(Boston2,i),mc.cores = 8, mc.set.seed = TRUE)
}))## user system elapsed
## 0.005 0.035 0.044
results[[1]]## K-means clustering with 2 clusters of sizes 466, 40
##
## Cluster means:
## rm ptratio black lstat medv
## 1 6.304245 18.3515 381.8433 11.97629 23.33348
## 2 6.056175 19.6675 63.4515 20.53750 13.20500
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 431 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 414694.4 124508.7
## (between_SS / total_SS = 87.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
4.4 Comparando tiempos
sysTime = data.frame(do.call("rbind",list(s1,s2, s3)))
sysTime = cbind(elapsed=sysTime$elapsed,data.frame(fun=c("1-sin paralelizar","2-paralelizado W","3-paralelizado L")))
require(ggplot2)
## Loading required package: ggplot2
ggplot(data=sysTime, aes(x=fun,y=elapsed,fill=fun)) +
geom_bar(stat="identity") +
ggtitle("Elapsed time of each function")4.5 Procesamiento distribuido en cluster de CEDIA
Se utilizara la libreria rslurm. Informacion detallada se puede acceder en: https://bookdown.org/content/1498/paralelizacion-con-r-rslurm.html
SLURM es un es un sistema de manejo de cluster y calendarizacion de tareas para cluster Linux. La libreria rslurm permite conectar r con SLURM para procesamiento distribuido.
Los pasos a seguir con:
1-Cargar libreria rslurm 2-Creacion de funciones a usar sobre el conjunto de parametros. Debe existir al menos una funcion principal 3-Creacion de data frame de parametros 4-Llamada a crear y (alternativamente) ejecutar el job usando a slurm -> slurm_apply 5-Retornar los resultados -> get_slurm_out 6-Eliminar (limpiar) archivos generados en el proceso -> cleanup_files
4.5.1 Ejemplo 1 media y desviacion estandar
#Cargar libreria rslurm
library(rslurm)
#Creacion de funcion. En este caso se simulan varios conjuntos de datos con distribucion normal (cada uno con media y desviacion estandar diferente).
test_func <- function(par_mu, par_sd) {
samp <- rnorm(10^6, par_mu, par_sd)
c(s_mu = mean(samp), s_sd = sd(samp))
}
#Creacion de data frame de parametros. En este caso, se crean 10 medias de 1 a 10, y 10 desviaciones estandar de 0.1 a 1.
pars <- data.frame(par_mu = 1:10,
par_sd = seq(0.1, 1, length.out = 10))
head(pars, 3)## par_mu par_sd
## 1 1 0.1
## 2 2 0.2
## 3 3 0.3
#ejecutar el job
sjob <- slurm_apply(test_func, pars, jobname = 'test_apply',
nodes = 2, cpus_per_node = 5,
slurm_options = list(exclude = "compute-0-1"),
submit = TRUE, pkgs = c())
#pkgs = c() no cargar ningun paquete
sjob## $jobname
## [1] "test_apply"
##
## $nodes
## [1] 2
##
## attr(,"class")
## [1] "slurm_job"
#Retornar los resultados
get_job_status(sjob)## $completed
## [1] FALSE
##
## $queue
## JOBID PARTITION NAME USER ST TIME NODES NODELIST.REASON.
## 1 2794_[0-1] oro test_app daniela. PD 0:00 1 (None)
##
## $log
## _rslurm_test_apply/slurm_0.out
## ""
## _rslurm_test_apply/slurm_1.out
## "/var/spool/slurmd/job02792/slurm_script: line 8: /usr/lib64/R/bin/Rscript: No such file or directory"
##
## attr(,"class")
## [1] "slurm_job_status"
list.files('_rslurm_test_apply', 'results')## [1] "results_0.RDS"
res <- get_slurm_out(sjob, outtype = 'table')
head(res, 3)## s_mu s_sd
## 1 1.000028 0.1000984
## 2 2.000066 0.1999072
## 3 3.000359 0.3001642
res_raw <- get_slurm_out(sjob, outtype = 'raw') #Para resultados mas complejos use "raw"
res_raw[1:3] ## [[1]]
## s_mu s_sd
## 1.0000277 0.1000984
##
## [[2]]
## s_mu s_sd
## 2.0000656 0.1999072
##
## [[3]]
## s_mu s_sd
## 3.0003594 0.3001642
#Eliminar archivos
cleanup_files(sjob)4.5.2 Ejemplo 2 kmeans
myFunction <- function(i, j) {
kmeans(Boston2, centers=i, nstart=j )
}
myiter <- 3
nstarts <- rep(25, myiter)
nclus <- 2:11
params <- expand.grid(j=nstarts,i=nclus)
params## j i
## 1 25 2
## 2 25 2
## 3 25 2
## 4 25 3
## 5 25 3
## 6 25 3
## 7 25 4
## 8 25 4
## 9 25 4
## 10 25 5
## 11 25 5
## 12 25 5
## 13 25 6
## 14 25 6
## 15 25 6
## 16 25 7
## 17 25 7
## 18 25 7
## 19 25 8
## 20 25 8
## 21 25 8
## 22 25 9
## 23 25 9
## 24 25 9
## 25 25 10
## 26 25 10
## 27 25 10
## 28 25 11
## 29 25 11
## 30 25 11
dim(params)## [1] 30 2
# Objetos adicionales "data"
sjob <- slurm_apply(myFunction, params, jobname = 'kmeans_apply', nodes = 3, cpus_per_node = 5,
add_objects = "Boston2", slurm_options = list(exclude = "compute-0-1"), pkgs = c())
# Estado de ejecucion del job
get_job_status(sjob)## $completed
## [1] FALSE
##
## $queue
## JOBID PARTITION NAME USER ST TIME NODES NODELIST.REASON.
## 1 2797_[0-2] oro kmeans_a daniela. PD 0:00 1 (None)
##
## $log
## _rslurm_kmeans_apply/slurm_0.out
## ""
## _rslurm_kmeans_apply/slurm_1.out
## "/var/spool/slurmd/job02791/slurm_script: line 8: /usr/lib64/R/bin/Rscript: No such file or directory"
## _rslurm_kmeans_apply/slurm_2.out
## ""
##
## attr(,"class")
## [1] "slurm_job_status"
# Recuperar resutados de archivos y combinarlos en un solo data frame
results <- get_slurm_out(sjob, outtype = 'raw')
summary(results)## Length Class Mode
## [1,] 9 kmeans list
## [2,] 9 kmeans list
## [3,] 9 kmeans list
## [4,] 9 kmeans list
## [5,] 9 kmeans list
## [6,] 9 kmeans list
## [7,] 9 kmeans list
## [8,] 9 kmeans list
## [9,] 9 kmeans list
## [10,] 9 kmeans list
## [11,] 9 kmeans list
## [12,] 9 kmeans list
## [13,] 9 kmeans list
## [14,] 9 kmeans list
## [15,] 9 kmeans list
## [16,] 9 kmeans list
## [17,] 9 kmeans list
## [18,] 9 kmeans list
## [19,] 9 kmeans list
## [20,] 9 kmeans list
results[[1]]## K-means clustering with 2 clusters of sizes 40, 466
##
## Cluster means:
## rm ptratio black lstat medv
## 1 6.056175 19.6675 63.4515 20.53750 13.20500
## 2 6.304245 18.3515 381.8433 11.97629 23.33348
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [ reached getOption("max.print") -- omitted 431 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 124508.7 414694.4
## (between_SS / total_SS = 87.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Eliminar archivos temporales generados por slurm
cleanup_files(sjob)5 Random Forest
5.1 Procesamiento secuencial
5.1.1 Matrices de correlacion
library(MASS)
library(PerformanceAnalytics)
chart.Correlation(data.frame(Boston[,"medv"], Boston[,c(1:7)]))chart.Correlation(data.frame(Boston[,"medv"], Boston[,c(8:13)]))5.1.2 Datos de entrenamiento y de validacion
set.seed(101)#Semilla aleatoria
train.i <- sample(1:nrow(Boston),300) # 300 datos para entrenamiento, seleccionados aleatoriamente
train<- Boston[train.i,] # Set de datos de entrenamiento
test<- Boston[-train.i,] # Set de datos de validacion5.1.3 Aprendizaje del modelo
La variable medv sera la variable objetivo, la cual es la mediana del costo de vivienda. Se usaran 500 arboles.
library(randomForest)
Boston.rf <- randomForest(medv ~ . , data = train, ntree=500,localImp = TRUE)
Boston.rf##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = 500, localImp = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.91917
## % Var explained: 82.6
El error medio cuadratico y la varianza explicada son calculados usando la estimacion OBB. El numero de variables seleccionadas aleatoriamente en cada ramificacion es 4 por defecto.
5.1.4 Error OBB vs numero de arboles
plot(Boston.rf)5.1.5 Importancia de variables
# importance(Boston.rf)
varImpPlot(Boston.rf)5.1.6 Prediccion
La predicion consiste en aplicar el modelo RF a un set de datos independiente, que no ha sido utilizado en la validacion.
prediccion<- predict(Boston.rf, newdata=test)
head(prediccion)## 6 10 17 19 21 22
## 27.25811 18.89784 21.10072 17.76235 15.35398 17.71217
5.1.7 Validacion
Si bien no es la unica opcion, en este caso se calcularan diferentes metricas con la libreria Metrics.
library(Metrics)
test$prediccion <- prediccion
rmse(test$medv, test$prediccion)## [1] 2.906019
cor(test$medv, test$prediccion)## [1] 0.9551733
percent_bias(test$medv, test$prediccion)## [1] -0.01785421
5.1.8 Medicion de tiempos
Las unidades de elapsed son el minutos.
s4<- system.time({
Boston.rf=randomForest(medv ~ . , data = train, ntree=1000,localImp = TRUE)
})
s4## user system elapsed
## 1.493 0.007 1.501
5.2 Paralelizacion en Windows
4 cores, eachTrees es 250. Es decir 250 arboles por cada core.
library(parallel)
eachTrees = 250 #en cada core se procesaran 250 arboles
cl = makeCluster(4) # crear numero de cores
clusterSetRNGStream(cl, iseed=101) # semilla aleatoria
init = clusterEvalQ(cl, {library(randomForest);library(MASS); NULL }) #inicializar librerias en cada core
clusterExport(cl, list("train")) # exportar datos a los cores
(s5<- system.time({
results = parLapply(cl, rep(eachTrees, 4), function(i) randomForest(medv ~ . , data = train, ntree=i, localImp = TRUE))
}))## user system elapsed
## 0.025 0.006 0.467
stopCluster(cl)
results## [[1]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.60396
## % Var explained: 82.96
##
## [[2]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.68718
## % Var explained: 82.87
##
## [[3]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.58814
## % Var explained: 82.98
##
## [[4]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 15.19506
## % Var explained: 82.27
5.2.1 Unir los resultados procesados en los distintos cores
library(rfUtilities)
result<- rf.combine(results[[1]],results[[2]],results[[3]],results[[4]])
result##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i, localImp = TRUE)
## Type of random forest: regression
## Number of random forests models: 4
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 15.19506
## % Var explained: 82.27
5.3 Paralelizacion en cluster de CEDIA - Linux
5.3.1 Acceso a Rstudio Server de CEDIA
Accede a https://rstudio.cedia.edu.ec/, ingresa el usuario y contrasena dado en el tutorial. Observar que la interfaz de Rstudio Server es similar al
Funcion mclapply de la libreria parallel
4 cores, eachTrees es 250. Es decir 250 arboles por cada core. Las diferencias se deben a que son semillas aleatorias diferentes respecto a cuando se ejecuto en 4 cores en paralelo.
(s6<- system.time({
results = mclapply(rep(eachTrees, 4), function(i) randomForest(medv ~ . , data = train, ntree=i),mc.cores = 8, mc.set.seed = TRUE)
}))## user system elapsed
## 0.541 0.083 0.324
results## [[1]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.81228
## % Var explained: 82.72
##
## [[2]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.92028
## % Var explained: 82.59
##
## [[3]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.35901
## % Var explained: 83.25
##
## [[4]]
##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.52452
## % Var explained: 83.06
(result<- rf.combine(results[[1]],results[[2]],results[[3]],results[[4]] ))##
## Call:
## randomForest(formula = medv ~ ., data = train, ntree = i)
## Type of random forest: regression
## Number of random forests models: 4
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 14.52452
## % Var explained: 83.06
varImpPlot(result)5.4 Comparando tiempos
sysTime = do.call("rbind",list(s4,s5,s6))
sysTime = cbind(sysTime,data.frame(fun=c("sin paralelizar","paralelizado W","paralelizado L")))
require(ggplot2)
## Loading required package: ggplot2
ggplot(data=sysTime, aes(x=fun,y=elapsed,fill=fun)) +
geom_bar(stat="identity") +
ggtitle("Elapsed time of each function")