Cargamos sparklyr y dplyr
library(sparklyr)
library(dplyr)
library(rsparkling)
library(ggplot2)
Inicializamos la conexiĂ³n
sc <- spark_connect(master = "local", version = "1.6.2")
Utilizamos la funciĂ³n spark_read_parquet
que lee los datos y crea un DataFrame de spark. Al asignarlo a un objeto se crea un tbl_spark
que permite utilizar funciones de dplyr
sobre un dataframe de Spark.
censo1_tbl <- spark_read_parquet(sc, "censo1", path = "/home/jose/spark-warehouse/censo2/")
class(censo1_tbl)
[1] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
censo1_tbl
Source: query [?? x 9]
Database: spark connection master=local[4] app=sparklyr local=TRUE
edad sexo cpro ecivil factor
<dbl> <chr> <chr> <chr> <dbl>
1 63 Mujer Zamora Casado 2.887789
2 66 Hombre Zamora Casado 12.095818
3 63 Mujer Zamora Casado 12.095818
4 66 Hombre Zamora Casado 11.343187
5 63 Mujer Zamora Casado 11.343187
6 66 Hombre Zamora Casado 1.763703
7 63 Mujer Zamora Casado 1.763703
8 25 Hombre Zamora Soltero 1.763703
9 66 Hombre Zamora Divorciado 2.400193
10 66 Hombre Zamora Casado 1.734913
# ... with more rows, and 4 more variables: esreal <chr>, rela <chr>,
# nhijos <dbl>, nocu <dbl>
mod_dataset <- censo1_tbl %>%
filter(edad > 20, edad < 70, !is.na(edad),!is.na(rela), !is.na(esreal), !is.na(nhijos), !is.na(nocu)) %>%
mutate(respuesta= as.character(ifelse( ecivil == "Divorciado", 1,0))) %>%
sdf_register("mod_dataset")
summarise(mod_dataset, n())
Source: query [?? x 1]
Database: spark connection master=local[4] app=sparklyr local=TRUE
`_c0`
<dbl>
1 921239
Con sdf_partion
creamos datos para training y para test
partitions <- mod_dataset %>% sdf_partition(training = 0.7, test = 0.3, seed = 42)
partitions$training
Source: query [?? x 10]
Database: spark connection master=local[4] app=sparklyr local=TRUE
edad sexo cpro ecivil factor
<dbl> <chr> <chr> <chr> <dbl>
1 21 Mujer Alicante/Alacant Casado 0.7784233
2 21 Mujer Alicante/Alacant Casado 6.9307727
3 21 Mujer Alicante/Alacant Casado 8.7363786
4 21 Mujer Alicante/Alacant Casado 13.8079170
5 21 Mujer Alicante/Alacant Casado 14.1988963
6 21 Mujer Alicante/Alacant Casado 25.5533933
7 21 Mujer Alicante/Alacant Casado 27.8321311
8 21 Mujer Alicante/Alacant Casado 38.7815963
9 21 Mujer Alicante/Alacant Casado 38.8812580
10 21 Mujer Alicante/Alacant Casado 42.5381768
# ... with more rows, and 5 more variables: esreal <chr>, rela <chr>,
# nhijos <dbl>, nocu <dbl>, respuesta <chr>
library(h2o)
training <- as_h2o_frame(sc, partitions$training)
test <- as_h2o_frame(sc, partitions$test)
training$respuesta <- as.factor(training$respuesta)
training$esreal <- as.factor(training$esreal)
training$rela <- as.factor(training$rela
)
test$respuesta <- as.factor(test$respuesta)
test$esreal <- as.factor(test$esreal)
test$rela <- as.factor(test$rela)
x <- c("edad","esreal","rela","nhijos", "nocu")
mod.glm <- h2o.glm(x = x,
y ="respuesta",
training_frame = training,
validation_frame = test,
family = "binomial")
|
| | 0%
|
|======== | 12%
|
|======================================================================| 100%
summary(mod.glm)
Model Details:
==============
H2OBinomialModel: glm
Model Key: GLM_model_R_1478695292408_1
GLM Model: summary
family link regularization
1 binomial logit Elastic Net (alpha = 0.5, lambda = 2.897E-5 )
number_of_predictors_total number_of_active_predictors number_of_iterations
1 21 19 4
training_frame
1 RTMP_sid_9e35_3
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.05188648
RMSE: 0.227786
LogLoss: 0.2031256
Mean Per-Class Error: 0.3370466
AUC: 0.7172977
Gini: 0.4345954
R^2: 0.04009171
Null Deviance: 283413.1
Residual Deviance: 262164
AIC: 262204
Confusion Matrix for F1-optimal threshold:
0.0 1.0 Error Rate
0.0 516209 92112 0.151420 =92112/608321
1.0 19341 17663 0.522673 =19341/37004
Totals 535550 109775 0.172708 =111453/645325
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.104221 0.240675 167
2 max f2 0.067227 0.357195 221
3 max f0point5 0.128707 0.196296 131
4 max accuracy 0.424431 0.942652 0
5 max precision 0.165138 0.192532 84
6 max recall 0.000988 1.000000 399
7 max specificity 0.424431 0.999993 0
8 max absolute_mcc 0.104221 0.201666 167
9 max min_per_class_accuracy 0.056621 0.671495 247
10 max mean_per_class_accuracy 0.065424 0.683625 225
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: glm
** Reported on validation data. **
MSE: 0.05114582
RMSE: 0.2261544
LogLoss: 0.2003889
Mean Per-Class Error: 0.3392575
AUC: 0.7213122
Gini: 0.4426244
R^2: 0.04083666
Null Deviance: 119902.5
Residual Deviance: 110580.2
AIC: 110620.2
Confusion Matrix for F1-optimal threshold:
0.0 1.0 Error Rate
0.0 222947 37373 0.143566 =37373/260320
1.0 8342 7252 0.534949 =8342/15594
Totals 231289 44625 0.165686 =45715/275914
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.107991 0.240854 165
2 max f2 0.069463 0.361174 223
3 max f0point5 0.130409 0.195382 131
4 max accuracy 0.423631 0.943461 0
5 max precision 0.163828 0.188190 88
6 max recall 0.001102 1.000000 399
7 max specificity 0.423631 0.999977 0
8 max absolute_mcc 0.076092 0.206276 214
9 max min_per_class_accuracy 0.056636 0.673202 253
10 max mean_per_class_accuracy 0.066631 0.688301 228
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Scoring History:
timestamp duration iteration negative_log_likelihood objective
1 2016-11-09 13:43:41 0.000 sec 0 141706.55680 0.21959
2 2016-11-09 13:43:41 0.922 sec 1 132370.69769 0.20526
3 2016-11-09 13:43:42 1.177 sec 2 131129.12741 0.20335
4 2016-11-09 13:43:42 1.398 sec 3 131082.50579 0.20329
5 2016-11-09 13:43:42 1.673 sec 4 131082.01912 0.20329
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Standardized Coefficient Magnitudes: standardized coefficient magnitudes
names
1 rela.Ocupado
2 rela.Otra situaci<0xC3B3>n
3 nocu
4 esreal.Sabe leer y escribir pero fue menos de 5 a<0xC3B1>os a la escuela
5 esreal.No sabe leer o escribir
coefficients sign
1 1.352854 POS
2 1.093698 NEG
3 0.968668 NEG
4 0.748729 NEG
5 0.672837 NEG
---
names
16 esreal.Lleg<0xC3B3> al <0xC3BA>ltimo curso de ESO, EGB o Bachiller Elemental o tiene el Certificado de Escolaridad o de Estudios Primarios
17 edad
18 nhijos
19 esreal.Doctorado
20 esreal.Diplomatura universitaria, Arquitectura T<0xC3A9>cnica, Ingenier<0xC3AD>a T<0xC3A9>cnica o equivalente
21 esreal.Licenciatura, Arquitectura, Ingenier<0xC3AD>a o equivalente
coefficients sign
16 0.081584 NEG
17 0.073724 POS
18 0.055346 NEG
19 0.050114 POS
20 0.000000 POS
21 0.000000 POS
Podemos ver la importancia de las variables
h2o.varimp_plot(mod.glm)
mod.rf <- h2o.randomForest(x = x,
y ="respuesta",
training_frame = training,
validation_frame = test,
balance_classes = TRUE, # opcion para balanceo de clases
# nfolds = 5,
stopping_rounds = 3,
stopping_metric = "AUC")
|
| | 0%
|
|= | 2%
|
|=== | 4%
|
|==== | 6%
|
|====== | 8%
|
|======= | 10%
|
|======== | 12%
|
|========== | 14%
|
|=========== | 16%
|
|============= | 18%
|
|============== | 20%
|
|=============== | 22%
|
|================= | 24%
|
|================== | 26%
|
|==================== | 28%
|
|===================== | 30%
|
|====================== | 32%
|
|======================== | 34%
|
|========================= | 36%
|
|=========================== | 38%
|
|============================ | 40%
|
|============================= | 42%
|
|=============================== | 44%
|
|================================ | 46%
|
|================================== | 48%
|
|=================================== | 50%
|
|==================================== | 52%
|
|====================================== | 54%
|
|======================================= | 56%
|
|========================================= | 58%
|
|========================================== | 60%
|
|=========================================== | 62%
|
|============================================= | 64%
|
|============================================== | 66%
|
|================================================ | 68%
|
|================================================= | 70%
|
|================================================== | 72%
|
|==================================================== | 74%
|
|===================================================== | 76%
|
|======================================================= | 78%
|
|======================================================== | 80%
|
|========================================================= | 82%
|
|=========================================================== | 84%
|
|============================================================ | 86%
|
|============================================================== | 88%
|
|=============================================================== | 90%
|
|================================================================ | 92%
|
|================================================================== | 94%
|
|=================================================================== | 96%
|
|===================================================================== | 98%
|
|======================================================================| 100%
summary(mod.rf)
Model Details:
==============
H2OBinomialModel: drf
Model Key: DRF_model_R_1478695292408_4
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 50 50 7692571 20
max_depth mean_depth min_leaves max_leaves mean_leaves
1 20 20.00000 10307 13599 12040.10000
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.3786151
RMSE: 0.6153171
LogLoss: 1.151018
Mean Per-Class Error: 0.2949029
AUC: 0.7976158
Gini: 0.5952317
Confusion Matrix for F1-optimal threshold:
0.0 1.0 Error Rate
0.0 340189 268132 0.440774 =268132/608321
1.0 90667 517706 0.149032 =90667/608373
Totals 430856 785838 0.294897 =358799/1216694
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.042523 0.742651 335
2 max f2 0.019249 0.851399 372
3 max f0point5 0.086888 0.733831 278
4 max accuracy 0.064822 0.725218 303
5 max precision 0.940624 0.995327 2
6 max recall 0.001928 1.000000 397
7 max specificity 1.000000 0.999998 0
8 max absolute_mcc 0.065603 0.450594 302
9 max min_per_class_accuracy 0.063052 0.724383 305
10 max mean_per_class_accuracy 0.064822 0.725218 303
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on validation data. **
MSE: 0.05037769
RMSE: 0.2244498
LogLoss: 0.1988596
Mean Per-Class Error: 0.3512187
AUC: 0.7386109
Gini: 0.4772219
Confusion Matrix for F1-optimal threshold:
0.0 1.0 Error Rate
0.0 233964 26356 0.101245 =26356/260320
1.0 9375 6219 0.601193 =9375/15594
Totals 243339 32575 0.129500 =35731/275914
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.141814 0.258216 167
2 max f2 0.087443 0.360712 225
3 max f0point5 0.196106 0.228147 119
4 max accuracy 0.945414 0.943475 0
5 max precision 0.246608 0.239838 78
6 max recall 0.000067 1.000000 399
7 max specificity 0.945414 0.999992 0
8 max absolute_mcc 0.121696 0.216605 187
9 max min_per_class_accuracy 0.055326 0.683660 273
10 max mean_per_class_accuracy 0.067090 0.687795 252
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Scoring History:
timestamp duration number_of_trees training_rmse
1 2016-11-09 13:44:10 0.018 sec 0
2 2016-11-09 13:44:19 8.768 sec 2 0.61137
3 2016-11-09 13:45:35 1 min 24.576 sec 39 0.61509
4 2016-11-09 13:46:25 2 min 14.698 sec 50 0.61532
training_logloss training_auc training_lift training_classification_error
1
2 1.47140 0.76941 5.15621 0.31564
3 1.15089 0.79696 6.63000 0.29617
4 1.15102 0.79762 6.54708 0.29490
validation_rmse validation_logloss validation_auc validation_lift
1
2 0.22895 0.30278 0.72628 3.08356
3 0.22446 0.19938 0.73875 4.25437
4 0.22445 0.19886 0.73861 4.06764
validation_classification_error
1
2 0.13080
3 0.13031
4 0.12950
Variable Importances: (Extract with `h2o.varimp`)
=================================================
Variable Importances:
variable relative_importance scaled_importance percentage
1 rela 925030.875000 1.000000 0.276321
2 nocu 810065.062500 0.875717 0.241979
3 esreal 789558.062500 0.853548 0.235853
4 edad 636783.437500 0.688392 0.190217
5 nhijos 186228.718750 0.201322 0.055629
h2o.varimp_plot(mod.rf)
h2o.gainsLift(mod.rf, valid = TRUE)
Gains/Lift Table: Avg response rate: 5,65 %
group cumulative_data_fraction lower_threshold lift cumulative_lift
1 1 0.01018433 0.264715 4.067639 4.067639
2 2 0.02002073 0.251240 4.191962 4.128720
3 3 0.03002385 0.241891 4.365704 4.207677
4 4 0.04005234 0.222118 4.060512 4.170829
5 5 0.05009894 0.207432 3.817018 4.099877
6 6 0.10002754 0.153527 2.997740 3.549747
7 7 0.15002863 0.118713 2.217471 3.105730
8 8 0.20010221 0.089730 1.534231 2.712478
9 9 0.30008988 0.061227 1.087090 2.170911
10 10 0.40064295 0.047714 0.797819 1.826294
11 11 0.50136274 0.038380 0.650697 1.590126
12 12 0.60021601 0.031310 0.526105 1.414885
13 13 0.70006596 0.024364 0.429656 1.274363
14 14 0.80002465 0.018119 0.389413 1.163793
15 15 0.90000507 0.010741 0.333527 1.071560
16 16 1.00000000 0.000000 0.355924 1.000000
response_rate cumulative_response_rate capture_rate cumulative_capture_rate
1 0.229893 0.229893 0.041426 0.041426
2 0.236920 0.233345 0.041234 0.082660
3 0.246739 0.237808 0.043671 0.126331
4 0.229490 0.235725 0.040721 0.167051
5 0.215729 0.231715 0.038348 0.205400
6 0.169425 0.200623 0.149673 0.355072
7 0.125326 0.175528 0.110876 0.465948
8 0.086711 0.153303 0.076824 0.542773
9 0.061440 0.122695 0.108696 0.651469
10 0.045091 0.103218 0.080223 0.731692
11 0.036776 0.089870 0.065538 0.797230
12 0.029734 0.079966 0.052007 0.849237
13 0.024283 0.072024 0.042901 0.892138
14 0.022009 0.065775 0.038925 0.931063
15 0.018850 0.060562 0.033346 0.964409
16 0.020116 0.056518 0.035591 1.000000
gain cumulative_gain
1 306.763903 306.763903
2 319.196200 312.872023
3 336.570350 320.767651
4 306.051177 317.082873
5 281.701763 309.987710
6 199.773974 254.974687
7 121.747134 210.573012
8 53.423081 171.247815
9 8.709048 117.091128
10 -20.218088 82.629363
11 -34.930337 59.012554
12 -47.389515 41.488543
13 -57.034416 27.436277
14 -61.058683 16.379318
15 -66.647314 7.155995
16 -64.407582 0.000000
preds <- h2o.predict(mod.rf, test)
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|======================================================== | 80%
|
|======================================================================| 100%
preds
predict 0.0 1.0
1 0.0 0.9972467 0.0027532898
2 0.0 0.9509684 0.0490315735
3 0.0 0.9883099 0.0116900775
4 0.0 1.0000000 0.0000000000
5 0.0 0.9994901 0.0005098817
6 0.0 0.9993504 0.0006496143
[275914 rows x 3 columns]
Collect en R
preds.df <- as.data.frame(preds)
summary(preds.df)
predict X0.0 X1.0
0.0:243773 Min. :0.05212 Min. :0.00000
1.0: 32141 1st Qu.:0.92867 1st Qu.:0.02088
Median :0.96162 Median :0.03838
Mean :0.93937 Mean :0.06063
3rd Qu.:0.97912 3rd Qu.:0.07133
Max. :1.00000 Max. :0.94788