课程的简介，本课程不会对h2o机器学习平台进行全面的讲解，只会讲解一些很重要，有实用性的内容。但是，本课程的最后会给出一些资料，帮助有需要的同学进一步的学习。

1. H2o机器学习平台简介

H2O是一个完全开源的分布式内存机器学习平台，具有线性可扩展性。H2O支持最广泛使用的统计和机器学习算法，包括梯度增强机器，广义线性模型，深度学习等。H2O还具有业界领先的AutoML功能，可自动运行所有算法及其超参数，以生成最佳模型的排行榜。H2O平台已被全球14,000多家组织使用，并且在R＆Python社区中非常受欢迎。

1.1. H2o机器学习平台的主要特征

领先的机器学习算法，比如GLM,GBM,XGBoost,GLRM,Word2Vec等等
可以很容易的使用R或者python进行使用。或者使用H2oFlow，其是一种机器学习的交互式界面，不需要任何编码就可以构建机器学习模型。
分布式内存处理，内存处理，节点和集群之间的快速序列化，以支持海量数据集。大数据的分布式处理通过细粒度并行提供高达100倍的速度，实现最佳效率，而不会降低计算精度
方便部署，可以将模型保存成为POJO和MOJO格式，在任何环境中部署模型都可以进行快速的预测

2. 在R中使用H2o

2.1 安装

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-xu/3/R")

在Rstudio中运行上面的代码就可以将h2o下载好

2.2 简单的例子

library(h2o)

## Warning: package 'h2o' was built under R version 3.4.4

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 days 3 hours 
##     H2O cluster timezone:       Asia/Shanghai 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.20.0.8 
##     H2O cluster version age:    4 months and 22 days !!! 
##     H2O cluster name:           H2O_started_from_R_milin_qtp673 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.61 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.3 (2017-11-30)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 22 days)!
## Please download and install the latest version from http://h2o.ai/download/

iris_h2o <- as.h2o(iris)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_ran <- h2o.randomForest(x = setdiff(names(iris),names(iris)[5]),y = names(iris)[5]
                            ,training_frame = iris_h2o)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |=================================================================| 100%

h2o_ran

## Model Details:
## ==============
## 
## H2OMultinomialModel: drf
## Model ID:  DRF_model_R_1549896359989_4 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                      150               21241         1
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         9    3.80000          2         14     6.31333
## 
## 
## H2OMultinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("iris")`
## MSE: (Extract with `h2o.mse`) 0.03630175
## RMSE: (Extract with `h2o.rmse`) 0.1905302
## Logloss: (Extract with `h2o.logloss`) 0.1167547
## Mean Per-Class Error: 0.04666667
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            setosa versicolor virginica  Error      Rate
## setosa         50          0         0 0.0000 =  0 / 50
## versicolor      0         47         3 0.0600 =  3 / 50
## virginica       0          4        46 0.0800 =  4 / 50
## Totals         50         51        49 0.0467 = 7 / 150
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.953333
## 2 2  1.000000
## 3 3  1.000000

pre <- predict(h2o_ran,iris_h2o)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

pre

##   predict    setosa versicolor   virginica
## 1  setosa 0.9987384          0 0.001261564
## 2  setosa 0.9987384          0 0.001261564
## 3  setosa 0.9987384          0 0.001261564
## 4  setosa 0.9987384          0 0.001261564
## 5  setosa 0.9987384          0 0.001261564
## 6  setosa 0.9987384          0 0.001261564
## 
## [150 rows x 4 columns]

上面的例子使用了自带的数据集合iris，在h2o机器学习平台中构建了一个随即森林模型，并进行了预测。

这个例子很简单，接下来我们会学习从数据到构建模型这个过程我们如何使用h2o进行操作。

2.3 输入输出与数据处理

读取数据

h2o.importFile 从本地读取一个文件进入h2o
as.h2o 将一个R对象，比如是dataframe 转换成为一个h2o的数据结构
h2o.loadModel 加载一个已经训练好的模型
h2o.download_pojo

写出数据

h2o.exportFile 将h2o平台上的文件写入本地
h2o.saveModel 保存一个训练好的模型

数据分割

h2o.splitFrame 划分数据集和测试集合

2.4 机器学习模型

有监督的模型

h2o.deeplearning
h2o.gbm
h2o.glm
h2o.naiveBayes
h2o.randomForest
h2o.xgboost

非监督的模型

h2o.prcomp
h2o.kmeans

模型的通用参数

因为不同的模型有不同的模型参数需要调整，本课程不针对某一模型的参数调整进行讲解，只会对所有模型模型的通用参数进行介绍

stopping_metric 用什么指标来衡量模型是否提前停止

MSE 回归模型
deviance
misclassification 分类模型
mean_per_class_error
logloss
MSE
AUC

对于分类模型而言，默认的参数是Logloss

x 自变量的名字(或者列数)
y 因变量的名字（或者列数）
training_frame 训练模型的数据集
ignore_const_cols 是否去除掉固定值的列
validation_frame 评估模型的数据集
stopping_tolerance 误差达到什么样的进度下，模型停止训练
max_runtime_secs 最大的模型训练时间
model_id 模型的名字
nfolds 进行几折交叉验证
fold_assignment 如何区划分训练数据集合，有几个选项’Random’,‘Modulo’

评价模型

h2o.mes
h2o.confusionMatrix
h2o.performance

2.5 复杂的例子

这个例子包含了一些通用参数的设置，以及模型的评估：

tmp <- h2o.splitFrame(data = iris_h2o, ratios = 0.8)

# 划分数据集合

iris_h2o.train <- tmp[[1]]
iris_h2o.test <- tmp[[2]]


h2o_ran <- h2o.randomForest(
  x = setdiff(names(iris),
              names(iris)[5]),
  y = names(iris)[5],
  training_frame = iris_h2o,
  model_id = 'frist model',
  nfolds = 10,
  validation_frame = iris_h2o.test,stopping_metric = 'AUC',stopping_tolerance = 0.001
)

## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Stopping metric is ignored for _stopping_rounds=0..

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=================================================================| 100%

pre <- h2o.predict(h2o_ran,newdata = iris_h2o.test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

pre

##   predict    setosa versicolor   virginica
## 1  setosa 0.9977802          0 0.002219758
## 2  setosa 0.9977802          0 0.002219758
## 3  setosa 0.9977802          0 0.002219758
## 4  setosa 0.9977802          0 0.002219758
## 5  setosa 0.9977802          0 0.002219758
## 6  setosa 0.9977802          0 0.002219758
## 
## [32 rows x 4 columns]

h2o.performance(h2o_ran,iris_h2o.test)

## H2OMultinomialMetrics: drf
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.009922785
## RMSE: (Extract with `h2o.rmse`) 0.09961318
## Logloss: (Extract with `h2o.logloss`) 0.04358814
## Mean Per-Class Error: 0
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##            setosa versicolor virginica  Error     Rate
## setosa          8          0         0 0.0000 =  0 / 8
## versicolor      0          8         0 0.0000 =  0 / 8
## virginica       0          0        16 0.0000 = 0 / 16
## Totals          8          8        16 0.0000 = 0 / 32
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  1.000000
## 2 2  1.000000
## 3 3  1.000000

3 h2o.flow

H2O Flow是H2O的开源用户界面。它是一个基于Web的交互式环境

H2O Flow允许用户以交互方式使用H2O导入文件，构建模型并迭代地改进它们。根据您的模型，您可以进行预测，所有这些都在Flow的基于浏览器的环境中。

Flow的混合用户界面将命令行计算与现代图形用户界面无缝融合。但是，Flow不是将输出显示为纯文本，而是为每个H2O操作提供点击式用户界面。它允许用户以组织良好的表格数据的形式访问任何H2O对象。

H2O Flow将命令作为可执行单元序列发送到H2O。可以修改，重新排列单元格或将其保存到库中。每个单元格都包含一个输入字段，允许输入命令，定义函数，调用其他函数以及访问页面上的其他单元格或对象。执行单元格时，输出是一个图形对象，可以检查该对象以查看其他详细信息。

虽然H2O Flow支持R脚本，但运行H2OFlow无需编程经验。用户可以使用使用鼠标操作的方式进行构建模型，而无需编写任何代码。H2O Flow旨在通过提供输入提示，交互式帮助和示例流程来指导用户的每一步。

3.1 h2o flow 安装

如果在Rstudio 中已经安装了h2o，直接在Rstudio中输入：

library(h2o)
h2o.init()

然后打开浏览器，输入：http://127.0.0.1:54321，这样，就打开了h2o flow

如果使用的是服务器，不需要下载Rstudio，直接安装:

apt-get  http://h2o-release.s3.amazonaws.com/h2o/rel-xu/3/h2o-3.22.1.3.zip

然后从终端运行：

cd~ / Downloads 
unzip h2o-3.22.1.3.zip 
cd h2o-3.22.1.3 
java -jar h2o.jar

从浏览器打开：

http：// localhost：54321

3.2 h2o的基本使用方法

导入数据

数据的连接如下：

https://raw.githubusercontent.com/leestott/IrisData/master/irisTrainData.txt

查看数据

分割数据

建立模型

设置参数

构建模型成功，查看模型

进行预测

查看预测结果

4. 延伸

通过本课程的学习，同学们可以学会使用h2o构建机器学习模型，并且会使用h2o flow ，以一种更加简单的方式构建模型。

但是本课程还有一些细节没有涉及，如果同学们想要进一步的学习，我提供给同学们两份资料：

h2o机器学习平台

Liam

2019/2/11