Quick-Start-R-H20

#Start R, and type ‘install.packages(“h2o”)’.
#install.packages("h2o")

#Let’s check that it worked by typing ‘library(h2o)’

library(h2o)

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

#Your next step is to start H2O: > h2o.init()

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         8 hours 24 minutes 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    29 days 
##     H2O cluster name:           H2O_started_from_R_maria_egb186 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.94 GB 
##     H2O cluster total cores:    12 
##     H2O cluster allowed cores:  12 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.2 (2023-10-31 ucrt)

#h2o.init() will only use two cores on your machine and maybe a quarter of your system memory,6 by default.

library(h2o)
h2o.init(nthreads = -1)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         8 hours 24 minutes 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.44.0.3 
##     H2O cluster version age:    29 days 
##     H2O cluster name:           H2O_started_from_R_maria_egb186 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.94 GB 
##     H2O cluster total cores:    12 
##     H2O cluster allowed cores:  12 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.3.2 (2023-10-31 ucrt)

#load the data and assign it to variables

datasets <- "https://raw.githubusercontent.com/DarrenCook/h2o/bk/datasets/"
data <- h2o.importFile(paste0(datasets, "iris_wheader.csv"))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

#show the data
data

##   sepal_len sepal_wid petal_len petal_wid       class
## 1       5.1       3.5       1.4       0.2 Iris-setosa
## 2       4.9       3.0       1.4       0.2 Iris-setosa
## 3       4.7       3.2       1.3       0.2 Iris-setosa
## 4       4.6       3.1       1.5       0.2 Iris-setosa
## 5       5.0       3.6       1.4       0.2 Iris-setosa
## 6       5.4       3.9       1.7       0.4 Iris-setosa
## 
## [150 rows x 5 columns]

# update the packages before splitting the data
#update.packages(ask = FALSE, checkBuilt = TRUE)

#assign the x and y variables for preprocessing
y <- "class"  
x <- setdiff(names(data), y)

#split the data using 80% as the training size and 20% will be used to test the trained model
parts <- h2o.splitFrame(data, 0.8)#In R, h2o.splitFrame() takes an H2O frame and returns a list of the splits, which are assigned to train and test, for readability:  

#sthe "part" (80% and 20% respectively) will be assigned to the training and testing variables to run the algorithm
train <- parts[[1]]
test <- parts[[2]]

#assign the model to a variable "m" to train using h2o(deep learning) 
m <- h2o.deeplearning(x, y, train)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

#create the predictions from the trained model using m 
# this will allow us to evaluate the accuracy of the predictions based on the trained data
p <- h2o.predict(m, test)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

#evaluate the metrics: 

#Mean squared error. This represents the average error of the predictions from the training compared to the actual observations.

#MSE allows us to evaluate the average error in our predictions by comparing each outcome to the actual observations in the data set. In this case, we average error is about 2.3%
h2o.mse(m)

## [1] 0.08499313

#the confusion matrix reveals the overall accuracy using the true positive, and true negatives in a model. This is calculated by using samples from the data set and calculating the error for each class. The total accuracy of the model is calculated by aggregating the accuracy of the predictions for each class, in this case, each type of flower.
h2o.confusionMatrix(m)

#The following command creates a data frame with the confidence for each prediction. It weights the three predictions to add up to a 100. The highest value represents the prediction for each class on each row.

#In the first row, the model predicted Iris Setosa within a 99% confidence interval. 
as.data.frame(p)

#which ones, if any, did H2O’s model get wrong?
as.data.frame( h2o.cbind(p$predict, test$class) )

# the model predicted all 25 flower classes correctly. An accuracy of 100% means that the model is overfitting the data set and cannot be used for a different data set effectively. Even if the simply add data to the current data set, such as a new class, the model might not be able predict the new class.

#evaluate the performance of the model
h2o.performance(m, test)

## H2OMultinomialMetrics: deeplearning
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.04363839
## RMSE: (Extract with `h2o.rmse`) 0.208898
## Logloss: (Extract with `h2o.logloss`) 0.3685096
## Mean Per-Class Error: 0.04166667
## AUC: (Extract with `h2o.auc`) NaN
## AUCPR: (Extract with `h2o.aucpr`) NaN
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##                 Iris-setosa Iris-versicolor Iris-virginica  Error     Rate
## Iris-setosa               7               0              0 0.0000 =  0 / 7
## Iris-versicolor           0               8              0 0.0000 =  0 / 8
## Iris-virginica            0               1              7 0.1250 =  1 / 8
## Totals                    7               9              7 0.0435 = 1 / 23
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.956522
## 2 2  1.000000
## 3 3  1.000000

#The hit ratio represents the number of times that the actual observation was predicted by the model, or in other terms, a true positive from a confusion matrix perspective.

#The hit ratio from the performance analysis shows 100% for the hit ratio. This confirms that the model over fits the data.

#In my opinion, this model can be improved by adding more data. This could be in the form of more observations, or a different feature that could improve the robustness of the entire algorithm