1 Load the Libraries

We’re going to use tools from the tidyverse, janitor, ggthemes, plyr, and h2o packages.

knitr::opts_chunk$set(echo = TRUE, message = F, warning = F, comment = NA)

library(tidyverse)
library(janitor)
library(ggthemes)
library(plyr)
library(h2o)

2 Ready the data and h2o

First we need to turn on or initiate the h2o cluster. This readies our environment for h2o machine learning. The h2o ML workflow requires data to be in the form of a special data frame called an h2oframe.

h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         14 hours 7 minutes 
    H2O cluster timezone:       America/Los_Angeles 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.24.0.5 
    H2O cluster version age:    1 month and 7 days  
    H2O cluster name:           H2O_started_from_R_michaelespero_eps221 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.99 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.2 (2018-12-20) 
df_h2o <- h2o.importFile("dat1_EFA")

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

3 Set the hyperparameters and train a GLRM

As you’ve learned in previous chapters, generalized low-rank modeling is a flexible technique. Like much of machine learning, getting to the best hyperparameter settings is key to creating good models. Apply what you’ve learned so far to set the hyperparameters and train your GLRM.

Keep in mind that finding the best model may require you to explore hyperparameters in a methodological fashion. In your own models this may mean looping through values for k and gamma, experimenting with different loss functions, and noticing how all of it influences your objective function and the ability of the model to approximate observed data.

gamma <- 1

rank_k <- 2

glrm_r2 <- h2o.glrm(
  training_frame = df_h2o,
  seed = 143,
  k = rank_k,
  gamma_x = gamma, gamma_y = gamma,
  regularization_x = "Quadratic", regularization_y = "Quadratic",
  transform = "STANDARDIZE"
)

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |=================================================================| 100%
glrm_r2_x <- h2o.getFrame(glrm_r2@model$representation_name)
glrm_r2_y <- glrm_r2@model$archetypes

glrm_r2_x %>% head()
       Arch1     Arch2
1  0.6152766 0.4214260
2 -0.2291777 0.8960790
3 -0.2315956 0.8976017
4 -0.1998442 0.8775549
5  1.1696696 0.4479189
6 -0.2291576 0.8960528
glrm_r2_y %>% dim()
[1]   2 968

4 Assess the Model

As you’ve learned, the glrm trains in an iterative fashion. You can plot the objective function over these iterations to see how it changed throughout training. You can also look into the H2ODimReductionModel to learn about its elements. For instance, you can output the relative importance of the 2 components resultant from your previous glrm specification.

plot(glrm_r2)

glrm_r2@model$objective
[1] 60845.49
glrm_r2@model$importance[2, ] %>% round(3)
Importance of components: 
                            pc1      pc2
Proportion of Variance 0.900000 0.100000

5 Visualize the embedded features over 2 dimensions

One of the most valuable benefits of dimensional reduction with GLRM is the ability to create compelling visualizations. With a few data processing steps, you can turn the y matrix into a dataframe, complete with labels for each of the embedded features. Finally, you can use this dataframe to make a plot that shows the proximity of embedded features of any type over k dimensions.

y_df <- t(glrm_r2_y) %>%
  data.frame() %>%
  cbind(embedded_features = row.names(t(glrm_r2_y)), .) %>%
  clean_names()

skimr::skim(y_df)
Skim summary statistics
 n obs: 968 
 n variables: 3 

── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
          variable missing complete   n n_unique
 embedded_features       0      968 968      968
                     top_counts ordered
 aer: 1, aer: 1, aer: 1, aer: 1   FALSE

── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
 variable missing complete   n  mean   sd    p0   p25   p50   p75 p100
    arch1       0      968 968 -0.56 0.76 -2.39 -0.84 -0.83 -0.82 3.54
    arch2       0      968 968 -1.1  0.67 -1.54 -1.47 -1.43 -1.33 2.12
     hist
 ▁▁▇▁▁▁▁▁
 ▇▁▁▂▁▁▁▁
y_plot <- ggplot(y_df, aes(x = arch2, y = arch1)) +
  geom_point() +
  theme_tufte() +
  geom_text(aes(label = embedded_features)) +
  labs(
    x = "Archetype 2 (10% Variance)", y = "Archetype 1 (90% Variance)",
    title = "Variable Space: 968 Embedded Features",
    caption = "Generalized low-rank model with rank 2 and quadratic regularization."
  )

y_plot

6 Congratulations!

You’re no beginner when it comes to generalized low-rank models. Check out these links to get under the hood of the glrm and explore tuning their hyperparameters to tap into valuable representations of observed data.