1 Load the Libraries
2 Ready the data and h2o
3 Set the hyperparameters and train a GLRM
4 Assess the Model
5 Visualize the embedded features over 2 dimensions
6 Congratulations!

1 Load the Libraries

We’re going to use tools from the tidyverse, janitor, ggthemes, plyr, and h2o packages.

Load these libraries

knitr::opts_chunk$set(echo = TRUE, message = F, warning = F, comment = NA)

library(tidyverse)
library(janitor)
library(ggthemes)
library(plyr)
library(h2o)

2 Ready the data and h2o

First we need to turn on or initiate the h2o cluster. This readies our environment for h2o machine learning. The h2o ML workflow requires data to be in the form of a special data frame called an h2oframe.

Initiate the h2o cluster.
Make df an h2oframe and call it df_h2o.

h2o.init()

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         14 hours 7 minutes 
    H2O cluster timezone:       America/Los_Angeles 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.24.0.5 
    H2O cluster version age:    1 month and 7 days  
    H2O cluster name:           H2O_started_from_R_michaelespero_eps221 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.99 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.2 (2018-12-20)

df_h2o <- h2o.importFile("dat1_EFA")


  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

3 Set the hyperparameters and train a GLRM

As you’ve learned in previous chapters, generalized low-rank modeling is a flexible technique. Like much of machine learning, getting to the best hyperparameter settings is key to creating good models. Apply what you’ve learned so far to set the hyperparameters and train your GLRM.

Keep in mind that finding the best model may require you to explore hyperparameters in a methodological fashion. In your own models this may mean looping through values for k and gamma, experimenting with different loss functions, and noticing how all of it influences your objective function and the ability of the model to approximate observed data.

Indicate the rank of the resulting matrices by setting k to 2 (Defaults to 1).
Indicate regularization weights by setting the gamma arguments for the x and y matrices to 1 (Defaults to 0).
Use Quadratic regularization on both the x and y matrices.
Standardize the data by passing the argument to transform.
Get the resultant X and Y matrices.
Look at the head of both matrices to get a feel for their contents.

gamma <- 1

rank_k <- 2

glrm_r2 <- h2o.glrm(
  training_frame = df_h2o,
  seed = 143,
  k = rank_k,
  gamma_x = gamma, gamma_y = gamma,
  regularization_x = "Quadratic", regularization_y = "Quadratic",
  transform = "STANDARDIZE"
)


  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |=================================================================| 100%

glrm_r2_x <- h2o.getFrame(glrm_r2@model$representation_name)
glrm_r2_y <- glrm_r2@model$archetypes

glrm_r2_x %>% head()

       Arch1     Arch2
1  0.6152766 0.4214260
2 -0.2291777 0.8960790
3 -0.2315956 0.8976017
4 -0.1998442 0.8775549
5  1.1696696 0.4479189
6 -0.2291576 0.8960528

glrm_r2_y %>% dim()

[1]   2 968

4 Assess the Model

As you’ve learned, the glrm trains in an iterative fashion. You can plot the objective function over these iterations to see how it changed throughout training. You can also look into the H2ODimReductionModel to learn about its elements. For instance, you can output the relative importance of the 2 components resultant from your previous glrm specification.

Use the plot() function to view glrm_r2’s objective function over training iterations.
Look into the H2ODimReductionModel object “glrm_r2” to print the objective function.
Display the proportion of variance accounted for by the first and second components.

plot(glrm_r2)

glrm_r2@model$objective

[1] 60845.49

glrm_r2@model$importance[2, ] %>% round(3)

Importance of components: 
                            pc1      pc2
Proportion of Variance 0.900000 0.100000

5 Visualize the embedded features over 2 dimensions

One of the most valuable benefits of dimensional reduction with GLRM is the ability to create compelling visualizations. With a few data processing steps, you can turn the y matrix into a dataframe, complete with labels for each of the embedded features. Finally, you can use this dataframe to make a plot that shows the proximity of embedded features of any type over k dimensions.

Take the transpose of the y matrix, make it a data frame, then bind a column to it.
Call the new column embedded_features with row names from the transpose of the y matrix.
Use the clean_names() function from the janitor package to make the column names standard lowercase
Use the skim() function from the skimr package to get a summary of y_df.
Make a scatter plot with feature names instead of dots over 2 dimensions, arch2 across the x-axis and the other over the y-axis.
Display the embedded feature space from your generalized low-rank model.

y_df <- t(glrm_r2_y) %>%
  data.frame() %>%
  cbind(embedded_features = row.names(t(glrm_r2_y)), .) %>%
  clean_names()

skimr::skim(y_df)

Skim summary statistics
 n obs: 968 
 n variables: 3 

── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
          variable missing complete   n n_unique
 embedded_features       0      968 968      968
                     top_counts ordered
 aer: 1, aer: 1, aer: 1, aer: 1   FALSE

── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
 variable missing complete   n  mean   sd    p0   p25   p50   p75 p100
    arch1       0      968 968 -0.56 0.76 -2.39 -0.84 -0.83 -0.82 3.54
    arch2       0      968 968 -1.1  0.67 -1.54 -1.47 -1.43 -1.33 2.12
     hist
 ▁▁▇▁▁▁▁▁
 ▇▁▁▂▁▁▁▁

y_plot <- ggplot(y_df, aes(x = arch2, y = arch1)) +
  geom_point() +
  theme_tufte() +
  geom_text(aes(label = embedded_features)) +
  labs(
    x = "Archetype 2 (10% Variance)", y = "Archetype 1 (90% Variance)",
    title = "Variable Space: 968 Embedded Features",
    caption = "Generalized low-rank model with rank 2 and quadratic regularization."
  )

y_plot

6 Congratulations!

You’re no beginner when it comes to generalized low-rank models. Check out these links to get under the hood of the glrm and explore tuning their hyperparameters to tap into valuable representations of observed data.

Developing Generalized Low-Rank Models

Michael Espero