We’re going to use tools from the tidyverse, janitor, ggthemes, plyr, and h2o packages.
knitr::opts_chunk$set(echo = TRUE, message = F, warning = F, comment = NA)
library(tidyverse)
library(janitor)
library(ggthemes)
library(plyr)
library(h2o)First we need to turn on or initiate the h2o cluster. This readies our environment for h2o machine learning. The h2o ML workflow requires data to be in the form of a special data frame called an h2oframe.
h2o.init() Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 14 hours 7 minutes
H2O cluster timezone: America/Los_Angeles
H2O data parsing timezone: UTC
H2O cluster version: 3.24.0.5
H2O cluster version age: 1 month and 7 days
H2O cluster name: H2O_started_from_R_michaelespero_eps221
H2O cluster total nodes: 1
H2O cluster total memory: 1.99 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.5.2 (2018-12-20)
df_h2o <- h2o.importFile("dat1_EFA")
|
| | 0%
|
|=================================================================| 100%
As you’ve learned in previous chapters, generalized low-rank modeling is a flexible technique. Like much of machine learning, getting to the best hyperparameter settings is key to creating good models. Apply what you’ve learned so far to set the hyperparameters and train your GLRM.
Keep in mind that finding the best model may require you to explore hyperparameters in a methodological fashion. In your own models this may mean looping through values for k and gamma, experimenting with different loss functions, and noticing how all of it influences your objective function and the ability of the model to approximate observed data.
gamma <- 1
rank_k <- 2
glrm_r2 <- h2o.glrm(
training_frame = df_h2o,
seed = 143,
k = rank_k,
gamma_x = gamma, gamma_y = gamma,
regularization_x = "Quadratic", regularization_y = "Quadratic",
transform = "STANDARDIZE"
)
|
| | 0%
|
|= | 2%
|
|=== | 4%
|
|=== | 5%
|
|==== | 7%
|
|===== | 8%
|
|====== | 9%
|
|======= | 11%
|
|======== | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 15%
|
|=========== | 16%
|
|=========== | 17%
|
|=================================================================| 100%
glrm_r2_x <- h2o.getFrame(glrm_r2@model$representation_name)
glrm_r2_y <- glrm_r2@model$archetypes
glrm_r2_x %>% head() Arch1 Arch2
1 0.6152766 0.4214260
2 -0.2291777 0.8960790
3 -0.2315956 0.8976017
4 -0.1998442 0.8775549
5 1.1696696 0.4479189
6 -0.2291576 0.8960528
glrm_r2_y %>% dim()[1] 2 968
As you’ve learned, the glrm trains in an iterative fashion. You can plot the objective function over these iterations to see how it changed throughout training. You can also look into the H2ODimReductionModel to learn about its elements. For instance, you can output the relative importance of the 2 components resultant from your previous glrm specification.
plot(glrm_r2)glrm_r2@model$objective[1] 60845.49
glrm_r2@model$importance[2, ] %>% round(3)Importance of components:
pc1 pc2
Proportion of Variance 0.900000 0.100000
One of the most valuable benefits of dimensional reduction with GLRM is the ability to create compelling visualizations. With a few data processing steps, you can turn the y matrix into a dataframe, complete with labels for each of the embedded features. Finally, you can use this dataframe to make a plot that shows the proximity of embedded features of any type over k dimensions.
y_df <- t(glrm_r2_y) %>%
data.frame() %>%
cbind(embedded_features = row.names(t(glrm_r2_y)), .) %>%
clean_names()
skimr::skim(y_df)Skim summary statistics
n obs: 968
n variables: 3
── Variable type:factor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n n_unique
embedded_features 0 968 968 968
top_counts ordered
aer: 1, aer: 1, aer: 1, aer: 1 FALSE
── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100
arch1 0 968 968 -0.56 0.76 -2.39 -0.84 -0.83 -0.82 3.54
arch2 0 968 968 -1.1 0.67 -1.54 -1.47 -1.43 -1.33 2.12
hist
▁▁▇▁▁▁▁▁
▇▁▁▂▁▁▁▁
y_plot <- ggplot(y_df, aes(x = arch2, y = arch1)) +
geom_point() +
theme_tufte() +
geom_text(aes(label = embedded_features)) +
labs(
x = "Archetype 2 (10% Variance)", y = "Archetype 1 (90% Variance)",
title = "Variable Space: 968 Embedded Features",
caption = "Generalized low-rank model with rank 2 and quadratic regularization."
)
y_plotYou’re no beginner when it comes to generalized low-rank models. Check out these links to get under the hood of the glrm and explore tuning their hyperparameters to tap into valuable representations of observed data.