Working with RoboBohr Data

Synopsis

In this example, we present data visualization and model training for machine learning based electronic structure predictions. The data is composed of the electronic structure calculations of 16,243 molecules generated by RoboBohr. The preprint of the article explaining RoboBohr can be accessed from here, while the GitHub Repo can be accesed from here.

Getting the data

First we load the outcomes, i.e. the ground state energies calculated from Quantum Espresso.

outcomes <- fread("./data/out.dat.0_16274", skip = 2, header = FALSE, colClasses = c("integer", "numeric"))
colnames(outcomes) <- c("Id", "E")

For each molecule, the atomization energy is obtained by subtracting the ground state energy of isolated atoms which the molecule is composed of. These terms are included in the file AESub.out:

AeSub <- fread("./data/AEsub.out", header = FALSE, colClasses = c("integer","numeric"))
colnames(AeSub) <- c("Id", "Esub")

# Merge AeSub and outcomes; compute atomization energies
outcomesAe <- merge(outcomes, AeSub, by = "Id")
outcomesAe[,Eat := E-Esub]
outcomesAe[,E:=NULL]; outcomesAe[,Esub:=NULL] # Remove unnescessary columns
rm(outcomes,AeSub)

# Scale factor: The largest value of atomization energy in the data
scl <- -max(abs(outcomesAe$Eat), na.rm = TRUE)

Now, the data matrix composed of the eigenvalues of the Coulomb matrices are read from file:

CoulombLambda <- fread("./data/coulombL.csv", header = FALSE)

We process the data a little further:

# Assign column names
nam <- paste0('px', 1:(ncol(CoulombLambda)-1))
nam <- c("Id", nam)
colnames(CoulombLambda) <- nam
CoulombLambda[,Id:=as.integer(Id)] # Make Id variable integer

# Match with Id's so that there is no mistmatch in order
combined <- merge(CoulombLambda, outcomesAe, by="Id")
rm(outcomesAe,CoulombLambda)

# Remove NA's (calculations that failed to converge)
l.complete <- complete.cases(combined$Eat)
combined <- combined[l.complete,]

# Store atomization energies in a vector Y and remove unnecessary columns from combined
Y <- combined$Eat
combined[,Eat:=NULL] # No need for Eat 
combined[,Id:=NULL] # No need for Id

Data visualization by Principal Components Analysis

First we compute the principal component vectors:

pca <- prcomp(combined, center = TRUE)

Then, we can plot the atomization energies in the data as a function of the first two principal component vectors

library(plot3D)
plot3D::points2D(pca$x[,1], pca$x[,2], type = "p", colvar = -Y,
                 xlab = expression("Z"[1]), ylab = expression("Z"[2]), 
                 pch=19, cex.axis=1.0, cex.lab=1.0, clab = "|E| (Ry)")

This plot shows the peculiar nonlinear relationship between the feature vectors and the atomization energies.

Model Training by boosted regression trees

The nonlinear behavior illustrated above can be modeled accurately by a boosted regression tree. To train a model and assess its performance, first we split the data into training and testing sets:

set.seed(101)
inTrain <- sample(1:dim(combined)[1], size = floor(0.7*dim(combined)[1]), replace = FALSE)
train.Y <- Y[inTrain]; test.Y <- Y[-inTrain]
train.X <- combined[inTrain,]; test.X <- combined[-inTrain,]

Then, we convert training and test data to XGboost style matrices:

# XGBoost style matrices
library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

dtrain.X <- xgb.DMatrix(as.matrix(train.X), label = train.Y/scl)
dtest.X <- xgb.DMatrix(as.matrix(test.X), label = test.Y/scl)

# Watchlist
watchlist <- list(train=dtrain.X, test=dtest.X)

The parameters below are determined from 5-folds cross validation using the training data (not shown here)

# Parameters
param <- list(booster="gbtree",
              eval_metric="rmse",
              eta=0.0156,
              colsample_bytree = 0.4,
              max_depth = 16,
              min_child_weight = 10,
              gamma = 0.0,
              lambda = 1.0,
              subsample = 0.8)

We train the model:

# Test xgboost
xgb.model <- xgb.train(data=dtrain.X, params = param, watchlist=watchlist,nround = 600)

Then, predict on the test data and assess performance on the test set:

# Predict
pred <- predict(xgb.model, newdata = dtest.X)*scl

# RMSE
sqrt(mean((pred - test.Y)^2))

## [1] 0.1309823

This is in units of Rydbergs. Finally, we plot an error histogram:

Working with RoboBohr Data

Burak Himmetoglu

September 26, 2016

Synopsis

Getting the data

Data visualization by Principal Components Analysis

Model Training by boosted regression trees