1 Introduction

The aim of this challenge is to predict the band gap and formation energies of sesquioxides having different compositions of metals like aluminium (Al), gallium (Ga) and indium (In). Bandgap energy is an important property for optoelectronic applications whereas formation energy is an important indicator of the stability of a material.

Along with a tabular dataset there is also provided geometry file for each sample observation which contains the primitive translation lattice vectors and the spatial position of atoms in that compound. These geometry files can be used to gain insights and engineer features that can best correlate with the targets.

1.1 Load libraries and data files

library(tidyverse)
library(caret)
library(CORElearn)
library(corrplot)
library(gridExtra)
library(GGally)
library(plotly)
library(DT)

train <- read_csv('train.csv')
test  <- read_csv('test.csv')

1.2 Train data

summary(train)

##        id           spacegroup    number_of_total_atoms percent_atom_al 
##  Min.   :   1.0   Min.   : 12.0   Min.   :10.00         Min.   :0.0000  
##  1st Qu.: 600.8   1st Qu.: 33.0   1st Qu.:40.00         1st Qu.:0.1667  
##  Median :1200.5   Median :194.0   Median :80.00         Median :0.3750  
##  Mean   :1200.5   Mean   :141.5   Mean   :61.68         Mean   :0.3854  
##  3rd Qu.:1800.2   3rd Qu.:206.0   3rd Qu.:80.00         3rd Qu.:0.5833  
##  Max.   :2400.0   Max.   :227.0   Max.   :80.00         Max.   :1.0000  
##  percent_atom_ga  percent_atom_in  lattice_vector_1_ang
##  Min.   :0.0000   Min.   :0.0000   Min.   : 3.037      
##  1st Qu.:0.0938   1st Qu.:0.0625   1st Qu.: 6.141      
##  Median :0.2812   Median :0.2500   Median : 9.537      
##  Mean   :0.3086   Mean   :0.3060   Mean   :10.030      
##  3rd Qu.:0.4688   3rd Qu.:0.4688   3rd Qu.:10.292      
##  Max.   :1.0000   Max.   :1.0000   Max.   :24.913      
##  lattice_vector_2_ang lattice_vector_3_ang lattice_angle_alpha_degree
##  Min.   : 2.942       Min.   : 5.673       Min.   : 82.74            
##  1st Qu.: 5.834       1st Qu.: 9.298       1st Qu.: 90.00            
##  Median : 6.383       Median :10.125       Median : 90.00            
##  Mean   : 7.087       Mean   :12.593       Mean   : 90.24            
##  3rd Qu.: 9.093       3rd Qu.:14.372       3rd Qu.: 90.01            
##  Max.   :10.290       Max.   :25.346       Max.   :101.23            
##  lattice_angle_beta_degree lattice_angle_gamma_degree
##  Min.   : 81.64            Min.   : 29.73            
##  1st Qu.: 90.00            1st Qu.: 90.00            
##  Median : 90.00            Median : 90.00            
##  Mean   : 92.40            Mean   : 94.79            
##  3rd Qu.: 90.01            3rd Qu.:120.00            
##  Max.   :106.17            Max.   :120.05            
##  formation_energy_ev_natom bandgap_energy_ev
##  Min.   :0.0000            Min.   :0.0001   
##  1st Qu.:0.1056            1st Qu.:1.2785   
##  Median :0.1818            Median :1.9079   
##  Mean   :0.1876            Mean   :2.0772   
##  3rd Qu.:0.2563            3rd Qu.:2.7620   
##  Max.   :0.6572            Max.   :5.2861

glimpse(train)

## Observations: 2,400
## Variables: 14
## $ id                         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ spacegroup                 <dbl> 33, 194, 227, 167, 194, 227, 206, 1...
## $ number_of_total_atoms      <dbl> 80, 80, 40, 30, 80, 40, 80, 20, 80,...
## $ percent_atom_al            <dbl> 0.6250, 0.6250, 0.8125, 0.7500, 0.0...
## $ percent_atom_ga            <dbl> 0.3750, 0.3750, 0.1875, 0.0000, 0.6...
## $ percent_atom_in            <dbl> 0.0000, 0.0000, 0.0000, 0.2500, 0.3...
## $ lattice_vector_1_ang       <dbl> 9.9523, 6.1840, 9.7510, 5.0036, 6.6...
## $ lattice_vector_2_ang       <dbl> 8.5513, 6.1838, 5.6595, 5.0034, 6.6...
## $ lattice_vector_3_ang       <dbl> 9.1775, 23.6287, 13.9630, 13.5318, ...
## $ lattice_angle_alpha_degree <dbl> 90.0026, 90.0186, 90.9688, 89.9888,...
## $ lattice_angle_beta_degree  <dbl> 90.0023, 89.9980, 91.1228, 90.0119,...
## $ lattice_angle_gamma_degree <dbl> 90.0017, 120.0025, 30.5185, 120.001...
## $ formation_energy_ev_natom  <dbl> 0.0680, 0.2490, 0.1821, 0.2172, 0.0...
## $ bandgap_energy_ev          <dbl> 3.4387, 2.9210, 2.7438, 3.3492, 1.3...

We observe that:

There are 11 variables other than the targets ie. formation energy and band gap energy (both in electron volts) for 2400 obervations in the train data set.
The percent_atom variables indicate the percentage of respective elements in the total metal content only without oxygen.
The lattice_vector variables indicate the length of the primitive translation lattice vectors in angstrom unit.
The lattice_angle variables indicate the angle between the lattice vectors in degrees.
The spacegroup variable encodes the symmetry of the crystal compound, hence can be made as a categorical variable.
The number_of_total_atoms is pretty self explanatory.

1.3 Test data

summary(test)

##        id          spacegroup    number_of_total_atoms percent_atom_al 
##  Min.   :  1.0   Min.   : 12.0   Min.   :10.00         Min.   :0.0000  
##  1st Qu.:150.8   1st Qu.: 33.0   1st Qu.:40.00         1st Qu.:0.1250  
##  Median :300.5   Median :194.0   Median :80.00         Median :0.3750  
##  Mean   :300.5   Mean   :139.6   Mean   :61.73         Mean   :0.3710  
##  3rd Qu.:450.2   3rd Qu.:206.0   3rd Qu.:80.00         3rd Qu.:0.5625  
##  Max.   :600.0   Max.   :227.0   Max.   :80.00         Max.   :1.0000  
##  percent_atom_ga  percent_atom_in  lattice_vector_1_ang
##  Min.   :0.0000   Min.   :0.0000   Min.   : 3.073      
##  1st Qu.:0.0938   1st Qu.:0.0625   1st Qu.: 6.137      
##  Median :0.2500   Median :0.2812   Median : 9.495      
##  Mean   :0.3133   Mean   :0.3157   Mean   :10.098      
##  3rd Qu.:0.4688   3rd Qu.:0.4688   3rd Qu.:10.363      
##  Max.   :0.9688   Max.   :0.9688   Max.   :24.913      
##  lattice_vector_2_ang lattice_vector_3_ang lattice_angle_alpha_degree
##  Min.   : 2.960       Min.   : 5.698       Min.   : 83.74            
##  1st Qu.: 5.829       1st Qu.: 9.309       1st Qu.: 90.00            
##  Median : 6.398       Median :10.097       Median : 90.00            
##  Mean   : 7.082       Mean   :12.442       Mean   : 90.16            
##  3rd Qu.: 9.157       3rd Qu.:14.328       3rd Qu.: 90.01            
##  Max.   :10.249       Max.   :25.306       Max.   :100.95            
##  lattice_angle_beta_degree lattice_angle_gamma_degree
##  Min.   : 82.75            Min.   : 29.72            
##  1st Qu.: 90.00            1st Qu.: 90.00            
##  Median : 90.00            Median : 90.00            
##  Mean   : 92.49            Mean   : 96.33            
##  3rd Qu.: 90.01            3rd Qu.:120.00            
##  Max.   :105.98            Max.   :120.05

glimpse(test)

## Observations: 600
## Variables: 12
## $ id                         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ spacegroup                 <dbl> 33, 33, 167, 12, 12, 33, 167, 33, 3...
## $ number_of_total_atoms      <dbl> 80, 80, 30, 80, 80, 40, 30, 80, 80,...
## $ percent_atom_al            <dbl> 0.1875, 0.7500, 0.6667, 0.5625, 0.1...
## $ percent_atom_ga            <dbl> 0.4688, 0.2500, 0.1667, 0.4375, 0.5...
## $ percent_atom_in            <dbl> 0.3438, 0.0000, 0.1667, 0.0000, 0.3...
## $ lattice_vector_1_ang       <dbl> 10.5381, 9.8938, 4.9811, 24.3370, 2...
## $ lattice_vector_2_ang       <dbl> 9.0141, 8.5014, 4.9808, 6.0091, 6.2...
## $ lattice_vector_3_ang       <dbl> 9.6361, 9.1298, 13.4799, 5.7620, 6....
## $ lattice_angle_alpha_degree <dbl> 89.9997, 90.0038, 89.9900, 89.9995,...
## $ lattice_angle_beta_degree  <dbl> 90.0003, 90.0023, 90.0109, 103.8581...
## $ lattice_angle_gamma_degree <dbl> 90.0006, 90.0015, 120.0014, 90.0002...

The test data contains 600 observations (i.e. 20% of the total 3000 materials). All features appear to have similar average properties as in the train data.

1.4 Missing values

sum(is.na(train))

## [1] 0

sum(is.na(test))

## [1] 0

There are no missing values.

1.5 Near-zero variance variables

nearZeroVar(train)

## integer(0)

There are no near-zero variance variables.

1.6 Individual variable plots

f1 <- train %>% ggplot(aes(formation_energy_ev_natom)) + 
  geom_histogram(bins = 100, fill = "red", alpha = 0.5) + 
  scale_x_log10()

f2 <- train %>% ggplot(aes(bandgap_energy_ev)) + 
  geom_histogram(bins = 100, fill = "blue", alpha = 0.5) + 
  scale_x_log10(breaks = c(0.1,0.5,1,5))

f3 <- train %>%
  ggplot(aes(number_of_total_atoms, fill = as.factor(number_of_total_atoms))) +
  geom_bar(width = 8) +
  theme(legend.position = "none")

f4 <- train %>% mutate(spacegroup = as.factor(spacegroup)) %>%
  ggplot(aes(spacegroup, fill = spacegroup)) +
  geom_bar() +
  theme(legend.position = "none")

grid.arrange(f1, f2, f3, f4, ncol = 2)

We observe that:

The formation energies are largely between 0.01 and 0.5 eV.
The bandgap energies are in the range of a few eV, with a peak around 1 eV. Values above 5 eV are very rare.
Compounds with 10, 60, and also 20 atoms are rare while those with 80 atoms are predominant.
The symmetry spacegroups only take 6 different values. Here the frequencies are much more similar, ranging from about 350 to almost 500 cases.

f5 <- train %>%
  ggplot(aes(percent_atom_al)) +
  geom_density(color = "blue", fill = "lightblue", alpha = 0.5)

f6 <- train %>%
  ggplot(aes(percent_atom_ga)) +
  geom_density(color = "green", fill = "orange", alpha = 0.5) 

f7 <- train %>%
  ggplot(aes(percent_atom_in)) +
  geom_density(color = "red", fill = "brown", alpha = 0.5) 

f8 <- train %>%
  ggplot(aes(lattice_vector_1_ang)) +
  geom_histogram(bins = 30, fill = "brown")

f9 <- train %>%
  ggplot(aes(lattice_vector_2_ang)) +
  geom_histogram(bins = 30, fill = "purple")

f10 <- train %>%
  ggplot(aes(lattice_vector_3_ang)) +
  geom_histogram(bins = 30, fill = "dark green")

f11 <- train %>%
  ggplot(aes(lattice_angle_alpha_degree)) +
  geom_histogram(bins = 30, fill = "brown")

f12 <- train %>%
  ggplot(aes(lattice_angle_beta_degree)) +
  geom_histogram(bins = 30, fill = "purple")

f13 <- train %>%
  ggplot(aes(lattice_angle_gamma_degree)) +
  geom_histogram(bins = 30, fill = "dark green")

grid.arrange(f5, f6, f7, f8, f9, f10, f11, f12, f13, ncol = 3)

We observe that:

There are relatively less crystal compounds with Al and In content around 10%. Also crystals with higher Al content are relatively more than that of In and Ga.
The lengths of the lattice vectors are clearly grouped into 3 to 4 different peaks. The second vectors have a notably smaller maximum of around 10 compared to the 25 of the other 2 directions.
The angles alpha and beta are clustered around 90 degrees, whereas angle gamma has a 3 dominant clusters.

1.7 Geometry file

Real crystals are composed of a specific arrangement of the constituent atoms in a unit cell, which is then repeated to form a periodic lattice. The geometry files contain the positions of the atoms in one-unit cell, along with the basis vectors used to construct a periodic array of cells. These are inherently three-dimensional structures and contain a rich set of information about bonding and symmetry.

geo <- read.table("train/10/geometry.xyz", skip = 6, as.is = T) %>% select(-V1) %>%
  rename(x = V2, y = V3, z = V4, atom = V5)

geo %>%
  plot_ly(x = ~x, y = ~y, z = ~z, color = ~atom) %>%
  add_markers() %>%
  layout(title = "Cell geometry id 10")

Above, we have plotted the position of each atom in the unit cell according to its physical position. The unit cells themselves may not occupy a perfect square. When extracting information about these materials, we would like the features to be descriptive of the material, rather than just the unit cell. Thus, we need to package our cell in such a way that periodicity can be handled efficiently. Transforming our structures to a reduced coordinate form, where the positions of the atoms are represented in terms of the basis vectors, rather than real space vectors, will encode all the unit cell information in a square with boundaries from zero to one. Repeating this reduced unit cell in unit steps along any dimension will allow us to encode periodicity without worrying about oblique shapes.

Let the reduced coordinate be denoted as \(r = (x,y,z)^{t}\), then position vector \(R\) is expressed using the lattice vectors as

\(R = (X,Y,Z)^{t} = xa_{1} + ya_{2} + za_{3} = Ar\)

where \(A = (a_{1}, a_{2}, a_{3})\) is a matrix of primitive lattice vectors.

Taking the inverse of the basis vector matrix, we can solve for the reduced form coordinates:

\(r = A^{-1}R\).

# Reading primitive lattice vectors
A <- read.table("train/10/geometry.xyz", skip = 3, nrows = 3) %>% select(-V1) %>%
  as.matrix() %>% t
B <- MASS::ginv(A)
# Reading spatial position of atoms
Geom <- read.table("train/10/geometry.xyz", skip = 6) %>% select(-V1) %>%
  rename(x = V2, y = V3, z = V4, atom = V5)

# Spatial position in reduced coordinates  
Geom[,1:3] <- t(B %*% t(Geom[,1:3]))

Geom %>%
  plot_ly(x = ~x, y = ~y, z = ~z, color = ~atom) %>%
  add_markers() %>%
  layout(title = "Cell geometry in reduced coordinates id 10")

The reduced coordinates fill a unit square parameterized by the lattice vectors of the crystal. This will allow us to build periodic arrays of the unit cell simply by stacking unit squares, for an arbitrary unit cell/basis vectors.

2 Feature Engineering

We engineer the following features apart from the one given in the data set.

Mean value of coordinates of Al, Ga, In and O atoms in reduced form for each crystal compound.
One hot encode variable spacegroup.
Convert the lattice angles from degrees to radians.

2.1 Mean value of coordinates of atoms in reduced form

# Mean value of coordinates of atoms in reduced form
rcmv_train <- matrix(NA, 2400, 15) %>% data.frame()

for(id in 1:2400){
  # Reading primitive lattice vectors
  A <- read.table(str_c('train/',id,'/geometry.xyz'), skip = 3, nrows = 3) %>% select(-V1) %>% as.matrix %>% t
  B <- MASS::ginv(A)
  
  # Reading spatial position of atoms
  Geom <- read.table(str_c('train/',id,'/geometry.xyz'), skip = 6) %>% select(-V1) %>%
  rename(x = V2, y = V3, z = V4, atom = V5)
  
  # Cell centre
  cg <- c(mean(Geom$x), mean(Geom$y), mean(Geom$z))
  cg_r <- (B %*% cg) %>% as.matrix() %>% t()
  
  # Centre of each atom type
  D_cg <- Geom %>% group_by(atom) %>% summarise(x = mean(x), y = mean(y), z = mean(z))
  
  if("Al" %in% levels(Geom$atom)){
    Al_cg_r <- (B %*% (D_cg %>% filter(atom == 'Al') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  }else{
    Al_cg_r <- matrix(0,1,3)
  }
  
  if("Ga" %in% levels(Geom$atom)){
    Ga_cg_r <- (B %*% (D_cg %>% filter(atom == 'Ga') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  }else{
    Ga_cg_r <- matrix(0,1,3)
  }
  
  if("In" %in% levels(Geom$atom)){
    In_cg_r <- (B %*% (D_cg %>% filter(atom == 'In') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  }else{
    In_cg_r <- matrix(0,1,3)
  }
  
  O_cg_r <- (B %*% (D_cg %>% filter(atom == 'O') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  
  temprow <- cbind( cg_r, Al_cg_r, Ga_cg_r, In_cg_r, O_cg_r )
  rcmv_train[id,] <- temprow
}
colnames(rcmv_train) <- c('cg_r_x', 'cg_r_y', 'cg_r_z',
                    'Al_cg_r_x', 'Al_cg_r_y', 'Al_cg_r_z',
                    'Ga_cg_r_x', 'Ga_cg_r_y', 'Ga_cg_r_z',
                    'In_cg_r_x', 'In_cg_r_y', 'In_cg_r_z',
                    'O_cg_r_x', 'O_cg_r_y', 'O_cg_r_z')


rcmv_test <- matrix(NA, 600, 15) %>% data.frame()
for(id in 1:600){
  # Reading primitive lattice vectors
  A <- read.table(str_c('test/',id,'/geometry.xyz'), skip = 3, nrows = 3) %>% select(-V1)   %>% as.matrix %>% t
  B <- MASS::ginv(A)
  
  # Reading spatial position of atoms
  Geom <- read.table(str_c('test/',id,'/geometry.xyz'), skip = 6) %>% select(-V1) %>%
  rename(x = V2, y = V3, z = V4, atom = V5)
  
  # Cell centre
  cg <- c(mean(Geom$x), mean(Geom$y), mean(Geom$z))
  cg_r <- (B %*% cg) %>% as.matrix() %>% t()
  
  # Centre of each atom type
  D_cg <- Geom %>% group_by(atom) %>% summarise(x = mean(x), y = mean(y), z = mean(z))
  
  if("Al" %in% levels(Geom$atom)){
    Al_cg_r <- (B %*% (D_cg %>% filter(atom == 'Al') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  }else{
    Al_cg_r <- matrix(0,1,3)
  }
  
  if("Ga" %in% levels(Geom$atom)){
    Ga_cg_r <- (B %*% (D_cg %>% filter(atom == 'Ga') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  }else{
    Ga_cg_r <- matrix(0,1,3)
  }
  
  if("In" %in% levels(Geom$atom)){
    In_cg_r <- (B %*% (D_cg %>% filter(atom == 'In') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  }else{
    In_cg_r <- matrix(0,1,3)
  }
  
  O_cg_r <- (B %*% (D_cg %>% filter(atom == 'O') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
  
  temprow <- cbind( cg_r, Al_cg_r, Ga_cg_r, In_cg_r, O_cg_r )
  rcmv_test[id,] <- temprow
}
colnames(rcmv_test) <- c('cg_r_x', 'cg_r_y', 'cg_r_z',
                    'Al_cg_r_x', 'Al_cg_r_y', 'Al_cg_r_z',
                    'Ga_cg_r_x', 'Ga_cg_r_y', 'Ga_cg_r_z',
                    'In_cg_r_x', 'In_cg_r_y', 'In_cg_r_z',
                    'O_cg_r_x', 'O_cg_r_y', 'O_cg_r_z')

head(rcmv_train)

2.2 Other features and formation of design matrix

# Design matrix
df_train <- train %>% 
  mutate(spacegroup = as.factor(spacegroup),
         alpha_rad = (lattice_angle_alpha_degree * pi/180),
         beta_rad = (lattice_angle_beta_degree * pi/180),
         gamma_rad = (lattice_angle_gamma_degree * pi/180)) %>% 
  select(-contains('degree'),-id,-contains('energy'))

df_test <- test %>% 
  mutate(spacegroup = as.factor(spacegroup),
         alpha_rad = (lattice_angle_alpha_degree * pi/180),
         beta_rad = (lattice_angle_beta_degree * pi/180),
         gamma_rad = (lattice_angle_gamma_degree * pi/180)) %>% 
  select(-contains('degree'),-id)

# One hot encoding variable spacegroup
dmy <- dummyVars(" ~ .", data = df_train)
df_train <- data.frame(predict(dmy, newdata = df_train))
df_test <- data.frame(predict(dmy, newdata = df_test))

# Forming a complete design matrix for train and test
df_train <- df_train %>% bind_cols(rcmv_train) 
df_test <- df_test %>% bind_cols(rcmv_test) 

rm(rcmv_train, rcmv_test)

head(df_train)

3 Duplicate rows in design matrix

The number of duplicate rows in design matrix are

dup_rows <- duplicated(df_train) | duplicated(df_train, fromLast = T)
sum(dup_rows)

## [1] 24

The corresponding rows in given train are

train[dup_rows,] %>% arrange_(.dots=colnames(train %>% select(-id,-contains("energy"))))

## Warning: arrange_() is deprecated. 
## Please use arrange() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with arrange() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.

The number of unique target variable observations for the above duplicate rows are

nrow(unique(train[dup_rows,] %>% select(contains('energy'))))

## [1] 24

Thus, there are duplicate rows in the design matrix but the corresponding target values are all distinct. So we remove those duplicate rows from design matrix and train.

df_train <- df_train[!dup_rows,]
train <- train[!dup_rows,]

4 Measuring Univariate Feature Importance

The dimension of the design matrix after removing the duplicate rows is

dim(df_train)

## [1] 2376   31

There are total 31 features to work with.

4.1 Binary features

We have already dummified variable spacegroup, let’s check the relevance of its each level by applying Wilcoxon rank sum test followed by Benjamini and Hochberg p-value correction. In this case null hypothesis is that the target distributions for feature values 1 and 0 differ by a location shift of zero. We apply this procedure for both formation energy and band gap energy.

4.1.1 Relevance for formation energy

We observe that all binary features derived from variable spacegroup pass the test under 5% significance level.

# Wilcoxon rank sum test on binary features for formation energy
wrst_fe <- apply(df_train %>% select(contains("spacegroup")), 2,
                 function(x, y){
                   wxn <- wilcox.test(y ~ x, conf.int = T)[c("statistic", "p.value", "estimate")]
                   unlist(wxn)},
                 y = train$formation_energy_ev_natom)

# The results are a matrix with predictors in columns. We reverse this
wrst_fe <- as.data.frame(t(wrst_fe))
names(wrst_fe) <- c("statistic.W", "p.value", "estimate")
wrst_fe$p.adj <- p.adjust(wrst_fe$p.value , method = "fdr")
wrst_fe$feature = rownames(wrst_fe)
(wrst_fe <- wrst_fe %>% mutate(sig=ifelse(wrst_fe$p.adj<0.05, "p.adj<0.05", "Not Sig.")))

4.1.2 Relevance for band gap energy

We observe that 4 out of 6 binary features derived from variable spacegroup pass the test under 5% significance level.

# Wilcoxon rank sum test on binary features for band gap energy
wrst_be <- apply(df_train %>% select(contains("spacegroup")), 2,
              function(x, y){
                wxn <- wilcox.test(y ~ x, conf.int = T)[c("statistic", "p.value", "estimate")]
                unlist(wxn)},
              y = train$bandgap_energy_ev)

# The results are a matrix with predictors in columns. We reverse this
wrst_be <- as.data.frame(t(wrst_be))
names(wrst_be) <- c("statistic.W", "p.value", "estimate")
wrst_be$p.adj <- p.adjust(wrst_be$p.value , method = "fdr")
wrst_be$feature = rownames(wrst_be)
(wrst_be <- wrst_be %>% mutate(sig=ifelse(wrst_be$p.adj<0.05, "p.adj<0.05", "Not Sig.")))

4.2 Continuous features

We find Pearson correlation coefficient, Spearman correlation coeffient, \(R^2\) statistic calculated on loess smoother fit and RReliefFexpRank statistic to measure the strength of linear or non-linear association between continuous features and targets. But before that let’s check the correlations between continuous features.

corrplot(cor(df_train %>% select(-contains('spacegroup')), method = "spearman"), order = "hclust")

We observe that there is strong correlation between cg_r_y and O_cg_r_y. The rest of the correlations seems below 0.75 in absolute value, hence univariate feature selection can be used. Otherwise if two predictors are highly correlated with the response and with each other, then the univariate approach will identify both as important leading to selection of redundant features.

4.2.1 Relevance for formation energy

We observe that:

The over all importance scores of features is mediocre.
The most important feature according to RReliefFexpRank is percent_atom_in and that according to rest is lattice_vector_3_ang.

# Pearson correlation
pearsonCorr_fe <- apply(df_train %>% select(-contains("spacegroup")), 
                        2,  FUN = function(x, y) cor(x, y), 
                        y = train$formation_energy_ev_natom)

# Spearman correlation
spearmanCorr_fe <- apply(df_train %>% select(-contains("spacegroup")),
                         2,  FUN = function(x, y) cor(x, y, method = "spearman"),
                         y = train$formation_energy_ev_natom)

# Loess scores
loessScores_fe <- filterVarImp(x = df_train %>% select(-contains("spacegroup")),
                               y = train$formation_energy_ev_natom, nonpara = TRUE)
names(loessScores_fe) = "loessScores"

# Relief scores
reliefScores_fe <- attrEval(fe ~., data = df_train %>% select(-contains("spacegroup")) %>% 
                              mutate(fe = train$formation_energy_ev_natom), estimator = "RReliefFexpRank")

# Scatter plot matrix of variable relevance scores for formation energy
scores_fe <- data.frame(abs_pearsonCorr = abs(pearsonCorr_fe), abs_spearmanCorr = abs(spearmanCorr_fe),
                        loessScores = loessScores_fe, reliefScores = reliefScores_fe)

scores_fe_plot <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>%
  ggpairs(columns = 1:4, upper=list(continuous = wrap('cor', method = "spearman")), 
          diag = list(continuous = wrap("densityDiag", alpha = .5, color = 'red', fill = 'brown')),
          lower = list(continuous = wrap("points", size = 3, alpha = .5, color = 'brown'),
                       mapping = aes(text = (paste("feature:", feature)))),
          switch = "both", 
          title = "Scatter plot matrix and density of variable relevance scores for formation energy")
ggplotly(scores_fe_plot, tooltip = c("text", "x", "y"))

4.2.2 Relevance for band gap energy

We observe that:

The over all importance scores of features is better than that for formation energy.
The most important feature according to RReliefFexpRank is percent_atom_al and that according to rest is percent_atom_in.

# Pearson correlation
pearsonCorr_be <- apply(df_train %>% select(-contains("spacegroup")), 
                   2,  FUN = function(x, y) cor(x, y), y = train$bandgap_energy_ev)

# Spearman correlation
spearmanCorr_be <- apply(df_train %>% select(-contains("spacegroup")), 
                   2,  FUN = function(x, y) cor(x, y, method = "spearman"), y = train$bandgap_energy_ev)

# Loess scores
loessScores_be <- filterVarImp(x = df_train %>% select(-contains("spacegroup")),
                            y = train$bandgap_energy_ev, nonpara = TRUE)
names(loessScores_be) = "loessScores"

# Relief scores
reliefScores_be <- attrEval(be ~., data = df_train %>% select(-contains("spacegroup")) %>% 
                              mutate(be = train$bandgap_energy_ev), estimator = "RReliefFexpRank")

# Scatter plot matrix of variable relevance scores for band gap energy
scores_be <- data.frame(abs_pearsonCorr = abs(pearsonCorr_be), abs_spearmanCorr = abs(spearmanCorr_be), 
                        loessScores = loessScores_be, reliefScores = reliefScores_be)

scores_be_plot <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
  ggpairs(columns = 1:4, upper=list(continuous = wrap('cor', method = "spearman")), 
          diag = list(continuous = wrap("densityDiag", alpha = .5, color = 'blue', fill = 'purple')),
          lower = list(continuous = wrap("points", size = 3, alpha = .5, color = 'purple'),
                       mapping = aes(text = (paste("feature:", feature)))),
          switch = "both", 
          title = "Scatter plot matrix and density of variable relevance scores for bandgap energy")
ggplotly(scores_be_plot, tooltip = c("text", "x", "y"))

5 Feature Selection

We select features for model building based on their importance scores found in the previous section. We select only those features whose scores are greater than the defined thresholds for respective score types. The thresholds are found by taking a fraction of the range of respective score types. Since range of respective feature importance scores is different for formation energy and band gap energy so is the fraction used.

5.1 Formation energy

The features used for modeling formation energy are

# Relevant predictors to formation energy
wilcox_pred_fe <- wrst_fe %>% filter(sig == "p.adj<0.05") %>% .$feature

pearson_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>% 
  filter(abs_pearsonCorr > (max(abs_pearsonCorr) - min(abs_pearsonCorr))*1/5) %>% 
  .$feature

spearman_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>% 
  filter(abs_spearmanCorr > (max(abs_spearmanCorr) - min(abs_spearmanCorr))*1/5) %>% 
  .$feature

loess_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>%
  filter(loessScores > (max(loessScores) - min(loessScores))*1/5) %>% 
  .$feature

relief_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>% 
  filter(reliefScores > (max(reliefScores) - min(reliefScores))*1/5) %>% 
  .$feature

pred_fe <- unique(c(wilcox_pred_fe, pearson_pred_fe, spearman_pred_fe, loess_pred_fe, relief_pred_fe))
pred_fe

##  [1] "spacegroup.12"        "spacegroup.33"        "spacegroup.167"      
##  [4] "spacegroup.194"       "spacegroup.206"       "spacegroup.227"      
##  [7] "percent_atom_al"      "percent_atom_ga"      "percent_atom_in"     
## [10] "lattice_vector_1_ang" "lattice_vector_2_ang" "lattice_vector_3_ang"
## [13] "alpha_rad"            "beta_rad"             "cg_r_y"              
## [16] "cg_r_z"               "Al_cg_r_x"            "Al_cg_r_z"           
## [19] "Ga_cg_r_x"            "Ga_cg_r_y"            "Ga_cg_r_z"           
## [22] "In_cg_r_x"            "In_cg_r_y"            "In_cg_r_z"           
## [25] "O_cg_r_y"

5.2 Band gap energy

The features used for modeling band gap energy are

# Relevant predictors to band gap energy
wilcox_pred_be <- wrst_be %>% filter(sig == "p.adj<0.05") %>% .$feature

pearson_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
  filter(abs_pearsonCorr > (max(abs_pearsonCorr) - min(abs_pearsonCorr))*1/20) %>% 
  .$feature

spearman_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>% 
  filter(abs_spearmanCorr > (max(abs_spearmanCorr) - min(abs_spearmanCorr))*1/20) %>% 
  .$feature

loess_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
  filter(loessScores > (max(loessScores) - min(loessScores))*1/20) %>% 
  .$feature

relief_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
  filter(reliefScores > (max(reliefScores) - min(reliefScores))*1/20) %>% 
  .$feature

pred_be <- unique(c(wilcox_pred_be, pearson_pred_be, spearman_pred_be, loess_pred_be, relief_pred_be))
pred_be

##  [1] "spacegroup.33"         "spacegroup.167"       
##  [3] "spacegroup.194"        "spacegroup.227"       
##  [5] "number_of_total_atoms" "percent_atom_al"      
##  [7] "percent_atom_in"       "lattice_vector_1_ang" 
##  [9] "lattice_vector_2_ang"  "lattice_vector_3_ang" 
## [11] "alpha_rad"             "beta_rad"             
## [13] "gamma_rad"             "cg_r_x"               
## [15] "cg_r_y"                "cg_r_z"               
## [17] "Al_cg_r_x"             "Al_cg_r_y"            
## [19] "Al_cg_r_z"             "Ga_cg_r_z"            
## [21] "In_cg_r_x"             "In_cg_r_y"            
## [23] "In_cg_r_z"             "O_cg_r_x"             
## [25] "O_cg_r_y"              "O_cg_r_z"             
## [27] "percent_atom_ga"       "Ga_cg_r_x"

6 Model Building

Caret’s train function is used to built the models. The models were tuned using root-mean-square-log-error (RMSLE) as the performance metric to evaluate the fit on the cross-validation hold-out set for different hyperparameters. Leave-group-out-cross-validation (LGOCV) data splitting strategy is used where we leave 20 percent as hold-out group.

# Function to calculate RMSLE to estimate model performance
custom_summary = function(data, lev = NULL, model = NULL) {
  p <- log(abs(data$pred) + 1)
  o <- log(data$obs + 1)
  out <- sqrt(mean((p-o)^2))
  names(out) = "RMSLE"
  c(out, defaultSummary(data, lev = NULL, model = NULL))
}

6.1 Formation energy

6.1.1 Model averaged neural network

set.seed(7654)
nptcFE_avNNet <- train(fe ~ ., df_train %>% select(pred_fe) %>%
                              mutate(fe = train$formation_energy_ev_natom),
                            preProcess = c("center", "scale"),
                            method = "avNNet",
                            metric = "RMSLE",
                            maximize = FALSE,
                            tuneGrid =  expand.grid(decay = c(0, 0.005, 0.01),
                                                    size = 3:12,
                                                    bag = FALSE),
                            trControl = trainControl(method = "LGOCV",
                                                     number = 10, p = 0.8,
                                                     search = "grid",
                                                     summaryFunction = custom_summary),
                            linout = TRUE,
                            trace = FALSE,
                            MaxNWts = 12 * (ncol(df_train) + 1) + 12 + 1,
                            maxit = 500)

nptcFE_avNNet

## Model Averaged Neural Network 
## 
## 2367 samples
##   25 predictor
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSLE       RMSE        Rsquared   MAE       
##   0.000   3    0.03149024  0.03908853  0.8581717  0.02774245
##   0.000   4    0.03021502  0.03778860  0.8672894  0.02565481
##   0.000   5    0.02908929  0.03651229  0.8760976  0.02447570
##   0.000   6    0.02876068  0.03606204  0.8792777  0.02404277
##   0.000   7    0.02889554  0.03628229  0.8779912  0.02394308
##   0.000   8    0.02859275  0.03593098  0.8803242  0.02373446
##   0.000   9    0.02887669  0.03626575  0.8784617  0.02395037
##   0.000  10    0.02890050  0.03631248  0.8783470  0.02389547
##   0.000  11    0.02878974  0.03612602  0.8793467  0.02397479
##   0.000  12    0.02882691  0.03619273  0.8787039  0.02396833
##   0.005   3    0.03108146  0.03875938  0.8601698  0.02657043
##   0.005   4    0.02973270  0.03722034  0.8711975  0.02504237
##   0.005   5    0.02918607  0.03654355  0.8758055  0.02445884
##   0.005   6    0.02894849  0.03625586  0.8780864  0.02415901
##   0.005   7    0.02867279  0.03586829  0.8805183  0.02379670
##   0.005   8    0.02851321  0.03573462  0.8816817  0.02374711
##   0.005   9    0.02864061  0.03588883  0.8804884  0.02368777
##   0.005  10    0.02882829  0.03614085  0.8792661  0.02370830
##   0.005  11    0.02913308  0.03648491  0.8772512  0.02389598
##   0.005  12    0.02907267  0.03644188  0.8777136  0.02391190
##   0.010   3    0.03104301  0.03873024  0.8604753  0.02652098
##   0.010   4    0.02984401  0.03734823  0.8703895  0.02514102
##   0.010   5    0.02933141  0.03671637  0.8747831  0.02444757
##   0.010   6    0.02890674  0.03619782  0.8786137  0.02404539
##   0.010   7    0.02886190  0.03620084  0.8786230  0.02400355
##   0.010   8    0.02893008  0.03619588  0.8785412  0.02401678
##   0.010   9    0.02881043  0.03613592  0.8791302  0.02386310
##   0.010  10    0.02862595  0.03589427  0.8806830  0.02368999
##   0.010  11    0.02906606  0.03644242  0.8773971  0.02395558
##   0.010  12    0.02863584  0.03586894  0.8811141  0.02355162
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 8, decay = 0.005 and bag
##  = FALSE.

6.1.2 Cubist

set.seed(7654)
nptcFE_cubist <- train(fe ~ ., df_train %>% select(pred_fe) %>% 
                         mutate(fe = train$formation_energy_ev_natom),
                       method = "cubist",
                       metric = "RMSLE",
                       maximize = FALSE,
                       tuneGrid = expand.grid(committees = seq(30,60,5),
                                              neighbors = c(0,3,5,7,9)),
                       trControl = trainControl(method = "LGOCV",
                                                number = 10, p = 0.8,
                                                search = "grid",
                                                summaryFunction = custom_summary))
nptcFE_cubist

## Cubist 
## 
## 2367 samples
##   25 predictor
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSLE       RMSE        Rsquared   MAE       
##   30          0          0.02957016  0.03722273  0.8708993  0.02466582
##   30          3          0.03114621  0.03940150  0.8569248  0.02534928
##   30          5          0.03043428  0.03848389  0.8628360  0.02452768
##   30          7          0.03005676  0.03799134  0.8660486  0.02422966
##   30          9          0.02979644  0.03763616  0.8683386  0.02411804
##   35          0          0.02956325  0.03720809  0.8710880  0.02466517
##   35          3          0.03113407  0.03938151  0.8570839  0.02534030
##   35          5          0.03042705  0.03846945  0.8629697  0.02452864
##   35          7          0.03005341  0.03798417  0.8661439  0.02422727
##   35          9          0.02979395  0.03763137  0.8684284  0.02412064
##   40          0          0.02949262  0.03712540  0.8716577  0.02459935
##   40          3          0.03107228  0.03930603  0.8576443  0.02528890
##   40          5          0.03036546  0.03839433  0.8635063  0.02447924
##   40          7          0.02998942  0.03790616  0.8666987  0.02417347
##   40          9          0.02972829  0.03755108  0.8689934  0.02406138
##   45          0          0.02944117  0.03706004  0.8720804  0.02455587
##   45          3          0.03104112  0.03926465  0.8578999  0.02523988
##   45          5          0.03033212  0.03835096  0.8637751  0.02443984
##   45          7          0.02995686  0.03786362  0.8669699  0.02414033
##   45          9          0.02969639  0.03750951  0.8692582  0.02402416
##   50          0          0.02938374  0.03698593  0.8726146  0.02451478
##   50          3          0.03100840  0.03922368  0.8581845  0.02521334
##   50          5          0.03029597  0.03830497  0.8640927  0.02440162
##   50          7          0.02991557  0.03781181  0.8673287  0.02411004
##   50          9          0.02965429  0.03745657  0.8696244  0.02399545
##   55          0          0.02940731  0.03701933  0.8723490  0.02452062
##   55          3          0.03103499  0.03926094  0.8578705  0.02523118
##   55          5          0.03031443  0.03833288  0.8638557  0.02439992
##   55          7          0.02993567  0.03784054  0.8670887  0.02411554
##   55          9          0.02967713  0.03748850  0.8693634  0.02400568
##   60          0          0.02941467  0.03702106  0.8723143  0.02450930
##   60          3          0.03104795  0.03927372  0.8577687  0.02524475
##   60          5          0.03032252  0.03833905  0.8637922  0.02440879
##   60          7          0.02994389  0.03784688  0.8670249  0.02411989
##   60          9          0.02968367  0.03749217  0.8693191  0.02400103
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 0.

6.1.3 Elasticnet

set.seed(7654)
nptcFE_enet <- train(fe ~ ., df_train %>% select(pred_fe) %>% 
                       mutate(fe = train$formation_energy_ev_natom),
                     preProcess = c("center", "scale"),
                     method = "enet",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = 5 - 1)),
                                             fraction = seq(.05, 1, length = 5)),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcFE_enet

## Elasticnet 
## 
## 2367 samples
##   25 predictor
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSLE       RMSE        Rsquared   MAE       
##   0e+00   0.0500    0.05074030  0.06107614  0.6531706  0.04675650
##   0e+00   0.2875    0.05063567  0.06095741  0.6544851  0.04668791
##   0e+00   0.5250    0.05057467  0.06089000  0.6552470  0.04665545
##   0e+00   0.7625    0.05053349  0.06084695  0.6557443  0.04663992
##   0e+00   1.0000    0.05051821  0.06083412  0.6559095  0.04664493
##   1e-04   0.0500    0.07473336  0.08983755  0.4993005  0.07365245
##   1e-04   0.2875    0.05180106  0.06224362  0.6413947  0.04775483
##   1e-04   0.5250    0.05101060  0.06132791  0.6500113  0.04701351
##   1e-04   0.7625    0.05082492  0.06114089  0.6522883  0.04683486
##   1e-04   1.0000    0.05077312  0.06110925  0.6527852  0.04678099
##   1e-03   0.0500    0.07677097  0.09226973  0.4697795  0.07564147
##   1e-03   0.2875    0.05308787  0.06385717  0.6310419  0.04917971
##   1e-03   0.5250    0.05113181  0.06146224  0.6483942  0.04713483
##   1e-03   0.7625    0.05091773  0.06122700  0.6512249  0.04692493
##   1e-03   1.0000    0.05082287  0.06113894  0.6523290  0.04682662
##   1e-02   0.0500    0.08087807  0.09717862  0.3498557  0.07956296
##   1e-02   0.2875    0.06192762  0.07450211  0.5728810  0.05954148
##   1e-02   0.5250    0.05326690  0.06408331  0.6285474  0.04932644
##   1e-02   0.7625    0.05135807  0.06168585  0.6459813  0.04731885
##   1e-02   1.0000    0.05109311  0.06139895  0.6492548  0.04710838
##   1e-01   0.0500    0.08150550  0.09793757  0.3333873  0.08015292
##   1e-01   0.2875    0.06416070  0.07718144  0.5596165  0.06226918
##   1e-01   0.5250    0.05486502  0.06600259  0.6163365  0.05128783
##   1e-01   0.7625    0.05187597  0.06231058  0.6393749  0.04775096
##   1e-01   1.0000    0.05146032  0.06174992  0.6466953  0.04740162
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 0.

6.1.4 Weighted k-nearest neighbors

set.seed(7654)
nptcFE_kknn <- train(fe ~ ., df_train %>% select(pred_fe) %>% 
                       mutate(fe = train$formation_energy_ev_natom),
                     preProcess = c("center", "scale"),
                     method = "kknn",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(kmax = c(6, 8, 10, 12, 14),
                                            distance = 1,
                                            kernel = "epanechnikov"),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcFE_kknn

## k-Nearest Neighbors 
## 
## 2367 samples
##   25 predictor
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   kmax  RMSLE       RMSE        Rsquared   MAE       
##    6    0.03591260  0.04499291  0.8137949  0.02977108
##    8    0.03532952  0.04423627  0.8191747  0.02929978
##   10    0.03508564  0.04388869  0.8216095  0.02915689
##   12    0.03493763  0.04365258  0.8232661  0.02911865
##   14    0.03488801  0.04354089  0.8240192  0.02917551
## 
## Tuning parameter 'distance' was held constant at a value of 1
## 
## Tuning parameter 'kernel' was held constant at a value of epanechnikov
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 14, distance = 1
##  and kernel = epanechnikov.

6.1.5 Partial least squares

set.seed(7654)
nptcFE_pls <- train(fe ~ ., df_train %>% select(pred_fe) %>% 
                      mutate(fe = train$formation_energy_ev_natom),
                    preProcess = c("center", "scale"),
                    method = "pls",
                    metric = "RMSLE",
                    maximize = FALSE,
                    tuneLength = ncol(df_train)-1,
                    trControl = trainControl(method = "LGOCV",
                                             number = 10, p = 0.8,
                                             search = "grid",
                                             summaryFunction = custom_summary))
nptcFE_pls

## Partial Least Squares 
## 
## 2367 samples
##   25 predictor
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSLE       RMSE        Rsquared   MAE       
##    1     0.05656785  0.06799984  0.5695875  0.05220535
##    2     0.05335146  0.06408669  0.6180657  0.04940284
##    3     0.05233598  0.06290488  0.6318702  0.04801181
##    4     0.05152707  0.06189150  0.6435939  0.04718767
##    5     0.05138643  0.06172301  0.6454721  0.04717133
##    6     0.05124354  0.06152417  0.6478052  0.04753760
##    7     0.05127116  0.06157420  0.6471736  0.04734778
##    8     0.05126262  0.06158273  0.6471104  0.04735000
##    9     0.05129306  0.06163193  0.6465628  0.04733396
##   10     0.05129680  0.06164133  0.6464777  0.04731726
##   11     0.05127643  0.06161157  0.6467491  0.04730038
##   12     0.05130110  0.06162636  0.6465920  0.04736736
##   13     0.05119999  0.06157249  0.6471953  0.04723168
##   14     0.05122909  0.06158367  0.6471098  0.04721432
##   15     0.05116460  0.06153453  0.6476940  0.04714967
##   16     0.05110616  0.06147068  0.6484501  0.04710513
##   17     0.05106968  0.06142830  0.6489767  0.04709176
##   18     0.05088958  0.06123781  0.6511868  0.04695528
##   19     0.05082478  0.06118200  0.6519420  0.04683705
##   20     0.05082021  0.06117320  0.6520866  0.04680960
##   21     0.05082549  0.06117803  0.6520384  0.04681449
##   22     0.05077328  0.06111275  0.6527660  0.04678277
##   23     0.05077074  0.06111136  0.6527832  0.04677736
##   24     0.05051821  0.06083412  0.6559095  0.04664493
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 24.

6.1.6 Support vector machines with polynomial kernel

set.seed(7654)
nptcFE_svmPoly <- train(fe ~ ., df_train %>% select(pred_fe) %>% 
                          mutate(fe = train$formation_energy_ev_natom),
                        preProcess = c("center", "scale"),
                        method = "svmPoly",
                        metric = "RMSLE",
                        maximize = FALSE,
                        tuneGrid = expand.grid(degree = c(1,2),
                                               scale = c(.01, .05),
                                               C = 2 ^((1:14) - 3)),
                        trControl = trainControl(method = "LGOCV",
                                                 number = 10, p = 0.8,
                                                 search = "grid",
                                                 summaryFunction = custom_summary))
nptcFE_svmPoly

## Support Vector Machines with Polynomial Kernel 
## 
## 2367 samples
##   25 predictor
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C        RMSLE       RMSE        Rsquared   MAE       
##   1       0.01      0.25  0.05303163  0.06372709  0.6238714  0.04635678
##   1       0.01      0.50  0.05288598  0.06352739  0.6267144  0.04601864
##   1       0.01      1.00  0.05280635  0.06343812  0.6285939  0.04586983
##   1       0.01      2.00  0.05279848  0.06344259  0.6288899  0.04582669
##   1       0.01      4.00  0.05278871  0.06344653  0.6291808  0.04578770
##   1       0.01      8.00  0.05271643  0.06337637  0.6302997  0.04571689
##   1       0.01     16.00  0.05264935  0.06330237  0.6316051  0.04567156
##   1       0.01     32.00  0.05255486  0.06319271  0.6332019  0.04563178
##   1       0.01     64.00  0.05250365  0.06313705  0.6338591  0.04561002
##   1       0.01    128.00  0.05242958  0.06305212  0.6348403  0.04555832
##   1       0.01    256.00  0.05239535  0.06301471  0.6352561  0.04554813
##   1       0.01    512.00  0.05238151  0.06299599  0.6354864  0.04553870
##   1       0.01   1024.00  0.05237346  0.06298743  0.6356515  0.04553400
##   1       0.01   2048.00  0.05238252  0.06300110  0.6355282  0.04553735
##   1       0.05      0.25  0.05280266  0.06343668  0.6287438  0.04585723
##   1       0.05      0.50  0.05279025  0.06343735  0.6290528  0.04580967
##   1       0.05      1.00  0.05274856  0.06340446  0.6296441  0.04576344
##   1       0.05      2.00  0.05270107  0.06336280  0.6305994  0.04570423
##   1       0.05      4.00  0.05259092  0.06323326  0.6324550  0.04564298
##   1       0.05      8.00  0.05253242  0.06316834  0.6334921  0.04562092
##   1       0.05     16.00  0.05249231  0.06312381  0.6340780  0.04559687
##   1       0.05     32.00  0.05240220  0.06302081  0.6351793  0.04554804
##   1       0.05     64.00  0.05238480  0.06300187  0.6353761  0.04554546
##   1       0.05    128.00  0.05236934  0.06298267  0.6356564  0.04553404
##   1       0.05    256.00  0.05237475  0.06298914  0.6356178  0.04553684
##   1       0.05    512.00  0.05237901  0.06299464  0.6355719  0.04553875
##   1       0.05   1024.00  0.05237285  0.06298716  0.6356828  0.04553419
##   1       0.05   2048.00  0.05241944  0.06304129  0.6352567  0.04552324
##   2       0.01      0.25  0.04440479  0.05375998  0.7338174  0.03910168
##   2       0.01      0.50  0.04047780  0.04931863  0.7762195  0.03582314
##   2       0.01      1.00  0.03669306  0.04511253  0.8128260  0.03254008
##   2       0.01      2.00  0.03398348  0.04210019  0.8364578  0.02977119
##   2       0.01      4.00  0.03251737  0.04045806  0.8483744  0.02803416
##   2       0.01      8.00  0.03191646  0.03976558  0.8533835  0.02719694
##   2       0.01     16.00  0.03177420  0.03960129  0.8546791  0.02690888
##   2       0.01     32.00  0.03182931  0.03967116  0.8542443  0.02685753
##   2       0.01     64.00  0.03179462  0.03961379  0.8549181  0.02679009
##   2       0.01    128.00  0.03168266  0.03944079  0.8563496  0.02672254
##   2       0.01    256.00  0.03146866  0.03912237  0.8587685  0.02659977
##   2       0.01    512.00  0.03134772  0.03892149  0.8603416  0.02660528
##   2       0.01   1024.00  0.03128940  0.03882059  0.8611145  0.02662766
##   2       0.01   2048.00  0.03152237  0.03912191  0.8591974  0.02676966
##   2       0.05      0.25  0.03209372  0.03998135  0.8519491  0.02744699
##   2       0.05      0.50  0.03181429  0.03964124  0.8542995  0.02698939
##   2       0.05      1.00  0.03182882  0.03966594  0.8542439  0.02690531
##   2       0.05      2.00  0.03183607  0.03966390  0.8544557  0.02681500
##   2       0.05      4.00  0.03180999  0.03960732  0.8551200  0.02680538
##   2       0.05      8.00  0.03152910  0.03921542  0.8580325  0.02663160
##   2       0.05     16.00  0.03135023  0.03893678  0.8601425  0.02656074
##   2       0.05     32.00  0.03129999  0.03883898  0.8610172  0.02664799
##   2       0.05     64.00  0.03140662  0.03896018  0.8602210  0.02670596
##   2       0.05    128.00  0.03166015  0.03930672  0.8580440  0.02684595
##   2       0.05    256.00  0.03186476  0.03963626  0.8558823  0.02692723
##   2       0.05    512.00  0.03238235  0.04050478  0.8507114  0.02721318
##   2       0.05   1024.00  0.03405211  0.04321461  0.8332330  0.02855112
##   2       0.05   2048.00  0.03525620  0.04670598  0.8114283  0.02858585
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.01 and C
##  = 1024.

6.1.7 Support vector machines with radial basis function kernel

# Estimate parameter sigma from function sigest
sigfe <- kernlab::sigest(fe ~ .,data = (df_train %>% select(pred_fe) %>% 
                                          mutate(fe = train$formation_energy_ev_natom)), frac=1)

set.seed(7654)
nptcFE_svmR <- train(fe ~ ., df_train %>% select(pred_fe) %>% 
                       mutate(fe = train$formation_energy_ev_natom),
                     preProcess = c("center", "scale"),
                     method = "svmRadial",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(sigma = sigfe[2],
                                            C = 2 ^((1:10) - 3)),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcFE_svmR

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2367 samples
##   25 predictor
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSLE       RMSE        Rsquared   MAE       
##     0.25  0.03564336  0.04413309  0.8230643  0.03141073
##     0.50  0.03317713  0.04138695  0.8423178  0.02874373
##     1.00  0.03165542  0.03964695  0.8543552  0.02685582
##     2.00  0.03061761  0.03842070  0.8628775  0.02553323
##     4.00  0.03011976  0.03783849  0.8669832  0.02482815
##     8.00  0.02980051  0.03744094  0.8698271  0.02443938
##    16.00  0.02980813  0.03744774  0.8699221  0.02446343
##    32.00  0.03030306  0.03801979  0.8663222  0.02484204
##    64.00  0.03132871  0.03926209  0.8584334  0.02549841
##   128.00  0.03296946  0.04125547  0.8455543  0.02653585
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02028928
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02028928 and C = 8.

6.1.8 Extreme gradient boosting

set.seed(7654)
nptcFE_xgbt <- train(fe ~ ., df_train %>% select(pred_fe) %>% 
                       mutate(fe = train$formation_energy_ev_natom),
                     method = "xgbTree",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(nrounds = seq(50,2000,50),
                                            max_depth = 5,
                                            eta = 0.01,
                                            gamma = 0,
                                            colsample_bytree = 0.5,
                                            min_child_weight = 1,
                                            subsample = 1),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcFE_xgbt

## eXtreme Gradient Boosting 
## 
## 2367 samples
##   25 predictor
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  RMSLE       RMSE        Rsquared   MAE       
##     50     0.16493523  0.20389366  0.8235249  0.19056779
##    100     0.10799095  0.12900836  0.8400868  0.11803869
##    150     0.07250731  0.08521288  0.8511644  0.07612876
##    200     0.05157697  0.06073893  0.8585337  0.05248216
##    250     0.04015662  0.04805988  0.8636729  0.03948127
##    300     0.03440070  0.04197719  0.8673314  0.03240565
##    350     0.03167588  0.03921555  0.8696018  0.02859465
##    400     0.03041817  0.03797479  0.8711436  0.02661904
##    450     0.02981748  0.03738509  0.8723213  0.02566341
##    500     0.02952313  0.03709966  0.8730306  0.02520750
##    550     0.02936829  0.03695084  0.8735082  0.02497408
##    600     0.02926580  0.03684738  0.8739589  0.02483718
##    650     0.02920784  0.03678779  0.8742277  0.02475737
##    700     0.02916442  0.03674320  0.8744446  0.02469493
##    750     0.02912773  0.03670453  0.8746467  0.02464396
##    800     0.02909705  0.03667171  0.8748286  0.02460275
##    850     0.02907624  0.03665039  0.8749392  0.02457183
##    900     0.02905850  0.03663322  0.8750385  0.02453916
##    950     0.02904627  0.03662020  0.8751119  0.02451404
##   1000     0.02903989  0.03661488  0.8751360  0.02449278
##   1050     0.02903618  0.03661218  0.8751464  0.02447313
##   1100     0.02902356  0.03659723  0.8752331  0.02445197
##   1150     0.02901742  0.03659150  0.8752602  0.02443802
##   1200     0.02901106  0.03658604  0.8752869  0.02441893
##   1250     0.02900900  0.03658566  0.8752880  0.02440859
##   1300     0.02900679  0.03658396  0.8752935  0.02440077
##   1350     0.02900327  0.03657953  0.8753206  0.02439043
##   1400     0.02900090  0.03657666  0.8753329  0.02438279
##   1450     0.02900218  0.03657962  0.8753115  0.02437744
##   1500     0.02899915  0.03657646  0.8753331  0.02436831
##   1550     0.02900270  0.03658163  0.8752988  0.02436426
##   1600     0.02900570  0.03658624  0.8752668  0.02435865
##   1650     0.02901009  0.03659254  0.8752247  0.02435434
##   1700     0.02901894  0.03660561  0.8751400  0.02435544
##   1750     0.02901514  0.03660180  0.8751683  0.02434952
##   1800     0.02901408  0.03660074  0.8751776  0.02434778
##   1850     0.02901359  0.03660094  0.8751763  0.02434404
##   1900     0.02901844  0.03660874  0.8751244  0.02434412
##   1950     0.02902057  0.03661145  0.8751042  0.02434245
##   2000     0.02902437  0.03661773  0.8750637  0.02434155
## 
## Tuning parameter 'max_depth' was held constant at a value of 5
##  0.5
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 1500, max_depth =
##  5, eta = 0.01, gamma = 0, colsample_bytree = 0.5, min_child_weight =
##  1 and subsample = 1.

6.2 Band gap energy

6.2.1 Model averaged neural network

set.seed(7654)
nptcBE_avNNet <- train(be ~ ., df_train %>% select(pred_be) %>%
                              mutate(be = train$bandgap_energy_ev),
                            preProcess = c("center", "scale"),
                            method = "avNNet",
                            metric = "RMSLE",
                            maximize = FALSE,
                            tuneGrid =  expand.grid(decay = c(0, 0.005, 0.01),
                                                    size = 3:12,
                                                    bag = FALSE),
                            trControl = trainControl(method = "LGOCV",
                                                     number = 10, p = 0.8,
                                                     search = "grid",
                                                     summaryFunction = custom_summary),
                            linout = TRUE,
                            trace = FALSE,
                            MaxNWts = 12 * (ncol(df_train) + 1) + 12 + 1,
                            maxit = 500)

nptcBE_avNNet

## Model Averaged Neural Network 
## 
## 2367 samples
##   28 predictor
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSLE       RMSE       Rsquared   MAE      
##   0.000   3    0.08404025  0.2037214  0.9588276  0.1335227
##   0.000   4    0.08261769  0.2009471  0.9599725  0.1282769
##   0.000   5    0.08371242  0.2039432  0.9586260  0.1265017
##   0.000   6    0.08273097  0.2038222  0.9586080  0.1254884
##   0.000   7    0.08381438  0.2027824  0.9591426  0.1267370
##   0.000   8    0.08762308  0.2173988  0.9516979  0.1281144
##   0.000   9    0.08535635  0.2109302  0.9555472  0.1293183
##   0.000  10    0.08538280  0.2098329  0.9561475  0.1294692
##   0.000  11    0.08649547  0.2097370  0.9562834  0.1293949
##   0.000  12    0.08516309  0.2104170  0.9558893  0.1310796
##   0.005   3    0.08200315  0.1983676  0.9609411  0.1286399
##   0.005   4    0.08244157  0.1989804  0.9606857  0.1269683
##   0.005   5    0.08195756  0.1992827  0.9605967  0.1256481
##   0.005   6    0.08191602  0.1984433  0.9609101  0.1243202
##   0.005   7    0.08413509  0.2018799  0.9595533  0.1257244
##   0.005   8    0.08362498  0.2032402  0.9589942  0.1266037
##   0.005   9    0.08466929  0.2047813  0.9583684  0.1277540
##   0.005  10    0.08418874  0.2044626  0.9585109  0.1271724
##   0.005  11    0.08396103  0.2034121  0.9589198  0.1272324
##   0.005  12    0.08559540  0.2088600  0.9566854  0.1285773
##   0.010   3    0.08248061  0.1995188  0.9605467  0.1291922
##   0.010   4    0.08203812  0.1998114  0.9603224  0.1269501
##   0.010   5    0.08185753  0.2000541  0.9603152  0.1262980
##   0.010   6    0.08219443  0.1984496  0.9609156  0.1242304
##   0.010   7    0.08319571  0.2017529  0.9595479  0.1256168
##   0.010   8    0.08476149  0.2057500  0.9578969  0.1270994
##   0.010   9    0.08415407  0.2025704  0.9592362  0.1261027
##   0.010  10    0.08413237  0.2041983  0.9586509  0.1272307
##   0.010  11    0.08617947  0.2079835  0.9570539  0.1291672
##   0.010  12    0.08548313  0.2073811  0.9573169  0.1284922
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.01 and bag
##  = FALSE.

6.2.2 Cubist

set.seed(7654)
nptcBE_cubist <- train(be ~ ., df_train %>% select(pred_be) %>% 
                         mutate(be = train$bandgap_energy_ev),
                       method = "cubist",
                       metric = "RMSLE",
                       maximize = FALSE,,
                       tuneGrid = expand.grid(committees = seq(30,60,5),
                                              neighbors = c(0,3,5,7,9)),
                       trControl = trainControl(method = "LGOCV",
                                                number = 10, p = 0.8,
                                                search = "grid",
                                                summaryFunction = custom_summary))
nptcBE_cubist

## Cubist 
## 
## 2367 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSLE       RMSE       Rsquared   MAE      
##   30          0          0.08666475  0.2133435  0.9548239  0.1307867
##   30          3          0.08856421  0.2210377  0.9514488  0.1322443
##   30          5          0.08663693  0.2149943  0.9540184  0.1288934
##   30          7          0.08616532  0.2136550  0.9545934  0.1281160
##   30          9          0.08553913  0.2117360  0.9554004  0.1275286
##   35          0          0.08662359  0.2131488  0.9548967  0.1305923
##   35          3          0.08856284  0.2209245  0.9515046  0.1322135
##   35          5          0.08665173  0.2149025  0.9540594  0.1288794
##   35          7          0.08618969  0.2136094  0.9546139  0.1281794
##   35          9          0.08558043  0.2117640  0.9553867  0.1275976
##   40          0          0.08661211  0.2126452  0.9551128  0.1302890
##   40          3          0.08848481  0.2202981  0.9517745  0.1319701
##   40          5          0.08656986  0.2143245  0.9543047  0.1286075
##   40          7          0.08613397  0.2130629  0.9548431  0.1279354
##   40          9          0.08551883  0.2111951  0.9556226  0.1273328
##   45          0          0.08656795  0.2123644  0.9552429  0.1302462
##   45          3          0.08839650  0.2198807  0.9519715  0.1317481
##   45          5          0.08648553  0.2139287  0.9544859  0.1284283
##   45          7          0.08603627  0.2126020  0.9550484  0.1277412
##   45          9          0.08542599  0.2107551  0.9558184  0.1271660
##   50          0          0.08664465  0.2124133  0.9552243  0.1303230
##   50          3          0.08840085  0.2197853  0.9520223  0.1317035
##   50          5          0.08649170  0.2138480  0.9545295  0.1284415
##   50          7          0.08604147  0.2125226  0.9550902  0.1277346
##   50          9          0.08545066  0.2106964  0.9558498  0.1271661
##   55          0          0.08670069  0.2127431  0.9550800  0.1305373
##   55          3          0.08845284  0.2200438  0.9519081  0.1317233
##   55          5          0.08654849  0.2141106  0.9544158  0.1284885
##   55          7          0.08610913  0.2128081  0.9549649  0.1278077
##   55          9          0.08552276  0.2110051  0.9557151  0.1272741
##   60          0          0.08661851  0.2126222  0.9551207  0.1304141
##   60          3          0.08841104  0.2200103  0.9519193  0.1316963
##   60          5          0.08650127  0.2140779  0.9544256  0.1284180
##   60          7          0.08607442  0.2127927  0.9549651  0.1277213
##   60          9          0.08548766  0.2109936  0.9557128  0.1271736
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 45 and neighbors = 9.

6.2.3 Elasticnet

set.seed(7654)
nptcBE_enet <- train(be ~ ., df_train %>% select(pred_be) %>%
                       mutate(be = train$bandgap_energy_ev),
                     preProcess = c("center", "scale"),
                     method = "enet",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = 5 - 1)),
                                             fraction = seq(.05, 1, length = 5)),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcBE_enet

## Elasticnet 
## 
## 2367 samples
##   28 predictor
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSLE      RMSE       Rsquared   MAE      
##   0e+00   0.0500    0.1109490  0.2820488  0.9211144  0.2073603
##   0e+00   0.2875    0.1108917  0.2818933  0.9212061  0.2072545
##   0e+00   0.5250    0.1108531  0.2817903  0.9212688  0.2071955
##   0e+00   0.7625    0.1108332  0.2817398  0.9213024  0.2071975
##   0e+00   1.0000    0.1108321  0.2817419  0.9213068  0.2072515
##   1e-04   0.0500    0.2687003  0.7999602  0.6946055  0.6463144
##   1e-04   0.2875    0.1198239  0.3153043  0.9055252  0.2325662
##   1e-04   0.5250    0.1120995  0.2832871  0.9203850  0.2075914
##   1e-04   0.7625    0.1116511  0.2825539  0.9208210  0.2073073
##   1e-04   1.0000    0.1112337  0.2821427  0.9210565  0.2072695
##   1e-03   0.0500    0.2947844  0.8780888  0.6892341  0.7173401
##   1e-03   0.2875    0.1589198  0.4676690  0.8393872  0.3367075
##   1e-03   0.5250    0.1164680  0.2998870  0.9124843  0.2189411
##   1e-03   0.7625    0.1125337  0.2837817  0.9201028  0.2079721
##   1e-03   1.0000    0.1119830  0.2829681  0.9205878  0.2074517
##   1e-02   0.0500    0.3042525  0.9067878  0.6836412  0.7429120
##   1e-02   0.2875    0.1872207  0.5583297  0.7839152  0.4109622
##   1e-02   0.5250    0.1284262  0.3555164  0.8900068  0.2626566
##   1e-02   0.7625    0.1147026  0.2904636  0.9165636  0.2112560
##   1e-02   1.0000    0.1129874  0.2844091  0.9197943  0.2082797
##   1e-01   0.0500    0.3066216  0.9139026  0.6846550  0.7492952
##   1e-01   0.2875    0.1957727  0.5834742  0.7555983  0.4311719
##   1e-01   0.5250    0.1353658  0.3815008  0.8758520  0.2782292
##   1e-01   0.7625    0.1210658  0.3036766  0.9085381  0.2212191
##   1e-01   1.0000    0.1177591  0.2922457  0.9166275  0.2141557
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 0.

6.2.4 Weighted k-nearest neighbors

set.seed(7654)
nptcBE_kknn <- train(be ~ ., df_train %>% select(pred_be) %>% 
                       mutate(be = train$bandgap_energy_ev),
                     preProcess = c("center", "scale"),
                     method = "kknn",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(kmax = c(6, 8, 10, 12, 14),
                                            distance = 1,
                                            kernel = "optimal"),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcBE_kknn

## k-Nearest Neighbors 
## 
## 2367 samples
##   28 predictor
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   kmax  RMSLE      RMSE       Rsquared   MAE      
##    6    0.1048465  0.2834887  0.9213232  0.1893098
##    8    0.1045391  0.2834038  0.9220607  0.1900744
##   10    0.1046645  0.2837562  0.9219978  0.1906079
##   12    0.1046645  0.2837562  0.9219978  0.1906079
##   14    0.1046645  0.2837562  0.9219978  0.1906079
## 
## Tuning parameter 'distance' was held constant at a value of 1
## 
## Tuning parameter 'kernel' was held constant at a value of optimal
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 8, distance = 1 and
##  kernel = optimal.

6.2.5 Partial least squares

set.seed(7654)
nptcBE_pls <- train(be ~ ., df_train %>% select(pred_be) %>% 
                      mutate(be = train$bandgap_energy_ev),
                    preProcess = c("center", "scale"),
                    method = "pls",
                    metric = "RMSLE",
                    maximize = FALSE,
                    tuneLength = ncol(df_train)-1,
                    trControl = trainControl(method = "LGOCV",
                                             number = 10, p = 0.8,
                                             search = "grid",
                                             summaryFunction = custom_summary))
nptcBE_pls

## Partial Least Squares 
## 
## 2367 samples
##   28 predictor
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSLE      RMSE       Rsquared   MAE      
##    1     0.1913732  0.5314200  0.7200892  0.4079832
##    2     0.1549723  0.4262093  0.8199190  0.3219768
##    3     0.1211774  0.3161109  0.9009182  0.2346206
##    4     0.1177785  0.2955907  0.9132820  0.2141636
##    5     0.1166281  0.2904437  0.9162259  0.2091514
##    6     0.1152155  0.2872513  0.9181367  0.2098431
##    7     0.1135308  0.2868033  0.9184071  0.2099670
##    8     0.1128023  0.2859694  0.9188808  0.2097137
##    9     0.1127386  0.2854893  0.9191417  0.2096235
##   10     0.1129068  0.2853049  0.9192607  0.2094079
##   11     0.1131371  0.2850669  0.9194060  0.2089860
##   12     0.1129296  0.2849366  0.9194780  0.2090415
##   13     0.1129290  0.2848923  0.9194962  0.2091444
##   14     0.1127245  0.2843559  0.9198094  0.2085523
##   15     0.1126358  0.2842960  0.9198498  0.2081096
##   16     0.1124747  0.2839506  0.9200481  0.2077635
##   17     0.1124035  0.2840198  0.9200145  0.2077878
##   18     0.1122914  0.2835854  0.9202457  0.2076000
##   19     0.1121049  0.2834567  0.9203098  0.2074980
##   20     0.1120681  0.2835811  0.9202433  0.2075443
##   21     0.1120246  0.2835156  0.9202785  0.2074990
##   22     0.1116514  0.2826444  0.9207647  0.2071727
##   23     0.1116912  0.2826862  0.9207425  0.2071279
##   24     0.1116306  0.2825897  0.9208023  0.2073384
##   25     0.1112928  0.2821227  0.9210628  0.2074953
##   26     0.1109507  0.2820686  0.9211028  0.2073812
##   27     0.1108321  0.2817419  0.9213068  0.2072515
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 27.

6.2.6 Support vector machines with polynomial kernel

nptcBE_svmPoly <- train(be ~ ., df_train %>% select(pred_be) %>%
                          mutate(be = train$bandgap_energy_ev),
                        preProcess = c("center", "scale"),
                        method = "svmPoly",
                        metric = "RMSLE",
                        maximize = FALSE,
                        tuneGrid = expand.grid(degree = c(1,2),
                                               scale = c(.001, .005, .01),
                                               C = 2 ^((1:10) - 3)),
                        trControl = trainControl(method = "LGOCV",
                                                 number = 10, p = 0.8,
                                                 search = "grid",
                                                 summaryFunction = custom_summary))
nptcBE_svmPoly

## Support Vector Machines with Polynomial Kernel 
## 
## 2367 samples
##   28 predictor
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C       RMSLE       RMSE       Rsquared   MAE      
##   1       0.001    0.25  0.16760030  0.5088291  0.8349129  0.3748828
##   1       0.001    0.50  0.13260661  0.3920336  0.8821464  0.2784349
##   1       0.001    1.00  0.11758773  0.3303453  0.9035441  0.2332121
##   1       0.001    2.00  0.11345362  0.3046767  0.9121194  0.2163861
##   1       0.001    4.00  0.11331273  0.2934693  0.9158364  0.2097283
##   1       0.001    8.00  0.11341680  0.2888294  0.9176800  0.2076724
##   1       0.001   16.00  0.11331904  0.2871094  0.9184318  0.2070740
##   1       0.001   32.00  0.11314948  0.2862559  0.9188472  0.2065886
##   1       0.001   64.00  0.11312813  0.2859056  0.9190226  0.2063122
##   1       0.001  128.00  0.11288512  0.2854668  0.9192818  0.2060789
##   1       0.005    0.25  0.11547602  0.3195180  0.9072570  0.2258988
##   1       0.005    0.50  0.11326551  0.3002486  0.9134875  0.2136172
##   1       0.005    1.00  0.11338369  0.2914300  0.9166207  0.2086136
##   1       0.005    2.00  0.11336778  0.2880360  0.9180248  0.2073371
##   1       0.005    4.00  0.11327934  0.2867897  0.9185919  0.2069433
##   1       0.005    8.00  0.11310092  0.2860682  0.9189518  0.2064310
##   1       0.005   16.00  0.11297354  0.2857073  0.9191524  0.2062406
##   1       0.005   32.00  0.11284091  0.2854208  0.9193035  0.2060014
##   1       0.005   64.00  0.11269149  0.2852091  0.9194260  0.2058856
##   1       0.005  128.00  0.11254760  0.2849682  0.9195566  0.2056609
##   1       0.010    0.25  0.11326502  0.3002443  0.9134915  0.2136102
##   1       0.010    0.50  0.11338460  0.2914297  0.9166194  0.2086156
##   1       0.010    1.00  0.11337346  0.2880591  0.9180132  0.2073469
##   1       0.010    2.00  0.11328114  0.2867916  0.9185914  0.2069488
##   1       0.010    4.00  0.11309087  0.2860670  0.9189514  0.2064290
##   1       0.010    8.00  0.11297296  0.2857128  0.9191494  0.2062336
##   1       0.010   16.00  0.11284045  0.2854194  0.9193041  0.2059962
##   1       0.010   32.00  0.11266060  0.2851875  0.9194326  0.2058766
##   1       0.010   64.00  0.11254521  0.2849728  0.9195525  0.2056643
##   1       0.010  128.00  0.11242154  0.2847192  0.9196881  0.2054642
##   2       0.001    0.25  0.13221063  0.3905877  0.8830853  0.2772337
##   2       0.001    0.50  0.11682868  0.3278178  0.9050609  0.2308856
##   2       0.001    1.00  0.11186260  0.3001572  0.9148577  0.2122337
##   2       0.001    2.00  0.11040134  0.2854315  0.9206269  0.2022943
##   2       0.001    4.00  0.10840004  0.2748755  0.9255356  0.1947110
##   2       0.001    8.00  0.10507000  0.2642405  0.9309142  0.1858584
##   2       0.001   16.00  0.10089888  0.2522106  0.9369823  0.1754251
##   2       0.001   32.00  0.09589693  0.2388211  0.9435133  0.1631097
##   2       0.001   64.00  0.09150384  0.2266615  0.9491221  0.1518735
##   2       0.001  128.00  0.08839324  0.2172527  0.9532187  0.1432503
##   2       0.005    0.25  0.10613680  0.2799612  0.9252540  0.1951508
##   2       0.005    0.50  0.10204485  0.2607430  0.9334862  0.1807551
##   2       0.005    1.00  0.09730535  0.2452557  0.9407083  0.1680388
##   2       0.005    2.00  0.09303509  0.2320019  0.9467751  0.1560923
##   2       0.005    4.00  0.08957678  0.2211542  0.9515448  0.1462550
##   2       0.005    8.00  0.08727414  0.2137974  0.9547189  0.1393041
##   2       0.005   16.00  0.08589502  0.2094329  0.9565180  0.1351232
##   2       0.005   32.00  0.08571292  0.2082324  0.9570047  0.1337566
##   2       0.005   64.00  0.08607687  0.2083921  0.9569366  0.1340237
##   2       0.005  128.00  0.08628569  0.2088570  0.9567150  0.1345763
##   2       0.010    0.25  0.09762560  0.2483019  0.9397491  0.1697841
##   2       0.010    0.50  0.09300358  0.2331161  0.9463885  0.1567831
##   2       0.010    1.00  0.08971619  0.2218234  0.9512833  0.1465606
##   2       0.010    2.00  0.08726747  0.2139900  0.9546364  0.1394059
##   2       0.010    4.00  0.08590408  0.2095281  0.9564827  0.1351720
##   2       0.010    8.00  0.08574467  0.2083553  0.9569618  0.1338936
##   2       0.010   16.00  0.08606128  0.2084090  0.9569240  0.1340454
##   2       0.010   32.00  0.08606444  0.2085500  0.9568488  0.1343117
##   2       0.010   64.00  0.08655509  0.2090538  0.9566459  0.1349958
##   2       0.010  128.00  0.08716553  0.2097148  0.9563964  0.1353621
## 
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.005 and C
##  = 32.

6.2.7 Support vector machines with radial basis function kernel

# Estimate parameter sigma from function sigest
sigbe <- kernlab::sigest(be ~ .,data = (df_train %>% select(pred_be) %>% 
                                          mutate(be = train$bandgap_energy_ev)), frac=1)
set.seed(7654)
nptcBE_svmR <- train(be ~ ., df_train %>% select(pred_be) %>% 
                       mutate(be = train$bandgap_energy_ev),
                     preProcess = c("center", "scale"),
                     method = "svmRadial",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(sigma = sigbe[2],
                                            C = 2 ^((1:10) - 3)),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcBE_svmR

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2367 samples
##   28 predictor
## 
## Pre-processing: centered (28), scaled (28) 
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSLE       RMSE       Rsquared   MAE      
##     0.25  0.09555421  0.2493848  0.9397653  0.1612058
##     0.50  0.09170542  0.2346348  0.9458592  0.1507655
##     1.00  0.08959076  0.2257370  0.9495469  0.1439978
##     2.00  0.08809119  0.2201960  0.9518185  0.1399394
##     4.00  0.08734206  0.2175768  0.9528973  0.1374790
##     8.00  0.08779134  0.2175826  0.9528689  0.1368912
##    16.00  0.08907483  0.2198592  0.9518484  0.1375226
##    32.00  0.09155089  0.2260517  0.9491353  0.1408886
##    64.00  0.09693138  0.2378759  0.9438022  0.1477634
##   128.00  0.10240700  0.2533572  0.9366309  0.1562443
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01826589
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01826589 and C = 4.

6.2.8 Extreme gradient boosting

set.seed(7654)
nptcBE_xgbt <- train(be ~ ., df_train %>% select(pred_be) %>% 
                       mutate(be = train$bandgap_energy_ev),
                     method = "xgbTree",
                     metric = "RMSLE",
                     maximize = FALSE,
                     tuneGrid = expand.grid(nrounds = seq(50,2000,50),
                                            max_depth = 4,
                                            eta = 0.02,
                                            gamma = 0.006,
                                            colsample_bytree = 0.7,
                                            min_child_weight = 1,
                                            subsample = 1),
                     trControl = trainControl(method = "LGOCV",
                                              number = 10, p = 0.8,
                                              search = "grid",
                                              summaryFunction = custom_summary))
nptcBE_xgbt

## eXtreme Gradient Boosting 
## 
## 2367 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%) 
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  RMSLE       RMSE       Rsquared   MAE      
##     50     0.22731412  0.7540470  0.9314675  0.6170701
##    100     0.11856587  0.3669656  0.9448323  0.2941888
##    150     0.09509505  0.2508754  0.9509037  0.1870335
##    200     0.08993156  0.2223069  0.9536518  0.1540430
##    250     0.08802606  0.2143664  0.9551324  0.1436202
##    300     0.08695346  0.2109817  0.9561451  0.1397350
##    350     0.08627867  0.2092370  0.9567390  0.1378994
##    400     0.08577763  0.2080224  0.9571780  0.1366337
##    450     0.08535223  0.2070865  0.9575288  0.1357365
##    500     0.08499707  0.2063638  0.9578032  0.1350049
##    550     0.08474098  0.2058341  0.9580015  0.1343646
##    600     0.08453966  0.2054274  0.9581560  0.1338640
##    650     0.08439679  0.2050918  0.9582863  0.1334444
##    700     0.08436009  0.2049672  0.9583321  0.1331799
##    750     0.08426845  0.2047519  0.9584156  0.1328730
##    800     0.08416334  0.2044871  0.9585209  0.1325970
##    850     0.08411278  0.2043789  0.9585640  0.1323875
##    900     0.08404836  0.2042505  0.9586126  0.1322020
##    950     0.08400962  0.2041124  0.9586674  0.1319674
##   1000     0.08398502  0.2040361  0.9586956  0.1318135
##   1050     0.08395417  0.2039769  0.9587178  0.1316942
##   1100     0.08392927  0.2039101  0.9587424  0.1315924
##   1150     0.08392302  0.2038674  0.9587577  0.1314863
##   1200     0.08390023  0.2038261  0.9587727  0.1313944
##   1250     0.08389008  0.2038296  0.9587685  0.1313577
##   1300     0.08387259  0.2038032  0.9587772  0.1312677
##   1350     0.08390664  0.2038692  0.9587479  0.1312614
##   1400     0.08386715  0.2037712  0.9587871  0.1311488
##   1450     0.08385119  0.2037322  0.9588012  0.1311128
##   1500     0.08382375  0.2036871  0.9588179  0.1310226
##   1550     0.08383425  0.2037056  0.9588120  0.1309960
##   1600     0.08384602  0.2037275  0.9588029  0.1309646
##   1650     0.08386195  0.2037648  0.9587892  0.1309330
##   1700     0.08387827  0.2038070  0.9587727  0.1309622
##   1750     0.08388969  0.2038577  0.9587560  0.1309589
##   1800     0.08392644  0.2039070  0.9587356  0.1309123
##   1850     0.08393212  0.2038926  0.9587415  0.1308403
##   1900     0.08393373  0.2039098  0.9587354  0.1308317
##   1950     0.08395218  0.2039344  0.9587242  0.1308229
##   2000     0.08394306  0.2039055  0.9587364  0.1307980
## 
## Tuning parameter 'max_depth' was held constant at a value of 4
##  0.7
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 1500, max_depth =
##  4, eta = 0.02, gamma = 0.006, colsample_bytree = 0.7, min_child_weight
##  = 1 and subsample = 1.

7 Model Comparisons and Ensembling

Since we used same resampling indices by setting the same seed value for building models, we can now compare the box-plots of cross-validated RMSLE estimates from different models fitted for predicting formation energy and band gap energy. We consider only those models for ensembling whose box-plot overlap with the one with the highest performance. We use simple weighted average ensembling and the weights are assigned to the models according to their ranking in performance.

7.1 Formation energy

We see that model averaged neural network out performs the rest.

# List of models
resamps_FE <- resamples(list(avNNet = nptcFE_avNNet,
                             cubist = nptcFE_cubist,
                             enet = nptcFE_enet,
                             kknn = nptcFE_kknn,
                             pls = nptcFE_pls,
                             svmPoly = nptcFE_svmPoly,
                             svmRadial = nptcFE_svmR,
                             xgbt = nptcFE_xgbt))

# Box-plots for comparison
bwplot(resamps_FE, metric = 'RMSLE')

Since there are 4 models whose box-plots overlap with the best model. Hence, we use an ensemble of 5 models on the test data predictions.

ma_fe = .7 * predict(nptcFE_avNNet, df_test) +
  .15 * predict(nptcFE_xgbt, df_test) +
  .06 * predict(nptcFE_cubist, df_test) +
  .05 * predict(nptcFE_svmR, df_test) +
  .04 * predict(nptcFE_svmPoly, df_test)

7.2 Band gap ennergy

Again we see that model averaged neural network out performs the rest but the ranking of other models based on their performance is different as compared to that for formation energy.

# List of models
resamps_BE <- resamples(list(avNNet = nptcBE_avNNet,
                             cubist = nptcBE_cubist,
                             enet = nptcBE_enet,
                             kknn = nptcBE_kknn,
                             pls = nptcBE_pls,
                             svmPoly = nptcBE_svmPoly,
                             svmRadial = nptcBE_svmR,
                             xgbt = nptcBE_xgbt))

# Box-plots for comparison
bwplot(resamps_BE, metric = 'RMSLE')

Again there are 4 models whose box-plots overlap with the best model. Hence, we use an ensemble of 5 models on the test data predictions.

ma_be = .7 * predict(nptcBE_avNNet, df_test) +
  .15 * predict(nptcBE_xgbt, df_test) +
  .06 * predict(nptcBE_svmPoly, df_test) +
  .05 * predict(nptcBE_cubist, df_test) +
  .04 * predict(nptcBE_svmR, df_test)

Let’s now check the performance of the best model, that is, model averaged neural network and the ensemble of models on the test data. Since energy is a non-negative term we take absolute value of the predictions for submission. We see that ensemble of models have a small improvement over the best model.

# Submission using best model
sub_avNNet <- read_csv('sample_submission.csv') %>%
  mutate(formation_energy_ev_natom = predict(nptcFE_avNNet, df_test),
         bandgap_energy_ev = predict(nptcBE_avNNet, df_test)) %>% abs()

write_excel_csv(sub_avNNet,'sub_avNNet.csv')

# Submission using weighted average ensemble
sub_wa5 <- read_csv('sample_submission.csv') %>%
  mutate(formation_energy_ev_natom = ma_fe,
         bandgap_energy_ev = ma_be) %>% abs()

write_excel_csv(sub_wa5, 'sub_wa5.csv')

The private leaderboard scores are

8 Conclusion

It is clear from the private leaderboard ranking that there is still scope for improvement in performance metric, which can be attributed towards meticulous engineering of features using the domain knowledge. However, the proposed solution in this document shows an improvement over the \(45^{th}\) place solution, which is among the top 5% solutions.

NOMAD2018 Predicting Transparent Conductors

Vishal Modagekar

1 Introduction

1.1 Load libraries and data files

1.2 Train data

1.3 Test data

1.4 Missing values

1.5 Near-zero variance variables

1.6 Individual variable plots

1.7 Geometry file

2 Feature Engineering

2.1 Mean value of coordinates of atoms in reduced form

2.2 Other features and formation of design matrix

3 Duplicate rows in design matrix

4 Measuring Univariate Feature Importance

4.1 Binary features

4.1.1 Relevance for formation energy

4.1.2 Relevance for band gap energy

4.2 Continuous features

4.2.1 Relevance for formation energy

4.2.2 Relevance for band gap energy

5 Feature Selection

5.1 Formation energy

5.2 Band gap energy

6 Model Building

6.1 Formation energy

6.1.1 Model averaged neural network

6.1.2 Cubist

6.1.3 Elasticnet

6.1.4 Weighted k-nearest neighbors

6.1.5 Partial least squares

6.1.6 Support vector machines with polynomial kernel

6.1.7 Support vector machines with radial basis function kernel

6.1.8 Extreme gradient boosting

6.2 Band gap energy

6.2.1 Model averaged neural network

6.2.2 Cubist

6.2.3 Elasticnet

6.2.4 Weighted k-nearest neighbors

6.2.5 Partial least squares

6.2.6 Support vector machines with polynomial kernel

6.2.7 Support vector machines with radial basis function kernel

6.2.8 Extreme gradient boosting

7 Model Comparisons and Ensembling

7.1 Formation energy

7.2 Band gap ennergy

8 Conclusion