The aim of this challenge is to predict the band gap and formation energies of sesquioxides having different compositions of metals like aluminium (Al), gallium (Ga) and indium (In). Bandgap energy is an important property for optoelectronic applications whereas formation energy is an important indicator of the stability of a material.
Along with a tabular dataset there is also provided geometry file for each sample observation which contains the primitive translation lattice vectors and the spatial position of atoms in that compound. These geometry files can be used to gain insights and engineer features that can best correlate with the targets.
library(tidyverse)
library(caret)
library(CORElearn)
library(corrplot)
library(gridExtra)
library(GGally)
library(plotly)
library(DT)train <- read_csv('train.csv')
test <- read_csv('test.csv')summary(train)## id spacegroup number_of_total_atoms percent_atom_al
## Min. : 1.0 Min. : 12.0 Min. :10.00 Min. :0.0000
## 1st Qu.: 600.8 1st Qu.: 33.0 1st Qu.:40.00 1st Qu.:0.1667
## Median :1200.5 Median :194.0 Median :80.00 Median :0.3750
## Mean :1200.5 Mean :141.5 Mean :61.68 Mean :0.3854
## 3rd Qu.:1800.2 3rd Qu.:206.0 3rd Qu.:80.00 3rd Qu.:0.5833
## Max. :2400.0 Max. :227.0 Max. :80.00 Max. :1.0000
## percent_atom_ga percent_atom_in lattice_vector_1_ang
## Min. :0.0000 Min. :0.0000 Min. : 3.037
## 1st Qu.:0.0938 1st Qu.:0.0625 1st Qu.: 6.141
## Median :0.2812 Median :0.2500 Median : 9.537
## Mean :0.3086 Mean :0.3060 Mean :10.030
## 3rd Qu.:0.4688 3rd Qu.:0.4688 3rd Qu.:10.292
## Max. :1.0000 Max. :1.0000 Max. :24.913
## lattice_vector_2_ang lattice_vector_3_ang lattice_angle_alpha_degree
## Min. : 2.942 Min. : 5.673 Min. : 82.74
## 1st Qu.: 5.834 1st Qu.: 9.298 1st Qu.: 90.00
## Median : 6.383 Median :10.125 Median : 90.00
## Mean : 7.087 Mean :12.593 Mean : 90.24
## 3rd Qu.: 9.093 3rd Qu.:14.372 3rd Qu.: 90.01
## Max. :10.290 Max. :25.346 Max. :101.23
## lattice_angle_beta_degree lattice_angle_gamma_degree
## Min. : 81.64 Min. : 29.73
## 1st Qu.: 90.00 1st Qu.: 90.00
## Median : 90.00 Median : 90.00
## Mean : 92.40 Mean : 94.79
## 3rd Qu.: 90.01 3rd Qu.:120.00
## Max. :106.17 Max. :120.05
## formation_energy_ev_natom bandgap_energy_ev
## Min. :0.0000 Min. :0.0001
## 1st Qu.:0.1056 1st Qu.:1.2785
## Median :0.1818 Median :1.9079
## Mean :0.1876 Mean :2.0772
## 3rd Qu.:0.2563 3rd Qu.:2.7620
## Max. :0.6572 Max. :5.2861
glimpse(train)## Observations: 2,400
## Variables: 14
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ spacegroup <dbl> 33, 194, 227, 167, 194, 227, 206, 1...
## $ number_of_total_atoms <dbl> 80, 80, 40, 30, 80, 40, 80, 20, 80,...
## $ percent_atom_al <dbl> 0.6250, 0.6250, 0.8125, 0.7500, 0.0...
## $ percent_atom_ga <dbl> 0.3750, 0.3750, 0.1875, 0.0000, 0.6...
## $ percent_atom_in <dbl> 0.0000, 0.0000, 0.0000, 0.2500, 0.3...
## $ lattice_vector_1_ang <dbl> 9.9523, 6.1840, 9.7510, 5.0036, 6.6...
## $ lattice_vector_2_ang <dbl> 8.5513, 6.1838, 5.6595, 5.0034, 6.6...
## $ lattice_vector_3_ang <dbl> 9.1775, 23.6287, 13.9630, 13.5318, ...
## $ lattice_angle_alpha_degree <dbl> 90.0026, 90.0186, 90.9688, 89.9888,...
## $ lattice_angle_beta_degree <dbl> 90.0023, 89.9980, 91.1228, 90.0119,...
## $ lattice_angle_gamma_degree <dbl> 90.0017, 120.0025, 30.5185, 120.001...
## $ formation_energy_ev_natom <dbl> 0.0680, 0.2490, 0.1821, 0.2172, 0.0...
## $ bandgap_energy_ev <dbl> 3.4387, 2.9210, 2.7438, 3.3492, 1.3...
We observe that:
summary(test)## id spacegroup number_of_total_atoms percent_atom_al
## Min. : 1.0 Min. : 12.0 Min. :10.00 Min. :0.0000
## 1st Qu.:150.8 1st Qu.: 33.0 1st Qu.:40.00 1st Qu.:0.1250
## Median :300.5 Median :194.0 Median :80.00 Median :0.3750
## Mean :300.5 Mean :139.6 Mean :61.73 Mean :0.3710
## 3rd Qu.:450.2 3rd Qu.:206.0 3rd Qu.:80.00 3rd Qu.:0.5625
## Max. :600.0 Max. :227.0 Max. :80.00 Max. :1.0000
## percent_atom_ga percent_atom_in lattice_vector_1_ang
## Min. :0.0000 Min. :0.0000 Min. : 3.073
## 1st Qu.:0.0938 1st Qu.:0.0625 1st Qu.: 6.137
## Median :0.2500 Median :0.2812 Median : 9.495
## Mean :0.3133 Mean :0.3157 Mean :10.098
## 3rd Qu.:0.4688 3rd Qu.:0.4688 3rd Qu.:10.363
## Max. :0.9688 Max. :0.9688 Max. :24.913
## lattice_vector_2_ang lattice_vector_3_ang lattice_angle_alpha_degree
## Min. : 2.960 Min. : 5.698 Min. : 83.74
## 1st Qu.: 5.829 1st Qu.: 9.309 1st Qu.: 90.00
## Median : 6.398 Median :10.097 Median : 90.00
## Mean : 7.082 Mean :12.442 Mean : 90.16
## 3rd Qu.: 9.157 3rd Qu.:14.328 3rd Qu.: 90.01
## Max. :10.249 Max. :25.306 Max. :100.95
## lattice_angle_beta_degree lattice_angle_gamma_degree
## Min. : 82.75 Min. : 29.72
## 1st Qu.: 90.00 1st Qu.: 90.00
## Median : 90.00 Median : 90.00
## Mean : 92.49 Mean : 96.33
## 3rd Qu.: 90.01 3rd Qu.:120.00
## Max. :105.98 Max. :120.05
glimpse(test)## Observations: 600
## Variables: 12
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ spacegroup <dbl> 33, 33, 167, 12, 12, 33, 167, 33, 3...
## $ number_of_total_atoms <dbl> 80, 80, 30, 80, 80, 40, 30, 80, 80,...
## $ percent_atom_al <dbl> 0.1875, 0.7500, 0.6667, 0.5625, 0.1...
## $ percent_atom_ga <dbl> 0.4688, 0.2500, 0.1667, 0.4375, 0.5...
## $ percent_atom_in <dbl> 0.3438, 0.0000, 0.1667, 0.0000, 0.3...
## $ lattice_vector_1_ang <dbl> 10.5381, 9.8938, 4.9811, 24.3370, 2...
## $ lattice_vector_2_ang <dbl> 9.0141, 8.5014, 4.9808, 6.0091, 6.2...
## $ lattice_vector_3_ang <dbl> 9.6361, 9.1298, 13.4799, 5.7620, 6....
## $ lattice_angle_alpha_degree <dbl> 89.9997, 90.0038, 89.9900, 89.9995,...
## $ lattice_angle_beta_degree <dbl> 90.0003, 90.0023, 90.0109, 103.8581...
## $ lattice_angle_gamma_degree <dbl> 90.0006, 90.0015, 120.0014, 90.0002...
The test data contains 600 observations (i.e. 20% of the total 3000 materials). All features appear to have similar average properties as in the train data.
sum(is.na(train))## [1] 0
sum(is.na(test))## [1] 0
There are no missing values.
nearZeroVar(train)## integer(0)
There are no near-zero variance variables.
f1 <- train %>% ggplot(aes(formation_energy_ev_natom)) +
geom_histogram(bins = 100, fill = "red", alpha = 0.5) +
scale_x_log10()
f2 <- train %>% ggplot(aes(bandgap_energy_ev)) +
geom_histogram(bins = 100, fill = "blue", alpha = 0.5) +
scale_x_log10(breaks = c(0.1,0.5,1,5))
f3 <- train %>%
ggplot(aes(number_of_total_atoms, fill = as.factor(number_of_total_atoms))) +
geom_bar(width = 8) +
theme(legend.position = "none")
f4 <- train %>% mutate(spacegroup = as.factor(spacegroup)) %>%
ggplot(aes(spacegroup, fill = spacegroup)) +
geom_bar() +
theme(legend.position = "none")
grid.arrange(f1, f2, f3, f4, ncol = 2)We observe that:
f5 <- train %>%
ggplot(aes(percent_atom_al)) +
geom_density(color = "blue", fill = "lightblue", alpha = 0.5)
f6 <- train %>%
ggplot(aes(percent_atom_ga)) +
geom_density(color = "green", fill = "orange", alpha = 0.5)
f7 <- train %>%
ggplot(aes(percent_atom_in)) +
geom_density(color = "red", fill = "brown", alpha = 0.5)
f8 <- train %>%
ggplot(aes(lattice_vector_1_ang)) +
geom_histogram(bins = 30, fill = "brown")
f9 <- train %>%
ggplot(aes(lattice_vector_2_ang)) +
geom_histogram(bins = 30, fill = "purple")
f10 <- train %>%
ggplot(aes(lattice_vector_3_ang)) +
geom_histogram(bins = 30, fill = "dark green")
f11 <- train %>%
ggplot(aes(lattice_angle_alpha_degree)) +
geom_histogram(bins = 30, fill = "brown")
f12 <- train %>%
ggplot(aes(lattice_angle_beta_degree)) +
geom_histogram(bins = 30, fill = "purple")
f13 <- train %>%
ggplot(aes(lattice_angle_gamma_degree)) +
geom_histogram(bins = 30, fill = "dark green")
grid.arrange(f5, f6, f7, f8, f9, f10, f11, f12, f13, ncol = 3)We observe that:
Real crystals are composed of a specific arrangement of the constituent atoms in a unit cell, which is then repeated to form a periodic lattice. The geometry files contain the positions of the atoms in one-unit cell, along with the basis vectors used to construct a periodic array of cells. These are inherently three-dimensional structures and contain a rich set of information about bonding and symmetry.
geo <- read.table("train/10/geometry.xyz", skip = 6, as.is = T) %>% select(-V1) %>%
rename(x = V2, y = V3, z = V4, atom = V5)
geo %>%
plot_ly(x = ~x, y = ~y, z = ~z, color = ~atom) %>%
add_markers() %>%
layout(title = "Cell geometry id 10")Above, we have plotted the position of each atom in the unit cell according to its physical position. The unit cells themselves may not occupy a perfect square. When extracting information about these materials, we would like the features to be descriptive of the material, rather than just the unit cell. Thus, we need to package our cell in such a way that periodicity can be handled efficiently. Transforming our structures to a reduced coordinate form, where the positions of the atoms are represented in terms of the basis vectors, rather than real space vectors, will encode all the unit cell information in a square with boundaries from zero to one. Repeating this reduced unit cell in unit steps along any dimension will allow us to encode periodicity without worrying about oblique shapes.
Let the reduced coordinate be denoted as \(r = (x,y,z)^{t}\), then position vector \(R\) is expressed using the lattice vectors as
\(R = (X,Y,Z)^{t} = xa_{1} + ya_{2} + za_{3} = Ar\)
where \(A = (a_{1}, a_{2}, a_{3})\) is a matrix of primitive lattice vectors.
Taking the inverse of the basis vector matrix, we can solve for the reduced form coordinates:
\(r = A^{-1}R\).
# Reading primitive lattice vectors
A <- read.table("train/10/geometry.xyz", skip = 3, nrows = 3) %>% select(-V1) %>%
as.matrix() %>% t
B <- MASS::ginv(A)
# Reading spatial position of atoms
Geom <- read.table("train/10/geometry.xyz", skip = 6) %>% select(-V1) %>%
rename(x = V2, y = V3, z = V4, atom = V5)
# Spatial position in reduced coordinates
Geom[,1:3] <- t(B %*% t(Geom[,1:3]))
Geom %>%
plot_ly(x = ~x, y = ~y, z = ~z, color = ~atom) %>%
add_markers() %>%
layout(title = "Cell geometry in reduced coordinates id 10")The reduced coordinates fill a unit square parameterized by the lattice vectors of the crystal. This will allow us to build periodic arrays of the unit cell simply by stacking unit squares, for an arbitrary unit cell/basis vectors.
We engineer the following features apart from the one given in the data set.
# Mean value of coordinates of atoms in reduced form
rcmv_train <- matrix(NA, 2400, 15) %>% data.frame()
for(id in 1:2400){
# Reading primitive lattice vectors
A <- read.table(str_c('train/',id,'/geometry.xyz'), skip = 3, nrows = 3) %>% select(-V1) %>% as.matrix %>% t
B <- MASS::ginv(A)
# Reading spatial position of atoms
Geom <- read.table(str_c('train/',id,'/geometry.xyz'), skip = 6) %>% select(-V1) %>%
rename(x = V2, y = V3, z = V4, atom = V5)
# Cell centre
cg <- c(mean(Geom$x), mean(Geom$y), mean(Geom$z))
cg_r <- (B %*% cg) %>% as.matrix() %>% t()
# Centre of each atom type
D_cg <- Geom %>% group_by(atom) %>% summarise(x = mean(x), y = mean(y), z = mean(z))
if("Al" %in% levels(Geom$atom)){
Al_cg_r <- (B %*% (D_cg %>% filter(atom == 'Al') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
}else{
Al_cg_r <- matrix(0,1,3)
}
if("Ga" %in% levels(Geom$atom)){
Ga_cg_r <- (B %*% (D_cg %>% filter(atom == 'Ga') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
}else{
Ga_cg_r <- matrix(0,1,3)
}
if("In" %in% levels(Geom$atom)){
In_cg_r <- (B %*% (D_cg %>% filter(atom == 'In') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
}else{
In_cg_r <- matrix(0,1,3)
}
O_cg_r <- (B %*% (D_cg %>% filter(atom == 'O') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
temprow <- cbind( cg_r, Al_cg_r, Ga_cg_r, In_cg_r, O_cg_r )
rcmv_train[id,] <- temprow
}
colnames(rcmv_train) <- c('cg_r_x', 'cg_r_y', 'cg_r_z',
'Al_cg_r_x', 'Al_cg_r_y', 'Al_cg_r_z',
'Ga_cg_r_x', 'Ga_cg_r_y', 'Ga_cg_r_z',
'In_cg_r_x', 'In_cg_r_y', 'In_cg_r_z',
'O_cg_r_x', 'O_cg_r_y', 'O_cg_r_z')
rcmv_test <- matrix(NA, 600, 15) %>% data.frame()
for(id in 1:600){
# Reading primitive lattice vectors
A <- read.table(str_c('test/',id,'/geometry.xyz'), skip = 3, nrows = 3) %>% select(-V1) %>% as.matrix %>% t
B <- MASS::ginv(A)
# Reading spatial position of atoms
Geom <- read.table(str_c('test/',id,'/geometry.xyz'), skip = 6) %>% select(-V1) %>%
rename(x = V2, y = V3, z = V4, atom = V5)
# Cell centre
cg <- c(mean(Geom$x), mean(Geom$y), mean(Geom$z))
cg_r <- (B %*% cg) %>% as.matrix() %>% t()
# Centre of each atom type
D_cg <- Geom %>% group_by(atom) %>% summarise(x = mean(x), y = mean(y), z = mean(z))
if("Al" %in% levels(Geom$atom)){
Al_cg_r <- (B %*% (D_cg %>% filter(atom == 'Al') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
}else{
Al_cg_r <- matrix(0,1,3)
}
if("Ga" %in% levels(Geom$atom)){
Ga_cg_r <- (B %*% (D_cg %>% filter(atom == 'Ga') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
}else{
Ga_cg_r <- matrix(0,1,3)
}
if("In" %in% levels(Geom$atom)){
In_cg_r <- (B %*% (D_cg %>% filter(atom == 'In') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
}else{
In_cg_r <- matrix(0,1,3)
}
O_cg_r <- (B %*% (D_cg %>% filter(atom == 'O') %>% select(x:z) %>% as.matrix() %>% t())) %>% t()
temprow <- cbind( cg_r, Al_cg_r, Ga_cg_r, In_cg_r, O_cg_r )
rcmv_test[id,] <- temprow
}
colnames(rcmv_test) <- c('cg_r_x', 'cg_r_y', 'cg_r_z',
'Al_cg_r_x', 'Al_cg_r_y', 'Al_cg_r_z',
'Ga_cg_r_x', 'Ga_cg_r_y', 'Ga_cg_r_z',
'In_cg_r_x', 'In_cg_r_y', 'In_cg_r_z',
'O_cg_r_x', 'O_cg_r_y', 'O_cg_r_z')head(rcmv_train)# Design matrix
df_train <- train %>%
mutate(spacegroup = as.factor(spacegroup),
alpha_rad = (lattice_angle_alpha_degree * pi/180),
beta_rad = (lattice_angle_beta_degree * pi/180),
gamma_rad = (lattice_angle_gamma_degree * pi/180)) %>%
select(-contains('degree'),-id,-contains('energy'))
df_test <- test %>%
mutate(spacegroup = as.factor(spacegroup),
alpha_rad = (lattice_angle_alpha_degree * pi/180),
beta_rad = (lattice_angle_beta_degree * pi/180),
gamma_rad = (lattice_angle_gamma_degree * pi/180)) %>%
select(-contains('degree'),-id)
# One hot encoding variable spacegroup
dmy <- dummyVars(" ~ .", data = df_train)
df_train <- data.frame(predict(dmy, newdata = df_train))
df_test <- data.frame(predict(dmy, newdata = df_test))
# Forming a complete design matrix for train and test
df_train <- df_train %>% bind_cols(rcmv_train)
df_test <- df_test %>% bind_cols(rcmv_test)
rm(rcmv_train, rcmv_test)head(df_train)The number of duplicate rows in design matrix are
dup_rows <- duplicated(df_train) | duplicated(df_train, fromLast = T)
sum(dup_rows)## [1] 24
The corresponding rows in given train are
train[dup_rows,] %>% arrange_(.dots=colnames(train %>% select(-id,-contains("energy"))))## Warning: arrange_() is deprecated.
## Please use arrange() instead
##
## The 'programming' vignette or the tidyeval book can help you
## to program with arrange() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
The number of unique target variable observations for the above duplicate rows are
nrow(unique(train[dup_rows,] %>% select(contains('energy'))))## [1] 24
Thus, there are duplicate rows in the design matrix but the corresponding target values are all distinct. So we remove those duplicate rows from design matrix and train.
df_train <- df_train[!dup_rows,]
train <- train[!dup_rows,]The dimension of the design matrix after removing the duplicate rows is
dim(df_train)## [1] 2376 31
There are total 31 features to work with.
We have already dummified variable spacegroup, let’s check the relevance of its each level by applying Wilcoxon rank sum test followed by Benjamini and Hochberg p-value correction. In this case null hypothesis is that the target distributions for feature values 1 and 0 differ by a location shift of zero. We apply this procedure for both formation energy and band gap energy.
We observe that all binary features derived from variable spacegroup pass the test under 5% significance level.
# Wilcoxon rank sum test on binary features for formation energy
wrst_fe <- apply(df_train %>% select(contains("spacegroup")), 2,
function(x, y){
wxn <- wilcox.test(y ~ x, conf.int = T)[c("statistic", "p.value", "estimate")]
unlist(wxn)},
y = train$formation_energy_ev_natom)
# The results are a matrix with predictors in columns. We reverse this
wrst_fe <- as.data.frame(t(wrst_fe))
names(wrst_fe) <- c("statistic.W", "p.value", "estimate")
wrst_fe$p.adj <- p.adjust(wrst_fe$p.value , method = "fdr")
wrst_fe$feature = rownames(wrst_fe)
(wrst_fe <- wrst_fe %>% mutate(sig=ifelse(wrst_fe$p.adj<0.05, "p.adj<0.05", "Not Sig.")))We observe that 4 out of 6 binary features derived from variable spacegroup pass the test under 5% significance level.
# Wilcoxon rank sum test on binary features for band gap energy
wrst_be <- apply(df_train %>% select(contains("spacegroup")), 2,
function(x, y){
wxn <- wilcox.test(y ~ x, conf.int = T)[c("statistic", "p.value", "estimate")]
unlist(wxn)},
y = train$bandgap_energy_ev)
# The results are a matrix with predictors in columns. We reverse this
wrst_be <- as.data.frame(t(wrst_be))
names(wrst_be) <- c("statistic.W", "p.value", "estimate")
wrst_be$p.adj <- p.adjust(wrst_be$p.value , method = "fdr")
wrst_be$feature = rownames(wrst_be)
(wrst_be <- wrst_be %>% mutate(sig=ifelse(wrst_be$p.adj<0.05, "p.adj<0.05", "Not Sig.")))We find Pearson correlation coefficient, Spearman correlation coeffient, \(R^2\) statistic calculated on loess smoother fit and RReliefFexpRank statistic to measure the strength of linear or non-linear association between continuous features and targets. But before that let’s check the correlations between continuous features.
corrplot(cor(df_train %>% select(-contains('spacegroup')), method = "spearman"), order = "hclust") We observe that there is strong correlation between cg_r_y and O_cg_r_y. The rest of the correlations seems below 0.75 in absolute value, hence univariate feature selection can be used. Otherwise if two predictors are highly correlated with the response and with each other, then the univariate approach will identify both as important leading to selection of redundant features.
We observe that:
# Pearson correlation
pearsonCorr_fe <- apply(df_train %>% select(-contains("spacegroup")),
2, FUN = function(x, y) cor(x, y),
y = train$formation_energy_ev_natom)
# Spearman correlation
spearmanCorr_fe <- apply(df_train %>% select(-contains("spacegroup")),
2, FUN = function(x, y) cor(x, y, method = "spearman"),
y = train$formation_energy_ev_natom)
# Loess scores
loessScores_fe <- filterVarImp(x = df_train %>% select(-contains("spacegroup")),
y = train$formation_energy_ev_natom, nonpara = TRUE)
names(loessScores_fe) = "loessScores"
# Relief scores
reliefScores_fe <- attrEval(fe ~., data = df_train %>% select(-contains("spacegroup")) %>%
mutate(fe = train$formation_energy_ev_natom), estimator = "RReliefFexpRank")
# Scatter plot matrix of variable relevance scores for formation energy
scores_fe <- data.frame(abs_pearsonCorr = abs(pearsonCorr_fe), abs_spearmanCorr = abs(spearmanCorr_fe),
loessScores = loessScores_fe, reliefScores = reliefScores_fe)
scores_fe_plot <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>%
ggpairs(columns = 1:4, upper=list(continuous = wrap('cor', method = "spearman")),
diag = list(continuous = wrap("densityDiag", alpha = .5, color = 'red', fill = 'brown')),
lower = list(continuous = wrap("points", size = 3, alpha = .5, color = 'brown'),
mapping = aes(text = (paste("feature:", feature)))),
switch = "both",
title = "Scatter plot matrix and density of variable relevance scores for formation energy")
ggplotly(scores_fe_plot, tooltip = c("text", "x", "y"))We observe that:
# Pearson correlation
pearsonCorr_be <- apply(df_train %>% select(-contains("spacegroup")),
2, FUN = function(x, y) cor(x, y), y = train$bandgap_energy_ev)
# Spearman correlation
spearmanCorr_be <- apply(df_train %>% select(-contains("spacegroup")),
2, FUN = function(x, y) cor(x, y, method = "spearman"), y = train$bandgap_energy_ev)
# Loess scores
loessScores_be <- filterVarImp(x = df_train %>% select(-contains("spacegroup")),
y = train$bandgap_energy_ev, nonpara = TRUE)
names(loessScores_be) = "loessScores"
# Relief scores
reliefScores_be <- attrEval(be ~., data = df_train %>% select(-contains("spacegroup")) %>%
mutate(be = train$bandgap_energy_ev), estimator = "RReliefFexpRank")
# Scatter plot matrix of variable relevance scores for band gap energy
scores_be <- data.frame(abs_pearsonCorr = abs(pearsonCorr_be), abs_spearmanCorr = abs(spearmanCorr_be),
loessScores = loessScores_be, reliefScores = reliefScores_be)
scores_be_plot <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
ggpairs(columns = 1:4, upper=list(continuous = wrap('cor', method = "spearman")),
diag = list(continuous = wrap("densityDiag", alpha = .5, color = 'blue', fill = 'purple')),
lower = list(continuous = wrap("points", size = 3, alpha = .5, color = 'purple'),
mapping = aes(text = (paste("feature:", feature)))),
switch = "both",
title = "Scatter plot matrix and density of variable relevance scores for bandgap energy")
ggplotly(scores_be_plot, tooltip = c("text", "x", "y"))We select features for model building based on their importance scores found in the previous section. We select only those features whose scores are greater than the defined thresholds for respective score types. The thresholds are found by taking a fraction of the range of respective score types. Since range of respective feature importance scores is different for formation energy and band gap energy so is the fraction used.
The features used for modeling formation energy are
# Relevant predictors to formation energy
wilcox_pred_fe <- wrst_fe %>% filter(sig == "p.adj<0.05") %>% .$feature
pearson_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>%
filter(abs_pearsonCorr > (max(abs_pearsonCorr) - min(abs_pearsonCorr))*1/5) %>%
.$feature
spearman_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>%
filter(abs_spearmanCorr > (max(abs_spearmanCorr) - min(abs_spearmanCorr))*1/5) %>%
.$feature
loess_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>%
filter(loessScores > (max(loessScores) - min(loessScores))*1/5) %>%
.$feature
relief_pred_fe <- scores_fe %>% mutate(feature = rownames(scores_fe)) %>%
filter(reliefScores > (max(reliefScores) - min(reliefScores))*1/5) %>%
.$feature
pred_fe <- unique(c(wilcox_pred_fe, pearson_pred_fe, spearman_pred_fe, loess_pred_fe, relief_pred_fe))
pred_fe## [1] "spacegroup.12" "spacegroup.33" "spacegroup.167"
## [4] "spacegroup.194" "spacegroup.206" "spacegroup.227"
## [7] "percent_atom_al" "percent_atom_ga" "percent_atom_in"
## [10] "lattice_vector_1_ang" "lattice_vector_2_ang" "lattice_vector_3_ang"
## [13] "alpha_rad" "beta_rad" "cg_r_y"
## [16] "cg_r_z" "Al_cg_r_x" "Al_cg_r_z"
## [19] "Ga_cg_r_x" "Ga_cg_r_y" "Ga_cg_r_z"
## [22] "In_cg_r_x" "In_cg_r_y" "In_cg_r_z"
## [25] "O_cg_r_y"
The features used for modeling band gap energy are
# Relevant predictors to band gap energy
wilcox_pred_be <- wrst_be %>% filter(sig == "p.adj<0.05") %>% .$feature
pearson_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
filter(abs_pearsonCorr > (max(abs_pearsonCorr) - min(abs_pearsonCorr))*1/20) %>%
.$feature
spearman_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
filter(abs_spearmanCorr > (max(abs_spearmanCorr) - min(abs_spearmanCorr))*1/20) %>%
.$feature
loess_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
filter(loessScores > (max(loessScores) - min(loessScores))*1/20) %>%
.$feature
relief_pred_be <- scores_be %>% mutate(feature = rownames(scores_be)) %>%
filter(reliefScores > (max(reliefScores) - min(reliefScores))*1/20) %>%
.$feature
pred_be <- unique(c(wilcox_pred_be, pearson_pred_be, spearman_pred_be, loess_pred_be, relief_pred_be))
pred_be## [1] "spacegroup.33" "spacegroup.167"
## [3] "spacegroup.194" "spacegroup.227"
## [5] "number_of_total_atoms" "percent_atom_al"
## [7] "percent_atom_in" "lattice_vector_1_ang"
## [9] "lattice_vector_2_ang" "lattice_vector_3_ang"
## [11] "alpha_rad" "beta_rad"
## [13] "gamma_rad" "cg_r_x"
## [15] "cg_r_y" "cg_r_z"
## [17] "Al_cg_r_x" "Al_cg_r_y"
## [19] "Al_cg_r_z" "Ga_cg_r_z"
## [21] "In_cg_r_x" "In_cg_r_y"
## [23] "In_cg_r_z" "O_cg_r_x"
## [25] "O_cg_r_y" "O_cg_r_z"
## [27] "percent_atom_ga" "Ga_cg_r_x"
Caret’s train function is used to built the models. The models were tuned using root-mean-square-log-error (RMSLE) as the performance metric to evaluate the fit on the cross-validation hold-out set for different hyperparameters. Leave-group-out-cross-validation (LGOCV) data splitting strategy is used where we leave 20 percent as hold-out group.
# Function to calculate RMSLE to estimate model performance
custom_summary = function(data, lev = NULL, model = NULL) {
p <- log(abs(data$pred) + 1)
o <- log(data$obs + 1)
out <- sqrt(mean((p-o)^2))
names(out) = "RMSLE"
c(out, defaultSummary(data, lev = NULL, model = NULL))
}set.seed(7654)
nptcFE_avNNet <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
preProcess = c("center", "scale"),
method = "avNNet",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(decay = c(0, 0.005, 0.01),
size = 3:12,
bag = FALSE),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary),
linout = TRUE,
trace = FALSE,
MaxNWts = 12 * (ncol(df_train) + 1) + 12 + 1,
maxit = 500)
nptcFE_avNNet## Model Averaged Neural Network
##
## 2367 samples
## 25 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## decay size RMSLE RMSE Rsquared MAE
## 0.000 3 0.03149024 0.03908853 0.8581717 0.02774245
## 0.000 4 0.03021502 0.03778860 0.8672894 0.02565481
## 0.000 5 0.02908929 0.03651229 0.8760976 0.02447570
## 0.000 6 0.02876068 0.03606204 0.8792777 0.02404277
## 0.000 7 0.02889554 0.03628229 0.8779912 0.02394308
## 0.000 8 0.02859275 0.03593098 0.8803242 0.02373446
## 0.000 9 0.02887669 0.03626575 0.8784617 0.02395037
## 0.000 10 0.02890050 0.03631248 0.8783470 0.02389547
## 0.000 11 0.02878974 0.03612602 0.8793467 0.02397479
## 0.000 12 0.02882691 0.03619273 0.8787039 0.02396833
## 0.005 3 0.03108146 0.03875938 0.8601698 0.02657043
## 0.005 4 0.02973270 0.03722034 0.8711975 0.02504237
## 0.005 5 0.02918607 0.03654355 0.8758055 0.02445884
## 0.005 6 0.02894849 0.03625586 0.8780864 0.02415901
## 0.005 7 0.02867279 0.03586829 0.8805183 0.02379670
## 0.005 8 0.02851321 0.03573462 0.8816817 0.02374711
## 0.005 9 0.02864061 0.03588883 0.8804884 0.02368777
## 0.005 10 0.02882829 0.03614085 0.8792661 0.02370830
## 0.005 11 0.02913308 0.03648491 0.8772512 0.02389598
## 0.005 12 0.02907267 0.03644188 0.8777136 0.02391190
## 0.010 3 0.03104301 0.03873024 0.8604753 0.02652098
## 0.010 4 0.02984401 0.03734823 0.8703895 0.02514102
## 0.010 5 0.02933141 0.03671637 0.8747831 0.02444757
## 0.010 6 0.02890674 0.03619782 0.8786137 0.02404539
## 0.010 7 0.02886190 0.03620084 0.8786230 0.02400355
## 0.010 8 0.02893008 0.03619588 0.8785412 0.02401678
## 0.010 9 0.02881043 0.03613592 0.8791302 0.02386310
## 0.010 10 0.02862595 0.03589427 0.8806830 0.02368999
## 0.010 11 0.02906606 0.03644242 0.8773971 0.02395558
## 0.010 12 0.02863584 0.03586894 0.8811141 0.02355162
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 8, decay = 0.005 and bag
## = FALSE.
set.seed(7654)
nptcFE_cubist <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
method = "cubist",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(committees = seq(30,60,5),
neighbors = c(0,3,5,7,9)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcFE_cubist## Cubist
##
## 2367 samples
## 25 predictor
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSLE RMSE Rsquared MAE
## 30 0 0.02957016 0.03722273 0.8708993 0.02466582
## 30 3 0.03114621 0.03940150 0.8569248 0.02534928
## 30 5 0.03043428 0.03848389 0.8628360 0.02452768
## 30 7 0.03005676 0.03799134 0.8660486 0.02422966
## 30 9 0.02979644 0.03763616 0.8683386 0.02411804
## 35 0 0.02956325 0.03720809 0.8710880 0.02466517
## 35 3 0.03113407 0.03938151 0.8570839 0.02534030
## 35 5 0.03042705 0.03846945 0.8629697 0.02452864
## 35 7 0.03005341 0.03798417 0.8661439 0.02422727
## 35 9 0.02979395 0.03763137 0.8684284 0.02412064
## 40 0 0.02949262 0.03712540 0.8716577 0.02459935
## 40 3 0.03107228 0.03930603 0.8576443 0.02528890
## 40 5 0.03036546 0.03839433 0.8635063 0.02447924
## 40 7 0.02998942 0.03790616 0.8666987 0.02417347
## 40 9 0.02972829 0.03755108 0.8689934 0.02406138
## 45 0 0.02944117 0.03706004 0.8720804 0.02455587
## 45 3 0.03104112 0.03926465 0.8578999 0.02523988
## 45 5 0.03033212 0.03835096 0.8637751 0.02443984
## 45 7 0.02995686 0.03786362 0.8669699 0.02414033
## 45 9 0.02969639 0.03750951 0.8692582 0.02402416
## 50 0 0.02938374 0.03698593 0.8726146 0.02451478
## 50 3 0.03100840 0.03922368 0.8581845 0.02521334
## 50 5 0.03029597 0.03830497 0.8640927 0.02440162
## 50 7 0.02991557 0.03781181 0.8673287 0.02411004
## 50 9 0.02965429 0.03745657 0.8696244 0.02399545
## 55 0 0.02940731 0.03701933 0.8723490 0.02452062
## 55 3 0.03103499 0.03926094 0.8578705 0.02523118
## 55 5 0.03031443 0.03833288 0.8638557 0.02439992
## 55 7 0.02993567 0.03784054 0.8670887 0.02411554
## 55 9 0.02967713 0.03748850 0.8693634 0.02400568
## 60 0 0.02941467 0.03702106 0.8723143 0.02450930
## 60 3 0.03104795 0.03927372 0.8577687 0.02524475
## 60 5 0.03032252 0.03833905 0.8637922 0.02440879
## 60 7 0.02994389 0.03784688 0.8670249 0.02411989
## 60 9 0.02968367 0.03749217 0.8693191 0.02400103
##
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 0.
set.seed(7654)
nptcFE_enet <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
preProcess = c("center", "scale"),
method = "enet",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = 5 - 1)),
fraction = seq(.05, 1, length = 5)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcFE_enet ## Elasticnet
##
## 2367 samples
## 25 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSLE RMSE Rsquared MAE
## 0e+00 0.0500 0.05074030 0.06107614 0.6531706 0.04675650
## 0e+00 0.2875 0.05063567 0.06095741 0.6544851 0.04668791
## 0e+00 0.5250 0.05057467 0.06089000 0.6552470 0.04665545
## 0e+00 0.7625 0.05053349 0.06084695 0.6557443 0.04663992
## 0e+00 1.0000 0.05051821 0.06083412 0.6559095 0.04664493
## 1e-04 0.0500 0.07473336 0.08983755 0.4993005 0.07365245
## 1e-04 0.2875 0.05180106 0.06224362 0.6413947 0.04775483
## 1e-04 0.5250 0.05101060 0.06132791 0.6500113 0.04701351
## 1e-04 0.7625 0.05082492 0.06114089 0.6522883 0.04683486
## 1e-04 1.0000 0.05077312 0.06110925 0.6527852 0.04678099
## 1e-03 0.0500 0.07677097 0.09226973 0.4697795 0.07564147
## 1e-03 0.2875 0.05308787 0.06385717 0.6310419 0.04917971
## 1e-03 0.5250 0.05113181 0.06146224 0.6483942 0.04713483
## 1e-03 0.7625 0.05091773 0.06122700 0.6512249 0.04692493
## 1e-03 1.0000 0.05082287 0.06113894 0.6523290 0.04682662
## 1e-02 0.0500 0.08087807 0.09717862 0.3498557 0.07956296
## 1e-02 0.2875 0.06192762 0.07450211 0.5728810 0.05954148
## 1e-02 0.5250 0.05326690 0.06408331 0.6285474 0.04932644
## 1e-02 0.7625 0.05135807 0.06168585 0.6459813 0.04731885
## 1e-02 1.0000 0.05109311 0.06139895 0.6492548 0.04710838
## 1e-01 0.0500 0.08150550 0.09793757 0.3333873 0.08015292
## 1e-01 0.2875 0.06416070 0.07718144 0.5596165 0.06226918
## 1e-01 0.5250 0.05486502 0.06600259 0.6163365 0.05128783
## 1e-01 0.7625 0.05187597 0.06231058 0.6393749 0.04775096
## 1e-01 1.0000 0.05146032 0.06174992 0.6466953 0.04740162
##
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 0.
set.seed(7654)
nptcFE_kknn <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
preProcess = c("center", "scale"),
method = "kknn",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(kmax = c(6, 8, 10, 12, 14),
distance = 1,
kernel = "epanechnikov"),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcFE_kknn## k-Nearest Neighbors
##
## 2367 samples
## 25 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## kmax RMSLE RMSE Rsquared MAE
## 6 0.03591260 0.04499291 0.8137949 0.02977108
## 8 0.03532952 0.04423627 0.8191747 0.02929978
## 10 0.03508564 0.04388869 0.8216095 0.02915689
## 12 0.03493763 0.04365258 0.8232661 0.02911865
## 14 0.03488801 0.04354089 0.8240192 0.02917551
##
## Tuning parameter 'distance' was held constant at a value of 1
##
## Tuning parameter 'kernel' was held constant at a value of epanechnikov
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 14, distance = 1
## and kernel = epanechnikov.
set.seed(7654)
nptcFE_pls <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
preProcess = c("center", "scale"),
method = "pls",
metric = "RMSLE",
maximize = FALSE,
tuneLength = ncol(df_train)-1,
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcFE_pls## Partial Least Squares
##
## 2367 samples
## 25 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## ncomp RMSLE RMSE Rsquared MAE
## 1 0.05656785 0.06799984 0.5695875 0.05220535
## 2 0.05335146 0.06408669 0.6180657 0.04940284
## 3 0.05233598 0.06290488 0.6318702 0.04801181
## 4 0.05152707 0.06189150 0.6435939 0.04718767
## 5 0.05138643 0.06172301 0.6454721 0.04717133
## 6 0.05124354 0.06152417 0.6478052 0.04753760
## 7 0.05127116 0.06157420 0.6471736 0.04734778
## 8 0.05126262 0.06158273 0.6471104 0.04735000
## 9 0.05129306 0.06163193 0.6465628 0.04733396
## 10 0.05129680 0.06164133 0.6464777 0.04731726
## 11 0.05127643 0.06161157 0.6467491 0.04730038
## 12 0.05130110 0.06162636 0.6465920 0.04736736
## 13 0.05119999 0.06157249 0.6471953 0.04723168
## 14 0.05122909 0.06158367 0.6471098 0.04721432
## 15 0.05116460 0.06153453 0.6476940 0.04714967
## 16 0.05110616 0.06147068 0.6484501 0.04710513
## 17 0.05106968 0.06142830 0.6489767 0.04709176
## 18 0.05088958 0.06123781 0.6511868 0.04695528
## 19 0.05082478 0.06118200 0.6519420 0.04683705
## 20 0.05082021 0.06117320 0.6520866 0.04680960
## 21 0.05082549 0.06117803 0.6520384 0.04681449
## 22 0.05077328 0.06111275 0.6527660 0.04678277
## 23 0.05077074 0.06111136 0.6527832 0.04677736
## 24 0.05051821 0.06083412 0.6559095 0.04664493
##
## RMSLE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 24.
set.seed(7654)
nptcFE_svmPoly <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
preProcess = c("center", "scale"),
method = "svmPoly",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(degree = c(1,2),
scale = c(.01, .05),
C = 2 ^((1:14) - 3)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcFE_svmPoly## Support Vector Machines with Polynomial Kernel
##
## 2367 samples
## 25 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## degree scale C RMSLE RMSE Rsquared MAE
## 1 0.01 0.25 0.05303163 0.06372709 0.6238714 0.04635678
## 1 0.01 0.50 0.05288598 0.06352739 0.6267144 0.04601864
## 1 0.01 1.00 0.05280635 0.06343812 0.6285939 0.04586983
## 1 0.01 2.00 0.05279848 0.06344259 0.6288899 0.04582669
## 1 0.01 4.00 0.05278871 0.06344653 0.6291808 0.04578770
## 1 0.01 8.00 0.05271643 0.06337637 0.6302997 0.04571689
## 1 0.01 16.00 0.05264935 0.06330237 0.6316051 0.04567156
## 1 0.01 32.00 0.05255486 0.06319271 0.6332019 0.04563178
## 1 0.01 64.00 0.05250365 0.06313705 0.6338591 0.04561002
## 1 0.01 128.00 0.05242958 0.06305212 0.6348403 0.04555832
## 1 0.01 256.00 0.05239535 0.06301471 0.6352561 0.04554813
## 1 0.01 512.00 0.05238151 0.06299599 0.6354864 0.04553870
## 1 0.01 1024.00 0.05237346 0.06298743 0.6356515 0.04553400
## 1 0.01 2048.00 0.05238252 0.06300110 0.6355282 0.04553735
## 1 0.05 0.25 0.05280266 0.06343668 0.6287438 0.04585723
## 1 0.05 0.50 0.05279025 0.06343735 0.6290528 0.04580967
## 1 0.05 1.00 0.05274856 0.06340446 0.6296441 0.04576344
## 1 0.05 2.00 0.05270107 0.06336280 0.6305994 0.04570423
## 1 0.05 4.00 0.05259092 0.06323326 0.6324550 0.04564298
## 1 0.05 8.00 0.05253242 0.06316834 0.6334921 0.04562092
## 1 0.05 16.00 0.05249231 0.06312381 0.6340780 0.04559687
## 1 0.05 32.00 0.05240220 0.06302081 0.6351793 0.04554804
## 1 0.05 64.00 0.05238480 0.06300187 0.6353761 0.04554546
## 1 0.05 128.00 0.05236934 0.06298267 0.6356564 0.04553404
## 1 0.05 256.00 0.05237475 0.06298914 0.6356178 0.04553684
## 1 0.05 512.00 0.05237901 0.06299464 0.6355719 0.04553875
## 1 0.05 1024.00 0.05237285 0.06298716 0.6356828 0.04553419
## 1 0.05 2048.00 0.05241944 0.06304129 0.6352567 0.04552324
## 2 0.01 0.25 0.04440479 0.05375998 0.7338174 0.03910168
## 2 0.01 0.50 0.04047780 0.04931863 0.7762195 0.03582314
## 2 0.01 1.00 0.03669306 0.04511253 0.8128260 0.03254008
## 2 0.01 2.00 0.03398348 0.04210019 0.8364578 0.02977119
## 2 0.01 4.00 0.03251737 0.04045806 0.8483744 0.02803416
## 2 0.01 8.00 0.03191646 0.03976558 0.8533835 0.02719694
## 2 0.01 16.00 0.03177420 0.03960129 0.8546791 0.02690888
## 2 0.01 32.00 0.03182931 0.03967116 0.8542443 0.02685753
## 2 0.01 64.00 0.03179462 0.03961379 0.8549181 0.02679009
## 2 0.01 128.00 0.03168266 0.03944079 0.8563496 0.02672254
## 2 0.01 256.00 0.03146866 0.03912237 0.8587685 0.02659977
## 2 0.01 512.00 0.03134772 0.03892149 0.8603416 0.02660528
## 2 0.01 1024.00 0.03128940 0.03882059 0.8611145 0.02662766
## 2 0.01 2048.00 0.03152237 0.03912191 0.8591974 0.02676966
## 2 0.05 0.25 0.03209372 0.03998135 0.8519491 0.02744699
## 2 0.05 0.50 0.03181429 0.03964124 0.8542995 0.02698939
## 2 0.05 1.00 0.03182882 0.03966594 0.8542439 0.02690531
## 2 0.05 2.00 0.03183607 0.03966390 0.8544557 0.02681500
## 2 0.05 4.00 0.03180999 0.03960732 0.8551200 0.02680538
## 2 0.05 8.00 0.03152910 0.03921542 0.8580325 0.02663160
## 2 0.05 16.00 0.03135023 0.03893678 0.8601425 0.02656074
## 2 0.05 32.00 0.03129999 0.03883898 0.8610172 0.02664799
## 2 0.05 64.00 0.03140662 0.03896018 0.8602210 0.02670596
## 2 0.05 128.00 0.03166015 0.03930672 0.8580440 0.02684595
## 2 0.05 256.00 0.03186476 0.03963626 0.8558823 0.02692723
## 2 0.05 512.00 0.03238235 0.04050478 0.8507114 0.02721318
## 2 0.05 1024.00 0.03405211 0.04321461 0.8332330 0.02855112
## 2 0.05 2048.00 0.03525620 0.04670598 0.8114283 0.02858585
##
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.01 and C
## = 1024.
# Estimate parameter sigma from function sigest
sigfe <- kernlab::sigest(fe ~ .,data = (df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom)), frac=1)
set.seed(7654)
nptcFE_svmR <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
preProcess = c("center", "scale"),
method = "svmRadial",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(sigma = sigfe[2],
C = 2 ^((1:10) - 3)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcFE_svmR## Support Vector Machines with Radial Basis Function Kernel
##
## 2367 samples
## 25 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## C RMSLE RMSE Rsquared MAE
## 0.25 0.03564336 0.04413309 0.8230643 0.03141073
## 0.50 0.03317713 0.04138695 0.8423178 0.02874373
## 1.00 0.03165542 0.03964695 0.8543552 0.02685582
## 2.00 0.03061761 0.03842070 0.8628775 0.02553323
## 4.00 0.03011976 0.03783849 0.8669832 0.02482815
## 8.00 0.02980051 0.03744094 0.8698271 0.02443938
## 16.00 0.02980813 0.03744774 0.8699221 0.02446343
## 32.00 0.03030306 0.03801979 0.8663222 0.02484204
## 64.00 0.03132871 0.03926209 0.8584334 0.02549841
## 128.00 0.03296946 0.04125547 0.8455543 0.02653585
##
## Tuning parameter 'sigma' was held constant at a value of 0.02028928
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02028928 and C = 8.
set.seed(7654)
nptcFE_xgbt <- train(fe ~ ., df_train %>% select(pred_fe) %>%
mutate(fe = train$formation_energy_ev_natom),
method = "xgbTree",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(nrounds = seq(50,2000,50),
max_depth = 5,
eta = 0.01,
gamma = 0,
colsample_bytree = 0.5,
min_child_weight = 1,
subsample = 1),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcFE_xgbt## eXtreme Gradient Boosting
##
## 2367 samples
## 25 predictor
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## nrounds RMSLE RMSE Rsquared MAE
## 50 0.16493523 0.20389366 0.8235249 0.19056779
## 100 0.10799095 0.12900836 0.8400868 0.11803869
## 150 0.07250731 0.08521288 0.8511644 0.07612876
## 200 0.05157697 0.06073893 0.8585337 0.05248216
## 250 0.04015662 0.04805988 0.8636729 0.03948127
## 300 0.03440070 0.04197719 0.8673314 0.03240565
## 350 0.03167588 0.03921555 0.8696018 0.02859465
## 400 0.03041817 0.03797479 0.8711436 0.02661904
## 450 0.02981748 0.03738509 0.8723213 0.02566341
## 500 0.02952313 0.03709966 0.8730306 0.02520750
## 550 0.02936829 0.03695084 0.8735082 0.02497408
## 600 0.02926580 0.03684738 0.8739589 0.02483718
## 650 0.02920784 0.03678779 0.8742277 0.02475737
## 700 0.02916442 0.03674320 0.8744446 0.02469493
## 750 0.02912773 0.03670453 0.8746467 0.02464396
## 800 0.02909705 0.03667171 0.8748286 0.02460275
## 850 0.02907624 0.03665039 0.8749392 0.02457183
## 900 0.02905850 0.03663322 0.8750385 0.02453916
## 950 0.02904627 0.03662020 0.8751119 0.02451404
## 1000 0.02903989 0.03661488 0.8751360 0.02449278
## 1050 0.02903618 0.03661218 0.8751464 0.02447313
## 1100 0.02902356 0.03659723 0.8752331 0.02445197
## 1150 0.02901742 0.03659150 0.8752602 0.02443802
## 1200 0.02901106 0.03658604 0.8752869 0.02441893
## 1250 0.02900900 0.03658566 0.8752880 0.02440859
## 1300 0.02900679 0.03658396 0.8752935 0.02440077
## 1350 0.02900327 0.03657953 0.8753206 0.02439043
## 1400 0.02900090 0.03657666 0.8753329 0.02438279
## 1450 0.02900218 0.03657962 0.8753115 0.02437744
## 1500 0.02899915 0.03657646 0.8753331 0.02436831
## 1550 0.02900270 0.03658163 0.8752988 0.02436426
## 1600 0.02900570 0.03658624 0.8752668 0.02435865
## 1650 0.02901009 0.03659254 0.8752247 0.02435434
## 1700 0.02901894 0.03660561 0.8751400 0.02435544
## 1750 0.02901514 0.03660180 0.8751683 0.02434952
## 1800 0.02901408 0.03660074 0.8751776 0.02434778
## 1850 0.02901359 0.03660094 0.8751763 0.02434404
## 1900 0.02901844 0.03660874 0.8751244 0.02434412
## 1950 0.02902057 0.03661145 0.8751042 0.02434245
## 2000 0.02902437 0.03661773 0.8750637 0.02434155
##
## Tuning parameter 'max_depth' was held constant at a value of 5
## 0.5
## Tuning parameter 'min_child_weight' was held constant at a value of
## 1
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 1500, max_depth =
## 5, eta = 0.01, gamma = 0, colsample_bytree = 0.5, min_child_weight =
## 1 and subsample = 1.
set.seed(7654)
nptcBE_avNNet <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
preProcess = c("center", "scale"),
method = "avNNet",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(decay = c(0, 0.005, 0.01),
size = 3:12,
bag = FALSE),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary),
linout = TRUE,
trace = FALSE,
MaxNWts = 12 * (ncol(df_train) + 1) + 12 + 1,
maxit = 500)
nptcBE_avNNet## Model Averaged Neural Network
##
## 2367 samples
## 28 predictor
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## decay size RMSLE RMSE Rsquared MAE
## 0.000 3 0.08404025 0.2037214 0.9588276 0.1335227
## 0.000 4 0.08261769 0.2009471 0.9599725 0.1282769
## 0.000 5 0.08371242 0.2039432 0.9586260 0.1265017
## 0.000 6 0.08273097 0.2038222 0.9586080 0.1254884
## 0.000 7 0.08381438 0.2027824 0.9591426 0.1267370
## 0.000 8 0.08762308 0.2173988 0.9516979 0.1281144
## 0.000 9 0.08535635 0.2109302 0.9555472 0.1293183
## 0.000 10 0.08538280 0.2098329 0.9561475 0.1294692
## 0.000 11 0.08649547 0.2097370 0.9562834 0.1293949
## 0.000 12 0.08516309 0.2104170 0.9558893 0.1310796
## 0.005 3 0.08200315 0.1983676 0.9609411 0.1286399
## 0.005 4 0.08244157 0.1989804 0.9606857 0.1269683
## 0.005 5 0.08195756 0.1992827 0.9605967 0.1256481
## 0.005 6 0.08191602 0.1984433 0.9609101 0.1243202
## 0.005 7 0.08413509 0.2018799 0.9595533 0.1257244
## 0.005 8 0.08362498 0.2032402 0.9589942 0.1266037
## 0.005 9 0.08466929 0.2047813 0.9583684 0.1277540
## 0.005 10 0.08418874 0.2044626 0.9585109 0.1271724
## 0.005 11 0.08396103 0.2034121 0.9589198 0.1272324
## 0.005 12 0.08559540 0.2088600 0.9566854 0.1285773
## 0.010 3 0.08248061 0.1995188 0.9605467 0.1291922
## 0.010 4 0.08203812 0.1998114 0.9603224 0.1269501
## 0.010 5 0.08185753 0.2000541 0.9603152 0.1262980
## 0.010 6 0.08219443 0.1984496 0.9609156 0.1242304
## 0.010 7 0.08319571 0.2017529 0.9595479 0.1256168
## 0.010 8 0.08476149 0.2057500 0.9578969 0.1270994
## 0.010 9 0.08415407 0.2025704 0.9592362 0.1261027
## 0.010 10 0.08413237 0.2041983 0.9586509 0.1272307
## 0.010 11 0.08617947 0.2079835 0.9570539 0.1291672
## 0.010 12 0.08548313 0.2073811 0.9573169 0.1284922
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.01 and bag
## = FALSE.
set.seed(7654)
nptcBE_cubist <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
method = "cubist",
metric = "RMSLE",
maximize = FALSE,,
tuneGrid = expand.grid(committees = seq(30,60,5),
neighbors = c(0,3,5,7,9)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcBE_cubist## Cubist
##
## 2367 samples
## 28 predictor
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSLE RMSE Rsquared MAE
## 30 0 0.08666475 0.2133435 0.9548239 0.1307867
## 30 3 0.08856421 0.2210377 0.9514488 0.1322443
## 30 5 0.08663693 0.2149943 0.9540184 0.1288934
## 30 7 0.08616532 0.2136550 0.9545934 0.1281160
## 30 9 0.08553913 0.2117360 0.9554004 0.1275286
## 35 0 0.08662359 0.2131488 0.9548967 0.1305923
## 35 3 0.08856284 0.2209245 0.9515046 0.1322135
## 35 5 0.08665173 0.2149025 0.9540594 0.1288794
## 35 7 0.08618969 0.2136094 0.9546139 0.1281794
## 35 9 0.08558043 0.2117640 0.9553867 0.1275976
## 40 0 0.08661211 0.2126452 0.9551128 0.1302890
## 40 3 0.08848481 0.2202981 0.9517745 0.1319701
## 40 5 0.08656986 0.2143245 0.9543047 0.1286075
## 40 7 0.08613397 0.2130629 0.9548431 0.1279354
## 40 9 0.08551883 0.2111951 0.9556226 0.1273328
## 45 0 0.08656795 0.2123644 0.9552429 0.1302462
## 45 3 0.08839650 0.2198807 0.9519715 0.1317481
## 45 5 0.08648553 0.2139287 0.9544859 0.1284283
## 45 7 0.08603627 0.2126020 0.9550484 0.1277412
## 45 9 0.08542599 0.2107551 0.9558184 0.1271660
## 50 0 0.08664465 0.2124133 0.9552243 0.1303230
## 50 3 0.08840085 0.2197853 0.9520223 0.1317035
## 50 5 0.08649170 0.2138480 0.9545295 0.1284415
## 50 7 0.08604147 0.2125226 0.9550902 0.1277346
## 50 9 0.08545066 0.2106964 0.9558498 0.1271661
## 55 0 0.08670069 0.2127431 0.9550800 0.1305373
## 55 3 0.08845284 0.2200438 0.9519081 0.1317233
## 55 5 0.08654849 0.2141106 0.9544158 0.1284885
## 55 7 0.08610913 0.2128081 0.9549649 0.1278077
## 55 9 0.08552276 0.2110051 0.9557151 0.1272741
## 60 0 0.08661851 0.2126222 0.9551207 0.1304141
## 60 3 0.08841104 0.2200103 0.9519193 0.1316963
## 60 5 0.08650127 0.2140779 0.9544256 0.1284180
## 60 7 0.08607442 0.2127927 0.9549651 0.1277213
## 60 9 0.08548766 0.2109936 0.9557128 0.1271736
##
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 45 and neighbors = 9.
set.seed(7654)
nptcBE_enet <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
preProcess = c("center", "scale"),
method = "enet",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(lambda = c(0, 10 ^ seq(-1, -4, length = 5 - 1)),
fraction = seq(.05, 1, length = 5)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcBE_enet ## Elasticnet
##
## 2367 samples
## 28 predictor
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSLE RMSE Rsquared MAE
## 0e+00 0.0500 0.1109490 0.2820488 0.9211144 0.2073603
## 0e+00 0.2875 0.1108917 0.2818933 0.9212061 0.2072545
## 0e+00 0.5250 0.1108531 0.2817903 0.9212688 0.2071955
## 0e+00 0.7625 0.1108332 0.2817398 0.9213024 0.2071975
## 0e+00 1.0000 0.1108321 0.2817419 0.9213068 0.2072515
## 1e-04 0.0500 0.2687003 0.7999602 0.6946055 0.6463144
## 1e-04 0.2875 0.1198239 0.3153043 0.9055252 0.2325662
## 1e-04 0.5250 0.1120995 0.2832871 0.9203850 0.2075914
## 1e-04 0.7625 0.1116511 0.2825539 0.9208210 0.2073073
## 1e-04 1.0000 0.1112337 0.2821427 0.9210565 0.2072695
## 1e-03 0.0500 0.2947844 0.8780888 0.6892341 0.7173401
## 1e-03 0.2875 0.1589198 0.4676690 0.8393872 0.3367075
## 1e-03 0.5250 0.1164680 0.2998870 0.9124843 0.2189411
## 1e-03 0.7625 0.1125337 0.2837817 0.9201028 0.2079721
## 1e-03 1.0000 0.1119830 0.2829681 0.9205878 0.2074517
## 1e-02 0.0500 0.3042525 0.9067878 0.6836412 0.7429120
## 1e-02 0.2875 0.1872207 0.5583297 0.7839152 0.4109622
## 1e-02 0.5250 0.1284262 0.3555164 0.8900068 0.2626566
## 1e-02 0.7625 0.1147026 0.2904636 0.9165636 0.2112560
## 1e-02 1.0000 0.1129874 0.2844091 0.9197943 0.2082797
## 1e-01 0.0500 0.3066216 0.9139026 0.6846550 0.7492952
## 1e-01 0.2875 0.1957727 0.5834742 0.7555983 0.4311719
## 1e-01 0.5250 0.1353658 0.3815008 0.8758520 0.2782292
## 1e-01 0.7625 0.1210658 0.3036766 0.9085381 0.2212191
## 1e-01 1.0000 0.1177591 0.2922457 0.9166275 0.2141557
##
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 1 and lambda = 0.
set.seed(7654)
nptcBE_kknn <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
preProcess = c("center", "scale"),
method = "kknn",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(kmax = c(6, 8, 10, 12, 14),
distance = 1,
kernel = "optimal"),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcBE_kknn## k-Nearest Neighbors
##
## 2367 samples
## 28 predictor
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## kmax RMSLE RMSE Rsquared MAE
## 6 0.1048465 0.2834887 0.9213232 0.1893098
## 8 0.1045391 0.2834038 0.9220607 0.1900744
## 10 0.1046645 0.2837562 0.9219978 0.1906079
## 12 0.1046645 0.2837562 0.9219978 0.1906079
## 14 0.1046645 0.2837562 0.9219978 0.1906079
##
## Tuning parameter 'distance' was held constant at a value of 1
##
## Tuning parameter 'kernel' was held constant at a value of optimal
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 8, distance = 1 and
## kernel = optimal.
set.seed(7654)
nptcBE_pls <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
preProcess = c("center", "scale"),
method = "pls",
metric = "RMSLE",
maximize = FALSE,
tuneLength = ncol(df_train)-1,
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcBE_pls## Partial Least Squares
##
## 2367 samples
## 28 predictor
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## ncomp RMSLE RMSE Rsquared MAE
## 1 0.1913732 0.5314200 0.7200892 0.4079832
## 2 0.1549723 0.4262093 0.8199190 0.3219768
## 3 0.1211774 0.3161109 0.9009182 0.2346206
## 4 0.1177785 0.2955907 0.9132820 0.2141636
## 5 0.1166281 0.2904437 0.9162259 0.2091514
## 6 0.1152155 0.2872513 0.9181367 0.2098431
## 7 0.1135308 0.2868033 0.9184071 0.2099670
## 8 0.1128023 0.2859694 0.9188808 0.2097137
## 9 0.1127386 0.2854893 0.9191417 0.2096235
## 10 0.1129068 0.2853049 0.9192607 0.2094079
## 11 0.1131371 0.2850669 0.9194060 0.2089860
## 12 0.1129296 0.2849366 0.9194780 0.2090415
## 13 0.1129290 0.2848923 0.9194962 0.2091444
## 14 0.1127245 0.2843559 0.9198094 0.2085523
## 15 0.1126358 0.2842960 0.9198498 0.2081096
## 16 0.1124747 0.2839506 0.9200481 0.2077635
## 17 0.1124035 0.2840198 0.9200145 0.2077878
## 18 0.1122914 0.2835854 0.9202457 0.2076000
## 19 0.1121049 0.2834567 0.9203098 0.2074980
## 20 0.1120681 0.2835811 0.9202433 0.2075443
## 21 0.1120246 0.2835156 0.9202785 0.2074990
## 22 0.1116514 0.2826444 0.9207647 0.2071727
## 23 0.1116912 0.2826862 0.9207425 0.2071279
## 24 0.1116306 0.2825897 0.9208023 0.2073384
## 25 0.1112928 0.2821227 0.9210628 0.2074953
## 26 0.1109507 0.2820686 0.9211028 0.2073812
## 27 0.1108321 0.2817419 0.9213068 0.2072515
##
## RMSLE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 27.
nptcBE_svmPoly <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
preProcess = c("center", "scale"),
method = "svmPoly",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(degree = c(1,2),
scale = c(.001, .005, .01),
C = 2 ^((1:10) - 3)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcBE_svmPoly## Support Vector Machines with Polynomial Kernel
##
## 2367 samples
## 28 predictor
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## degree scale C RMSLE RMSE Rsquared MAE
## 1 0.001 0.25 0.16760030 0.5088291 0.8349129 0.3748828
## 1 0.001 0.50 0.13260661 0.3920336 0.8821464 0.2784349
## 1 0.001 1.00 0.11758773 0.3303453 0.9035441 0.2332121
## 1 0.001 2.00 0.11345362 0.3046767 0.9121194 0.2163861
## 1 0.001 4.00 0.11331273 0.2934693 0.9158364 0.2097283
## 1 0.001 8.00 0.11341680 0.2888294 0.9176800 0.2076724
## 1 0.001 16.00 0.11331904 0.2871094 0.9184318 0.2070740
## 1 0.001 32.00 0.11314948 0.2862559 0.9188472 0.2065886
## 1 0.001 64.00 0.11312813 0.2859056 0.9190226 0.2063122
## 1 0.001 128.00 0.11288512 0.2854668 0.9192818 0.2060789
## 1 0.005 0.25 0.11547602 0.3195180 0.9072570 0.2258988
## 1 0.005 0.50 0.11326551 0.3002486 0.9134875 0.2136172
## 1 0.005 1.00 0.11338369 0.2914300 0.9166207 0.2086136
## 1 0.005 2.00 0.11336778 0.2880360 0.9180248 0.2073371
## 1 0.005 4.00 0.11327934 0.2867897 0.9185919 0.2069433
## 1 0.005 8.00 0.11310092 0.2860682 0.9189518 0.2064310
## 1 0.005 16.00 0.11297354 0.2857073 0.9191524 0.2062406
## 1 0.005 32.00 0.11284091 0.2854208 0.9193035 0.2060014
## 1 0.005 64.00 0.11269149 0.2852091 0.9194260 0.2058856
## 1 0.005 128.00 0.11254760 0.2849682 0.9195566 0.2056609
## 1 0.010 0.25 0.11326502 0.3002443 0.9134915 0.2136102
## 1 0.010 0.50 0.11338460 0.2914297 0.9166194 0.2086156
## 1 0.010 1.00 0.11337346 0.2880591 0.9180132 0.2073469
## 1 0.010 2.00 0.11328114 0.2867916 0.9185914 0.2069488
## 1 0.010 4.00 0.11309087 0.2860670 0.9189514 0.2064290
## 1 0.010 8.00 0.11297296 0.2857128 0.9191494 0.2062336
## 1 0.010 16.00 0.11284045 0.2854194 0.9193041 0.2059962
## 1 0.010 32.00 0.11266060 0.2851875 0.9194326 0.2058766
## 1 0.010 64.00 0.11254521 0.2849728 0.9195525 0.2056643
## 1 0.010 128.00 0.11242154 0.2847192 0.9196881 0.2054642
## 2 0.001 0.25 0.13221063 0.3905877 0.8830853 0.2772337
## 2 0.001 0.50 0.11682868 0.3278178 0.9050609 0.2308856
## 2 0.001 1.00 0.11186260 0.3001572 0.9148577 0.2122337
## 2 0.001 2.00 0.11040134 0.2854315 0.9206269 0.2022943
## 2 0.001 4.00 0.10840004 0.2748755 0.9255356 0.1947110
## 2 0.001 8.00 0.10507000 0.2642405 0.9309142 0.1858584
## 2 0.001 16.00 0.10089888 0.2522106 0.9369823 0.1754251
## 2 0.001 32.00 0.09589693 0.2388211 0.9435133 0.1631097
## 2 0.001 64.00 0.09150384 0.2266615 0.9491221 0.1518735
## 2 0.001 128.00 0.08839324 0.2172527 0.9532187 0.1432503
## 2 0.005 0.25 0.10613680 0.2799612 0.9252540 0.1951508
## 2 0.005 0.50 0.10204485 0.2607430 0.9334862 0.1807551
## 2 0.005 1.00 0.09730535 0.2452557 0.9407083 0.1680388
## 2 0.005 2.00 0.09303509 0.2320019 0.9467751 0.1560923
## 2 0.005 4.00 0.08957678 0.2211542 0.9515448 0.1462550
## 2 0.005 8.00 0.08727414 0.2137974 0.9547189 0.1393041
## 2 0.005 16.00 0.08589502 0.2094329 0.9565180 0.1351232
## 2 0.005 32.00 0.08571292 0.2082324 0.9570047 0.1337566
## 2 0.005 64.00 0.08607687 0.2083921 0.9569366 0.1340237
## 2 0.005 128.00 0.08628569 0.2088570 0.9567150 0.1345763
## 2 0.010 0.25 0.09762560 0.2483019 0.9397491 0.1697841
## 2 0.010 0.50 0.09300358 0.2331161 0.9463885 0.1567831
## 2 0.010 1.00 0.08971619 0.2218234 0.9512833 0.1465606
## 2 0.010 2.00 0.08726747 0.2139900 0.9546364 0.1394059
## 2 0.010 4.00 0.08590408 0.2095281 0.9564827 0.1351720
## 2 0.010 8.00 0.08574467 0.2083553 0.9569618 0.1338936
## 2 0.010 16.00 0.08606128 0.2084090 0.9569240 0.1340454
## 2 0.010 32.00 0.08606444 0.2085500 0.9568488 0.1343117
## 2 0.010 64.00 0.08655509 0.2090538 0.9566459 0.1349958
## 2 0.010 128.00 0.08716553 0.2097148 0.9563964 0.1353621
##
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 2, scale = 0.005 and C
## = 32.
# Estimate parameter sigma from function sigest
sigbe <- kernlab::sigest(be ~ .,data = (df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev)), frac=1)
set.seed(7654)
nptcBE_svmR <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
preProcess = c("center", "scale"),
method = "svmRadial",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(sigma = sigbe[2],
C = 2 ^((1:10) - 3)),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcBE_svmR## Support Vector Machines with Radial Basis Function Kernel
##
## 2367 samples
## 28 predictor
##
## Pre-processing: centered (28), scaled (28)
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## C RMSLE RMSE Rsquared MAE
## 0.25 0.09555421 0.2493848 0.9397653 0.1612058
## 0.50 0.09170542 0.2346348 0.9458592 0.1507655
## 1.00 0.08959076 0.2257370 0.9495469 0.1439978
## 2.00 0.08809119 0.2201960 0.9518185 0.1399394
## 4.00 0.08734206 0.2175768 0.9528973 0.1374790
## 8.00 0.08779134 0.2175826 0.9528689 0.1368912
## 16.00 0.08907483 0.2198592 0.9518484 0.1375226
## 32.00 0.09155089 0.2260517 0.9491353 0.1408886
## 64.00 0.09693138 0.2378759 0.9438022 0.1477634
## 128.00 0.10240700 0.2533572 0.9366309 0.1562443
##
## Tuning parameter 'sigma' was held constant at a value of 0.01826589
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01826589 and C = 4.
set.seed(7654)
nptcBE_xgbt <- train(be ~ ., df_train %>% select(pred_be) %>%
mutate(be = train$bandgap_energy_ev),
method = "xgbTree",
metric = "RMSLE",
maximize = FALSE,
tuneGrid = expand.grid(nrounds = seq(50,2000,50),
max_depth = 4,
eta = 0.02,
gamma = 0.006,
colsample_bytree = 0.7,
min_child_weight = 1,
subsample = 1),
trControl = trainControl(method = "LGOCV",
number = 10, p = 0.8,
search = "grid",
summaryFunction = custom_summary))
nptcBE_xgbt## eXtreme Gradient Boosting
##
## 2367 samples
## 28 predictor
##
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 80%)
## Summary of sample sizes: 1895, 1895, 1895, 1895, 1895, 1895, ...
## Resampling results across tuning parameters:
##
## nrounds RMSLE RMSE Rsquared MAE
## 50 0.22731412 0.7540470 0.9314675 0.6170701
## 100 0.11856587 0.3669656 0.9448323 0.2941888
## 150 0.09509505 0.2508754 0.9509037 0.1870335
## 200 0.08993156 0.2223069 0.9536518 0.1540430
## 250 0.08802606 0.2143664 0.9551324 0.1436202
## 300 0.08695346 0.2109817 0.9561451 0.1397350
## 350 0.08627867 0.2092370 0.9567390 0.1378994
## 400 0.08577763 0.2080224 0.9571780 0.1366337
## 450 0.08535223 0.2070865 0.9575288 0.1357365
## 500 0.08499707 0.2063638 0.9578032 0.1350049
## 550 0.08474098 0.2058341 0.9580015 0.1343646
## 600 0.08453966 0.2054274 0.9581560 0.1338640
## 650 0.08439679 0.2050918 0.9582863 0.1334444
## 700 0.08436009 0.2049672 0.9583321 0.1331799
## 750 0.08426845 0.2047519 0.9584156 0.1328730
## 800 0.08416334 0.2044871 0.9585209 0.1325970
## 850 0.08411278 0.2043789 0.9585640 0.1323875
## 900 0.08404836 0.2042505 0.9586126 0.1322020
## 950 0.08400962 0.2041124 0.9586674 0.1319674
## 1000 0.08398502 0.2040361 0.9586956 0.1318135
## 1050 0.08395417 0.2039769 0.9587178 0.1316942
## 1100 0.08392927 0.2039101 0.9587424 0.1315924
## 1150 0.08392302 0.2038674 0.9587577 0.1314863
## 1200 0.08390023 0.2038261 0.9587727 0.1313944
## 1250 0.08389008 0.2038296 0.9587685 0.1313577
## 1300 0.08387259 0.2038032 0.9587772 0.1312677
## 1350 0.08390664 0.2038692 0.9587479 0.1312614
## 1400 0.08386715 0.2037712 0.9587871 0.1311488
## 1450 0.08385119 0.2037322 0.9588012 0.1311128
## 1500 0.08382375 0.2036871 0.9588179 0.1310226
## 1550 0.08383425 0.2037056 0.9588120 0.1309960
## 1600 0.08384602 0.2037275 0.9588029 0.1309646
## 1650 0.08386195 0.2037648 0.9587892 0.1309330
## 1700 0.08387827 0.2038070 0.9587727 0.1309622
## 1750 0.08388969 0.2038577 0.9587560 0.1309589
## 1800 0.08392644 0.2039070 0.9587356 0.1309123
## 1850 0.08393212 0.2038926 0.9587415 0.1308403
## 1900 0.08393373 0.2039098 0.9587354 0.1308317
## 1950 0.08395218 0.2039344 0.9587242 0.1308229
## 2000 0.08394306 0.2039055 0.9587364 0.1307980
##
## Tuning parameter 'max_depth' was held constant at a value of 4
## 0.7
## Tuning parameter 'min_child_weight' was held constant at a value of
## 1
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSLE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 1500, max_depth =
## 4, eta = 0.02, gamma = 0.006, colsample_bytree = 0.7, min_child_weight
## = 1 and subsample = 1.
Since we used same resampling indices by setting the same seed value for building models, we can now compare the box-plots of cross-validated RMSLE estimates from different models fitted for predicting formation energy and band gap energy. We consider only those models for ensembling whose box-plot overlap with the one with the highest performance. We use simple weighted average ensembling and the weights are assigned to the models according to their ranking in performance.
We see that model averaged neural network out performs the rest.
# List of models
resamps_FE <- resamples(list(avNNet = nptcFE_avNNet,
cubist = nptcFE_cubist,
enet = nptcFE_enet,
kknn = nptcFE_kknn,
pls = nptcFE_pls,
svmPoly = nptcFE_svmPoly,
svmRadial = nptcFE_svmR,
xgbt = nptcFE_xgbt))
# Box-plots for comparison
bwplot(resamps_FE, metric = 'RMSLE')Since there are 4 models whose box-plots overlap with the best model. Hence, we use an ensemble of 5 models on the test data predictions.
ma_fe = .7 * predict(nptcFE_avNNet, df_test) +
.15 * predict(nptcFE_xgbt, df_test) +
.06 * predict(nptcFE_cubist, df_test) +
.05 * predict(nptcFE_svmR, df_test) +
.04 * predict(nptcFE_svmPoly, df_test)Again we see that model averaged neural network out performs the rest but the ranking of other models based on their performance is different as compared to that for formation energy.
# List of models
resamps_BE <- resamples(list(avNNet = nptcBE_avNNet,
cubist = nptcBE_cubist,
enet = nptcBE_enet,
kknn = nptcBE_kknn,
pls = nptcBE_pls,
svmPoly = nptcBE_svmPoly,
svmRadial = nptcBE_svmR,
xgbt = nptcBE_xgbt))
# Box-plots for comparison
bwplot(resamps_BE, metric = 'RMSLE')Again there are 4 models whose box-plots overlap with the best model. Hence, we use an ensemble of 5 models on the test data predictions.
ma_be = .7 * predict(nptcBE_avNNet, df_test) +
.15 * predict(nptcBE_xgbt, df_test) +
.06 * predict(nptcBE_svmPoly, df_test) +
.05 * predict(nptcBE_cubist, df_test) +
.04 * predict(nptcBE_svmR, df_test) Let’s now check the performance of the best model, that is, model averaged neural network and the ensemble of models on the test data. Since energy is a non-negative term we take absolute value of the predictions for submission. We see that ensemble of models have a small improvement over the best model.
# Submission using best model
sub_avNNet <- read_csv('sample_submission.csv') %>%
mutate(formation_energy_ev_natom = predict(nptcFE_avNNet, df_test),
bandgap_energy_ev = predict(nptcBE_avNNet, df_test)) %>% abs()
write_excel_csv(sub_avNNet,'sub_avNNet.csv')
# Submission using weighted average ensemble
sub_wa5 <- read_csv('sample_submission.csv') %>%
mutate(formation_energy_ev_natom = ma_fe,
bandgap_energy_ev = ma_be) %>% abs()
write_excel_csv(sub_wa5, 'sub_wa5.csv')The private leaderboard scores are
It is clear from the private leaderboard ranking that there is still scope for improvement in performance metric, which can be attributed towards meticulous engineering of features using the domain knowledge. However, the proposed solution in this document shows an improvement over the \(45^{th}\) place solution, which is among the top 5% solutions.