Machine Learning, Key to Your Formulation Challenges

Formulation Challenges are Everywhere…
Step1: Retrieve Existing Data
Step 2: Normalize Data
Step 3: Train and Test a Model
Step 4: Evaluate Model Performance
Step 5: Improving Model Performance with 5 Hidden Neurons
Step 6: Improving Model using Random Forest Algorithm
Step 7: Testing Further with Resampling
Step 8: Actual Display of a Random Forest Tree Solution
Conclusions
References

Formulation Challenges are Everywhere…

You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted with the challenge of blending and formulating to meet process or performance properties. While traditional Research and Development does approach the problem with experimentation, it generally involves designs, time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten or perhaps obsolete.

Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for formulation and classification.

Today, we will explain how Machine Learning can shed new light on this generic and very persistent formulation challenge. We will discuss the other important aspect of classification and clustering often associated with these formulations challenges in a forthcoming communication.

Step1: Retrieve Existing Data

To illustrate the approach, we selected a formulation dataset hosted on UCI Machine Learning Repository, to predict the compressive strength performance dependency on the formulation ingredients. This is the well-known formulation composition - property relationship scientists, engineers and business professionals must address daily and any established R&D would certainly have similar and sometimes hidden knowledge in its archives…

We will use R to demonstrate quickly the approach on this dataset, and also demonstrate how reproducibility of the analysis is enforced. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and date stamped from the repository.

Sys.info()[1:5]

##      sysname      release      version     nodename      machine 
##    "Windows"      "7 x64" "build 9200"   "STALLION"     "x86-64"

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] reshape_0.8.5       scales_0.3.0        devtools_1.8.0     
##  [4] rpart.plot_1.5.3    rpart_4.1-10        randomForest_4.6-10
##  [7] neuralnet_1.32      MASS_7.3-44         caret_6.0-52       
## [10] ggplot2_1.0.1       lattice_0.20-33     stringr_1.0.0      
## [13] xlsx_0.5.7          xlsxjars_0.6.1      rJava_0.9-7        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0         git2r_0.11.0        formatR_1.2        
##  [4] nloptr_1.0.4        plyr_1.8.3          iterators_1.0.7    
##  [7] tools_3.2.2         digest_0.6.8        lme4_1.1-9         
## [10] memoise_0.2.1       evaluate_0.7.2      gtable_0.1.2       
## [13] nlme_3.1-121        mgcv_1.8-9          Matrix_1.2-2       
## [16] foreach_1.4.2       curl_0.9.3          parallel_3.2.2     
## [19] yaml_2.1.13         SparseM_1.7         brglm_0.5-9        
## [22] proto_0.3-10        xml2_0.1.1          BradleyTerry2_1.0-6
## [25] knitr_1.11          rversions_1.0.2     MatrixModels_0.4-1 
## [28] gtools_3.5.0        stats4_3.2.2        nnet_7.3-11        
## [31] rmarkdown_0.9.2     minqa_1.2.4         reshape2_1.4.1     
## [34] car_2.0-26          magrittr_1.5        codetools_0.2-14   
## [37] htmltools_0.2.6     splines_3.2.2       pbkrtest_0.4-2     
## [40] colorspace_1.2-6    quantreg_5.18       stringi_0.5-5      
## [43] munsell_0.4.2

library(xlsx)
library(stringr)
library(caret)
library(neuralnet)
library(devtools)
library(rpart)
library(rpart.plot)
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Concrete_Data.xls",method="curl")
dateDownloaded <- date()
concrete <- read.xlsx("./data/Concrete_Data.xls",sheetName="Sheet1")
str(concrete)

## 'data.frame':    1030 obs. of  9 variables:
##  $ Cement..component.1..kg.in.a.m.3.mixture.            : num  540 540 332 332 199 ...
##  $ Blast.Furnace.Slag..component.2..kg.in.a.m.3.mixture.: num  0 0 142 142 132 ...
##  $ Fly.Ash..component.3..kg.in.a.m.3.mixture.           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Water...component.4..kg.in.a.m.3.mixture.            : num  162 162 228 228 192 228 228 228 228 228 ...
##  $ Superplasticizer..component.5..kg.in.a.m.3.mixture.  : num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ Coarse.Aggregate...component.6..kg.in.a.m.3.mixture. : num  1040 1055 932 932 978 ...
##  $ Fine.Aggregate..component.7..kg.in.a.m.3.mixture.    : num  676 676 594 594 826 ...
##  $ Age..day.                                            : num  28 28 270 365 360 90 365 28 28 28 ...
##  $ Concrete.compressive.strength.MPa..megapascals..     : num  80 61.9 40.3 41.1 44.3 ...

Step 2: Normalize Data

The dataset information reveals 1030 observations with 9 variables: 8 inputs, from which 7 are ingredients and 1 is a process attribute (Age) and 1 output, the strength property. There are no missing values in this set. We’ll easily truncate the variable names and normalize the data, displaying the normalized strength.

normalize <- function(x) {return((x - min(x)) / (max(x) - min(x)))}
names(concrete)<-gsub("\\."," ",names(concrete))
names(concrete)<-word(names(concrete),1)
names(concrete)[9]<-"Strength"
concrete_norm <- as.data.frame(lapply(concrete, normalize))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2663  0.4000  0.4172  0.5457  1.0000

These transformations should be typical of a generic formulation where ingredients and process variables are independent or input variables, and property is a dependent or output variable.

Step 3: Train and Test a Model

The method we’ll follow now is a standard approach where we randomly split the data set in a train and test set. The caret package implements this task well. We’ll use 75% of the data to train and the remainder to test the model. To make the analysis reproducible, we’ll use the set.seed() function.

set.seed(12121)
inTrain<-createDataPartition(y=concrete_norm$Strength,p=0.75,list=FALSE)
concrete_train <- concrete_norm[inTrain, ]
concrete_test <- concrete_norm[-inTrain, ]
concrete_model <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Superplasticizer + Coarse + Fine + Age, data = concrete_train)

The network topology is then easily visualized. See for details the excellent NeuralNetTool page. Suffise to say that even this simple first attempt highlights the main dependencies and higher impacts are highligted with thicker links in the typical neural net representation. here the I’s represent inputs, O is the output, H and B are Hidden and Bias nodes as defined in the theory. Note a single bias node is added for each input and hidden layer to accomodate input features equal to 0.

Step 4: Evaluate Model Performance

We will now compute predictions and compare them to actual strength and examine the correlation between predicted and actual strength values.

model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result
cor(predicted_strength, concrete_test$Strength)[1,1]

## [1] 0.8336352308

The default neural net exhibits a correlation of 0.8336352. We can certainly try to improve it by including a few hidden neurons…

Step 5: Improving Model Performance with 5 Hidden Neurons

concrete_model2 <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Superplasticizer + Coarse + Fine + Age, data = concrete_train, hidden=5)
plot.nnet(concrete_model2,cex.val=0.75)

We observe age remains a key contributor, but the effects of Water, Superplasticizer, Cement and Blast are also visibly ranked. 4 out of the 5 Hidden nodes are about evenly contributing…

model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$Strength)[1,1]

## [1] 0.9138705528

p <- plot(concrete_test$Strength,predicted_strength2)

Step 6: Improving Model using Random Forest Algorithm

We now will rely on the Random Forest algorithm and attempt a model improvement.

model_result3 <- train(Strength ~ ., data = concrete_train,method='rf',prox=TRUE)
model_result3

## Random Forest 
## 
## 774 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 774, 774, 774, 774, 774, 774, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE           Rsquared      RMSE SD         Rsquared SD  
##   2     0.07833903729  0.8774837294  0.004515965257  0.01607292866
##   5     0.07159729020  0.8869237640  0.004510365293  0.01395064039
##   8     0.07365915696  0.8777421774  0.005197103383  0.01691711063
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 5.

predicted_strength3 <- predict(model_result3,concrete_test)
cor(predicted_strength3, concrete_test$Strength)

## [1] 0.943384811

The default Random Forest algorithm helped improve our prediction and exhibits a correlation of 0.943384811. Again, we can certainly try to improve by introducing resampling… The caret package offers multiple methods to try out. We’ll just try one to give an idea…

Step 7: Testing Further with Resampling

model_result4 <- train(Strength ~ ., method='rf',data = concrete_train,verbose=FALSE,trControl = trainControl(method="cv"))
model_result4

## Random Forest 
## 
## 774 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 696, 696, 695, 697, 698, 696, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   RMSE SD     Rsquared SD
##   2     0.06910599  0.9068129  0.01101706  0.04013275 
##   5     0.06227778  0.9121046  0.01232558  0.03945448 
##   8     0.06287753  0.9089120  0.01208088  0.03835530 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 5.

predicted_strength4 <- predict(model_result4,concrete_test)
cor(predicted_strength4, concrete_test$Strength)

## [1] 0.9428245

We observe the prediction is practically unchanged in this case, with a correlation of 0.9428245304. Still not bad for a quick analysis performed on existing data. Regardless of our property target, we already derived key areas to investigate deeper… and can clearly see some key ingredients (Cement, Blast, Fly, Superplasticizer, Water…) and the Age process as determining factors to produce strength performance. So naturally, one may want to display this model.

Step 8: Actual Display of a Random Forest Tree Solution

It turns out that so-called blackbox models are – well – meant to stay in their box! However, the rpart and rpart.plot packages make it easy to visualize even complex trees.

strength.tree <- rpart(Strength ~ .,data=concrete_train, control=rpart.control(minsplit=20,cp=0.002))
prp(strength.tree,compress=TRUE)

In the network, normalized strength is indicated in the oval leaves, and are ranked from low to high from left to right branches.

Conclusions

We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help resolve formulation challenges, offering a fast, efficient and economical alternative to tedious experimentation. It is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or any scientific area.

Rubber formulations to minimize rolling resistance and emissions, or modern composites to build renewable energy sources or lighweight transportation vehicles and next-generation public transit, as well as innovative UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the nature of inputs and outputs vary. Therefore, the method can and should be applied broadly!

Next time, we’ll review how to address another common challenge: classification and clustering. Till then, we hope this approach has triggered interest.

Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC Consulting, LLC is dedicated to boosting innovation thru improved Analytics, one customer at the time!

References

The following sources are referenced as they provided significant help and information to develop this Machine Learning analysis applied to formulations: