Exercises 7.2 & 7.5 from the K&J book.

getwd()
## [1] "T:/00-624 HH Predictive Analytics/HMWK8 NonLinearRegresn/HMWK8 HH"
# clear the workspace
  rm(list = ls())
library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(nnet)
library(ggplot2)
library(mice)
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x kernlab::alpha() masks ggplot2::alpha()
## x purrr::cross()   masks kernlab::cross()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x purrr::lift()    masks caret::lift()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Prepare Data

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
## iteration of imputation of missing data
tmp.data <- mice(ChemicalManufacturingProcess,m=2,maxit=3,meth='pmm',seed=125)
## 
##  iter imp variable
##   1   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   1   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   2   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   2   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   3   1  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
##   3   2  ManufacturingProcess01  ManufacturingProcess02  ManufacturingProcess03  ManufacturingProcess04  ManufacturingProcess05  ManufacturingProcess06  ManufacturingProcess07  ManufacturingProcess08  ManufacturingProcess10  ManufacturingProcess11  ManufacturingProcess12  ManufacturingProcess14  ManufacturingProcess22  ManufacturingProcess23  ManufacturingProcess24  ManufacturingProcess25  ManufacturingProcess26  ManufacturingProcess27  ManufacturingProcess28  ManufacturingProcess29  ManufacturingProcess30  ManufacturingProcess31  ManufacturingProcess33  ManufacturingProcess34  ManufacturingProcess35  ManufacturingProcess36  ManufacturingProcess40  ManufacturingProcess41
## Warning: Number of logged events: 162
ChemicalManufacturingProcess = complete(tmp.data)  ##176*58

• The data set is in the library of applied predictive modeling. It contains 176 observations and 58 variables. The data set has the outcome variable as yield ,and 12 variables describing biological materials, and 45 variables describing manufacturing process.I am using the mice package for the imputation, with the PMM method, and two rounds of imputations are done. Within each imputation, a maximum of 3 iterations are allowed.

# train test split
set.seed(200)
rows = nrow(ChemicalManufacturingProcess)

t.index <- sample(1:rows, size = round(0.75*rows), replace=FALSE)

df.train <- ChemicalManufacturingProcess[t.index ,]
df.test <- ChemicalManufacturingProcess[-t.index ,]

df.train.x = df.train[,-1]
df.train.y = df.train[,1]

df.test.x = df.test[,-1]
df.test.y = df.test[,1]

Now the data is completed and ready to be analyzed. The data is split into training dataset by the 75% versus 25% rule. they both contain the original 58 variables, Which is 57 explanning variables and one outcome variable.. The training data set has 132 observations and the testing data set has 44 observations.

FUNCTION FOR COMMON MODELS

model.eval = function(modelmethod, gridSearch = NULL)
{
  Model = train(x = df.train.x, 
                y = df.train.y, 
                method = modelmethod, 
                tuneGrid = gridSearch, 
                preProcess = c('center', 'scale'),
                trControl = trainControl(method='cv'))
  
  Pred = predict(Model, newdata = df.test.x)
  modelperf = postResample(Pred, df.test.y)
  
  print(modelperf)
}

The function model .eval is defined so that the common training models can be applied to. It has some predefine variables describing X, Y, method,grid search process, and cross validation as the training control. Further, post resample is also considered in this function.

Next, we apply this function to K nearest neighbors, support vector machines, multi variate adaptive regression splines.

K-Nearest Neighbors

perfknn = model.eval('knn')
##      RMSE  Rsquared       MAE 
## 1.4998718 0.4976055 1.3024472

Support Vector Machine

perfsvm = model.eval('svmRadial')
##      RMSE  Rsquared       MAE 
## 1.2526982 0.6560941 1.0771329

Multivariate Adaptive Regression Splines

marsGrid = expand.grid(degree = 1:2, nprune = 2:38)
perfmars = model.eval('earth', marsGrid)
##      RMSE  Rsquared       MAE 
## 1.2103929 0.6674794 0.9872408

Neural Network NN

We then apply neural network prediction model to this database. Since its syntax are quite different from the rest, we did not use the function as the rest of the models. We chose five neural network, Max iteration of 500, and decay from 0 to 0.01.

nnetFit <- nnet(
              x=df.train[,-1],
               y = df.train$Yield,
              size=5,
              decay=c(0,0.01,.1),
              trace=FALSE,
              maxit=500,
            MaxNWts=5*(ncol(df.train[,-1])+1)+5+1)
nnetFit
## a 57-5-1 network with 296 weights
## options were -
nnetPred <- predict (nnetFit, newdata=df.test[,-1])

perfnn <- postResample(pred=nnetPred, obs=df.test$Yield)
RMSE_nn = perfnn[1]
library(kableExtra)
df.perf_all4 = rbind(data.frame(Name = 'KNN', RMSE = perfknn[1]), 
                data.frame(Name = 'SVM', RMSE = perfsvm[1]), 
                data.frame(Name = 'MARS', RMSE = perfmars[1]),
                 data.frame(Name= 'NN', RMSE = RMSE_nn ))              
kable (df.perf_all4, caption='Model Comparisons') 
Model Comparisons
Name RMSE
RMSE KNN 1.499872
RMSE1 SVM 1.252698
RMSE2 MARS 1.210393
RMSE3 NN 39.394649

Performance evaluation in terms RMSE clearly shows that neural network has a huge RMSE, while the rest of the three methods have similar RMS E at around 1.2 to 1.5..

library(tidyverse)
df.perf.filtered <-
  df.perf_all4 %>% 
   slice (1,2,3)

ggplot(data = df.perf.filtered , 
       aes(x = Name, y = RMSE, fill=Name)) +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_text(aes(label=RMSE), vjust=1, color="white",
            position = position_dodge(0.9), size=3.5)

7.5A.

Which nonlinear regression model gives the optimal resampling and test set performance?

The RMSE from three scalable models a plotted in below graph. Mars has the best performance, I eat the lowest RMSE value.

7.5B.

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

marsGrid = expand.grid(degree = 1:2, nprune = 2:38)

MARSModel = train(x = df.train.x, y = df.train.y, 
                  method = 'earth', 
                  tuneGrid = marsGrid, 
                  preProcess = c('center', 'scale'), 
                  trControl = trainControl(method='cv'))

print(varImp(MARSModel))
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   60.11
## ManufacturingProcess13   25.79
## ManufacturingProcess39    0.00

The optimum model is the Mars model. We further evaluated the variable importance in this model, using the earth method. From that model we see that the top 4 most important predictor our manufacturing process 32, 09, 13, 39,.

summary(MARSModel)
## Call: earth(x=data.frame[132,57], y=c(39.15,39.79,4...), keepxy=TRUE, degree=1,
##             nprune=5)
## 
##                                      coefficients
## (Intercept)                             39.654560
## h(0.708086-ManufacturingProcess09)      -0.815238
## h(-1.40448-ManufacturingProcess13)       3.063079
## h(ManufacturingProcess32- -0.817266)     1.152976
## h(0.153958-ManufacturingProcess39)      -0.271455
## 
## Selected 5 of 21 terms, and 4 of 57 predictors (nprune=5)
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 4 (additive model)
## GCV 1.153422    RSS 132.1979    GRSq 0.6278107    RSq 0.6718809
plot(MARSModel)

The first degree RMSE is lowest at 1.5 at 1-14 terms, then jum to 3.0 at 15 terms and above. The RMSE with the 2nd degree products starts at also 1.4 from 1-14 terms, then jum to around 3.5 from 14-20 tems, and stays at 5.0 there after.

7.5C

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

library(magrittr)  
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(corrplot)
## corrplot 0.84 loaded
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
plot(varImp(MARSModel), top=4)

df.train %>% 
  dplyr::select( ManufacturingProcess32,  ManufacturingProcess09,  ManufacturingProcess13, ManufacturingProcess39) %>%
  chart.Correlation(histogram=TRUE, pch=19, method = 'pearson')

The pearson’s correlation effect for the top predictors to the outcome (yield) are plotted. The correlation are relatively non signficant for all four variables.

We can see that there is no linear relationship of three variables ( ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess33 ) to the outcome Yield at all.

There is a week somewhat linear relationship between ManufacturingProcess13 to outcome Yield, but it did not reach statistical significance.

This plot is a visual confirmation that the MARS auto model has shown us the the second degree MARS is the optimal model (instead of 1st degree MARS) for the YIELD outcome prediction.

From the above correlation graph between top 10 important predictors and outcome variable (Yield) We can see that

This same information is also plotted out in the next section of plot with the same 4 top predictors, as an alternative method for preditor-outcome visualization.

par(mfrow = c(2, 2))

plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess32)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess32), col="red")

plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess09)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess09), col="red")


plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess13)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess13), col="red")


plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess39)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess39), col="red")

The plot of the top four predictors to the outcome yield is plotted here. We can see that ManuacturingProcess 13 is negatively associated with the yied, while manufacturingprocess 32 and 09 are relatively positively predictive of yield. No, the plot does not support the intitive notion that the biological predictors are more indicative of yield than the manufacturing process variables, it is actually the contrary. Given tha tthe manufacturing process can be modifiable, we feel that qqqqqqqqqqqqqqqqc