Exercises 7.2 & 7.5 from the K&J book.
getwd()
## [1] "T:/00-624 HH Predictive Analytics/HMWK8 NonLinearRegresn/HMWK8 HH"
# clear the workspace
rm(list = ls())
library(mlbench)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(nnet)
library(ggplot2)
library(mice)
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x kernlab::alpha() masks ggplot2::alpha()
## x purrr::cross() masks kernlab::cross()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
## iteration of imputation of missing data
tmp.data <- mice(ChemicalManufacturingProcess,m=2,maxit=3,meth='pmm',seed=125)
##
## iter imp variable
## 1 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 1 2 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 2 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 2 2 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 3 1 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## 3 2 ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess14 ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess40 ManufacturingProcess41
## Warning: Number of logged events: 162
ChemicalManufacturingProcess = complete(tmp.data) ##176*58
• The data set is in the library of applied predictive modeling. It contains 176 observations and 58 variables. The data set has the outcome variable as yield ,and 12 variables describing biological materials, and 45 variables describing manufacturing process.I am using the mice package for the imputation, with the PMM method, and two rounds of imputations are done. Within each imputation, a maximum of 3 iterations are allowed.
# train test split
set.seed(200)
rows = nrow(ChemicalManufacturingProcess)
t.index <- sample(1:rows, size = round(0.75*rows), replace=FALSE)
df.train <- ChemicalManufacturingProcess[t.index ,]
df.test <- ChemicalManufacturingProcess[-t.index ,]
df.train.x = df.train[,-1]
df.train.y = df.train[,1]
df.test.x = df.test[,-1]
df.test.y = df.test[,1]
Now the data is completed and ready to be analyzed. The data is split into training dataset by the 75% versus 25% rule. they both contain the original 58 variables, Which is 57 explanning variables and one outcome variable.. The training data set has 132 observations and the testing data set has 44 observations.
model.eval = function(modelmethod, gridSearch = NULL)
{
Model = train(x = df.train.x,
y = df.train.y,
method = modelmethod,
tuneGrid = gridSearch,
preProcess = c('center', 'scale'),
trControl = trainControl(method='cv'))
Pred = predict(Model, newdata = df.test.x)
modelperf = postResample(Pred, df.test.y)
print(modelperf)
}
The function model .eval is defined so that the common training models can be applied to. It has some predefine variables describing X, Y, method,grid search process, and cross validation as the training control. Further, post resample is also considered in this function.
Next, we apply this function to K nearest neighbors, support vector machines, multi variate adaptive regression splines.
perfknn = model.eval('knn')
## RMSE Rsquared MAE
## 1.4998718 0.4976055 1.3024472
perfsvm = model.eval('svmRadial')
## RMSE Rsquared MAE
## 1.2526982 0.6560941 1.0771329
marsGrid = expand.grid(degree = 1:2, nprune = 2:38)
perfmars = model.eval('earth', marsGrid)
## RMSE Rsquared MAE
## 1.2103929 0.6674794 0.9872408
We then apply neural network prediction model to this database. Since its syntax are quite different from the rest, we did not use the function as the rest of the models. We chose five neural network, Max iteration of 500, and decay from 0 to 0.01.
nnetFit <- nnet(
x=df.train[,-1],
y = df.train$Yield,
size=5,
decay=c(0,0.01,.1),
trace=FALSE,
maxit=500,
MaxNWts=5*(ncol(df.train[,-1])+1)+5+1)
nnetFit
## a 57-5-1 network with 296 weights
## options were -
nnetPred <- predict (nnetFit, newdata=df.test[,-1])
perfnn <- postResample(pred=nnetPred, obs=df.test$Yield)
RMSE_nn = perfnn[1]
library(kableExtra)
df.perf_all4 = rbind(data.frame(Name = 'KNN', RMSE = perfknn[1]),
data.frame(Name = 'SVM', RMSE = perfsvm[1]),
data.frame(Name = 'MARS', RMSE = perfmars[1]),
data.frame(Name= 'NN', RMSE = RMSE_nn ))
kable (df.perf_all4, caption='Model Comparisons')
Name | RMSE | |
---|---|---|
RMSE | KNN | 1.499872 |
RMSE1 | SVM | 1.252698 |
RMSE2 | MARS | 1.210393 |
RMSE3 | NN | 39.394649 |
Performance evaluation in terms RMSE clearly shows that neural network has a huge RMSE, while the rest of the three methods have similar RMS E at around 1.2 to 1.5..
library(tidyverse)
df.perf.filtered <-
df.perf_all4 %>%
slice (1,2,3)
ggplot(data = df.perf.filtered ,
aes(x = Name, y = RMSE, fill=Name)) +
geom_bar(stat="identity", position=position_dodge()) +
geom_text(aes(label=RMSE), vjust=1, color="white",
position = position_dodge(0.9), size=3.5)
Which nonlinear regression model gives the optimal resampling and test set performance?
The RMSE from three scalable models a plotted in below graph. Mars has the best performance, I eat the lowest RMSE value.
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
marsGrid = expand.grid(degree = 1:2, nprune = 2:38)
MARSModel = train(x = df.train.x, y = df.train.y,
method = 'earth',
tuneGrid = marsGrid,
preProcess = c('center', 'scale'),
trControl = trainControl(method='cv'))
print(varImp(MARSModel))
## earth variable importance
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 60.11
## ManufacturingProcess13 25.79
## ManufacturingProcess39 0.00
The optimum model is the Mars model. We further evaluated the variable importance in this model, using the earth method. From that model we see that the top 4 most important predictor our manufacturing process 32, 09, 13, 39,.
summary(MARSModel)
## Call: earth(x=data.frame[132,57], y=c(39.15,39.79,4...), keepxy=TRUE, degree=1,
## nprune=5)
##
## coefficients
## (Intercept) 39.654560
## h(0.708086-ManufacturingProcess09) -0.815238
## h(-1.40448-ManufacturingProcess13) 3.063079
## h(ManufacturingProcess32- -0.817266) 1.152976
## h(0.153958-ManufacturingProcess39) -0.271455
##
## Selected 5 of 21 terms, and 4 of 57 predictors (nprune=5)
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 4 (additive model)
## GCV 1.153422 RSS 132.1979 GRSq 0.6278107 RSq 0.6718809
plot(MARSModel)
The first degree RMSE is lowest at 1.5 at 1-14 terms, then jum to 3.0 at 15 terms and above. The RMSE with the 2nd degree products starts at also 1.4 from 1-14 terms, then jum to around 3.5 from 14-20 tems, and stays at 5.0 there after.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(corrplot)
## corrplot 0.84 loaded
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
plot(varImp(MARSModel), top=4)
df.train %>%
dplyr::select( ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess13, ManufacturingProcess39) %>%
chart.Correlation(histogram=TRUE, pch=19, method = 'pearson')
The pearson’s correlation effect for the top predictors to the outcome (yield) are plotted. The correlation are relatively non signficant for all four variables.
We can see that there is no linear relationship of three variables ( ManufacturingProcess32, ManufacturingProcess09, ManufacturingProcess33 ) to the outcome Yield at all.
There is a week somewhat linear relationship between ManufacturingProcess13 to outcome Yield, but it did not reach statistical significance.
This plot is a visual confirmation that the MARS auto model has shown us the the second degree MARS is the optimal model (instead of 1st degree MARS) for the YIELD outcome prediction.
From the above correlation graph between top 10 important predictors and outcome variable (Yield) We can see that
This same information is also plotted out in the next section of plot with the same 4 top predictors, as an alternative method for preditor-outcome visualization.
par(mfrow = c(2, 2))
plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess32)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess32), col="red")
plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess09)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess09), col="red")
plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess13)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess13), col="red")
plot(ChemicalManufacturingProcess $Yield, ChemicalManufacturingProcess $ManufacturingProcess39)
abline(lm(ChemicalManufacturingProcess $Yield~ChemicalManufacturingProcess $ManufacturingProcess39), col="red")
The plot of the top four predictors to the outcome yield is plotted here. We can see that ManuacturingProcess 13 is negatively associated with the yied, while manufacturingprocess 32 and 09 are relatively positively predictive of yield. No, the plot does not support the intitive notion that the biological predictors are more indicative of yield than the manufacturing process variables, it is actually the contrary. Given tha tthe manufacturing process can be modifiable, we feel that qqqqqqqqqqqqqqqqc