Source Code: https://github.com/djlofland/DATA624_PredictiveAnalytics/tree/master/Homework_8

Problem 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[y = 10 sin(\pi x_1 x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\] where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

Load Data

set.seed(200)

trainingData <- mlbench.friedman1(200, sd = 1)

## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)

## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

KNN

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

knnModel <- train(x = trainingData$x, y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10)

stopCluster(cl)

# Model results
plot(knnModel)

summary(knnModel)
##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      10     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    1     -none-     logical  
## param        0     -none-     list
varImp(knnModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000
knnPred <- predict(knnModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set
## perforamnce values
(knnPerf <- postResample(pred = knnPred, obs = testData$y))
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

NNET

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(
  .decay = c(0, 0.01, .1),
  .size = c(1:10),
  ## The next option is to use bagging (see the
  ## next chapter) instead of different random
  ## seeds.
  .bag = FALSE)

ctrl <- trainControl(method='cv', number=10)

nnetModel <- train(trainingData$x, trainingData$y,
  method = "avNNet",
  tuneGrid = nnetGrid,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE,
  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
  maxit = 500)

stopCluster(cl)

# Model results
plot(nnetModel)

summary(nnetModel)
##             Length Class      Mode     
## model        5     -none-     list     
## repeats      1     -none-     numeric  
## bag          1     -none-     logical  
## seeds        5     -none-     numeric  
## names       10     -none-     character
## terms        3     terms      call     
## coefnames   10     -none-     character
## xlevels      0     -none-     list     
## xNames      10     -none-     character
## problemType  1     -none-     character
## tuneValue    3     data.frame list     
## obsLevels    1     -none-     logical  
## param        4     -none-     list
varImp(nnetModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000
nnetPred <- predict(nnetModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set
## perforamnce values
(nnetPerf <- postResample(pred = nnetPred, obs = testData$y))
##      RMSE  Rsquared       MAE 
## 1.6644075 0.8905623 1.2423150

SVM

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

svmModel <- train(trainingData$x, trainingData$y,
  method = "svmRadial",  # svmRadial, svmLinear, svmPoly
  tuneLength = 14,
  trControl = trainControl(method = "cv"))

stopCluster(cl)

# Model results
plot(svmModel)

summary(svmModel)
## Length  Class   Mode 
##      1   ksvm     S4
varImp(svmModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000
svmPred <- predict(svmModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set
## perforamnce values
(svmPerf <- postResample(pred = svmPred, obs = testData$y))
##      RMSE  Rsquared       MAE 
## 2.0893300 0.8232353 1.5877434

MARS

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

marsModel <- train(trainingData$x, trainingData$y,
  method = "earth",
  tuneGrid = marsGrid,
  trControl = trainControl(method = "cv"))

stopCluster(cl)

# Model results
plot(marsModel)

summary(marsModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
##             nprune=14)
## 
##                                 coefficients
## (Intercept)                        21.905319
## h(0.621722-X1)                    -15.726181
## h(X1-0.621722)                      9.234027
## h(0.601063-X2)                    -18.253527
## h(X2-0.601063)                     10.448545
## h(0.447442-X3)                      9.700589
## h(X3-0.606015)                     12.674694
## h(0.734892-X4)                     -9.863526
## h(X4-0.734892)                     10.297964
## h(0.850094-X5)                     -5.324175
## h(0.218266-X1) * h(X2-0.601063)   -55.358726
## h(X1-0.218266) * h(X2-0.601063)   -29.100250
## h(X1-0.621722) * h(X2-0.295997)   -26.833129
## h(0.649253-X1) * h(0.601063-X2)    27.120721
## 
## Selected 14 of 18 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.62945    RSS 225.8601    GRSq 0.9338437    RSq 0.953688
varImp(marsModel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.74
## X5   15.53
## X3    0.00
marsPred <- predict(marsModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set
## perforamnce values
(marsPerf <- postResample(pred = marsPred, obs = testData$y))
##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

Results

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

(models_summary <- rbind(nnetPerf, marsPerf, svmPerf, knnPerf))
##              RMSE  Rsquared       MAE
## nnetPerf 1.664408 0.8905623 1.2423150
## marsPerf 1.172263 0.9448890 0.9324923
## svmPerf  2.089330 0.8232353 1.5877434
## knnPerf  3.204059 0.6819919 2.5683461

All three models (KNN, MARS and SVM) identified X4, X1, X2, X5, and X3 as the top 5 features. The only difference is that MARS had X1 as 1st and X4 as 2nd. The others had X4 1st and x1 2nd.

Problem 7.5

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Load & Clean Data

# NOTE: Code copied from my Homework #7
data("ChemicalManufacturingProcess")

cmp <- as_tibble(ChemicalManufacturingProcess)

x_raw <- cmp[,2:58]
y_raw <- as.matrix(cmp$Yield)

print(paste(nrow(x_raw), ncol(x_raw)))
## [1] "176 57"
print(paste(nrow(y_raw), ncol(y_raw)))
## [1] "176 1"
# Various NA plots to inspect data
knitr::kable(miss_var_summary(cmp) %>% filter(n_miss > 0), 
             caption = 'Missing Values',
             format="html", 
             table.attr="style='width:50%;'") %>% 
  kableExtra::kable_styling()
Missing Values
variable n_miss pct_miss
ManufacturingProcess03 15 8.5227273
ManufacturingProcess11 10 5.6818182
ManufacturingProcess10 9 5.1136364
ManufacturingProcess25 5 2.8409091
ManufacturingProcess26 5 2.8409091
ManufacturingProcess27 5 2.8409091
ManufacturingProcess28 5 2.8409091
ManufacturingProcess29 5 2.8409091
ManufacturingProcess30 5 2.8409091
ManufacturingProcess31 5 2.8409091
ManufacturingProcess33 5 2.8409091
ManufacturingProcess34 5 2.8409091
ManufacturingProcess35 5 2.8409091
ManufacturingProcess36 5 2.8409091
ManufacturingProcess02 3 1.7045455
ManufacturingProcess06 2 1.1363636
ManufacturingProcess01 1 0.5681818
ManufacturingProcess04 1 0.5681818
ManufacturingProcess05 1 0.5681818
ManufacturingProcess07 1 0.5681818
ManufacturingProcess08 1 0.5681818
ManufacturingProcess12 1 0.5681818
ManufacturingProcess14 1 0.5681818
ManufacturingProcess22 1 0.5681818
ManufacturingProcess23 1 0.5681818
ManufacturingProcess24 1 0.5681818
ManufacturingProcess40 1 0.5681818
ManufacturingProcess41 1 0.5681818
gg_miss_var(cmp)

gg_miss_upset(cmp)

# Impute missing using KNN
x_imputed <- knn.impute(as.matrix(x_raw), k=10)

# Check for columns with little variance - candidate to drop
lowVariance <- nearZeroVar(x_imputed, names = TRUE)
head(lowVariance)
## [1] "BiologicalMaterial07"
lowVariance <- nearZeroVar(x_imputed)

x_lowvar <- x_imputed[,-lowVariance]

# Deal with outliers ... impute to median
x_outliers <- outlieR::impute(x_lowvar, fill='median')

# Find and drop high correlation features
correlations <- cor(x_outliers)
highCorr <- findCorrelation(correlations, names=TRUE, cutoff=0.9)
(highCorr)
## [1] "BiologicalMaterial02"   "ManufacturingProcess26" "BiologicalMaterial04"  
## [4] "BiologicalMaterial12"   "ManufacturingProcess11" "ManufacturingProcess27"
## [7] "ManufacturingProcess44" "ManufacturingProcess40"
highCorr <- findCorrelation(correlations, cutoff=0.9)
x_corr <- x_outliers[,-highCorr]

# Transofrm our dat (scale, center, boxcox)
x_transf <-  preProcess(x_corr, method=c('center', 'scale', 'BoxCox'))
x_transf <- predict(x_transf, x_corr)

Split Training/Testing:

# get training/test split
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)

# Build training datasets
trainX <- x_transf[trainingRows,]
trainY <- y_raw[trainingRows]

# put remaining rows into the test sets
testX <- x_transf[-trainingRows,]
testY <- y_raw[-trainingRows]

# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$Yield <- trainY

Part (A)

  1. Which nonlinear regression model gives the optimal resampling and test set performance?

Neural Net

set.seed(42424)

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

## Create a specific candidate set of models to evaluate:
nnetGrid <- expand.grid(
  .decay = c(0, 0.01, .1),
  .size = c(1:10),
  ## The next option is to use bagging (see the
  ## next chapter) instead of different random
  ## seeds.
  .bag = FALSE)

ctrl <- trainControl(method='cv', number=10)

nnetTune <- train(trainX, trainY,
  method = "avNNet",
  tuneGrid = nnetGrid,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE,
  MaxNWts = 10 * (ncol(trainX) + 1) + 10 + 1,
  maxit = 500)

stopCluster(cl)

# Model results
plot(nnetTune)

summary(nnetTune)
##             Length Class      Mode     
## model        5     -none-     list     
## repeats      1     -none-     numeric  
## bag          1     -none-     logical  
## seeds        5     -none-     numeric  
## names       48     -none-     character
## terms        3     terms      call     
## coefnames   48     -none-     character
## xlevels      0     -none-     list     
## xNames      48     -none-     character
## problemType  1     -none-     character
## tuneValue    3     data.frame list     
## obsLevels    1     -none-     logical  
## param        4     -none-     list
varImp(nnetTune)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 48)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     89.09
## ManufacturingProcess13   79.04
## BiologicalMaterial03     77.63
## ManufacturingProcess09   75.31
## ManufacturingProcess31   74.87
## ManufacturingProcess36   72.14
## ManufacturingProcess17   69.93
## ManufacturingProcess06   62.10
## ManufacturingProcess30   55.24
## ManufacturingProcess33   53.74
## BiologicalMaterial11     46.95
## BiologicalMaterial01     40.10
## ManufacturingProcess18   38.18
## BiologicalMaterial09     36.04
## ManufacturingProcess25   35.93
## BiologicalMaterial08     35.81
## ManufacturingProcess20   34.31
## ManufacturingProcess29   34.04
## ManufacturingProcess15   33.15
modelPred <- predict(nnetTune, newdata=testX)

modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(nnet_values <- defaultSummary(modelValues))
##      RMSE  Rsquared       MAE 
## 1.2146853 0.6426541 0.8962163

MARS

set.seed(42424)

# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

marsTune <- train(trainX, trainY,
  method = "earth",
  tuneGrid = marsGrid,
  trControl = trainControl(method = "cv"))

stopCluster(cl)

# Model Results
plot(marsTune)

summary(marsTune)
## Call: earth(x=matrix[144,48], y=c(42.44,42.03,4...), keepxy=TRUE, degree=1,
##             nprune=3)
## 
##                                      coefficients
## (Intercept)                             39.647909
## h(0.553668-ManufacturingProcess09)      -1.037195
## h(ManufacturingProcess32- -0.813021)     1.323398
## 
## Selected 3 of 21 terms, and 2 of 48 predictors (nprune=3)
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 2 (additive model)
## GCV 1.369288    RSS 183.7224    GRSq 0.5808792    RSq 0.6039985
varImp(marsTune)
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32     100
## ManufacturingProcess09       0
modelPred <- predict(marsTune, newdata=testX)

modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(mars_values <- defaultSummary(modelValues))
##      RMSE  Rsquared       MAE 
## 1.3016929 0.5974956 1.0242889

SVM

set.seed(42424)

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

svmRTune <- train(trainX, trainY,
  method = "svmRadial",  # svmRadial, svmLinear, svmPoly
  tuneLength = 14,
  trControl = trainControl(method = "cv"))

stopCluster(cl)

# Model Results
plot(svmRTune)

summary(svmRTune)
## Length  Class   Mode 
##      1   ksvm     S4
varImp(svmRTune)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 48)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     89.09
## ManufacturingProcess13   79.04
## BiologicalMaterial03     77.63
## ManufacturingProcess09   75.31
## ManufacturingProcess31   74.87
## ManufacturingProcess36   72.14
## ManufacturingProcess17   69.93
## ManufacturingProcess06   62.10
## ManufacturingProcess30   55.24
## ManufacturingProcess33   53.74
## BiologicalMaterial11     46.95
## BiologicalMaterial01     40.10
## ManufacturingProcess18   38.18
## BiologicalMaterial09     36.04
## ManufacturingProcess25   35.93
## BiologicalMaterial08     35.81
## ManufacturingProcess20   34.31
## ManufacturingProcess29   34.04
## ManufacturingProcess15   33.15
modelPred <- predict(svmRTune, newdata=testX)

modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(svm_values <- defaultSummary(modelValues))
##     RMSE Rsquared      MAE 
## 1.146482 0.713360 0.859768

KNN

set.seed(42424)

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

knnTune <- train(trainX, trainY,
  method = "knn",
  # Center and scaling will occur for new predictions too
  preProc = c("center", "scale"),
  tuneGrid = data.frame(.k = 1:20),
  trControl = trainControl(method = "cv"))

stopCluster(cl)

# Model Results
plot(knnTune)

summary(knnTune)
##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      48     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    1     -none-     logical  
## param        0     -none-     list
varImp(knnTune)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 48)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     89.09
## ManufacturingProcess13   79.04
## BiologicalMaterial03     77.63
## ManufacturingProcess09   75.31
## ManufacturingProcess31   74.87
## ManufacturingProcess36   72.14
## ManufacturingProcess17   69.93
## ManufacturingProcess06   62.10
## ManufacturingProcess30   55.24
## ManufacturingProcess33   53.74
## BiologicalMaterial11     46.95
## BiologicalMaterial01     40.10
## ManufacturingProcess18   38.18
## BiologicalMaterial09     36.04
## ManufacturingProcess25   35.93
## BiologicalMaterial08     35.81
## ManufacturingProcess20   34.31
## ManufacturingProcess29   34.04
## ManufacturingProcess15   33.15
modelPred <- predict(knnTune, newdata=testX)

modelValues <- data.frame(obs = testY, pred=modelPred)
colnames(modelValues) = c('obs', 'pred')
(knn_values <- defaultSummary(modelValues))
##      RMSE  Rsquared       MAE 
## 1.4621971 0.4875616 1.0596875

Models Summary

(models_summary <- rbind(nnet_values, mars_values, svm_values, knn_values))
##                 RMSE  Rsquared       MAE
## nnet_values 1.214685 0.6426541 0.8962163
## mars_values 1.301693 0.5974956 1.0242889
## svm_values  1.146482 0.7133600 0.8597680
## knn_values  1.462197 0.4875616 1.0596875

MARS appears to provide the best RMSE and \(R^2\) of the tuned models for NNET, MARS, SVM and KNN.

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

ManufacturingProcess* features dominate the top most important featrues for all models. The top 10 predictors are the same for NNET, SVM and KNN. Only MARS had a differnt top feature list. All four non-linear models consistently picked ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess09, and BiologicalMaterial06 in the top 10.

Referring back to Problem 6.3 in Homwwork #7, these same 4 features also show as top using varImp(). However, we improved our RMSE and Rsquared values using non-linear models over the linear counterparts.

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
df <- cmp %>% 
  dplyr::select(c('ManufacturingProcess32', 'ManufacturingProcess13', 'ManufacturingProcess09', 'BiologicalMaterial06', 'Yield'))


caret::featurePlot(x=df['ManufacturingProcess32'], y=df['Yield'], plot="pairs", pch=20)

caret::featurePlot(x=df['ManufacturingProcess13'], y=df['Yield'], plot="pairs", pch=20)

caret::featurePlot(x=df['ManufacturingProcess09'], y=df['Yield'], plot="pairs", pch=20)

caret::featurePlot(x=df['BiologicalMaterial06'], y=df['Yield'], plot="pairs", pch=20)

SInce we don’t actually know what each Process physically maps to in the real world, it’s a little hard to apply intuition. I don’t know how ManufacturingProcess32 differs from ManufacturingProcess09. That said, we do see clear patterns between the features and Yield which we would expect since these were shown to be the top predictors. Had we seen no visual relationships, it’s possible there was still one detectable in the math, but that would have been concerning. ManufacturingProcess32, ManufacturingProcess09, and BiologicalMaterial06 show increase in Yield with an increase in the feature. ManufacturingProcess13 shows a decrease in Yield as the feature increases.