Introduction

In this homework assignment I will be submitting exercises 7.2 and 7.5 from the Kuhn and Johnson Applied Predictive Modeling Book.

Question 7.2

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)

where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench) set.seed(200) trainingData <- mlbench.friedman1(200, sd = 1) ## We convert the ‘x’ data from a matrix to a data frame ## One reason is that this will give the columns names. trainingData\(x <- data.frame(trainingData\)x) ## Look at the data using featurePlot(trainingData\(x, trainingData\)y) ## or other methods.

This creates a list with a vector ‘y’ and a matrix

of predictors ‘x’. Also simulate a large test set to

estimate the true error rate with good precision:

testData <- mlbench.friedman1(5000, sd = 1) testData\(x <- data.frame(testData\)x)

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

KNN Model

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center","scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

plot(knnModel)

knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

NN Model

# Checking for predictors with multicollinearity

tooHigh <- findCorrelation(cor(trainingData$x), cutoff = 0.75)
tooHigh

## integer(0)

# Creating a tuning grid

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))

# Using a 10-fold cross-validation

ctrl <- trainControl(method = "cv", number = 10)

set.seed(123)

# Tuning Model

nnetTune <- train(trainingData$x, trainingData$y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)

nnetTune

## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE     
##   0.00    1     2.428469  0.7653147  1.881622
##   0.00    2     2.658319  0.7176523  2.123341
##   0.00    3     2.439662  0.7576737  1.909810
##   0.00    4     2.395450  0.7718850  1.870523
##   0.00    5     3.158842  0.6600032  2.306494
##   0.00    6     5.114760  0.5927134  2.878301
##   0.00    7     3.998873  0.5863680  2.807368
##   0.00    8    13.038224  0.3380882  6.216311
##   0.00    9     3.418244  0.6064786  2.750387
##   0.00   10    14.282253  0.3903329  5.817233
##   0.01    1     2.428178  0.7651061  1.880566
##   0.01    2     2.628711  0.7352039  2.081463
##   0.01    3     2.500025  0.7465002  1.952391
##   0.01    4     2.487535  0.7612713  1.893652
##   0.01    5     2.627321  0.7308586  2.175922
##   0.01    6     2.779170  0.7162527  2.154943
##   0.01    7     2.974175  0.6754817  2.306175
##   0.01    8     3.029658  0.6786905  2.368330
##   0.01    9     3.530934  0.6095407  2.864381
##   0.01   10     3.427628  0.6325805  2.755656
##   0.10    1     2.441856  0.7622291  1.892003
##   0.10    2     2.571867  0.7378116  1.967507
##   0.10    3     2.209160  0.8028016  1.797465
##   0.10    4     2.384313  0.7706745  1.946165
##   0.10    5     2.703109  0.7134765  2.109896
##   0.10    6     2.691025  0.7343891  2.178368
##   0.10    7     2.807927  0.7008760  2.202846
##   0.10    8     2.906717  0.7081779  2.361708
##   0.10    9     3.246571  0.6386602  2.586619
##   0.10   10     3.239180  0.6467128  2.555378
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.1.

nnPred <- predict(nnetTune, testData$x)
postResample(nnPred, testData$y)

##      RMSE  Rsquared       MAE 
## 2.4763089 0.7565333 1.8564599

MARS Model

# Creating a tuning grid

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(123)

# Tuning Model

library(earth)

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

marsTune <- train(trainingData$x, trainingData$y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsTune

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.311247  0.2748122  3.603533
##   1        3      3.531005  0.5107259  2.857560
##   1        4      2.609132  0.7291471  2.109945
##   1        5      2.234494  0.8007350  1.788244
##   1        6      2.279819  0.7999357  1.803273
##   1        7      1.792708  0.8748522  1.398541
##   1        8      1.710582  0.8857656  1.323419
##   1        9      1.662155  0.8892531  1.291466
##   1       10      1.706154  0.8823897  1.306578
##   1       11      1.743116  0.8739494  1.360521
##   1       12      1.740790  0.8734421  1.357507
##   1       13      1.703492  0.8788657  1.326192
##   1       14      1.700604  0.8791430  1.324716
##   1       15      1.692444  0.8801290  1.317804
##   1       16      1.692444  0.8801290  1.317804
##   1       17      1.692444  0.8801290  1.317804
##   1       18      1.692444  0.8801290  1.317804
##   1       19      1.692444  0.8801290  1.317804
##   1       20      1.692444  0.8801290  1.317804
##   1       21      1.692444  0.8801290  1.317804
##   1       22      1.692444  0.8801290  1.317804
##   1       23      1.692444  0.8801290  1.317804
##   1       24      1.692444  0.8801290  1.317804
##   1       25      1.692444  0.8801290  1.317804
##   1       26      1.692444  0.8801290  1.317804
##   1       27      1.692444  0.8801290  1.317804
##   1       28      1.692444  0.8801290  1.317804
##   1       29      1.692444  0.8801290  1.317804
##   1       30      1.692444  0.8801290  1.317804
##   1       31      1.692444  0.8801290  1.317804
##   1       32      1.692444  0.8801290  1.317804
##   1       33      1.692444  0.8801290  1.317804
##   1       34      1.692444  0.8801290  1.317804
##   1       35      1.692444  0.8801290  1.317804
##   1       36      1.692444  0.8801290  1.317804
##   1       37      1.692444  0.8801290  1.317804
##   1       38      1.692444  0.8801290  1.317804
##   2        2      4.311247  0.2748122  3.603533
##   2        3      3.531005  0.5107259  2.857560
##   2        4      2.609132  0.7291471  2.109945
##   2        5      2.243508  0.7985944  1.788189
##   2        6      2.236723  0.7987764  1.770156
##   2        7      1.815177  0.8693557  1.425563
##   2        8      1.699050  0.8834064  1.317662
##   2        9      1.487692  0.9084049  1.182061
##   2       10      1.469496  0.9053535  1.160443
##   2       11      1.392318  0.9178210  1.085187
##   2       12      1.302695  0.9312685  1.032827
##   2       13      1.293800  0.9331208  1.033397
##   2       14      1.265082  0.9371588  1.012795
##   2       15      1.275804  0.9351561  1.019457
##   2       16      1.288843  0.9335588  1.031360
##   2       17      1.296439  0.9327583  1.035093
##   2       18      1.296439  0.9327583  1.035093
##   2       19      1.296439  0.9327583  1.035093
##   2       20      1.296439  0.9327583  1.035093
##   2       21      1.296439  0.9327583  1.035093
##   2       22      1.296439  0.9327583  1.035093
##   2       23      1.296439  0.9327583  1.035093
##   2       24      1.296439  0.9327583  1.035093
##   2       25      1.296439  0.9327583  1.035093
##   2       26      1.296439  0.9327583  1.035093
##   2       27      1.296439  0.9327583  1.035093
##   2       28      1.296439  0.9327583  1.035093
##   2       29      1.296439  0.9327583  1.035093
##   2       30      1.296439  0.9327583  1.035093
##   2       31      1.296439  0.9327583  1.035093
##   2       32      1.296439  0.9327583  1.035093
##   2       33      1.296439  0.9327583  1.035093
##   2       34      1.296439  0.9327583  1.035093
##   2       35      1.296439  0.9327583  1.035093
##   2       36      1.296439  0.9327583  1.035093
##   2       37      1.296439  0.9327583  1.035093
##   2       38      1.296439  0.9327583  1.035093
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

marsPred <- predict(marsTune, testData$x)

postResample(marsPred, testData$y)

##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Out of the NN, KNN, and MARS models, the MARS model appears to give the best predictive performance with the lowest RMSE at 1.17 and the highest Rsquared 0.94.

varImp(marsTune)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.74
## X5   15.53
## X3    0.00

Our MARS model selects the informative predictors x1-x5, however the x3 predictor is given an importance of 0. This suggests that the results of our model does not gain any contributions from x3. However, if we know for certain that predictor x3 is important than this suggests that our modeling of the data can be tuned to give a better outcome.

Question 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

data("ChemicalManufacturingProcess")

# Imputing with MICE
chemical_imputed <- mice(data = ChemicalManufacturingProcess,m=5,maxit=40,meth='pmm',seed=500,printFlag = FALSE)

## Warning: Number of logged events: 5400

data_imputed <- complete(chemical_imputed,1)

# Removing near zero variance predictors

nzv <- nearZeroVar(data_imputed)
remaining_predictors <- data_imputed[,-nzv]

# Removing highly correlated predictors

high_correlation <- findCorrelation(cor(remaining_predictors[,2:57]), cutoff = .9, exact = TRUE)
length(high_correlation)

## [1] 10

There are 10 highly correlated predictors (>0.90) I will be removing

filtered_remaining <- remaining_predictors[,-high_correlation]
ncol(filtered_remaining)

## [1] 47

# Creating split and test/train
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)

train.x <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]

test.x <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]

KNN Model

knnTune <- train(train.x, train.y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnTune

## k-Nearest Neighbors 
## 
## 132 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.379862  0.4273352  1.109757
##    7  1.375664  0.4284585  1.112894
##    9  1.381304  0.4282176  1.128005
##   11  1.388269  0.4257264  1.133378
##   13  1.390374  0.4275608  1.135833
##   15  1.400483  0.4262227  1.145369
##   17  1.409025  0.4271038  1.149309
##   19  1.421441  0.4184200  1.158681
##   21  1.423856  0.4209609  1.165372
##   23  1.426456  0.4309096  1.170421
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

knnPred <- predict(knnTune, test.x)
postResample(pred = knnPred, test.y)

##      RMSE  Rsquared       MAE 
## 1.5263825 0.4485173 1.1743506

NN Model

# Creating grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))
# Using 10 fold validation
ctrl <- trainControl(method = "cv", number = 10)

set.seed(123)

# Tuning Model

nnetTune <- train(train.x, train.y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(train.x) + 1) + 10 + 1,
                  maxit = 500)

nnetTune

## Neural Network 
## 
## 132 samples
##  46 predictor
## 
## Pre-processing: centered (46), scaled (46) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 117, 119, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    4.779850  0.2445153  2.206453
##   0.00    2    1.569938  0.3370528  1.253875
##   0.00    3    1.896079  0.3627347  1.409978
##   0.00    4    2.620677  0.2988770  2.023155
##   0.00    5    2.899390  0.3060834  2.352837
##   0.00    6    3.124734  0.1542018  2.462135
##   0.00    7    4.017534  0.1947839  2.960149
##   0.00    8    3.754563  0.2448613  2.823495
##   0.00    9    6.175078  0.1759941  4.482240
##   0.00   10    6.326637  0.1268561  4.096067
##   0.01    1    1.518869  0.4292017  1.183204
##   0.01    2    2.210348  0.4530363  1.593790
##   0.01    3    2.331191  0.2789183  1.758083
##   0.01    4    2.736868  0.2909688  2.033200
##   0.01    5    2.677014  0.2983575  1.953792
##   0.01    6    2.378304  0.2500730  1.831719
##   0.01    7    2.538277  0.1849682  1.997785
##   0.01    8    2.195123  0.3877242  1.718666
##   0.01    9    2.847662  0.2393914  2.250954
##   0.01   10    3.023707  0.2322443  2.306779
##   0.10    1    1.805248  0.4744852  1.166351
##   0.10    2    2.517003  0.3465457  1.713372
##   0.10    3    2.585820  0.3019622  1.809522
##   0.10    4    2.430351  0.3950712  1.680381
##   0.10    5    2.508810  0.2724510  1.784748
##   0.10    6    2.040357  0.3620717  1.381619
##   0.10    7    2.320429  0.3353555  1.635677
##   0.10    8    2.057117  0.3722615  1.549548
##   0.10    9    2.066339  0.3460202  1.599354
##   0.10   10    2.207382  0.2945016  1.719344
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.01.

nnPred <- predict(nnetTune, test.x)

postResample(nnPred, test.y)

##      RMSE  Rsquared       MAE 
## 2.1282666 0.1440941 1.6774653

MARS Model

# Creating Grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(123)

# Tuning Model
marsTune <- train(train.x, train.y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsTune

## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 118, 117, 119, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.371325  0.4443383  1.0949033
##   1        3      1.206355  0.5580595  0.9644577
##   1        4      1.107039  0.6589157  0.9023411
##   1        5      1.113524  0.6374006  0.9103233
##   1        6      1.124825  0.6381776  0.9127902
##   1        7      1.131973  0.6384136  0.9153945
##   1        8      1.118654  0.6491934  0.9313501
##   1        9      1.170994  0.6161922  0.9581622
##   1       10      1.167153  0.6201985  0.9396652
##   1       11      1.379872  0.6079873  1.0282629
##   1       12      1.442576  0.5884937  1.0562688
##   1       13      1.445923  0.5730241  1.0836112
##   1       14      1.434894  0.5774451  1.0582389
##   1       15      1.431721  0.5842213  1.0625070
##   1       16      1.429295  0.5884530  1.0598433
##   1       17      1.432440  0.5870429  1.0642719
##   1       18      1.441956  0.5857340  1.0742922
##   1       19      1.441956  0.5857340  1.0742922
##   1       20      1.441956  0.5857340  1.0742922
##   1       21      1.441956  0.5857340  1.0742922
##   1       22      1.441956  0.5857340  1.0742922
##   1       23      1.441956  0.5857340  1.0742922
##   1       24      1.441956  0.5857340  1.0742922
##   1       25      1.441956  0.5857340  1.0742922
##   1       26      1.441956  0.5857340  1.0742922
##   1       27      1.441956  0.5857340  1.0742922
##   1       28      1.441956  0.5857340  1.0742922
##   1       29      1.441956  0.5857340  1.0742922
##   1       30      1.441956  0.5857340  1.0742922
##   1       31      1.441956  0.5857340  1.0742922
##   1       32      1.441956  0.5857340  1.0742922
##   1       33      1.441956  0.5857340  1.0742922
##   1       34      1.441956  0.5857340  1.0742922
##   1       35      1.441956  0.5857340  1.0742922
##   1       36      1.441956  0.5857340  1.0742922
##   1       37      1.441956  0.5857340  1.0742922
##   1       38      1.441956  0.5857340  1.0742922
##   2        2      1.371325  0.4443383  1.0949033
##   2        3      1.167346  0.5943510  0.9302185
##   2        4      1.111407  0.6393386  0.9072126
##   2        5      1.125335  0.6282815  0.9054744
##   2        6      1.092162  0.6460384  0.8803609
##   2        7      1.080226  0.6683239  0.8717035
##   2        8      1.082905  0.6712948  0.8365283
##   2        9      1.117087  0.6660025  0.8686386
##   2       10      1.160621  0.6444833  0.9097187
##   2       11      1.143612  0.6592276  0.9251865
##   2       12      1.149449  0.6563472  0.9360115
##   2       13      1.146434  0.6592886  0.9428546
##   2       14      1.133104  0.6644922  0.9342078
##   2       15      1.139213  0.6648344  0.9338646
##   2       16      1.157414  0.6549112  0.9404125
##   2       17      1.157352  0.6582743  0.9435324
##   2       18      1.158051  0.6589929  0.9456224
##   2       19      1.158051  0.6589929  0.9456224
##   2       20      1.168467  0.6530093  0.9543794
##   2       21      1.185206  0.6440278  0.9694642
##   2       22      1.181133  0.6442686  0.9682029
##   2       23      1.181133  0.6442686  0.9682029
##   2       24      1.181133  0.6442686  0.9682029
##   2       25      1.181133  0.6442686  0.9682029
##   2       26      1.181133  0.6442686  0.9682029
##   2       27      1.181133  0.6442686  0.9682029
##   2       28      1.181133  0.6442686  0.9682029
##   2       29      1.181133  0.6442686  0.9682029
##   2       30      1.181133  0.6442686  0.9682029
##   2       31      1.181133  0.6442686  0.9682029
##   2       32      1.181133  0.6442686  0.9682029
##   2       33      1.181133  0.6442686  0.9682029
##   2       34      1.181133  0.6442686  0.9682029
##   2       35      1.181133  0.6442686  0.9682029
##   2       36      1.181133  0.6442686  0.9682029
##   2       37      1.181133  0.6442686  0.9682029
##   2       38      1.181133  0.6442686  0.9682029
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 2.

marsPred <- predict(marsTune, test.x)

postResample(marsPred, test.y)

##      RMSE  Rsquared       MAE 
## 1.5992190 0.3886594 1.2759123

A.

Which nonlinear regression model give the optimal resampling and test set performance?

Based on the three models run above our KNN model gives the lowest RMSE at 1.53 and the largest Rsquared value at 0.45.

B.

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? Howdo the top ten important predictors compare to the top ten predictors from the optimal linear model?

varImp(knnTune)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     88.87
## ManufacturingProcess09   85.95
## ManufacturingProcess13   83.21
## ManufacturingProcess06   77.51
## ManufacturingProcess36   76.77
## ManufacturingProcess31   73.80
## BiologicalMaterial12     70.52
## BiologicalMaterial02     68.72
## ManufacturingProcess11   62.69
## BiologicalMaterial04     54.85
## ManufacturingProcess18   50.85
## BiologicalMaterial08     42.97
## ManufacturingProcess33   41.71
## BiologicalMaterial09     33.16
## ManufacturingProcess29   29.87
## ManufacturingProcess35   28.75
## ManufacturingProcess01   27.09
## ManufacturingProcess20   26.03
## BiologicalMaterial10     23.66

# Optimal linear method from homework 7 was found to be the PLS model
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)

train.p <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]

test.p <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]

p_pro <- c("nzv", "center","scale")
tr_ctrl =trainControl(method = "boot", number=5)

pls_fit <-train(train.p, train.y, method="pls", tuneLength = 20, preProcess=p_pro,trControl=tr_ctrl)

varImp(pls_fit)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

## pls variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   96.09
## ManufacturingProcess09   95.61
## ManufacturingProcess36   86.12
## BiologicalMaterial02     74.55
## BiologicalMaterial06     73.37
## ManufacturingProcess06   72.30
## BiologicalMaterial08     64.12
## BiologicalMaterial12     63.52
## ManufacturingProcess12   62.39
## ManufacturingProcess33   57.43
## BiologicalMaterial04     55.22
## ManufacturingProcess11   53.33
## ManufacturingProcess04   47.99
## ManufacturingProcess34   47.15
## ManufacturingProcess02   40.64
## ManufacturingProcess35   36.29
## BiologicalMaterial10     35.03
## ManufacturingProcess10   29.83
## ManufacturingProcess43   28.28

The most important predictors from our knn model is slightly different from our pls linear model. Although both cite Process 32 as the top predictor and certain predictors are in both model’s top 10 important predictors, some are different. For example process 31 and process 11 are now in the top ten and have kicked out two predictors from the top ten of the pls model. Most predictors importance values and place in the top ten have changed.

C.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Out of the top ten predictors in the nonlinear regression model the unique predictors are process 31 and process 11.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks mice::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

top10 <- varImp(knnTune)$importance %>%
  arrange(-Overall) %>%
  head(10)

top10

##                          Overall
## ManufacturingProcess32 100.00000
## BiologicalMaterial06    88.87033
## ManufacturingProcess09  85.94624
## ManufacturingProcess13  83.21469
## ManufacturingProcess06  77.51353
## ManufacturingProcess36  76.77219
## ManufacturingProcess31  73.79702
## BiologicalMaterial12    70.51875
## BiologicalMaterial02    68.72283
## ManufacturingProcess11  62.68749

library(corrplot)

## corrplot 0.95 loaded

## 
## Attaching package: 'corrplot'

## The following object is masked from 'package:pls':
## 
##     corrplot

filtered_remaining %>%
  select(c("Yield", row.names(top10))) %>%
  cor() %>%
  corrplot()

filtered_remaining %>%
  select(c("Yield", row.names(top10))) %>%
  cor()

##                              Yield ManufacturingProcess32 BiologicalMaterial06
## Yield                   1.00000000           0.6083321497           0.47816342
## ManufacturingProcess32  0.60833215           1.0000000000           0.60059580
## BiologicalMaterial06    0.47816342           0.6005958009           1.00000000
## ManufacturingProcess09  0.50347051           0.0410030118           0.23005968
## ManufacturingProcess13 -0.50367972          -0.1012067889          -0.12186756
## ManufacturingProcess06  0.39458486           0.2116458104           0.23943828
## ManufacturingProcess36 -0.51798271          -0.7747446108          -0.52170786
## ManufacturingProcess31 -0.06180917          -0.0004030551          -0.05391156
## BiologicalMaterial12    0.36749764           0.3877760264           0.81285397
## BiologicalMaterial02    0.48151579           0.6298320895           0.95431130
## ManufacturingProcess11  0.31623978          -0.1073779089           0.07411573
##                        ManufacturingProcess09 ManufacturingProcess13
## Yield                              0.50347051            -0.50367972
## ManufacturingProcess32             0.04100301            -0.10120679
## BiologicalMaterial06               0.23005968            -0.12186756
## ManufacturingProcess09             1.00000000            -0.79135366
## ManufacturingProcess13            -0.79135366             1.00000000
## ManufacturingProcess06             0.37565912            -0.41390995
## ManufacturingProcess36            -0.05063795             0.09406382
## ManufacturingProcess31            -0.13092043             0.08643301
## BiologicalMaterial12               0.24585610            -0.11198335
## BiologicalMaterial02               0.21884418            -0.11246895
## ManufacturingProcess11             0.70622047            -0.57238455
##                        ManufacturingProcess06 ManufacturingProcess36
## Yield                              0.39458486            -0.51798271
## ManufacturingProcess32             0.21164581            -0.77474461
## BiologicalMaterial06               0.23943828            -0.52170786
## ManufacturingProcess09             0.37565912            -0.05063795
## ManufacturingProcess13            -0.41390995             0.09406382
## ManufacturingProcess06             1.00000000            -0.24416278
## ManufacturingProcess36            -0.24416278             1.00000000
## ManufacturingProcess31            -0.08333126             0.07228466
## BiologicalMaterial12               0.26843526            -0.36437219
## BiologicalMaterial02               0.26825422            -0.54785592
## ManufacturingProcess11             0.29001330             0.07134105
##                        ManufacturingProcess31 BiologicalMaterial12
## Yield                           -0.0618091742            0.3674976
## ManufacturingProcess32          -0.0004030551            0.3877760
## BiologicalMaterial06            -0.0539115594            0.8128540
## ManufacturingProcess09          -0.1309204309            0.2458561
## ManufacturingProcess13           0.0864330146           -0.1119834
## ManufacturingProcess06          -0.0833312630            0.2684353
## ManufacturingProcess36           0.0722846634           -0.3643722
## ManufacturingProcess31           1.0000000000           -0.1017085
## BiologicalMaterial12            -0.1017084932            1.0000000
## BiologicalMaterial02            -0.0498877843            0.7793419
## ManufacturingProcess11          -0.1398012286            0.1049372
##                        BiologicalMaterial02 ManufacturingProcess11
## Yield                            0.48151579             0.31623978
## ManufacturingProcess32           0.62983209            -0.10737791
## BiologicalMaterial06             0.95431130             0.07411573
## ManufacturingProcess09           0.21884418             0.70622047
## ManufacturingProcess13          -0.11246895            -0.57238455
## ManufacturingProcess06           0.26825422             0.29001330
## ManufacturingProcess36          -0.54785592             0.07134105
## ManufacturingProcess31          -0.04988778            -0.13980123
## BiologicalMaterial12             0.77934185             0.10493721
## BiologicalMaterial02             1.00000000             0.08361743
## ManufacturingProcess11           0.08361743             1.00000000

important_predictors <- filtered_remaining %>%
  select(c("Yield", row.names(top10)))

head(important_predictors)

##   Yield ManufacturingProcess32 BiologicalMaterial06 ManufacturingProcess09
## 1 38.00                    156                43.73                  43.00
## 2 42.44                    169                53.14                  46.57
## 3 42.03                    173                53.14                  45.07
## 4 41.42                    171                53.14                  44.92
## 5 42.49                    171                54.66                  44.96
## 6 43.57                    173                51.23                  45.32
##   ManufacturingProcess13 ManufacturingProcess06 ManufacturingProcess36
## 1                   35.5                  203.6                  0.019
## 2                   34.0                  210.0                  0.019
## 3                   34.8                  207.1                  0.018
## 4                   34.8                  213.3                  0.018
## 5                   34.6                  205.7                  0.017
## 6                   34.0                  208.9                  0.018
##   ManufacturingProcess31 BiologicalMaterial12 BiologicalMaterial02
## 1                   69.1                18.83                49.58
## 2                   68.7                21.05                60.97
## 3                   69.3                21.05                60.97
## 4                   69.3                21.05                60.97
## 5                   69.4                21.05                63.33
## 6                   68.2                20.76                58.36
##   ManufacturingProcess11
## 1                    9.7
## 2                    9.2
## 3                    9.0
## 4                    9.0
## 5                    9.2
## 6                   10.1

process31 <- important_predictors[,c("Yield", "ManufacturingProcess31")]

ggplot(process31, aes(x = ManufacturingProcess31, y = Yield)) +
  geom_point()

process11 <- important_predictors[,c("Yield", "ManufacturingProcess11")]

ggplot(process11, aes(x = ManufacturingProcess11, y = Yield)) +
  geom_point()

From the plots of the unique predictors of our KNN model it seems that manufacturing processes do not have much influence on the yield. When taking into consideration Process 32, our highest positive correlation with yield, I wonder if the correlations seen in processes are due to limitations on which materials can be used in the process. My intuition is that the biological materials might have more impactful influence. Further research is needed to make any recommendations.

624-Homework-8

Introduction

Question 7.2

This creates a list with a vector ‘y’ and a matrix

of predictors ‘x’. Also simulate a large test set to

estimate the true error rate with good precision:

Question 7.5

A.

B.

C.