In this homework assignment I will be submitting exercises 7.2 and 7.5 from the Kuhn and Johnson Applied Predictive Modeling Book.
Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:
y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)
where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench) set.seed(200) trainingData <- mlbench.friedman1(200, sd = 1) ## We convert the ‘x’ data from a matrix to a data frame ## One reason is that this will give the columns names. trainingData\(x <- data.frame(trainingData\)x) ## Look at the data using featurePlot(trainingData\(x, trainingData\)y) ## or other methods.
This creates a list with a vector ‘y’ and a matrix
of predictors ‘x’. Also simulate a large test set to
estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1) testData\(x <- data.frame(testData\)x)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
KNN Model
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center","scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
plot(knnModel)
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
NN Model
# Checking for predictors with multicollinearity
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = 0.75)
tooHigh
## integer(0)
# Creating a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
# Using a 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)
set.seed(123)
# Tuning Model
nnetTune <- train(trainingData$x, trainingData$y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
nnetTune
## Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.428469 0.7653147 1.881622
## 0.00 2 2.658319 0.7176523 2.123341
## 0.00 3 2.439662 0.7576737 1.909810
## 0.00 4 2.395450 0.7718850 1.870523
## 0.00 5 3.158842 0.6600032 2.306494
## 0.00 6 5.114760 0.5927134 2.878301
## 0.00 7 3.998873 0.5863680 2.807368
## 0.00 8 13.038224 0.3380882 6.216311
## 0.00 9 3.418244 0.6064786 2.750387
## 0.00 10 14.282253 0.3903329 5.817233
## 0.01 1 2.428178 0.7651061 1.880566
## 0.01 2 2.628711 0.7352039 2.081463
## 0.01 3 2.500025 0.7465002 1.952391
## 0.01 4 2.487535 0.7612713 1.893652
## 0.01 5 2.627321 0.7308586 2.175922
## 0.01 6 2.779170 0.7162527 2.154943
## 0.01 7 2.974175 0.6754817 2.306175
## 0.01 8 3.029658 0.6786905 2.368330
## 0.01 9 3.530934 0.6095407 2.864381
## 0.01 10 3.427628 0.6325805 2.755656
## 0.10 1 2.441856 0.7622291 1.892003
## 0.10 2 2.571867 0.7378116 1.967507
## 0.10 3 2.209160 0.8028016 1.797465
## 0.10 4 2.384313 0.7706745 1.946165
## 0.10 5 2.703109 0.7134765 2.109896
## 0.10 6 2.691025 0.7343891 2.178368
## 0.10 7 2.807927 0.7008760 2.202846
## 0.10 8 2.906717 0.7081779 2.361708
## 0.10 9 3.246571 0.6386602 2.586619
## 0.10 10 3.239180 0.6467128 2.555378
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.1.
nnPred <- predict(nnetTune, testData$x)
postResample(nnPred, testData$y)
## RMSE Rsquared MAE
## 2.4763089 0.7565333 1.8564599
MARS Model
# Creating a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(123)
# Tuning Model
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsTune <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.311247 0.2748122 3.603533
## 1 3 3.531005 0.5107259 2.857560
## 1 4 2.609132 0.7291471 2.109945
## 1 5 2.234494 0.8007350 1.788244
## 1 6 2.279819 0.7999357 1.803273
## 1 7 1.792708 0.8748522 1.398541
## 1 8 1.710582 0.8857656 1.323419
## 1 9 1.662155 0.8892531 1.291466
## 1 10 1.706154 0.8823897 1.306578
## 1 11 1.743116 0.8739494 1.360521
## 1 12 1.740790 0.8734421 1.357507
## 1 13 1.703492 0.8788657 1.326192
## 1 14 1.700604 0.8791430 1.324716
## 1 15 1.692444 0.8801290 1.317804
## 1 16 1.692444 0.8801290 1.317804
## 1 17 1.692444 0.8801290 1.317804
## 1 18 1.692444 0.8801290 1.317804
## 1 19 1.692444 0.8801290 1.317804
## 1 20 1.692444 0.8801290 1.317804
## 1 21 1.692444 0.8801290 1.317804
## 1 22 1.692444 0.8801290 1.317804
## 1 23 1.692444 0.8801290 1.317804
## 1 24 1.692444 0.8801290 1.317804
## 1 25 1.692444 0.8801290 1.317804
## 1 26 1.692444 0.8801290 1.317804
## 1 27 1.692444 0.8801290 1.317804
## 1 28 1.692444 0.8801290 1.317804
## 1 29 1.692444 0.8801290 1.317804
## 1 30 1.692444 0.8801290 1.317804
## 1 31 1.692444 0.8801290 1.317804
## 1 32 1.692444 0.8801290 1.317804
## 1 33 1.692444 0.8801290 1.317804
## 1 34 1.692444 0.8801290 1.317804
## 1 35 1.692444 0.8801290 1.317804
## 1 36 1.692444 0.8801290 1.317804
## 1 37 1.692444 0.8801290 1.317804
## 1 38 1.692444 0.8801290 1.317804
## 2 2 4.311247 0.2748122 3.603533
## 2 3 3.531005 0.5107259 2.857560
## 2 4 2.609132 0.7291471 2.109945
## 2 5 2.243508 0.7985944 1.788189
## 2 6 2.236723 0.7987764 1.770156
## 2 7 1.815177 0.8693557 1.425563
## 2 8 1.699050 0.8834064 1.317662
## 2 9 1.487692 0.9084049 1.182061
## 2 10 1.469496 0.9053535 1.160443
## 2 11 1.392318 0.9178210 1.085187
## 2 12 1.302695 0.9312685 1.032827
## 2 13 1.293800 0.9331208 1.033397
## 2 14 1.265082 0.9371588 1.012795
## 2 15 1.275804 0.9351561 1.019457
## 2 16 1.288843 0.9335588 1.031360
## 2 17 1.296439 0.9327583 1.035093
## 2 18 1.296439 0.9327583 1.035093
## 2 19 1.296439 0.9327583 1.035093
## 2 20 1.296439 0.9327583 1.035093
## 2 21 1.296439 0.9327583 1.035093
## 2 22 1.296439 0.9327583 1.035093
## 2 23 1.296439 0.9327583 1.035093
## 2 24 1.296439 0.9327583 1.035093
## 2 25 1.296439 0.9327583 1.035093
## 2 26 1.296439 0.9327583 1.035093
## 2 27 1.296439 0.9327583 1.035093
## 2 28 1.296439 0.9327583 1.035093
## 2 29 1.296439 0.9327583 1.035093
## 2 30 1.296439 0.9327583 1.035093
## 2 31 1.296439 0.9327583 1.035093
## 2 32 1.296439 0.9327583 1.035093
## 2 33 1.296439 0.9327583 1.035093
## 2 34 1.296439 0.9327583 1.035093
## 2 35 1.296439 0.9327583 1.035093
## 2 36 1.296439 0.9327583 1.035093
## 2 37 1.296439 0.9327583 1.035093
## 2 38 1.296439 0.9327583 1.035093
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
marsPred <- predict(marsTune, testData$x)
postResample(marsPred, testData$y)
## RMSE Rsquared MAE
## 1.1722635 0.9448890 0.9324923
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
Out of the NN, KNN, and MARS models, the MARS model appears to give the best predictive performance with the lowest RMSE at 1.17 and the highest Rsquared 0.94.
varImp(marsTune)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.74
## X5 15.53
## X3 0.00
Our MARS model selects the informative predictors x1-x5, however the x3 predictor is given an importance of 0. This suggests that the results of our model does not gain any contributions from x3. However, if we know for certain that predictor x3 is important than this suggests that our modeling of the data can be tuned to give a better outcome.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
data("ChemicalManufacturingProcess")
# Imputing with MICE
chemical_imputed <- mice(data = ChemicalManufacturingProcess,m=5,maxit=40,meth='pmm',seed=500,printFlag = FALSE)
## Warning: Number of logged events: 5400
data_imputed <- complete(chemical_imputed,1)
# Removing near zero variance predictors
nzv <- nearZeroVar(data_imputed)
remaining_predictors <- data_imputed[,-nzv]
# Removing highly correlated predictors
high_correlation <- findCorrelation(cor(remaining_predictors[,2:57]), cutoff = .9, exact = TRUE)
length(high_correlation)
## [1] 10
There are 10 highly correlated predictors (>0.90) I will be removing
filtered_remaining <- remaining_predictors[,-high_correlation]
ncol(filtered_remaining)
## [1] 47
# Creating split and test/train
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)
train.x <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]
test.x <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]
KNN Model
knnTune <- train(train.x, train.y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnTune
## k-Nearest Neighbors
##
## 132 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.379862 0.4273352 1.109757
## 7 1.375664 0.4284585 1.112894
## 9 1.381304 0.4282176 1.128005
## 11 1.388269 0.4257264 1.133378
## 13 1.390374 0.4275608 1.135833
## 15 1.400483 0.4262227 1.145369
## 17 1.409025 0.4271038 1.149309
## 19 1.421441 0.4184200 1.158681
## 21 1.423856 0.4209609 1.165372
## 23 1.426456 0.4309096 1.170421
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
knnPred <- predict(knnTune, test.x)
postResample(pred = knnPred, test.y)
## RMSE Rsquared MAE
## 1.5263825 0.4485173 1.1743506
NN Model
# Creating grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
# Using 10 fold validation
ctrl <- trainControl(method = "cv", number = 10)
set.seed(123)
# Tuning Model
nnetTune <- train(train.x, train.y,
method = "nnet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train.x) + 1) + 10 + 1,
maxit = 500)
nnetTune
## Neural Network
##
## 132 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 117, 119, 120, 118, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 4.779850 0.2445153 2.206453
## 0.00 2 1.569938 0.3370528 1.253875
## 0.00 3 1.896079 0.3627347 1.409978
## 0.00 4 2.620677 0.2988770 2.023155
## 0.00 5 2.899390 0.3060834 2.352837
## 0.00 6 3.124734 0.1542018 2.462135
## 0.00 7 4.017534 0.1947839 2.960149
## 0.00 8 3.754563 0.2448613 2.823495
## 0.00 9 6.175078 0.1759941 4.482240
## 0.00 10 6.326637 0.1268561 4.096067
## 0.01 1 1.518869 0.4292017 1.183204
## 0.01 2 2.210348 0.4530363 1.593790
## 0.01 3 2.331191 0.2789183 1.758083
## 0.01 4 2.736868 0.2909688 2.033200
## 0.01 5 2.677014 0.2983575 1.953792
## 0.01 6 2.378304 0.2500730 1.831719
## 0.01 7 2.538277 0.1849682 1.997785
## 0.01 8 2.195123 0.3877242 1.718666
## 0.01 9 2.847662 0.2393914 2.250954
## 0.01 10 3.023707 0.2322443 2.306779
## 0.10 1 1.805248 0.4744852 1.166351
## 0.10 2 2.517003 0.3465457 1.713372
## 0.10 3 2.585820 0.3019622 1.809522
## 0.10 4 2.430351 0.3950712 1.680381
## 0.10 5 2.508810 0.2724510 1.784748
## 0.10 6 2.040357 0.3620717 1.381619
## 0.10 7 2.320429 0.3353555 1.635677
## 0.10 8 2.057117 0.3722615 1.549548
## 0.10 9 2.066339 0.3460202 1.599354
## 0.10 10 2.207382 0.2945016 1.719344
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.01.
nnPred <- predict(nnetTune, test.x)
postResample(nnPred, test.y)
## RMSE Rsquared MAE
## 2.1282666 0.1440941 1.6774653
MARS Model
# Creating Grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(123)
# Tuning Model
marsTune <- train(train.x, train.y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline
##
## 132 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 119, 118, 117, 119, 120, 118, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.371325 0.4443383 1.0949033
## 1 3 1.206355 0.5580595 0.9644577
## 1 4 1.107039 0.6589157 0.9023411
## 1 5 1.113524 0.6374006 0.9103233
## 1 6 1.124825 0.6381776 0.9127902
## 1 7 1.131973 0.6384136 0.9153945
## 1 8 1.118654 0.6491934 0.9313501
## 1 9 1.170994 0.6161922 0.9581622
## 1 10 1.167153 0.6201985 0.9396652
## 1 11 1.379872 0.6079873 1.0282629
## 1 12 1.442576 0.5884937 1.0562688
## 1 13 1.445923 0.5730241 1.0836112
## 1 14 1.434894 0.5774451 1.0582389
## 1 15 1.431721 0.5842213 1.0625070
## 1 16 1.429295 0.5884530 1.0598433
## 1 17 1.432440 0.5870429 1.0642719
## 1 18 1.441956 0.5857340 1.0742922
## 1 19 1.441956 0.5857340 1.0742922
## 1 20 1.441956 0.5857340 1.0742922
## 1 21 1.441956 0.5857340 1.0742922
## 1 22 1.441956 0.5857340 1.0742922
## 1 23 1.441956 0.5857340 1.0742922
## 1 24 1.441956 0.5857340 1.0742922
## 1 25 1.441956 0.5857340 1.0742922
## 1 26 1.441956 0.5857340 1.0742922
## 1 27 1.441956 0.5857340 1.0742922
## 1 28 1.441956 0.5857340 1.0742922
## 1 29 1.441956 0.5857340 1.0742922
## 1 30 1.441956 0.5857340 1.0742922
## 1 31 1.441956 0.5857340 1.0742922
## 1 32 1.441956 0.5857340 1.0742922
## 1 33 1.441956 0.5857340 1.0742922
## 1 34 1.441956 0.5857340 1.0742922
## 1 35 1.441956 0.5857340 1.0742922
## 1 36 1.441956 0.5857340 1.0742922
## 1 37 1.441956 0.5857340 1.0742922
## 1 38 1.441956 0.5857340 1.0742922
## 2 2 1.371325 0.4443383 1.0949033
## 2 3 1.167346 0.5943510 0.9302185
## 2 4 1.111407 0.6393386 0.9072126
## 2 5 1.125335 0.6282815 0.9054744
## 2 6 1.092162 0.6460384 0.8803609
## 2 7 1.080226 0.6683239 0.8717035
## 2 8 1.082905 0.6712948 0.8365283
## 2 9 1.117087 0.6660025 0.8686386
## 2 10 1.160621 0.6444833 0.9097187
## 2 11 1.143612 0.6592276 0.9251865
## 2 12 1.149449 0.6563472 0.9360115
## 2 13 1.146434 0.6592886 0.9428546
## 2 14 1.133104 0.6644922 0.9342078
## 2 15 1.139213 0.6648344 0.9338646
## 2 16 1.157414 0.6549112 0.9404125
## 2 17 1.157352 0.6582743 0.9435324
## 2 18 1.158051 0.6589929 0.9456224
## 2 19 1.158051 0.6589929 0.9456224
## 2 20 1.168467 0.6530093 0.9543794
## 2 21 1.185206 0.6440278 0.9694642
## 2 22 1.181133 0.6442686 0.9682029
## 2 23 1.181133 0.6442686 0.9682029
## 2 24 1.181133 0.6442686 0.9682029
## 2 25 1.181133 0.6442686 0.9682029
## 2 26 1.181133 0.6442686 0.9682029
## 2 27 1.181133 0.6442686 0.9682029
## 2 28 1.181133 0.6442686 0.9682029
## 2 29 1.181133 0.6442686 0.9682029
## 2 30 1.181133 0.6442686 0.9682029
## 2 31 1.181133 0.6442686 0.9682029
## 2 32 1.181133 0.6442686 0.9682029
## 2 33 1.181133 0.6442686 0.9682029
## 2 34 1.181133 0.6442686 0.9682029
## 2 35 1.181133 0.6442686 0.9682029
## 2 36 1.181133 0.6442686 0.9682029
## 2 37 1.181133 0.6442686 0.9682029
## 2 38 1.181133 0.6442686 0.9682029
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 2.
marsPred <- predict(marsTune, test.x)
postResample(marsPred, test.y)
## RMSE Rsquared MAE
## 1.5992190 0.3886594 1.2759123
Which nonlinear regression model give the optimal resampling and test set performance?
Based on the three models run above our KNN model gives the lowest RMSE at 1.53 and the largest Rsquared value at 0.45.
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? Howdo the top ten important predictors compare to the top ten predictors from the optimal linear model?
varImp(knnTune)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 88.87
## ManufacturingProcess09 85.95
## ManufacturingProcess13 83.21
## ManufacturingProcess06 77.51
## ManufacturingProcess36 76.77
## ManufacturingProcess31 73.80
## BiologicalMaterial12 70.52
## BiologicalMaterial02 68.72
## ManufacturingProcess11 62.69
## BiologicalMaterial04 54.85
## ManufacturingProcess18 50.85
## BiologicalMaterial08 42.97
## ManufacturingProcess33 41.71
## BiologicalMaterial09 33.16
## ManufacturingProcess29 29.87
## ManufacturingProcess35 28.75
## ManufacturingProcess01 27.09
## ManufacturingProcess20 26.03
## BiologicalMaterial10 23.66
# Optimal linear method from homework 7 was found to be the PLS model
split <- createDataPartition(filtered_remaining$Yield, p = 0.75, list = FALSE)
train.p <- filtered_remaining[split, -1]
train.y <- filtered_remaining[split, 1]
test.p <- filtered_remaining[-split, -1]
test.y <- filtered_remaining[-split, 1]
p_pro <- c("nzv", "center","scale")
tr_ctrl =trainControl(method = "boot", number=5)
pls_fit <-train(train.p, train.y, method="pls", tuneLength = 20, preProcess=p_pro,trControl=tr_ctrl)
varImp(pls_fit)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 46)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 96.09
## ManufacturingProcess09 95.61
## ManufacturingProcess36 86.12
## BiologicalMaterial02 74.55
## BiologicalMaterial06 73.37
## ManufacturingProcess06 72.30
## BiologicalMaterial08 64.12
## BiologicalMaterial12 63.52
## ManufacturingProcess12 62.39
## ManufacturingProcess33 57.43
## BiologicalMaterial04 55.22
## ManufacturingProcess11 53.33
## ManufacturingProcess04 47.99
## ManufacturingProcess34 47.15
## ManufacturingProcess02 40.64
## ManufacturingProcess35 36.29
## BiologicalMaterial10 35.03
## ManufacturingProcess10 29.83
## ManufacturingProcess43 28.28
The most important predictors from our knn model is slightly different from our pls linear model. Although both cite Process 32 as the top predictor and certain predictors are in both model’s top 10 important predictors, some are different. For example process 31 and process 11 are now in the top ten and have kicked out two predictors from the top ten of the pls model. Most predictors importance values and place in the top ten have changed.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Out of the top ten predictors in the nonlinear regression model the unique predictors are process 31 and process 11.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks mice::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
top10 <- varImp(knnTune)$importance %>%
arrange(-Overall) %>%
head(10)
top10
## Overall
## ManufacturingProcess32 100.00000
## BiologicalMaterial06 88.87033
## ManufacturingProcess09 85.94624
## ManufacturingProcess13 83.21469
## ManufacturingProcess06 77.51353
## ManufacturingProcess36 76.77219
## ManufacturingProcess31 73.79702
## BiologicalMaterial12 70.51875
## BiologicalMaterial02 68.72283
## ManufacturingProcess11 62.68749
library(corrplot)
## corrplot 0.95 loaded
##
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
##
## corrplot
filtered_remaining %>%
select(c("Yield", row.names(top10))) %>%
cor() %>%
corrplot()
filtered_remaining %>%
select(c("Yield", row.names(top10))) %>%
cor()
## Yield ManufacturingProcess32 BiologicalMaterial06
## Yield 1.00000000 0.6083321497 0.47816342
## ManufacturingProcess32 0.60833215 1.0000000000 0.60059580
## BiologicalMaterial06 0.47816342 0.6005958009 1.00000000
## ManufacturingProcess09 0.50347051 0.0410030118 0.23005968
## ManufacturingProcess13 -0.50367972 -0.1012067889 -0.12186756
## ManufacturingProcess06 0.39458486 0.2116458104 0.23943828
## ManufacturingProcess36 -0.51798271 -0.7747446108 -0.52170786
## ManufacturingProcess31 -0.06180917 -0.0004030551 -0.05391156
## BiologicalMaterial12 0.36749764 0.3877760264 0.81285397
## BiologicalMaterial02 0.48151579 0.6298320895 0.95431130
## ManufacturingProcess11 0.31623978 -0.1073779089 0.07411573
## ManufacturingProcess09 ManufacturingProcess13
## Yield 0.50347051 -0.50367972
## ManufacturingProcess32 0.04100301 -0.10120679
## BiologicalMaterial06 0.23005968 -0.12186756
## ManufacturingProcess09 1.00000000 -0.79135366
## ManufacturingProcess13 -0.79135366 1.00000000
## ManufacturingProcess06 0.37565912 -0.41390995
## ManufacturingProcess36 -0.05063795 0.09406382
## ManufacturingProcess31 -0.13092043 0.08643301
## BiologicalMaterial12 0.24585610 -0.11198335
## BiologicalMaterial02 0.21884418 -0.11246895
## ManufacturingProcess11 0.70622047 -0.57238455
## ManufacturingProcess06 ManufacturingProcess36
## Yield 0.39458486 -0.51798271
## ManufacturingProcess32 0.21164581 -0.77474461
## BiologicalMaterial06 0.23943828 -0.52170786
## ManufacturingProcess09 0.37565912 -0.05063795
## ManufacturingProcess13 -0.41390995 0.09406382
## ManufacturingProcess06 1.00000000 -0.24416278
## ManufacturingProcess36 -0.24416278 1.00000000
## ManufacturingProcess31 -0.08333126 0.07228466
## BiologicalMaterial12 0.26843526 -0.36437219
## BiologicalMaterial02 0.26825422 -0.54785592
## ManufacturingProcess11 0.29001330 0.07134105
## ManufacturingProcess31 BiologicalMaterial12
## Yield -0.0618091742 0.3674976
## ManufacturingProcess32 -0.0004030551 0.3877760
## BiologicalMaterial06 -0.0539115594 0.8128540
## ManufacturingProcess09 -0.1309204309 0.2458561
## ManufacturingProcess13 0.0864330146 -0.1119834
## ManufacturingProcess06 -0.0833312630 0.2684353
## ManufacturingProcess36 0.0722846634 -0.3643722
## ManufacturingProcess31 1.0000000000 -0.1017085
## BiologicalMaterial12 -0.1017084932 1.0000000
## BiologicalMaterial02 -0.0498877843 0.7793419
## ManufacturingProcess11 -0.1398012286 0.1049372
## BiologicalMaterial02 ManufacturingProcess11
## Yield 0.48151579 0.31623978
## ManufacturingProcess32 0.62983209 -0.10737791
## BiologicalMaterial06 0.95431130 0.07411573
## ManufacturingProcess09 0.21884418 0.70622047
## ManufacturingProcess13 -0.11246895 -0.57238455
## ManufacturingProcess06 0.26825422 0.29001330
## ManufacturingProcess36 -0.54785592 0.07134105
## ManufacturingProcess31 -0.04988778 -0.13980123
## BiologicalMaterial12 0.77934185 0.10493721
## BiologicalMaterial02 1.00000000 0.08361743
## ManufacturingProcess11 0.08361743 1.00000000
important_predictors <- filtered_remaining %>%
select(c("Yield", row.names(top10)))
head(important_predictors)
## Yield ManufacturingProcess32 BiologicalMaterial06 ManufacturingProcess09
## 1 38.00 156 43.73 43.00
## 2 42.44 169 53.14 46.57
## 3 42.03 173 53.14 45.07
## 4 41.42 171 53.14 44.92
## 5 42.49 171 54.66 44.96
## 6 43.57 173 51.23 45.32
## ManufacturingProcess13 ManufacturingProcess06 ManufacturingProcess36
## 1 35.5 203.6 0.019
## 2 34.0 210.0 0.019
## 3 34.8 207.1 0.018
## 4 34.8 213.3 0.018
## 5 34.6 205.7 0.017
## 6 34.0 208.9 0.018
## ManufacturingProcess31 BiologicalMaterial12 BiologicalMaterial02
## 1 69.1 18.83 49.58
## 2 68.7 21.05 60.97
## 3 69.3 21.05 60.97
## 4 69.3 21.05 60.97
## 5 69.4 21.05 63.33
## 6 68.2 20.76 58.36
## ManufacturingProcess11
## 1 9.7
## 2 9.2
## 3 9.0
## 4 9.0
## 5 9.2
## 6 10.1
process31 <- important_predictors[,c("Yield", "ManufacturingProcess31")]
ggplot(process31, aes(x = ManufacturingProcess31, y = Yield)) +
geom_point()
process11 <- important_predictors[,c("Yield", "ManufacturingProcess11")]
ggplot(process11, aes(x = ManufacturingProcess11, y = Yield)) +
geom_point()
From the plots of the unique predictors of our KNN model it seems that manufacturing processes do not have much influence on the yield. When taking into consideration Process 32, our highest positive correlation with yield, I wonder if the correlations seen in processes are due to limitations on which materials can be used in the process. My intuition is that the biological materials might have more impactful influence. Further research is needed to make any recommendations.