library(tidyverse)
library(MASS)
library(caret)
library(mlbench)
library(xgboost)
library(GGally)
library(e1071)
library(corrplot)
Let’s create the simulated data from Friedman using the same bit of code to generate the simulated data. We’ll also create the feature distribution plot in the same manner as the text.
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
# Set up test data
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Before we model, let’s get a summary of both our training features
and test data, plotting them via ggpairs.
# Plot training data distributions
ggpairs(cbind(trainingData$x, trainingData$y))
While the scatter plots are a bit busy, we see some relatively normally distributed features.
We’ll train the KNN model specified in K&J. We’ll specify a
default tuneLength across this exercise of 10 for
standardization in tuning
# library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"))
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
Now we’ll actually predict against our test data using the KNN model. We’ll print out the root-mean squared error as well, which is a diagnostic metric we can use to evaluate the efficacy of a regression model.
knnPred <- predict(knnModel, testData$x)
RMSE(knnPred, testData$y)
## [1] 3.117232
Let’s train a decision tree on our data as well. Decision trees can be robust models, but are often prone to overfiting on training data.
decisionTreeModel <- train(x=trainingData$x,
y=trainingData$y,
method="rpart",
tuneLength=10)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
decisionTreeModel
## CART
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01228972 3.678817 0.4809688 2.990114
## 0.01387981 3.686479 0.4774770 2.997159
## 0.01429458 3.695995 0.4747507 3.007059
## 0.02618043 3.712253 0.4625909 3.033527
## 0.03515931 3.792122 0.4356117 3.101091
## 0.05860160 3.983127 0.3854415 3.270421
## 0.06528495 4.061607 0.3651742 3.339100
## 0.07513824 4.135943 0.3429854 3.392170
## 0.20070359 4.596144 0.1871626 3.794009
## 0.25672400 4.706893 0.1682766 3.883181
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01228972.
# Predict using decision tree
predictTree <- predict(decisionTreeModel, testData$x)
RMSE(predictTree, testData$y)
## [1] 3.381563
We’ll also train a MARS model similar to the method employed in K&J.
# Train spline model (MARS)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(x=trainingData$x,
y=trainingData$y, method = "earth", tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.327437 0.2666502 3.6133264
## 1 3 3.582260 0.5145400 2.9105565
## 1 4 2.640676 0.7227538 2.1614949
## 1 5 2.292028 0.8062647 1.8481401
## 1 6 2.258999 0.8151057 1.7940661
## 1 7 1.810340 0.8740701 1.4166846
## 1 8 1.701523 0.8897982 1.3223375
## 1 9 1.629183 0.8960408 1.2682728
## 1 10 1.660758 0.8964917 1.3076894
## 1 11 1.557953 0.9054063 1.2285821
## 1 12 1.561823 0.9051019 1.2479219
## 1 13 1.571426 0.9049790 1.2474697
## 1 14 1.570282 0.9055072 1.2430842
## 1 15 1.570282 0.9055072 1.2430842
## 1 16 1.570282 0.9055072 1.2430842
## 1 17 1.570282 0.9055072 1.2430842
## 1 18 1.570282 0.9055072 1.2430842
## 1 19 1.570282 0.9055072 1.2430842
## 1 20 1.570282 0.9055072 1.2430842
## 1 21 1.570282 0.9055072 1.2430842
## 1 22 1.570282 0.9055072 1.2430842
## 1 23 1.570282 0.9055072 1.2430842
## 1 24 1.570282 0.9055072 1.2430842
## 1 25 1.570282 0.9055072 1.2430842
## 1 26 1.570282 0.9055072 1.2430842
## 1 27 1.570282 0.9055072 1.2430842
## 1 28 1.570282 0.9055072 1.2430842
## 1 29 1.570282 0.9055072 1.2430842
## 1 30 1.570282 0.9055072 1.2430842
## 1 31 1.570282 0.9055072 1.2430842
## 1 32 1.570282 0.9055072 1.2430842
## 1 33 1.570282 0.9055072 1.2430842
## 1 34 1.570282 0.9055072 1.2430842
## 1 35 1.570282 0.9055072 1.2430842
## 1 36 1.570282 0.9055072 1.2430842
## 1 37 1.570282 0.9055072 1.2430842
## 1 38 1.570282 0.9055072 1.2430842
## 2 2 4.327437 0.2666502 3.6133264
## 2 3 3.582260 0.5145400 2.9105565
## 2 4 2.640676 0.7227538 2.1614949
## 2 5 2.292028 0.8062647 1.8481401
## 2 6 2.253887 0.8145545 1.8022680
## 2 7 1.805725 0.8768413 1.4274473
## 2 8 1.686811 0.8952010 1.2708834
## 2 9 1.609773 0.9008126 1.2478381
## 2 10 1.508613 0.9145215 1.2062349
## 2 11 1.375676 0.9222637 1.0939575
## 2 12 1.331723 0.9263653 1.0749753
## 2 13 1.258092 0.9348329 1.0065578
## 2 14 1.206714 0.9413494 0.9764079
## 2 15 1.203725 0.9426384 0.9741912
## 2 16 1.214990 0.9416954 0.9859887
## 2 17 1.210825 0.9417940 0.9854172
## 2 18 1.210825 0.9417940 0.9854172
## 2 19 1.210825 0.9417940 0.9854172
## 2 20 1.210825 0.9417940 0.9854172
## 2 21 1.210825 0.9417940 0.9854172
## 2 22 1.210825 0.9417940 0.9854172
## 2 23 1.210825 0.9417940 0.9854172
## 2 24 1.210825 0.9417940 0.9854172
## 2 25 1.210825 0.9417940 0.9854172
## 2 26 1.210825 0.9417940 0.9854172
## 2 27 1.210825 0.9417940 0.9854172
## 2 28 1.210825 0.9417940 0.9854172
## 2 29 1.210825 0.9417940 0.9854172
## 2 30 1.210825 0.9417940 0.9854172
## 2 31 1.210825 0.9417940 0.9854172
## 2 32 1.210825 0.9417940 0.9854172
## 2 33 1.210825 0.9417940 0.9854172
## 2 34 1.210825 0.9417940 0.9854172
## 2 35 1.210825 0.9417940 0.9854172
## 2 36 1.210825 0.9417940 0.9854172
## 2 37 1.210825 0.9417940 0.9854172
## 2 38 1.210825 0.9417940 0.9854172
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
We’ll predict using the MARS model to see the RMSE on testing data
predictMars <- predict(marsTuned, testData$x)
RMSE(predictMars, testData$y)
## [1] 1.158995
We can use the varImp function to see which are the most
important features for a given model. In the case of our MARS model, we
can see that the informative predictors (X1 - X5) are in
fact selected
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.73
## X5 15.52
## X3 0.00
Lastly, let’s try to train an XGBoost (Gradient Boosting), we can use
the xgboost
library within R to do so. XGBoost can be used for both regression
and classification tasks.
# Train XGBoost model
xgBoosetModel <- xgboost(data = as.matrix(trainingData$x),
label = as.matrix(trainingData$y),
max.depth = 2, eta = 1, nthread = 2,
nrounds = 2)
## [1] train-rmse:3.478113
## [2] train-rmse:2.661574
xgBoosetModel
## ##### xgb.Booster
## raw: 5.8 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, max.depth = 2, eta = 1, nthread = 2)
## params (as set within xgb.train):
## max_depth = "2", eta = "1", nthread = "2", validate_parameters = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## cb.evaluation.log()
## # of features: 10
## niter: 2
## nfeatures : 10
## evaluation_log:
## iter train_rmse
## <num> <num>
## 1 3.478113
## 2 2.661574
In terms of purely RMSE when predicting against our test data, we see the best performance from the MARS model.
# Load chemical data
library(AppliedPredictiveModeling)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(earth)
library(nnet)
library(ModelMetrics)
##
## Attaching package: 'ModelMetrics'
## The following objects are masked from 'package:caret':
##
## confusionMatrix, precision, recall, sensitivity, specificity
## The following object is masked from 'package:base':
##
## kappa
data("ChemicalManufacturingProcess")
chemical <- ChemicalManufacturingProcess
chemical_features <- chemical %>% dplyr::select(-c("Yield"))
We’ll impute this data the same way we did in HW7
# Impute chemical yield data
imputed <- preProcess(chemical,
method = c("knnImpute"))
trans <- predict(imputed, chemical)
We’ll also set up a train-test split as well
# Split into train and test splits
#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(trans), replace=TRUE, prob=c(0.8,0.2))
train <- trans[sample, ]
train_yield <- train$Yield
train <- train %>%
dplyr::select(-c("Yield"))
test <- trans[!sample, ]
test_yield <- test$Yield
test <- test %>%
dplyr::select(-c("Yield"))
Now we can try a spline regression model similar to how it’s done in K&J
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTuned <- train(x = train,
y = train_yield,
method = "earth", tuneGrid =marsGrid,
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 133 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 117, 120, 120, 120, 120, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.8255218 0.4255589 0.6448691
## 1 3 0.7076756 0.5594258 0.5588394
## 1 4 0.6829418 0.5905764 0.5398365
## 1 5 0.6777599 0.6074488 0.5355634
## 1 6 0.6798713 0.5942309 0.5554584
## 1 7 0.6721252 0.6150362 0.5452002
## 1 8 0.6681566 0.6248951 0.5575549
## 1 9 0.7061488 0.5979535 0.5845283
## 1 10 0.6849416 0.6115068 0.5773218
## 1 11 0.6943135 0.5966026 0.5773263
## 1 12 0.6836511 0.6152858 0.5704027
## 1 13 0.6901165 0.6107945 0.5734003
## 1 14 0.7025914 0.6027601 0.5818426
## 1 15 0.7064758 0.5990417 0.5851013
## 1 16 0.7076302 0.6002581 0.5841450
## 1 17 0.7094408 0.5995345 0.5861282
## 1 18 0.7094408 0.5995345 0.5861282
## 1 19 0.7094408 0.5995345 0.5861282
## 1 20 0.7158914 0.6004276 0.5918082
## 2 2 0.8255218 0.4255589 0.6448691
## 2 3 0.7219303 0.5442372 0.5657423
## 2 4 0.6890481 0.5887808 0.5338162
## 2 5 0.7412996 0.5358115 0.5891260
## 2 6 0.7382768 0.5366446 0.5826828
## 2 7 0.7380444 0.5549002 0.5894028
## 2 8 0.7576289 0.5438524 0.5993969
## 2 9 0.7308578 0.5696927 0.5866199
## 2 10 0.7703698 0.5487598 0.5947291
## 2 11 0.7534409 0.5581414 0.5845952
## 2 12 0.7324413 0.5880692 0.5766953
## 2 13 0.7420924 0.5880776 0.5893879
## 2 14 0.7430857 0.5924334 0.5876311
## 2 15 0.7405586 0.5913932 0.5820524
## 2 16 0.7488744 0.5893842 0.5894230
## 2 17 0.7272651 0.6127237 0.5722228
## 2 18 0.7503613 0.6034496 0.5792468
## 2 19 0.7866035 0.5670338 0.6099921
## 2 20 0.8747462 0.5084571 0.6474419
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.
We can plot our spline model, and we see the resulting errors over our cross-validation folds.
plot(marsTuned)
# train a support vector
svmFit <- train(train, train_yield, method = "svmRadial",
preProc = c("center", "scale"), tuneLength = 15,
trControl = trainControl(method = "cv"))
svmFit$resample
## RMSE Rsquared MAE Resample
## 1 0.8283012 0.4484280 0.7332573 Fold01
## 2 0.6631408 0.6081640 0.5371689 Fold03
## 3 0.5995672 0.7343060 0.5107410 Fold09
## 4 0.8339779 0.6160227 0.5526669 Fold02
## 5 0.4961339 0.7891719 0.3857777 Fold10
## 6 0.5396032 0.7705326 0.4690446 Fold05
## 7 0.5103868 0.8048235 0.4289214 Fold06
## 8 0.6169507 0.7511327 0.4705829 Fold04
## 9 0.6631854 0.6691171 0.5466585 Fold07
## 10 0.5388635 0.6775118 0.4165972 Fold08
We can get the RMSE of our SVM regression model based on the predicted alues compared against the testing values
rmse(predict(svmFit, test), test_yield)
## [1] 0.5324566
We can also train a KNN regression model to predict the
Yield output variable
knnTune <- train(train,
train_yield,
method = "knn",
preProc = c("center", "scale"), # setting this in the model training will make it occur for testing as well
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = "cv"))
plot(knnTune)
Based on RMSE as our metric, we see an ideal number of 6 neighbors to be used for this model.
Lastly, we’ll train a neural net model on our checmical processing data
nnetFit <- train(train, train_yield,
method = "nnet",
tuneLength=10,
preProc = c("center", "scale"), trace = FALSE,
trControl = trainControl(method = "cv"))
nnPred <- predict(nnetFit, test)
nnetFit$results
## size decay RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
## 2 1 0.0001000000 0.9602047 0.35370438 0.7924751 0.13091826 0.14532767
## 3 1 0.0002371374 0.9499226 0.41670304 0.7820682 0.12074066 0.16328378
## 4 1 0.0005623413 0.8694557 0.49612601 0.7036847 0.13923502 0.18220159
## 5 1 0.0013335214 0.8415267 0.54236859 0.6788265 0.12621286 0.17267371
## 6 1 0.0031622777 0.8747694 0.46345512 0.7154521 0.10029505 0.10555984
## 7 1 0.0074989421 0.8801878 0.48722454 0.7207901 0.11207299 0.17585171
## 8 1 0.0177827941 0.8584324 0.51926562 0.6996166 0.12934376 0.16808123
## 9 1 0.0421696503 0.8759545 0.48453574 0.7177421 0.11902629 0.13898627
## 10 1 0.1000000000 0.8789621 0.48627958 0.7193764 0.13055838 0.13706046
## 11 3 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
## 12 3 0.0001000000 0.8851309 0.43581928 0.7255324 0.15429023 0.18578804
## 13 3 0.0002371374 0.9232621 0.44065404 0.7628656 0.17176893 0.19586877
## 14 3 0.0005623413 0.8798274 0.47550108 0.7181215 0.12910418 0.15554495
## 15 3 0.0013335214 0.8636654 0.52667724 0.7064073 0.09982384 0.14880142
## 16 3 0.0031622777 0.8445232 0.57487775 0.6793834 0.11292242 0.17136834
## 17 3 0.0074989421 0.8631240 0.49705203 0.7028483 0.11998867 0.14598445
## 18 3 0.0177827941 0.8498644 0.56946524 0.6893010 0.12048441 0.17287164
## 19 3 0.0421696503 0.8501513 0.54926951 0.6951087 0.12100173 0.18657704
## 20 3 0.1000000000 0.8587813 0.56454381 0.7033117 0.12326735 0.14980901
## 21 5 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
## 22 5 0.0001000000 0.9046827 0.54277513 0.7327564 0.18659897 0.24505477
## 23 5 0.0002371374 0.8989956 0.49684758 0.7337956 0.11807850 0.13631450
## 24 5 0.0005623413 0.8544478 0.53633134 0.6978395 0.14383473 0.18023297
## 25 5 0.0013335214 0.8499765 0.52747507 0.6922167 0.12069685 0.11569156
## 26 5 0.0031622777 0.8799698 0.49321847 0.7148081 0.12312564 0.21166670
## 27 5 0.0074989421 0.8306857 0.59491413 0.6729338 0.10853109 0.15489434
## 28 5 0.0177827941 0.8476035 0.56172126 0.6895430 0.13594806 0.17898397
## 29 5 0.0421696503 0.8506182 0.56302080 0.6933603 0.12800920 0.17785392
## 30 5 0.1000000000 0.8493008 0.56866416 0.6958157 0.12296593 0.17583374
## 31 7 0.0000000000 1.0446689 0.26118383 0.8647718 0.11687753 NA
## 32 7 0.0001000000 0.8758914 0.43401605 0.7052874 0.13718720 0.24252946
## 33 7 0.0002371374 0.8612501 0.50130019 0.7051495 0.13216836 0.19679401
## 34 7 0.0005623413 0.8665464 0.49691632 0.7058923 0.12099405 0.20848006
## 35 7 0.0013335214 0.8889666 0.45768304 0.7225195 0.12498205 0.15221620
## 36 7 0.0031622777 0.8450675 0.53785732 0.6908665 0.10777355 0.17835990
## 37 7 0.0074989421 0.8450696 0.55615284 0.6919681 0.11285030 0.16789082
## 38 7 0.0177827941 0.8473481 0.56596774 0.6920727 0.11814933 0.15178101
## 39 7 0.0421696503 0.8516854 0.54866716 0.6991686 0.12458335 0.17987535
## 40 7 0.1000000000 0.8532175 0.55598530 0.6995990 0.11531984 0.16347045
## 41 9 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
## 42 9 0.0001000000 0.9053400 0.47018450 0.7376198 0.12340510 0.18174748
## 43 9 0.0002371374 0.8581339 0.50279175 0.6945128 0.12360396 0.20117339
## 44 9 0.0005623413 0.8682516 0.47743254 0.7052188 0.10866332 0.15761384
## 45 9 0.0013335214 0.8487841 0.54335152 0.6923365 0.11285926 0.21172703
## 46 9 0.0031622777 0.8653819 0.50006243 0.7039278 0.13418057 0.17381697
## 47 9 0.0074989421 0.8614632 0.53728306 0.7051822 0.11501599 0.14209049
## 48 9 0.0177827941 0.8438667 0.57239374 0.6875099 0.12070424 0.19107887
## 49 9 0.0421696503 0.8475202 0.56139811 0.6914204 0.12451399 0.18518574
## 50 9 0.1000000000 0.8479317 0.56555419 0.6929261 0.12062881 0.17474524
## 51 11 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
## 52 11 0.0001000000 0.8863797 0.43661659 0.7174792 0.13482035 0.20425531
## 53 11 0.0002371374 0.8639678 0.49298862 0.6939276 0.08650186 0.14487318
## 54 11 0.0005623413 0.8305483 0.57425068 0.6703282 0.10233630 0.12548986
## 55 11 0.0013335214 0.8684857 0.50391917 0.7040714 0.10635001 0.13048214
## 56 11 0.0031622777 0.8417667 0.57740132 0.6771116 0.10407699 0.15965129
## 57 11 0.0074989421 0.8558486 0.51574617 0.7047413 0.11503821 0.16985105
## 58 11 0.0177827941 0.8494980 0.56452899 0.6975150 0.12409607 0.21118983
## 59 11 0.0421696503 0.8500332 0.55671838 0.6948788 0.12375800 0.17375419
## 60 11 0.1000000000 0.8510765 0.54909167 0.6974395 0.11635677 0.15479243
## 61 13 0.0000000000 1.0446688 0.03838475 0.8647717 0.11687763 NA
## 62 13 0.0001000000 0.8881632 0.44651960 0.7318770 0.14900779 0.16602109
## 63 13 0.0002371374 0.8556257 0.50977811 0.6939578 0.13067559 0.18075079
## 64 13 0.0005623413 0.8731109 0.45175150 0.7143269 0.10868515 0.08216590
## 65 13 0.0013335214 0.8568762 0.53019163 0.6932179 0.10799738 0.15061463
## 66 13 0.0031622777 0.8352275 0.59196217 0.6761294 0.10503114 0.09079615
## 67 13 0.0074989421 0.8263626 0.59095798 0.6707033 0.11344133 0.15139356
## 68 13 0.0177827941 0.8561148 0.54886071 0.7030678 0.12672822 0.16576423
## 69 13 0.0421696503 0.8564260 0.52442336 0.6996245 0.11112637 0.15545819
## 70 13 0.1000000000 0.8495124 0.56051409 0.6935039 0.11689193 0.18305543
## 71 15 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
## 72 15 0.0001000000 0.8499036 0.53533955 0.6900113 0.13009169 0.21611916
## 73 15 0.0002371374 0.8676816 0.51175672 0.7039072 0.13775367 0.15976834
## 74 15 0.0005623413 0.8276822 0.57173349 0.6618641 0.10675721 0.14457227
## 75 15 0.0013335214 0.8320246 0.59444792 0.6721161 0.12096892 0.13185455
## 76 15 0.0031622777 0.8233850 0.60369290 0.6685152 0.10945503 0.13689580
## 77 15 0.0074989421 0.8428127 0.57053362 0.6901207 0.11145676 0.17261132
## 78 15 0.0177827941 0.8426123 0.57641734 0.6884722 0.12346335 0.19082474
## 79 15 0.0421696503 0.8487272 0.55848547 0.6939770 0.11777717 0.18292071
## 80 15 0.1000000000 0.8635644 0.53296657 0.7079445 0.11293195 0.15492395
## 81 17 0.0000000000 NaN NaN NaN NA NA
## 82 17 0.0001000000 NaN NaN NaN NA NA
## 83 17 0.0002371374 NaN NaN NaN NA NA
## 84 17 0.0005623413 NaN NaN NaN NA NA
## 85 17 0.0013335214 NaN NaN NaN NA NA
## 86 17 0.0031622777 NaN NaN NaN NA NA
## 87 17 0.0074989421 NaN NaN NaN NA NA
## 88 17 0.0177827941 NaN NaN NaN NA NA
## 89 17 0.0421696503 NaN NaN NaN NA NA
## 90 17 0.1000000000 NaN NaN NaN NA NA
## 91 19 0.0000000000 NaN NaN NaN NA NA
## 92 19 0.0001000000 NaN NaN NaN NA NA
## 93 19 0.0002371374 NaN NaN NaN NA NA
## 94 19 0.0005623413 NaN NaN NaN NA NA
## 95 19 0.0013335214 NaN NaN NaN NA NA
## 96 19 0.0031622777 NaN NaN NaN NA NA
## 97 19 0.0074989421 NaN NaN NaN NA NA
## 98 19 0.0177827941 NaN NaN NaN NA NA
## 99 19 0.0421696503 NaN NaN NaN NA NA
## 100 19 0.1000000000 NaN NaN NaN NA NA
## MAESD
## 1 0.08179069
## 2 0.09489705
## 3 0.10224020
## 4 0.09996843
## 5 0.08066147
## 6 0.07348494
## 7 0.06944262
## 8 0.09017337
## 9 0.07262601
## 10 0.09171220
## 11 0.08179069
## 12 0.09257861
## 13 0.13249873
## 14 0.07640050
## 15 0.06140349
## 16 0.06181548
## 17 0.08826072
## 18 0.07627891
## 19 0.08372594
## 20 0.08515295
## 21 0.08179069
## 22 0.10681852
## 23 0.09903434
## 24 0.08349104
## 25 0.07196408
## 26 0.08392403
## 27 0.07088956
## 28 0.08362451
## 29 0.08537633
## 30 0.08441758
## 31 0.08179069
## 32 0.08430184
## 33 0.09400697
## 34 0.08130073
## 35 0.07973096
## 36 0.06914395
## 37 0.08214588
## 38 0.07643855
## 39 0.08817356
## 40 0.07436564
## 41 0.08179069
## 42 0.10362609
## 43 0.09640021
## 44 0.07572124
## 45 0.06970062
## 46 0.08666348
## 47 0.06942885
## 48 0.07986951
## 49 0.08811353
## 50 0.08145392
## 51 0.08179069
## 52 0.07795344
## 53 0.06262777
## 54 0.05374182
## 55 0.07082103
## 56 0.06149358
## 57 0.08325841
## 58 0.07992801
## 59 0.08491389
## 60 0.07618963
## 61 0.08179083
## 62 0.09293326
## 63 0.08156012
## 64 0.06745824
## 65 0.06255872
## 66 0.04168713
## 67 0.06879100
## 68 0.08115662
## 69 0.07811210
## 70 0.07726643
## 71 0.08179069
## 72 0.07880120
## 73 0.09582226
## 74 0.04938747
## 75 0.06542059
## 76 0.06247051
## 77 0.07361516
## 78 0.08266805
## 79 0.07930470
## 80 0.07686923
## 81 NA
## 82 NA
## 83 NA
## 84 NA
## 85 NA
## 86 NA
## 87 NA
## 88 NA
## 89 NA
## 90 NA
## 91 NA
## 92 NA
## 93 NA
## 94 NA
## 95 NA
## 96 NA
## 97 NA
## 98 NA
## 99 NA
## 100 NA
Let’s use the varImp function to see which feature
variables in our fit are most consequential
(importance <- varImp(nnetFit))
## nnet variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess36 59.73
## ManufacturingProcess28 52.69
## ManufacturingProcess39 49.31
## ManufacturingProcess24 48.18
## ManufacturingProcess13 41.22
## ManufacturingProcess22 39.64
## ManufacturingProcess37 39.31
## ManufacturingProcess35 35.03
## ManufacturingProcess12 34.99
## ManufacturingProcess04 32.40
## ManufacturingProcess05 31.69
## ManufacturingProcess33 31.53
## ManufacturingProcess01 31.46
## ManufacturingProcess03 30.67
## BiologicalMaterial10 29.40
## BiologicalMaterial02 27.86
## ManufacturingProcess08 27.54
## ManufacturingProcess43 26.51
## ManufacturingProcess07 26.24
We see the variable for our neural network that’s most important is
the ManufacturingProcess32. In fact, we see the
Manufacturing variables dominate the top 10 most important variables for
this fit. This is the same as the case of the linear models trained in
Exercise 6.3, in which manufacturing process variables had a higher
influence.
One way that we could visualize the relationships between our predictor variables and
# Get top-10 feature names
importance <- importance$importance
importance$feature <- rownames(importance)
importance <- importance[order(importance$Overall, decreasing=TRUE), ]
# Get importance feature names
important_features <- rownames(head(importance, 10))
# Create correlelogram of imputed chemical data
correlation <- cor(imputed$data[, c(important_features, "Yield")])
corrplot(correlation)
We see some stronger correlations between some feature variables, for instance manufacturing processes 9 and 13 are strongly negatively correlated. Overall, the one Biological process doesn’t correlate well with most manufacturing processes, which makes sense as these are likely much different processes.