Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
Exercise 7.2 Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[y = 10sin(Ď€x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\]
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
Before we model, a summary of both training features and test data, plotting them via ggpairs.
While the scatter plots are a bit busy, I see some relatively normally distributed features. I will specify a default tuneLength across this exercise of 10 for standardization in tuning
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"))
knnModel
k-Nearest Neighbors
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 3.466085 0.5121775 2.816838
7 3.349428 0.5452823 2.727410
9 3.264276 0.5785990 2.660026
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.
Now we’ll actually predict against our test data using the KNN model.Print out the root-mean squared error as well, which is a diagnostic metric we can use to evaluate the efficacy of a regression model.
[1] 3.117232
Let’s train a decision tree on our data as well. Decision trees can be robust models, but are often prone to overfiting on training data.
CART
200 samples
10 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01228972 3.678817 0.4809688 2.990114
0.01387981 3.686479 0.4774770 2.997159
0.01429458 3.695995 0.4747507 3.007059
0.02618043 3.712253 0.4625909 3.033527
0.03515931 3.792122 0.4356117 3.101091
0.05860160 3.983127 0.3854415 3.270421
0.06528495 4.061607 0.3651742 3.339100
0.07513824 4.135943 0.3429854 3.392170
0.20070359 4.596144 0.1871626 3.794009
0.25672400 4.706893 0.1682766 3.883181
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01228972.
# Predict using decision tree
predictTree <- predict(decisionTreeModel, testData$x)
RMSE(predictTree, testData$y)
[1] 3.381563
We’ll also train a MARS model similar to the method employed in K&J.
# Train spline model (MARS)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(x=trainingData$x,
y=trainingData$y, method = "earth", tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
Multivariate Adaptive Regression Spline
200 samples
10 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
degree nprune RMSE Rsquared MAE
1 2 4.327437 0.2666502 3.6133264
1 3 3.582260 0.5145400 2.9105565
1 4 2.640676 0.7227538 2.1614949
1 5 2.292028 0.8062647 1.8481401
1 6 2.258999 0.8151057 1.7940661
1 7 1.810340 0.8740701 1.4166846
1 8 1.701523 0.8897982 1.3223375
1 9 1.629183 0.8960408 1.2682728
1 10 1.660758 0.8964917 1.3076894
1 11 1.557953 0.9054063 1.2285821
1 12 1.561823 0.9051019 1.2479219
1 13 1.571426 0.9049790 1.2474697
1 14 1.570282 0.9055072 1.2430842
1 15 1.570282 0.9055072 1.2430842
1 16 1.570282 0.9055072 1.2430842
1 17 1.570282 0.9055072 1.2430842
1 18 1.570282 0.9055072 1.2430842
1 19 1.570282 0.9055072 1.2430842
1 20 1.570282 0.9055072 1.2430842
1 21 1.570282 0.9055072 1.2430842
1 22 1.570282 0.9055072 1.2430842
1 23 1.570282 0.9055072 1.2430842
1 24 1.570282 0.9055072 1.2430842
1 25 1.570282 0.9055072 1.2430842
1 26 1.570282 0.9055072 1.2430842
1 27 1.570282 0.9055072 1.2430842
1 28 1.570282 0.9055072 1.2430842
1 29 1.570282 0.9055072 1.2430842
1 30 1.570282 0.9055072 1.2430842
1 31 1.570282 0.9055072 1.2430842
1 32 1.570282 0.9055072 1.2430842
1 33 1.570282 0.9055072 1.2430842
1 34 1.570282 0.9055072 1.2430842
1 35 1.570282 0.9055072 1.2430842
1 36 1.570282 0.9055072 1.2430842
1 37 1.570282 0.9055072 1.2430842
1 38 1.570282 0.9055072 1.2430842
2 2 4.327437 0.2666502 3.6133264
2 3 3.582260 0.5145400 2.9105565
2 4 2.640676 0.7227538 2.1614949
2 5 2.292028 0.8062647 1.8481401
2 6 2.253887 0.8145545 1.8022680
2 7 1.805725 0.8768413 1.4274473
2 8 1.686811 0.8952010 1.2708834
2 9 1.609773 0.9008126 1.2478381
2 10 1.508613 0.9145215 1.2062349
2 11 1.375676 0.9222637 1.0939575
2 12 1.331723 0.9263653 1.0749753
2 13 1.258092 0.9348329 1.0065578
2 14 1.206714 0.9413494 0.9764079
2 15 1.203725 0.9426384 0.9741912
2 16 1.214990 0.9416954 0.9859887
2 17 1.210825 0.9417940 0.9854172
2 18 1.210825 0.9417940 0.9854172
2 19 1.210825 0.9417940 0.9854172
2 20 1.210825 0.9417940 0.9854172
2 21 1.210825 0.9417940 0.9854172
2 22 1.210825 0.9417940 0.9854172
2 23 1.210825 0.9417940 0.9854172
2 24 1.210825 0.9417940 0.9854172
2 25 1.210825 0.9417940 0.9854172
2 26 1.210825 0.9417940 0.9854172
2 27 1.210825 0.9417940 0.9854172
2 28 1.210825 0.9417940 0.9854172
2 29 1.210825 0.9417940 0.9854172
2 30 1.210825 0.9417940 0.9854172
2 31 1.210825 0.9417940 0.9854172
2 32 1.210825 0.9417940 0.9854172
2 33 1.210825 0.9417940 0.9854172
2 34 1.210825 0.9417940 0.9854172
2 35 1.210825 0.9417940 0.9854172
2 36 1.210825 0.9417940 0.9854172
2 37 1.210825 0.9417940 0.9854172
2 38 1.210825 0.9417940 0.9854172
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 15 and degree = 2.
We’ll predict using the MARS model to see the RMSE on testing data
[1] 1.158995
We can use the varImp function to see which are the most important features for a given model. In the case of our MARS model, we can see that the informative predictors (X1 - X5) are in fact selected
earth variable importance
Overall
X1 100.00
X4 75.24
X2 48.73
X5 15.52
X3 0.00
Lastly, let’s try to train an XGBoost (Gradient Boosting), we can use the xgboost library. XGBoost can be used for both regression and classification tasks.
# Train XGBoost model
xgBoosetModel <- xgboost(data = as.matrix(trainingData$x),
label = as.matrix(trainingData$y),
max.depth = 2, eta = 1, nthread = 2,
nrounds = 2)
[1] train-rmse:3.478113
[2] train-rmse:2.661574
##### xgb.Booster
raw: 5.8 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max.depth = 2, eta = 1, nthread = 2)
params (as set within xgb.train):
max_depth = "2", eta = "1", nthread = "2", validate_parameters = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log()
# of features: 10
niter: 2
nfeatures : 10
evaluation_log:
iter train_rmse
<num> <num>
1 3.478113
2 2.661574
In terms of purely RMSE when predicting against our test data, we see the best performance from the MARS model.
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Which nonlinear regression model gives the optimal resampling and test set performance?
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
library(AppliedPredictiveModeling)
library(kernlab)
library(earth)
library(nnet)
library(ModelMetrics)
data("ChemicalManufacturingProcess")
chemical <- ChemicalManufacturingProcess
chemical_features <- chemical %>% dplyr::select(-c("Yield"))
We’ll impute this data
# Impute chemical yield data
imputed <- preProcess(chemical,
method = c("knnImpute"))
trans <- predict(imputed, chemical)
We’ll also set up a train-test split as well
# Split into train and test splits
#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(trans), replace=TRUE, prob=c(0.8,0.2))
train <- trans[sample, ]
train_yield <- train$Yield
train <- train %>%
dplyr::select(-c("Yield"))
test <- trans[!sample, ]
test_yield <- test$Yield
test <- test %>%
dplyr::select(-c("Yield"))
Now we can try a spline regression model
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTuned <- train(x = train,
y = train_yield,
method = "earth", tuneGrid =marsGrid,
trControl = trainControl(method = "cv"))
marsTuned
Multivariate Adaptive Regression Spline
133 samples
57 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 120, 117, 120, 120, 120, 120, ...
Resampling results across tuning parameters:
degree nprune RMSE Rsquared MAE
1 2 0.8255218 0.4255589 0.6448691
1 3 0.7076756 0.5594258 0.5588394
1 4 0.6829418 0.5905764 0.5398365
1 5 0.6777599 0.6074488 0.5355634
1 6 0.6798713 0.5942309 0.5554584
1 7 0.6721252 0.6150362 0.5452002
1 8 0.6681566 0.6248951 0.5575549
1 9 0.7061488 0.5979535 0.5845283
1 10 0.6849416 0.6115068 0.5773218
1 11 0.6943135 0.5966026 0.5773263
1 12 0.6836511 0.6152858 0.5704027
1 13 0.6901165 0.6107945 0.5734003
1 14 0.7025914 0.6027601 0.5818426
1 15 0.7064758 0.5990417 0.5851013
1 16 0.7076302 0.6002581 0.5841450
1 17 0.7094408 0.5995345 0.5861282
1 18 0.7094408 0.5995345 0.5861282
1 19 0.7094408 0.5995345 0.5861282
1 20 0.7158914 0.6004276 0.5918082
2 2 0.8255218 0.4255589 0.6448691
2 3 0.7219303 0.5442372 0.5657423
2 4 0.7094490 0.5630614 0.5499513
2 5 0.7568039 0.5173411 0.5988227
2 6 0.7666427 0.5050106 0.6113483
2 7 0.7718170 0.5136246 0.6157548
2 8 0.7616770 0.5369341 0.6055975
2 9 0.7491831 0.5503402 0.5973027
2 10 0.7792853 0.5390793 0.6137423
2 11 0.7555746 0.5448990 0.5931787
2 12 0.7359925 0.5736795 0.5857371
2 13 0.7504894 0.5659826 0.6066187
2 14 0.7427183 0.5839756 0.5946789
2 15 0.7475103 0.5738970 0.5958920
2 16 0.7503278 0.5787836 0.5877399
2 17 0.7313502 0.5970206 0.5722539
2 18 0.7520954 0.5879849 0.5781150
2 19 0.7739194 0.5633356 0.5883682
2 20 0.8671464 0.5073033 0.6329745
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 8 and degree = 1.
We can plot our spline model, and we see the resulting errors over our cross-validation folds.
# train a support vector
svmFit <- train(train, train_yield, method = "svmRadial",
preProc = c("center", "scale"), tuneLength = 15,
trControl = trainControl(method = "cv"))
svmFit$resample
RMSE Rsquared MAE Resample
1 0.8283012 0.4484280 0.7332573 Fold01
2 0.6631408 0.6081640 0.5371689 Fold03
3 0.5995672 0.7343060 0.5107410 Fold09
4 0.8339779 0.6160227 0.5526669 Fold02
5 0.4961339 0.7891719 0.3857777 Fold10
6 0.5396032 0.7705326 0.4690446 Fold05
7 0.5103868 0.8048235 0.4289214 Fold06
8 0.6169507 0.7511327 0.4705829 Fold04
9 0.6631854 0.6691171 0.5466585 Fold07
10 0.5388635 0.6775118 0.4165972 Fold08
We can get the RMSE of our SVM regression model based on the predicted alues compared against the testing values
[1] 0.5324566
We can also train a KNN regression model to predict the Yield output variable
knnTune <- train(train,
train_yield,
method = "knn",
preProc = c("center", "scale"), # setting this in the model training will make it occur for testing as well
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = "cv"))
Based on RMSE as our metric, we see an ideal number of 6 neighbors to be used for this model.
Lastly, we’ll train a neural net model on our checmical processing data
nnetFit <- train(train, train_yield,
method = "nnet",
tuneLength=10,
preProc = c("center", "scale"), trace = FALSE,
trControl = trainControl(method = "cv"))
size decay RMSE Rsquared MAE RMSESD RsquaredSD
1 1 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
2 1 0.0001000000 0.9542694 0.36453847 0.7813614 0.12680347 0.13445536
3 1 0.0002371374 0.9466420 0.43031975 0.7839454 0.11656901 0.17561625
4 1 0.0005623413 0.8684038 0.48985394 0.7026930 0.13789603 0.18213195
5 1 0.0013335214 0.8446965 0.54068892 0.6811947 0.11696280 0.17284932
6 1 0.0031622777 0.8645272 0.48673833 0.7052410 0.10097049 0.11720248
7 1 0.0074989421 0.8715234 0.50536294 0.7069608 0.13550898 0.17086414
8 1 0.0177827941 0.8583479 0.51724788 0.6969092 0.13224314 0.17456496
9 1 0.0421696503 0.8761372 0.48402230 0.7178496 0.11891438 0.13867091
10 1 0.1000000000 0.8789617 0.48628096 0.7193759 0.13055756 0.13705716
11 3 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
12 3 0.0001000000 0.8847194 0.44303924 0.7232908 0.15575958 0.18651356
13 3 0.0002371374 0.9136125 0.44481984 0.7599729 0.16114540 0.18413129
14 3 0.0005623413 0.8760403 0.48277974 0.7125063 0.13565689 0.15885110
15 3 0.0013335214 0.8759353 0.48344943 0.7079560 0.10010669 0.13028271
16 3 0.0031622777 0.8608407 0.54393529 0.6994648 0.12452724 0.17253287
17 3 0.0074989421 0.8762684 0.46127663 0.7152433 0.09785980 0.14163309
18 3 0.0177827941 0.8519816 0.56159612 0.6916056 0.12087608 0.17582455
19 3 0.0421696503 0.8501467 0.54940934 0.6950391 0.12099830 0.18651991
20 3 0.1000000000 0.8587715 0.56455783 0.7032994 0.12328505 0.14981945
21 5 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
22 5 0.0001000000 0.9057045 0.53932126 0.7351706 0.18596180 0.24244690
23 5 0.0002371374 0.8996320 0.48648469 0.7310962 0.11706869 0.15005490
24 5 0.0005623413 0.8565708 0.51215030 0.6905188 0.12265131 0.20342865
25 5 0.0013335214 0.8368429 0.56123984 0.6835676 0.11553301 0.12548573
26 5 0.0031622777 0.8804294 0.47370808 0.7119672 0.11455998 0.17370900
27 5 0.0074989421 0.8301849 0.59217979 0.6717010 0.10836512 0.15749782
28 5 0.0177827941 0.8476145 0.56158473 0.6895417 0.13596245 0.17881923
29 5 0.0421696503 0.8506137 0.56304234 0.6933575 0.12801265 0.17787857
30 5 0.1000000000 0.8493008 0.56866416 0.6958157 0.12296593 0.17583374
31 7 0.0000000000 1.0446689 0.26118383 0.8647718 0.11687753 NA
32 7 0.0001000000 0.8574782 0.47752734 0.6865403 0.12909445 0.23294308
33 7 0.0002371374 0.8603531 0.50368662 0.7014791 0.13061026 0.19354097
34 7 0.0005623413 0.8615605 0.50853759 0.7058293 0.12289662 0.19176478
35 7 0.0013335214 0.8796102 0.47752190 0.7025399 0.11890052 0.12476454
36 7 0.0031622777 0.8307254 0.57435029 0.6742875 0.10324503 0.18105575
37 7 0.0074989421 0.8476169 0.55170099 0.6945817 0.11314152 0.16246766
38 7 0.0177827941 0.8471812 0.56574926 0.6916247 0.11829469 0.15095870
39 7 0.0421696503 0.8516917 0.54865887 0.6991762 0.12458225 0.17986628
40 7 0.1000000000 0.8532175 0.55598530 0.6995990 0.11531984 0.16347045
41 9 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
42 9 0.0001000000 0.9043466 0.47150143 0.7388232 0.11826587 0.17845483
43 9 0.0002371374 0.8582291 0.50466915 0.6948912 0.12357845 0.19970104
44 9 0.0005623413 0.8694690 0.47886213 0.7030061 0.12053769 0.15717109
45 9 0.0013335214 0.8514352 0.54738913 0.6947003 0.11039419 0.21514490
46 9 0.0031622777 0.8639353 0.50400283 0.7044491 0.12396283 0.14768604
47 9 0.0074989421 0.8594343 0.54137496 0.7035777 0.11562709 0.14161259
48 9 0.0177827941 0.8414491 0.57866012 0.6862069 0.12019540 0.19730131
49 9 0.0421696503 0.8475194 0.56140092 0.6914190 0.12451328 0.18518926
50 9 0.1000000000 0.8479317 0.56555419 0.6929261 0.12062882 0.17474524
51 11 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
52 11 0.0001000000 0.9032802 0.41806037 0.7403833 0.16119939 0.23182849
53 11 0.0002371374 0.8566433 0.50962852 0.6845412 0.08235548 0.11708827
54 11 0.0005623413 0.8285419 0.58219777 0.6707848 0.10266987 0.11564880
55 11 0.0013335214 0.8591061 0.52244560 0.7030786 0.12167971 0.18321771
56 11 0.0031622777 0.8404660 0.58820246 0.6762268 0.10308377 0.15174876
57 11 0.0074989421 0.8558811 0.51505994 0.7042372 0.11462365 0.16931431
58 11 0.0177827941 0.8508009 0.56090913 0.6988181 0.12645181 0.21002875
59 11 0.0421696503 0.8500023 0.55680391 0.6948443 0.12377467 0.17385419
60 11 0.1000000000 0.8510770 0.54908798 0.6974404 0.11635630 0.15478813
61 13 0.0000000000 1.0446688 0.03838475 0.8647717 0.11687763 NA
62 13 0.0001000000 0.8870892 0.45962179 0.7270453 0.14429119 0.16362521
63 13 0.0002371374 0.8604686 0.49821858 0.6989204 0.12953505 0.17088320
64 13 0.0005623413 0.8608903 0.49233464 0.6993091 0.11269278 0.15447609
65 13 0.0013335214 0.8463950 0.55950500 0.6865408 0.11416761 0.16021417
66 13 0.0031622777 0.8402320 0.57954980 0.6784971 0.11309901 0.09686417
67 13 0.0074989421 0.8313959 0.57538517 0.6777195 0.11556793 0.15946094
68 13 0.0177827941 0.8557499 0.55001110 0.7026915 0.12692687 0.16642893
69 13 0.0421696503 0.8565759 0.52386043 0.6997531 0.11113101 0.15537879
70 13 0.1000000000 0.8495124 0.56051409 0.6935039 0.11689193 0.18305543
71 15 0.0000000000 1.0446689 NaN 0.8647718 0.11687754 NA
72 15 0.0001000000 0.8542083 0.52074764 0.6957404 0.12852321 0.22950819
73 15 0.0002371374 0.8694178 0.51607471 0.7108593 0.13644878 0.17359257
74 15 0.0005623413 0.8315018 0.57213846 0.6674878 0.09834591 0.13554296
75 15 0.0013335214 0.8455483 0.56258270 0.6779581 0.09967033 0.13183626
76 15 0.0031622777 0.8215251 0.60574189 0.6640445 0.10868630 0.13219710
77 15 0.0074989421 0.8430867 0.56370989 0.6901568 0.10724816 0.16527756
78 15 0.0177827941 0.8426215 0.57658867 0.6883842 0.12346134 0.19099020
79 15 0.0421696503 0.8487464 0.55842812 0.6940051 0.11773867 0.18295017
80 15 0.1000000000 0.8635650 0.53296557 0.7079452 0.11293156 0.15492506
81 17 0.0000000000 NaN NaN NaN NA NA
82 17 0.0001000000 NaN NaN NaN NA NA
83 17 0.0002371374 NaN NaN NaN NA NA
84 17 0.0005623413 NaN NaN NaN NA NA
85 17 0.0013335214 NaN NaN NaN NA NA
86 17 0.0031622777 NaN NaN NaN NA NA
87 17 0.0074989421 NaN NaN NaN NA NA
88 17 0.0177827941 NaN NaN NaN NA NA
89 17 0.0421696503 NaN NaN NaN NA NA
90 17 0.1000000000 NaN NaN NaN NA NA
91 19 0.0000000000 NaN NaN NaN NA NA
92 19 0.0001000000 NaN NaN NaN NA NA
93 19 0.0002371374 NaN NaN NaN NA NA
94 19 0.0005623413 NaN NaN NaN NA NA
95 19 0.0013335214 NaN NaN NaN NA NA
96 19 0.0031622777 NaN NaN NaN NA NA
97 19 0.0074989421 NaN NaN NaN NA NA
98 19 0.0177827941 NaN NaN NaN NA NA
99 19 0.0421696503 NaN NaN NaN NA NA
100 19 0.1000000000 NaN NaN NaN NA NA
MAESD
1 0.08179069
2 0.09057422
3 0.09096805
4 0.09828508
5 0.08010935
6 0.07460541
7 0.09438018
8 0.08859903
9 0.07266111
10 0.09171134
11 0.08179069
12 0.09504531
13 0.11965903
14 0.07498538
15 0.06373285
16 0.06775019
17 0.07355006
18 0.07886260
19 0.08367477
20 0.08517409
21 0.08179069
22 0.10685804
23 0.09948579
24 0.08028935
25 0.06367827
26 0.08156457
27 0.07301548
28 0.08369853
29 0.08537643
30 0.08441758
31 0.08179069
32 0.07856738
33 0.09525400
34 0.07709966
35 0.06512952
36 0.05739402
37 0.08031130
38 0.07680723
39 0.08817776
40 0.07436564
41 0.08179069
42 0.09815068
43 0.09695200
44 0.07446604
45 0.06897606
46 0.07737957
47 0.06402453
48 0.07901474
49 0.08811272
50 0.08145392
51 0.08179069
52 0.10536066
53 0.05325120
54 0.05489229
55 0.06970357
56 0.06044404
57 0.08363492
58 0.07910548
59 0.08491888
60 0.07618946
61 0.08179083
62 0.09114902
63 0.08495608
64 0.07093670
65 0.06179021
66 0.05818094
67 0.07714087
68 0.08122788
69 0.07819142
70 0.07726643
71 0.08179069
72 0.08113766
73 0.08894894
74 0.04757259
75 0.05831447
76 0.06865267
77 0.06683151
78 0.08264085
79 0.07928959
80 0.07686953
81 NA
82 NA
83 NA
84 NA
85 NA
86 NA
87 NA
88 NA
89 NA
90 NA
91 NA
92 NA
93 NA
94 NA
95 NA
96 NA
97 NA
98 NA
99 NA
100 NA
Let’s use the varImp function to see which feature variables in our fit are most consequential
nnet variable importance
only 20 most important variables shown (out of 57)
Overall
ManufacturingProcess32 100.00
ManufacturingProcess36 62.10
ManufacturingProcess28 53.94
ManufacturingProcess39 46.50
ManufacturingProcess24 45.11
ManufacturingProcess13 43.72
ManufacturingProcess12 40.63
ManufacturingProcess05 39.84
ManufacturingProcess37 36.81
ManufacturingProcess35 33.00
ManufacturingProcess33 32.89
ManufacturingProcess22 32.19
ManufacturingProcess01 31.88
ManufacturingProcess03 30.93
ManufacturingProcess04 30.35
BiologicalMaterial10 28.79
ManufacturingProcess07 28.48
ManufacturingProcess09 26.77
ManufacturingProcess16 26.61
BiologicalMaterial04 26.03
We see the variable for our neural network that’s most important is the ManufacturingProcess32. In fact, we see the Manufacturing variables dominate the top 10 most important variables for this fit. This is the same as the case of the linear models trained in Exercise 6.3, in which manufacturing process variables had a higher influence.
One way that we could visualize the relationships between our predictor variables and
# Get top-10 feature names
importance <- importance$importance
importance$feature <- rownames(importance)
importance <- importance[order(importance$Overall, decreasing=TRUE), ]
# Get importance feature names
important_features <- rownames(head(importance, 10))
# Create correlelogram of imputed chemical data
correlation <- cor(imputed$data[, c(important_features, "Yield")])
corrplot(correlation)
We see some stronger correlations between some feature variables, for instance manufacturing processes 9 and 13 are strongly negatively correlated. Overall, the one Biological process doesn’t correlate well with most manufacturing processes, which makes sense as these are likely much different processes.