Bikash-Homework-8.knit

Column

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.

Exercise 7.2 Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[y = 10sin(πx_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\]

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)

trainingData$x <- data.frame(trainingData$x)

featurePlot(trainingData$x, trainingData$y)

# Set up test data
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Before we model, a summary of both training features and test data, plotting them via ggpairs.

# Plot training data distributions
ggpairs(cbind(trainingData$x, trainingData$y))

While the scatter plots are a bit busy, I see some relatively normally distributed features. I will specify a default tuneLength across this exercise of 10 for standardization in tuning

knnModel <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "knn",
                   preProc = c("center", "scale"))

knnModel

k-Nearest Neighbors 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  3.466085  0.5121775  2.816838
  7  3.349428  0.5452823  2.727410
  9  3.264276  0.5785990  2.660026

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.

Now we’ll actually predict against our test data using the KNN model.Print out the root-mean squared error as well, which is a diagnostic metric we can use to evaluate the efficacy of a regression model.

knnPred <- predict(knnModel, testData$x)

RMSE(knnPred, testData$y)

[1] 3.117232

Let’s train a decision tree on our data as well. Decision trees can be robust models, but are often prone to overfiting on training data.

decisionTreeModel <- train(x=trainingData$x,
                 y=trainingData$y,
                 method="rpart",
                 tuneLength=10)

decisionTreeModel

CART 

200 samples
 10 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE     
  0.01228972  3.678817  0.4809688  2.990114
  0.01387981  3.686479  0.4774770  2.997159
  0.01429458  3.695995  0.4747507  3.007059
  0.02618043  3.712253  0.4625909  3.033527
  0.03515931  3.792122  0.4356117  3.101091
  0.05860160  3.983127  0.3854415  3.270421
  0.06528495  4.061607  0.3651742  3.339100
  0.07513824  4.135943  0.3429854  3.392170
  0.20070359  4.596144  0.1871626  3.794009
  0.25672400  4.706893  0.1682766  3.883181

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01228972.

# Predict using decision tree
predictTree <- predict(decisionTreeModel, testData$x)

RMSE(predictTree, testData$y)

[1] 3.381563

We’ll also train a MARS model similar to the method employed in K&J.

# Train spline model (MARS) 
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(x=trainingData$x,
                   y=trainingData$y, method = "earth", tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

marsTuned

Multivariate Adaptive Regression Spline 

200 samples
 10 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared   MAE      
  1        2      4.327437  0.2666502  3.6133264
  1        3      3.582260  0.5145400  2.9105565
  1        4      2.640676  0.7227538  2.1614949
  1        5      2.292028  0.8062647  1.8481401
  1        6      2.258999  0.8151057  1.7940661
  1        7      1.810340  0.8740701  1.4166846
  1        8      1.701523  0.8897982  1.3223375
  1        9      1.629183  0.8960408  1.2682728
  1       10      1.660758  0.8964917  1.3076894
  1       11      1.557953  0.9054063  1.2285821
  1       12      1.561823  0.9051019  1.2479219
  1       13      1.571426  0.9049790  1.2474697
  1       14      1.570282  0.9055072  1.2430842
  1       15      1.570282  0.9055072  1.2430842
  1       16      1.570282  0.9055072  1.2430842
  1       17      1.570282  0.9055072  1.2430842
  1       18      1.570282  0.9055072  1.2430842
  1       19      1.570282  0.9055072  1.2430842
  1       20      1.570282  0.9055072  1.2430842
  1       21      1.570282  0.9055072  1.2430842
  1       22      1.570282  0.9055072  1.2430842
  1       23      1.570282  0.9055072  1.2430842
  1       24      1.570282  0.9055072  1.2430842
  1       25      1.570282  0.9055072  1.2430842
  1       26      1.570282  0.9055072  1.2430842
  1       27      1.570282  0.9055072  1.2430842
  1       28      1.570282  0.9055072  1.2430842
  1       29      1.570282  0.9055072  1.2430842
  1       30      1.570282  0.9055072  1.2430842
  1       31      1.570282  0.9055072  1.2430842
  1       32      1.570282  0.9055072  1.2430842
  1       33      1.570282  0.9055072  1.2430842
  1       34      1.570282  0.9055072  1.2430842
  1       35      1.570282  0.9055072  1.2430842
  1       36      1.570282  0.9055072  1.2430842
  1       37      1.570282  0.9055072  1.2430842
  1       38      1.570282  0.9055072  1.2430842
  2        2      4.327437  0.2666502  3.6133264
  2        3      3.582260  0.5145400  2.9105565
  2        4      2.640676  0.7227538  2.1614949
  2        5      2.292028  0.8062647  1.8481401
  2        6      2.253887  0.8145545  1.8022680
  2        7      1.805725  0.8768413  1.4274473
  2        8      1.686811  0.8952010  1.2708834
  2        9      1.609773  0.9008126  1.2478381
  2       10      1.508613  0.9145215  1.2062349
  2       11      1.375676  0.9222637  1.0939575
  2       12      1.331723  0.9263653  1.0749753
  2       13      1.258092  0.9348329  1.0065578
  2       14      1.206714  0.9413494  0.9764079
  2       15      1.203725  0.9426384  0.9741912
  2       16      1.214990  0.9416954  0.9859887
  2       17      1.210825  0.9417940  0.9854172
  2       18      1.210825  0.9417940  0.9854172
  2       19      1.210825  0.9417940  0.9854172
  2       20      1.210825  0.9417940  0.9854172
  2       21      1.210825  0.9417940  0.9854172
  2       22      1.210825  0.9417940  0.9854172
  2       23      1.210825  0.9417940  0.9854172
  2       24      1.210825  0.9417940  0.9854172
  2       25      1.210825  0.9417940  0.9854172
  2       26      1.210825  0.9417940  0.9854172
  2       27      1.210825  0.9417940  0.9854172
  2       28      1.210825  0.9417940  0.9854172
  2       29      1.210825  0.9417940  0.9854172
  2       30      1.210825  0.9417940  0.9854172
  2       31      1.210825  0.9417940  0.9854172
  2       32      1.210825  0.9417940  0.9854172
  2       33      1.210825  0.9417940  0.9854172
  2       34      1.210825  0.9417940  0.9854172
  2       35      1.210825  0.9417940  0.9854172
  2       36      1.210825  0.9417940  0.9854172
  2       37      1.210825  0.9417940  0.9854172
  2       38      1.210825  0.9417940  0.9854172

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 15 and degree = 2.

We’ll predict using the MARS model to see the RMSE on testing data

predictMars <- predict(marsTuned, testData$x)

RMSE(predictMars, testData$y)

[1] 1.158995

We can use the varImp function to see which are the most important features for a given model. In the case of our MARS model, we can see that the informative predictors (X1 - X5) are in fact selected

varImp(marsTuned)

earth variable importance

   Overall
X1  100.00
X4   75.24
X2   48.73
X5   15.52
X3    0.00

Lastly, let’s try to train an XGBoost (Gradient Boosting), we can use the xgboost library. XGBoost can be used for both regression and classification tasks.

# Train XGBoost model
xgBoosetModel <- xgboost(data = as.matrix(trainingData$x), 
                         label = as.matrix(trainingData$y),
                         max.depth = 2, eta = 1, nthread = 2,
                         nrounds = 2)

[1] train-rmse:3.478113 
[2] train-rmse:2.661574

xgBoosetModel

##### xgb.Booster
raw: 5.8 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, max.depth = 2, eta = 1, nthread = 2)
params (as set within xgb.train):
  max_depth = "2", eta = "1", nthread = "2", validate_parameters = "1"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
  cb.evaluation.log()
# of features: 10 
niter: 2
nfeatures : 10 
evaluation_log:
  iter train_rmse
 <num>      <num>
     1   3.478113
     2   2.661574

In terms of purely RMSE when predicting against our test data, we see the best performance from the MARS model.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Which nonlinear regression model gives the optimal resampling and test set performance?

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

library(AppliedPredictiveModeling)
library(kernlab)
library(earth)
library(nnet)
library(ModelMetrics)

data("ChemicalManufacturingProcess")

chemical <- ChemicalManufacturingProcess
chemical_features <- chemical %>% dplyr::select(-c("Yield"))

We’ll impute this data

# Impute chemical yield data
imputed <- preProcess(chemical,
                       method = c("knnImpute"))

trans <- predict(imputed, chemical)

We’ll also set up a train-test split as well

# Split into train and test splits
#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(trans), replace=TRUE, prob=c(0.8,0.2))
train  <- trans[sample, ]
train_yield <- train$Yield
train <- train %>% 
  dplyr::select(-c("Yield"))
test <- trans[!sample, ]
test_yield <- test$Yield
test <- test  %>% 
  dplyr::select(-c("Yield"))

Now we can try a spline regression model

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTuned <- train(x = train,
                   y = train_yield,
                   method = "earth", tuneGrid =marsGrid,
trControl = trainControl(method = "cv"))

marsTuned

Multivariate Adaptive Regression Spline 

133 samples
 57 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 120, 117, 120, 120, 120, 120, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE       Rsquared   MAE      
  1        2      0.8255218  0.4255589  0.6448691
  1        3      0.7076756  0.5594258  0.5588394
  1        4      0.6829418  0.5905764  0.5398365
  1        5      0.6777599  0.6074488  0.5355634
  1        6      0.6798713  0.5942309  0.5554584
  1        7      0.6721252  0.6150362  0.5452002
  1        8      0.6681566  0.6248951  0.5575549
  1        9      0.7061488  0.5979535  0.5845283
  1       10      0.6849416  0.6115068  0.5773218
  1       11      0.6943135  0.5966026  0.5773263
  1       12      0.6836511  0.6152858  0.5704027
  1       13      0.6901165  0.6107945  0.5734003
  1       14      0.7025914  0.6027601  0.5818426
  1       15      0.7064758  0.5990417  0.5851013
  1       16      0.7076302  0.6002581  0.5841450
  1       17      0.7094408  0.5995345  0.5861282
  1       18      0.7094408  0.5995345  0.5861282
  1       19      0.7094408  0.5995345  0.5861282
  1       20      0.7158914  0.6004276  0.5918082
  2        2      0.8255218  0.4255589  0.6448691
  2        3      0.7219303  0.5442372  0.5657423
  2        4      0.7094490  0.5630614  0.5499513
  2        5      0.7568039  0.5173411  0.5988227
  2        6      0.7666427  0.5050106  0.6113483
  2        7      0.7718170  0.5136246  0.6157548
  2        8      0.7616770  0.5369341  0.6055975
  2        9      0.7491831  0.5503402  0.5973027
  2       10      0.7792853  0.5390793  0.6137423
  2       11      0.7555746  0.5448990  0.5931787
  2       12      0.7359925  0.5736795  0.5857371
  2       13      0.7504894  0.5659826  0.6066187
  2       14      0.7427183  0.5839756  0.5946789
  2       15      0.7475103  0.5738970  0.5958920
  2       16      0.7503278  0.5787836  0.5877399
  2       17      0.7313502  0.5970206  0.5722539
  2       18      0.7520954  0.5879849  0.5781150
  2       19      0.7739194  0.5633356  0.5883682
  2       20      0.8671464  0.5073033  0.6329745

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 8 and degree = 1.

We can plot our spline model, and we see the resulting errors over our cross-validation folds.

plot(marsTuned)

# train a support vector
svmFit <- train(train, train_yield, method = "svmRadial",
                    preProc = c("center", "scale"), tuneLength = 15,
                    trControl = trainControl(method = "cv"))


svmFit$resample

        RMSE  Rsquared       MAE Resample
1  0.8283012 0.4484280 0.7332573   Fold01
2  0.6631408 0.6081640 0.5371689   Fold03
3  0.5995672 0.7343060 0.5107410   Fold09
4  0.8339779 0.6160227 0.5526669   Fold02
5  0.4961339 0.7891719 0.3857777   Fold10
6  0.5396032 0.7705326 0.4690446   Fold05
7  0.5103868 0.8048235 0.4289214   Fold06
8  0.6169507 0.7511327 0.4705829   Fold04
9  0.6631854 0.6691171 0.5466585   Fold07
10 0.5388635 0.6775118 0.4165972   Fold08

We can get the RMSE of our SVM regression model based on the predicted alues compared against the testing values

rmse(predict(svmFit, test), test_yield)

[1] 0.5324566

We can also train a KNN regression model to predict the Yield output variable

knnTune <- train(train,
                 train_yield,
                 method = "knn",
                 preProc = c("center", "scale"), # setting this in the model training will make it occur for testing as well
                 tuneGrid = data.frame(.k = 1:20),
                 trControl = trainControl(method = "cv"))

plot(knnTune)

Based on RMSE as our metric, we see an ideal number of 6 neighbors to be used for this model.

Lastly, we’ll train a neural net model on our checmical processing data

nnetFit <- train(train, train_yield,
                  method = "nnet",
                 tuneLength=10,
                 preProc = c("center", "scale"), trace = FALSE,
                trControl = trainControl(method = "cv"))

nnPred <- predict(nnetFit, test)

nnetFit$results

    size        decay      RMSE   Rsquared       MAE     RMSESD RsquaredSD
1      1 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
2      1 0.0001000000 0.9542694 0.36453847 0.7813614 0.12680347 0.13445536
3      1 0.0002371374 0.9466420 0.43031975 0.7839454 0.11656901 0.17561625
4      1 0.0005623413 0.8684038 0.48985394 0.7026930 0.13789603 0.18213195
5      1 0.0013335214 0.8446965 0.54068892 0.6811947 0.11696280 0.17284932
6      1 0.0031622777 0.8645272 0.48673833 0.7052410 0.10097049 0.11720248
7      1 0.0074989421 0.8715234 0.50536294 0.7069608 0.13550898 0.17086414
8      1 0.0177827941 0.8583479 0.51724788 0.6969092 0.13224314 0.17456496
9      1 0.0421696503 0.8761372 0.48402230 0.7178496 0.11891438 0.13867091
10     1 0.1000000000 0.8789617 0.48628096 0.7193759 0.13055756 0.13705716
11     3 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
12     3 0.0001000000 0.8847194 0.44303924 0.7232908 0.15575958 0.18651356
13     3 0.0002371374 0.9136125 0.44481984 0.7599729 0.16114540 0.18413129
14     3 0.0005623413 0.8760403 0.48277974 0.7125063 0.13565689 0.15885110
15     3 0.0013335214 0.8759353 0.48344943 0.7079560 0.10010669 0.13028271
16     3 0.0031622777 0.8608407 0.54393529 0.6994648 0.12452724 0.17253287
17     3 0.0074989421 0.8762684 0.46127663 0.7152433 0.09785980 0.14163309
18     3 0.0177827941 0.8519816 0.56159612 0.6916056 0.12087608 0.17582455
19     3 0.0421696503 0.8501467 0.54940934 0.6950391 0.12099830 0.18651991
20     3 0.1000000000 0.8587715 0.56455783 0.7032994 0.12328505 0.14981945
21     5 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
22     5 0.0001000000 0.9057045 0.53932126 0.7351706 0.18596180 0.24244690
23     5 0.0002371374 0.8996320 0.48648469 0.7310962 0.11706869 0.15005490
24     5 0.0005623413 0.8565708 0.51215030 0.6905188 0.12265131 0.20342865
25     5 0.0013335214 0.8368429 0.56123984 0.6835676 0.11553301 0.12548573
26     5 0.0031622777 0.8804294 0.47370808 0.7119672 0.11455998 0.17370900
27     5 0.0074989421 0.8301849 0.59217979 0.6717010 0.10836512 0.15749782
28     5 0.0177827941 0.8476145 0.56158473 0.6895417 0.13596245 0.17881923
29     5 0.0421696503 0.8506137 0.56304234 0.6933575 0.12801265 0.17787857
30     5 0.1000000000 0.8493008 0.56866416 0.6958157 0.12296593 0.17583374
31     7 0.0000000000 1.0446689 0.26118383 0.8647718 0.11687753         NA
32     7 0.0001000000 0.8574782 0.47752734 0.6865403 0.12909445 0.23294308
33     7 0.0002371374 0.8603531 0.50368662 0.7014791 0.13061026 0.19354097
34     7 0.0005623413 0.8615605 0.50853759 0.7058293 0.12289662 0.19176478
35     7 0.0013335214 0.8796102 0.47752190 0.7025399 0.11890052 0.12476454
36     7 0.0031622777 0.8307254 0.57435029 0.6742875 0.10324503 0.18105575
37     7 0.0074989421 0.8476169 0.55170099 0.6945817 0.11314152 0.16246766
38     7 0.0177827941 0.8471812 0.56574926 0.6916247 0.11829469 0.15095870
39     7 0.0421696503 0.8516917 0.54865887 0.6991762 0.12458225 0.17986628
40     7 0.1000000000 0.8532175 0.55598530 0.6995990 0.11531984 0.16347045
41     9 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
42     9 0.0001000000 0.9043466 0.47150143 0.7388232 0.11826587 0.17845483
43     9 0.0002371374 0.8582291 0.50466915 0.6948912 0.12357845 0.19970104
44     9 0.0005623413 0.8694690 0.47886213 0.7030061 0.12053769 0.15717109
45     9 0.0013335214 0.8514352 0.54738913 0.6947003 0.11039419 0.21514490
46     9 0.0031622777 0.8639353 0.50400283 0.7044491 0.12396283 0.14768604
47     9 0.0074989421 0.8594343 0.54137496 0.7035777 0.11562709 0.14161259
48     9 0.0177827941 0.8414491 0.57866012 0.6862069 0.12019540 0.19730131
49     9 0.0421696503 0.8475194 0.56140092 0.6914190 0.12451328 0.18518926
50     9 0.1000000000 0.8479317 0.56555419 0.6929261 0.12062882 0.17474524
51    11 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
52    11 0.0001000000 0.9032802 0.41806037 0.7403833 0.16119939 0.23182849
53    11 0.0002371374 0.8566433 0.50962852 0.6845412 0.08235548 0.11708827
54    11 0.0005623413 0.8285419 0.58219777 0.6707848 0.10266987 0.11564880
55    11 0.0013335214 0.8591061 0.52244560 0.7030786 0.12167971 0.18321771
56    11 0.0031622777 0.8404660 0.58820246 0.6762268 0.10308377 0.15174876
57    11 0.0074989421 0.8558811 0.51505994 0.7042372 0.11462365 0.16931431
58    11 0.0177827941 0.8508009 0.56090913 0.6988181 0.12645181 0.21002875
59    11 0.0421696503 0.8500023 0.55680391 0.6948443 0.12377467 0.17385419
60    11 0.1000000000 0.8510770 0.54908798 0.6974404 0.11635630 0.15478813
61    13 0.0000000000 1.0446688 0.03838475 0.8647717 0.11687763         NA
62    13 0.0001000000 0.8870892 0.45962179 0.7270453 0.14429119 0.16362521
63    13 0.0002371374 0.8604686 0.49821858 0.6989204 0.12953505 0.17088320
64    13 0.0005623413 0.8608903 0.49233464 0.6993091 0.11269278 0.15447609
65    13 0.0013335214 0.8463950 0.55950500 0.6865408 0.11416761 0.16021417
66    13 0.0031622777 0.8402320 0.57954980 0.6784971 0.11309901 0.09686417
67    13 0.0074989421 0.8313959 0.57538517 0.6777195 0.11556793 0.15946094
68    13 0.0177827941 0.8557499 0.55001110 0.7026915 0.12692687 0.16642893
69    13 0.0421696503 0.8565759 0.52386043 0.6997531 0.11113101 0.15537879
70    13 0.1000000000 0.8495124 0.56051409 0.6935039 0.11689193 0.18305543
71    15 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
72    15 0.0001000000 0.8542083 0.52074764 0.6957404 0.12852321 0.22950819
73    15 0.0002371374 0.8694178 0.51607471 0.7108593 0.13644878 0.17359257
74    15 0.0005623413 0.8315018 0.57213846 0.6674878 0.09834591 0.13554296
75    15 0.0013335214 0.8455483 0.56258270 0.6779581 0.09967033 0.13183626
76    15 0.0031622777 0.8215251 0.60574189 0.6640445 0.10868630 0.13219710
77    15 0.0074989421 0.8430867 0.56370989 0.6901568 0.10724816 0.16527756
78    15 0.0177827941 0.8426215 0.57658867 0.6883842 0.12346134 0.19099020
79    15 0.0421696503 0.8487464 0.55842812 0.6940051 0.11773867 0.18295017
80    15 0.1000000000 0.8635650 0.53296557 0.7079452 0.11293156 0.15492506
81    17 0.0000000000       NaN        NaN       NaN         NA         NA
82    17 0.0001000000       NaN        NaN       NaN         NA         NA
83    17 0.0002371374       NaN        NaN       NaN         NA         NA
84    17 0.0005623413       NaN        NaN       NaN         NA         NA
85    17 0.0013335214       NaN        NaN       NaN         NA         NA
86    17 0.0031622777       NaN        NaN       NaN         NA         NA
87    17 0.0074989421       NaN        NaN       NaN         NA         NA
88    17 0.0177827941       NaN        NaN       NaN         NA         NA
89    17 0.0421696503       NaN        NaN       NaN         NA         NA
90    17 0.1000000000       NaN        NaN       NaN         NA         NA
91    19 0.0000000000       NaN        NaN       NaN         NA         NA
92    19 0.0001000000       NaN        NaN       NaN         NA         NA
93    19 0.0002371374       NaN        NaN       NaN         NA         NA
94    19 0.0005623413       NaN        NaN       NaN         NA         NA
95    19 0.0013335214       NaN        NaN       NaN         NA         NA
96    19 0.0031622777       NaN        NaN       NaN         NA         NA
97    19 0.0074989421       NaN        NaN       NaN         NA         NA
98    19 0.0177827941       NaN        NaN       NaN         NA         NA
99    19 0.0421696503       NaN        NaN       NaN         NA         NA
100   19 0.1000000000       NaN        NaN       NaN         NA         NA
         MAESD
1   0.08179069
2   0.09057422
3   0.09096805
4   0.09828508
5   0.08010935
6   0.07460541
7   0.09438018
8   0.08859903
9   0.07266111
10  0.09171134
11  0.08179069
12  0.09504531
13  0.11965903
14  0.07498538
15  0.06373285
16  0.06775019
17  0.07355006
18  0.07886260
19  0.08367477
20  0.08517409
21  0.08179069
22  0.10685804
23  0.09948579
24  0.08028935
25  0.06367827
26  0.08156457
27  0.07301548
28  0.08369853
29  0.08537643
30  0.08441758
31  0.08179069
32  0.07856738
33  0.09525400
34  0.07709966
35  0.06512952
36  0.05739402
37  0.08031130
38  0.07680723
39  0.08817776
40  0.07436564
41  0.08179069
42  0.09815068
43  0.09695200
44  0.07446604
45  0.06897606
46  0.07737957
47  0.06402453
48  0.07901474
49  0.08811272
50  0.08145392
51  0.08179069
52  0.10536066
53  0.05325120
54  0.05489229
55  0.06970357
56  0.06044404
57  0.08363492
58  0.07910548
59  0.08491888
60  0.07618946
61  0.08179083
62  0.09114902
63  0.08495608
64  0.07093670
65  0.06179021
66  0.05818094
67  0.07714087
68  0.08122788
69  0.07819142
70  0.07726643
71  0.08179069
72  0.08113766
73  0.08894894
74  0.04757259
75  0.05831447
76  0.06865267
77  0.06683151
78  0.08264085
79  0.07928959
80  0.07686953
81          NA
82          NA
83          NA
84          NA
85          NA
86          NA
87          NA
88          NA
89          NA
90          NA
91          NA
92          NA
93          NA
94          NA
95          NA
96          NA
97          NA
98          NA
99          NA
100         NA

Let’s use the varImp function to see which feature variables in our fit are most consequential

(importance <- varImp(nnetFit))

nnet variable importance

  only 20 most important variables shown (out of 57)

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess36   62.10
ManufacturingProcess28   53.94
ManufacturingProcess39   46.50
ManufacturingProcess24   45.11
ManufacturingProcess13   43.72
ManufacturingProcess12   40.63
ManufacturingProcess05   39.84
ManufacturingProcess37   36.81
ManufacturingProcess35   33.00
ManufacturingProcess33   32.89
ManufacturingProcess22   32.19
ManufacturingProcess01   31.88
ManufacturingProcess03   30.93
ManufacturingProcess04   30.35
BiologicalMaterial10     28.79
ManufacturingProcess07   28.48
ManufacturingProcess09   26.77
ManufacturingProcess16   26.61
BiologicalMaterial04     26.03

We see the variable for our neural network that’s most important is the ManufacturingProcess32. In fact, we see the Manufacturing variables dominate the top 10 most important variables for this fit. This is the same as the case of the linear models trained in Exercise 6.3, in which manufacturing process variables had a higher influence.

One way that we could visualize the relationships between our predictor variables and

# Get top-10 feature names
importance <- importance$importance
importance$feature <- rownames(importance)
importance <- importance[order(importance$Overall, decreasing=TRUE), ]

# Get importance feature names
important_features <- rownames(head(importance, 10))

# Create correlelogram of imputed chemical data
correlation <- cor(imputed$data[, c(important_features, "Yield")])
corrplot(correlation)

We see some stronger correlations between some feature variables, for instance manufacturing processes 9 and 13 are strongly negatively correlated. Overall, the one Biological process doesn’t correlate well with most manufacturing processes, which makes sense as these are likely much different processes.

DATA 624, - Homework 8

Bikash Bhowmik

13 Apr 2025

Column

Column