DATA 624: Homework 8

library(tidyverse)
library(MASS)
library(caret)
library(mlbench)
library(xgboost)
library(GGally)
library(e1071)
library(corrplot)

Exercise 7.2 (K&J)

Let’s create the simulated data from Friedman using the same bit of code to generate the simulated data. We’ll also create the feature distribution plot in the same manner as the text.

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)

trainingData$x <- data.frame(trainingData$x)

featurePlot(trainingData$x, trainingData$y)

# Set up test data
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Before we model, let’s get a summary of both our training features and test data, plotting them via ggpairs.

# Plot training data distributions
ggpairs(cbind(trainingData$x, trainingData$y))

While the scatter plots are a bit busy, we see some relatively normally distributed features.

We’ll train the KNN model specified in K&J. We’ll specify a default tuneLength across this exercise of 10 for standardization in tuning

# library(caret)
knnModel <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "knn",
                   preProc = c("center", "scale"))

knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  3.466085  0.5121775  2.816838
##   7  3.349428  0.5452823  2.727410
##   9  3.264276  0.5785990  2.660026
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

Now we’ll actually predict against our test data using the KNN model. We’ll print out the root-mean squared error as well, which is a diagnostic metric we can use to evaluate the efficacy of a regression model.

knnPred <- predict(knnModel, testData$x)

RMSE(knnPred, testData$y)

## [1] 3.117232

Let’s train a decision tree on our data as well. Decision trees can be robust models, but are often prone to overfiting on training data.

decisionTreeModel <- train(x=trainingData$x,
                 y=trainingData$y,
                 method="rpart",
                 tuneLength=10)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

decisionTreeModel

## CART 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.01228972  3.678817  0.4809688  2.990114
##   0.01387981  3.686479  0.4774770  2.997159
##   0.01429458  3.695995  0.4747507  3.007059
##   0.02618043  3.712253  0.4625909  3.033527
##   0.03515931  3.792122  0.4356117  3.101091
##   0.05860160  3.983127  0.3854415  3.270421
##   0.06528495  4.061607  0.3651742  3.339100
##   0.07513824  4.135943  0.3429854  3.392170
##   0.20070359  4.596144  0.1871626  3.794009
##   0.25672400  4.706893  0.1682766  3.883181
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01228972.

# Predict using decision tree
predictTree <- predict(decisionTreeModel, testData$x)

RMSE(predictTree, testData$y)

## [1] 3.381563

We’ll also train a MARS model similar to the method employed in K&J.

# Train spline model (MARS) 
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(x=trainingData$x,
                   y=trainingData$y, method = "earth", tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.327437  0.2666502  3.6133264
##   1        3      3.582260  0.5145400  2.9105565
##   1        4      2.640676  0.7227538  2.1614949
##   1        5      2.292028  0.8062647  1.8481401
##   1        6      2.258999  0.8151057  1.7940661
##   1        7      1.810340  0.8740701  1.4166846
##   1        8      1.701523  0.8897982  1.3223375
##   1        9      1.629183  0.8960408  1.2682728
##   1       10      1.660758  0.8964917  1.3076894
##   1       11      1.557953  0.9054063  1.2285821
##   1       12      1.561823  0.9051019  1.2479219
##   1       13      1.571426  0.9049790  1.2474697
##   1       14      1.570282  0.9055072  1.2430842
##   1       15      1.570282  0.9055072  1.2430842
##   1       16      1.570282  0.9055072  1.2430842
##   1       17      1.570282  0.9055072  1.2430842
##   1       18      1.570282  0.9055072  1.2430842
##   1       19      1.570282  0.9055072  1.2430842
##   1       20      1.570282  0.9055072  1.2430842
##   1       21      1.570282  0.9055072  1.2430842
##   1       22      1.570282  0.9055072  1.2430842
##   1       23      1.570282  0.9055072  1.2430842
##   1       24      1.570282  0.9055072  1.2430842
##   1       25      1.570282  0.9055072  1.2430842
##   1       26      1.570282  0.9055072  1.2430842
##   1       27      1.570282  0.9055072  1.2430842
##   1       28      1.570282  0.9055072  1.2430842
##   1       29      1.570282  0.9055072  1.2430842
##   1       30      1.570282  0.9055072  1.2430842
##   1       31      1.570282  0.9055072  1.2430842
##   1       32      1.570282  0.9055072  1.2430842
##   1       33      1.570282  0.9055072  1.2430842
##   1       34      1.570282  0.9055072  1.2430842
##   1       35      1.570282  0.9055072  1.2430842
##   1       36      1.570282  0.9055072  1.2430842
##   1       37      1.570282  0.9055072  1.2430842
##   1       38      1.570282  0.9055072  1.2430842
##   2        2      4.327437  0.2666502  3.6133264
##   2        3      3.582260  0.5145400  2.9105565
##   2        4      2.640676  0.7227538  2.1614949
##   2        5      2.292028  0.8062647  1.8481401
##   2        6      2.253887  0.8145545  1.8022680
##   2        7      1.805725  0.8768413  1.4274473
##   2        8      1.686811  0.8952010  1.2708834
##   2        9      1.609773  0.9008126  1.2478381
##   2       10      1.508613  0.9145215  1.2062349
##   2       11      1.375676  0.9222637  1.0939575
##   2       12      1.331723  0.9263653  1.0749753
##   2       13      1.258092  0.9348329  1.0065578
##   2       14      1.206714  0.9413494  0.9764079
##   2       15      1.203725  0.9426384  0.9741912
##   2       16      1.214990  0.9416954  0.9859887
##   2       17      1.210825  0.9417940  0.9854172
##   2       18      1.210825  0.9417940  0.9854172
##   2       19      1.210825  0.9417940  0.9854172
##   2       20      1.210825  0.9417940  0.9854172
##   2       21      1.210825  0.9417940  0.9854172
##   2       22      1.210825  0.9417940  0.9854172
##   2       23      1.210825  0.9417940  0.9854172
##   2       24      1.210825  0.9417940  0.9854172
##   2       25      1.210825  0.9417940  0.9854172
##   2       26      1.210825  0.9417940  0.9854172
##   2       27      1.210825  0.9417940  0.9854172
##   2       28      1.210825  0.9417940  0.9854172
##   2       29      1.210825  0.9417940  0.9854172
##   2       30      1.210825  0.9417940  0.9854172
##   2       31      1.210825  0.9417940  0.9854172
##   2       32      1.210825  0.9417940  0.9854172
##   2       33      1.210825  0.9417940  0.9854172
##   2       34      1.210825  0.9417940  0.9854172
##   2       35      1.210825  0.9417940  0.9854172
##   2       36      1.210825  0.9417940  0.9854172
##   2       37      1.210825  0.9417940  0.9854172
##   2       38      1.210825  0.9417940  0.9854172
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

We’ll predict using the MARS model to see the RMSE on testing data

predictMars <- predict(marsTuned, testData$x)

RMSE(predictMars, testData$y)

## [1] 1.158995

We can use the varImp function to see which are the most important features for a given model. In the case of our MARS model, we can see that the informative predictors (X1 - X5) are in fact selected

varImp(marsTuned)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.73
## X5   15.52
## X3    0.00

Lastly, let’s try to train an XGBoost (Gradient Boosting), we can use the xgboost library within R to do so. XGBoost can be used for both regression and classification tasks.

# Train XGBoost model
xgBoosetModel <- xgboost(data = as.matrix(trainingData$x), 
                         label = as.matrix(trainingData$y),
                         max.depth = 2, eta = 1, nthread = 2,
                         nrounds = 2)

## [1]  train-rmse:3.478113 
## [2]  train-rmse:2.661574

xgBoosetModel

## ##### xgb.Booster
## raw: 5.8 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, max.depth = 2, eta = 1, nthread = 2)
## params (as set within xgb.train):
##   max_depth = "2", eta = "1", nthread = "2", validate_parameters = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
##   cb.evaluation.log()
## # of features: 10 
## niter: 2
## nfeatures : 10 
## evaluation_log:
##   iter train_rmse
##  <num>      <num>
##      1   3.478113
##      2   2.661574

In terms of purely RMSE when predicting against our test data, we see the best performance from the MARS model.

Exercise 7.5 (K&J)

# Load chemical data
library(AppliedPredictiveModeling)
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(earth)
library(nnet)
library(ModelMetrics)

## 
## Attaching package: 'ModelMetrics'

## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, recall, sensitivity, specificity

## The following object is masked from 'package:base':
## 
##     kappa

data("ChemicalManufacturingProcess")

chemical <- ChemicalManufacturingProcess
chemical_features <- chemical %>% dplyr::select(-c("Yield"))

We’ll impute this data the same way we did in HW7

# Impute chemical yield data
imputed <- preProcess(chemical,
                       method = c("knnImpute"))

trans <- predict(imputed, chemical)

We’ll also set up a train-test split as well

# Split into train and test splits
#use 75% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(trans), replace=TRUE, prob=c(0.8,0.2))
train  <- trans[sample, ]
train_yield <- train$Yield
train <- train %>% 
  dplyr::select(-c("Yield"))
test <- trans[!sample, ]
test_yield <- test$Yield
test <- test  %>% 
  dplyr::select(-c("Yield"))

Now we can try a spline regression model similar to how it’s done in K&J

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTuned <- train(x = train,
                   y = train_yield,
                   method = "earth", tuneGrid =marsGrid,
trControl = trainControl(method = "cv"))

marsTuned

## Multivariate Adaptive Regression Spline 
## 
## 133 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 117, 120, 120, 120, 120, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.8255218  0.4255589  0.6448691
##   1        3      0.7076756  0.5594258  0.5588394
##   1        4      0.6829418  0.5905764  0.5398365
##   1        5      0.6777599  0.6074488  0.5355634
##   1        6      0.6798713  0.5942309  0.5554584
##   1        7      0.6721252  0.6150362  0.5452002
##   1        8      0.6681566  0.6248951  0.5575549
##   1        9      0.7061488  0.5979535  0.5845283
##   1       10      0.6849416  0.6115068  0.5773218
##   1       11      0.6943135  0.5966026  0.5773263
##   1       12      0.6836511  0.6152858  0.5704027
##   1       13      0.6901165  0.6107945  0.5734003
##   1       14      0.7025914  0.6027601  0.5818426
##   1       15      0.7064758  0.5990417  0.5851013
##   1       16      0.7076302  0.6002581  0.5841450
##   1       17      0.7094408  0.5995345  0.5861282
##   1       18      0.7094408  0.5995345  0.5861282
##   1       19      0.7094408  0.5995345  0.5861282
##   1       20      0.7158914  0.6004276  0.5918082
##   2        2      0.8255218  0.4255589  0.6448691
##   2        3      0.7219303  0.5442372  0.5657423
##   2        4      0.6890481  0.5887808  0.5338162
##   2        5      0.7412996  0.5358115  0.5891260
##   2        6      0.7382768  0.5366446  0.5826828
##   2        7      0.7380444  0.5549002  0.5894028
##   2        8      0.7576289  0.5438524  0.5993969
##   2        9      0.7308578  0.5696927  0.5866199
##   2       10      0.7703698  0.5487598  0.5947291
##   2       11      0.7534409  0.5581414  0.5845952
##   2       12      0.7324413  0.5880692  0.5766953
##   2       13      0.7420924  0.5880776  0.5893879
##   2       14      0.7430857  0.5924334  0.5876311
##   2       15      0.7405586  0.5913932  0.5820524
##   2       16      0.7488744  0.5893842  0.5894230
##   2       17      0.7272651  0.6127237  0.5722228
##   2       18      0.7503613  0.6034496  0.5792468
##   2       19      0.7866035  0.5670338  0.6099921
##   2       20      0.8747462  0.5084571  0.6474419
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 1.

We can plot our spline model, and we see the resulting errors over our cross-validation folds.

plot(marsTuned)

# train a support vector
svmFit <- train(train, train_yield, method = "svmRadial",
                    preProc = c("center", "scale"), tuneLength = 15,
                    trControl = trainControl(method = "cv"))


svmFit$resample

##         RMSE  Rsquared       MAE Resample
## 1  0.8283012 0.4484280 0.7332573   Fold01
## 2  0.6631408 0.6081640 0.5371689   Fold03
## 3  0.5995672 0.7343060 0.5107410   Fold09
## 4  0.8339779 0.6160227 0.5526669   Fold02
## 5  0.4961339 0.7891719 0.3857777   Fold10
## 6  0.5396032 0.7705326 0.4690446   Fold05
## 7  0.5103868 0.8048235 0.4289214   Fold06
## 8  0.6169507 0.7511327 0.4705829   Fold04
## 9  0.6631854 0.6691171 0.5466585   Fold07
## 10 0.5388635 0.6775118 0.4165972   Fold08

We can get the RMSE of our SVM regression model based on the predicted alues compared against the testing values

rmse(predict(svmFit, test), test_yield)

## [1] 0.5324566

We can also train a KNN regression model to predict the Yield output variable

knnTune <- train(train,
                 train_yield,
                 method = "knn",
                 preProc = c("center", "scale"), # setting this in the model training will make it occur for testing as well
                 tuneGrid = data.frame(.k = 1:20),
                 trControl = trainControl(method = "cv"))

plot(knnTune)

Based on RMSE as our metric, we see an ideal number of 6 neighbors to be used for this model.

Lastly, we’ll train a neural net model on our checmical processing data

nnetFit <- train(train, train_yield,
                  method = "nnet",
                 tuneLength=10,
                 preProc = c("center", "scale"), trace = FALSE,
                trControl = trainControl(method = "cv"))

nnPred <- predict(nnetFit, test)

nnetFit$results

##     size        decay      RMSE   Rsquared       MAE     RMSESD RsquaredSD
## 1      1 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
## 2      1 0.0001000000 0.9602047 0.35370438 0.7924751 0.13091826 0.14532767
## 3      1 0.0002371374 0.9499226 0.41670304 0.7820682 0.12074066 0.16328378
## 4      1 0.0005623413 0.8694557 0.49612601 0.7036847 0.13923502 0.18220159
## 5      1 0.0013335214 0.8415267 0.54236859 0.6788265 0.12621286 0.17267371
## 6      1 0.0031622777 0.8747694 0.46345512 0.7154521 0.10029505 0.10555984
## 7      1 0.0074989421 0.8801878 0.48722454 0.7207901 0.11207299 0.17585171
## 8      1 0.0177827941 0.8584324 0.51926562 0.6996166 0.12934376 0.16808123
## 9      1 0.0421696503 0.8759545 0.48453574 0.7177421 0.11902629 0.13898627
## 10     1 0.1000000000 0.8789621 0.48627958 0.7193764 0.13055838 0.13706046
## 11     3 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
## 12     3 0.0001000000 0.8851309 0.43581928 0.7255324 0.15429023 0.18578804
## 13     3 0.0002371374 0.9232621 0.44065404 0.7628656 0.17176893 0.19586877
## 14     3 0.0005623413 0.8798274 0.47550108 0.7181215 0.12910418 0.15554495
## 15     3 0.0013335214 0.8636654 0.52667724 0.7064073 0.09982384 0.14880142
## 16     3 0.0031622777 0.8445232 0.57487775 0.6793834 0.11292242 0.17136834
## 17     3 0.0074989421 0.8631240 0.49705203 0.7028483 0.11998867 0.14598445
## 18     3 0.0177827941 0.8498644 0.56946524 0.6893010 0.12048441 0.17287164
## 19     3 0.0421696503 0.8501513 0.54926951 0.6951087 0.12100173 0.18657704
## 20     3 0.1000000000 0.8587813 0.56454381 0.7033117 0.12326735 0.14980901
## 21     5 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
## 22     5 0.0001000000 0.9046827 0.54277513 0.7327564 0.18659897 0.24505477
## 23     5 0.0002371374 0.8989956 0.49684758 0.7337956 0.11807850 0.13631450
## 24     5 0.0005623413 0.8544478 0.53633134 0.6978395 0.14383473 0.18023297
## 25     5 0.0013335214 0.8499765 0.52747507 0.6922167 0.12069685 0.11569156
## 26     5 0.0031622777 0.8799698 0.49321847 0.7148081 0.12312564 0.21166670
## 27     5 0.0074989421 0.8306857 0.59491413 0.6729338 0.10853109 0.15489434
## 28     5 0.0177827941 0.8476035 0.56172126 0.6895430 0.13594806 0.17898397
## 29     5 0.0421696503 0.8506182 0.56302080 0.6933603 0.12800920 0.17785392
## 30     5 0.1000000000 0.8493008 0.56866416 0.6958157 0.12296593 0.17583374
## 31     7 0.0000000000 1.0446689 0.26118383 0.8647718 0.11687753         NA
## 32     7 0.0001000000 0.8758914 0.43401605 0.7052874 0.13718720 0.24252946
## 33     7 0.0002371374 0.8612501 0.50130019 0.7051495 0.13216836 0.19679401
## 34     7 0.0005623413 0.8665464 0.49691632 0.7058923 0.12099405 0.20848006
## 35     7 0.0013335214 0.8889666 0.45768304 0.7225195 0.12498205 0.15221620
## 36     7 0.0031622777 0.8450675 0.53785732 0.6908665 0.10777355 0.17835990
## 37     7 0.0074989421 0.8450696 0.55615284 0.6919681 0.11285030 0.16789082
## 38     7 0.0177827941 0.8473481 0.56596774 0.6920727 0.11814933 0.15178101
## 39     7 0.0421696503 0.8516854 0.54866716 0.6991686 0.12458335 0.17987535
## 40     7 0.1000000000 0.8532175 0.55598530 0.6995990 0.11531984 0.16347045
## 41     9 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
## 42     9 0.0001000000 0.9053400 0.47018450 0.7376198 0.12340510 0.18174748
## 43     9 0.0002371374 0.8581339 0.50279175 0.6945128 0.12360396 0.20117339
## 44     9 0.0005623413 0.8682516 0.47743254 0.7052188 0.10866332 0.15761384
## 45     9 0.0013335214 0.8487841 0.54335152 0.6923365 0.11285926 0.21172703
## 46     9 0.0031622777 0.8653819 0.50006243 0.7039278 0.13418057 0.17381697
## 47     9 0.0074989421 0.8614632 0.53728306 0.7051822 0.11501599 0.14209049
## 48     9 0.0177827941 0.8438667 0.57239374 0.6875099 0.12070424 0.19107887
## 49     9 0.0421696503 0.8475202 0.56139811 0.6914204 0.12451399 0.18518574
## 50     9 0.1000000000 0.8479317 0.56555419 0.6929261 0.12062881 0.17474524
## 51    11 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
## 52    11 0.0001000000 0.8863797 0.43661659 0.7174792 0.13482035 0.20425531
## 53    11 0.0002371374 0.8639678 0.49298862 0.6939276 0.08650186 0.14487318
## 54    11 0.0005623413 0.8305483 0.57425068 0.6703282 0.10233630 0.12548986
## 55    11 0.0013335214 0.8684857 0.50391917 0.7040714 0.10635001 0.13048214
## 56    11 0.0031622777 0.8417667 0.57740132 0.6771116 0.10407699 0.15965129
## 57    11 0.0074989421 0.8558486 0.51574617 0.7047413 0.11503821 0.16985105
## 58    11 0.0177827941 0.8494980 0.56452899 0.6975150 0.12409607 0.21118983
## 59    11 0.0421696503 0.8500332 0.55671838 0.6948788 0.12375800 0.17375419
## 60    11 0.1000000000 0.8510765 0.54909167 0.6974395 0.11635677 0.15479243
## 61    13 0.0000000000 1.0446688 0.03838475 0.8647717 0.11687763         NA
## 62    13 0.0001000000 0.8881632 0.44651960 0.7318770 0.14900779 0.16602109
## 63    13 0.0002371374 0.8556257 0.50977811 0.6939578 0.13067559 0.18075079
## 64    13 0.0005623413 0.8731109 0.45175150 0.7143269 0.10868515 0.08216590
## 65    13 0.0013335214 0.8568762 0.53019163 0.6932179 0.10799738 0.15061463
## 66    13 0.0031622777 0.8352275 0.59196217 0.6761294 0.10503114 0.09079615
## 67    13 0.0074989421 0.8263626 0.59095798 0.6707033 0.11344133 0.15139356
## 68    13 0.0177827941 0.8561148 0.54886071 0.7030678 0.12672822 0.16576423
## 69    13 0.0421696503 0.8564260 0.52442336 0.6996245 0.11112637 0.15545819
## 70    13 0.1000000000 0.8495124 0.56051409 0.6935039 0.11689193 0.18305543
## 71    15 0.0000000000 1.0446689        NaN 0.8647718 0.11687754         NA
## 72    15 0.0001000000 0.8499036 0.53533955 0.6900113 0.13009169 0.21611916
## 73    15 0.0002371374 0.8676816 0.51175672 0.7039072 0.13775367 0.15976834
## 74    15 0.0005623413 0.8276822 0.57173349 0.6618641 0.10675721 0.14457227
## 75    15 0.0013335214 0.8320246 0.59444792 0.6721161 0.12096892 0.13185455
## 76    15 0.0031622777 0.8233850 0.60369290 0.6685152 0.10945503 0.13689580
## 77    15 0.0074989421 0.8428127 0.57053362 0.6901207 0.11145676 0.17261132
## 78    15 0.0177827941 0.8426123 0.57641734 0.6884722 0.12346335 0.19082474
## 79    15 0.0421696503 0.8487272 0.55848547 0.6939770 0.11777717 0.18292071
## 80    15 0.1000000000 0.8635644 0.53296657 0.7079445 0.11293195 0.15492395
## 81    17 0.0000000000       NaN        NaN       NaN         NA         NA
## 82    17 0.0001000000       NaN        NaN       NaN         NA         NA
## 83    17 0.0002371374       NaN        NaN       NaN         NA         NA
## 84    17 0.0005623413       NaN        NaN       NaN         NA         NA
## 85    17 0.0013335214       NaN        NaN       NaN         NA         NA
## 86    17 0.0031622777       NaN        NaN       NaN         NA         NA
## 87    17 0.0074989421       NaN        NaN       NaN         NA         NA
## 88    17 0.0177827941       NaN        NaN       NaN         NA         NA
## 89    17 0.0421696503       NaN        NaN       NaN         NA         NA
## 90    17 0.1000000000       NaN        NaN       NaN         NA         NA
## 91    19 0.0000000000       NaN        NaN       NaN         NA         NA
## 92    19 0.0001000000       NaN        NaN       NaN         NA         NA
## 93    19 0.0002371374       NaN        NaN       NaN         NA         NA
## 94    19 0.0005623413       NaN        NaN       NaN         NA         NA
## 95    19 0.0013335214       NaN        NaN       NaN         NA         NA
## 96    19 0.0031622777       NaN        NaN       NaN         NA         NA
## 97    19 0.0074989421       NaN        NaN       NaN         NA         NA
## 98    19 0.0177827941       NaN        NaN       NaN         NA         NA
## 99    19 0.0421696503       NaN        NaN       NaN         NA         NA
## 100   19 0.1000000000       NaN        NaN       NaN         NA         NA
##          MAESD
## 1   0.08179069
## 2   0.09489705
## 3   0.10224020
## 4   0.09996843
## 5   0.08066147
## 6   0.07348494
## 7   0.06944262
## 8   0.09017337
## 9   0.07262601
## 10  0.09171220
## 11  0.08179069
## 12  0.09257861
## 13  0.13249873
## 14  0.07640050
## 15  0.06140349
## 16  0.06181548
## 17  0.08826072
## 18  0.07627891
## 19  0.08372594
## 20  0.08515295
## 21  0.08179069
## 22  0.10681852
## 23  0.09903434
## 24  0.08349104
## 25  0.07196408
## 26  0.08392403
## 27  0.07088956
## 28  0.08362451
## 29  0.08537633
## 30  0.08441758
## 31  0.08179069
## 32  0.08430184
## 33  0.09400697
## 34  0.08130073
## 35  0.07973096
## 36  0.06914395
## 37  0.08214588
## 38  0.07643855
## 39  0.08817356
## 40  0.07436564
## 41  0.08179069
## 42  0.10362609
## 43  0.09640021
## 44  0.07572124
## 45  0.06970062
## 46  0.08666348
## 47  0.06942885
## 48  0.07986951
## 49  0.08811353
## 50  0.08145392
## 51  0.08179069
## 52  0.07795344
## 53  0.06262777
## 54  0.05374182
## 55  0.07082103
## 56  0.06149358
## 57  0.08325841
## 58  0.07992801
## 59  0.08491389
## 60  0.07618963
## 61  0.08179083
## 62  0.09293326
## 63  0.08156012
## 64  0.06745824
## 65  0.06255872
## 66  0.04168713
## 67  0.06879100
## 68  0.08115662
## 69  0.07811210
## 70  0.07726643
## 71  0.08179069
## 72  0.07880120
## 73  0.09582226
## 74  0.04938747
## 75  0.06542059
## 76  0.06247051
## 77  0.07361516
## 78  0.08266805
## 79  0.07930470
## 80  0.07686923
## 81          NA
## 82          NA
## 83          NA
## 84          NA
## 85          NA
## 86          NA
## 87          NA
## 88          NA
## 89          NA
## 90          NA
## 91          NA
## 92          NA
## 93          NA
## 94          NA
## 95          NA
## 96          NA
## 97          NA
## 98          NA
## 99          NA
## 100         NA

Let’s use the varImp function to see which feature variables in our fit are most consequential

(importance <- varImp(nnetFit))

## nnet variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess36   59.73
## ManufacturingProcess28   52.69
## ManufacturingProcess39   49.31
## ManufacturingProcess24   48.18
## ManufacturingProcess13   41.22
## ManufacturingProcess22   39.64
## ManufacturingProcess37   39.31
## ManufacturingProcess35   35.03
## ManufacturingProcess12   34.99
## ManufacturingProcess04   32.40
## ManufacturingProcess05   31.69
## ManufacturingProcess33   31.53
## ManufacturingProcess01   31.46
## ManufacturingProcess03   30.67
## BiologicalMaterial10     29.40
## BiologicalMaterial02     27.86
## ManufacturingProcess08   27.54
## ManufacturingProcess43   26.51
## ManufacturingProcess07   26.24

We see the variable for our neural network that’s most important is the ManufacturingProcess32. In fact, we see the Manufacturing variables dominate the top 10 most important variables for this fit. This is the same as the case of the linear models trained in Exercise 6.3, in which manufacturing process variables had a higher influence.

One way that we could visualize the relationships between our predictor variables and

# Get top-10 feature names
importance <- importance$importance
importance$feature <- rownames(importance)
importance <- importance[order(importance$Overall, decreasing=TRUE), ]

# Get importance feature names
important_features <- rownames(head(importance, 10))

# Create correlelogram of imputed chemical data
correlation <- cor(imputed$data[, c(important_features, "Yield")])
corrplot(correlation)

We see some stronger correlations between some feature variables, for instance manufacturing processes 9 and 13 are strongly negatively correlated. Overall, the one Biological process doesn’t correlate well with most manufacturing processes, which makes sense as these are likely much different processes.

DATA 624: Homework 8

Andrew Bowen

2024-04-02

Exercise 7.2 (K&J)

Exercise 7.5 (K&J)