Exercises from Chapter 7 of textbook Applied Predictive Modeling by Kuhn & Johnson
\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
set.seed(200)
trainingData = mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x = data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
plot(trainingData$y, main = "200 Computed Values using Uniformly Distributed Seed Inputs",
ylab = "Computed Value")
This question uses an interesting approach, calculating 200 values like the ones plotted above, based on a non-linear function of uniformly distributed “predictors”, and then training models on those 200 values. But then the models are evaluated against a much larger test set, in order to remove the chance outcomes that arise with smaller test sets. The main chance/variance now arises from the relatively small training set, where 10-fold CV, for example, will be averaging fits on 180 values against batches of 20 values.
set.seed(200)
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData = mlbench.friedman1(5000, sd = 1)
testData$x = data.frame(testData$x)
# alias to simplify typing
trD = trainingData
teD = testData
Provided example:
set.seed(200)
knnModel <- train(x = trD$x,
y = trD$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.654912 0.4779838 2.958475
## 7 3.529432 0.5118581 2.861742
## 9 3.446330 0.5425096 2.780756
## 11 3.378049 0.5723793 2.719410
## 13 3.332339 0.5953773 2.692863
## 15 3.309235 0.6111389 2.663046
## 17 3.317408 0.6201421 2.678898
## 19 3.311667 0.6333800 2.682098
## 21 3.316340 0.6407537 2.688887
## 23 3.326040 0.6491480 2.705915
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
set.seed(200)
knnPred = predict(knnModel, new_data = teD$x)
postResample(pred = knnPred, obs = teD$y)
## RMSE Rsquared MAE
## 5.621993 NA 4.546538
Seeing this example code that normalizes the data makes me wonder if normalizing data really helps when they are uniformly distributed between 0 and 1. It seems like it makes the IQR values (0.25 to 0.75) harder to differentiate, by crunching them together near zero, without reducing the spread or outliers, and in fact increasing their distance (old range: 0 to 1; new range: -1 to 1). Let’s see –>
set.seed(200)
knnNonNorm <- train(x = trD$x,
y = trD$y,
method = "knn",
tuneLength = 10)
knnNonNorm
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.693100 0.4670964 2.984478
## 7 3.588241 0.4932495 2.900074
## 9 3.483328 0.5316156 2.821533
## 11 3.415450 0.5616735 2.761507
## 13 3.380127 0.5834403 2.738470
## 15 3.355314 0.5996582 2.716491
## 17 3.370422 0.6044661 2.732379
## 19 3.366822 0.6189193 2.736500
## 21 3.380526 0.6227753 2.749054
## 23 3.379533 0.6334530 2.754868
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
set.seed(200)
knnNonNormPred = predict(knnNonNorm, new_data = teD$x)
postResample(pred = knnNonNormPred, obs = teD$y)
## RMSE Rsquared MAE
## 5.577570 NA 4.508264
The non-normalized data model did better, slightly, in terms of the errors, and in terms of the \(R^2\) during training.
It chose the same number of kNN as the normalized one.
Incidentally, it doesn’t matter if you just center the data (subtract 0.5 from all predictors), you get the exact same models. This is because the Euclidean distances between observations don’t change at all. I’m sure for some model types, normalized data work better than uniform data, but I don’t think that’s the case for kNN.
set.seed(200)
nnetGrid = expand.grid(.decay = c(0.05, 0.1, .15, .2),
.size = 8,
.bag = F)
nnetTune = train(trD$x, trD$y, method = "avNNet", tuneGrid = nnetGrid,
linout = T, trace = F, maxit = 500, maxNWts = 100)
## Warning: executing %dopar% sequentially: no parallel backend registered
plot(nnetTune, main = "Bootstrapped Training RMSE for 8 Hidden Units with Varying Decays")
Using a weight decay of 0.15 –>
set.seed(200)
nnetAvg = avNNet(trD$x, trD$y, size = 8, decay = 0.15, repeats = 5, linout = T,
trace = F, maxit = 500, maxNWts = 100)
nnetPreds = predict(nnetAvg, newdata = teD$x)
nnetResults = data.frame(obs = teD$y, pred = nnetPreds)
defaultSummary(nnetResults)
## RMSE Rsquared MAE
## 1.5612987 0.9025872 1.2090830
These are much better results than the KNN model. For the record, I tried 4, 8, 12, and 16 hidden units, and 8 was best.
And also maybe worth mentioning, I didn’t normalize the uniform data.
library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
set.seed(200)
plainMars = earth(trD$x, trD$y)
summary(plainMars)
## Call: earth(x=trD$x, y=trD$y)
##
## coefficients
## (Intercept) 18.451984
## h(0.621722-X1) -11.074396
## h(0.601063-X2) -10.744225
## h(X3-0.281766) 20.607853
## h(0.447442-X3) 17.880232
## h(X3-0.447442) -23.282007
## h(X3-0.636458) 15.150350
## h(0.734892-X4) -10.027487
## h(X4-0.734892) 9.092045
## h(0.850094-X5) -4.723407
## h(X5-0.850094) 10.832932
## h(X6-0.361791) -1.956821
##
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
The informative predictors were indeed recognized by the model. In order: X1, X4, X2, X5, X3.
plainMarsPreds = predict(plainMars, newdata = teD$x)
plainMarsResults = data.frame(obs = teD$y, pred = plainMarsPreds[,])
defaultSummary(plainMarsResults)
## RMSE Rsquared MAE
## 1.8108045 0.8690992 1.3840843
Let’s see if tuning a couple of parameters will improve those results.
set.seed(200)
marsGrid = expand.grid(.degree = 1:2, .nprune = 5:20)
marsTuned = train(trD$x, trD$y, method = "earth", tuneGrid = marsGrid,
newvar.penalty = 0.5,
trControl = trainControl(method = "cv", number = 10))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 5 2.402099 0.7568218 1.9457323
## 1 6 2.289366 0.7762213 1.8514384
## 1 7 1.730346 0.8670812 1.3584142
## 1 8 1.671535 0.8753821 1.3215500
## 1 9 1.603034 0.8845449 1.2847966
## 1 10 1.620698 0.8820821 1.2989191
## 1 11 1.588064 0.8879146 1.2627053
## 1 12 1.577311 0.8890065 1.2494447
## 1 13 1.578157 0.8891578 1.2596942
## 1 14 1.579869 0.8895682 1.2546183
## 1 15 1.579869 0.8895682 1.2546183
## 1 16 1.579869 0.8895682 1.2546183
## 1 17 1.579869 0.8895682 1.2546183
## 1 18 1.579869 0.8895682 1.2546183
## 1 19 1.579869 0.8895682 1.2546183
## 1 20 1.579869 0.8895682 1.2546183
## 2 5 2.353031 0.7675326 1.8660663
## 2 6 2.215402 0.7888283 1.7377388
## 2 7 1.780715 0.8582379 1.3800123
## 2 8 1.666597 0.8813352 1.3063166
## 2 9 1.554979 0.8985663 1.2114735
## 2 10 1.426109 0.9165576 1.1509405
## 2 11 1.355926 0.9215169 1.0974748
## 2 12 1.254201 0.9329189 1.0259035
## 2 13 1.245268 0.9356153 1.0049496
## 2 14 1.232472 0.9377076 0.9963319
## 2 15 1.235843 0.9370977 1.0032840
## 2 16 1.205527 0.9403988 0.9839347
## 2 17 1.205527 0.9403988 0.9839347
## 2 18 1.205527 0.9403988 0.9839347
## 2 19 1.205527 0.9403988 0.9839347
## 2 20 1.205527 0.9403988 0.9839347
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 16 and degree = 2.
varImp(marsTuned)$importance
## Overall
## X1 100.00000
## X4 74.84542
## X2 48.59694
## X5 15.43919
## X3 0.00000
This tuned version uses just X1-X5 and their combinations (it did pick up on the X1*X2 interaction):
summary(marsTuned)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
## nprune=16, newvar.penalty=0.5)
##
## coefficients
## (Intercept) 22.721354
## h(0.655098-X1) -15.638291
## h(X1-0.655098) 9.284136
## h(0.647925-X2) -17.516017
## h(X2-0.647925) 9.858733
## h(0.447442-X3) 9.726195
## h(X3-0.606015) 12.604512
## h(0.734892-X4) -9.776458
## h(X4-0.734892) 9.960955
## h(0.850094-X5) -5.366065
## h(0.218266-X1) * h(X2-0.647925) -57.662217
## h(X1-0.218266) * h(X2-0.647925) -31.931918
## h(X1-0.655098) * h(X2-0.321448) -38.012887
## h(0.671787-X1) * h(0.647925-X2) 24.027065
##
## Selected 14 of 19 terms, and 5 of 10 predictors (nprune=16)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.595457 RSS 221.1483 GRSq 0.9352238 RSq 0.9546541
It looks like it may do slightly better than the default MARS model from before (.869).
newMarsPreds = predict(marsTuned, newdata = teD$x)
newMarsResults = data.frame(obs = teD$y, pred = newMarsPreds[,])
defaultSummary(newMarsResults)
## RMSE Rsquared MAE
## 1.1806023 0.9445875 0.9362056
That is quite a bit better than even the nnet model.
1.18 RMSE here, vs 1.81 in the default MARS and 1.56 in the nnet.
.945 \(R^2\) here vs .869 in the default MARS and .903 in the nnet.
Upon reviewing the help page for earth, it appears that by trying different newvar.penalty values, and going with the one that worked best, which was the highest one I tried, I made it preferable for the model to use different combinations of already chosen features rather than adding new candidate ones. As design of this dataset would have it, X1 and X2 are the most important ones, and the ones after X5 don’t even matter. So assuming the model considered the features in order, this matter of fate happened to work in its favor, and perhaps these results shouldn’t be so high.
I don’t know whether the model searches left to right in the matrix, or by alphabetical order, so I’ll try changing both to see if the results were due to lucky ordering of features considered. First the left-to-right ordering:
set.seed(200)
neworder = sample(1:10)
shufTrain = trD$x[,neworder]
shufTest = teD$x[,neworder]
shufTuned = train(shufTrain, trD$y, method = "earth", tuneGrid = marsGrid,
newvar.penalty = 0.5,
trControl = trainControl(method = "cv", number = 10))
varImp(shufTuned)$importance
## Overall
## X1 100.00000
## X4 74.84542
## X2 48.59694
## X5 15.43919
## X3 0.00000
head(shufTrain, 2)
## X6 X2 X7 X8 X4 X10 X1
## 1 0.3617906 0.6478064 0.8266609 0.4214081 0.1815996 0.5886216 0.5337724
## 2 0.4530593 0.4381528 0.6489601 0.8446239 0.6692491 0.7584008 0.5837650
## X9 X3 X5
## 1 0.5911144 0.8507853 0.9290398
## 2 0.9281931 0.6727266 0.1637978
Well, changing the left-right order of the columns didn’t affect anything, but maybe the model is going by alphabetical order, so I’ll try renaming features in reverse.
renameTrain = trD$x
renameTest = teD$x
names(renameTrain) = LETTERS[10:1]
names(renameTest) = names(renameTrain)
renameTuned = train(renameTrain, trD$y, method = "earth", tuneGrid = marsGrid,
newvar.penalty = 0.5,
trControl = trainControl(method = "cv", number = 10))
varImp(renameTuned)$importance
## Overall
## J 100.00000
## G 74.89804
## I 48.68261
## F 15.50199
## H 0.00000
That didn’t change things either, so it looks like the high new.penalty parameter effect wasn’t due to chance.
library(kernlab)
## Warning: package 'kernlab' was built under R version 4.0.5
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
set.seed(200)
svmRTuned = train(trD$x, trD$y,
method = 'svmRadial',
tuneLength = 14, trControl = trainControl(method = 'cv'))
svmLTuned = train(trD$x, trD$y,
method = 'svmLinear',
tuneLength = 14, trControl = trainControl(method = 'cv'))
polyGrid = expand.grid(.degree = 2:4, .scale = c(.05, .1, .15), .C = c(.25, .5, 1))
svmPTuned = train(trD$x, trD$y,
method = 'svmPoly',
tuneGrid = polyGrid, trControl = trainControl(method = 'cv'))
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 8
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0629932410345397
##
## Number of Support Vectors : 152
##
## Objective Function Value : -72.63
## Training error : 0.009178
svmLTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 172
##
## Objective Function Value : -54.9759
## Training error : 0.216921
svmPTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 0.5
##
## Polynomial kernel function.
## Hyperparameters : degree = 4 scale = 0.05 offset = 1
##
## Number of Support Vectors : 144
##
## Objective Function Value : -11.5111
## Training error : 0.029197
radialPreds = predict(svmRTuned, newdata = teD$x)
radialResults = data.frame(obs = teD$y, pred = radialPreds)
defaultSummary(radialResults)
## RMSE Rsquared MAE
## 2.0870811 0.8253497 1.5886934
linearPreds = predict(svmLTuned, newdata = teD$x)
linearResults = data.frame(obs = teD$y, pred = linearPreds)
defaultSummary(linearResults)
## RMSE Rsquared MAE
## 2.7610923 0.6997975 2.0815518
polyPreds = predict(svmPTuned, newdata = teD$x)
polyResults = data.frame(obs = teD$y, pred = polyPreds)
defaultSummary(polyResults)
## RMSE Rsquared MAE
## 2.093961 0.824972 1.574803
The RBF method (top) and polynomial method (bottom), produced similar results, where the linear method lagged.
Overall, svm results weren’t as effective for this dataset as MARS and NeuralNets, but did better than kNN.
The best of the models I tried was the MARS model with a high new.variable penalty. Second best was a neural network with 8 hidden units and a high weight decay.
MARS did in fact find the important “predictors”.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# remove one feature that has near-zero variance
prepro = ChemicalManufacturingProcess %>% select(-c(BiologicalMaterial07))
# turn the 0's in row 108 in NA's, then impute medians for all NA's in dataset
prepro[108, c(37:39,41:43)] = NA #MP's 25-27, 29-31
for (feat in names(prepro)) {
med = median(prepro[,feat], na.rm = T)
prepro[feat][is.na(prepro[feat])] = med
}
# shorten the feature names
shorten = function(n) paste0(substr(n, 1, 1), substr(n, nchar(n)-1, nchar(n)))
names(prepro) = sapply(names(prepro), shorten)
Split the 176 rows into 136 train / 40 test.
set.seed(624)
trains = sample(1:176, 136) # randomly choose 136 and test with the other 40
tests = setdiff(1:176, trains)
trainDF = prepro[trains,]
testDF = prepro[tests,]
trainX = trainDF %>% select(-c(Yld))
trainY = trainDF$Yld
testX = testDF %>% select(-c(Yld))
testY = testDF$Yld
min-max scale the values. Use the ranges of the training set, so as not to unrealistically steal info from the test set.
for (feat in names(trainX)){
hi = max(trainX[,feat])
lo = min(trainX[,feat])
trainX[,feat] = (trainX[,feat] - lo) / (hi - lo)
testX[,feat] = (testX[,feat] - lo) / (hi - lo)
}
featurePlot(trainX[, 1:28], trainY)
featurePlot(trainX[, 29:56], trainY)
The positive correlation between the Bio features and the response are evident here. Similarly, some of the Mfg features show clear correlations, both positive (M32) and negative (M36). Also, because of several features with lone 0’s (M20), these models will benefit from normalizing the features.
set.seed(624)
knnTune = train(trainX, trainY, method = "knn",
preProc = c("center", "scale"),
tuneGrid = data.frame(.k = seq(1,15,2)),
trControl = trainControl(method = "cv"))
knnTune
## k-Nearest Neighbors
##
## 136 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 124, 121, 121, 124, 123, 122, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 1.416087 0.5025799 1.1124934
## 3 1.210293 0.6047356 0.9290389
## 5 1.249059 0.5953135 1.0020743
## 7 1.277930 0.5805682 1.0546478
## 9 1.308584 0.5623801 1.0821914
## 11 1.332110 0.5503524 1.0958409
## 13 1.367348 0.5223980 1.1200447
## 15 1.369058 0.5301305 1.1275520
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
knnNormPred = predict(knnTune, newdata = testX)
knnNormResults = data.frame(obs = testY, pred = knnNormPred)
defaultSummary(knnNormResults)
## RMSE Rsquared MAE
## 1.2446672 0.4532793 1.0724167
kNN with k=3 and the features normalized produces a baseline RMSE of 1.245 and an \(R^2\) of .453. Hopefully this won’t be the optimal model, since it doesn’t get to the heart of the question about which predictors have the most effect on the response. For that, maybe MARS will help here.
set.seed(624)
marsGrid = expand.grid(.degree = 1:2, .nprune = 5:20)
marsTuned = train(trainX, trainY, method = "earth", tuneGrid = marsGrid,
newvar.penalty = 0.07, preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"))
summary(marsTuned)
## Call: earth(x=data.frame[136,56], y=c(43.38,40.31,3...), keepxy=TRUE, degree=1,
## nprune=15, newvar.penalty=0.07)
##
## coefficients
## (Intercept) 39.818667
## h(1.3001-B06) -0.535678
## h(-1.09499-M09) -1.003246
## h(-1.36465-M13) 3.860237
## h(M22- -1.34292) -0.205026
## h(0.76318-M28) 0.431782
## h(1.15187-M30) -0.518481
## h(-1.16237-M32) -2.016012
## h(M32- -1.16237) 1.606808
## h(M32-1.01059) -1.443762
## h(-0.960163-M33) 2.156659
## h(0.218267-M39) -0.232838
## h(M39-0.218267) -6.938517
## h(M42-0.295014) -6.846153
##
## Selected 14 of 22 terms, and 10 of 56 predictors (nprune=15)
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: M32, M30, M13, M39, M09, B06, M28, M33, M22, M42, B01-unused, ...
## Number of terms at each degree of interaction: 1 13 (additive model)
## GCV 1.01665 RSS 88.81486 GRSq 0.7196 RSq 0.8172054
marsPreds = predict(marsTuned, newdata = testX)
marsResults = data.frame(obs = testY, pred = marsPreds[,])
defaultSummary(marsResults)
## RMSE Rsquared MAE
## 1.1554675 0.5355233 0.8798888
That .536 is quite a dropoff from the cross-validation \(R^2\), and not as good as a simple Linear Regression model (.596).
set.seed(624)
svmRadTuned = train(trainX, trainY,
method = 'svmRadial', preProcess = c("center", "scale"),
tuneLength = 14, trControl = trainControl(method = 'cv'))
polyGrid = expand.grid(.degree = 2:4, .scale = c(.05, .1, .15), .C = c(.25, .5, 1))
svmPolyTuned = train(trainX, trainY,
method = 'svmPoly', preProcess = c("center", "scale"),
tuneGrid = polyGrid, trControl = trainControl(method = 'cv'))
svmRadTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 4
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0139503949307737
##
## Number of Support Vectors : 118
##
## Objective Function Value : -75.1875
## Training error : 0.031368
radPreds = predict(svmRadTuned, newdata = testX)
radResults = data.frame(obs = testY, pred = radPreds)
defaultSummary(radResults)
## RMSE Rsquared MAE
## 0.9821407 0.6717444 0.7670203
The RBF model did quite a bit better than all other models have, with an \(R^2\) of .676.
svmPolyTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 0.25
##
## Polynomial kernel function.
## Hyperparameters : degree = 2 scale = 0.15 offset = 1
##
## Number of Support Vectors : 129
##
## Objective Function Value : -1.2469
## Training error : 0.009761
polPreds = predict(svmPolyTuned, newdata = testX)
polResults = data.frame(obs = testY, pred = polPreds)
defaultSummary(polResults)
## RMSE Rsquared MAE
## 5.3052920 0.0406753 1.7460609
Not nearly as good as the RBF SVM.
The optimal model was the SVM with RBF. Here are the top-20 predictor importances:
varImp(svmRadTuned)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## M13 100.00
## M32 96.49
## M17 88.25
## B06 83.63
## M09 79.40
## B03 70.95
## M36 68.71
## M31 64.92
## B02 62.62
## B12 59.61
## M06 58.95
## M33 50.11
## B04 48.81
## M30 47.95
## M11 46.60
## B11 42.04
## M29 40.86
## B08 40.66
## B01 38.30
## B09 38.26
Compared to the best linear model’s feature importance, determined by coefficient magnitude:
linfeats = c('B02', 'B03', 'M04', 'M09', 'M17', 'M27', 'M29', 'M32', 'M33', 'M43')
linvals = c(-3.614, 3.647, 2.556, 2.401, -3.071, -4.984, 7.204, 8.088, -4.209, 2.063)
featframe = data.frame(Feats = linfeats, Coef = linvals)
featframe[order(-abs(featframe$Coef)),]
## Feats Coef
## 8 M32 8.088
## 7 M29 7.204
## 6 M27 -4.984
## 9 M33 -4.209
## 2 B03 3.647
## 1 B02 -3.614
## 5 M17 -3.071
## 3 M04 2.556
## 4 M09 2.401
## 10 M43 2.063
Interestingly, the Bio Features are more important in the svm model than they were in the linear models. The Bio Feats were for the most part correlated with the response, so it makes sense that they’re important to these models, but there were other Manufacturing features that were highly correlated with several of the Bio Feats, which left the door open for the model to remove one of them.
M13 was the most important for svm but didn’t register as much for the linear model. It was highly correlated with M09 and M17, both of which were top ten in both models. M09, M32, B02, and B03 were top ten for both as well. svm kept B06 and B12, where lm kept M27 and had M29 higher up. That M29 is one of the features that lm used to capture some of the Bio Feats it discarded, as it is fairly correlated with the B06 and B12 that svm leaned on.
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(prepro), type = "upper", diag = F, ) -> r #to suppress some output
As mentioned, the Bio features uniquely used by the svm model are positively correlated with the response, especially B06. This B06, in turn, is highly correlated with features the linear model relied on most, like M29 and M33. It’s hard to know what would be the most effective way to use this knowledge to increase Yield, since we don’t know what the features describe, beyond their broad category. It was mentioned that Bio Features can’t be altered, whereas Manufacturing ones can, so it seems like it might be better to have a top 10 list like the linear model, where more of the features are manufacturing ones. But if the experimenters can just decide to use materials with the Bio Feats that work best, the svm model shows a clearer path to improving Yield, based on its higher predictive power. One big question is why are so many Bio features so strongly correlated with Mfg features? It seems like a few more experimental runs, breaking the dependence of those correlations, might determine whether to focus on improving Yield through selection of Bio feats vs adjustments to Mfg feats.