Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
y = 10sin(πx1x2)+20(x3 −0.5)2 +10x4 +5x5 +N(0,σ2)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.565620 0.4887976 2.886629
## 7 3.422420 0.5300524 2.752964
## 9 3.368072 0.5536927 2.715310
## 11 3.323010 0.5779056 2.669375
## 13 3.275835 0.6030846 2.628663
## 15 3.261864 0.6163510 2.621192
## 17 3.261973 0.6267032 2.616956
## 19 3.286299 0.6281075 2.640585
## 21 3.280950 0.6390386 2.643807
## 23 3.292397 0.6440392 2.656080
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
Evaluating the model’s performance:
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
The next model I will tune and evaluate is SVM model - I will use pre-process to center and scale the data and will use tune length of 10.
set.seed(200)
svmTuned = train(x = trainingData$x,
y=trainingData$y,
method="svmRadial",
preProc=c("center", "scale"),
tuneLength=10,
trControl = trainControl(method = "repeatedcv"))
svmTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.509258 0.8007195 2.008079
## 0.50 2.265063 0.8151861 1.787201
## 1.00 2.090305 0.8365821 1.636255
## 2.00 1.995634 0.8495588 1.557253
## 4.00 1.888238 0.8634860 1.501387
## 8.00 1.845986 0.8701926 1.486047
## 16.00 1.842402 0.8704345 1.485279
## 32.00 1.842402 0.8704345 1.485279
## 64.00 1.842402 0.8704345 1.485279
## 128.00 1.842402 0.8704345 1.485279
##
## Tuning parameter 'sigma' was held constant at a value of 0.06574662
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06574662 and C = 16.
Evaluating the model’s performance:
svmPred = predict(svmTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0804235 0.8246038 1.5805232
The next model I will tune and evaluate is MARS model - I will use pre-process to center and scale the data.
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
preProc=c("center", "scale"),
tuneGrid= marsGrid,
trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.350138 0.2513413 3.5784792
## 1 3 3.644350 0.4731295 2.9614267
## 1 4 2.546913 0.7523393 2.0639448
## 1 5 2.209407 0.8000495 1.7664615
## 1 6 2.098155 0.8270593 1.6534878
## 1 7 1.719366 0.8812337 1.3882290
## 1 8 1.695204 0.8831099 1.3488029
## 1 9 1.742115 0.8770325 1.3682250
## 1 10 1.735453 0.8790493 1.3614383
## 1 11 1.703769 0.8857205 1.3551069
## 1 12 1.684606 0.8879534 1.3341565
## 1 13 1.690241 0.8868812 1.3355205
## 1 14 1.677364 0.8877485 1.3241164
## 1 15 1.689320 0.8860203 1.3337850
## 1 16 1.689320 0.8860203 1.3337850
## 1 17 1.689320 0.8860203 1.3337850
## 1 18 1.689320 0.8860203 1.3337850
## 1 19 1.689320 0.8860203 1.3337850
## 1 20 1.689320 0.8860203 1.3337850
## 2 2 4.350138 0.2513413 3.5784792
## 2 3 3.644350 0.4731295 2.9614267
## 2 4 2.546913 0.7523393 2.0639448
## 2 5 2.209407 0.8000495 1.7664615
## 2 6 2.147030 0.8163048 1.6856627
## 2 7 1.694724 0.8853223 1.3533072
## 2 8 1.588283 0.8969166 1.2356377
## 2 9 1.442477 0.9148789 1.1339275
## 2 10 1.393764 0.9212369 1.1120797
## 2 11 1.279830 0.9364074 1.0098606
## 2 12 1.217682 0.9427854 0.9598982
## 2 13 1.204097 0.9460219 0.9422898
## 2 14 1.196080 0.9463109 0.9386478
## 2 15 1.206115 0.9441910 0.9558298
## 2 16 1.212026 0.9446917 0.9549862
## 2 17 1.209095 0.9447813 0.9531492
## 2 18 1.209095 0.9447813 0.9531492
## 2 19 1.209095 0.9447813 0.9531492
## 2 20 1.209095 0.9447813 0.9531492
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
Evaluating the model’s performance:
marsPred = predict(marsTune, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.2779993 0.9338365 1.0147070
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0804235 0.8246038 1.5805232
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.2779993 0.9338365 1.0147070
varImp(marsTune)
## earth variable importance
##
## Overall
## X1 100.00
## X4 84.98
## X2 68.87
## X5 48.55
## X3 38.96
## X6 0.00
## X10 0.00
## X8 0.00
## X7 0.00
## X9 0.00
MARS model gives the best performance with lowers RMSE and highest R^2 of 93.38. I use varImp to identify the most important MARS model predictors and confirm that those include the top 5 informative predictors X1 - X5.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
The first step is to read the data into a dataframe and using knn to impute the data.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
df1<-ChemicalManufacturingProcess
dim(df1)
## [1] 176 58
#head(df1)
df1pp <- preProcess(df1[,2:ncol(df1)], method=c('knnImpute'))
df1 <- cbind(df1$Yield,predict(df1pp, df1[,2:ncol(df1)]))
colnames(df1)[1] <- "Yield"
The second step is splitting the data into test and train datasets - I will use 80%/20% split.
set.seed(333)
trainingRows <- createDataPartition(df1$Yield, p = 0.8, list = FALSE)
df1_train <- df1[trainingRows, ]
df1_test <- df1[-trainingRows, ]
The first model I will tune and evaluate is KNN model - I will use pre-process to center and scale the data and will use tune length of 6.
library(caret)
set.seed(200)
knnModel <- suppressWarnings(train(x = df1_train[,2:ncol(df1_train)],
y = df1_train$Yield,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 6,
trControl = trainControl(method = "repeatedcv")))
knnModel
## k-Nearest Neighbors
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.296841 0.5588651 1.049013
## 7 1.335754 0.5386837 1.097550
## 9 1.324421 0.5752357 1.094814
## 11 1.327724 0.5714342 1.094795
## 13 1.350020 0.5509203 1.107042
## 15 1.363992 0.5438233 1.116888
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
Evaluating the model’s performance:
knnPred <- predict(knnModel, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = knnPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.3031733 0.5503026 0.9813125
The next model I will tune and evaluate is SVM model - I will use pre-process to center and scale the data and will use tune length of 10.
set.seed(200)
svmTuned = train(x = df1_train[,2:ncol(df1_train)],
y = df1_train$Yield,
method="svmRadial",
preProc=c("center", "scale"),
tuneLength=10,
trControl = trainControl(method = "repeatedcv"))
svmTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.386341 0.5543104 1.1455128
## 0.50 1.257204 0.5993286 1.0318796
## 1.00 1.134554 0.6549640 0.9261774
## 2.00 1.064801 0.6887278 0.8641585
## 4.00 1.051533 0.7004169 0.8568931
## 8.00 1.054729 0.7026112 0.8648906
## 16.00 1.055415 0.7017858 0.8649271
## 32.00 1.055415 0.7017858 0.8649271
## 64.00 1.055415 0.7017858 0.8649271
## 128.00 1.055415 0.7017858 0.8649271
##
## Tuning parameter 'sigma' was held constant at a value of 0.01372596
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01372596 and C = 4.
Evaluating the model’s performance:
svmPred <- predict(svmTuned, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = svmPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.2208660 0.6179264 0.8668962
The model is performing better than KNN.
The next model I will tune and evaluate is MARS model - I will use pre-process to center and scale the data.
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:20)
marsTune <- train(df1_train[,2:ncol(df1_train)],
y = df1_train$Yield,
method = "earth",
preProc=c("center", "scale"),
tuneGrid= marsGrid,
trControl = trainControl(method = "cv"))
marsTune
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 130, 129, 132, 129, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.469694 0.4099538 1.1524962
## 1 3 1.270394 0.5461842 1.0135638
## 1 4 1.237360 0.5645598 0.9940006
## 1 5 1.243428 0.5465747 0.9978504
## 1 6 1.243240 0.5625398 0.9675818
## 1 7 1.281018 0.5448316 1.0106055
## 1 8 1.261312 0.5602795 1.0181434
## 1 9 1.267466 0.5564907 1.0329659
## 1 10 1.302172 0.5403139 1.0488237
## 1 11 1.296341 0.5445436 1.0586088
## 1 12 1.273445 0.5541629 1.0393424
## 1 13 1.252598 0.5698303 1.0246146
## 1 14 1.229202 0.5807273 1.0091250
## 1 15 1.232702 0.5796333 1.0146955
## 1 16 1.231305 0.5806299 1.0147462
## 1 17 1.235551 0.5768138 1.0097048
## 1 18 1.235551 0.5768138 1.0097048
## 1 19 1.235551 0.5768138 1.0097048
## 1 20 1.235551 0.5768138 1.0097048
## 2 2 1.469694 0.4099538 1.1524962
## 2 3 1.281032 0.5539317 1.0244959
## 2 4 1.245999 0.5686795 1.0027749
## 2 5 1.239588 0.5755069 1.0083431
## 2 6 1.277475 0.5596173 1.0404891
## 2 7 1.290410 0.5601103 1.0537775
## 2 8 1.284861 0.5524504 1.0509238
## 2 9 1.278701 0.5694702 1.0456271
## 2 10 1.294947 0.5617347 1.0661414
## 2 11 1.251266 0.5946481 1.0199754
## 2 12 1.296675 0.5954595 1.0328179
## 2 13 1.313246 0.5910848 1.0133991
## 2 14 1.310340 0.5943687 1.0090514
## 2 15 1.311538 0.5921334 1.0090578
## 2 16 1.320320 0.5905494 1.0068345
## 2 17 1.336474 0.5847754 1.0142035
## 2 18 1.326958 0.5945070 1.0169338
## 2 19 1.314935 0.5768725 1.0182396
## 2 20 1.354841 0.5597531 1.0389166
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 1.
Evaluating the model’s performance:
marsPred = predict(marsTune, newdata = df1_test[,2:ncol(df1_test)])
postResample(pred = marsPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.0494165 0.6925441 0.8367074
Which nonlinear regression model gives the optimal resampling and test set performance?
Let’s comparing the models and determine which has the best performance:
postResample(pred = knnPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.3031733 0.5503026 0.9813125
postResample(pred = svmPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.2208660 0.6179264 0.8668962
postResample(pred = marsPred, obs = df1_test$Yield)
## RMSE Rsquared MAE
## 1.0494165 0.6925441 0.8367074
MARS model gives the best performance with lowers RMSE and highest R^2 of 69.25.
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Most important MARS Predictors:
varImp(marsTune)
## earth variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09 68.202
## ManufacturingProcess13 36.528
## ManufacturingProcess28 28.445
## ManufacturingProcess33 28.445
## ManufacturingProcess15 27.292
## ManufacturingProcess39 26.812
## ManufacturingProcess04 23.173
## BiologicalMaterial03 12.184
## ManufacturingProcess23 6.087
## ManufacturingProcess10 0.000
## BiologicalMaterial06 0.000
## ManufacturingProcess40 0.000
## ManufacturingProcess41 0.000
## ManufacturingProcess36 0.000
## BiologicalMaterial05 0.000
## BiologicalMaterial08 0.000
## ManufacturingProcess25 0.000
## ManufacturingProcess19 0.000
## BiologicalMaterial04 0.000
Most important predictors are process variables, there is only one biological variable and it is #9 on the list so it is relatively not important.
Linear Model Coefficients
df1_enet <- train(df1_train[,2:ncol(df1_train)], df1_train$Yield,
method='enet',
metric='RMSE',
trControl = trainControl(method = "repeatedcv"),
preProcess=c('center','scale'))
enetCoef<- predict(df1_enet$finalModel, newx = as.matrix(df1_test[,2:ncol(df1_test)]),
s = .1, mode = "fraction",
type = "coefficients")
head(sort(enetCoef$coefficients),3)
## ManufacturingProcess13 ManufacturingProcess37 ManufacturingProcess17
## -0.25153249 -0.11488070 -0.06413573
tail(sort(enetCoef$coefficients),7)
## BiologicalMaterial03 ManufacturingProcess44 ManufacturingProcess06
## 0.07826640 0.09057314 0.10343127
## ManufacturingProcess15 ManufacturingProcess34 ManufacturingProcess09
## 0.15819222 0.17985524 0.51029411
## ManufacturingProcess32
## 0.72011230
Process 32, 9, 34, 15, 6, 44, 3(biol), 17, 37, and 13 were the stongest predictors in my ENET linear model.
Five of the top 10 important predictors are on the list of the top ten predictors from the optimal linear model. Linear and Nonlinear models are able to identify 5 of the same main predictors but overall Nonlinear model is showing a better performance.
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
The 5 predictors which were not identified as top predictors in the linear model are the following: ManufacturingProcess33, ManufacturingProcess28, ManufacturingProcess39, ManufacturingProces04, ManufacturingProcess23
featurePlot(df1_test$ManufacturingProcess33, df1_test$Yield)
featurePlot(df1_test$ManufacturingProcess28, df1_test$Yield)
featurePlot(df1_test$ManufacturingProcess39, df1_test$Yield)
featurePlot(df1_test$ManufacturingProcess04, df1_test$Yield)
featurePlot(df1_test$ManufacturingProcess23, df1_test$Yield)
ManufacturingProcess33 appears to have a positive linear relationship with Yield, ManufacturingProcess04 appears to have a negative linear relationship with Yield. ManufacturingProcess28, ManufacturingProcess39, and ManufacturingProcess23 seem to have nonlinear relationship. None of the top predictors unique to the model are biological, so these graphs don’t reveal anything about their relationship with yield.