Data 624 Assignment 8
7.2
Friedman (1991) introduced several benchmark data sets created by simulations. One of these simulations uses a non linear equation (See Applied Predictive Modeling Question 7.2 for equation) to create data where th ex values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData<- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)testData<-mlbench.friedman1(5000,sd = 1)
testData$x<-data.frame(testData$x)K-Nearest Neighbors (KNN)
knnModel <-train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10)
knnModelk-Nearest Neighbors
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 3.466085 0.5121775 2.816838
7 3.349428 0.5452823 2.727410
9 3.264276 0.5785990 2.660026
11 3.214216 0.6024244 2.603767
13 3.196510 0.6176570 2.591935
15 3.184173 0.6305506 2.577482
17 3.183130 0.6425367 2.567787
19 3.198752 0.6483184 2.592683
21 3.188993 0.6611428 2.588787
23 3.200458 0.6638353 2.604529
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
knnPred<- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y) RMSE Rsquared MAE
3.2040595 0.6819919 2.5683461
Support Vector Machine (SVM)
SVMModel <-train(x = trainingData$x,
y = trainingData$y,
method = "svmLinear",
preProcess = c("center", "scale"),
tuneLength = 10)
SVMPred<- predict(SVMModel, newdata = testData$x)
SVMModelSupport Vector Machines with Linear Kernel
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
RMSE Rsquared MAE
2.551519 0.7476728 2.01746
Tuning parameter 'C' was held constant at a value of 1
postResample(pred = SVMPred, obs = testData$y) RMSE Rsquared MAE
2.7633860 0.6973384 2.0970616
Random Forest (RF)
RFModel <-train(x = trainingData$x,
y = trainingData$y,
method = "ranger",
preProcess = c("center", "scale"),
tuneLength = 10)note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .
RFPred<- predict(RFModel, newdata = testData$x)
RFModelRandom Forest
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 2.902871 0.7853247 2.415221
2 extratrees 3.125389 0.8020307 2.581730
3 variance 2.717609 0.7837701 2.255983
3 extratrees 2.877289 0.8133828 2.375101
4 variance 2.636851 0.7754399 2.186274
4 extratrees 2.735603 0.8127877 2.252930
5 variance 2.596968 0.7660308 2.140881
5 extratrees 2.637149 0.8110524 2.166540
6 variance 2.597802 0.7551409 2.140643
6 extratrees 2.583236 0.8064049 2.118032
7 variance 2.600109 0.7455207 2.139053
7 extratrees 2.534877 0.8048926 2.075316
8 variance 2.609986 0.7391331 2.144419
8 extratrees 2.517114 0.8005132 2.059029
9 variance 2.638492 0.7291585 2.167845
9 extratrees 2.500823 0.7969444 2.045371
10 variance 2.657646 0.7207815 2.177573
10 extratrees 2.489710 0.7929158 2.028374
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 10, splitrule = extratrees
and min.node.size = 5.
postResample(pred = RFPred, obs = testData$y) RMSE Rsquared MAE
2.4153170 0.8141425 1.9216027
Partial Least Squares (PLS)
plsModel <-train(x = trainingData$x,
y = trainingData$y,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 10)
plsPred<- predict(plsModel, newdata = testData$x)
plsModelPartial Least Squares
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 2.688547 0.7075938 2.141657
2 2.501126 0.7516482 1.975048
3 2.505093 0.7522105 1.980098
4 2.508100 0.7517820 1.982606
5 2.508667 0.7516772 1.983134
6 2.508736 0.7516648 1.983233
7 2.508768 0.7516589 1.983275
8 2.508767 0.7516593 1.983274
9 2.508767 0.7516593 1.983274
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 2.
postResample(pred = plsPred, obs = testData$y) RMSE Rsquared MAE
2.685591 0.710292 2.052676
Multivariate Adaptive Regression Spline
MARSModel <-train(x = trainingData$x,
y = trainingData$y,
method = "earth",
preProcess = c("center", "scale"),
tuneLength = 10)
MARSPred<- predict(MARSModel, newdata = testData$x)
MARSModelMultivariate Adaptive Regression Spline
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
nprune RMSE Rsquared MAE
2 4.513855 0.2226566 3.710019
3 3.726900 0.4697515 3.010978
4 2.705967 0.7244549 2.175573
6 2.314145 0.7905331 1.823292
7 1.858053 0.8642626 1.459393
9 1.758915 0.8792330 1.364881
10 1.761866 0.8790926 1.359756
12 1.768688 0.8787412 1.367119
13 1.767428 0.8790870 1.368607
15 1.782611 0.8772060 1.384283
Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 9 and degree = 1.
postResample(pred = MARSPred, obs = testData$y) RMSE Rsquared MAE
1.7901760 0.8705315 1.3712537
The Mars model performed the best for this evaluation
7.5
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Data Prep
data(ChemicalManufacturingProcess)
df<-
ChemicalManufacturingProcess %>%
recipe(Yield~.) %>%
step_impute_knn(all_predictors()) %>%
step_naomit(all_outcomes()) %>%
step_nzv(all_predictors()) %>%
step_corr(all_predictors(),threshold = 0.9) %>%
prep() %>%
bake(ChemicalManufacturingProcess)Mars
dfadj<-df
set.seed(1)
split<-dfadj %>% initial_split(prop = .8)
trainingData<-training(split) %>% select(-Yield)
trainingAnswer<-(training(split) %>% select(Yield))[,1][[1]]
testData<- testing(split) %>% select(-Yield)
testAnswer<-(testing(split) %>% select(Yield))[,1][[1]]
MARSModel <-train(x = trainingData,
y = trainingAnswer,
method = "earth",
preProcess = c("center", "scale"),
tuneLength = 10)
MARSPred<- predict(MARSModel, newdata = testData)
MARSModelMultivariate Adaptive Regression Spline
141 samples
46 predictor
Pre-processing: centered (46), scaled (46)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 141, 141, 141, 141, 141, 141, ...
Resampling results across tuning parameters:
nprune RMSE Rsquared MAE
2 1.445212 0.3909780 1.126735
3 1.327861 0.4912781 1.038761
5 1.291479 0.5275262 1.024341
7 1.327385 0.5214789 1.030030
8 1.360571 0.5113547 1.052845
10 1.405551 0.4990792 1.088933
12 1.421788 0.4998446 1.102440
13 1.452651 0.4910400 1.122829
15 1.469972 0.4900091 1.140023
17 1.513381 0.4711792 1.174286
Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 5 and degree = 1.
postResample(pred = MARSPred, obs = testAnswer) RMSE Rsquared MAE
1.2241712 0.5471941 0.9581943
SVM
SVMModel <-train(x = trainingData,
y = trainingAnswer,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 10)
SVMPred<- predict(SVMModel, newdata = testData)
SVMModelSupport Vector Machines with Radial Basis Function Kernel
141 samples
46 predictor
Pre-processing: centered (46), scaled (46)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 141, 141, 141, 141, 141, 141, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 1.457277 0.4880247 1.1799509
0.50 1.353425 0.5289183 1.0904100
1.00 1.291395 0.5556078 1.0321077
2.00 1.252407 0.5732383 0.9912171
4.00 1.241901 0.5766854 0.9730877
8.00 1.230610 0.5821925 0.9628612
16.00 1.229129 0.5830298 0.9613244
32.00 1.229129 0.5830298 0.9613244
64.00 1.229129 0.5830298 0.9613244
128.00 1.229129 0.5830298 0.9613244
Tuning parameter 'sigma' was held constant at a value of 0.01498411
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01498411 and C = 16.
postResample(pred = SVMPred, obs = testAnswer) RMSE Rsquared MAE
1.0909770 0.6386468 0.8978089
KNN
knnModel <-train(x = trainingData,
y = trainingAnswer,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10)
knnPred<- predict(knnModel, newdata = testData)
knnModelk-Nearest Neighbors
141 samples
46 predictor
Pre-processing: centered (46), scaled (46)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 141, 141, 141, 141, 141, 141, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 1.481696 0.3705621 1.189947
7 1.464873 0.3794961 1.180520
9 1.471312 0.3763153 1.185161
11 1.458591 0.3920425 1.179344
13 1.457298 0.3961140 1.186204
15 1.457740 0.3998465 1.180835
17 1.457198 0.4056303 1.179678
19 1.460008 0.4083058 1.177618
21 1.464070 0.4090481 1.175194
23 1.467587 0.4114118 1.175814
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
postResample(pred = knnPred, obs = testAnswer) RMSE Rsquared MAE
1.427461 0.426987 1.139815
Random Forest
rfModel <-train(x = trainingData,
y = trainingAnswer,
method = "rf",
preProcess = c("center", "scale"),
tuneLength = 10)
rfPred<- predict(rfModel, newdata = testData)
varImp(MARSModel)earth variable importance
Overall
ManufacturingProcess32 100.00
ManufacturingProcess13 55.44
ManufacturingProcess09 15.57
ManufacturingProcess15 0.00
postResample(pred = rfPred, obs = testAnswer) RMSE Rsquared MAE
1.1841280 0.6290190 0.8604337
In this case using non-linear regression methods, random forest performed the best. variable importance showed that manufacturering process number 32 was the greatest predictor followed by Manufacturing process 13