Data 624 Assignment 8

7.2

Friedman (1991) introduced several benchmark data sets created by simulations. One of these simulations uses a non linear equation (See Applied Predictive Modeling Question 7.2 for equation) to create data where th ex values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)

trainingData<- mlbench.friedman1(200, sd = 1) 

trainingData$x <- data.frame(trainingData$x)

featurePlot(trainingData$x, trainingData$y)

testData<-mlbench.friedman1(5000,sd = 1)
testData$x<-data.frame(testData$x)

K-Nearest Neighbors (KNN)

knnModel <-train(x = trainingData$x,
                 y = trainingData$y,
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
knnModel
k-Nearest Neighbors 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  3.466085  0.5121775  2.816838
   7  3.349428  0.5452823  2.727410
   9  3.264276  0.5785990  2.660026
  11  3.214216  0.6024244  2.603767
  13  3.196510  0.6176570  2.591935
  15  3.184173  0.6305506  2.577482
  17  3.183130  0.6425367  2.567787
  19  3.198752  0.6483184  2.592683
  21  3.188993  0.6611428  2.588787
  23  3.200458  0.6638353  2.604529

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
knnPred<- predict(knnModel, newdata = testData$x)

postResample(pred = knnPred, obs = testData$y)
     RMSE  Rsquared       MAE 
3.2040595 0.6819919 2.5683461 

Support Vector Machine (SVM)

SVMModel <-train(x = trainingData$x,
                 y = trainingData$y,
                 method = "svmLinear",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

SVMPred<- predict(SVMModel, newdata = testData$x)
SVMModel
Support Vector Machines with Linear Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results:

  RMSE      Rsquared   MAE    
  2.551519  0.7476728  2.01746

Tuning parameter 'C' was held constant at a value of 1
postResample(pred = SVMPred, obs = testData$y)
     RMSE  Rsquared       MAE 
2.7633860 0.6973384 2.0970616 

Random Forest (RF)

RFModel <-train(x = trainingData$x,
                 y = trainingData$y,
                 method = "ranger",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .
RFPred<- predict(RFModel, newdata = testData$x)
RFModel
Random Forest 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  mtry  splitrule   RMSE      Rsquared   MAE     
   2    variance    2.902871  0.7853247  2.415221
   2    extratrees  3.125389  0.8020307  2.581730
   3    variance    2.717609  0.7837701  2.255983
   3    extratrees  2.877289  0.8133828  2.375101
   4    variance    2.636851  0.7754399  2.186274
   4    extratrees  2.735603  0.8127877  2.252930
   5    variance    2.596968  0.7660308  2.140881
   5    extratrees  2.637149  0.8110524  2.166540
   6    variance    2.597802  0.7551409  2.140643
   6    extratrees  2.583236  0.8064049  2.118032
   7    variance    2.600109  0.7455207  2.139053
   7    extratrees  2.534877  0.8048926  2.075316
   8    variance    2.609986  0.7391331  2.144419
   8    extratrees  2.517114  0.8005132  2.059029
   9    variance    2.638492  0.7291585  2.167845
   9    extratrees  2.500823  0.7969444  2.045371
  10    variance    2.657646  0.7207815  2.177573
  10    extratrees  2.489710  0.7929158  2.028374

Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 10, splitrule = extratrees
 and min.node.size = 5.
postResample(pred = RFPred, obs = testData$y)
     RMSE  Rsquared       MAE 
2.4153170 0.8141425 1.9216027 

Partial Least Squares (PLS)

plsModel <-train(x = trainingData$x,
                 y = trainingData$y,
                 method = "pls",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

plsPred<- predict(plsModel, newdata = testData$x)
plsModel
Partial Least Squares 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE     
  1      2.688547  0.7075938  2.141657
  2      2.501126  0.7516482  1.975048
  3      2.505093  0.7522105  1.980098
  4      2.508100  0.7517820  1.982606
  5      2.508667  0.7516772  1.983134
  6      2.508736  0.7516648  1.983233
  7      2.508768  0.7516589  1.983275
  8      2.508767  0.7516593  1.983274
  9      2.508767  0.7516593  1.983274

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 2.
postResample(pred = plsPred, obs = testData$y)
    RMSE Rsquared      MAE 
2.685591 0.710292 2.052676 

Multivariate Adaptive Regression Spline

MARSModel <-train(x = trainingData$x,
                 y = trainingData$y,
                 method = "earth",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

MARSPred<- predict(MARSModel, newdata = testData$x)
MARSModel
Multivariate Adaptive Regression Spline 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  nprune  RMSE      Rsquared   MAE     
   2      4.513855  0.2226566  3.710019
   3      3.726900  0.4697515  3.010978
   4      2.705967  0.7244549  2.175573
   6      2.314145  0.7905331  1.823292
   7      1.858053  0.8642626  1.459393
   9      1.758915  0.8792330  1.364881
  10      1.761866  0.8790926  1.359756
  12      1.768688  0.8787412  1.367119
  13      1.767428  0.8790870  1.368607
  15      1.782611  0.8772060  1.384283

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 9 and degree = 1.
postResample(pred = MARSPred, obs = testData$y)
     RMSE  Rsquared       MAE 
1.7901760 0.8705315 1.3712537 

The Mars model performed the best for this evaluation

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Data Prep

data(ChemicalManufacturingProcess)

df<-
ChemicalManufacturingProcess %>% 
  recipe(Yield~.) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_naomit(all_outcomes()) %>% 
  step_nzv(all_predictors()) %>% 
  step_corr(all_predictors(),threshold = 0.9) %>% 
  prep() %>% 
  bake(ChemicalManufacturingProcess)

Mars

dfadj<-df 
set.seed(1)
split<-dfadj %>% initial_split(prop = .8)

trainingData<-training(split) %>% select(-Yield)
trainingAnswer<-(training(split) %>% select(Yield))[,1][[1]]

testData<- testing(split) %>% select(-Yield)
testAnswer<-(testing(split) %>% select(Yield))[,1][[1]]



MARSModel <-train(x = trainingData,
                 y = trainingAnswer,
                 method = "earth",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

MARSPred<- predict(MARSModel, newdata = testData)
MARSModel
Multivariate Adaptive Regression Spline 

141 samples
 46 predictor

Pre-processing: centered (46), scaled (46) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 141, 141, 141, 141, 141, 141, ... 
Resampling results across tuning parameters:

  nprune  RMSE      Rsquared   MAE     
   2      1.445212  0.3909780  1.126735
   3      1.327861  0.4912781  1.038761
   5      1.291479  0.5275262  1.024341
   7      1.327385  0.5214789  1.030030
   8      1.360571  0.5113547  1.052845
  10      1.405551  0.4990792  1.088933
  12      1.421788  0.4998446  1.102440
  13      1.452651  0.4910400  1.122829
  15      1.469972  0.4900091  1.140023
  17      1.513381  0.4711792  1.174286

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 5 and degree = 1.
postResample(pred = MARSPred, obs = testAnswer)
     RMSE  Rsquared       MAE 
1.2241712 0.5471941 0.9581943 

SVM

SVMModel <-train(x = trainingData,
                 y = trainingAnswer,
                 method = "svmRadial",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

SVMPred<- predict(SVMModel, newdata = testData)
SVMModel
Support Vector Machines with Radial Basis Function Kernel 

141 samples
 46 predictor

Pre-processing: centered (46), scaled (46) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 141, 141, 141, 141, 141, 141, ... 
Resampling results across tuning parameters:

  C       RMSE      Rsquared   MAE      
    0.25  1.457277  0.4880247  1.1799509
    0.50  1.353425  0.5289183  1.0904100
    1.00  1.291395  0.5556078  1.0321077
    2.00  1.252407  0.5732383  0.9912171
    4.00  1.241901  0.5766854  0.9730877
    8.00  1.230610  0.5821925  0.9628612
   16.00  1.229129  0.5830298  0.9613244
   32.00  1.229129  0.5830298  0.9613244
   64.00  1.229129  0.5830298  0.9613244
  128.00  1.229129  0.5830298  0.9613244

Tuning parameter 'sigma' was held constant at a value of 0.01498411
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01498411 and C = 16.
postResample(pred = SVMPred, obs = testAnswer)
     RMSE  Rsquared       MAE 
1.0909770 0.6386468 0.8978089 

KNN

knnModel <-train(x = trainingData,
                 y = trainingAnswer,
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

knnPred<- predict(knnModel, newdata = testData)
knnModel
k-Nearest Neighbors 

141 samples
 46 predictor

Pre-processing: centered (46), scaled (46) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 141, 141, 141, 141, 141, 141, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  1.481696  0.3705621  1.189947
   7  1.464873  0.3794961  1.180520
   9  1.471312  0.3763153  1.185161
  11  1.458591  0.3920425  1.179344
  13  1.457298  0.3961140  1.186204
  15  1.457740  0.3998465  1.180835
  17  1.457198  0.4056303  1.179678
  19  1.460008  0.4083058  1.177618
  21  1.464070  0.4090481  1.175194
  23  1.467587  0.4114118  1.175814

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
postResample(pred = knnPred, obs = testAnswer)
    RMSE Rsquared      MAE 
1.427461 0.426987 1.139815 

Random Forest

rfModel <-train(x = trainingData,
                 y = trainingAnswer,
                 method = "rf",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)

rfPred<- predict(rfModel, newdata = testData)
varImp(MARSModel)
earth variable importance

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess13   55.44
ManufacturingProcess09   15.57
ManufacturingProcess15    0.00
postResample(pred = rfPred, obs = testAnswer)
     RMSE  Rsquared       MAE 
1.1841280 0.6290190 0.8604337 

In this case using non-linear regression methods, random forest performed the best. variable importance showed that manufacturering process number 32 was the greatest predictor followed by Manufacturing process 13