Running Code

Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.

  • 7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data

  • ANSWER:

    Code
    library(mlbench)
    library(Seurat)
    set.seed(200)
    trainingData <- mlbench.friedman1(200, sd = 1)
    ## We convert the 'x' data from a matrix to a data frame
    ## One reason is that this will five the columns names.
    trainingData$x <- data.frame(trainingData$x)
    ## Look at the data using
    #featurePlot(trainingData$x, trainingData$y)
    
    ## or other methods.
    ## This creates a list with a vector 'y' and a matrix
    ## of predictors 'x'.  Also simulate a large test set to
    ## estimate the true error rate with good precision:
    testData <- mlbench.friedman1(5000, sd = 1)
    testData$x <- data.frame(testData$x)
  • Tune several models on these data. For example:

Code
library(caret)
knnModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProcess = c("center", "scale"),
                  tuneLength = 10)
knnModel
k-Nearest Neighbors 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  3.466085  0.5121775  2.816838
   7  3.349428  0.5452823  2.727410
   9  3.264276  0.5785990  2.660026
  11  3.214216  0.6024244  2.603767
  13  3.196510  0.6176570  2.591935
  15  3.184173  0.6305506  2.577482
  17  3.183130  0.6425367  2.567787
  19  3.198752  0.6483184  2.592683
  21  3.188993  0.6611428  2.588787
  23  3.200458  0.6638353  2.604529

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
Code
knnPred <- predict(knnModel, newdata = testData$x)  
 ##The function 'postResample' can be used to get the test set  > ##perforamnce values  
 postResample(pred = knnPred, obs = testData$y)  
     RMSE  Rsquared       MAE 
3.2040595 0.6819919 2.5683461 

Which models appear to give the best performance?

ANSWER: Here we have trained several models below.

MARS Choose and optimal RMSE when nprune = 13 and degree is 2.

Code
MARS_param <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "earth",
                  tuneGrid = MARS_param,
                  preProcess = c("center", "scale"),
                  tuneLength = 10)

MARS_model
Multivariate Adaptive Regression Spline 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared   MAE     
  1        2      4.383438  0.2405683  3.597961
  1        3      3.645469  0.4745962  2.930453
  1        4      2.727602  0.7035031  2.184240
  1        5      2.449243  0.7611230  1.939231
  1        6      2.331605  0.7835496  1.833420
  1        7      1.976830  0.8421599  1.562591
  1        8      1.870959  0.8585503  1.464551
  1        9      1.804342  0.8683110  1.410395
  1       10      1.787676  0.8711960  1.386944
  1       11      1.790700  0.8707740  1.393076
  1       12      1.821005  0.8670619  1.419893
  1       13      1.858688  0.8617344  1.445459
  1       14      1.862343  0.8623072  1.446050
  1       15      1.871033  0.8607099  1.457618
  2        2      4.383438  0.2405683  3.597961
  2        3      3.644919  0.4742570  2.929647
  2        4      2.730222  0.7028372  2.183075
  2        5      2.481291  0.7545789  1.965749
  2        6      2.338369  0.7827873  1.825542
  2        7      2.030065  0.8328250  1.602024
  2        8      1.890997  0.8551326  1.477422
  2        9      1.742626  0.8757904  1.371910
  2       10      1.608221  0.8943432  1.255416
  2       11      1.474325  0.9111463  1.157848
  2       12      1.437483  0.9157967  1.120977
  2       13      1.439395  0.9164721  1.128309
  2       14      1.428565  0.9184503  1.118634
  2       15      1.434093  0.9182413  1.121622

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 14 and degree = 2.

The RMSE for MARS is much lower than KNN model

Code
MARS_pred <- predict(MARS_model, newdata = testData$x)
postResample(pred = MARS_pred, obs = testData$y)
     RMSE  Rsquared       MAE 
1.2779993 0.9338365 1.0147070 

Does MARS select the informative predictors (those named X1–X5)?

ANSWER: The variable importance are 4 variables below. Yes, the MARS model selects 4 variables as most important.

Code
varImp(MARS_model)
earth variable importance

   Overall
X1  100.00
X4   75.40
X2   49.00
X5   15.72
X3    0.00
Code
SVM_model <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))
SVM_model
Support Vector Machines with Radial Basis Function Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  C       RMSE      Rsquared   MAE     
    0.25  2.548095  0.7967225  2.001583
    0.50  2.290826  0.8118225  1.779407
    1.00  2.095918  0.8337552  1.634192
    2.00  1.989469  0.8462100  1.539721
    4.00  1.901332  0.8565135  1.510431
    8.00  1.879815  0.8588126  1.514502
   16.00  1.878866  0.8591568  1.518094
   32.00  1.878866  0.8591568  1.518094
   64.00  1.878866  0.8591568  1.518094
  128.00  1.878866  0.8591568  1.518094

Tuning parameter 'sigma' was held constant at a value of 0.06670077
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06670077 and C = 16.
Code
SVM_predict <- predict(SVM_model, newdata = testData$x)
postResample(pred = SVM_predict, obs = testData$y)
     RMSE  Rsquared       MAE 
2.0829707 0.8242096 1.5826017 
Code
varImp(SVM_model)
loess r-squared variable importance

     Overall
X4  100.0000
X1   95.5047
X2   89.6186
X5   45.2170
X3   29.9330
X9    6.3299
X10   5.5182
X8    3.2527
X6    0.8884
X7    0.0000
Code
#install.packages(c('neuralnet'),dependencies = T)
library(neuralnet)

nnet_model <- neuralnet(trainingData$y~.,trainingData,
                  hidden=c(7,3),
linear.output = FALSE)
plot(nnet_model,rep = "best")

Code
nnet <- predict(nnet_model, newdata = testData$x)
postResample(pred = nnet, obs = testData$y)
       RMSE    Rsquared         MAE 
14.27692927  0.01560489 13.38691083 

ANSWER: The MARS model as the best Rsquared value of 0.9291489 and a lower RMSE , therefore we select this model.

Code
results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>% 
  mutate("Model" = "KNN")
results <- data.frame(t(postResample(pred = MARS_pred, obs = testData$y))) %>%
  mutate("Model"= "MARS") %>%
  bind_rows(results)
results <- data.frame(t(postResample(pred = SVM_predict, obs = testData$y))) %>%
  mutate("Model"= "SVM") %>%
  bind_rows(results)
results <- data.frame(t(postResample(pred = nnet, obs = testData$y))) %>%
  mutate("Model"= "Neural Network") %>%
  bind_rows(results)
results %>%
  dplyr::select(Model, RMSE, Rsquared, MAE) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()
Model RMSE Rsquared MAE
MARS 1.277999 0.9338365 1.014707
SVM 2.082971 0.8242096 1.582602
KNN 3.204060 0.6819919 2.568346
Neural Network 14.276929 0.0156049 13.386911
  • 7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

  • (a) Which nonlinear regression model gives the optimal resampling and test set performance?

  • ANSWER: After trying several models, its seems that the PLS model has the highest Rquared value of 67.7% and the RMSE 0.619.

    Code
    library(AppliedPredictiveModeling)
    data(ChemicalManufacturingProcess)
    # Make this reproducible
    set.seed(42)
    knn <- preProcess(ChemicalManufacturingProcess, "knnImpute")
    df1 <- predict(knn, ChemicalManufacturingProcess)
    df <- df1 %>%
      select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
    in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
    train_df <- df[in_train, ]
    test_df <- df[-in_train, ]
    pls_model <- train(
      Yield ~ ., data = train_df, method = "pls",
      center = TRUE,
      scale = TRUE,
      trControl = trainControl("cv", number = 10),
      tuneLength = 25
    )
    pls_model
    Partial Least Squares 
    
    144 samples
     56 predictor
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 130, 129, 128, 129, 130, 129, ... 
    Resampling results across tuning parameters:
    
      ncomp  RMSE       Rsquared   MAE      
       1     0.8824790  0.3779221  0.6711462
       2     1.1458456  0.4219806  0.7086431
       3     0.7363066  0.5244517  0.5688553
       4     0.8235294  0.5298005  0.5933120
       5     0.9670735  0.4846010  0.6371199
       6     0.9959036  0.4776684  0.6427478
       7     0.9119517  0.4986338  0.6200233
       8     0.9068621  0.5012144  0.6293371
       9     0.8517370  0.5220166  0.6163795
      10     0.8919356  0.5062912  0.6332243
      11     0.9173758  0.4934557  0.6463164
      12     0.9064125  0.4791526  0.6485663
      13     0.9255289  0.4542181  0.6620193
      14     1.0239913  0.4358371  0.6944056
      15     1.0754710  0.4365214  0.7077991
      16     1.1110579  0.4269065  0.7135684
      17     1.1492855  0.4210485  0.7222868
      18     1.1940639  0.4132534  0.7396357
      19     1.2271867  0.4079005  0.7494818
      20     1.2077102  0.4022859  0.7470327
      21     1.2082648  0.4026711  0.7452969
      22     1.2669285  0.3987044  0.7634170
      23     1.3663033  0.3970188  0.7957514
      24     1.4531634  0.3898475  0.8243034
      25     1.5624265  0.3820102  0.8612935
    
    RMSE was used to select the optimal model using the smallest value.
    The final value used for the model was ncomp = 3.
    Code
    pls <- predict(pls_model, test_df)
    results <- data.frame(t(postResample(pred = pls, obs = test_df$Yield))) %>%
      mutate("Model"= "PLS")
    plot(pls_model)

    Code
    MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
    MARS_model <- train(
      Yield ~ ., data = train_df, method = "earth",
      tuneGrid = MARS_grid,
      # If the following lines are uncommented, it throws an error
      #center = TRUE,
      #scale = TRUE,
      trControl = trainControl("cv", number = 10),
      tuneLength = 25
    )
    MARS_model
    Multivariate Adaptive Regression Spline 
    
    144 samples
     56 predictor
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 130, 129, 130, 130, 130, 131, ... 
    Resampling results across tuning parameters:
    
      degree  nprune  RMSE       Rsquared   MAE      
      1        2      0.7727673  0.4024317  0.6055172
      1        3      0.6617934  0.5462836  0.5301205
      1        4      0.6414403  0.5782650  0.5175873
      1        5      0.6612068  0.5491861  0.5191206
      1        6      0.6578997  0.5541443  0.5065749
      1        7      0.6431856  0.5731468  0.5094160
      1        8      0.6365886  0.5818343  0.5009426
      1        9      0.6489700  0.5690713  0.5198826
      1       10      0.6289426  0.6018481  0.5107398
      1       11      0.6254798  0.6109447  0.4938393
      1       12      0.8175335  0.5491371  0.5524263
      1       13      0.8485872  0.5202012  0.5742223
      1       14      0.9496041  0.4980894  0.6161224
      1       15      0.9447012  0.4963106  0.6150941
      2        2      0.7727673  0.4024317  0.6055172
      2        3      0.7017341  0.4897201  0.5560432
      2        4      0.6723104  0.5388118  0.5456391
      2        5      0.6720395  0.5643163  0.5260406
      2        6      0.6054599  0.6330120  0.4700175
      2        7      0.5879508  0.6609611  0.4649161
      2        8      0.5579780  0.6910772  0.4407506
      2        9      0.5605913  0.6918493  0.4401424
      2       10      0.5766080  0.6752904  0.4484094
      2       11      0.5743771  0.6754628  0.4479538
      2       12      0.5773643  0.6804448  0.4451957
      2       13      0.5743328  0.6899189  0.4354640
      2       14      0.6066695  0.6585134  0.4483969
      2       15      0.6195630  0.6490841  0.4512092
    
    RMSE was used to select the optimal model using the smallest value.
    The final values used for the model were nprune = 8 and degree = 2.
    Code
    MARS_pred <- predict(MARS_model, test_df)
    results <- data.frame(t(postResample(pred = MARS_pred, obs = test_df$Yield))) %>%
      mutate("Model"= "MARS") %>% rbind(results)
    plot(MARS_model)

    Code
    SVM_model <- train(
      Yield ~ ., data = train_df, method = "svmRadial",
      center = TRUE,
      scale = TRUE,
      trControl = trainControl(method = "cv"),
      tuneLength = 25
    )
    SVM_model
    Support Vector Machines with Radial Basis Function Kernel 
    
    144 samples
     56 predictor
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 130, 130, 129, 129, 129, 130, ... 
    Resampling results across tuning parameters:
    
      C           RMSE       Rsquared   MAE      
            0.25  0.7686126  0.4567865  0.6341644
            0.50  0.7187220  0.4883557  0.5932541
            1.00  0.6756247  0.5370899  0.5532656
            2.00  0.6502763  0.5676720  0.5259804
            4.00  0.6577992  0.5614980  0.5317055
            8.00  0.6316941  0.5959658  0.5141839
           16.00  0.6291552  0.5994754  0.5145133
           32.00  0.6291552  0.5994754  0.5145133
           64.00  0.6291552  0.5994754  0.5145133
          128.00  0.6291552  0.5994754  0.5145133
          256.00  0.6291552  0.5994754  0.5145133
          512.00  0.6291552  0.5994754  0.5145133
         1024.00  0.6291552  0.5994754  0.5145133
         2048.00  0.6291552  0.5994754  0.5145133
         4096.00  0.6291552  0.5994754  0.5145133
         8192.00  0.6291552  0.5994754  0.5145133
        16384.00  0.6291552  0.5994754  0.5145133
        32768.00  0.6291552  0.5994754  0.5145133
        65536.00  0.6291552  0.5994754  0.5145133
       131072.00  0.6291552  0.5994754  0.5145133
       262144.00  0.6291552  0.5994754  0.5145133
       524288.00  0.6291552  0.5994754  0.5145133
      1048576.00  0.6291552  0.5994754  0.5145133
      2097152.00  0.6291552  0.5994754  0.5145133
      4194304.00  0.6291552  0.5994754  0.5145133
    
    Tuning parameter 'sigma' was held constant at a value of 0.0137077
    RMSE was used to select the optimal model using the smallest value.
    The final values used for the model were sigma = 0.0137077 and C = 16.
    Code
    SVM_pred <- predict(SVM_model, test_df)
    results <- data.frame(t(postResample(pred = SVM_pred, obs = test_df$Yield))) %>%
      mutate("Model"= "SVM") %>% rbind(results)
    plot(SVM_model)

    Code
    knn_model <- train(
      Yield ~ ., data = train_df, method = "knn",
      center = TRUE,
      scale = TRUE,
      trControl = trainControl("cv", number = 10),
      tuneLength = 25
    )
    knn_model
    k-Nearest Neighbors 
    
    144 samples
     56 predictor
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 132, 130, 128, 129, 128, 129, ... 
    Resampling results across tuning parameters:
    
      k   RMSE       Rsquared   MAE      
       5  0.6828734  0.5168641  0.5604420
       7  0.7347844  0.4695838  0.6141467
       9  0.7460879  0.4583270  0.6240751
      11  0.7476629  0.4585902  0.6358480
      13  0.7490971  0.4610305  0.6317689
      15  0.7591816  0.4463879  0.6369478
      17  0.7711858  0.4294319  0.6461468
      19  0.7778231  0.4224278  0.6543177
      21  0.7800437  0.4336780  0.6499864
      23  0.7816982  0.4390426  0.6504034
      25  0.7904133  0.4288995  0.6559515
      27  0.8021603  0.4151312  0.6675529
      29  0.8077564  0.4143896  0.6739736
      31  0.8159533  0.4104242  0.6797708
      33  0.8200913  0.4070435  0.6797057
      35  0.8228305  0.4099405  0.6821038
      37  0.8260244  0.4191607  0.6829727
      39  0.8319769  0.4057005  0.6858128
      41  0.8360941  0.4025312  0.6889323
      43  0.8354433  0.4036491  0.6876839
      45  0.8335381  0.4158074  0.6845551
      47  0.8379981  0.4122304  0.6914769
      49  0.8435241  0.4030341  0.6951803
      51  0.8439107  0.4086331  0.6962559
      53  0.8455989  0.4070050  0.6962722
    
    RMSE was used to select the optimal model using the smallest value.
    The final value used for the model was k = 5.
    Code
    knn_predict <- predict(knn_model, test_df)
    results <- data.frame(t(postResample(pred = knn_predict, obs = test_df$Yield))) %>%
      mutate("Model"= "KNN") %>% rbind(results)
    
    plot(knn_model)

Code
results1 <- data.frame(t(postResample(pred = knn_predict, obs = test_df$Yield))) %>% 
  mutate("Model" = "KNN")
results1 <- data.frame(t(postResample(pred = MARS_pred, obs = test_df$Yield))) %>%
  mutate("Model"= "MARS") %>%
  bind_rows(results1)
results1 <- data.frame(t(postResample(pred = SVM_pred, obs = test_df$Yield))) %>%
  mutate("Model"= "SVM") %>%
  bind_rows(results1)
results1 <- data.frame(t(postResample(pred = pls, obs = test_df$Yield))) %>%
  mutate("Model"= "PLS") %>%
  bind_rows(results1)

results1 %>%
  dplyr::select(Model, RMSE, Rsquared, MAE) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()
Model RMSE Rsquared MAE
PLS 0.6192577 0.6771122 0.5059984
SVM 0.6794955 0.6245393 0.4935834
KNN 0.7149819 0.5894607 0.5047973
MARS 0.7199111 0.5920601 0.5651110
  • (b) Which predictors are most important in the optimal nonlinear regression model?

  • ANSWER: In the PLS model, the Manufacturing process variables (32, 9 , 36, 13,17) are the highest

    Code
    library(dplyr)
    
    
    top <- varImp(pls_model)$importance %>%
      arrange(-Overall) %>%
      head(10) %>% kable() %>% kable_styling() 
    
    top1 <- varImp(SVM_model)$importance %>%
      arrange(-Overall) %>%
      head(10) %>% kable() %>% kable_styling()
    
    top
    Overall
    ManufacturingProcess32 100.00000
    ManufacturingProcess09 88.04157
    ManufacturingProcess36 82.20485
    ManufacturingProcess13 82.11199
    ManufacturingProcess17 80.24860
    ManufacturingProcess06 59.05610
    ManufacturingProcess11 55.92794
    BiologicalMaterial02 55.45684
    BiologicalMaterial06 54.64083
    BiologicalMaterial03 54.49979
    Code
    top1
    Overall
    ManufacturingProcess32 100.00000
    ManufacturingProcess13 93.81970
    ManufacturingProcess09 89.92917
    ManufacturingProcess17 88.19815
    BiologicalMaterial06 82.60809
    BiologicalMaterial03 79.43762
    ManufacturingProcess36 73.84749
    BiologicalMaterial12 72.36048
    ManufacturingProcess06 69.00493
    ManufacturingProcess11 62.33657
  • Do either the biological or process variables dominate the list?

  • ANSWER: For the PLS model with the higher RSQUARED, the manufacturing process variables dominate the list.

  • How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

  • ANSWER: Both PLS and SVM seem to have process variables at the top of the importance list. However, for the SVM model has few biological material variables (06 and 03) deemed important.

  • (c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.

  • Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

    • ANSWER: We can view correlation plot as well as geomplot below. The geomplot confirms that manufacturing 17 has a negative correlation with yield. Also manufacturing 09 has a positive correlation with yield. The biological variables 06 and 03 how a positive correlation with yield.
    Code
    ggplot(train_df, aes(BiologicalMaterial12, Yield)) +
      geom_point()

    Code
    ggplot(train_df, aes(ManufacturingProcess17, Yield)) +
      geom_point()

    Code
    ggplot(train_df, aes(ManufacturingProcess09, Yield)) +
      geom_point()

    Code
    ggplot(train_df, aes(BiologicalMaterial06, Yield)) +
      geom_point()

    Code
    ggplot(train_df, aes(BiologicalMaterial03, Yield)) +
      geom_point()

    Code
    # Get top-10 feature names
    importance <- varImp(SVM_model)$importance
    importance$feature <- rownames(importance)
    importance <- importance[order(importance$Overall, decreasing=TRUE), ]
    
    # Get importance feature names
    important_f <- rownames(head(importance, 10))
    
    # Create correlelogram of imputed chemical data
    correlation <- cor(train_df[, c(important_f, "Yield")])
    # corrplot(correlation, method="circle")
    
    ggcorrplot(correlation, type = "upper", outline.color = "white",
           ggtheme = theme_classic,
           #colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)

    Code
    #            
    # # Calculate the correlation matrix
    # correlation_matrix <- cor(numeric_data, use="complete.obs")
    # print(correlation_matrix)
    # corrplot(correlation_matrix, method="circle")