DATA 624 Homework8

Running Code

Code

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.

7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data

ANSWER:

Code

library(mlbench)
library(Seurat)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will five the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
#featurePlot(trainingData$x, trainingData$y)

## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'.  Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

Code

library(caret)
knnModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProcess = c("center", "scale"),
                  tuneLength = 10)
knnModel

k-Nearest Neighbors 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  3.466085  0.5121775  2.816838
   7  3.349428  0.5452823  2.727410
   9  3.264276  0.5785990  2.660026
  11  3.214216  0.6024244  2.603767
  13  3.196510  0.6176570  2.591935
  15  3.184173  0.6305506  2.577482
  17  3.183130  0.6425367  2.567787
  19  3.198752  0.6483184  2.592683
  21  3.188993  0.6611428  2.588787
  23  3.200458  0.6638353  2.604529

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.

Code

knnPred <- predict(knnModel, newdata = testData$x)  
 ##The function 'postResample' can be used to get the test set  > ##perforamnce values  
 postResample(pred = knnPred, obs = testData$y)

     RMSE  Rsquared       MAE 
3.2040595 0.6819919 2.5683461

Which models appear to give the best performance?

ANSWER: Here we have trained several models below.

MARS Choose and optimal RMSE when nprune = 13 and degree is 2.

Code

MARS_param <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "earth",
                  tuneGrid = MARS_param,
                  preProcess = c("center", "scale"),
                  tuneLength = 10)

MARS_model

Multivariate Adaptive Regression Spline 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared   MAE     
  1        2      4.383438  0.2405683  3.597961
  1        3      3.645469  0.4745962  2.930453
  1        4      2.727602  0.7035031  2.184240
  1        5      2.449243  0.7611230  1.939231
  1        6      2.331605  0.7835496  1.833420
  1        7      1.976830  0.8421599  1.562591
  1        8      1.870959  0.8585503  1.464551
  1        9      1.804342  0.8683110  1.410395
  1       10      1.787676  0.8711960  1.386944
  1       11      1.790700  0.8707740  1.393076
  1       12      1.821005  0.8670619  1.419893
  1       13      1.858688  0.8617344  1.445459
  1       14      1.862343  0.8623072  1.446050
  1       15      1.871033  0.8607099  1.457618
  2        2      4.383438  0.2405683  3.597961
  2        3      3.644919  0.4742570  2.929647
  2        4      2.730222  0.7028372  2.183075
  2        5      2.481291  0.7545789  1.965749
  2        6      2.338369  0.7827873  1.825542
  2        7      2.030065  0.8328250  1.602024
  2        8      1.890997  0.8551326  1.477422
  2        9      1.742626  0.8757904  1.371910
  2       10      1.608221  0.8943432  1.255416
  2       11      1.474325  0.9111463  1.157848
  2       12      1.437483  0.9157967  1.120977
  2       13      1.439395  0.9164721  1.128309
  2       14      1.428565  0.9184503  1.118634
  2       15      1.434093  0.9182413  1.121622

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 14 and degree = 2.

The RMSE for MARS is much lower than KNN model

Code

MARS_pred <- predict(MARS_model, newdata = testData$x)
postResample(pred = MARS_pred, obs = testData$y)

     RMSE  Rsquared       MAE 
1.2779993 0.9338365 1.0147070

Does MARS select the informative predictors (those named X1–X5)?

ANSWER: The variable importance are 4 variables below. Yes, the MARS model selects 4 variables as most important.

Code

varImp(MARS_model)

earth variable importance

   Overall
X1  100.00
X4   75.40
X2   49.00
X5   15.72
X3    0.00

Code

SVM_model <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))
SVM_model

Support Vector Machines with Radial Basis Function Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  C       RMSE      Rsquared   MAE     
    0.25  2.548095  0.7967225  2.001583
    0.50  2.290826  0.8118225  1.779407
    1.00  2.095918  0.8337552  1.634192
    2.00  1.989469  0.8462100  1.539721
    4.00  1.901332  0.8565135  1.510431
    8.00  1.879815  0.8588126  1.514502
   16.00  1.878866  0.8591568  1.518094
   32.00  1.878866  0.8591568  1.518094
   64.00  1.878866  0.8591568  1.518094
  128.00  1.878866  0.8591568  1.518094

Tuning parameter 'sigma' was held constant at a value of 0.06670077
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06670077 and C = 16.

Code

SVM_predict <- predict(SVM_model, newdata = testData$x)
postResample(pred = SVM_predict, obs = testData$y)

     RMSE  Rsquared       MAE 
2.0829707 0.8242096 1.5826017

Code

varImp(SVM_model)

loess r-squared variable importance

     Overall
X4  100.0000
X1   95.5047
X2   89.6186
X5   45.2170
X3   29.9330
X9    6.3299
X10   5.5182
X8    3.2527
X6    0.8884
X7    0.0000

Code

#install.packages(c('neuralnet'),dependencies = T)
library(neuralnet)

nnet_model <- neuralnet(trainingData$y~.,trainingData,
                  hidden=c(7,3),
linear.output = FALSE)
plot(nnet_model,rep = "best")

Code

nnet <- predict(nnet_model, newdata = testData$x)
postResample(pred = nnet, obs = testData$y)

       RMSE    Rsquared         MAE 
14.27692927  0.01560489 13.38691083

ANSWER: The MARS model as the best Rsquared value of 0.9291489 and a lower RMSE , therefore we select this model.

Code

results <- data.frame(t(postResample(pred = knnPred, obs = testData$y))) %>% 
  mutate("Model" = "KNN")
results <- data.frame(t(postResample(pred = MARS_pred, obs = testData$y))) %>%
  mutate("Model"= "MARS") %>%
  bind_rows(results)
results <- data.frame(t(postResample(pred = SVM_predict, obs = testData$y))) %>%
  mutate("Model"= "SVM") %>%
  bind_rows(results)
results <- data.frame(t(postResample(pred = nnet, obs = testData$y))) %>%
  mutate("Model"= "Neural Network") %>%
  bind_rows(results)
results %>%
  dplyr::select(Model, RMSE, Rsquared, MAE) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()

Model	RMSE	Rsquared	MAE
MARS	1.277999	0.9338365	1.014707
SVM	2.082971	0.8242096	1.582602
KNN	3.204060	0.6819919	2.568346
Neural Network	14.276929	0.0156049	13.386911

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
(a) Which nonlinear regression model gives the optimal resampling and test set performance?

ANSWER: After trying several models, its seems that the PLS model has the highest Rquared value of 67.7% and the RMSE 0.619.

Code

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Make this reproducible
set.seed(42)
knn <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df1 <- predict(knn, ChemicalManufacturingProcess)
df <- df1 %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
  Yield ~ ., data = train_df, method = "pls",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
pls_model

Partial Least Squares 

144 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 130, 129, 128, 129, 130, 129, ... 
Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared   MAE      
   1     0.8824790  0.3779221  0.6711462
   2     1.1458456  0.4219806  0.7086431
   3     0.7363066  0.5244517  0.5688553
   4     0.8235294  0.5298005  0.5933120
   5     0.9670735  0.4846010  0.6371199
   6     0.9959036  0.4776684  0.6427478
   7     0.9119517  0.4986338  0.6200233
   8     0.9068621  0.5012144  0.6293371
   9     0.8517370  0.5220166  0.6163795
  10     0.8919356  0.5062912  0.6332243
  11     0.9173758  0.4934557  0.6463164
  12     0.9064125  0.4791526  0.6485663
  13     0.9255289  0.4542181  0.6620193
  14     1.0239913  0.4358371  0.6944056
  15     1.0754710  0.4365214  0.7077991
  16     1.1110579  0.4269065  0.7135684
  17     1.1492855  0.4210485  0.7222868
  18     1.1940639  0.4132534  0.7396357
  19     1.2271867  0.4079005  0.7494818
  20     1.2077102  0.4022859  0.7470327
  21     1.2082648  0.4026711  0.7452969
  22     1.2669285  0.3987044  0.7634170
  23     1.3663033  0.3970188  0.7957514
  24     1.4531634  0.3898475  0.8243034
  25     1.5624265  0.3820102  0.8612935

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 3.

Code

pls <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls, obs = test_df$Yield))) %>%
  mutate("Model"= "PLS")
plot(pls_model)

Code

MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(
  Yield ~ ., data = train_df, method = "earth",
  tuneGrid = MARS_grid,
  # If the following lines are uncommented, it throws an error
  #center = TRUE,
  #scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
MARS_model

Multivariate Adaptive Regression Spline 

144 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 130, 129, 130, 130, 130, 131, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE       Rsquared   MAE      
  1        2      0.7727673  0.4024317  0.6055172
  1        3      0.6617934  0.5462836  0.5301205
  1        4      0.6414403  0.5782650  0.5175873
  1        5      0.6612068  0.5491861  0.5191206
  1        6      0.6578997  0.5541443  0.5065749
  1        7      0.6431856  0.5731468  0.5094160
  1        8      0.6365886  0.5818343  0.5009426
  1        9      0.6489700  0.5690713  0.5198826
  1       10      0.6289426  0.6018481  0.5107398
  1       11      0.6254798  0.6109447  0.4938393
  1       12      0.8175335  0.5491371  0.5524263
  1       13      0.8485872  0.5202012  0.5742223
  1       14      0.9496041  0.4980894  0.6161224
  1       15      0.9447012  0.4963106  0.6150941
  2        2      0.7727673  0.4024317  0.6055172
  2        3      0.7017341  0.4897201  0.5560432
  2        4      0.6723104  0.5388118  0.5456391
  2        5      0.6720395  0.5643163  0.5260406
  2        6      0.6054599  0.6330120  0.4700175
  2        7      0.5879508  0.6609611  0.4649161
  2        8      0.5579780  0.6910772  0.4407506
  2        9      0.5605913  0.6918493  0.4401424
  2       10      0.5766080  0.6752904  0.4484094
  2       11      0.5743771  0.6754628  0.4479538
  2       12      0.5773643  0.6804448  0.4451957
  2       13      0.5743328  0.6899189  0.4354640
  2       14      0.6066695  0.6585134  0.4483969
  2       15      0.6195630  0.6490841  0.4512092

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 8 and degree = 2.

Code

MARS_pred <- predict(MARS_model, test_df)
results <- data.frame(t(postResample(pred = MARS_pred, obs = test_df$Yield))) %>%
  mutate("Model"= "MARS") %>% rbind(results)
plot(MARS_model)

Code

SVM_model <- train(
  Yield ~ ., data = train_df, method = "svmRadial",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl(method = "cv"),
  tuneLength = 25
)
SVM_model

Support Vector Machines with Radial Basis Function Kernel 

144 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 130, 130, 129, 129, 129, 130, ... 
Resampling results across tuning parameters:

  C           RMSE       Rsquared   MAE      
        0.25  0.7686126  0.4567865  0.6341644
        0.50  0.7187220  0.4883557  0.5932541
        1.00  0.6756247  0.5370899  0.5532656
        2.00  0.6502763  0.5676720  0.5259804
        4.00  0.6577992  0.5614980  0.5317055
        8.00  0.6316941  0.5959658  0.5141839
       16.00  0.6291552  0.5994754  0.5145133
       32.00  0.6291552  0.5994754  0.5145133
       64.00  0.6291552  0.5994754  0.5145133
      128.00  0.6291552  0.5994754  0.5145133
      256.00  0.6291552  0.5994754  0.5145133
      512.00  0.6291552  0.5994754  0.5145133
     1024.00  0.6291552  0.5994754  0.5145133
     2048.00  0.6291552  0.5994754  0.5145133
     4096.00  0.6291552  0.5994754  0.5145133
     8192.00  0.6291552  0.5994754  0.5145133
    16384.00  0.6291552  0.5994754  0.5145133
    32768.00  0.6291552  0.5994754  0.5145133
    65536.00  0.6291552  0.5994754  0.5145133
   131072.00  0.6291552  0.5994754  0.5145133
   262144.00  0.6291552  0.5994754  0.5145133
   524288.00  0.6291552  0.5994754  0.5145133
  1048576.00  0.6291552  0.5994754  0.5145133
  2097152.00  0.6291552  0.5994754  0.5145133
  4194304.00  0.6291552  0.5994754  0.5145133

Tuning parameter 'sigma' was held constant at a value of 0.0137077
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0137077 and C = 16.

Code

SVM_pred <- predict(SVM_model, test_df)
results <- data.frame(t(postResample(pred = SVM_pred, obs = test_df$Yield))) %>%
  mutate("Model"= "SVM") %>% rbind(results)
plot(SVM_model)

Code

knn_model <- train(
  Yield ~ ., data = train_df, method = "knn",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
knn_model

k-Nearest Neighbors 

144 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 132, 130, 128, 129, 128, 129, ... 
Resampling results across tuning parameters:

  k   RMSE       Rsquared   MAE      
   5  0.6828734  0.5168641  0.5604420
   7  0.7347844  0.4695838  0.6141467
   9  0.7460879  0.4583270  0.6240751
  11  0.7476629  0.4585902  0.6358480
  13  0.7490971  0.4610305  0.6317689
  15  0.7591816  0.4463879  0.6369478
  17  0.7711858  0.4294319  0.6461468
  19  0.7778231  0.4224278  0.6543177
  21  0.7800437  0.4336780  0.6499864
  23  0.7816982  0.4390426  0.6504034
  25  0.7904133  0.4288995  0.6559515
  27  0.8021603  0.4151312  0.6675529
  29  0.8077564  0.4143896  0.6739736
  31  0.8159533  0.4104242  0.6797708
  33  0.8200913  0.4070435  0.6797057
  35  0.8228305  0.4099405  0.6821038
  37  0.8260244  0.4191607  0.6829727
  39  0.8319769  0.4057005  0.6858128
  41  0.8360941  0.4025312  0.6889323
  43  0.8354433  0.4036491  0.6876839
  45  0.8335381  0.4158074  0.6845551
  47  0.8379981  0.4122304  0.6914769
  49  0.8435241  0.4030341  0.6951803
  51  0.8439107  0.4086331  0.6962559
  53  0.8455989  0.4070050  0.6962722

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.

Code

knn_predict <- predict(knn_model, test_df)
results <- data.frame(t(postResample(pred = knn_predict, obs = test_df$Yield))) %>%
  mutate("Model"= "KNN") %>% rbind(results)

plot(knn_model)

Code

results1 <- data.frame(t(postResample(pred = knn_predict, obs = test_df$Yield))) %>% 
  mutate("Model" = "KNN")
results1 <- data.frame(t(postResample(pred = MARS_pred, obs = test_df$Yield))) %>%
  mutate("Model"= "MARS") %>%
  bind_rows(results1)
results1 <- data.frame(t(postResample(pred = SVM_pred, obs = test_df$Yield))) %>%
  mutate("Model"= "SVM") %>%
  bind_rows(results1)
results1 <- data.frame(t(postResample(pred = pls, obs = test_df$Yield))) %>%
  mutate("Model"= "PLS") %>%
  bind_rows(results1)

results1 %>%
  dplyr::select(Model, RMSE, Rsquared, MAE) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()

Model	RMSE	Rsquared	MAE
PLS	0.6192577	0.6771122	0.5059984
SVM	0.6794955	0.6245393	0.4935834
KNN	0.7149819	0.5894607	0.5047973
MARS	0.7199111	0.5920601	0.5651110

(b) Which predictors are most important in the optimal nonlinear regression model?

ANSWER: In the PLS model, the Manufacturing process variables (32, 9 , 36, 13,17) are the highest

Code

library(dplyr)


top <- varImp(pls_model)$importance %>%
  arrange(-Overall) %>%
  head(10) %>% kable() %>% kable_styling() 

top1 <- varImp(SVM_model)$importance %>%
  arrange(-Overall) %>%
  head(10) %>% kable() %>% kable_styling()

top

	Overall
ManufacturingProcess32	100.00000
ManufacturingProcess09	88.04157
ManufacturingProcess36	82.20485
ManufacturingProcess13	82.11199
ManufacturingProcess17	80.24860
ManufacturingProcess06	59.05610
ManufacturingProcess11	55.92794
BiologicalMaterial02	55.45684
BiologicalMaterial06	54.64083
BiologicalMaterial03	54.49979

Code

top1

	Overall
ManufacturingProcess32	100.00000
ManufacturingProcess13	93.81970
ManufacturingProcess09	89.92917
ManufacturingProcess17	88.19815
BiologicalMaterial06	82.60809
BiologicalMaterial03	79.43762
ManufacturingProcess36	73.84749
BiologicalMaterial12	72.36048
ManufacturingProcess06	69.00493
ManufacturingProcess11	62.33657

Do either the biological or process variables dominate the list?
ANSWER: For the PLS model with the higher RSQUARED, the manufacturing process variables dominate the list.
How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
ANSWER: Both PLS and SVM seem to have process variables at the top of the importance list. However, for the SVM model has few biological material variables (06 and 03) deemed important.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model.

Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

ANSWER: We can view correlation plot as well as geomplot below. The geomplot confirms that manufacturing 17 has a negative correlation with yield. Also manufacturing 09 has a positive correlation with yield. The biological variables 06 and 03 how a positive correlation with yield.

Code

ggplot(train_df, aes(BiologicalMaterial12, Yield)) +
  geom_point()

Code

ggplot(train_df, aes(ManufacturingProcess17, Yield)) +
  geom_point()

Code

ggplot(train_df, aes(ManufacturingProcess09, Yield)) +
  geom_point()

Code

ggplot(train_df, aes(BiologicalMaterial06, Yield)) +
  geom_point()

Code

ggplot(train_df, aes(BiologicalMaterial03, Yield)) +
  geom_point()

Code

# Get top-10 feature names
importance <- varImp(SVM_model)$importance
importance$feature <- rownames(importance)
importance <- importance[order(importance$Overall, decreasing=TRUE), ]

# Get importance feature names
important_f <- rownames(head(importance, 10))

# Create correlelogram of imputed chemical data
correlation <- cor(train_df[, c(important_f, "Yield")])
# corrplot(correlation, method="circle")

ggcorrplot(correlation, type = "upper", outline.color = "white",
       ggtheme = theme_classic,
       #colors = c("#6D9EC1", "white", "#E46726"),
       lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)

Code

#            
# # Calculate the correlation matrix
# correlation_matrix <- cor(numeric_data, use="complete.obs")
# print(correlation_matrix)
# corrplot(correlation_matrix, method="circle")