Response 7.2

Replicating the data

We replicate the code provided by the authors to generate the Friedman simulated dataset for training and test dataset. The training data uses 200 observations while the test data uses 5000 observations to achieve a high degree of precision.

set.seed(200)
trainingData = mlbench.friedman1( 200, sd = 1 )

## We convert the 'x' data from a matrix to a dataframe.
## One reason is that this will give the columns names

trainingData$x = data.frame( trainingData$x )

The featurePlot below shows the scatterplot between the predictors and the response variable.

Our solution tunes a kNN model, a MARS model and an SVM model on the above training dataset. We conclude that the MARS model has the best test performance and includes the informative variables only.

KNN Model Results

We replicate the knn tuning process by the authors to confirm our results match the stated numbers. We find that they are close but not identical. This suggests that the underlying algorithms have changed even though we have uses the same random seeds and code invocation.

options(digits = 3)
set.seed(1020)
knnModel = train( x= trainingData$x, y = trainingData$y, 
                  method = "knn", 
                  preProc = c("center", "scale"), 
                  tuneLength = 10)

knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE  Rsquared  MAE 
##    5  3.49  0.500     2.82
##    7  3.34  0.546     2.71
##    9  3.28  0.572     2.66
##   11  3.22  0.600     2.62
##   13  3.21  0.614     2.60
##   15  3.19  0.634     2.58
##   17  3.19  0.644     2.58
##   19  3.21  0.642     2.61
##   21  3.23  0.647     2.63
##   23  3.23  0.656     2.64
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

The model test performance shows an \(R^2\) of 68.2% below.

##     RMSE Rsquared      MAE 
##    3.204    0.682    2.568

MARS Model Results

marsGrid = expand.grid( .degree = 1:2, .nprune = 2:20)

repeats = 4

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "best", verboseIter = FALSE)

set.seed(1000)

marsTuned = caret::train(x = trainingData$x, 
                         y = trainingData$y ,
                         method = "earth" ,
                         metric = "RMSE" , 
                         tuneGrid = marsGrid ,
                         preProc = c("center", "scale") ,
                         trControl = ctrl )

The model hinge function and coefficients are shown below. The best model has 11 terms, degree 2 and \(R^2\) of 94.6%.

## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
##             nprune=16)
## 
##                                   coefficients
## (Intercept)                              22.04
## h(0.507267-X1)                           -4.26
## h(X1-0.507267)                            2.66
## h(0.325504-X2)                           -5.26
## h(-0.216741-X3)                           2.91
## h(X3- -0.216741)                          1.66
## h(0.953812-X4)                           -2.81
## h(X4-0.953812)                            2.83
## h(1.17878-X5)                            -1.50
## h(-0.951872-X1) * h(X2-0.325504)         -3.62
## h(X1- -0.951872) * h(X2-0.325504)        -1.51
## h(X1-0.507267) * h(X2- -0.798188)        -2.20
## h(0.606835-X1) * h(0.325504-X2)           1.98
## h(0.325504-X2) * h(X3-0.795427)           2.14
## h(X2-0.325504) * h(X3- -0.917499)         1.34
## h(X2-0.325504) * h(-0.917499-X3)          4.48
## 
## Selected 16 of 21 terms, and 5 of 10 predictors (nprune=16)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 8 7
## GCV 1.64    RSS 214    GRSq 0.933    RSq 0.956

##     RMSE Rsquared      MAE 
##    1.279    0.934    1.009

The test performance shows the model is very accurate at 93.4% \(R^2\) with RMSE 1.28.

If we look at the variable importance plots, we see that X1, X4, X2, X5, X3 are the top variables.

Moreover, the MARS model clearly selects the useful predictors \(X1, X2, X3, X4, X5\) to include in the model.

SVM Model Results

We run the support vector machine model using a radial basis function on the training data. After scaling and centering the data, we use the oneSE rule to choose the model based on RMSE and use repeated cross validation to generate the performance.

svmGrid = expand.grid( C = c(  1.0, 5, 10, 15, 30, 100 ), sigma = c( 0.005, 0.01, 0.015, 0.02, 0.04, 0.10 ))

repeats = 4

ctrl <- trainControl(method = "cv", number = 10,  selectionFunction = "oneSE")

set.seed(1000)

svmTuned = caret::train(x =  trainingData$x , 
                         y = trainingData$y, 
                         method = "svmRadial" ,
                         metric = "RMSE" , 
                         tuneGrid = svmGrid, 
                         preProc = c("center", "scale") ,
                         trControl = ctrl )

svmTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C    sigma  RMSE  Rsquared  MAE 
##     1  0.005  2.63  0.767     2.14
##     1  0.010  2.43  0.778     1.96
##     1  0.015  2.35  0.787     1.88
##     1  0.020  2.32  0.793     1.85
##     1  0.040  2.17  0.818     1.70
##     1  0.100  2.17  0.829     1.70
##     5  0.005  2.38  0.779     1.91
##     5  0.010  2.26  0.800     1.78
##     5  0.015  2.11  0.824     1.65
##     5  0.020  1.99  0.844     1.55
##     5  0.040  1.88  0.864     1.45
##     5  0.100  2.05  0.845     1.62
##    10  0.005  2.35  0.785     1.87
##    10  0.010  2.12  0.821     1.66
##    10  0.015  1.94  0.852     1.51
##    10  0.020  1.88  0.862     1.47
##    10  0.040  1.89  0.865     1.46
##    10  0.100  2.05  0.845     1.62
##    15  0.005  2.30  0.793     1.83
##    15  0.010  2.02  0.839     1.57
##    15  0.015  1.89  0.861     1.47
##    15  0.020  1.86  0.866     1.45
##    15  0.040  1.93  0.863     1.50
##    15  0.100  2.05  0.845     1.62
##    30  0.005  2.17  0.813     1.69
##    30  0.010  1.89  0.862     1.47
##    30  0.015  1.86  0.868     1.45
##    30  0.020  1.83  0.873     1.43
##    30  0.040  1.91  0.867     1.51
##    30  0.100  2.05  0.845     1.62
##   100  0.005  1.90  0.860     1.47
##   100  0.010  1.84  0.872     1.45
##   100  0.015  1.88  0.868     1.45
##   100  0.020  1.99  0.857     1.54
##   100  0.040  1.90  0.868     1.51
##   100  0.100  2.05  0.845     1.62
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final values used for the model were sigma = 0.04 and C = 5.

##     RMSE Rsquared      MAE 
##    2.031    0.833    1.529

The SVM model produces a respectable performance on test data of \(R^2\) equal to 83.3% and RMSE 2.03. It includes a number of spurious variables in the variable importance plot including X9, X10, X8, X6, X7.

Therefore while the predictive accuracy is decent, the variable importance is misleading. The last 5 predictors in the plot are all independent of the response.

Discussion

The table below summarized the test performance of the 3 models tested. We observe that MARS RMSE and \(R^2\) both beat those of kNN and SVM and conclude that MARS has the best test performance. Because MARS is able to do variable selection, it selects only informative predictors whereas SVM and kNN include uninformative predictors.

Test Performance for both models

models	RMSE	RSquared
kNN	3.20	0.682
MARS	1.28	0.934
SVM - Radial Basis Function	2.03	0.833

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Which nonlinear regression model gives the optimal resampling and test set performance?
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Response 7.5a

We will compare the performances of a MARS and an SVM model in this exercise. After pre-processing the data, we run the training process to obtain the optimal parameters for the MARS and SVM models. Then we produce the test set performance figures which suggest MARS outperforms SVM.

Preprocessing the data

To re-use the same data splitting, data imputation and pre-processing steps as before in Exercise 6.3, we replicate and run the code below and briefly recall the steps taken to obtain the prepared datasets. However, we omit the detailed analysis and diagnostic plots used to justify those steps.

Rename the lengthy variable names:
- predictors beginning with BiologicalMaterial are shortened to B
- predictors beginning with ManufacturingProcess are shortened to M
Split the data into a 75-25 train/test split based on stratified random sampling using the response Yield to define 5 quantiles.
Drop near zero variance predictors: B07 is dropped from both test and train data sets.
Remove outlier zero values for predictors where all other values are positive. Replace the zeros by the median of the population.
Center and scale the predictors.
Use knn imputation to fill in missing values in the data set.

data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]

#chemX %>% rename_at(vars(starts_with("ManufacturingProcess") ), 
#                    funs(str_replace(., "ManufacturingProcess", "M"))) %>% 
#  rename_at(vars(starts_with("Biological")), 
#            funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2


chemX %>% 
  rename_with(~ str_replace(., "ManufacturingProcess", "M"),  starts_with("Manufacturing") ) %>%
  rename_with(~ str_replace(., "BiologicalMaterial", "B"),  starts_with("Biological") ) -> chemX3

set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )

chemX_train_bv %>% mutate( 
  across(c(M18, M20, M16, M26, M25, M27, M31, 
           M29, M43, M30, M01, M42, M44, M45, 
           M39, M38 ) ,   # Run a function across the columns listed in across()
  ~if_else( condition = ( .x == 0 )  ,  # anonymous function applied to each column in across()  .x refers to current column argument
            true = median( . ) ,  # we use median without removal of NA
            false = as.numeric(.) ) ) # keep the existing value of it is nonzero
  
  ) -> chemX_train_nz

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )

MARS - Model Results

Now we fit a MARS non-linear regression model to the chemical processing data. We examine MARS models of degrees 1 and 2 as higher degrees are generally not advisable. We consider the inclusion of up to 25 of the variables. We use a repeated 10-fold cross validation to generate our cross validation statistics below. The optimization rule is the small RMSE across the test parameter range. A comparison of the RMSE and \(R^2\) statistics show consistent results for the optimal tuning parameters.

marsGrid = expand.grid( .degree = 1:2, .nprune = 2:25 )

repeats = 4

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "best", verboseIter = FALSE)

set.seed(1000)

marsTuned = caret::train(x = chemX_train_pp, 
                         y = chemY_train, 
                         method = "earth" ,
                         metric = "RMSE" , 
                         tuneGrid = marsGrid ,
                         trControl = ctrl )

## Multivariate Adaptive Regression Spline 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 4 times) 
## Summary of sample sizes: 118, 118, 119, 119, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE  Rsquared  MAE 
##   1        2      1.42  0.447     1.12
##   1        3      1.22  0.580     1.01
##   1        4      1.23  0.573     1.01
##   1        5      1.27  0.546     1.03
##   1        6      1.37  0.521     1.09
##   1        7      1.38  0.521     1.09
##   1        8      1.45  0.502     1.13
##   1        9      1.46  0.490     1.14
##   1       10      1.48  0.482     1.15
##   1       11      1.51  0.479     1.17
##   1       12      1.53  0.477     1.17
##   1       13      1.55  0.461     1.20
##   1       14      1.56  0.458     1.20
##   1       15      1.57  0.456     1.20
##   1       16      1.59  0.447     1.22
##   1       17      1.59  0.449     1.22
##   1       18      1.60  0.447     1.22
##   1       19      1.60  0.447     1.22
##   1       20      1.60  0.448     1.22
##   1       21      1.62  0.442     1.23
##   1       22      1.62  0.442     1.23
##   1       23      1.62  0.442     1.23
##   1       24      1.62  0.442     1.23
##   1       25      1.62  0.442     1.23
##   2        2      1.42  0.447     1.12
##   2        3      1.29  0.537     1.05
##   2        4      1.26  0.553     1.03
##   2        5      1.27  0.579     1.02
##   2        6      1.31  0.563     1.05
##   2        7      1.31  0.581     1.04
##   2        8      1.34  0.562     1.06
##   2        9      1.37  0.545     1.07
##   2       10      1.48  0.505     1.14
##   2       11      1.49  0.503     1.15
##   2       12      1.56  0.494     1.18
##   2       13      1.60  0.493     1.18
##   2       14      1.59  0.491     1.19
##   2       15      1.66  0.483     1.22
##   2       16      1.63  0.489     1.20
##   2       17      1.65  0.485     1.21
##   2       18      1.65  0.481     1.22
##   2       19      1.66  0.475     1.22
##   2       20      1.67  0.475     1.22
##   2       21      1.70  0.466     1.24
##   2       22      1.73  0.459     1.26
##   2       23      1.73  0.459     1.26
##   2       24      1.73  0.463     1.27
##   2       25      1.73  0.465     1.26
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.

## Selected 3 of 21 terms, and 2 of 56 predictors (nprune=3)
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: M32, M09, B01-unused, B02-unused, B03-unused, B04-unused, ...
## Number of terms at each degree of interaction: 1 2 (additive model)
## GCV 1.58    RSS 193    GRSq 0.534    RSq 0.562

We conclude the best MARS model has degree 1 and 3 terms based on the cross validation results.

pred_mars_chemY = predict(marsTuned, newdata = chemX_test_pp)

testRMSE_mars_chem = caret::RMSE(pred_mars_chemY, chemY_test ) 

testR2_mars_chem   = caret::R2(  pred_mars_chemY, chemY_test)

SVM - Model Results

Now we consider the Support Vector Machine non-linear model. We use the radial basis function and tune a grid of costs and sigmas for the range of \(C = 1, 5, 10, 15, 30, 100\) and \(\sigma = 0.5%, 1%, 1.5%, 2%, 4%, 10%\). We use a repeated 10-fold cross validation to generate our cross validation statistics below.

svmGrid = expand.grid( C = c(  1.0, 5, 10, 15, 30, 100 ), sigma = c( 0.005, 0.01, 0.015, 0.02, 0.04, 0.10 ))

repeats = 9

ctrl <- trainControl(method = "cv", number = 10,  selectionFunction = "oneSE")

set.seed(1000)

svmTuned = caret::train(x = chemX_train_pp, 
                         y = chemY_train, 
                         method = "svmRadial" ,
                         metric = "RMSE" , 
                         tuneGrid = svmGrid, 
                         trControl = ctrl )

svmTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 118, 118, 119, 119, 120, 118, ... 
## Resampling results across tuning parameters:
## 
##   C    sigma  RMSE  Rsquared  MAE  
##     1  0.005  1.23  0.578     1.005
##     1  0.010  1.19  0.611     0.970
##     1  0.015  1.20  0.618     0.979
##     1  0.020  1.24  0.609     1.007
##     1  0.040  1.40  0.517     1.133
##     1  0.100  1.71  0.295     1.407
##     5  0.005  1.07  0.652     0.880
##     5  0.010  1.04  0.684     0.851
##     5  0.015  1.07  0.687     0.870
##     5  0.020  1.13  0.671     0.916
##     5  0.040  1.34  0.574     1.092
##     5  0.100  1.70  0.321     1.398
##    10  0.005  1.04  0.672     0.849
##    10  0.010  1.03  0.697     0.835
##    10  0.015  1.07  0.687     0.870
##    10  0.020  1.13  0.671     0.916
##    10  0.040  1.34  0.574     1.092
##    10  0.100  1.70  0.321     1.398
##    15  0.005  1.04  0.677     0.850
##    15  0.010  1.03  0.697     0.835
##    15  0.015  1.07  0.687     0.870
##    15  0.020  1.13  0.671     0.916
##    15  0.040  1.34  0.574     1.092
##    15  0.100  1.70  0.321     1.398
##    30  0.005  1.02  0.696     0.837
##    30  0.010  1.03  0.697     0.835
##    30  0.015  1.07  0.687     0.870
##    30  0.020  1.13  0.671     0.916
##    30  0.040  1.34  0.574     1.092
##    30  0.100  1.70  0.321     1.398
##   100  0.005  1.02  0.696     0.837
##   100  0.010  1.03  0.697     0.835
##   100  0.015  1.07  0.687     0.870
##   100  0.020  1.13  0.671     0.916
##   100  0.040  1.34  0.574     1.092
##   100  0.100  1.70  0.321     1.398
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final values used for the model were sigma = 0.015 and C = 5.

The above results show a SVM radial basis model with cost 5 and \(\sigma\) equal to 0.015 has the best performance in terms of \(RMSE\).

pred_svm_chemY = predict(svmTuned, newdata = chemX_test_pp)

(testRMSE_svm_chem = caret::RMSE(pred_svm_chemY, chemY_test ) )

## [1] 1.24

(testR2_svm_chem   = caret::R2(  pred_svm_chemY, chemY_test) )

## [1] 0.571

Evaluation of Test Results

Test Performance for both models

models	RMSE	RSquared
MARS	1.20	0.600
SVM - Radial Basis Function	1.24	0.571

The test results shows MARS slightly outperforms SVM on both \(R^2\) and \(RMSE\) on the test data.

Response 7.5b

Next we examine the variable importance of the MARS model, which is the optimal non-linear model considered in the previous section. For completeness, we also report the variable importance for the SVM model.

The MARS model excludes all but 2 predictors (M32, M09). The SVM model includes all variables. The most important ones are M32, M13 and B06.

The manufacturing predictors dominate for the MARS and SVM models. But the third predictor B06 is not present in the MARS or last week’s PLS model as a top predictor. For the sake of completeness, we examine the MARS model coefficients below.

names	x
(Intercept)	37.62
h(M32- -1.16635)	1.24
h(M09- -0.948781)	0.97

We interpret the MARS hinge functions to mean:

M32 is the dominant predictor which is positively correlated to Yield. For standardized values of M32 exceeding -1.166, the yield increases by 1.24% for each 1 standard deviation increase of M32. The MARS model does not include a corresponding term when M32 is less than -1.166, therefore the effect of M32 is nonlinear.
M09 is significant predictor which is positively correlated with Yield. For standardized values of M09 exceeding -0.948, the yield increases by 0.97% for each 1 standard deviation increase in M09.

The hinge function effects are non-linear and clearly show the importance of these two predictors over others.

repeats = 9

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = chemX_train_pp, y = chemY_train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale") ,
                 metric = "Rsquared"
                 )

pred_pls_chemY = predict(plsTune, newdata = chemX_test_pp)

plsTune

## Partial Least Squares 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold, repeated 9 times) 
## Summary of sample sizes: 119, 118, 120, 120, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE  Rsquared  MAE  
##    1     1.35  0.473     1.114
##    2     1.26  0.527     1.020
##    3     1.20  0.566     0.988
##    4     1.23  0.562     0.993
##    5     1.23  0.560     1.007
##    6     1.24  0.562     1.019
##    7     1.24  0.567     1.023
##    8     1.24  0.568     1.027
##    9     1.28  0.556     1.055
##   10     1.30  0.549     1.080
##   11     1.33  0.539     1.093
##   12     1.36  0.526     1.114
##   13     1.38  0.523     1.125
##   14     1.40  0.520     1.132
##   15     1.41  0.515     1.139
##   16     1.42  0.509     1.143
##   17     1.43  0.506     1.150
##   18     1.44  0.507     1.150
##   19     1.44  0.508     1.148
##   20     1.45  0.504     1.152
## 
## Rsquared was used to select the optimal model using  the one SE rule.
## The final value used for the model was ncomp = 3.

The PLS Coefficient plot shows that M32 has a large positive coefficient in each of the top 3 latent variables as shown in its spike in all 3 curves below.

In the table below, we show the PLS coefficients of the 3 latent variables with respect to the 4 most influential variables in the PLS variable importance plot. The coefficients inform us if the latent variable movements coincide with the impact of each predictors M32, M09, M13 and M36. We see that M32 has positive coefficients relative to all 3 top latent variables while M13 has negative coefficents to all three.

Latent Variable Coefficients

LV1	LV2	LV3	names
0.102	0.177	0.198	M09
-0.109	-0.193	-0.227	M13
0.129	0.209	0.346	M32
-0.108	-0.156	-0.238	M36

While the mathematical structure of the models are different, PLS and MARS assign similar direction and size of impact to each of the top 3 predictors.

Response 7.5c

In the MARS final model, we display below the marginal effects plot of top predictors against the fitted values of the yield. Within the plots of M32 and M09, the red fitted line shows a hinge effect. This suggests that the model predicts non-linearity related to the hinge effect. Moreover, we observe the green curve bends up at roughly the knot points of the hinge functions consistent with the model specification.

In the PLS final model, for the corresponding marginal plot, we see a more linear effect of the predictors on the predicted response in the training data.

The presence of hinge effects or non-linearities in the yield are to be expected. The hinge function knots may be associated with chemical concentration, temperature or pressure thresholds that cause significant phase changes in chemical response. Biochemists should be consulted on the intepretation of the model results.

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)

library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(caret)
library(AppliedPredictiveModeling)
library(earth)
library(mlbench)  # data for Friedman non-linear dataset
.code-bg {
  background-color: lightgray;
}

runProblem1 = TRUE
runModels1  = TRUE
runProblem2 = TRUE

set.seed(200)
trainingData = mlbench.friedman1( 200, sd = 1 )

## We convert the 'x' data from a matrix to a dataframe.
## One reason is that this will give the columns names

trainingData$x = data.frame( trainingData$x )
## Look at the data using
featurePlot( trainingData$x, trainingData$y)
## or other methods
## Simulate a large test set to estimate the true error rate with good precision:
testData = mlbench.friedman1( 5000, sd = 1 )
testData$x = data.frame( testData$x)
options(digits = 3)
set.seed(1020)
knnModel = train( x= trainingData$x, y = trainingData$y, 
                  method = "knn", 
                  preProc = c("center", "scale"), 
                  tuneLength = 10)

knnModel
knnPred <- predict( knnModel, newdata = testData$x)

## the function postResample can be used to get the test set
## performance values

(knn_test_performance = postResample(pred = knnPred, obs = testData$y) )

marsGrid = expand.grid( .degree = 1:2, .nprune = 2:20)

repeats = 4

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "best", verboseIter = FALSE)

set.seed(1000)

marsTuned = caret::train(x = trainingData$x, 
                         y = trainingData$y ,
                         method = "earth" ,
                         metric = "RMSE" , 
                         tuneGrid = marsGrid ,
                         preProc = c("center", "scale") ,
                         trControl = ctrl )
summary(marsTuned)

p1 = ggplot(marsTuned, metric = "Rsquared" ) + labs(y = "R-Squared", title = "Performance under Repeated CV: MARS")
p2 = ggplot(marsTuned, metric = "RMSE") + labs(y ="RMSE")

cowplot::plot_grid(p1, p2, ncol = 1)
marsPred = predict( marsTuned, newdata = testData$x )

(mars_test_performance = postResample(pred = marsPred, obs = testData$y ) )


plot( varImp(marsTuned, scale = FALSE ))


svmGrid = expand.grid( C = c(  1.0, 5, 10, 15, 30, 100 ), sigma = c( 0.005, 0.01, 0.015, 0.02, 0.04, 0.10 ))

repeats = 4

ctrl <- trainControl(method = "cv", number = 10,  selectionFunction = "oneSE")

set.seed(1000)

svmTuned = caret::train(x =  trainingData$x , 
                         y = trainingData$y, 
                         method = "svmRadial" ,
                         metric = "RMSE" , 
                         tuneGrid = svmGrid, 
                         preProc = c("center", "scale") ,
                         trControl = ctrl )

svmTuned

svmPred = predict( svmTuned, newdata = testData$x )

(svm_test_performance = postResample(pred = svmPred, obs = testData$y ) )

plot(varImp(svmTuned, scale = FALSE ))

test_results = data.frame(models = c("kNN", "MARS", "SVM - Radial Basis Function") ,
                          RMSE  = c(  knn_test_performance[1], mars_test_performance[1], svm_test_performance[1] ) ,
                          RSquared = c( knn_test_performance[2], mars_test_performance[2], svm_test_performance[2] ) 
                          )

test_results %>% kable(caption = "Test Performance for both models", digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

data("ChemicalManufacturingProcess")

chemX = ChemicalManufacturingProcess[2:58]
chemY = ChemicalManufacturingProcess[1]

#chemX %>% rename_at(vars(starts_with("ManufacturingProcess") ), 
#                    funs(str_replace(., "ManufacturingProcess", "M"))) %>% 
#  rename_at(vars(starts_with("Biological")), 
#            funs( str_replace( ., "BiologicalMaterial", "B"))) -> chemX2


chemX %>% 
  rename_with(~ str_replace(., "ManufacturingProcess", "M"),  starts_with("Manufacturing") ) %>%
  rename_with(~ str_replace(., "BiologicalMaterial", "B"),  starts_with("Biological") ) -> chemX3


set.seed(19372)
train_indices = createDataPartition(chemY$Yield , times = 1, list = FALSE, p = 0.75 )
chemX_test   = chemX3[-train_indices , ]
chemX_train  = chemX3[ train_indices , ]
chemY_test   = chemY[-train_indices , ]
chemY_train  = chemY[ train_indices , ]


# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_train_bv = chemX_train[, -nearZeroVar(chemX_train) ]

# Save the non-zero-variance predictors as a "bv" (big variance set of predictors)
chemX_test_bv = chemX_test[, -nearZeroVar(chemX_test) ]


chemX_train_bv %>% mutate( 
  across(c(M18, M20, M16, M26, M25, M27, M31, 
           M29, M43, M30, M01, M42, M44, M45, 
           M39, M38 ) ,   # Run a function across the columns listed in across()
  ~if_else( condition = ( .x == 0 )  ,  # anonymous function applied to each column in across()  .x refers to current column argument
            true = median( . ) ,  # we use median without removal of NA
            false = as.numeric(.) ) ) # keep the existing value of it is nonzero
  
  ) -> chemX_train_nz
  
# Replace the outlier zeros with the median value for the variables below for test
chemX_test_bv %>% mutate(
    across(c(M39, M42, M44, M45, M01 ) ,
  ~if_else( condition = ( .x == 0 )  ,
            true = median( . ) ,
            false = as.numeric(.) ) ) )  -> chemX_test_nz 
                           

# Build the caret function to preprocess the Chemical data and impute missing values.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(chemX_train_nz, method = c("center", "scale", "knnImpute"))

# Becomes the source data for the model building
chemX_train_pp = predict( preProcFunc,  chemX_train_nz )

# Becomes the final version of test data for prediction
chemX_test_pp  = predict( preProcFunc,  chemX_test_nz )


marsGrid = expand.grid( .degree = 1:2, .nprune = 2:25 )

repeats = 4

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "best", verboseIter = FALSE)

set.seed(1000)

marsTuned = caret::train(x = chemX_train_pp, 
                         y = chemY_train, 
                         method = "earth" ,
                         metric = "RMSE" , 
                         tuneGrid = marsGrid ,
                         trControl = ctrl )
p1 = ggplot(marsTuned, metric = "Rsquared" ) + labs(y = "R-Squared", title = "Performance under Repeated CV: MARS")
p2 = ggplot(marsTuned, metric = "RMSE") + labs(y ="RMSE")

cowplot::plot_grid(p1, p2, ncol = 1)
print(marsTuned)
print(marsTuned$finalModel)
pred_mars_chemY = predict(marsTuned, newdata = chemX_test_pp)

testRMSE_mars_chem = caret::RMSE(pred_mars_chemY, chemY_test ) 

testR2_mars_chem   = caret::R2(  pred_mars_chemY, chemY_test) 

svmGrid = expand.grid( C = c(  1.0, 5, 10, 15, 30, 100 ), sigma = c( 0.005, 0.01, 0.015, 0.02, 0.04, 0.10 ))

repeats = 9

ctrl <- trainControl(method = "cv", number = 10,  selectionFunction = "oneSE")

set.seed(1000)

svmTuned = caret::train(x = chemX_train_pp, 
                         y = chemY_train, 
                         method = "svmRadial" ,
                         metric = "RMSE" , 
                         tuneGrid = svmGrid, 
                         trControl = ctrl )

svmTuned
p1 = ggplot(svmTuned, metric = "Rsquared" ) + labs(title = "SVM Performance in Cross Validation", y = "R-squared")
p2 = ggplot(svmTuned, metric = "RMSE") + labs(y = "RMSE")

cowplot::plot_grid(p1, p2, ncol = 1 )
pred_svm_chemY = predict(svmTuned, newdata = chemX_test_pp)

(testRMSE_svm_chem = caret::RMSE(pred_svm_chemY, chemY_test ) )

(testR2_svm_chem   = caret::R2(  pred_svm_chemY, chemY_test) )

test_results = data.frame(models = c("MARS", "SVM - Radial Basis Function") ,
                          RMSE  = c(   testRMSE_mars_chem,  testRMSE_svm_chem) ,
                          RSquared = c( testR2_mars_chem,  testR2_svm_chem  ) 
                          )

test_results %>% kable(caption = "Test Performance for both models", digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

p1 = plot(varImp(marsTuned, scale= FALSE), top = 10 , main = "MARS Variable Importance")
p2 = plot(varImp(svmTuned,  scale = FALSE), top = 10, main = "SVM Variable Importance")

plot_grid(p1, p2)


coef(marsTuned$finalModel) %>% broom::tidy() %>% kable(digits = 2 ) %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
repeats = 9

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = repeats,  selectionFunction = "oneSE")

set.seed(102030)

plsTune <- train(x = chemX_train_pp, y = chemY_train,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl ,
                 preProcess = c("center", "scale") ,
                 metric = "Rsquared"
                 )

pred_pls_chemY = predict(plsTune, newdata = chemX_test_pp)

plsTune

pls::coefplot(plsTune$finalModel, ncomp=1:3,  
              labels = "names", legendpos = "top", 
              main = "PLS Coefficient Plot of 3 Latent Variables")

a = drop(plsTune$finalModel$coefficients) 

df_pls_coefs =as_tibble( a )
colnames(df_pls_coefs) = c("LV1", "LV2", "LV3")
df_pls_coefs$names = rownames(a)

df_pls_coefs %>% filter( names %in% c( 'M32', 'M09', 'M13', 'M36' ) ) %>% kable(digits = 3 , caption = "Latent Variable Coefficients")


vars_to_plot = c("M32", "M09" ) 
                 #"M13", "B06")

theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = "green"
theme1$plot.line$lwd = 2
trellis.par.set(theme1)

caret::featurePlot(x = chemX_train_pp[, vars_to_plot], 
                   y = marsTuned$finalModel$fitted.values , plot = "scatter" ,
                   type = c("p", "smooth") ,
                   span = 0.5, 
                   layout = c(2,1) ,
                   main = "Marginal Effects Plot for MARS - Training Set"
                   )
caret::featurePlot(x = chemX_train_pp[, vars_to_plot], 
                   y = predict(plsTune, newdata=chemX_train_pp) , plot = "scatter" ,
                   type = c("p", "smooth") ,
                   span = 0.5, 
                   layout = c(2,1) ,
                   main = "Marginal Effects Plot for PLS - Training Set"
                   )

Data 624 HW 8 (Week 11) Non-Linear Regression

Alexander Ng

Due 11/14/2021

Overview

Exercise 7.2