Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

(a) Tune several models on these data.

Knn Model (Example from book)

Looking at the plot above we see that the Kth dimension with the best RMSE was the model using 11 k-dimensions, producing a RMSE of 3.27.

Multivariate Adaptive Regression Splines

Looking at the MARS model selection plot we see that a MARS model of degree 2 with 17 terms is the one with the lowest RMSE of 1.24.

Support Vector Machine

The SVM model with the lowest RMSE of 2.02, was one that had a cost of 32 and used 168 training data set points as support vectors. This makes up 84% of the training set. The sigma for all models was 0.06.

(b) Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Regression model metrics
model RMSE Rsquared MAE
KNN 3.2700 0.6250 2.6287
MARS 1.2404 0.9419 0.9938
SVM 2.0208 0.8496 1.6423

Comparing the model metrics we see that the MARS model had the best metrics across the board, with the SVM model coming in second and the KNN model performing to worst of all. However, my concern with the MARS model is that it might be over fitting the data given that the \(R^2\) is so much higher than the other two models.

Variable importance by Model
predictor MARS KNN SVM
4 X4 100.00 100.00 100.00
2 X2 69.73 70.41 70.41
1 X1 42.88 72.78 72.78
5 X5 20.52 66.84 66.84
3 X3 0.00 29.76 29.76

Additionally, as seen in the table above, the decay in importance for each subsequent predictor after X4 is much quicker than the other two models. That is, each predictor after X4 is considered less and less important at a faster rate than either the KNN and SVM models. This means that the MARS model is putting a lot of the weight of prediction onto the X4 predictor.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

Looking at the graph above we see that the KNN model used to predict yield was actually underperforming on the training data. On the other hand, the MARS and SVM models have fairly consistent \(R^2\). I would select the SVM model since it has the highest \(R^2\) and the \(R^2\) for the training and test data sets are nearly identical.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Top 10 SVM predictors
type predictors percent
Biological 4 0.4
Process 6 0.6

The top 10 predictors for the SVM model aren’t dominated by one process or the other. Of the top 10 predictors 60% are Process predictors while 40% are biological predictors. This is slightly lower from HW 7, where the Process predictors made up 70% of the top 10 predictors in the Linear model.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Looking at the scatter plots above we can see that across the board the Biological Material will increase yield. On the other hand, with the Manufacturing Process it depends on which process. Some will decrease yield and some will increase yield, as seen with ManufactoringProcess33 and ManufactoringProcess36.