Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
Looking at the plot above we see that the Kth dimension with the best RMSE was the model using 11 k-dimensions, producing a RMSE of 3.27.
Looking at the MARS model selection plot we see that a MARS model of degree 2 with 17 terms is the one with the lowest RMSE of 1.24.
The SVM model with the lowest RMSE of 2.02, was one that had a cost of 32 and used 168 training data set points as support vectors. This makes up 84% of the training set. The sigma for all models was 0.06.
| model | RMSE | Rsquared | MAE |
|---|---|---|---|
| KNN | 3.2700 | 0.6250 | 2.6287 |
| MARS | 1.2404 | 0.9419 | 0.9938 |
| SVM | 2.0208 | 0.8496 | 1.6423 |
Comparing the model metrics we see that the MARS model had the best metrics across the board, with the SVM model coming in second and the KNN model performing to worst of all. However, my concern with the MARS model is that it might be over fitting the data given that the \(R^2\) is so much higher than the other two models.
| predictor | MARS | KNN | SVM | |
|---|---|---|---|---|
| 4 | X4 | 100.00 | 100.00 | 100.00 |
| 2 | X2 | 69.73 | 70.41 | 70.41 |
| 1 | X1 | 42.88 | 72.78 | 72.78 |
| 5 | X5 | 20.52 | 66.84 | 66.84 |
| 3 | X3 | 0.00 | 29.76 | 29.76 |
Additionally, as seen in the table above, the decay in importance for each subsequent predictor after X4 is much quicker than the other two models. That is, each predictor after X4 is considered less and less important at a faster rate than either the KNN and SVM models. This means that the MARS model is putting a lot of the weight of prediction onto the X4 predictor.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Looking at the graph above we see that the KNN model used to predict yield was actually underperforming on the training data. On the other hand, the MARS and SVM models have fairly consistent \(R^2\). I would select the SVM model since it has the highest \(R^2\) and the \(R^2\) for the training and test data sets are nearly identical.
| type | predictors | percent |
|---|---|---|
| Biological | 4 | 0.4 |
| Process | 6 | 0.6 |
The top 10 predictors for the SVM model aren’t dominated by one process or the other. Of the top 10 predictors 60% are Process predictors while 40% are biological predictors. This is slightly lower from HW 7, where the Process predictors made up 70% of the top 10 predictors in the Linear model.
Looking at the scatter plots above we can see that across the board the Biological Material will increase yield. On the other hand, with the Manufacturing Process it depends on which process. Some will decrease yield and some will increase yield, as seen with ManufactoringProcess33 and ManufactoringProcess36.