Questions

Question 7.2

Which models appear to give the best performance? Does MARS select the informative predictors (those named \(X1-X5\))?

Following the code given, I loaded the test function and created training and test data.

set.seed(200)
training.data <- mlbench.friedman1(200, sd=1)
test.data <- mlbench.friedman1(5000, sd=1)
test.data$x <- data.frame(test.data$x)

Support Vector Machine

First I tested SVMs which resulted in a RMSE of 2.25.

svm.fit <- ksvm(x=training.data$x, y=training.data$y, kernel='rbfdot', kpar='automatic', C=1, epsilon=0.1)
svm.fit
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0601226829145354 
## 
## Number of Support Vectors : 157 
## 
## Objective Function Value : -41.707 
## Training error : 0.073535
Metrics::rmse(test.data$y, predict(svm.fit, test.data$x))
## [1] 2.255179

KNN

Next I tested K Nearest Neighbor which reported and RMSE of 3.14

knn.fit <- knnreg(training.data$x, training.data$y, k=5)
knn.fit
## 5-nearest neighbor regression model
Metrics::rmse(test.data$y, predict(knn.fit, test.data$x))
## [1] 3.147854

Mars

Finally I tested MARS which reported an RMSE of 1.81

mars.fit <- earth(training.data$x, training.data$y)
mars.fit
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: training.data$x1, training.data$x4, training.data$x2, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
Metrics::rmse(test.data$y, predict(mars.fit, test.data$x))
## [1] 1.813647

Mars has the strongest fit of the three tested methods. It selected 6 of the 10 predictors including all of the informative variables and one of the noise variables. All of the informative variables were identified as more important than the noise variable.

Question 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

As per the direction, I copy and pasted my cleaning code from last week’s assignment here. The end result is the cleaned Chemical data.

data(ChemicalManufacturingProcess)
Chemical.Y <- ChemicalManufacturingProcess$Yield
Chemical.X <- ChemicalManufacturingProcess %>%
  select(-Yield)

set.seed(123)
imputed.data <- mice::mice(Chemical.X, m=5, maxit=10, seed=123, printFlag=FALSE)
## Warning: Number of logged events: 1350
completed.data <- cbind(Chemical.Y, mice::complete(imputed.data, 1)) 

set.seed(123)

part <- caret::createDataPartition(completed.data$Chemical.Y, p=0.8, list=FALSE)
training <- completed.data %>%
  filter(row_number() %in% part)
testing <- completed.data %>%
  filter(!row_number() %in% part)
  1. Which nonlinear regression model gives the optimal resampling and test set performance?
svm.fit <- ksvm(Chemical.Y ~ ., data=training, kpar='automatic', C=1, epsilon=0.01)
Metrics::rmse(testing$Chemical.Y, predict(svm.fit, testing %>% select(-Chemical.Y)))
## [1] 1.209252
knn.fit <- knnreg(training %>% select(-Chemical.Y), training$Chemical.Y, k=3)
Metrics::rmse(testing$Chemical.Y, predict(knn.fit, testing %>% select(-Chemical.Y)))
## [1] 1.015829
mars.fit <- earth(training %>% select(-Chemical.Y), training$Chemical.Y, thres=.01)
Metrics::rmse(testing$Chemical.Y, predict(mars.fit, testing %>% select(-Chemical.Y)))
## [1] 1.041377

The K Nearest Neighbors with k set to 3 provides the best RSME. However, the MARS model is very close and may be a more robust model overall. Given KNN’s slowdown as training size increases, it may be reasonable to select MARS even though it has a slightly worse RMSE.

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

There is no method for determinig the most important predictors with KNN. I will use MARS as my model for the next two problems. Viewing the selected predictors makes it clear that it favors Manufacturing Processes over Biological Material. Only 7 predictors were selected, 6 of which are Manufacturing Processes.

In the last homework, I selected the Lasso regression as my optimal linear model. It select 17 predictors that were roughly split between Manufacturing Processes and Biological Material. So while the MARS model definitely favored Manufacturing Processes, Lasso did not.

mars.fit
## Selected 9 of 21 terms, and 7 of 57 predictors
## Termination condition: RSq changed by less than 0.01 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 8 (additive model)
## GCV 1.117226    RSS 125.137    GRSq 0.6890954    RSq 0.7547763
mars.fit$coefficients
##                                 training$Chemical.Y
## (Intercept)                             39.44693942
## h(ManufacturingProcess32-154)            0.20024067
## h(46.67-ManufacturingProcess09)         -0.63441216
## h(33.1-ManufacturingProcess13)           2.77359020
## h(ManufacturingProcess39-6.9)           -2.36682154
## h(6.9-ManufacturingProcess39)           -0.40047145
## h(ManufacturingProcess01-9.9)            0.39878733
## h(BiologicalMaterial06-44.45)            0.11018121
## h(974-ManufacturingProcess05)            0.04489574
  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Below I plotted the relationship between the top predictors and the response variable. Interestingly, most of the predictors have a nonlinear relationship. This may go a long way towards explaining why the previous linear methods had so much trouble capturing the information present in the predictors. Furthermore, I would also suggest that in a more robust solution that many of the predictors may have outliers or influential points that may need to be addressed.

Intuitively, we can see why each predictor will aid in predicting the response variable. Each one has a strong trend and relationship. The weakest predictor, Manufacturing Processes 05, is unsurprisingly the least influential of the predictors. Conversly, Manufacturing Process 32 appears to so accurately track the response variable with a single 3rd order polynomial relationship, that I would wager that predictor alone could account for most of the variance in the response variable.

training %>%
  select(Chemical.Y, ManufacturingProcess32, ManufacturingProcess09, 
         ManufacturingProcess13, ManufacturingProcess39, ManufacturingProcess01, 
         BiologicalMaterial06, ManufacturingProcess05
         ) %>%
  gather(key='key', value='value', -Chemical.Y) %>%
  ggplot(aes(value, Chemical.Y)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~key, scales = 'free')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'