Problem No. 7.2

First, let’s create some of Friedman’s simulated nonlinear data:

set.seed(1804)

train <- mlbench.friedman1(200, sd=1)
train$x <- as.data.frame(train$x)
train_df <- as.data.frame(train$x)
train_df$y <- train$y

test <- mlbench.friedman1(5000, sd=1)
test_df <- as.data.frame(test$x)
test_df$y <- test$y

The plots below show how it each variable relates to \(y\). The first five the variables are related to \(y\) in a non-linear fashion, (by design), and the latter five others are not related to \(y\) at all. This is apparent in this plot:

train_df %>%
  tidyr::gather(-y, key='var', value='value') %>%
   ggplot(aes(x = value, y = y)) +
    geom_point(alpha=0.33) +
    geom_smooth(se=FALSE) +
    facet_wrap(~ var, scales='free')

The next section attempts to model this data with various learning algorithms. First, make a linear regression model to compare more sophisticated models to:

m0 <- lm(y ~ ., train_df)

m0_pred <- predict(m0, test_df)

# out of sample RMSE
sqrt(mean((test_df$y - m0_pred)^2))

## [1] 2.704765

scatter.smooth(m0_pred, test$y, col='gray')

Interestingly, the linear model doesn’t look too bad. It recognizes that four of the first five variables important. The adjusted \(R^2 = 0.72\), which is good. \(RMSE\) on test set is 2.7. A simple plot of the predictions against \(y\) looks pretty reasonable, despite some negative outliers.

\(M_1\): Neural network model

The first algorithm to use is a neural network. None of the independent variables are correlated. Test decay parameter in \((0, 0.01, 0.1)\), and the number of hidden nodes between 3 and 9:

m1_grid <- expand.grid(decay = c(0, 0.01, 0.1),
                       size = c(3:9))
m1 <- train(as.data.frame(train$x), train$y,
            method='nnet',
            tuneGrid=m1_grid,
            preProcess=c('center', 'scale'),
            linout=TRUE,
            trace=FALSE)

The best model has 3 hidden nodes with decay \(= 0.1\). Evaluate on the test set:

m1_pred <- predict(m1, newdata=as.data.frame(test$x))
postResample(m1_pred, obs=test$y)

##      RMSE  Rsquared       MAE 
## 2.8111016 0.6908836 2.1917741

Interestingly, the best neural network model performs worse than plain ol’ linear regression on the test set. It scores an \(RMSE = 2.81\) and \(R^2 = 0.69\).

\(M_2\): Multivariate adaptive regression splines model

Next, let’s try to apply MARS. Test degrees of 1 and 2 and pruning between 2 and 9:

m2_grid <- expand.grid(degree=1:2,
                       nprune=2:9)
set.seed(1804)
m2 <- train(x=train$x,
            y=train$y,
            method='earth',
            tuneGrid=m2_grid)
plot(m2)

The best model is \(nprune = 9\) and \(degree = 2\). The plot shows how \(RMSE\) changes with both tuning variables. On the test data, the MARS model performs very well, with \(RMSE = 1.59\) and \(R^2 = 0.90\)! It is superior to both neural network and linear regression:

m2_pred <- predict(m2, newdata=as.data.frame(test$x))

postResample(m2_pred, obs=test$y)

##      RMSE  Rsquared       MAE 
## 1.5896446 0.8996629 1.2601951

Examining the (arguably) ‘most important variables’ with varImp() shows that MARS algorithm correctly identifies the first five variables as informative, and the second five variables as unimportant.

\(M_3\): Support vector machines model

Next, try a SVM model. Set the granularity in the tuning parameter grid with tuneLength:

set.seed(1804)
m3 <- train(x=train$x,
            y=train$y,
            method='svmRadial',
            preProc=c('center', 'scale'),
            tuneLength=20)

Best model is \(\sigma = 0.056\) and \(C = 8\). On the hold-out data set, it records \(RMSE = 2.13\) and \(R^2 = 0.82\)—better than neural network and linear regression, but worse than MARS:

m3_pred <- predict(m3, newdata=as.data.frame(test$x))
postResample(m3_pred, obs=test$y)

##      RMSE  Rsquared       MAE 
## 2.1253776 0.8212325 1.6571028

According to (problematic) varImp(), the SVM algorithm identifies that the first five variables are the most important; however, it also accords some minor importance to three uninformative variables. This failure to discriminant noise from signal is part of the reaosn the model does not perform as well as it could.

Conclusion

At this point, I have trained linear regression, neural network, MARS, and SVM models. Their performances on the test data:

\(i\)	Model	\(RMSE\)	\(R^2\)
0	Linear regression	\(2.70\)	—
1	Neural network	\(2.81\)	\(0.69\)
2	MARS	\(1.59\)	\(0.90\)
3	SVM	\(2.12\)	\(0.82\)

On both measures, the MARS model performs the best. MARS maybe particularly appropriate for this dataset, where the important variables are non-linearly related to \(y\). Using a piecewise linear model thus can be considered the best approach. Additionally, MARS was the only algorithm to correctly include all and only the informative predictors (\(V1\)–\(V5\)).

Problem No. 7.5

Load the chemical manufactering data and split into train and test, using the same random seed as previous assignment:

data("ChemicalManufacturingProcess")
df <- ChemicalManufacturingProcess
colnames(df) <- tolower(colnames(df))

set.seed(1805)
train_ix <- createDataPartition(df$yield, p=0.8, list=FALSE)
df_train <- df[train_ix, ]
df_test <- df[-train_ix, ]

I will train linear regression, glmnet (best model from last assignment), SVM, and MARS models.

Next, let’s train a linear regression, neural network, glmnet (winner of last time), SVN, and MARS models. I will use the conveniant functionality caretEnsemble::caretList, which allows us to easily specify and train multiple models with minimum coding. (Note that the MARS function does not seem to play nicely, so I train it seperately.) As before, I use bagImpute to impute any missing values.

This code takes some time to run, so I have told Rmarkdown not to run it during knitting:

library(caretEnsemble)

#ctrl <- trainControl(method='cv', number=4)

models <- caretList(x=df_train[, 2:58],
                    y=df_train$yield,
                    methodList=c('svmRadial', 'glmnet', 'lm'),
                    preProcess='bagImpute',
                    tunelist=list(
                      svnRadial=caretModelSpec(tuneLength=14),
                      glmnet=caretModelSpec(tuneGrid=expand.grid(alpha=0:1, lambda=10^seq(-3, 3, length=100)))
                    )
                    #, trControl=ctrl
                    )

# MARS model seperately
mars_grid <- expand.grid(degree=1:2, nprune=2:5)
mars_model <- train(x=df_train[, 2:58],
                    y=df_train$yield,
                    preProcess='bagImpute',
                    method='earth',
                    tuneGrid=mars_grid)

Examine each model’s performance on the test set:

postResample( predict(models$svmRadial, newdata=df_test), obs=df_test$y)
postResample( predict(models$glmnet, newdata=df_test), obs=df_test$y)
postResample( predict(models$lm, newdata=df_test), obs=df_test$y)
postResample( predict(mars_model, newdata=df_test), obs=df_test$y)

The SVM model was unable to be trained, but the other performances are:

Model	\(RMSE\)	\(R^2\)
Linear regression	1.36	0.58
GLMnet	1.55	0.49
MARS	1.51	0.49
SVM	—	—

However: R informs us that prediction from a rank-deificient fit may be misleading. As we know from previous assignment, this is related to correlation between the independent variables.

This being the case, I just that the MARS model is the best model. It performs slightly better than the optimal model from last assignment, the GLMnet.

Using varImp(), the MARS model uses only three variables: the manufactering processes 32, 9, and 13. None of the biological material variables are important to this model.

DATA 624—Week No. 11

Ben Horvath

April 19, 2020