DATA624 - Homework 8

Author

Anthony Josue Roman

Exercise 7.2

library(caret)
library(mlbench)
library(earth)
library(kernlab)
library(nnet)
library(AppliedPredictiveModeling)

set.seed(200)
ctrl <- trainControl(method = "cv", number = 10)

trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

set.seed(200)

knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProcess = c("center", "scale"),
  tuneLength = 10,
  trControl = ctrl
)

marsModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "earth",
  tuneLength = 10,
  trControl = ctrl
)

svmModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 10,
  trControl = ctrl
)

nnetModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "nnet",
  preProcess = c("center", "scale"),
  tuneLength = 5,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE,
  maxit = 500
)

results <- resamples(list(
  KNN = knnModel,
  MARS = marsModel,
  SVM = svmModel,
  NeuralNet = nnetModel
))

summary(results)


Call:
summary.resamples(object = results)

Models: KNN, MARS, SVM, NeuralNet 
Number of resamples: 10 

MAE 
               Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
KNN       1.9925181 2.203288 2.386149 2.506584 2.768301 3.319190    0
MARS      0.9991096 1.173404 1.268388 1.275334 1.367821 1.697877    0
SVM       1.0807824 1.388131 1.504246 1.516022 1.730650 1.787318    0
NeuralNet 1.0068002 1.683876 1.806117 1.795324 2.013971 2.184521    0

RMSE 
              Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
KNN       2.582485 2.797964 3.039929 3.086639 3.292676 3.858210    0
MARS      1.170702 1.397844 1.643812 1.633284 1.881016 2.077305    0
SVM       1.307085 1.762298 1.930205 1.912882 2.096860 2.392585    0
NeuralNet 1.295561 2.059514 2.235439 2.237020 2.571122 2.673052    0

Rsquared 
               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
KNN       0.4810928 0.6450407 0.6663713 0.6822198 0.7565689 0.8154310    0
MARS      0.8169917 0.8692214 0.9073215 0.8987840 0.9393663 0.9614652    0
SVM       0.7281328 0.8247039 0.8541268 0.8534898 0.8981878 0.9398293    0
NeuralNet 0.6921375 0.7725302 0.7986020 0.8100231 0.8529043 0.9400158    0

dotplot(results)

models <- list(
  KNN = knnModel,
  MARS = marsModel,
  SVM = svmModel,
  NeuralNet = nnetModel
)

test_results <- lapply(models, function(model) {
  pred <- predict(model, newdata = testData$x)
  postResample(pred = pred, obs = testData$y)
})

test_results

$KNN
     RMSE  Rsquared       MAE 
3.1222641 0.6690472 2.4963650 

$MARS
     RMSE  Rsquared       MAE 
1.8136467 0.8677298 1.3911836 

$SVM
     RMSE  Rsquared       MAE 
2.0725274 0.8260536 1.5739454 

$NeuralNet
     RMSE  Rsquared       MAE 
2.2725013 0.7989897 1.6538021

varImp(marsModel)

earth variable importance

   Overall
X1  100.00
X4   82.08
X2   62.79
X5   38.07
X3   25.80
X6    0.00

summary(marsModel$finalModel)

Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=1,
            nprune=12)

               coefficients
(Intercept)       18.451984
h(0.621722-X1)   -11.074396
h(0.601063-X2)   -10.744225
h(X3-0.281766)    20.607853
h(0.447442-X3)    17.880232
h(X3-0.447442)   -23.282007
h(X3-0.636458)    15.150350
h(0.734892-X4)   -10.027487
h(X4-0.734892)     9.092045
h(0.850094-X5)    -4.723407
h(X5-0.850094)    10.832932
h(X6-0.361791)    -1.956821

Selected 12 of 18 terms, and 6 of 10 predictors (nprune=12)
Termination condition: Reached nk 21
Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
Number of terms at each degree of interaction: 1 11 (additive model)
GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

From the results of resampling, it can be stated that MARS gives the best performance. It produces the minimum RMSE (1.63), the minimum MAE (1.28), and the maximum \(R^2\) (0.90). Thus, it manages to capture the non-linear structure of the data. Consistently, the test set gives the lowest value of RMSE (1.81) and the highest value of \(R^2\) (0.87).

SVM is the second best, with average error values and strong \(R^2\) measures. The third place is occupied by the neural network, followed by the model based on KNN, which demonstrates the largest error rate and the minimum \(R^2\). This means that the algorithm for finding the local averages is inappropriate here because it cannot detect the underlying nonlinear relation between the predictors and target.

As for the identification of important variables in the regression model, the results of variable importance measure indicate that the main predictors are X1, X4, X2, X5, and X3. They correspond to the true predictors utilized in generating data in accordance with Friedman function. Other variables starting from the sixth one are noise and are almost useless in terms of MARS model building. The variable importance measure is reflected in the output of the final model as well. Thus, the MARS model performs optimally in terms of prediction and accurate identification of the predictors.

Exercise 7.5

library(caret)
library(AppliedPredictiveModeling)
library(earth)
library(kernlab)
library(nnet)

data(ChemicalManufacturingProcess)

chem_data <- ChemicalManufacturingProcess
chem_data <- chem_data[!is.na(chem_data$Yield), ]

nzv <- nearZeroVar(chem_data)
chem_data <- chem_data[, -nzv]

missing_rate <- sapply(chem_data, function(x) mean(is.na(x)))
chem_data <- chem_data[, missing_rate < 0.5]

set.seed(123)
trainIndex <- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)

trainData <- chem_data[trainIndex, ]
testData  <- chem_data[-trainIndex, ]

trainX <- trainData[, setdiff(names(trainData), "Yield")]
trainY <- trainData$Yield

testX <- testData[, setdiff(names(testData), "Yield")]
testY <- testData$Yield

ctrl <- trainControl(method = "cv", number = 10)

set.seed(123)

knn_model <- train(
  x = trainX,
  y = trainY,
  method = "knn",
  preProcess = c("medianImpute", "center", "scale"),
  tuneLength = 10,
  trControl = ctrl
)

mars_model <- train(
  x = trainX,
  y = trainY,
  method = "earth",
  preProcess = c("medianImpute", "center", "scale"),
  tuneLength = 10,
  trControl = ctrl
)

svm_model <- train(
  x = trainX,
  y = trainY,
  method = "svmRadial",
  preProcess = c("medianImpute", "center", "scale"),
  tuneLength = 10,
  trControl = ctrl
)

nnet_model <- train(
  x = trainX,
  y = trainY,
  method = "nnet",
  preProcess = c("medianImpute", "center", "scale"),
  tuneLength = 5,
  trControl = ctrl,
  linout = TRUE,
  trace = FALSE,
  maxit = 500
)

results <- resamples(list(
  KNN = knn_model,
  MARS = mars_model,
  SVM = svm_model,
  NeuralNet = nnet_model
))

summary(results)


Call:
summary.resamples(object = results)

Models: KNN, MARS, SVM, NeuralNet 
Number of resamples: 10 

MAE 
               Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
KNN       0.6115714 0.8091154 0.9655000 0.9828700 1.0667333 1.446714    0
MARS      0.7843782 0.9754770 1.0727365 1.0303828 1.1175051 1.234551    0
SVM       0.7658238 0.8155113 0.8539036 0.8840884 0.9586661 1.021509    0
NeuralNet 0.7886221 0.8722139 0.9557973 1.0683594 1.1640112 1.865406    0

RMSE 
               Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
KNN       0.7718007 0.977187 1.178366 1.246410 1.474923 1.857784    0
MARS      0.9712784 1.177282 1.250176 1.259343 1.365867 1.461194    0
SVM       0.8503731 1.003167 1.094046 1.099475 1.219176 1.381220    0
NeuralNet 0.8864439 1.090693 1.167982 1.327979 1.483263 2.308371    0

Rsquared 
               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
KNN       0.2486207 0.4913011 0.6061918 0.5809122 0.6834773 0.8099282    0
MARS      0.3471675 0.4603229 0.5137825 0.5414691 0.6248561 0.8398102    0
SVM       0.3406067 0.6194533 0.6955414 0.6654192 0.7414939 0.8469834    0
NeuralNet 0.2329528 0.4817500 0.6278817 0.5743561 0.6994350 0.7483293    0

dotplot(results)

models <- list(
  KNN = knn_model,
  MARS = mars_model,
  SVM = svm_model,
  NeuralNet = nnet_model
)

test_results <- lapply(models, function(model) {
  pred <- predict(model, newdata = testX)
  postResample(pred = pred, obs = testY)
})

test_results

$KNN
     RMSE  Rsquared       MAE 
1.4407772 0.3935666 1.2255625 

$MARS
     RMSE  Rsquared       MAE 
1.2792995 0.5273828 1.0512395 

$SVM
     RMSE  Rsquared       MAE 
1.2333550 0.5492522 1.0373187 

$NeuralNet
     RMSE  Rsquared       MAE 
1.8407270 0.2060201 1.5125262

Exercise 7.5 A

In terms of the resampling analysis, the SVM algorithm demonstrates the optimal performance in comparison with other nonlinear algorithms. The SVM algorithm yields the minimum value of the RMSE (approximately equal to 1.10) and the minimum value of MAE (approximately equal to 0.88), as well as the maximum value of \(R^2\) (approximately equal to 0.67).

The KNN model ranks second, as it has somewhat higher error rate and lower \(R^2\). The neural network is close to the KNN model in terms of its performance. MARS model is very similar to the SVM model; however, it demonstrates slightly worse performance in terms of the RMSE and \(R^2\) values.

Exercise 7.5 B

varImp(mars_model)

earth variable importance

                       Overall
ManufacturingProcess32  100.00
ManufacturingProcess17   48.01
ManufacturingProcess09   23.08
ManufacturingProcess02    0.00

According to variable importance obtained from the MARS model, the top predictors include ManufacturingProcess32, ManufacturingProcess17, and ManufacturingProcess09. Among all these, the ManufacturingProcess32 predictor appears to be extremely important, and its contribution towards yield prediction is significantly high compared to other variables. The ManufacturingProcess02 predictor does not contribute to the model at all.

It is clear that all the three predictors mentioned above are process variables, which implies that process variables dominate the model’s predictions. As a result, biological variables appear to have very low contributions when predicting the response of the dataset.

However, it might be possible that there is some overlap between the set of important predictors identified using MARS and the optimal linear model obtained from Exercise 6.3. Nonetheless, due to nonlinearity, the MARS model succeeds in identifying more influential process variables.

Exercise 7.5 C

top_vars <- c("ManufacturingProcess32",
              "ManufacturingProcess17",
              "ManufacturingProcess09")

par(mfrow = c(1, 3))

for (v in top_vars) {
  plot(trainX[[v]], trainY,
       xlab = v,
       ylab = "Yield",
       main = paste("Yield vs", v),
       pch = 16, col = "darkgray")
  
  lines(lowess(trainX[[v]], trainY), col = "blue", lwd = 2)
}

The plot between yield and the significant predictor ManufacturingProcess32 shows the non-linear relationship. It can be seen that there is a gradual increase with respect to yield initially and after that a sharp increase after a certain level. This means that once a certain threshold level is reached, there could be increased benefits from having a high value of ManufacturingProcess32.

Similarly, ManufacturingProcess17 also shows a non-linear relation but this time it is negative in nature. There is a decline in yield with increasing value of ManufacturingProcess17. Although there is some kind of curve in between, it is evident that there is overall decline in yield.

In the case of ManufacturingProcess09, we have positive relationship with increasing yield. As the predictor increases, so does the yield. Here, also, there is a smooth trend line in between, which confirms the overall positive trend with yield.

From these plots, one can conclude that yield is significantly affected by the non-linearity of some process variables. For example, threshold levels of some predictors and changes in slopes show the areas in which the process can benefit more from minor variations.