HW 8

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.

doParallel library,

allowParallel = TRUE in your trainControl

7.2. Friedman (1991) introduced several benchmark data sets create by sim- ulation. One of these simulations used the following nonlinear equation to create data: 22 y = 10sin(πx1x2) + 20(x3 − 0.5) + 10x4 + 5x5 + N(0,σ ) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simula- tion). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names. > trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

# columns need names 
if (is.null(colnames(trainingData$x))) {
  colnames(trainingData$x) <- paste0("X", 1:ncol(trainingData$x))
}

knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10
)

print(knnModel)

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Nueral Network model best RMSE was 2.001280, SVM model’s best RMSE was 2.004996; MARS best RMSE was 1.391925 and KNN bets RMSE was 3.192409. MARS model has the best performance. MARSA also does select the informative predictor Feature Importance X1 100.00000, X4 75.57379, X2 49.29161, X5 15.94173, X3 0.00000

#Tune several models on these data
knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10
)

print(knnModel)

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.596096  0.4930100  2.930570
##    7  3.459955  0.5319522  2.798379
##    9  3.368640  0.5669301  2.726710
##   11  3.357390  0.5808369  2.723971
##   13  3.355299  0.5920307  2.726627
##   15  3.343154  0.6053457  2.730079
##   17  3.313771  0.6249174  2.703343
##   19  3.305186  0.6393157  2.701323
##   21  3.306140  0.6497242  2.709582
##   23  3.328424  0.6524463  2.730748
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 19.

set.seed(235)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:10)
marsmodel <- train(trainingData$x, trainingData$y,
                   method = "earth",
                   # Explicitly declare the candidate models to  test
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

marsmodel

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.490017  0.2220081  3.699914
##   1        3      3.695379  0.4633081  3.037251
##   1        4      2.681257  0.7234989  2.185523
##   1        5      2.296355  0.7932909  1.833138
##   1        6      2.285594  0.7941082  1.792822
##   1        7      1.780806  0.8751494  1.384274
##   1        8      1.727727  0.8778758  1.355689
##   1        9      1.718990  0.8757342  1.357720
##   1       10      1.721599  0.8781766  1.351784
##   2        2      4.490017  0.2220081  3.699914
##   2        3      3.695379  0.4633081  3.037251
##   2        4      2.651081  0.7304574  2.139694
##   2        5      2.272867  0.7985207  1.821653
##   2        6      2.198442  0.8071944  1.723420
##   2        7      1.749857  0.8798856  1.361137
##   2        8      1.739416  0.8831931  1.342930
##   2        9      1.418495  0.9183112  1.114303
##   2       10      1.391925  0.9219279  1.109642
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 10 and degree = 2.

#checking for MARS informative predictors

library(tibble)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mars_feature_importance <- varImp(marsmodel, method = "gcv")$importance |>
    arrange(desc(Overall)) |>
    rownames_to_column(var = "Feature")

cols <- c("Feature", "Importance")

colnames(mars_feature_importance) <- cols

knitr::kable(mars_feature_importance, format = "simple")

Feature	Importance
X1	100.00000
X4	75.57379
X2	49.29161
X5	15.94173
X3	0.00000

svmGrid <- expand.grid(.sigma = seq(0.1, 2, by = 0.1), 
                       .C = 2^(2:5))  #

svm_model <- train(trainingData$x, trainingData$y,
                   method = "svmRadial",  
                   tuneGrid = svmGrid,
                   trControl = trainControl(method = "cv"))

print(svm_model)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C   RMSE      Rsquared    MAE     
##   0.1     4  2.004124  0.84213903  1.624731
##   0.1     8  2.002586  0.84253307  1.621716
##   0.1    16  2.002586  0.84253307  1.621716
##   0.1    32  2.002586  0.84253307  1.621716
##   0.2     4  2.550963  0.76256554  2.027480
##   0.2     8  2.550963  0.76256554  2.027480
##   0.2    16  2.550963  0.76256554  2.027480
##   0.2    32  2.550963  0.76256554  2.027480
##   0.3     4  3.146359  0.68771692  2.482975
##   0.3     8  3.146359  0.68771692  2.482975
##   0.3    16  3.146359  0.68771692  2.482975
##   0.3    32  3.146359  0.68771692  2.482975
##   0.4     4  3.703836  0.61645636  2.965680
##   0.4     8  3.703836  0.61645636  2.965680
##   0.4    16  3.703836  0.61645636  2.965680
##   0.4    32  3.703836  0.61645636  2.965680
##   0.5     4  4.129748  0.54432652  3.344175
##   0.5     8  4.129748  0.54432652  3.344175
##   0.5    16  4.129748  0.54432652  3.344175
##   0.5    32  4.129748  0.54432652  3.344175
##   0.6     4  4.413870  0.47171027  3.597855
##   0.6     8  4.413870  0.47171027  3.597855
##   0.6    16  4.413870  0.47171027  3.597855
##   0.6    32  4.413870  0.47171027  3.597855
##   0.7     4  4.592388  0.40184650  3.752623
##   0.7     8  4.592388  0.40184650  3.752623
##   0.7    16  4.592388  0.40184650  3.752623
##   0.7    32  4.592388  0.40184650  3.752623
##   0.8     4  4.702830  0.33958770  3.846434
##   0.8     8  4.702830  0.33958770  3.846434
##   0.8    16  4.702830  0.33958770  3.846434
##   0.8    32  4.702830  0.33958770  3.846434
##   0.9     4  4.771884  0.28701461  3.906782
##   0.9     8  4.771884  0.28701461  3.906782
##   0.9    16  4.771884  0.28701461  3.906782
##   0.9    32  4.771884  0.28701461  3.906782
##   1.0     4  4.816151  0.24433823  3.946255
##   1.0     8  4.816151  0.24433823  3.946255
##   1.0    16  4.816151  0.24433823  3.946255
##   1.0    32  4.816151  0.24433823  3.946255
##   1.1     4  4.845214  0.21031635  3.972238
##   1.1     8  4.845214  0.21031635  3.972238
##   1.1    16  4.845214  0.21031635  3.972238
##   1.1    32  4.845214  0.21031635  3.972238
##   1.2     4  4.864840  0.18346956  3.989848
##   1.2     8  4.864840  0.18346956  3.989848
##   1.2    16  4.864840  0.18346956  3.989848
##   1.2    32  4.864840  0.18346956  3.989848
##   1.3     4  4.878428  0.16249191  4.002023
##   1.3     8  4.878428  0.16249191  4.002023
##   1.3    16  4.878428  0.16249191  4.002023
##   1.3    32  4.878428  0.16249191  4.002023
##   1.4     4  4.888056  0.14587578  4.010645
##   1.4     8  4.888056  0.14587578  4.010645
##   1.4    16  4.888056  0.14587578  4.010645
##   1.4    32  4.888056  0.14587578  4.010645
##   1.5     4  4.895038  0.13234071  4.016902
##   1.5     8  4.895038  0.13234071  4.016902
##   1.5    16  4.895038  0.13234071  4.016902
##   1.5    32  4.895038  0.13234071  4.016902
##   1.6     4  4.900204  0.12105815  4.021540
##   1.6     8  4.900204  0.12105815  4.021540
##   1.6    16  4.900204  0.12105815  4.021540
##   1.6    32  4.900204  0.12105815  4.021540
##   1.7     4  4.904096  0.11130649  4.025039
##   1.7     8  4.904096  0.11130649  4.025039
##   1.7    16  4.904096  0.11130649  4.025039
##   1.7    32  4.904096  0.11130649  4.025039
##   1.8     4  4.907081  0.10275351  4.028065
##   1.8     8  4.907081  0.10275351  4.028065
##   1.8    16  4.907081  0.10275351  4.028065
##   1.8    32  4.907081  0.10275351  4.028065
##   1.9     4  4.909409  0.09517436  4.030515
##   1.9     8  4.909409  0.09517436  4.030515
##   1.9    16  4.909409  0.09517436  4.030515
##   1.9    32  4.909409  0.09517436  4.030515
##   2.0     4  4.911250  0.08830632  4.032461
##   2.0     8  4.911250  0.08830632  4.032461
##   2.0    16  4.911250  0.08830632  4.032461
##   2.0    32  4.911250  0.08830632  4.032461
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.1 and C = 8.

nnGrid <- expand.grid(.size = c(5, 10, 15), 
                      .decay = c(0.1, 0.01)) 

nn_model <- train(trainingData$x, trainingData$y,
                  method = "nnet", 
                  tuneGrid = nnGrid,
                  trControl = trainControl(method = "cv"),
                  linout = TRUE, 
                  trace = FALSE)  

print(nn_model)

## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##    5    0.01   2.438537  0.7729618  1.893834
##    5    0.10   2.188256  0.8083202  1.689090
##   10    0.01   2.736651  0.7286443  2.120302
##   10    0.10   2.043694  0.8377757  1.575420
##   15    0.01   2.443838  0.7800169  1.904620
##   15    0.10   1.914587  0.8550439  1.544981
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 15 and decay = 0.1.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models. Done -

library(AppliedPredictiveModeling)
library(caret)
library(RANN)
library(ggplot2)
library(dplyr)

data(ChemicalManufacturingProcess)
#Initially had issues accessing it /came up as null..
yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]  

# Impute using knn
preProc <- preProcess(processPredictors, method = "knnImpute")
processPredictors_imputed <- predict(preProc, processPredictors)

#start with NZV,  Identify and remove 
nzv_columns <- nearZeroVar(processPredictors_imputed)
Predictors_cleaned <- processPredictors_imputed[, -nzv_columns]

data_combined <- data.frame(Yield = yield, Predictors_cleaned)

trainIndex <- createDataPartition(data_combined$Yield, p = .8, 
                                  list = FALSE, 
                                  times = 1)
trainData <- data_combined[trainIndex, ]
testData <- data_combined[-trainIndex, ]

# Prepare training and test matrices
x_train <- as.matrix(trainData[, -1])  # Exclude the target column (Yield) for predictors
y_train <- trainData$Yield

x_test <- as.matrix(testData[, -1])
y_test <- testData$Yield

set.seed(324)

# KNN Model
knn_grid <- expand.grid(.k = seq(3, 15, by = 2))  # Define a range of k values

knn_model <- train(x_train, y_train,
                   method = "knn",
                   tuneGrid = knn_grid,
                   trControl = trainControl(method = "cv", number = 10))

# Results for KNN
print(knn_model)

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 129, 131, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    3  1.138516  0.6538275  0.9042020
##    5  1.172183  0.6513499  0.9564651
##    7  1.277074  0.5948033  1.0584678
##    9  1.297873  0.5668331  1.0574808
##   11  1.312362  0.5634861  1.0594674
##   13  1.333685  0.5450198  1.0845789
##   15  1.334827  0.5594116  1.0816259
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.

# MARS Model
set.seed(324)

mars_grid <- expand.grid(.degree = 1:2, .nprune = 2:20)

mars_model <- train(x_train, y_train,
                    method = "earth",
                    tuneGrid = mars_grid,
                    trControl = trainControl(method = "cv", number = 10))

print(mars_model)

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 129, 131, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.477643  0.4310503  1.1832759
##   1        3      1.321451  0.5333627  1.0592448
##   1        4      1.165102  0.6299526  0.9569337
##   1        5      1.220376  0.5898154  0.9957056
##   1        6      1.311361  0.5488447  1.0402853
##   1        7      1.316415  0.5388407  1.0432122
##   1        8      1.326002  0.5291970  1.0467461
##   1        9      1.370446  0.5083005  1.0944542
##   1       10      1.384346  0.5154384  1.0876090
##   1       11      1.394004  0.5144958  1.0923066
##   1       12      1.391703  0.5228326  1.0874111
##   1       13      1.395228  0.5226364  1.0926455
##   1       14      1.380421  0.5339409  1.0764225
##   1       15      1.593219  0.4719343  1.1645826
##   1       16      1.594854  0.4663980  1.1700563
##   1       17      1.593295  0.4690904  1.1697660
##   1       18      1.589681  0.4707103  1.1674545
##   1       19      1.595051  0.4717230  1.1715371
##   1       20      1.591921  0.4735404  1.1685802
##   2        2      1.477643  0.4310503  1.1832759
##   2        3      1.314930  0.5771918  1.0464400
##   2        4      1.165308  0.6348322  0.9520699
##   2        5      1.245112  0.5846747  0.9998722
##   2        6      1.251620  0.5875025  0.9938204
##   2        7      1.292136  0.5585730  1.0104557
##   2        8      1.284327  0.5629706  1.0153324
##   2        9      1.348968  0.5626971  1.0429043
##   2       10      1.357001  0.5664570  1.0547706
##   2       11      1.387198  0.5564458  1.0817275
##   2       12      1.438911  0.5360715  1.1125429
##   2       13      1.431899  0.5312524  1.0905656
##   2       14      1.670693  0.4977115  1.1807686
##   2       15      1.647313  0.4973974  1.1605058
##   2       16      1.685850  0.5148702  1.1930428
##   2       17      4.421883  0.4445947  1.9701953
##   2       18      4.598073  0.4442805  2.0244132
##   2       19      4.543661  0.4452384  2.0175349
##   2       20      5.506856  0.4465259  2.2595784
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.

# SVM with Radial Basis Kernel
set.seed(324)

svm_grid <- expand.grid(.sigma = seq(0.01, 0.1, by = 0.01), .C = 2^(2:5))

svm_model <- train(x_train, y_train,
                   method = "svmRadial",
                   tuneGrid = svm_grid,
                   trControl = trainControl(method = "cv", number = 10))
print(svm_model)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 129, 131, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   sigma  C   RMSE      Rsquared   MAE      
##   0.01    4  1.023055  0.7327905  0.8024418
##   0.01    8  1.026482  0.7325642  0.8085265
##   0.01   16  1.025010  0.7333342  0.8064179
##   0.01   32  1.025010  0.7333342  0.8064179
##   0.02    4  1.052246  0.7256570  0.8344373
##   0.02    8  1.046309  0.7277819  0.8283659
##   0.02   16  1.046309  0.7277819  0.8283659
##   0.02   32  1.046309  0.7277819  0.8283659
##   0.03    4  1.130665  0.6985535  0.8960079
##   0.03    8  1.130665  0.6985535  0.8960079
##   0.03   16  1.130665  0.6985535  0.8960079
##   0.03   32  1.130665  0.6985535  0.8960079
##   0.04    4  1.222891  0.6644626  0.9742151
##   0.04    8  1.222891  0.6644626  0.9742151
##   0.04   16  1.222891  0.6644626  0.9742151
##   0.04   32  1.222891  0.6644626  0.9742151
##   0.05    4  1.315910  0.6245970  1.0513053
##   0.05    8  1.315910  0.6245970  1.0513053
##   0.05   16  1.315910  0.6245970  1.0513053
##   0.05   32  1.315910  0.6245970  1.0513053
##   0.06    4  1.404357  0.5836113  1.1237833
##   0.06    8  1.404357  0.5836113  1.1237833
##   0.06   16  1.404357  0.5836113  1.1237833
##   0.06   32  1.404357  0.5836113  1.1237833
##   0.07    4  1.484331  0.5462909  1.1927214
##   0.07    8  1.484331  0.5462909  1.1927214
##   0.07   16  1.484331  0.5462909  1.1927214
##   0.07   32  1.484331  0.5462909  1.1927214
##   0.08    4  1.553301  0.5118922  1.2541863
##   0.08    8  1.553301  0.5118922  1.2541863
##   0.08   16  1.553301  0.5118922  1.2541863
##   0.08   32  1.553301  0.5118922  1.2541863
##   0.09    4  1.611123  0.4794827  1.3052716
##   0.09    8  1.611123  0.4794827  1.3052716
##   0.09   16  1.611123  0.4794827  1.3052716
##   0.09   32  1.611123  0.4794827  1.3052716
##   0.10    4  1.659486  0.4507024  1.3481776
##   0.10    8  1.659486  0.4507024  1.3481776
##   0.10   16  1.659486  0.4507024  1.3481776
##   0.10   32  1.659486  0.4507024  1.3481776
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01 and C = 4.

# Neural Network
set.seed(324)

nn_grid <- expand.grid(.size = c(5, 10, 15), .decay = c(0.1, 0.01))

nn_model <- train(x_train, y_train,
                  method = "nnet",
                  tuneGrid = nn_grid,
                  trControl = trainControl(method = "cv", number = 10),
                  linout = TRUE, 
                  trace = FALSE)  

print(nn_model)

## Neural Network 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 129, 131, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE       Rsquared    MAE     
##    5    0.01    3.684185  0.16365973  2.789183
##    5    0.10    3.017182  0.21908165  2.452500
##   10    0.01   12.202386  0.11551708  7.084475
##   10    0.10   10.782109  0.07642328  7.217990
##   15    0.01    6.221714  0.22031869  4.396151
##   15    0.10   10.021429  0.14286062  5.648800
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5 and decay = 0.1.

Which nonlinear regression model gives the optimal resampling and test set performance?

Support Vector Machines model gives the optimal resampling and test set performance with RMSE of 1.16

set.seed(324)
# Extract cross-validation performance for each model
knn_cv <- getTrainPerf(knn_model)
svm_cv <- getTrainPerf(svm_model)
mars_cv <- getTrainPerf(mars_model)
nn_cv <- getTrainPerf(nn_model)


# Display cross-validation RMSE for each model
cat("KNN CV RMSE:", knn_cv$TrainRMSE, "\n")

## KNN CV RMSE: 1.138516

cat("SVM CV RMSE:", svm_cv$TrainRMSE, "\n")

## SVM CV RMSE: 1.023055

cat("MARS CV RMSE:", mars_cv$TrainRMSE, "\n")

## MARS CV RMSE: 1.165102

cat("Neural Network CV RMSE:", nn_cv$TrainRMSE, "\n")

## Neural Network CV RMSE: 3.017182

Which predictors are most important in the optimal nonlinear regres- sion model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

For both Linear and nonLinear optimal models, the process variables dominate the list; however, with the SVM model shows 4 biological variables on top 10 and Ridge has only 1 biological variable on top 10. These 6 variables are present for both models as top 10: ManufactoringProcess 32, ManufactoringProcess 9, ManufactoringProcess 13, ManufactoringProcess 36, BiologicalMaterial 03, ManufactoringProcess 17. Only top 10 for Ridge :ManufactoringProcess 34, ManufactoringProcess 33, ManufactoringProcess 4, ManufactoringProcess 37. Only top 10 for SVM: BiologicalMaterial06, BiologicalMaterial02, BiologicalMaterial012, ManufactoringProcess 31.

svm_importance <- varImp(svm_model, scale = TRUE)
plot(svm_importance, top = 10)

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

ridge_model <- glmnet(x_train, y_train, alpha = 0)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_lambda <- cv_ridge$lambda.min

ridge_importance <- as.data.frame(as.matrix(coef(ridge_model, s = best_lambda))[-1, , drop = FALSE])  # Remove intercept
colnames(ridge_importance) <- "Importance"
ridge_importance <- ridge_importance[order(-abs(ridge_importance$Importance)), , drop = FALSE]

top_predictors <- head(ridge_importance, 10)
ggplot(top_predictors, aes(x = reorder(rownames(top_predictors), Importance), y = Importance)) +
    geom_bar(stat = "identity", fill = "skyblue") +
    coord_flip() +
    labs(title = "Top 10 Important Predictors in Ridge Regression",
         x = "Predictor",
         y = "Importance") +
    theme_minimal()

library(ggplot2)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack

ridge_importance <- ridge_importance[order(-abs(ridge_importance$Importance)), , drop = FALSE]
top_predictors <- head(ridge_importance, 3) 

top_predictor_names <- rownames(top_predictors)

subset_data <- data_combined[, c(top_predictor_names, "Yield")]

long_data <- gather(subset_data, key = "Predictor", value = "Value", -Yield)

ggplot(long_data, aes(x = Value, y = Yield)) +
    geom_point(alpha = 0.6, color = "darkgreen") +
    geom_smooth(method = "lm", se = FALSE, color = "blue") +  
    facet_wrap(~Predictor, scales = "free") + 
    labs(title = "Relationship Between Top 3 Predictors and Yield",
         x = "Predictor Value",
         y = "Yield") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

I add the visual for top 3 for ridge so I can better compare. Initially the difference I can see is in the line connected the data of course, since I used ‘lm’ for the ridge smoothing, and used ‘loess’ for SVM. BiologicalMaterial06 looks like a positive linear relationship up until 1.5 when its starts a downward trend. BiologicalMaterial02 looks like a positive linear relationship. BiologicalMaterial12 has a lot more curvature. ManufactoringProcess 31 has a outlier aroun -15 on the x-axis which makes the visual look curvature, but most of the date points lie in a negative direction around 0.

unique_predictors <- c("BiologicalMaterial06", "BiologicalMaterial02", "BiologicalMaterial12", "ManufacturingProcess31")

for (predictor in unique_predictors) {
    # Ensure the predictor exists in the data
    if (predictor %in% colnames(data_combined)) {
        plot <- ggplot(data_combined, aes_string(x = predictor, y = "Yield")) +
            geom_point(alpha = 0.6, color = "darkgreen") +
            geom_smooth(method = "loess", se = FALSE, color = "blue") + 
            labs(title = paste("Relationship between", predictor, "and Yield"),
                 x = predictor,
                 y = "Yield") +
            theme_minimal() +
            theme(plot.title = element_text(hjust = 0.5)) 

        print(plot)  
    } else {
        message(paste("Predictor", predictor, "not found in data."))
    }
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

HW 8

Marjete Vucinaj

2024-11-09