In Kuhn and Johnson do problems 7.2 and 7.5. There are only two but they consist of many parts. Please submit both a link to your Rpubs and the .rmd file.
Friedman (1991) introduced several benchmark data sets create by
simulation. One of these simulations used the following nonlinear
equation to create data: where the x values are random variables
uniformly distributed between [0, 1] (there are also 5 other
non-informative variables also created in the simulation). The package
mlbench contains a function called mlbench.friedman1 that simulates
these data:
Code Given(1):
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.2
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
print(featurePlot(trainingData$x, trainingData$y))
## or other methods.
## This creates a list with a vector 'y' and a matrix of predictors 'x'. Also simulate a large test set to estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
Code Given(2):
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Working with MARS Model
## Setting up the parameters for the MARS model allowing individual predictors, as well as limited dual combinations of them. Allowing large range for pruning.
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:50)
set.seed(100)
marsFit <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"))
print(summary(marsFit$finalModel))
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
## nprune=14)
##
## coefficients
## (Intercept) 22.0339290
## h(0.507267-X1) -4.5157241
## h(X1-0.507267) 2.6841329
## h(0.325504-X2) -5.4438679
## h(-0.216741-X3) 3.3683600
## h(X3- -0.216741) 2.0371575
## h(0.953812-X4) -2.7853388
## h(X4-0.953812) 2.7366442
## h(1.17878-X5) -1.5636213
## h(X1- -0.951872) * h(X2-0.325504) -0.7790969
## h(X1-0.507267) * h(X2- -0.798188) -2.6276789
## h(0.606835-X1) * h(0.325504-X2) 2.1773145
## h(0.325504-X2) * h(X3-0.795427) 1.7739671
## h(X2-0.325504) * h(X3- -0.917499) 0.5726623
##
## Selected 14 of 21 terms, and 5 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 8 5
## GCV 1.841667 RSS 255.2757 GRSq 0.9252276 RSq 0.9476564
#Predictions
marsPred <- predict(marsFit, newdata = testData$x)
print(postResample(pred = marsPred, obs = testData$y))
## RMSE Rsquared MAE
## 1.2779993 0.9338365 1.0147070
# RMSE Rsquared MAE
#1.2433451 0.9364207 0.9824062
Working with Neural Network Model
### Neural net model, we have 1-10 neurons, keeping inline with text books decay and bagging suggestions,
set.seed(100)
netGrid <- expand.grid(.decay = c(0, 0.01, 0.1),.size = 1:10,.bag = FALSE)
nnetTune <- train(trainingData$x, trainingData$y,method = "avNNet", tuneGrid = netGrid,
trControl = trainControl(method = "cv"),
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
print(summary(nnetTune$bestTune))
## size decay bag
## Min. :5 Min. :0.1 Mode :logical
## 1st Qu.:5 1st Qu.:0.1 FALSE:1
## Median :5 Median :0.1
## Mean :5 Mean :0.1
## 3rd Qu.:5 3rd Qu.:0.1
## Max. :5 Max. :0.1
print(summary(nnetTune$finalModel))
## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 10 -none- character
## terms 3 terms call
## coefnames 10 -none- character
## xlevels 0 -none- list
## xNames 10 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 1 -none- logical
## param 3 -none- list
print(nnetTune$results)
## decay size bag RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.00 1 FALSE 2.392711 0.7610354 1.897330 0.5399693 0.17419578 0.4641855
## 11 0.01 1 FALSE 2.385381 0.7602926 1.887906 0.5431008 0.17543766 0.4628980
## 21 0.10 1 FALSE 2.393965 0.7596431 1.894191 0.5491281 0.17609869 0.4603040
## 2 0.00 2 FALSE 2.410532 0.7567109 1.907478 0.5537566 0.16917116 0.4919607
## 12 0.01 2 FALSE 2.425125 0.7510903 1.935991 0.5736786 0.17578732 0.4898851
## 22 0.10 2 FALSE 2.423612 0.7525959 1.935872 0.5655335 0.16649632 0.4884680
## 3 0.00 3 FALSE 2.043957 0.8224281 1.630751 0.5108291 0.12133400 0.3649992
## 13 0.01 3 FALSE 2.151209 0.8016018 1.701951 0.5304276 0.14210227 0.3833116
## 23 0.10 3 FALSE 2.169914 0.7982380 1.726854 0.5615105 0.14535072 0.4255456
## 4 0.00 4 FALSE 2.289347 0.8130639 1.749187 1.2305832 0.12997012 0.7232752
## 14 0.01 4 FALSE 2.091925 0.8154383 1.676653 0.4658305 0.10671510 0.3146718
## 24 0.10 4 FALSE 2.059080 0.8224160 1.648610 0.4180618 0.11004539 0.2603083
## 5 0.00 5 FALSE 2.445600 0.7709399 1.824446 0.6475054 0.15948604 0.4273511
## 15 0.01 5 FALSE 2.169742 0.7999255 1.738715 0.4283499 0.11256277 0.3392557
## 25 0.10 5 FALSE 1.975656 0.8394000 1.578979 0.4361593 0.10985162 0.3170231
## 6 0.00 6 FALSE 2.898295 0.7388800 2.052725 1.7653452 0.22072875 0.9189415
## 16 0.01 6 FALSE 2.262032 0.8056619 1.817195 0.3837171 0.08434748 0.2665117
## 26 0.10 6 FALSE 2.152198 0.8098015 1.693056 0.4125129 0.11092059 0.2616601
## 7 0.00 7 FALSE 3.351563 0.6644147 2.460366 1.0264160 0.12674850 0.5790942
## 17 0.01 7 FALSE 2.318301 0.7861811 1.856908 0.4915103 0.12867124 0.3401214
## 27 0.10 7 FALSE 2.161512 0.8163011 1.693526 0.4391635 0.08772441 0.2562003
## 8 0.00 8 FALSE 6.513566 0.4418645 3.563297 5.3438631 0.28363806 1.9345711
## 18 0.01 8 FALSE 2.413847 0.7772629 1.938009 0.5063174 0.11741514 0.3397851
## 28 0.10 8 FALSE 2.273716 0.7922525 1.822713 0.3808189 0.10972838 0.2184895
## 9 0.00 9 FALSE 4.484215 0.5644107 2.877950 2.3737627 0.22792454 0.9598735
## 19 0.01 9 FALSE 2.317190 0.7847500 1.857641 0.5353001 0.13951823 0.4198470
## 29 0.10 9 FALSE 2.315333 0.7811273 1.785409 0.5720570 0.13029365 0.3979768
## 10 0.00 10 FALSE 3.422545 0.6247430 2.439739 1.5522354 0.22789677 0.6384685
## 20 0.01 10 FALSE 2.480407 0.7408505 1.995656 0.5927652 0.16156925 0.4396292
## 30 0.10 10 FALSE 2.334803 0.7692182 1.872733 0.5149409 0.13810727 0.4015950
print(plot(nnetTune))
### Executing on Test Data
nnetPred <- predict(nnetTune, newdata = testData$x)
print(postResample(pred = nnetPred, obs = testData$y))
## RMSE Rsquared MAE
## 2.1113956 0.8277556 1.5739011
# RMSE Rsquared MAE
#2.1917095 0.8121672 1.6442031
Working with SVM Model
set.seed(100)
svmRTuned <- train(trainingData$x, trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
print(summary(svmRTuned$finalModel))
## Length Class Mode
## 1 ksvm S4
print(svmRTuned$results)
## sigma C RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.06509124 0.25 2.530787 0.7922715 2.013175 0.2696024 0.12291001
## 2 0.06509124 0.50 2.259539 0.8064569 1.789962 0.3313399 0.11781033
## 3 0.06509124 1.00 2.099789 0.8274242 1.656154 0.3481393 0.10198585
## 4 0.06509124 2.00 2.002943 0.8412934 1.583791 0.3311129 0.08451813
## 5 0.06509124 4.00 1.943618 0.8504425 1.546586 0.3292231 0.07589338
## 6 0.06509124 8.00 1.918711 0.8547582 1.532981 0.3162050 0.07104595
## 7 0.06509124 16.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## 8 0.06509124 32.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## 9 0.06509124 64.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## 10 0.06509124 128.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## 11 0.06509124 256.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## 12 0.06509124 512.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## 13 0.06509124 1024.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## 14 0.06509124 2048.00 1.920651 0.8536189 1.536116 0.3185506 0.07191691
## MAESD
## 1 0.2661719
## 2 0.3248628
## 3 0.3121841
## 4 0.2760162
## 5 0.2546977
## 6 0.2460647
## 7 0.2495101
## 8 0.2495101
## 9 0.2495101
## 10 0.2495101
## 11 0.2495101
## 12 0.2495101
## 13 0.2495101
## 14 0.2495101
## Test Data
svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0631908 0.8275736 1.5662213
# RMSE Rsquared MAE
#2.0202829 0.8327653 1.5675884
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
Here are the output results of each of my models:
Model | RMSE | R² | MAE |
---|---|---|---|
MARS | 1.2433 | 0.9364 | 0.9824 |
SVM (Radial) | 2.0203 | 0.8328 | 1.5676 |
Neural Network | 2.1917 | 0.8122 | 1.6442 |
The MARS model is by far the best with the lowest Root Mean Squared error and the highest R^2. This means this model explains about 93 percent of the variance in the target variable with the least error, as MAE is also the lowest in this model. The runner up would be the SVM model, with the neural network model coming in third place.
With respect to the MARS model, yes the model chose X variables 1 through 5. The ranking of importance of variable by influence was listed as X1, X4, X2, X5, X3. This generally matches the initially provide formula with the MARS model ignoring the variables that are not relevant.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Training the data as in Exercise 6.3
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.2
data(ChemicalManufacturingProcess)
# Prepping the data
set.seed(100)
## IMPUTING Keeping it simple with taking the median values of each of the columns, as most columns that have NA only have one NA
imputed <- preProcess(ChemicalManufacturingProcess, method = "medianImpute")
chem_df_imputed <- predict(imputed, ChemicalManufacturingProcess)
## Establishing the x and y vars
X <- chem_df_imputed[, -which(names(chem_df_imputed) == "Yield")]
y <- chem_df_imputed$Yield
## putting together for working & spltting
ChemData <- list(x = X, y = y)
## Splitting 70/30
splitIndex <- createDataPartition(ChemData$y, p = 0.7, list = FALSE)
ChemtrainingData <- list(x = data.frame(ChemData$x[splitIndex, ]),y = ChemData$y[splitIndex])
ChemtestData <- list( x = data.frame(ChemData$x[-splitIndex, ]),y = ChemData$y[-splitIndex])
#Looking at which of the predictor variables create the best fitting non-linear model to predict yield
set.seed(100)
### Firstly the Neural Net Model
netGrid <- expand.grid(.decay = c(0, 0.01, 0.1),.size = 1:10,.bag = FALSE)
nnetTune <- train( ChemtrainingData$x,
ChemtrainingData$y,
method = "avNNet", tuneGrid = netGrid,
trControl = trainControl(method = "cv"),
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,maxit = 500)
#----Evaluation
nnetPred <- predict(nnetTune, newdata = ChemtestData$x)
print(postResample(pred = nnetPred, obs = ChemtestData$y))
## RMSE Rsquared MAE
## 1.3310979 0.4190896 1.0918633
# RMSE Rsquared MAE
#1.3745797 0.5428737 1.0879853
### MARS model
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:50)
marsFit <- train(x = ChemtrainingData$x,
y = ChemtrainingData$y,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"),
trControl = trainControl(method = "cv"))
#----Evaluation
marsPred <- predict(marsFit, newdata = ChemtestData$x)
print(postResample(pred = marsPred, obs = ChemtestData$y))
## RMSE Rsquared MAE
## 1.4072009 0.4851382 1.1332918
# RMSE Rsquared MAE
#1.2135116 0.6004976 0.9666867
#KNN Model
knnModel <- train(x = ChemtrainingData$x,
y = ChemtrainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
#----Evaluation
knnPred <- predict(knnModel, newdata = ChemtestData$x)
print(postResample(pred = knnPred, obs = ChemtestData$y))
## RMSE Rsquared MAE
## 1.2268780 0.4636449 1.0207692
# RMSE Rsquared MAE
#1.4009888 0.5174132 1.1011325
##SVM model
svmRTuned <- train(ChemtrainingData$x, ChemtrainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
#----Evaluation
svmPred <- predict(svmRTuned, newdata = ChemtestData$x)
print(postResample(pred = svmPred, obs = ChemtestData$y))
## RMSE Rsquared MAE
## 1.2175550 0.5212596 0.9650886
# RMSE Rsquared MAE
#1.1690926 0.6398953 0.8977043
Of all the models ran on the data, the SVM model is the best performing model. The R^2 from this model was 0.634 and the RMSE was 1.16. The MAE was 0.897. OVerall these were the best numbers. The second place model was the MARS model and it had an RMSE of 1.213, an r^2 of 0.600, and a MAE 0.966. Taking a look at the SVM model, the top ten most important predictors include both biological and process variables. Out of the ten, six were process variables and four were biological. This means that process variables were a bit more dominant in the “most important” predictors. When compared to the results of the linear model from 6.3, there are many more biological variables dominating in the SVM model results. However, the ManufacturingProcess32 predicotr is still number one, and the ManufacturingProcess13 is still in the top 3.
## SVM Model results for important variables
print(varImp(svmRTuned))
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.64
## BiologicalMaterial06 76.19
## ManufacturingProcess17 73.26
## ManufacturingProcess09 69.65
## BiologicalMaterial12 66.34
## BiologicalMaterial02 66.06
## ManufacturingProcess36 65.18
## ManufacturingProcess31 64.17
## BiologicalMaterial03 63.32
## ManufacturingProcess11 54.80
## ManufacturingProcess06 53.67
## BiologicalMaterial04 50.26
## ManufacturingProcess30 49.24
## ManufacturingProcess29 44.38
## BiologicalMaterial08 44.11
## BiologicalMaterial09 42.22
## BiologicalMaterial11 39.01
## ManufacturingProcess33 38.36
## ManufacturingProcess02 36.08
I looked at BiologicalMaterial06, BiologicalMaterial03 and BiologicalMaterial12. The first two of these variables were in the top ten for both model, but were much higher in the SVM model than the linear model. The third was not in the top 10 of the linear model, but was the 6th most important in the SVM model. When plotted against yield,all of these variables showed some one ambiguous, not so linear patterns. This explains the discrepency in model variable results.
library(ggplot2)
vars_to_plot <- c("BiologicalMaterial06", "BiologicalMaterial03", "BiologicalMaterial12")
for (var in vars_to_plot) {
print(
ggplot(ChemicalManufacturingProcess, aes_string(x = var, y = "Yield")) +
geom_point() +
#geom_smooth(method = "loess", se = FALSE, color = "blue") +
labs(title = paste("Yield vs", var))
)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.