Problem 7.2 Friedman introduced several benchmark methods

For this exercise we have to train several non linear models on this data.

library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(200)
trainingData <- mlbench.friedman1(200,sd = 1)
trainingData$x <- data.frame(trainingData$x)

featurePlot(trainingData$x,trainingData$y)

## Creates a list with vector y and a matrix of predictors x, also simulates a large test to estimate the true error rate with good precision
testData <- mlbench.friedman1(5000,sd = 1)
testData$x <- data.frame(testData$x)

KNN Model (From Textbook..)

## Tune several model on these data.
knnmodel <- train(x = trainingData$x,y = trainingData$y,method = "knn",preProc = c("center","scale"),tuneLength = 10)

knnmodel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
plot(knnmodel)

KnnPred <- predict(knnmodel,newdata = testData$x)

## use the rsample 
postResample(KnnPred,testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

KNN appears to not have done very well on this set of data, we will attempt to create other models.

Mars Model

## Taking a look at the mars model from the textbook..
library(earth)
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
marsfit <- earth(trainingData$x,trainingData$y)

summary(marsfit)
## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
# decided to tweak the grid from the textbook.
set.seed(100)
marsGrid <- expand.grid(.degree = 1:2,.nprune = 2:15)
Marsmodel <- train(trainingData$x,trainingData$y,method = "earth",tuneGrid = marsGrid,trControl = trainControl(method = "cv"))
Marsmodel
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.327937  0.2544880  3.600474
##   1        3      3.572450  0.4912720  2.895811
##   1        4      2.596841  0.7183600  2.106341
##   1        5      2.370161  0.7659777  1.918669
##   1        6      2.276141  0.7881481  1.810001
##   1        7      1.766728  0.8751831  1.390215
##   1        8      1.780946  0.8723243  1.401345
##   1        9      1.665091  0.8819775  1.325515
##   1       10      1.663804  0.8821283  1.327657
##   1       11      1.657738  0.8822967  1.331730
##   1       12      1.653784  0.8827903  1.331504
##   1       13      1.648496  0.8823663  1.316407
##   1       14      1.639073  0.8841742  1.312833
##   1       15      1.639073  0.8841742  1.312833
##   2        2      4.327937  0.2544880  3.600474
##   2        3      3.572450  0.4912720  2.895811
##   2        4      2.661826  0.7070510  2.173471
##   2        5      2.404015  0.7578971  1.975387
##   2        6      2.243927  0.7914805  1.783072
##   2        7      1.856336  0.8605482  1.435682
##   2        8      1.754607  0.8763186  1.396841
##   2        9      1.603578  0.8938666  1.261361
##   2       10      1.492421  0.9084998  1.168700
##   2       11      1.317350  0.9292504  1.033926
##   2       12      1.304327  0.9320133  1.019108
##   2       13      1.277510  0.9323681  1.002927
##   2       14      1.269626  0.9350024  1.003346
##   2       15      1.266217  0.9359400  1.013893
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
plot(Marsmodel)

Looking at the model summary we see that the x1,x2,x3,x4,x5 which according to the textbook are the informative predictors have been used in the model, and the final values used in the model are nprune = 13 and the degree is 2.

Marspred <- predict(Marsmodel,testData$x)

# use the PostRsample
postResample(Marspred,testData$y)
##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230
varImp(Marsmodel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.73
## X5   15.52
## X3    0.00

We see on the VarImp plot that x1 had the highest overall importance followed by x4 and then x2.

Neural Network Model

We will attempt to fit a basic neural network model now. The code below mimics the computing section from the textbook

## First we remove correlated predictors and create our grid

## We noticed that components aren't very correlated with each other, thus it is empty
tooCorrelated <- findCorrelation(cor(trainingData$x),cutoff = .75)
tooCorrelated
## integer(0)

Since the predictors aren’t too correlated we can skip the step in the textbook where we filter out the correlated predictors and just set up the grid.

## Create a specific candidate sets of model to evaluate

## Textbook says our lambda decay should be between 0 and 0.1 and I will make .size 1 to 5 to play around with it
nnetGrid <- expand.grid(size = c(1:10),decay = c(0,.01,.1))

# stick with 10 fold cross validation again
set.seed(100)
ctrl1 <- trainControl(method = "cv",number = 10)
## Create the model with the train function from caret
# I put 5 hidden units and maxiterations is 250?
## Apparently .bag parameter from the textbook should be removed since it interferes with the model creation.. 
set.seed(100)
nnetTune <- train(trainingData$x,trainingData$y,method = "nnet",tuneGrid = nnetGrid,trControl = ctrl1,preProc = c("center","scale"),linout = TRUE,trace = FALSE,MaxNWts = 10 * (ncol(trainingData$x)+ 1) + 10 + 1, maxit = 500)
nnetTune
## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##    1    0.00   2.540546  0.7254252  2.008197
##    1    0.01   2.385425  0.7602782  1.887777
##    1    0.10   2.393895  0.7596503  1.894167
##    2    0.00   2.566314  0.7212618  2.052922
##    2    0.01   2.620311  0.7133931  2.039737
##    2    0.10   2.592617  0.7216119  2.105093
##    3    0.00   2.302390  0.7766520  1.829400
##    3    0.01   2.278686  0.7724293  1.803214
##    3    0.10   2.472966  0.7519439  1.977240
##    4    0.00   2.433270  0.7497128  1.862868
##    4    0.01   2.485616  0.7430911  2.025216
##    4    0.10   2.358607  0.7772165  1.880072
##    5    0.00   2.515220  0.7472471  2.029753
##    5    0.01   2.430803  0.7607865  1.932468
##    5    0.10   2.311168  0.7845703  1.830923
##    6    0.00   6.784753  0.5364566  3.582875
##    6    0.01   2.613883  0.7309686  2.115403
##    6    0.10   2.534052  0.7478124  2.000938
##    7    0.00   6.272111  0.4964439  3.730671
##    7    0.01   3.152634  0.6475999  2.502389
##    7    0.10   2.576842  0.7834795  2.042534
##    8    0.00   9.132172  0.4711305  5.258124
##    8    0.01   2.990065  0.6809087  2.372930
##    8    0.10   2.911291  0.6864066  2.328065
##    9    0.00   8.411991  0.4618170  4.241654
##    9    0.01   3.173268  0.6966041  2.567788
##    9    0.10   2.903788  0.6864446  2.313660
##   10    0.00   4.954073  0.4182651  3.602381
##   10    0.01   3.508813  0.5912309  2.787225
##   10    0.10   2.973902  0.6608165  2.360836
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.01.
plot(nnetTune)

It appears that the model choose a weight decay of 0,01 with 3 hidden units to get the lowest RMSE.

NNpred <- predict(nnetTune,testData$x)

postResample(NNpred,testData$y)
##      RMSE  Rsquared       MAE 
## 1.9478991 0.8496598 1.4767314

It seems the Neural Network did worst on the test data than the MARS and the knn model


SVM model

Finally, we attempt to build a Support Vector Machine Model, I am going to attempt build 3 svms model using different kernel and choose the one that did the best..

SVMRadial

Here I will build a svm model with the radial kernel, a tunelength of 12 and standard 10-fold cross validation model.

set.seed(100)
svmRtuned1 <- train(trainingData$x,trainingData$y,method = "svmRadial",preProc = c("center","scale"),tuneLength = 12,trControl = ctrl1)
svmRtuned1
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.530787  0.7922715  2.013175
##     0.50  2.259539  0.8064569  1.789962
##     1.00  2.099789  0.8274242  1.656154
##     2.00  2.002943  0.8412934  1.583791
##     4.00  1.943618  0.8504425  1.546586
##     8.00  1.918711  0.8547582  1.532981
##    16.00  1.920651  0.8536189  1.536116
##    32.00  1.920651  0.8536189  1.536116
##    64.00  1.920651  0.8536189  1.536116
##   128.00  1.920651  0.8536189  1.536116
##   256.00  1.920651  0.8536189  1.536116
##   512.00  1.920651  0.8536189  1.536116
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06509124
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06509124 and C = 8.
plot(svmRtuned1)

The plot and the summary shows that the lowest RMSE vale fell off at sigma = 0.65 and cost at 8 and we see that higher costs results in the same RMSE and R2 squared values.

svmRtunedpred <- predict(svmRtuned1,testData$x)

postResample(svmRtunedpred,testData$y)
##      RMSE  Rsquared       MAE 
## 2.0631908 0.8275736 1.5662213

The svm with the radial kernel did okay on the test data

SVMLinear

Here I will build a svm model with the linear kernel, with the same tunelength and cross-validation

set.seed(100)
svmRtuned2 <- train(trainingData$x,trainingData$y,method = "svmLinear",preProc = c("center","scale"),tuneLength = 12,trControl = ctrl1)
svmRtuned2
## Support Vector Machines with Linear Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.414092  0.7548203  1.965221
## 
## Tuning parameter 'C' was held constant at a value of 1

Since we can’t plot the linear kernel, we can see that the linear kernel performed worse on the data with a higher RMSE value and a slightly lower R-squared value.

svmRtunedpred2 <- predict(svmRtuned2,testData$x)

postResample(svmRtunedpred2,testData$y)
##      RMSE  Rsquared       MAE 
## 2.7633860 0.6973384 2.0970616

The svm with the linear kernel performed worse on the test data


For the svm polynomial kernel, the model was taking a very long to produce any values, thus I omitted it in this case. Looking at the other kernels we see that the svm with the default radial kernel performed better than the linear kernel,so for SVM we will stick with the radial kernel.


Comparing Values

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Tabble <- bind_rows(knn = postResample(KnnPred,testData$y),NN = postResample(NNpred,testData$y),SVMLinear = postResample(svmRtunedpred,testData$y),Mars = postResample(Marspred,testData$y))
Tabble$id = c("knn","Neural Network","Support Vector Machine","MARS")

Tabble
## # A tibble: 4 × 4
##    RMSE Rsquared   MAE id                    
##   <dbl>    <dbl> <dbl> <chr>                 
## 1  3.20    0.682 2.57  knn                   
## 2  1.95    0.850 1.48  Neural Network        
## 3  2.06    0.828 1.57  Support Vector Machine
## 4  1.16    0.946 0.925 MARS

Looking at the various Non-Linear Models we have created in this exercise it appears that MARS model had done very well in terms of having the lowest RMSE and the highest R-squared value onto the test set, so we would go with the MARS model.


7.5 Exercise 6.3

For this exercise we will re-use the same data-cleaning and preprocessing I had done in the previous hw, but will apply different models.

Data Preprocessing

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data("ChemicalManufacturingProcess")
## we have to find the columns with missing values

na_counts <- colSums(is.na(ChemicalManufacturingProcess))

cols_w_na <- names(na_counts[na_counts > 0])

cols_w_na
##  [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
##  [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
##  [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"

It appears the missing values are in the ManufacturingProcess columns

## Check each column and impute it 

trans <- preProcess(ChemicalManufacturingProcess,method = "knnImpute")

We use the preProcess function and apply knnimpute according to section 3.9 from the textbook.

## This is the only way I have found that I can view the knn imputation.. 
imp <- predict(trans,newdata = ChemicalManufacturingProcess)

head(imp)
##        Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 -1.1792673           -0.2261036           -1.5140979          -2.68303622
## 2  1.2263678            2.2391498            1.3089960          -0.05623504
## 3  1.0042258            2.2391498            1.3089960          -0.05623504
## 4  0.6737219            2.2391498            1.3089960          -0.05623504
## 5  1.2534583            1.4827653            1.8939391           1.13594780
## 6  1.8386128           -0.4081962            0.6620886          -0.59859075
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1            0.2201765            0.4941942           -1.3828880
## 2            1.2964386            0.4128555            1.1290767
## 3            1.2964386            0.4128555            1.1290767
## 4            1.2964386            0.4128555            1.1290767
## 5            0.9414412           -0.3734185            1.5348350
## 6            1.5894524            1.7305423            0.6192092
##   BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1           -0.1313107            -1.233131           -3.3962895
## 2           -0.1313107             2.282619           -0.7227225
## 3           -0.1313107             2.282619           -0.7227225
## 4           -0.1313107             2.282619           -0.7227225
## 5           -0.1313107             1.071310           -0.1205678
## 6           -0.1313107             1.189487           -1.7343424
##   BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1            1.1005296            -1.838655           -1.7709224
## 2            1.1005296             1.393395            1.0989855
## 3            1.1005296             1.393395            1.0989855
## 4            1.1005296             1.393395            1.0989855
## 5            0.4162193             0.136256            1.0989855
## 6            1.6346255             1.022062            0.7240877
##   ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1              0.2154105              0.5662872              0.3765810
## 2             -6.1497028             -1.9692525              0.1979962
## 3             -6.1497028             -1.9692525              0.1087038
## 4             -6.1497028             -1.9692525              0.4658734
## 5             -0.2784345             -1.9692525              0.1087038
## 6              0.4348971             -1.9692525              0.5551658
##   ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1              0.5655598            -0.44593467             -0.5414997
## 2             -2.3669726             0.99933318              0.9625383
## 3             -3.1638563             0.06246417             -0.1117745
## 4             -3.3232331             0.42279841              2.1850322
## 5             -2.2075958             0.84537219             -0.6304083
## 6             -1.2513352             0.49486525              0.5550403
##   ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1             -0.1596700             -0.3095182             -1.7201524
## 2             -0.9580199              0.8941637              0.5883746
## 3              1.0378549              0.8941637             -0.3815947
## 4             -0.9580199             -1.1119728             -0.4785917
## 5              1.0378549              0.8941637             -0.4527258
## 6              1.0378549              0.8941637             -0.2199332
##   ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1            -0.07700901            -0.09157342             -0.4806937
## 2             0.52297397             1.08204765             -0.4806937
## 3             0.31428424             0.55112383             -0.4806937
## 4            -0.02483658             0.80261406             -0.4806937
## 5            -0.39004361             0.10403009             -0.4806937
## 6             0.28819802             1.41736795             -0.4806937
##   ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1             0.97711512              0.8093999              1.1846438
## 2            -0.50030980              0.2775205              0.9617071
## 3             0.28765016              0.4425865              0.8245152
## 4             0.28765016              0.7910592              1.0817499
## 5             0.09066017              2.5334227              3.3282665
## 6            -0.50030980              2.4050380              3.1396277
##   ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1              0.3303945              0.9263296              0.1505348
## 2              0.1455765             -0.2753953              0.1559773
## 3              0.1455765              0.3655246              0.1831898
## 4              0.1967569              0.3655246              0.1695836
## 5              0.4754056             -0.3555103              0.2076811
## 6              0.6261033             -0.7560852              0.1423710
##   ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1              0.4563798              0.3109942              0.2109804
## 2              1.5095063              0.1849230              0.2109804
## 3              1.0926437              0.1849230              0.2109804
## 4              0.9829430              0.1562704              0.2109804
## 5              1.6192070              0.2938027             -0.6884239
## 6              1.9044287              0.3998171             -0.5599376
##   ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1             0.05833309              0.8317688              0.8907291
## 2            -0.72230090             -1.8147683             -1.0060115
## 3            -0.42205706             -1.2132826             -0.8335805
## 4            -0.12181322             -0.6117969             -0.6611496
## 5             0.77891831              0.5911745              1.5804530
## 6             1.07916216             -1.2132826             -1.3508734
##   ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1              0.1200183              0.1256347              0.3460352
## 2              0.1093082              0.1966227              0.1906613
## 3              0.1842786              0.2159831              0.2104362
## 4              0.1708910              0.2052273              0.1906613
## 5              0.2726365              0.2912733              0.3432102
## 6              0.1146633              0.2417969              0.3516852
##   ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1              0.7826636              0.5943242              0.7566948
## 2              0.8779201              0.8347250              0.7566948
## 3              0.8588688              0.7746248              0.2444430
## 4              0.8588688              0.7746248              0.2444430
## 5              0.8969714              0.9549255             -0.1653585
## 6              0.9160227              1.0150257              0.9615956
##   ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1             -0.1952552             -0.4568829              0.9890307
## 2             -0.2672523              1.9517531              0.9890307
## 3             -0.1592567              2.6928719              0.9890307
## 4             -0.1592567              2.3223125              1.7943843
## 5             -0.1412574              2.3223125              2.5997378
## 6             -0.3572486              2.6928719              2.5997378
##   ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1             -1.7202722            -0.88694718             -0.6557774
## 2              1.9568096             1.14638329             -0.6557774
## 3              1.9568096             1.23880740             -1.8000420
## 4              0.1182687             0.03729394             -1.8000420
## 5              0.1182687            -2.55058120             -2.9443066
## 6              0.1182687            -0.51725073             -1.8000420
##   ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1             -1.1540243              0.7174727              0.2317270
## 2              2.2161351             -0.8224687              0.2317270
## 3             -0.7046697             -0.8224687              0.2317270
## 4              0.4187168             -0.8224687              0.2317270
## 5             -1.8280562             -0.8224687              0.2981503
## 6             -1.3787016             -0.8224687              0.2317270
##   ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1             0.05969714            -0.06900773             0.20279570
## 2             2.14909691             2.34626280            -0.05472265
## 3            -0.46265281            -0.44058781             0.40881037
## 4            -0.46265281            -0.44058781            -0.31224099
## 5            -0.46265281            -0.44058781            -0.10622632
## 6            -0.46265281            -0.44058781             0.15129203
##   ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1             2.40564734            -0.01588055             0.64371849
## 2            -0.01374656             0.29467248             0.15220242
## 3             0.10146268            -0.01588055             0.39796046
## 4             0.21667191            -0.01588055            -0.09355562
## 5             0.21667191            -0.32643359            -0.09355562
## 6             1.48397347            -0.01588055            -0.33931365

It seems the entire values were transformed, but the missing values are imputed with KNN, the defualt is k = 10.

## We need a ydefault
impnoY <- imp %>%
  select(-Yield)
set.seed(1)
trainRow <- createDataPartition(imp$Yield, p=0.8, list=FALSE)
imp.train <- impnoY[trainRow, ]
Yield.train <- imp[trainRow,]$Yield
imp.test <- impnoY[-trainRow, ]
Yield.test <- imp[-trainRow,]$Yield

A. Train different Nonlinear Model

KNN model..

## To note that we got a warning of a Zerovariance predictor which was the biological material 07
knnmodel2 <- train(x = imp.train,y = Yield.train,method = "knn",preProc = c("center","scale"),tuneLength = 10)

knnmodel2
## k-Nearest Neighbors 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.8291537  0.3668913  0.6533378
##    7  0.8211288  0.3769018  0.6488012
##    9  0.8080362  0.3954928  0.6430315
##   11  0.8027950  0.4033675  0.6446828
##   13  0.7954209  0.4188734  0.6373942
##   15  0.8003301  0.4156814  0.6410707
##   17  0.8041667  0.4135066  0.6434692
##   19  0.8080031  0.4134072  0.6473769
##   21  0.8120783  0.4125170  0.6503615
##   23  0.8174328  0.4090511  0.6541833
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
plot(knnmodel2)

We see that if we have eleven nearest neighbors in our model, we have the lowest RMSE values. Based on the training data it appears that knn doesn’t appear to fit the data well.

knnpred3 <- predict(knnmodel2,imp.test)
postResample(knnpred3,Yield.test)
##      RMSE  Rsquared       MAE 
## 0.5931631 0.6003265 0.5191605

It appears that the knn performed better on the test data, telling us that the knn model was able to generalize onto the new data.


Neural Network

## First we remove correlated predictors and create our grid

tooCorrelated <- findCorrelation(cor(imp.train),cutoff = .75)
tooCorrelated
##  [1]  2  6  8  1 12  4 44 27 41 42 21 26 25 54 57 56 38 37 43 30 52

We can notice that some of the predictors are correlated with each other,, and thus we will have to remove from our train and test sets.

## We can create new train and test sets according to the textbook
trainxnet <- imp.train[,-tooCorrelated]
testxnet <- imp.test[,-tooCorrelated]

And now we can create our model

## Create a specific candidate sets of model to evaluate

## Textbook says our lambda decay should be between 0 and 0.1 and I will make .size 1 to 5 to play around with it
nnetGrid2 <- expand.grid(size = c(1:5),decay = c(0,.01,.1))

# stick with 10 fold cross validation
set.seed(102)
ctrl2 <- trainControl(method = "cv",number = 10)
## Create the model with the train function from caret
# I put 10 hidden units and maxiterations is 500
## Apparently .bag parameter from the textbook should be removed since it interferes with the model creation.. 
set.seed(102)
nnetTune2 <- train(trainxnet,Yield.train,method = "nnet",tuneGrid = nnetGrid2,trControl = ctrl2,preProc = c("center","scale"),linout = TRUE,trace = FALSE,MaxNWts = 10 * (ncol(trainxnet)+ 1) + 10 + 1, maxit = 500)
nnetTune2
## Neural Network 
## 
## 144 samples
##  36 predictor
## 
## Pre-processing: centered (36), scaled (36) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 130, 130, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE       Rsquared   MAE      
##   1     0.00   0.9208304  0.3632362  0.7653970
##   1     0.01   0.8638859  0.3767211  0.6847736
##   1     0.10   0.7981602  0.4796208  0.6396534
##   2     0.00   0.9369980  0.3752629  0.7527670
##   2     0.01   0.9366374  0.4003136  0.7079207
##   2     0.10   0.8862205  0.4525297  0.6856467
##   3     0.00   0.8853312  0.4331014  0.7078507
##   3     0.01   1.1003103  0.3732871  0.8775116
##   3     0.10   0.9017420  0.4690088  0.7068700
##   4     0.00   1.1458739  0.3571929  0.9214667
##   4     0.01   1.0886435  0.3867733  0.8657484
##   4     0.10   0.9806198  0.4010995  0.7518177
##   5     0.00   1.3230848  0.2166120  1.0446854
##   5     0.01   1.0292122  0.3139016  0.8082850
##   5     0.10   0.8210496  0.4731805  0.6478778
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
plot(nnetTune2)

The final parameters used for lowest Rmse were size = 1 and the weight decay value being 0.01

NNpred2 <-predict(nnetTune2,testxnet)
postResample(NNpred2,Yield.test)
##      RMSE  Rsquared       MAE 
## 0.6999993 0.4787150 0.5919719

It appears the Neural Network with these setting appears to have performed worse on the test set.


MARS

## Create a grid, use up to three degrees..
marsgrid2 <- expand.grid(.degree = 1:3,.nprune = 2:30)

set.seed(102)
marsTuned <- train(imp.train,Yield.train,method = "earth",trControl = trainControl(method = "cv"),tuneGrid = marsgrid2)
marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 130, 130, 128, 129, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.7401828  0.5076835  0.5859174
##   1        3      0.6700539  0.6143070  0.5478887
##   1        4      0.6336196  0.6339889  0.5213229
##   1        5      0.6380850  0.6223464  0.5330441
##   1        6      0.6493623  0.6225629  0.5253757
##   1        7      0.6781870  0.5902393  0.5623102
##   1        8      0.6670887  0.5997152  0.5330688
##   1        9      0.6570267  0.6124918  0.5262138
##   1       10      0.6388756  0.6317758  0.5094791
##   1       11      0.6552031  0.6037919  0.5232743
##   1       12      0.6577968  0.6088902  0.5296082
##   1       13      0.6690541  0.6000783  0.5423353
##   1       14      0.6689753  0.5993107  0.5396548
##   1       15      0.6777116  0.5914683  0.5506578
##   1       16      0.6843519  0.5875380  0.5539098
##   1       17      0.6847553  0.5897997  0.5561409
##   1       18      0.6830386  0.5906434  0.5553786
##   1       19      0.6858592  0.5875657  0.5591931
##   1       20      0.6832162  0.5870476  0.5605455
##   1       21      0.6855152  0.5834529  0.5660780
##   1       22      0.6855152  0.5834529  0.5660780
##   1       23      0.6855152  0.5834529  0.5660780
##   1       24      0.6855152  0.5834529  0.5660780
##   1       25      0.6855152  0.5834529  0.5660780
##   1       26      0.6855152  0.5834529  0.5660780
##   1       27      0.6855152  0.5834529  0.5660780
##   1       28      0.6855152  0.5834529  0.5660780
##   1       29      0.6855152  0.5834529  0.5660780
##   1       30      0.6855152  0.5834529  0.5660780
##   2        2      0.7401828  0.5076835  0.5859174
##   2        3      0.6858121  0.5699007  0.5520895
##   2        4      0.6710553  0.5797737  0.5440391
##   2        5      0.6822840  0.5778308  0.5530664
##   2        6      0.6845224  0.5639186  0.5548198
##   2        7      0.7442537  0.5219663  0.5662516
##   2        8      0.7586647  0.5112531  0.5762678
##   2        9      0.8138133  0.4577877  0.5959540
##   2       10      0.8431044  0.4171317  0.6106736
##   2       11      0.8217784  0.4393290  0.5945527
##   2       12      0.8017329  0.4579097  0.5893465
##   2       13      1.1532784  0.4466222  0.7026523
##   2       14      1.1544112  0.4427076  0.7034843
##   2       15      1.1460628  0.4506502  0.7035306
##   2       16      1.1397684  0.4590504  0.7018349
##   2       17      1.1387850  0.4796401  0.6967700
##   2       18      1.1673808  0.4660005  0.7139729
##   2       19      1.1705725  0.4691587  0.7219652
##   2       20      1.1802355  0.4645687  0.7273957
##   2       21      1.1802355  0.4645687  0.7273957
##   2       22      1.1802355  0.4645687  0.7273957
##   2       23      1.1802355  0.4645687  0.7273957
##   2       24      1.1802355  0.4645687  0.7273957
##   2       25      1.1802355  0.4645687  0.7273957
##   2       26      1.1802355  0.4645687  0.7273957
##   2       27      1.1802355  0.4645687  0.7273957
##   2       28      1.1802355  0.4645687  0.7273957
##   2       29      1.1802355  0.4645687  0.7273957
##   2       30      1.1802355  0.4645687  0.7273957
##   3        2      0.7401828  0.5076835  0.5859174
##   3        3      0.6521967  0.6021238  0.5292473
##   3        4      0.6465669  0.6188624  0.5236559
##   3        5      0.6709029  0.5711027  0.5551302
##   3        6      0.6761283  0.5651955  0.5720745
##   3        7      0.6763515  0.5786132  0.5566866
##   3        8      0.6621672  0.5995342  0.5372266
##   3        9      0.6634479  0.6006042  0.5373258
##   3       10      0.6734689  0.5910775  0.5378345
##   3       11      0.6654518  0.5918416  0.5321995
##   3       12      0.6884503  0.5758691  0.5560502
##   3       13      0.7053613  0.5595778  0.5637820
##   3       14      0.7096525  0.5615177  0.5685833
##   3       15      0.7334520  0.5394750  0.5862697
##   3       16      0.7774417  0.5261401  0.6039763
##   3       17      0.8292614  0.5185614  0.6346414
##   3       18      0.8283822  0.5243659  0.6320672
##   3       19      0.8285162  0.5213362  0.6333991
##   3       20      0.8354142  0.5185224  0.6428146
##   3       21      0.8298619  0.5242205  0.6424197
##   3       22      0.8446325  0.5104117  0.6562010
##   3       23      0.8454015  0.5102857  0.6539848
##   3       24      0.8347014  0.5121538  0.6502187
##   3       25      0.8384520  0.5077728  0.6523437
##   3       26      1.0147562  0.4619587  0.7194270
##   3       27      1.0096110  0.4613875  0.7194332
##   3       28      0.9787094  0.4631364  0.7072870
##   3       29      0.9817739  0.4590483  0.7092167
##   3       30      0.9817739  0.4590483  0.7092167
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.
plot(marsTuned)

We see various fluctuations with the different degrees within the model, but ultimately the lowest is at the first degree and the # of terms chosen is at 4

maarsPred <- predict(marsTuned,imp.test)
postResample(maarsPred,Yield.test)
##      RMSE  Rsquared       MAE 
## 0.6321206 0.5108776 0.5290423

It appears the Mars model did slightly worse than the knn model but better than the Neural Network model.

SVM (Linear Kernel)

I am going to stick with the default radial kernel for our svm model

set.seed(123)
svmRTune <- train(imp.train,Yield.train,method = "svmRadial",preProc = c("center","scale"),tuneLength = 10,trControl = trainControl(method = "cv"))
svmRTune
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE      
##     0.25  0.7919846  0.4880654  0.6359145
##     0.50  0.7241610  0.5438409  0.5798627
##     1.00  0.6657428  0.6114859  0.5334832
##     2.00  0.6353959  0.6311046  0.5076299
##     4.00  0.6296930  0.6363929  0.4999562
##     8.00  0.6291753  0.6396011  0.4962631
##    16.00  0.6292338  0.6397384  0.4963570
##    32.00  0.6292338  0.6397384  0.4963570
##    64.00  0.6292338  0.6397384  0.4963570
##   128.00  0.6292338  0.6397384  0.4963570
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01460699
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01460699 and C = 8.
plot(svmRTune)

Like the Mars model before, I have got a warning for zero variance predictors which was only a single predictor, but the model choose the cost paremeters for 4 and a sigma is 0.01.

svmpred <- predict(svmRTune,imp.test)
postResample(svmpred,Yield.test)
##      RMSE  Rsquared       MAE 
## 0.5337424 0.6735164 0.4331323

It seems like svm performed well on the test set as well, the R-squared is pretty high.


Comparing Values..

Tablee2 <- bind_rows(knn = postResample(knnpred3,Yield.test),NN = postResample(NNpred2,Yield.test),mars = postResample(maarsPred,Yield.test),svm = postResample(svmpred,Yield.test))
Tablee2$id = c("KNN","NN","MARS","SVM")
Tablee2
## # A tibble: 4 × 4
##    RMSE Rsquared   MAE id   
##   <dbl>    <dbl> <dbl> <chr>
## 1 0.593    0.600 0.519 KNN  
## 2 0.700    0.479 0.592 NN   
## 3 0.632    0.511 0.529 MARS 
## 4 0.534    0.674 0.433 SVM

Looking at the data, it appears that SVM with the default radial kernel performed well on the testing set..


B.

plot(varImp(svmRTune),top = 20)

Acoording to the vif plot, we see that the manufacturingprocess predictors play a big part in predicting the yield. For the top ten we see a mix of manufacturingprocess and biologicalmaterial dominate in yield prediction. If we compare the top ten predictors from our nonlinear model, we see a even mix of manufacturing process and biological material play a great importance in yield prediction, whereas in the linear model, the variable importance placed a greater importance on the manufacturing process than the biological material.

C. Explore The Relationships.

If we look at the top ten predictors, our variable importance plot tells us that we should take a look at the manufacturing process32, 13 and 36 and to use biological_material06 and to 03 in order to produce the greatest amount of yield. This also tells me that the SVM was able to capture the patterns that the linear model wasn’t able to and thus the nonlinear model tells us that there are some biological materials that play a great importance in determining the highest yield.


Fin