Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10sin(\pi x_1x_2)+20(x_3 -0.5)^2 +10x_4 +5x_5 +N(0,\sigma^2)\)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame ## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to ## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
library(caret)
knnModel <-
train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10
)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
I think that the question is asking me to evaluate the results of the KNN models and then run some MARS models and evaluate the feature selection.
This is really interesting, for the KNN models the model with the best performance is when k = 17. We get a RMSE of 0.64 and 0.68 on the training data and test data respectively.
When we run the MARS models, we get built in feature select in our model build so our model only uses X1, X2, X3, X4, X5, X6. We get an R-squared of 0.92 and 0.87 on the training and test data respectively. We get a RMSE of 1.8 on the test data with the MARS model compared to a RMSE of 3.2 on the test data of the KNN model.
library("earth")
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsFit <- earth(trainingData$x, trainingData$y)
marsFit
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
summary(marsFit)
## Call: earth(x=trainingData$x, y=trainingData$y)
##
## coefficients
## (Intercept) 18.451984
## h(0.621722-X1) -11.074396
## h(0.601063-X2) -10.744225
## h(X3-0.281766) 20.607853
## h(0.447442-X3) 17.880232
## h(X3-0.447442) -23.282007
## h(X3-0.636458) 15.150350
## h(0.734892-X4) -10.027487
## h(X4-0.734892) 9.092045
## h(0.850094-X5) -4.723407
## h(X5-0.850094) 10.832932
## h(X6-0.361791) -1.956821
##
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
marsPred <- predict(marsFit, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.8136467 0.8677298 1.3911836
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models. (a) Which nonlinear regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cmp_df <- ChemicalManufacturingProcess
sum(is.na(cmp_df))
## [1] 106
We can use the ‘colSums()’ function to understand missingness by predictor. The greatest amount of missingness is 15 values. I’m going to replace the missing values with the column mean.
colSums(is.na(cmp_df))
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
cmp_df <- na.aggregate(cmp_df)
sum(is.na(cmp_df))
## [1] 0
columns_to_remove <- nearZeroVar(cmp_df)
cmp_df <- cmp_df[,-columns_to_remove]
# Set the random number seed so we can reproduce the results
set.seed(123456789)
# By default, the numbers are returned as a list. Using
# list = FALSE, a matrix of row numbers is generated.
# These samples are allocated to the training set.
trainingRows <- createDataPartition(cmp_df$Yield, p = .70, list = FALSE)
head(trainingRows)
## Resample1
## [1,] 2
## [2,] 3
## [3,] 4
## [4,] 5
## [5,] 7
## [6,] 8
# Subset the data into objects for training using
# integer sub-setting.
train <- cmp_df[trainingRows,]
train <- train |> select(-Yield)
trans <- preProcess(train, method = c("center", "scale"))
train <- predict(trans, train)
yield_train <- cmp_df$Yield[trainingRows]
head(train)
## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 2 2.0957767 1.252483 -0.1268052
## 3 2.0957767 1.252483 -0.1268052
## 4 2.0957767 1.252483 -0.1268052
## 5 1.3778129 1.837914 1.0949364
## 7 1.3911085 2.120707 1.1359172
## 8 0.6731447 1.904892 1.0462716
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 2 1.1800471 0.3587809 1.089012
## 3 1.1800471 0.3587809 1.089012
## 4 1.1800471 0.3587809 1.089012
## 5 0.8504608 -0.3993435 1.496600
## 7 0.7458303 -0.5039124 1.440288
## 8 1.7293575 0.3901516 1.512689
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 2 2.311820 -0.7707901 1.0177401
## 3 2.311820 -0.7707901 1.0177401
## 4 2.311820 -0.7707901 1.0177401
## 5 1.088856 -0.1581108 0.3809166
## 7 1.834566 0.2094968 0.3653844
## 8 2.028451 0.6506259 1.6234990
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 2 1.38641812 1.1472202 -5.4714537
## 3 1.38641812 1.1472202 -5.4714537
## 4 1.38641812 1.1472202 -5.4714537
## 5 0.12116426 1.1472202 -0.2242367
## 7 0.01677037 0.7370656 0.1680786
## 8 1.49707564 1.6940931 0.4132757
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 2 -1.969428 -0.008089843 -2.3181036
## 3 -1.969428 -0.008089843 -3.1108362
## 4 -1.969428 -0.008089843 -3.2693827
## 5 -1.969428 -0.008089843 -2.1595570
## 7 -1.969428 1.016858040 0.2186408
## 8 -1.969428 0.515287799 -0.4155453
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 2 0.9688887 0.8891570 -1.0286226
## 3 0.0417505 -0.1547302 0.9643337
## 4 0.3983421 2.0770287 -1.0286226
## 5 0.8165268 -0.6586758 0.9643337
## 7 -0.4347856 0.8891570 -1.0286226
## 8 0.2783977 1.5010909 0.9643337
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 2 0.8604992 0.5591263 -0.02499045
## 3 0.8604992 -0.4168261 -0.02499045
## 4 -1.1527442 -0.5144214 -0.02499045
## 5 0.8604992 -0.4883960 -0.02499045
## 7 0.8604992 2.3743977 3.10870450
## 8 0.8604992 1.9319660 1.29654056
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 2 -0.006996376 -0.4879186 -0.4456372
## 3 -0.006996376 -0.4879186 0.3100416
## 4 -0.006996376 -0.4879186 0.3100416
## 5 -0.006996376 -0.4879186 0.1211219
## 7 3.019944677 -0.4879186 -1.9569949
## 8 2.733635723 -0.4879186 -0.8234766
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 2 0.278423622 0.9457981 0.3928445
## 3 0.440595210 0.8079242 0.3928445
## 4 0.782957451 1.0664377 0.6854134
## 5 2.494768657 3.3241223 2.2782886
## 7 -1.955940478 -0.7086884 -1.7364069
## 8 0.008137642 1.1181404 0.5391290
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 2 -0.2144985 0.6185123 1.4529711
## 3 0.4094971 0.8268674 1.0426138
## 4 0.4094971 0.7226898 0.9346251
## 5 -0.2924979 1.0143870 1.5609598
## 7 -0.3704974 -1.6525586 -0.3612397
## 8 -0.5264963 -1.4858745 -0.1668600
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 2 0.7597951 0.2744285 -0.68368241
## 3 0.7597951 0.2744285 -0.38090877
## 4 0.5570959 0.2744285 -0.07813513
## 5 1.5300521 -0.7018171 0.83018578
## 7 -1.2469271 2.2269197 -1.28922968
## 8 -0.6388295 0.2744285 -0.98645605
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 2 -1.7829988 -0.9201897 0.3224479
## 3 -1.1918441 -0.7459859 1.0168982
## 4 -0.6006894 -0.5717820 0.8928892
## 5 0.5816199 1.6928682 1.8353575
## 7 -1.1918441 -1.2685975 -1.5128850
## 8 -0.6006894 -1.0943936 -1.2400652
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 2 1.2489958 0.91027573 0.8234824
## 3 1.4505103 1.06428140 0.8034346
## 4 1.3385578 0.91027573 0.8034346
## 5 2.2341779 2.09831946 0.8435302
## 7 0.1294708 -0.36577125 0.8234824
## 8 0.1742518 -0.05775991 0.8034346
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 2 2.035497 1.0357345 -1.3228383
## 3 1.874264 0.2846139 -0.8912206
## 4 1.874264 0.2846139 -0.8912206
## 5 2.357963 -0.3162826 -0.8192844
## 7 1.713030 2.9886481 -2.1141372
## 8 1.713030 2.5379758 -1.8983284
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 2 1.9201778 1.0509993 1.9873701
## 3 2.6579069 1.0509993 1.9873701
## 4 2.2890423 1.9190132 0.1249846
## 5 2.2890423 2.7870272 0.1249846
## 7 0.0758552 0.6169924 0.1249846
## 8 0.4447197 0.6169924 0.1249846
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 2 1.05962903 -0.735134 2.14246805
## 3 1.15314672 -1.905827 -0.68145550
## 4 -0.06258329 -1.905827 0.40466895
## 5 -2.68107870 -3.076521 -1.76757994
## 7 -2.02645485 -0.735134 -0.46423061
## 8 -1.74590177 -0.735134 -0.02978083
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 2 -0.6760464 0.2356983 1.9819345
## 3 -0.6760464 0.2356983 -0.5004885
## 4 -0.6760464 0.2356983 -0.5004885
## 5 -0.6760464 0.3000740 -0.5004885
## 7 -0.6760464 0.3000740 -0.5004885
## 8 -0.6760464 0.3000740 -0.5004885
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 2 2.0995668 0.02815815 -0.03692521
## 3 -0.4781192 0.42096439 0.06482425
## 4 -0.4781192 -0.19006753 0.16657371
## 5 -0.4781192 -0.01548698 0.16657371
## 7 -0.4781192 0.29002897 -0.24042412
## 8 -0.4781192 0.15909356 -0.13867466
## ManufacturingProcess44 ManufacturingProcess45
## 2 0.30054609 0.15921800
## 3 0.03623605 0.37150866
## 4 0.03623605 -0.05307267
## 5 -0.22807398 -0.05307267
## 7 0.56485612 0.15921800
## 8 0.56485612 0.15921800
# Do the same for the test set using negative integers.
test <- cmp_df[-trainingRows,]
test <- test |> select(-Yield)
trans <- preProcess(test, method = c("center", "scale"))
test <- predict(trans, test)
yield_test <- cmp_df$Yield[-trainingRows]
head(test)
## BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 -0.1755184 -1.3793558 -2.40754671
## 6 -0.3862653 0.8007961 -0.41775223
## 9 0.9430609 2.1019346 1.19269294
## 15 -0.2079410 1.9355677 0.63917697
## 18 2.0778515 1.0689697 0.03794411
## 21 1.8671047 1.1434624 0.05225918
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 0.3542804 0.651668 -1.2603016
## 6 2.0820641 2.014069 0.7176365
## 9 2.4589059 0.597889 1.6380370
## 15 -0.3140803 1.267139 1.5984783
## 18 2.4375752 1.840781 0.9259793
## 21 2.4731263 2.020044 0.9998224
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1 -1.211512 -3.17093088 1.389593
## 6 1.136085 -1.58036748 2.054042
## 9 1.923388 0.72479687 2.199390
## 15 1.980647 0.01019592 -1.019035
## 18 1.737298 0.10240249 2.843075
## 21 1.866130 0.37902222 3.009187
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1 -1.758777 -1.5392922 -0.10566420
## 6 1.035161 0.6717192 0.59433809
## 9 1.505207 1.4621844 0.59433809
## 15 1.227637 2.2984736 -0.02390474
## 18 1.973228 1.4850964 0.94761971
## 21 2.348049 1.8860570 1.21258092
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1 -0.002005454 0.01550889 -0.08922738
## 6 -2.008492079 0.01550889 -1.36399595
## 9 -2.008492079 0.83950282 -0.71455053
## 15 -2.008492079 0.01550889 -0.38982783
## 18 -2.008492079 0.01550889 -0.38982783
## 21 -2.008492079 0.83950282 -1.20163459
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1 0.04971722 0.1250541 0.1764062
## 6 0.55858429 0.7382875 1.2411437
## 9 0.11058032 0.6564290 -0.8064284
## 15 4.05840476 0.2880655 -0.8064284
## 18 0.71353303 0.5336412 -0.8064284
## 21 3.45545206 -0.2030857 1.2411437
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1 0.08726288 -1.6101083 0.06710068
## 6 0.97879500 -0.1392355 0.06710068
## 9 -1.02143731 1.0526787 0.82638896
## 15 -1.02143731 0.5708411 0.38914364
## 18 -1.02143731 0.1207033 0.53489208
## 21 0.97879500 -0.2026352 -0.19385013
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1 0.01681627 0.03113167 0.9982485
## 6 0.01681627 -0.46351599 -0.6549019
## 9 2.49064599 -0.46351599 -0.7651119
## 15 1.19201577 -0.46351599 0.7778284
## 18 1.76918475 -0.46351599 0.9982485
## 21 0.90343127 -0.46351599 0.6676184
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1 0.8324958 1.209825 0.2823790
## 6 2.5025585 3.126916 0.4449408
## 9 0.7365152 1.815222 0.2448647
## 15 1.2164182 2.017021 0.2948837
## 18 0.8516919 1.462074 0.2980099
## 21 1.8114981 2.101104 0.4136789
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1 0.8562408 0.18038449 0.5551000
## 6 -0.9558037 0.17591996 2.0540011
## 9 -0.5243645 0.03751929 0.1917301
## 15 0.6836651 0.15657363 1.5543674
## 18 0.8562408 0.10002281 0.3734150
## 21 0.5973773 0.18931357 1.8950268
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1 0.2715747 0.09499888 -0.1044680
## 6 0.3202221 -0.56366000 0.9619927
## 9 0.1005240 0.09499888 -0.8182676
## 15 0.2417585 0.09499888 1.5554127
## 18 0.2260658 0.09499888 -1.1149777
## 21 0.3343456 0.09499888 -0.2248475
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1 -0.001520213 -0.2299603 0.16487392
## 6 -1.269930888 -1.5985493 0.16190785
## 9 -0.012299907 -1.2491648 0.11889974
## 15 -0.012299907 0.8471423 0.16635696
## 18 -1.269930888 -1.4238571 0.07885771
## 21 0.616515583 -0.8997803 0.17377215
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1 0.1707123 0.2895910 0.9717201
## 6 0.2350214 0.2927257 1.1000139
## 9 0.2052487 0.1704753 1.0816862
## 15 0.2314487 0.2488409 1.0816862
## 18 0.1480850 0.2049562 0.9900478
## 21 0.2219214 0.3068315 1.0450308
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1 0.4514727 0.5923419 -0.02438267
## 6 0.6963849 0.7307769 -0.11567137
## 9 0.6264100 1.0076468 -0.13595775
## 15 0.6613974 0.5923419 -0.08524180
## 18 0.4164853 1.0768643 -0.03452586
## 21 0.6264100 0.5923419 -0.06495542
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1 -0.4040165 0.9122196 -1.7476413
## 6 2.7566080 2.3496567 0.1069984
## 9 0.3396599 0.5528604 0.1069984
## 15 0.3396599 0.1935011 0.1069984
## 18 -0.2180974 0.1935011 0.1069984
## 21 -0.2180974 -0.1658581 0.1069984
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1 -0.6869526 -0.5069243 -1.2503431
## 6 -0.2965533 -1.6530140 -1.4938557
## 9 -0.1989535 -0.5069243 0.4542445
## 15 -0.2965533 -0.5069243 1.4282946
## 18 0.3866454 0.6391654 0.2107320
## 21 0.1914458 0.6391654 -0.2762931
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1 0.6901983 0.2200479 0.1810502
## 6 -1.4209964 0.2200479 -0.3685787
## 9 0.6901983 0.3630791 -0.3685787
## 15 0.6901983 0.2200479 -0.3685787
## 18 0.6901983 0.3630791 -0.3685787
## 21 0.6901983 0.2915635 2.7341653
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1 0.2466006 -0.07204426 4.28571934
## 6 -0.3503305 -0.60723023 2.68817521
## 9 -0.3503305 -1.14241620 0.09216601
## 15 -0.3503305 -0.07204426 -0.10752701
## 18 -0.3503305 0.46314170 1.68971013
## 21 4.6840274 -1.14241620 2.28878918
## ManufacturingProcess44 ManufacturingProcess45
## 1 -0.5717719 1.2985131
## 6 -0.5717719 -0.9522429
## 9 0.5717719 -0.3895539
## 15 0.5717719 -0.9522429
## 18 -0.5717719 0.1731351
## 21 -2.8588594 -0.9522429
marsFit <- earth(train, yield_train)
marsFit
## Selected 10 of 21 terms, and 7 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 0.9849263 RSS 87.57107 GRSq 0.7289215 RSq 0.8024562
summary(marsFit)
## Call: earth(x=train, y=yield_train)
##
## coefficients
## (Intercept) 39.925652
## h(-0.85297-ManufacturingProcess05) 2.072296
## h(0.611177-ManufacturingProcess09) -1.038454
## h(-1.39024-ManufacturingProcess13) 4.219073
## h(-0.846306-ManufacturingProcess32) -2.107534
## h(ManufacturingProcess32- -0.846306) 1.112444
## h(-1.11904-ManufacturingProcess33) 2.583950
## h(-0.0625833-ManufacturingProcess35) 0.519896
## h(0.106947-ManufacturingProcess39) -0.242464
## h(ManufacturingProcess39-0.106947) -2.518510
##
## Selected 10 of 21 terms, and 7 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 0.9849263 RSS 87.57107 GRSq 0.7289215 RSq 0.8024562
marsPred <- predict(marsFit, newdata = test)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred, obs = yield_test)
## RMSE Rsquared MAE
## 1.3116790 0.4665333 1.0908784
knnModel <-
train(
x = train,
y = yield_train,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10
)
knnModel
## k-Nearest Neighbors
##
## 124 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.411625 0.4674504 1.131015
## 7 1.406354 0.4714071 1.125800
## 9 1.405609 0.4805191 1.128503
## 11 1.422081 0.4759467 1.144328
## 13 1.440176 0.4648049 1.169865
## 15 1.451331 0.4630822 1.184253
## 17 1.461761 0.4608139 1.193029
## 19 1.469425 0.4601417 1.202707
## 21 1.474191 0.4631285 1.206779
## 23 1.483928 0.4615102 1.216426
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
knnPred <- predict(knnModel, newdata = test)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = yield_test)
## RMSE Rsquared MAE
## 1.4703213 0.2767066 1.0799103
The optimal linear model was a ridge regression model
ridgeModel <- enet(x = as.matrix(train), y = yield_train, lambda = 0.22525)
ridgePred <- predict(ridgeModel, newx = as.matrix(test), s = 1, mode = "fraction",type = "fit")
lmValues1 <- data.frame(obs = yield_test, pred = ridgePred$fit)
names(lmValues1) <- c('obs', 'pred')
defaultSummary(lmValues1)
## RMSE Rsquared MAE
## 1.2683048 0.4988622 1.0184721
Here are top predictors for the ridge regression model, 8 out of the 9 are Manufactioning Process predictors.
ridgeCoef<- predict(ridgeModel, newx = as.matrix(test), s = 1, mode = "fraction",type = "coefficients")
coef <- as.data.frame(ridgeCoef$coefficients)
names(coef) <- c("coefficients")
coef|> as.data.frame() |> filter(abs(coefficients) > 0.2)
## coefficients
## ManufacturingProcess04 0.2079555
## ManufacturingProcess09 0.3593582
## ManufacturingProcess13 -0.3360078
## ManufacturingProcess17 -0.3218476
## ManufacturingProcess26 0.2916862
## ManufacturingProcess29 0.2510196
## ManufacturingProcess32 0.4919538
## ManufacturingProcess36 -0.2783848
We don’t get 10 top predictors from our MARS model, we only get 7 with all of them being Manufacturing Process predictors. The top 2 predictors for both models are the same, ManufacturingProcess32 and ManufacturingProcess09.
evimp(marsFit)
## nsubsets gcv rss
## ManufacturingProcess32 9 100.0 100.0
## ManufacturingProcess09 8 67.0 69.0
## ManufacturingProcess13 7 42.5 47.3
## ManufacturingProcess33 6 28.3 35.4
## ManufacturingProcess35 4 23.1 28.3
## ManufacturingProcess05 3 16.6 22.2
## ManufacturingProcess39 2 5.9 14.7
We see that ManufacturingProcess32 and ManufacturingProcess09 which are the most influential predictors have the steepest slopes.
The question asks us to specifically look for predictors that are unique to the optimal nonlineear regression model. ManufacturingProcess39 in not found in the ridge regression predictors and has a relationship not seen in the other predictors. The Yield increases as ManufacturingProcess39 increases in value but at high values of ManufacturingProcess39 the Yield actually drops. So too a value of ManufacturingProcess39 reverses the relationship and hurts the Yield.
plotmo(marsFit)
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## -0.08470601 -0.07713991 -0.1268052
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## -0.1461453 -0.001981749 -0.1176615
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 0.02994867 -0.0600821 -0.1627132
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## -0.198281 -0.02172056 0.1190392
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 0.5107121 0.01371756 0.3771873
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## -0.06198524 -0.2627186 0.9643337
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 0.8604992 0.02235246 -0.1273025
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 0.00335214 -0.4879186 0.1211219
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 0.01714718 -0.1485758 -0.07038956
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 0.09749931 -0.04822413 -0.16686
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## -0.0611367 -0.1439625 -0.07813513
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## -0.009534753 -0.2233743 -0.08678172
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## -0.1280199 -0.07976072 0.6630997
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## -0.2217679 -0.04482494 0.1158871
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## -0.1085771 0.1829854 0.1249846
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## -0.06258329 0.4355593 -0.02978083
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 0.7447969 0.2356983 -0.5004885
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## -0.4781192 0.2027387 -0.1386747
## ManufacturingProcess44 ManufacturingProcess45
## 0.3005461 0.159218
Lastly, we are going to look at the relationships between the most important predictors and Yield by building a reduced model with just most influential predictors and increase the ‘degree’ to 2 and see if we get any interaction terms in our model. We see that the interaction of ManufacturingProcess32:ManufacturingProcess09 and ManufacturingProcess09:ManufacturingProcess35 are influential in the Yield. As Process 09 increases and Process 35 goes down, the Yield increase.
rd_df <- cmp_df[, c("Yield", "ManufacturingProcess32", "ManufacturingProcess09", "ManufacturingProcess13", "ManufacturingProcess33", "ManufacturingProcess35")]
mars2 <- earth(Yield ~., data = rd_df, degree = 2) # allow first order interactions
plotmo(mars2)
## plotmo grid: ManufacturingProcess32 ManufacturingProcess09
## 158 45.73
## ManufacturingProcess13 ManufacturingProcess33 ManufacturingProcess35
## 34.6 64 495.5965