DATA624

R Markdown

Exercise 7.2.

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10sin(\pi x_1x_2)+20(x_3 -0.5)^2 +10x_4 +5x_5 +N(0,\sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

 library(mlbench)
 library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

 set.seed(200)
 trainingData <- mlbench.friedman1(200, sd = 1)
 ## We convert the 'x' data from a matrix to a data frame  ## One reason is that this will give the columns names.  
 trainingData$x <- data.frame(trainingData$x)
 ## Look at the data using
 featurePlot(trainingData$x, trainingData$y)

 ## or other methods.

 ## This creates a list with a vector 'y' and a matrix
 ## of predictors 'x'. Also simulate a large test set to  ## estimate the true error rate with good precision:
 testData <- mlbench.friedman1(5000, sd = 1)
 testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

library(caret)
knnModel <-
  train(
    x = trainingData$x,
    y = trainingData$y,
    method = "knn",
    preProc = c("center", "scale"),
    tuneLength = 10
  )
                  
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

I think that the question is asking me to evaluate the results of the KNN models and then run some MARS models and evaluate the feature selection.

This is really interesting, for the KNN models the model with the best performance is when k = 17. We get a RMSE of 0.64 and 0.68 on the training data and test data respectively.

When we run the MARS models, we get built in feature select in our model build so our model only uses X1, X2, X3, X4, X5, X6. We get an R-squared of 0.92 and 0.87 on the training and test data respectively. We get a RMSE of 1.8 on the test data with the MARS model compared to a RMSE of 3.2 on the test data of the KNN model.

library("earth")

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

marsFit <- earth(trainingData$x, trainingData$y)
marsFit

## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

summary(marsFit)

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

marsPred <- predict(marsFit, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.8136467 0.8677298 1.3911836

Exercise 7.5.

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models. (a) Which nonlinear regression model gives the optimal resampling and test set performance?

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

cmp_df <- ChemicalManufacturingProcess
sum(is.na(cmp_df))

## [1] 106

We can use the ‘colSums()’ function to understand missingness by predictor. The greatest amount of missingness is 15 values. I’m going to replace the missing values with the column mean.

colSums(is.na(cmp_df))

##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

cmp_df <- na.aggregate(cmp_df)
sum(is.na(cmp_df))

## [1] 0

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

columns_to_remove <- nearZeroVar(cmp_df)
cmp_df <- cmp_df[,-columns_to_remove]

# Set the random number seed so we can reproduce the results
set.seed(123456789)
# By default, the numbers are returned as a list. Using
# list = FALSE, a matrix of row numbers is generated.
# These samples are allocated to the training set.
trainingRows <- createDataPartition(cmp_df$Yield, p = .70, list = FALSE)
head(trainingRows)

##      Resample1
## [1,]         2
## [2,]         3
## [3,]         4
## [4,]         5
## [5,]         7
## [6,]         8

# Subset the data into objects for training using
# integer sub-setting.
train <- cmp_df[trainingRows,]
train <- train |> select(-Yield)

trans <- preProcess(train, method = c("center", "scale"))
train <- predict(trans, train)

yield_train <- cmp_df$Yield[trainingRows]
head(train)

##   BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 2            2.0957767             1.252483           -0.1268052
## 3            2.0957767             1.252483           -0.1268052
## 4            2.0957767             1.252483           -0.1268052
## 5            1.3778129             1.837914            1.0949364
## 7            1.3911085             2.120707            1.1359172
## 8            0.6731447             1.904892            1.0462716
##   BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 2            1.1800471            0.3587809             1.089012
## 3            1.1800471            0.3587809             1.089012
## 4            1.1800471            0.3587809             1.089012
## 5            0.8504608           -0.3993435             1.496600
## 7            0.7458303           -0.5039124             1.440288
## 8            1.7293575            0.3901516             1.512689
##   BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 2             2.311820           -0.7707901            1.0177401
## 3             2.311820           -0.7707901            1.0177401
## 4             2.311820           -0.7707901            1.0177401
## 5             1.088856           -0.1581108            0.3809166
## 7             1.834566            0.2094968            0.3653844
## 8             2.028451            0.6506259            1.6234990
##   BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 2           1.38641812            1.1472202             -5.4714537
## 3           1.38641812            1.1472202             -5.4714537
## 4           1.38641812            1.1472202             -5.4714537
## 5           0.12116426            1.1472202             -0.2242367
## 7           0.01677037            0.7370656              0.1680786
## 8           1.49707564            1.6940931              0.4132757
##   ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 2              -1.969428           -0.008089843             -2.3181036
## 3              -1.969428           -0.008089843             -3.1108362
## 4              -1.969428           -0.008089843             -3.2693827
## 5              -1.969428           -0.008089843             -2.1595570
## 7              -1.969428            1.016858040              0.2186408
## 8              -1.969428            0.515287799             -0.4155453
##   ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 2              0.9688887              0.8891570             -1.0286226
## 3              0.0417505             -0.1547302              0.9643337
## 4              0.3983421              2.0770287             -1.0286226
## 5              0.8165268             -0.6586758              0.9643337
## 7             -0.4347856              0.8891570             -1.0286226
## 8              0.2783977              1.5010909              0.9643337
##   ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 2              0.8604992              0.5591263            -0.02499045
## 3              0.8604992             -0.4168261            -0.02499045
## 4             -1.1527442             -0.5144214            -0.02499045
## 5              0.8604992             -0.4883960            -0.02499045
## 7              0.8604992              2.3743977             3.10870450
## 8              0.8604992              1.9319660             1.29654056
##   ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 2           -0.006996376             -0.4879186             -0.4456372
## 3           -0.006996376             -0.4879186              0.3100416
## 4           -0.006996376             -0.4879186              0.3100416
## 5           -0.006996376             -0.4879186              0.1211219
## 7            3.019944677             -0.4879186             -1.9569949
## 8            2.733635723             -0.4879186             -0.8234766
##   ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 2            0.278423622              0.9457981              0.3928445
## 3            0.440595210              0.8079242              0.3928445
## 4            0.782957451              1.0664377              0.6854134
## 5            2.494768657              3.3241223              2.2782886
## 7           -1.955940478             -0.7086884             -1.7364069
## 8            0.008137642              1.1181404              0.5391290
##   ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 2             -0.2144985              0.6185123              1.4529711
## 3              0.4094971              0.8268674              1.0426138
## 4              0.4094971              0.7226898              0.9346251
## 5             -0.2924979              1.0143870              1.5609598
## 7             -0.3704974             -1.6525586             -0.3612397
## 8             -0.5264963             -1.4858745             -0.1668600
##   ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 2              0.7597951              0.2744285            -0.68368241
## 3              0.7597951              0.2744285            -0.38090877
## 4              0.5570959              0.2744285            -0.07813513
## 5              1.5300521             -0.7018171             0.83018578
## 7             -1.2469271              2.2269197            -1.28922968
## 8             -0.6388295              0.2744285            -0.98645605
##   ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 2             -1.7829988             -0.9201897              0.3224479
## 3             -1.1918441             -0.7459859              1.0168982
## 4             -0.6006894             -0.5717820              0.8928892
## 5              0.5816199              1.6928682              1.8353575
## 7             -1.1918441             -1.2685975             -1.5128850
## 8             -0.6006894             -1.0943936             -1.2400652
##   ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 2              1.2489958             0.91027573              0.8234824
## 3              1.4505103             1.06428140              0.8034346
## 4              1.3385578             0.91027573              0.8034346
## 5              2.2341779             2.09831946              0.8435302
## 7              0.1294708            -0.36577125              0.8234824
## 8              0.1742518            -0.05775991              0.8034346
##   ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 2               2.035497              1.0357345             -1.3228383
## 3               1.874264              0.2846139             -0.8912206
## 4               1.874264              0.2846139             -0.8912206
## 5               2.357963             -0.3162826             -0.8192844
## 7               1.713030              2.9886481             -2.1141372
## 8               1.713030              2.5379758             -1.8983284
##   ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 2              1.9201778              1.0509993              1.9873701
## 3              2.6579069              1.0509993              1.9873701
## 4              2.2890423              1.9190132              0.1249846
## 5              2.2890423              2.7870272              0.1249846
## 7              0.0758552              0.6169924              0.1249846
## 8              0.4447197              0.6169924              0.1249846
##   ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 2             1.05962903              -0.735134             2.14246805
## 3             1.15314672              -1.905827            -0.68145550
## 4            -0.06258329              -1.905827             0.40466895
## 5            -2.68107870              -3.076521            -1.76757994
## 7            -2.02645485              -0.735134            -0.46423061
## 8            -1.74590177              -0.735134            -0.02978083
##   ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 2             -0.6760464              0.2356983              1.9819345
## 3             -0.6760464              0.2356983             -0.5004885
## 4             -0.6760464              0.2356983             -0.5004885
## 5             -0.6760464              0.3000740             -0.5004885
## 7             -0.6760464              0.3000740             -0.5004885
## 8             -0.6760464              0.3000740             -0.5004885
##   ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 2              2.0995668             0.02815815            -0.03692521
## 3             -0.4781192             0.42096439             0.06482425
## 4             -0.4781192            -0.19006753             0.16657371
## 5             -0.4781192            -0.01548698             0.16657371
## 7             -0.4781192             0.29002897            -0.24042412
## 8             -0.4781192             0.15909356            -0.13867466
##   ManufacturingProcess44 ManufacturingProcess45
## 2             0.30054609             0.15921800
## 3             0.03623605             0.37150866
## 4             0.03623605            -0.05307267
## 5            -0.22807398            -0.05307267
## 7             0.56485612             0.15921800
## 8             0.56485612             0.15921800

# Do the same for the test set using negative integers.
test <- cmp_df[-trainingRows,]
test <- test |> select(-Yield)

trans <- preProcess(test, method = c("center", "scale"))
test <- predict(trans, test)

yield_test <- cmp_df$Yield[-trainingRows]
head(test)

##    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1            -0.1755184           -1.3793558          -2.40754671
## 6            -0.3862653            0.8007961          -0.41775223
## 9             0.9430609            2.1019346           1.19269294
## 15           -0.2079410            1.9355677           0.63917697
## 18            2.0778515            1.0689697           0.03794411
## 21            1.8671047            1.1434624           0.05225918
##    BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1             0.3542804             0.651668           -1.2603016
## 6             2.0820641             2.014069            0.7176365
## 9             2.4589059             0.597889            1.6380370
## 15           -0.3140803             1.267139            1.5984783
## 18            2.4375752             1.840781            0.9259793
## 21            2.4731263             2.020044            0.9998224
##    BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 1             -1.211512          -3.17093088             1.389593
## 6              1.136085          -1.58036748             2.054042
## 9              1.923388           0.72479687             2.199390
## 15             1.980647           0.01019592            -1.019035
## 18             1.737298           0.10240249             2.843075
## 21             1.866130           0.37902222             3.009187
##    BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 1             -1.758777           -1.5392922            -0.10566420
## 6              1.035161            0.6717192             0.59433809
## 9              1.505207            1.4621844             0.59433809
## 15             1.227637            2.2984736            -0.02390474
## 18             1.973228            1.4850964             0.94761971
## 21             2.348049            1.8860570             1.21258092
##    ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 1            -0.002005454             0.01550889            -0.08922738
## 6            -2.008492079             0.01550889            -1.36399595
## 9            -2.008492079             0.83950282            -0.71455053
## 15           -2.008492079             0.01550889            -0.38982783
## 18           -2.008492079             0.01550889            -0.38982783
## 21           -2.008492079             0.83950282            -1.20163459
##    ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 1              0.04971722              0.1250541              0.1764062
## 6              0.55858429              0.7382875              1.2411437
## 9              0.11058032              0.6564290             -0.8064284
## 15             4.05840476              0.2880655             -0.8064284
## 18             0.71353303              0.5336412             -0.8064284
## 21             3.45545206             -0.2030857              1.2411437
##    ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 1              0.08726288             -1.6101083             0.06710068
## 6              0.97879500             -0.1392355             0.06710068
## 9             -1.02143731              1.0526787             0.82638896
## 15            -1.02143731              0.5708411             0.38914364
## 18            -1.02143731              0.1207033             0.53489208
## 21             0.97879500             -0.2026352            -0.19385013
##    ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 1              0.01681627             0.03113167              0.9982485
## 6              0.01681627            -0.46351599             -0.6549019
## 9              2.49064599            -0.46351599             -0.7651119
## 15             1.19201577            -0.46351599              0.7778284
## 18             1.76918475            -0.46351599              0.9982485
## 21             0.90343127            -0.46351599              0.6676184
##    ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 1               0.8324958               1.209825              0.2823790
## 6               2.5025585               3.126916              0.4449408
## 9               0.7365152               1.815222              0.2448647
## 15              1.2164182               2.017021              0.2948837
## 18              0.8516919               1.462074              0.2980099
## 21              1.8114981               2.101104              0.4136789
##    ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 1               0.8562408             0.18038449              0.5551000
## 6              -0.9558037             0.17591996              2.0540011
## 9              -0.5243645             0.03751929              0.1917301
## 15              0.6836651             0.15657363              1.5543674
## 18              0.8562408             0.10002281              0.3734150
## 21              0.5973773             0.18931357              1.8950268
##    ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 1               0.2715747             0.09499888             -0.1044680
## 6               0.3202221            -0.56366000              0.9619927
## 9               0.1005240             0.09499888             -0.8182676
## 15              0.2417585             0.09499888              1.5554127
## 18              0.2260658             0.09499888             -1.1149777
## 21              0.3343456             0.09499888             -0.2248475
##    ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 1            -0.001520213             -0.2299603             0.16487392
## 6            -1.269930888             -1.5985493             0.16190785
## 9            -0.012299907             -1.2491648             0.11889974
## 15           -0.012299907              0.8471423             0.16635696
## 18           -1.269930888             -1.4238571             0.07885771
## 21            0.616515583             -0.8997803             0.17377215
##    ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 1               0.1707123              0.2895910              0.9717201
## 6               0.2350214              0.2927257              1.1000139
## 9               0.2052487              0.1704753              1.0816862
## 15              0.2314487              0.2488409              1.0816862
## 18              0.1480850              0.2049562              0.9900478
## 21              0.2219214              0.3068315              1.0450308
##    ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 1               0.4514727              0.5923419            -0.02438267
## 6               0.6963849              0.7307769            -0.11567137
## 9               0.6264100              1.0076468            -0.13595775
## 15              0.6613974              0.5923419            -0.08524180
## 18              0.4164853              1.0768643            -0.03452586
## 21              0.6264100              0.5923419            -0.06495542
##    ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 1              -0.4040165              0.9122196             -1.7476413
## 6               2.7566080              2.3496567              0.1069984
## 9               0.3396599              0.5528604              0.1069984
## 15              0.3396599              0.1935011              0.1069984
## 18             -0.2180974              0.1935011              0.1069984
## 21             -0.2180974             -0.1658581              0.1069984
##    ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 1              -0.6869526             -0.5069243             -1.2503431
## 6              -0.2965533             -1.6530140             -1.4938557
## 9              -0.1989535             -0.5069243              0.4542445
## 15             -0.2965533             -0.5069243              1.4282946
## 18              0.3866454              0.6391654              0.2107320
## 21              0.1914458              0.6391654             -0.2762931
##    ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 1               0.6901983              0.2200479              0.1810502
## 6              -1.4209964              0.2200479             -0.3685787
## 9               0.6901983              0.3630791             -0.3685787
## 15              0.6901983              0.2200479             -0.3685787
## 18              0.6901983              0.3630791             -0.3685787
## 21              0.6901983              0.2915635              2.7341653
##    ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 1               0.2466006            -0.07204426             4.28571934
## 6              -0.3503305            -0.60723023             2.68817521
## 9              -0.3503305            -1.14241620             0.09216601
## 15             -0.3503305            -0.07204426            -0.10752701
## 18             -0.3503305             0.46314170             1.68971013
## 21              4.6840274            -1.14241620             2.28878918
##    ManufacturingProcess44 ManufacturingProcess45
## 1              -0.5717719              1.2985131
## 6              -0.5717719             -0.9522429
## 9               0.5717719             -0.3895539
## 15              0.5717719             -0.9522429
## 18             -0.5717719              0.1731351
## 21             -2.8588594             -0.9522429

marsFit <- earth(train, yield_train)
marsFit

## Selected 10 of 21 terms, and 7 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 0.9849263    RSS 87.57107    GRSq 0.7289215    RSq 0.8024562

summary(marsFit)

## Call: earth(x=train, y=yield_train)
## 
##                                      coefficients
## (Intercept)                             39.925652
## h(-0.85297-ManufacturingProcess05)       2.072296
## h(0.611177-ManufacturingProcess09)      -1.038454
## h(-1.39024-ManufacturingProcess13)       4.219073
## h(-0.846306-ManufacturingProcess32)     -2.107534
## h(ManufacturingProcess32- -0.846306)     1.112444
## h(-1.11904-ManufacturingProcess33)       2.583950
## h(-0.0625833-ManufacturingProcess35)     0.519896
## h(0.106947-ManufacturingProcess39)      -0.242464
## h(ManufacturingProcess39-0.106947)      -2.518510
## 
## Selected 10 of 21 terms, and 7 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 0.9849263    RSS 87.57107    GRSq 0.7289215    RSq 0.8024562

marsPred <- predict(marsFit, newdata = test)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = marsPred, obs = yield_test)

##      RMSE  Rsquared       MAE 
## 1.3116790 0.4665333 1.0908784

knnModel <-
  train(
    x = train,
    y = yield_train,
    method = "knn",
    preProc = c("center", "scale"),
    tuneLength = 10
  )
                  
knnModel

## k-Nearest Neighbors 
## 
## 124 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.411625  0.4674504  1.131015
##    7  1.406354  0.4714071  1.125800
##    9  1.405609  0.4805191  1.128503
##   11  1.422081  0.4759467  1.144328
##   13  1.440176  0.4648049  1.169865
##   15  1.451331  0.4630822  1.184253
##   17  1.461761  0.4608139  1.193029
##   19  1.469425  0.4601417  1.202707
##   21  1.474191  0.4631285  1.206779
##   23  1.483928  0.4615102  1.216426
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

knnPred <- predict(knnModel, newdata = test)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = yield_test)

##      RMSE  Rsquared       MAE 
## 1.4703213 0.2767066 1.0799103

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The optimal linear model was a ridge regression model

ridgeModel <- enet(x = as.matrix(train), y = yield_train, lambda = 0.22525)

ridgePred <- predict(ridgeModel, newx = as.matrix(test), s = 1, mode = "fraction",type = "fit")

lmValues1 <- data.frame(obs = yield_test, pred = ridgePred$fit)
names(lmValues1) <- c('obs', 'pred')
defaultSummary(lmValues1)

##      RMSE  Rsquared       MAE 
## 1.2683048 0.4988622 1.0184721

Here are top predictors for the ridge regression model, 8 out of the 9 are Manufactioning Process predictors.

ridgeCoef<- predict(ridgeModel, newx = as.matrix(test), s = 1, mode = "fraction",type = "coefficients")

coef <- as.data.frame(ridgeCoef$coefficients)
names(coef) <- c("coefficients")

coef|> as.data.frame() |> filter(abs(coefficients) > 0.2)

##                        coefficients
## ManufacturingProcess04    0.2079555
## ManufacturingProcess09    0.3593582
## ManufacturingProcess13   -0.3360078
## ManufacturingProcess17   -0.3218476
## ManufacturingProcess26    0.2916862
## ManufacturingProcess29    0.2510196
## ManufacturingProcess32    0.4919538
## ManufacturingProcess36   -0.2783848

We don’t get 10 top predictors from our MARS model, we only get 7 with all of them being Manufacturing Process predictors. The top 2 predictors for both models are the same, ManufacturingProcess32 and ManufacturingProcess09.

evimp(marsFit)

##                        nsubsets   gcv    rss
## ManufacturingProcess32        9 100.0  100.0
## ManufacturingProcess09        8  67.0   69.0
## ManufacturingProcess13        7  42.5   47.3
## ManufacturingProcess33        6  28.3   35.4
## ManufacturingProcess35        4  23.1   28.3
## ManufacturingProcess05        3  16.6   22.2
## ManufacturingProcess39        2   5.9   14.7

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

We see that ManufacturingProcess32 and ManufacturingProcess09 which are the most influential predictors have the steepest slopes.

The question asks us to specifically look for predictors that are unique to the optimal nonlineear regression model. ManufacturingProcess39 in not found in the ridge regression predictors and has a relationship not seen in the other predictors. The Yield increases as ManufacturingProcess39 increases in value but at high values of ManufacturingProcess39 the Yield actually drops. So too a value of ManufacturingProcess39 reverses the relationship and hurts the Yield.

plotmo(marsFit)

##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                           -0.08470601          -0.07713991           -0.1268052
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##            -0.1461453         -0.001981749           -0.1176615
##  BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
##            0.02994867           -0.0600821           -0.1627132
##  BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
##             -0.198281          -0.02172056              0.1190392
##  ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
##               0.5107121             0.01371756              0.3771873
##  ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
##             -0.06198524             -0.2627186              0.9643337
##  ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
##               0.8604992             0.02235246             -0.1273025
##  ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
##              0.00335214             -0.4879186              0.1211219
##  ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
##              0.01714718             -0.1485758            -0.07038956
##  ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
##              0.09749931            -0.04822413               -0.16686
##  ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
##              -0.0611367             -0.1439625            -0.07813513
##  ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
##            -0.009534753             -0.2233743            -0.08678172
##  ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
##              -0.1280199            -0.07976072              0.6630997
##  ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
##              -0.2217679            -0.04482494              0.1158871
##  ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
##              -0.1085771              0.1829854              0.1249846
##  ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
##             -0.06258329              0.4355593            -0.02978083
##  ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
##               0.7447969              0.2356983             -0.5004885
##  ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
##              -0.4781192              0.2027387             -0.1386747
##  ManufacturingProcess44 ManufacturingProcess45
##               0.3005461               0.159218

Lastly, we are going to look at the relationships between the most important predictors and Yield by building a reduced model with just most influential predictors and increase the ‘degree’ to 2 and see if we get any interaction terms in our model. We see that the interaction of ManufacturingProcess32:ManufacturingProcess09 and ManufacturingProcess09:ManufacturingProcess35 are influential in the Yield. As Process 09 increases and Process 35 goes down, the Yield increase.

rd_df <- cmp_df[, c("Yield", "ManufacturingProcess32", "ManufacturingProcess09", "ManufacturingProcess13", "ManufacturingProcess33", "ManufacturingProcess35")]
mars2 <- earth(Yield ~., data = rd_df, degree = 2) # allow first order interactions
plotmo(mars2)

##  plotmo grid:    ManufacturingProcess32 ManufacturingProcess09
##                                     158                  45.73
##  ManufacturingProcess13 ManufacturingProcess33 ManufacturingProcess35
##                    34.6                     64               495.5965

DATA624_HW8

William Aiken

11/9/2024

R Markdown

Exercise 7.2.

Exercise 7.5.