Data 624 Assignment 8: APM Exercise 7.2 and 7.5

Exercise 7.2

Friedman (1991) introduced several benchmark data sets created by simulation. On of these simulations used the following nonlinear equations to create data:

y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+N(0,σ2)

where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

Create Simulated Training Set

set.seed(4763)  
trainingData <- mlbench.friedman1(200, sd = 1)  
trainingData$x <- data.frame(trainingData$x)

View the Data with Skim and Plot with Feature Plot

skim(trainingData)

Data summary
Name	trainingData
Number of rows	200
Number of columns	11
_______________________
Column type frequency:
numeric	11
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
x.X1	1	0.51	0.29	0.00	0.25	0.51	0.75	1.00	▇▇▇▇▇
x.X2	1	0.56	0.29	0.01	0.32	0.56	0.80	0.99	▅▆▆▇▇
x.X3	1	0.50	0.28	0.00	0.26	0.54	0.75	1.00	▇▇▇▇▇
x.X4	1	0.50	0.30	0.00	0.21	0.51	0.75	0.99	▇▅▇▆▇
x.X5	1	0.52	0.29	0.00	0.28	0.52	0.78	0.99	▆▆▇▆▇
x.X6	1	0.51	0.29	0.00	0.25	0.53	0.76	1.00	▇▆▇▇▇
x.X7	1	0.51	0.28	0.00	0.31	0.51	0.76	0.99	▆▆▇▆▇
x.X8	1	0.53	0.30	0.01	0.26	0.57	0.80	1.00	▆▅▆▆▇
x.X9	1	0.50	0.28	0.00	0.26	0.49	0.72	1.00	▆▆▇▆▆
x.X10	1	0.55	0.28	0.01	0.32	0.57	0.78	0.99	▅▆▇▇▇
y	1	14.73	5.06	0.80	11.04	14.90	18.49	26.72	▁▅▇▇▂

featurePlot(trainingData$x, trainingData$y,  labels = c("Predictor", "Response"), main = "Friedman1 Simulated Data")

The skim chart above provides a mini-histogram of the response variable Y the plot below

x <- seq(0, 30, length.out = 100)
y <- dnorm(x, mean = mean(trainingData$y), sd = sd(trainingData$y))
ggplot(data.frame(trainingData$y)) + geom_histogram(aes(x = trainingData$y)) + theme_fivethirtyeight() +
  labs(title = "Response Variable", subtitle = "Histgram")

ggpairs(data.frame(Y = trainingData$y, trainingData$x)) + theme_fivethirtyeight() +
  labs(title = "Pairs Plot")

Create Testing Data

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

KNN Model

knnModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProcess = c("center", "scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.726109  0.4646980  3.014146
##    7  3.609130  0.4983845  2.916210
##    9  3.573850  0.5157660  2.876988
##   11  3.557243  0.5313498  2.865519
##   13  3.538419  0.5485584  2.862030
##   15  3.527742  0.5645340  2.849736
##   17  3.540169  0.5725169  2.865204
##   19  3.530327  0.5863954  2.851141
##   21  3.519411  0.6000860  2.842288
##   23  3.522842  0.6084923  2.844349
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 21.

knn <- predict(knnModel, newdata = testData$x)
postResample(pred = knn, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.3454465 0.6725733 2.7038661

MARS Model

MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "earth",
                  tuneGrid = MARS_grid,
                  preProcess = c("center", "scale"),
                  tuneLength = 10)
MARS_model

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.164892  0.3611916  3.409897
##   1        3      3.660936  0.5062227  2.959827
##   1        4      3.244858  0.6187136  2.631157
##   1        5      2.805473  0.7142806  2.242194
##   1        6      2.505767  0.7703123  1.976545
##   1        7      2.467969  0.7775514  1.945278
##   1        8      2.215528  0.8214284  1.761183
##   1        9      2.140487  0.8342432  1.707313
##   1       10      2.158827  0.8315906  1.725630
##   1       11      2.182385  0.8294915  1.732187
##   1       12      2.221737  0.8232456  1.758400
##   1       13      2.245196  0.8199335  1.777172
##   1       14      2.252059  0.8192026  1.790685
##   1       15      2.267975  0.8178146  1.805161
##   2        2      4.208856  0.3476404  3.451024
##   2        3      3.736080  0.4869258  2.995870
##   2        4      3.284380  0.6084662  2.653734
##   2        5      2.852638  0.7034535  2.294502
##   2        6      2.514765  0.7672643  1.972002
##   2        7      2.281683  0.8090696  1.795828
##   2        8      2.093803  0.8393584  1.663283
##   2        9      1.824083  0.8783372  1.467502
##   2       10      1.588086  0.9080874  1.267878
##   2       11      1.461354  0.9219269  1.174227
##   2       12      1.423772  0.9260926  1.132255
##   2       13      1.434791  0.9245093  1.128864
##   2       14      1.431186  0.9249019  1.124739
##   2       15      1.450474  0.9230373  1.140071
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 12 and degree = 2.

MARS <- predict(MARS_model, newdata = testData$x)
postResample(pred = MARS, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.1890158 0.9434200 0.9512591

varImp(MARS_model)

## earth variable importance
## 
##    Overall
## X4  100.00
## X1   77.46
## X5   59.43
## X2   39.88
## X3    0.00

SVM Model

SVM_model <- train(x = trainingData$x,
                   y = trainingData$y,
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))
SVM_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.869165  0.7532017  2.312402
##     0.50  2.585576  0.7823054  2.055335
##     1.00  2.409369  0.8062289  1.894586
##     2.00  2.321492  0.8173656  1.854516
##     4.00  2.205788  0.8351372  1.774367
##     8.00  2.152113  0.8422696  1.735025
##    16.00  2.151148  0.8391198  1.728622
##    32.00  2.151463  0.8390591  1.728927
##    64.00  2.151463  0.8390591  1.728927
##   128.00  2.151463  0.8390591  1.728927
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05916194
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05916194 and C = 16.

SVM <- predict(SVM_model, newdata = testData$x)
postResample(pred = SVM, obs = testData$y)

##     RMSE Rsquared      MAE 
## 1.936793 0.849370 1.513800

varImp(SVM_model)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   44.0360
## X5   33.8046
## X2   27.0904
## X3   18.6716
## X10   3.9053
## X7    1.3407
## X9    1.0309
## X8    0.6161
## X6    0.0000

Neural Network Model

nnet_grid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnet_maxnwts <- 5 * (ncol(trainingData$x) + 1) + 5 + 1
nnet_model <- train(x = trainingData$x,
                    y = trainingData$y,
                    method = "avNNet",
                    preProcess = c("center", "scale"),
                    tuneGrid = nnet_grid,
                    trControl = trainControl(method = "cv"),
                    linout = TRUE,
                    trace = FALSE,
                    MaxNWts = nnet_maxnwts,
                    maxit = 500)
nnet_model

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.765430  0.7214950  2.222278
##   0.00    2    2.454579  0.7755543  1.958788
##   0.00    3    2.345981  0.7944327  1.878345
##   0.00    4    2.384023  0.7976148  1.825766
##   0.00    5    2.571703  0.7713210  2.013408
##   0.00    6         NaN        NaN       NaN
##   0.00    7         NaN        NaN       NaN
##   0.00    8         NaN        NaN       NaN
##   0.00    9         NaN        NaN       NaN
##   0.00   10         NaN        NaN       NaN
##   0.01    1    2.739902  0.7233210  2.182945
##   0.01    2    2.502228  0.7644033  1.992281
##   0.01    3    2.327996  0.7978076  1.839162
##   0.01    4    2.313584  0.7936864  1.902121
##   0.01    5    2.399065  0.7850531  1.919526
##   0.01    6         NaN        NaN       NaN
##   0.01    7         NaN        NaN       NaN
##   0.01    8         NaN        NaN       NaN
##   0.01    9         NaN        NaN       NaN
##   0.01   10         NaN        NaN       NaN
##   0.10    1    2.740531  0.7201455  2.160466
##   0.10    2    2.542600  0.7600431  2.034750
##   0.10    3    2.447342  0.7772923  1.961562
##   0.10    4    2.281908  0.8054697  1.865467
##   0.10    5    2.333053  0.8006064  1.930540
##   0.10    6         NaN        NaN       NaN
##   0.10    7         NaN        NaN       NaN
##   0.10    8         NaN        NaN       NaN
##   0.10    9         NaN        NaN       NaN
##   0.10   10         NaN        NaN       NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.1 and bag = FALSE.

nnet <- predict(nnet_model, newdata = testData$x)
postResample(pred = nnet, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.1282054 0.8177093 1.6732538

varImp(nnet_model)

## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   44.0360
## X5   33.8046
## X2   27.0904
## X3   18.6716
## X10   3.9053
## X7    1.3407
## X9    1.0309
## X8    0.6161
## X6    0.0000

Findings

The MARS model is the clear winner when it come to peformance. It enjoys the best metrics across RMSE, RSquared and MAE compared to the other models. This likely results from the pruning that takes place with MARs, as you can see MARs has in fact selected all of the “informative” predictors.

results <- data.frame(t(postResample(pred = knn, obs = testData$y))) %>% 
  mutate("Model" = "KNN")

results <- data.frame(t(postResample(pred = MARS, obs = testData$y))) %>%
  mutate("Model"= "MARS") %>%
  bind_rows(results)

results <- data.frame(t(postResample(pred = SVM, obs = testData$y))) %>%
  mutate("Model"= "SVM") %>%
  bind_rows(results)

results <- data.frame(t(postResample(pred = nnet, obs = testData$y))) %>%
  mutate("Model"= "Neural Network") %>%
  bind_rows(results)

results %>% 
  gt()

RMSE	Rsquared	MAE	Model
2.128205	0.8177093	1.6732538	Neural Network
1.936793	0.8493700	1.5138001	SVM
1.189016	0.9434200	0.9512591	MARS
3.345447	0.6725733	2.7038661	KNN

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and traing several nonlinear regression models.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

(a) Which nonlinear regression model give the optimal resampling and test set performance?

Preprocessing

set.seed(42)

knn_model <- preProcess(ChemicalManufacturingProcess, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)

df <- df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]

KNN

knn_model <- train(
  Yield ~ ., data = train_df, method = "knn",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
knn_model

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 128, 129, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7017857  0.4962847  0.5606515
##    7  0.7236139  0.4746428  0.5969490
##    9  0.7300591  0.4668491  0.6013968
##   11  0.7460436  0.4490707  0.6177046
##   13  0.7485067  0.4442744  0.6213027
##   15  0.7586368  0.4291603  0.6258871
##   17  0.7624259  0.4375912  0.6294124
##   19  0.7627696  0.4449437  0.6254787
##   21  0.7635900  0.4490280  0.6242544
##   23  0.7732659  0.4412447  0.6310587
##   25  0.7730124  0.4461752  0.6353608
##   27  0.7896268  0.4196195  0.6475769
##   29  0.7988773  0.4028687  0.6572028
##   31  0.8024417  0.4113070  0.6624965
##   33  0.8075687  0.4045510  0.6647987
##   35  0.8120014  0.4029270  0.6670706
##   37  0.8154453  0.3950009  0.6669333
##   39  0.8200178  0.3964573  0.6710715
##   41  0.8227606  0.3978416  0.6722223
##   43  0.8253769  0.3947873  0.6756019
##   45  0.8279943  0.3936810  0.6783199
##   47  0.8268730  0.4025667  0.6780751
##   49  0.8340750  0.3925289  0.6843632
##   51  0.8364038  0.3974095  0.6845187
##   53  0.8378127  0.3955672  0.6875498
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

knn <- predict(knn_model, test_df)

results <- data.frame(t(postResample(pred = knn, obs = test_df$Yield))) %>%
  mutate("Model"= "KNN") %>% rbind(results)

MARS

MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)

MARS_model <- train(
  Yield ~ ., data = train_df, method = "earth",
  tuneGrid = MARS_grid,
 
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
MARS_model

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 130, 132, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.7722023  0.3941996  0.5999321
##   1        3      0.6627211  0.5546441  0.5313042
##   1        4      0.6425518  0.5768163  0.5156998
##   1        5      0.6585956  0.5596575  0.5180076
##   1        6      0.6627592  0.5524007  0.5198269
##   1        7      0.6604789  0.5589514  0.5153873
##   1        8      0.6543415  0.5634510  0.5078212
##   1        9      0.6604364  0.5567869  0.5187573
##   1       10      0.6650075  0.5440094  0.5148558
##   1       11      0.6628266  0.5524088  0.5212075
##   1       12      0.6746918  0.5472778  0.5276799
##   1       13      0.6583636  0.5646934  0.5106820
##   1       14      0.6608923  0.5579993  0.5158548
##   1       15      0.6629204  0.5566854  0.5176025
##   2        2      0.7617132  0.4083614  0.5915262
##   2        3      0.6801367  0.5184663  0.5504259
##   2        4      0.6298407  0.5873378  0.5176954
##   2        5      0.6249017  0.5959341  0.5122018
##   2        6      0.6007549  0.6286730  0.4823449
##   2        7      0.5634810  0.6679785  0.4605940
##   2        8      0.5510920  0.6848203  0.4448332
##   2        9      0.5560838  0.6808989  0.4394491
##   2       10      0.5708689  0.6600874  0.4609790
##   2       11      0.5605395  0.6742829  0.4541922
##   2       12      0.5693831  0.6739141  0.4619783
##   2       13      0.5673132  0.6724538  0.4510969
##   2       14      0.5665736  0.6723828  0.4533791
##   2       15      0.5905837  0.6500161  0.4711157
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 8 and degree = 2.

MARS <- predict(MARS_model, test_df)

results <- data.frame(t(postResample(pred = MARS, obs = test_df$Yield))) %>%
  mutate("Model"= "MARS") %>% rbind(results)

SVM

SVM_model <- train(
  Yield ~ ., data = train_df, method = "svmRadial",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl(method = "cv"),
  tuneLength = 25
)
SVM_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 130, 130, 130, 130, 128, ... 
## Resampling results across tuning parameters:
## 
##   C           RMSE       Rsquared   MAE      
##         0.25  0.7637178  0.4612331  0.6304200
##         0.50  0.7138185  0.4953123  0.5846336
##         1.00  0.6639811  0.5514336  0.5374400
##         2.00  0.6315988  0.5856940  0.5107699
##         4.00  0.6294062  0.5859009  0.5046017
##         8.00  0.6087599  0.6065785  0.4889208
##        16.00  0.6084278  0.6069749  0.4888434
##        32.00  0.6084278  0.6069749  0.4888434
##        64.00  0.6084278  0.6069749  0.4888434
##       128.00  0.6084278  0.6069749  0.4888434
##       256.00  0.6084278  0.6069749  0.4888434
##       512.00  0.6084278  0.6069749  0.4888434
##      1024.00  0.6084278  0.6069749  0.4888434
##      2048.00  0.6084278  0.6069749  0.4888434
##      4096.00  0.6084278  0.6069749  0.4888434
##      8192.00  0.6084278  0.6069749  0.4888434
##     16384.00  0.6084278  0.6069749  0.4888434
##     32768.00  0.6084278  0.6069749  0.4888434
##     65536.00  0.6084278  0.6069749  0.4888434
##    131072.00  0.6084278  0.6069749  0.4888434
##    262144.00  0.6084278  0.6069749  0.4888434
##    524288.00  0.6084278  0.6069749  0.4888434
##   1048576.00  0.6084278  0.6069749  0.4888434
##   2097152.00  0.6084278  0.6069749  0.4888434
##   4194304.00  0.6084278  0.6069749  0.4888434
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01436322
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01436322 and C = 16.

SVM <- predict(SVM_model, test_df)

results <- data.frame(t(postResample(pred = SVM, obs = test_df$Yield))) %>%
  mutate("Model"= "SVM") %>% rbind(results)

Neural Net

nnet_grid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnet_maxnwts <- 5 * ncol(train_df) + 5 + 1
nnet_model <- train(
  Yield ~ ., data = train_df, method = "avNNet",
  center = TRUE,
  scale = TRUE,
  tuneGrid = nnet_grid,
  trControl = trainControl(method = "cv"),
  linout = TRUE,
  trace = FALSE,
  MaxNWts = nnet_maxnwts,
  maxit = 500
)

nnet_model

## Model Averaged Neural Network 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 129, 129, 129, 129, 131, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    0.8550454  0.3905974  0.6736735
##   0.00    2    0.8397700  0.4677405  0.6781118
##   0.00    3    1.5264613  0.4818655  1.1051510
##   0.00    4    0.7159878  0.5603617  0.5629924
##   0.00    5    0.7164835  0.5410727  0.5823812
##   0.00    6          NaN        NaN        NaN
##   0.00    7          NaN        NaN        NaN
##   0.00    8          NaN        NaN        NaN
##   0.00    9          NaN        NaN        NaN
##   0.00   10          NaN        NaN        NaN
##   0.01    1    0.7755530  0.5041892  0.6271978
##   0.01    2    0.8231924  0.4582981  0.6547267
##   0.01    3    0.7477910  0.5072772  0.5946967
##   0.01    4    0.6783638  0.6022529  0.5179515
##   0.01    5    0.6213627  0.6623922  0.4979006
##   0.01    6          NaN        NaN        NaN
##   0.01    7          NaN        NaN        NaN
##   0.01    8          NaN        NaN        NaN
##   0.01    9          NaN        NaN        NaN
##   0.01   10          NaN        NaN        NaN
##   0.10    1    0.7522723  0.5255260  0.6046651
##   0.10    2    0.7397727  0.5402556  0.5802931
##   0.10    3    0.6544117  0.6037688  0.5332527
##   0.10    4    0.6473491  0.6268090  0.5231965
##   0.10    5    0.6113875  0.6511967  0.4892838
##   0.10    6          NaN        NaN        NaN
##   0.10    7          NaN        NaN        NaN
##   0.10    8          NaN        NaN        NaN
##   0.10    9          NaN        NaN        NaN
##   0.10   10          NaN        NaN        NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.

nnet <- predict(nnet_model, test_df)

results <- data.frame(t(postResample(pred = nnet, obs = test_df$Yield))) %>%
  mutate("Model"= "Neural Network") %>% rbind(results)

Findings

It appears that the neural net and SVM model performed better than the other models on an RMSE basis. However, the MARs and SVM models had high Rsqaured metrics. Overall the SVM may be the best model to go with, as it enjoys relatively strong metrics across the various metrics(RMSE, Rsqared and MAE)

results %>%
  select(Model, RMSE, Rsquared, MAE) %>%
  arrange(RMSE) %>%
  gt()

Model	RMSE	Rsquared	MAE
Neural Network	0.6190814	0.6740072	0.4705560
SVM	0.6808183	0.6244615	0.4939512
KNN	0.7149819	0.5894607	0.5047973
MARS	0.7199111	0.5920601	0.5651110
MARS	1.1890158	0.9434200	0.9512591
SVM	1.9367928	0.8493700	1.5138001
Neural Network	2.1282054	0.8177093	1.6732538
KNN	3.3454465	0.6725733	2.7038661

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The most important predictors are MP32, MP13, MP09, MP17 and BM06. The Manufacturing Processes dominate list both in terms of numbers and prominence -four of the top five predictors are Manufacturing Processes. Compared to the optimal linear model from last week, both the linear and non linear selected M32 as the most important variable. After that, however, there was only one other variable in the top 10 for both linear and non-linear - M09.

varImp(SVM_model, 10)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   93.82
## ManufacturingProcess09   89.93
## ManufacturingProcess17   88.20
## BiologicalMaterial06     82.61
## BiologicalMaterial03     79.44
## ManufacturingProcess36   73.85
## BiologicalMaterial12     72.36
## ManufacturingProcess06   69.00
## ManufacturingProcess11   62.34
## ManufacturingProcess31   56.39
## BiologicalMaterial02     50.34
## BiologicalMaterial11     48.53
## BiologicalMaterial09     44.76
## ManufacturingProcess30   41.87
## BiologicalMaterial08     40.24
## ManufacturingProcess29   38.54
## ManufacturingProcess33   38.16
## BiologicalMaterial04     36.92
## ManufacturingProcess25   36.83

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Below, I have plotted the top five non-linear predictors that were not included in the list of top linear predictors. For some of the predictors we can easily see a positive of negative relationship with the response variable. For others (M36 and M12) there is no disernible relationship. I believe the predictors with no disernible reflect the transformation to high orders that take place with the SVM regression.

M13

ggplot(train_df, aes(ManufacturingProcess13, Yield)) +
  geom_point() + theme_fivethirtyeight() + 
  geom_smooth(method='lm') + 
  labs(title= "M13")

M17

ggplot(train_df, aes(ManufacturingProcess17, Yield)) +
  geom_point() + theme_fivethirtyeight() + 
  geom_smooth(method='lm') +  
  labs(title= "M17")

B06

ggplot(train_df, aes(BiologicalMaterial06, Yield)) +
  geom_point() + theme_fivethirtyeight() + 
  geom_smooth(method='lm') + 
  labs(title= "B06")

B03

ggplot(train_df, aes(BiologicalMaterial03, Yield)) +
  geom_point() + theme_fivethirtyeight() + 
  geom_smooth(method='lm') +  
  labs(title= "B03")

B36

ggplot(train_df, aes(ManufacturingProcess36, Yield)) +
  geom_point() + theme_fivethirtyeight() + 
  geom_smooth(method='lm') +  
  labs(title= "M36")

M12

ggplot(train_df, aes(ManufacturingProcess12, Yield)) +
  geom_point() + theme_fivethirtyeight() + 
  geom_smooth(method='lm') +  geom_smooth(method='loess') +
  labs(title= "M12")

Data 624 Assignment 8: APM Exercise 7.2 and 7.5

Jim Mundy

Exercise 7.2

Friedman (1991) introduced several benchmark data sets created by simulation. On of these simulations used the following nonlinear equations to create data:

y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+N(0,σ2)

where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

Create Simulated Training Set

The skim chart above provides a mini-histogram of the response variable Y the plot below

Create Testing Data

KNN Model

MARS Model

SVM Model

Neural Network Model

Findings

The MARS model is the clear winner when it come to peformance. It enjoys the best metrics across RMSE, RSquared and MAE compared to the other models. This likely results from the pruning that takes place with MARs, as you can see MARs has in fact selected all of the “informative” predictors.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and traing several nonlinear regression models.

(a) Which nonlinear regression model give the optimal resampling and test set performance?

Preprocessing

KNN

MARS

SVM

Neural Net

Findings

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

M13

M17

B06

B03

B36

M12