NonLinear Regression

Jack Wright

7.2 , 7.5

7.2

Benchmark data set created by simulation.

One of the simulations used the following nonlinear equation

\[y = 10sin(\pi x_1 x_2)+20(x_3 -0.5)^2+10x_4 + 5x_5 +N(0, \sigma^2)\]

x are the values of random variables uniformly distributed between [0,1]

5 non-informative variables also created for simulation

mlbench contains a function called mlbench.friedman1 that sims these data:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.1.3
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(200)
trainingData<-mlbench.friedman1(200, sd = 1)
## convert the 'x' data froma matrix to a data frame
## one reason is that this will give column names
trainingData$x<-data.frame(trainingData$x)
## look at the data using
featurePlot(trainingData$x, trainingData$y)

##  this creates a list with vector 'y' and a matrix
##  of predictors 'x'. Also simulate a large test set to 
##  estimate the true error rate with good precision:
testData<-mlbench.friedman1(500, sd = 1)
testData$x<-data.frame(testData$x)

Tune several models on these data. for example:

library(caret)
knnModel<-train(x = trainingData$x,
                y = trainingData$y, 
                method = 'knn',
                preProc = c('center', 'scale'),
                tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.587483  0.5016797  2.925321
##    7  3.449601  0.5457896  2.820733
##    9  3.366991  0.5780134  2.745968
##   11  3.319312  0.6028553  2.692924
##   13  3.338963  0.6074127  2.709832
##   15  3.324987  0.6225090  2.703993
##   17  3.312809  0.6390199  2.684094
##   19  3.320180  0.6507115  2.689365
##   21  3.348821  0.6496607  2.717876
##   23  3.355189  0.6584384  2.726909
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <-predict(knnModel, newdata = testData$x)
## the function 'postResample' can be used to get the test set
## performance values

knn_metrics<-postResample(pred = knnPred, obs = testData$y)

Lets check the variable importance for the knn Model.

varImp(knnModel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

Interestingly there are variables we know are not important (because we created them) that the knn views as important. This clearly means the model is not working well. The reason why is because the data has non-neighborhood correlation.

which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

Lets rebuild the model for MARS with some tuning parameters for number of predictors.

marsGrid <- expand.grid(degree = 1:2, nprune = seq(2,14, by =2))
set.seed(921)
marsModel<-train(x = trainingData$x, 
                 y = trainingData$y,
                 method = 'earth',
                 preProc = c('center','scale'),
                 tuneGrid = marsGrid)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.1.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.1.3
## Loading required package: plotrix
## Loading required package: TeachingDemos
## Warning: package 'TeachingDemos' was built under R version 4.1.3
marsPred <-predict(marsModel, newdata = testData$x)
mars_metrics<-postResample(pred = marsPred, obs = testData$y)

Lets check the variable importance for the MARS

varImp(marsModel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00

The MARS model correctly pulls out the most important features.

Now lets build and train the SVM model.

use plotmo package to create plots of each predictor vs outcome

set.seed(921)
svmRmodel<-train(x = trainingData$x, 
                 y = trainingData$y,
                 method = 'svmRadial',
                 preProc = c('center','scale'),
                 tuneLength = 8)
svmRpred<-predict(svmRmodel, newdata = testData$x)

varImp(svmRmodel)
## loess r-squared variable importance
## 
##      Overall
## X4  100.0000
## X1   95.5047
## X2   89.6186
## X5   45.2170
## X3   29.9330
## X9    6.3299
## X10   5.5182
## X8    3.2527
## X6    0.8884
## X7    0.0000

The SVM model does rate the important variables the highest, but there is some noise that leaks in from the false data.

plotmo(svmRmodel)
##  plotmo grid:    X1        X2       X3        X4        X5        X6        X7
##           0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
##        X8        X9       X10
##  0.497961 0.5288716 0.5359218

looking at the SVM plots, we can see that the non-important variables are virtually zero.

svm_metrics<-postResample(pred = svmRpred, obs = testData$y)

Now lets look at the metrics to determine which model performed the best.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## v purrr   0.3.4
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
df<-as.data.frame(rbind(knn_metrics, mars_metrics, svm_metrics))
df%>%arrange(desc(Rsquared))
##                  RMSE  Rsquared      MAE
## mars_metrics 1.323799 0.9317627 1.027932
## svm_metrics  2.193731 0.8128594 1.634044
## knn_metrics  3.333159 0.6741453 2.653670

Overall the mars model works the best. This is a bit suprising, because one of the SVM’s selling points is that it is able to identify sinusoidal data with the radial kernel.

7.5

a

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.2
data("ChemicalManufacturingProcess")

b A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values

Lets take a look at the missing values

library(visdat)
vis_miss(ChemicalManufacturingProcess)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.

Lets first see if We can remove some columns with nearZeroVar and how that impacts the missingness

#knnDescr<-solTrainXtrans[,-nearZeroVar(solTrainXtrans)]

chem_dat<-ChemicalManufacturingProcess[,-nearZeroVar(ChemicalManufacturingProcess)]

vis_miss(chem_dat)

Ironically it INCREASED the percentage of data missing! (because we are removing columns that had a higher percentage of populated data, this isn’t a problem)

Now lets get a list of columns that contain missing data.

names(which(colSums(is.na(chem_dat))>0))
##  [1] "ManufacturingProcess01" "ManufacturingProcess02" "ManufacturingProcess03"
##  [4] "ManufacturingProcess04" "ManufacturingProcess05" "ManufacturingProcess06"
##  [7] "ManufacturingProcess07" "ManufacturingProcess08" "ManufacturingProcess10"
## [10] "ManufacturingProcess11" "ManufacturingProcess12" "ManufacturingProcess14"
## [13] "ManufacturingProcess22" "ManufacturingProcess23" "ManufacturingProcess24"
## [16] "ManufacturingProcess25" "ManufacturingProcess26" "ManufacturingProcess27"
## [19] "ManufacturingProcess28" "ManufacturingProcess29" "ManufacturingProcess30"
## [22] "ManufacturingProcess31" "ManufacturingProcess33" "ManufacturingProcess34"
## [25] "ManufacturingProcess35" "ManufacturingProcess36" "ManufacturingProcess40"
## [28] "ManufacturingProcess41"

… so basically all of the column have NA values.

Since the missing data seems fairly spread out, we will just impute during the preprocessing process for our model.

Lets now split our data into predictors and response

library(RANN)
## Warning: package 'RANN' was built under R version 4.1.3
response<-chem_dat%>%select(Yield)
features<-chem_dat%>%select(-Yield)

set.seed(123)
train_rows<-createDataPartition(response$Yield,
                                p = .7,
                                list = FALSE)
train_pred<-features[train_rows,]
train_resp<-response[train_rows,]

test_pred<-features[-train_rows,]
test_resp<-response[-train_rows,]

#preprocess data
pp<-preProcess(train_pred, method = c('BoxCox', 'center','scale','knnImpute'))

pp_train_pred<-predict(pp,train_pred)
pp_test_pred<-predict(pp,test_pred)

Since we already removed the near zero variance columns, we should remove the high correlation variables as well

pred_cor<-cor(pp_train_pred)
high_corr_pp<-findCorrelation(pred_cor)
pp_train_pred<-pp_train_pred[,-high_corr_pp]
pp_test_pred<-pp_test_pred[,-high_corr_pp]

c

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?

I would not expect this data to have neighborhood correlation so I won’t use a knn. I also like the idea of the self pruning of variables, so I’ll start with a MARS model

ctrl <- trainControl(method = "repeatedcv", repeats = 5)
marsGrid <- expand.grid(degree = 1:2, nprune = seq(2,30, by =2))
set.seed(921)
marsModel<-train(x = pp_train_pred, 
                 y = train_resp,
                 method = 'earth',
                 preProc = c('center','scale'),
                 trControl = ctrl,
                 tuneGrid = marsGrid)
marsPred <-predict(marsModel, newdata = pp_test_pred)

postResample(pred = marsPred, obs = test_resp)
##      RMSE  Rsquared       MAE 
## 1.0949893 0.6432769 0.8859736

Lets look at what the most important predictors the MARS model selected are

varImp(marsModel)
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   59.69
## ManufacturingProcess13   21.22
## ManufacturingProcess01    0.00
plotmo(marsModel)
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial03 BiologicalMaterial04
##                           -0.04498146            -0.166129         -0.009073643
##  BiologicalMaterial05 BiologicalMaterial06 BiologicalMaterial08
##           -0.06373363          -0.05786777           0.07287398
##  BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
##            -0.0338531           -0.1758822          -0.08481589
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##              0.08200158              0.4864915             0.03424434
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##               0.3312441            -0.09860299             -0.2402823
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##              -0.9038186              0.9336492             0.04686526
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##              -0.1058449            -0.06053385             -0.5129976
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##               0.1326757              0.0851508            -0.06382909
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess19
##              0.08746425             0.08163835             -0.1607842
##  ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
##              0.09026192               -0.19479             -0.1742694
##  ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess26
##              0.01003158              -0.128005             0.07788206
##  ManufacturingProcess28 ManufacturingProcess30 ManufacturingProcess32
##               0.7153686           -0.003758284            -0.08208428
##  ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
##               0.1813987              0.1113707            -0.07283974
##  ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
##               -0.432863              -0.072112              0.6817466
##  ManufacturingProcess39 ManufacturingProcess41 ManufacturingProcess43
##               0.2167978              -0.455224             -0.1210006
##  ManufacturingProcess44 ManufacturingProcess45
##               0.2662542              0.1323683

I really like using the MARS model here, because I don’t lose the predictors, and it will be easy to explain to decision makers how we could change the manufacturing process.